Title: Word Spotting: Indexing Handwritten Manuscripts
1Word Spotting Indexing Handwritten Manuscripts
- Michael D. Fecina
- IST 497/597
- 22-JAN-02
2History
- OCR was used in the past for indexing machine
typed letters and documents - OCR does not work well with handwritten documents
because of - noise (ink marks, perhaps)
- variations among writing styles
- inconsistencies in formation of letters/words
3More history
- OCR is used to segment a page into words, then
break each word into its characters - OCR successful with clean machine fonts against
clean background - Character segmentation is too difficult with
handwritten documents
4Motivation
- To efficiently index historical hand written
documents - To simplify reading documents where the
handwriting is particularly hard to read - Eventually, just as with images, it is hoped that
automatic indexing of documents will be available
5Specific important documents to be indexed
- W.E.B. Dubois
- Washington and other Presidents writings located
in Library of Congress - Over 6,400 scanned 8-bit grey level images of
Washingtons manuscripts - Serve as valuable resources for scholars as well
as others who wish to consult original source
material
6What is word spotting?
- A method by which handwritten material can be
indexed - Assumes documents are written by same person
- Assumes that variations between same-word
occurrences is minimal - The above assumption does not always hold true
(significant contrib. to error)
7More about word spotting
- Avoid recognizing the words
- Use word images
- What is difficult about it??
- Segmenting the page into words
- Ascenders, descenders
- Noise, inconsistencies
- Matching the words effectively
8Methodology
- Obtain grey level image of document.
- Reduce image by ½ using Gaussian filtering and
sub-sampling. - Image is then binarized by thresholding.
(characterswhite/bg black) - Binary image segmented into words (word images).
9Methodology
- Each word image is tested against every other
word image yet pruning takes place dependent
upon image area and aspect ratios. - Matching produces equivalence classes.
- Top n equivalence classes chosen. Top s classes
are removed noted as stop words. Then, user
provides ASCII equivalent for remaining top m
classes.
10Details of Word Segmentation
- Spacing between characters is smaller than that
of between words - If two white pixels are separated by less than a
certain distance k, the intermediate pixels are
made white - Done in horizontal and vertical direction to
obtain descenders
11Word segmentation
- Errors do occur using this algorithm (dot over
the i,j) - However, minimum length is required. This
removes the dots of the i/j becoming separate
word images - If large gaps are left in some instances of a
word, but not in another, segmented as different
word
12Senior Document
13Segmented Senior Document
14Two primary algorithms used for word matching
- EDM (Euclidian Distance Mapping (D. 1980))
- Fast, but assumes that no distortions have
occurred except for relative translation - Does well matching words with relatively low
variations in reference to the template - SLH (Scott and Longuet-Higgins (1991))
- Assumes an affine transformation between the
words - Slow, computationally expensive in current
implementations
15Matching with EDM
- Aligning vertical alignment by baseline,
horizontal by coinciding left sides. (thus
vertical al. gt horizontal al.) - XOR image is computed XOR corresponding pixels
to produce the difference between the images - Not good for sole use in determining image
difference since equal weight is given to
isolated pixels and blobs
16XOR for Lloyd
Whats in each one, but not both, of the images
17EDM Step
- EDM computed by assigning to each white pixel in
the image its minimum distance to a black pixel - A white pixel inside a blob will get a larger
distance than isolated white pixel - An error measure, (EEDM) can now be calculated by
summing the distance measures for each pixel
18Forming Blobs using EDM
- The distance between every white pixel and the
nearest black pixel is computed - distance lt threshold, assumed to be noise.
19Problems with EDM
- EDM does not discriminate well between good and
bad matches - Fails when there is significant distortion in the
words - Need for matching algorithm that models some
variation -gt SLH
20SLH Matching technique
- Affine transformation allows for scaling and
shear deformations in both directions - Much more accurate than the Euclidian Distance
Mapping technique - Computationally slow and expensive because the
SVD (Singular Value Decomposition) must be
computed for large matrix
21Differences (/-)
- EDM does not account for any distortions and thus
performs poorly when handwriting is bad - SLH almost always produces the correct rankings
even if the handwriting is bad - Two areas need to be improved with both
- Speed and word validity discrimination
22Tests . . .
- Two documents, Senior and Hudson, were
compared using both matching algorithms - The statistical information for both documents is
as follows
23Test Information
- Since the SLH algorithm is slow but more accurate
than the EDM, EDM was applied first - A cut off threshold was used to limit the number
of classes (words) displayed, and remained
constant in both tests
24Senior Document - classes
stop words
25EDM on Senior
- The EDM algorithm performed quite well on Senior
Document - Average precision of 78
- Since the handwriting is good, this performance
was expected - Remember that EDM does not account for much
variation in word images
26EDM matches for Lloyd
50 correct
27EDM problems . . .
- The algorithm performs poorly in that it cannot
discriminate well between valid and invalid words.
28The Hudson Document
29EDM on Hudson
- Average precision was 57.9, much lower than that
of the Senior Document (78) - Difference in precision attributed to the
handwriting - Difficult to read even for humans looking at
greyscale images at 300 dpi
30Problems with EDM/Hudson
31SLH matches for Lloyd
62.5 correct
3/8 incorrect
32SLH on Senior
- Proved to be very accurate, yet as mentioned
before, slow - Average precision of SLH was 86.3, compared to
78.7 for EDM - SLH recorded the word rankings correctly, and
also showed a much greater discrimination in
match error
33SLH on Hudson
- Very difficult because of writing, but ranking
proved to be much better than EDM - Performance on templates like they was good
probably because they are simple, repetitive
words - Correct ranking for the word Standard
34Standard template with SLH
35Current matching techniques
- SLH is more than reasonably accurate, but slow in
current implementations. References report this
will change ... - Require matching every word against every other
word - O(N2).
- N2 220 x 6400 1012 matches !!!
36Main problems with wordspotting
37Main problems with wordspotting
- Word scaling and connection of ascenders,
descenders
38Main problems with wordspotting
39Recent Work
- Fixed bugs and some problems with algorithm
- Recently have successfully segmented the 6400
scanned images of George Washingtons documents - Its impractical (too labor intensive) to compute
segmentation statistics on the entire collection.
40Future Work
- Continue working on new methods to match words
effectively and efficiently - Possible ideas for better matching techniques
include - Combining multiple word features
- Language probability for automatic indexing
- Continue working on new methods to match words
effectively and efficiently - Integrate into indexing scheme (back of book
index)
41Conclusion
- EDM works reasonably well for matching words, but
SLH is better since it accounts for variations - SLH pays the price computationally expensive
- Future work is needed but progress thus far is
very encouraging
42Questions?
- Thanks for listening to me blob (?) about
Wordspotting. - Any Comments/Questions?