Title: Document Image Analysis for Digital Libraries
1Document Image Analysis for Digital Libraries
- Prateek Sarkar
- Perceptual Document Analysis
- Palo Alto Research Center
- California, USA
2Palo Alto Research Center (PARC)
- 1970s
- First laser printer (Xerox)
- First object-oriented programming language,
Smalltalk (ParcPlace) - Personal distributed computing/client server
architecture - Ethernet, graphical user interface, and pop-up
menus (Xerox, Apple) - Page Description Languages (Adobe), Bravo text
editor (MSWord) - DFB solid state diode laser (SDL)
- 1980s
- Optical LAN (Synoptics), document management
software (Documentum) - Linguistic technology in spell checkers
(Microlytics) - Collective programming/multi-user domains
invented - Liveboard (LiveWorks), information visualization
(InXight) - Multibeam lasers (Xerox), a-Si TFTs in displays,
medical imagers (dpiX) - 1990s
- Encryption (Semaphore)
- Ethnographic studies of field service (Xerox)
- Constraint-based scheduler (Xerox)
3PARC Research Labs
- Computing Sciences
- Ad Hoc and Sensor Networks Security Privacy
Interoperability Ubiquitous Computing
Applications User-centered Design Workscapes
Organizations - Intelligent Systems
- Embedded Reasoning Human Information
Interaction Natural Language Processing
Perceptual Document Analysis User Interfaces
Visualization - Electronic Materials and Devices
- Class III-V Materials Large Area Electronics
MEMS Optoelectronics Optical Detectors - Hardware Systems
- Biomedical Systems Biodefense Particle
Manipulation Piezo Materials and Devices
Printing Systems Renewable Energy
4Perceptual document analysis
- Develop probabilistic modeling frameworks for
perceptual analysis of documents
5(No Transcript)
621 300 92 9.00-03
SALE INVOICE 1NNoICE NO
INCE DATE DUE DATE
FAX TO RILL TO REMIT
TO ACCOUNT NO
PO
NUMBER LOCATION OF UNITS SAME AS ABOVE
UNIT NO. SERIAL NO. 26.361.00 UNIT
NO. SERIAL NO 35.351.00 DOWN
PAYMENT BUILDING DELIVERY BUILDING DELIVERY BLOCK
AND LEVEL BLOCKAND LEVEL ANCHORMIEDOWN DECKING
ELECTRICAL PLUMBING INSTALLATION SITE MANAGEMENT
SKIRTING VINYL 0.00 0.00 400.00 0.00 2,100.00
780,00 960,00 1.350 00 3.925 00 1,100.00 1,300.00
TOTAL DUE THIS INVOICE 63,767.00
7(No Transcript)
8Document imaging services
- Xerox document image scanning centres
- Americas, Europe, Asia
- Millions of pages scanned everyday
- These images need
- Indexing
- Storage
- Classification and sorting
- Functional Role Labeling
- and much more
9Document image libraries subtasks
- Document layout understanding
- Character recognition
- Functional role labeling
- Image cleanup/enhancement
- Indexing
- Organizing
- Restructuring
- Summarizing
- Cross-linking
- Redaction
- Privacy management
- Distribution
- Interaction searching, browsing, learning,
annotation, authoring, publishing
10PARC research on Digital libraries
- Digital repositories by intent
- Information dissemination
- Record keeping
- Institutional document repositories/archives
- Personal document collections
- Collaborative collections
11PARC research on Digital Libraries
- Digital Libraries by content
- Textual
- Structured/semi-structured (databases, email)
- Images with symbolic content
- Images with natural content
- Music
- Video
-
12Robust OCR Using Fisher Kernels
DICE Document Image Classification Engine
Provably optimal Functional Role Labeling
Contour labeling Using MRFs
SenseMaking
Ubitext
UpLib Personal DL
Beyond text and keywords
13Provably optimal algorithm for Functional Role
Labeling
14(No Transcript)
1521 300 92 9.00-03
SALE INVOICE 1NNoICE NO
INCE DATE DUE DATE
FAX TO RILL TO REMIT
TO ACCOUNT NO
PO
NUMBER LOCATION OF UNITS SAME AS ABOVE
UNIT NO. SERIAL NO. 26.361.00 UNIT
NO. SERIAL NO 35.351.00 DOWN
PAYMENT BUILDING DELIVERY BUILDING DELIVERY BLOCK
AND LEVEL BLOCKAND LEVEL ANCHORMIEDOWN DECKING
ELECTRICAL PLUMBING INSTALLATION SITE MANAGEMENT
SKIRTING VINYL 0.00 0.00 400.00 0.00 2,100.00
780,00 960,00 1.350 00 3.925 00 1,100.00 1,300.00
TOTAL DUE THIS INVOICE 63,767.00
16(No Transcript)
17Functional Role Labeling
18Functional Role Labeling
- Identify salient groups in images and label them
- Morphological cues
- Proximity, color similarity
- Perceptual cues
- Alignments, curvilinear continuity, closure
- Semantic cues
- Semantic validation of a grouping hypothesis
19Functional Role Labeling
20Functional Role Labeling - Why it is a
difficult problem
21Table parsing
22Table parsing
Vertical Alignment
23Table parsing
Symmetry
24Table parsing
25Table parsing
Horizontal Alignment (revisited)
26Image Parsing
- Parsing images unlike text parsing
- No natural ordering in 2D
- Segmentation or grouping is intractable
- Polynomial complexity with restrictions
- Factored HMMs, XY-grammars
27Image parsing
Simplify Let perceptual grouping principles
generate grouping hypothesis. Focus on the
labeling problem.
28The labeling problem
Labels
Groups/Regions/Segments
29(No Transcript)
30The labeling problem - individual labeling
- Individual labeling (simple case)
- Computational complexity CN
- Number of objects to label N
- Number of labels C
31The labeling problem - joint labeling
32Classification with complex joints
- The joint distribution is a complex mess with all
kinds of dependence factors. - The general problem to solve
- (y1, y2 yN) argmax P(y1yN, x1xN)
- Have to search over CN interpretations
- Compare to simple case
- (y1, y2 yN) argmax P(y1,x1)P(y2,x2)P(yN,xN)
33A search factorizable bounds
- Identify a factorizable upper bound
- The general problem to solve
- f(y1,x1)f(y2,x2)f(yN,xN) gt P(y1yN, x1xN)
- Sort interpretations by upper bound (easy!)
- Evaluate an interpretation only if its upper
bound is higher than best interpretation so far. - Excellent average case performance.
34Sorting by the upper bound
F(y1,y2,,yN) f(1c1) . f(2 c2) . f(N cN)
- (C,B) (B,B) (C,A)
- (B,B) (C,A) (A,B) (B,A)
- (C,A) (A,B) (B,A) (C,C)
-
35Drawing titles are close to drawing numbers
Drawing numbers are visually salient wrt
surroundings
There is at most one drawing number on a sheet.
36Variable format engineering drawings examples
37Variable format engineering drawingsexample
38Variable format engineering drawings segmentation
39Variable format engineering drawings feature
measurements
Mutual exclusion constraint
40Variable format engineering drawings results
- CFive labels of interest
- DN, DT, DNI, DTI, Other
- N50 to 80 text-regions to be labeled
- Number of possible labelings 504
- 1000 images tested
- Drawing number correctly identified in about 80
of cases - Errors mostly OCRsegmentationgtlabeling
41Variable format engineering drawings results
- Typically only 200-1000 of 504 label hypothesis
are explored - No heuristic pruning of search space
- Best labeling guaranteed
- Application benefits
- Faster search
- Research benefits
- Enables complex model design
- If in trouble, fix the models, not the search
42Document Image Classification Engine DICE
43Visual classification of document images
44Document Image Classification
- Goal Sort and classify diverse scanned and
electronic documents into generic categories and
specific known templates.
Document Classification Module
Document category database
Text-based features
OCR
Electronic documents
Layout analysis
Layout-based features
Classifier engine
Paper documents
Free text analysis redaction
Letters
Statements
Acct. anonymization
Example Processing Architecture
Personal information filtering
Forms
Slide Courtesy Eric Saund, PARC
45Existing methods
- Template matching
- Universal
- Explicit search over space of variations
- High level features
- Layout-based features
- Dubiel, Dengel. FormClas. (DAS, 1998)
- Hu, Kashi, Wilfong. (ICDAR, 1999)
- Shin, Doermann, Rosenfeld (IJDAR, 2001)
- OCR followed by text categorization
- Feature extraction is a bottleneck
- Key-feature based methods
- Rule based, design by trial and error
- Special mention
- Diligenti et al., Hidden Tree Markov Models. PAMI
2003
46Expected within class variability
47Expected within class variability
48Our approach key ideas
- Compute low-level visual-features on document
image - e.g., locations of large intensity variations
- A document image produces a scatter-plot of these
features - A document-image category is represented by a
probability distribution that would generate such
a scatter plot
49Choice of features
- Haar filter features
- 6 FilterTypes
- 100 different window sizes
- Each filter is applied to an image
- at multiple scales
- at all locations of an image.
- If filter response exceeds preset thresholds a
5-dimensional feature fires. - FeatureType, log(filterWidth),
log(filterHeight), x, y
f
w
h
x
y
50A document image as a scatter-plot
51Choice of probability model
- Latent Conditional Independence
- p(x1, x2, , xn) Sk p(k) p(x1k) p(x2k)
p(xnk) - (generative model)
52Latent Conditional Independence
p(x,y)Sk p(k) p(xk) p(yk)
p(k)
x
y
p(xk)
k
p(yk)
Nm
53Latent Conditional Independence (LCI)
54LCI model one per image category
p(kc)
c
p(fk,c)
p(wk,c)
f
w
h
x
y
p(hk,c)
p(xk,c)
k
n1..Nm
p(yk,c)
m1..M
55Per category LCI model training
Use EM algorithm for parameter estimation
p(kc)
c
p(fk,c)
p(wk,c)
f
w
h
x
y
p(hk,c)
p(xk,c)
k
n1..Nm
p(yk,c)
m1..M
56Per category LCI model testing
Use Maximum Likelihood classification
p(kc)
c
p(fk,c)
p(wk,c)
f
w
h
x
y
p(hk,c)
p(xk,c)
k
n1..Nm
p(yk,c)
m1..M
57NIST Database of Tax Forms
20 form categories
5590 images with category labels
58NIST Tax Forms Data
- Fixed layout forms
- Various degrees of markup (filling in)
- Small translations and skew
- Lightness/Darkness variations
- Obscured corners
59Example feature distributions
60Experiments on NIST forms data
- 5-dim Haar filter features
- FeatureType, log(filterWidth),
log(filterHeight), x, y - One feature dimension is discrete
- 10,000 features for each image
- Train on 10 images per category
- K100 components
- 15 EM iterations
- 10 classification errors in 5390 test samples.
- 1-1.5 sec per image (US Letter, 300dpi)
- Java implementation, 3GHz Pentium
ML Classification
Feature extraction
61Looking forward
- 7 out of 10 errors were on a single category
- K100 for that category was overkill
- Automatically identify K or adopt a
non-parametric model - Also applied to telling first page from second
page of journal articles - Non-rigid layout categories require deformable
models - Handled within the same graphical models
framework - Incorporating appropriate hidden variables
- Parts models, Displacement models
- Acknowledgments Mithun Das Gupta, Univ. Illinois
Csurka et al. ECCV (2004) Sudderth et al. CVPR
(2005)
62Clustering by appearance
Use EM algorithm for parameter estimation
p(kc)
c
c
p(fk,c)
p(wk,c)
f
w
h
x
y
p(hk,c)
p(xk,c)
k
n1..Nm
p(yk,c)
m1..M
63Robust OCR Using Fisher Kernels
DICE Document Image Classification Engine
Provably optimal Functional Role Labeling
Contour labeling Using MRFs
SenseMaking
Ubitext
UpLib Personal DL
Beyond text and keywords
64My personal digital Library in UpLib
65Images I keep in my UpLib
- Articles I read
- My notes on paper
- Letters from family and friends
- Receipts, bills and copies of paperwork
- Family medical prescriptions
- Cartoons I like
- Sudoku-s I enjoyed
- Artwork of my four-year old daughter
66Robust OCR Using Fisher Kernels
DICE Document Image Classification Engine
Provably optimal Functional Role Labeling
Contour labeling Using MRFs
SenseMaking
Ubitext
UpLib Personal DL
Beyond text and keywords
67Fisher kernels Robust OCR
OCR Independent pixel noise does not
matter Test your system on correlated noise
(interference)
68Fisher kernels Robust OCR
- Generative models
- Learn p(features class) from data
- Compute p(class features) using Bayes rule
- Modular, composable
- Rejection criterion
- Discriminative models
- Learn p(class features) directly from data
- Learn the boundaries only
- Often, lower error rate with small samples
69Fisher kernels Robust OCR
70Fisher kernels Robust OCR
- Fisher kernel
- Observation likelihood
- Maximum likelihood parameter estimate
- Fisher kernel
71Fisher kernels Robust OCR
72Fisher kernels Robust OCR
73The Bit Flip Model (DID)
- For glyph c of size (w, h)
74Fisher kernels Robust OCR
ABBYY
Fisher
DID
75Ubitext Paper to PDA
76Degree of interest tree
77ScentIndex
- End of book indices are carefully designed by
author/editor/publisher - Make use of this index to provide an active
dynamic clickabel interface to users - On search, show search terms as well as related
terms in rearranged clickable index.
78Digital library content
- Text, email, web-history, pdf
- Photos, Music, Audio, Video, Movies
- Addresses, bookmarks, calendars, quick notes
- Recipes, travelogues, manuals, org-charts,
genealogy - Computer code
- Graphs and charts
- Data astronomical, genomic, medical,
geopolitical, - Reference dictionaries, encyclopedia, handbooks,
- Design VLSI, airplane, building architecture,
fashion, - Collaborative artifacts tags, annotations,
ratings, links,
79What do we want beyond search
- User interface
- Sub-second interactions
- visual attention
- Real-time interactions
- search responsiveness
- incremental query building
- Long-term interaction
- Support for information foraging
- User modeling
- Knowledge tools (annotation, content authoring,
excerpting, collaborative tagging)
80What do we want beyond search
- The lay-of-the-land
- Browsing
- Search by scent and example
- Knowledge discovery tools and aids