Title: Content%20Level%20Access%20to%20Digital%20Library%20of%20India%20Pages
1Content Level Access to Digital Library of India
Pages
- Praveen Krishnan, Ravi Shekhar, C.V. Jawahar
- CVIT, IIIT Hyderabad
2Digital Library of India (DLI)
http//www.dli.iiit.ac.in/
Vision To enhance access to information and
knowledge to masses.
- Partner to Million Book Universal Digital
Library Programme.
Information for people
Dataset for researchers
Vamshi Ambati, N.Balakrishnan, Raj Reddy, Lakshmi
Pratha, C V Jawahar The Digital Library of India
Project Process, Policies and Architecture,
ICDL , 2007.
3Digital Library of India (DLI)
Vision To enhance access to information and
knowledge to masses.
Languages
Content
Statistics
- 41 different languages
- Includes
- - Hindi, Telugu, Marathi..
- - English, French, Greek..
- Books 4 Lakhs
- Pages 134 Million
- Words 26 Billion
Source http//www.new1.dli.ernet.in/
4Digital Library of India (DLI)
Meta data search
- Supports Meta data based search.
- No Content Level Access
Indian freedom struggle and independence
Search
5Digital Library of India (DLI)
- Need Content Level Access
- Content Meta Data
Indian freedom struggle and independence
Search
6Digital Library of India (DLI)
Reliable Text Representation
?
- Need Content Level Access
- Content Meta Data
Indian freedom struggle and independence
Search
7Goal
Digital Library of India Search
- Build a search engine with support for Indian
languages. - Word Spotting
8Goal
Indian Language Document Search Engine
Text Query Support
???
Page 1
9Goal
Indian Language Document Search Engine
?????? ?? ????? ?????????
???
Multi Keyword Support
Page 1
10Goal
Indian Language Document Search Engine
?????? ?? ????? ?????????
???
Ranks based on Occurrences
Page 1
11Goal
Indian Language Document Search Engine
?????? ?? ????? ?????????
???
Semantically Related Words
Page 1
12Goal
Indian Language Document Search Engine
?????? ?? ????? ?????????
???
Seamless scaling to billions of word images. Sub
second retrieval
Page 1
13Text from OCR
Hindi Page
Telugu Page
- Hindi Title - Praachin Bhaartiy Vichaar Aur
Vibhutiyaan, Published in 1624 - Telugu Title -
Andhra Vagmayaramba Dasha, Published in 1960
14Text from OCR
Hindi Page
Telugu Page
Cuts
Cuts
15Text from OCR
Hindi Page
Telugu Page
Cuts
Merges
16Text from OCR
Hindi Page
Telugu Page
Cuts
Variations in Script, Font and Typesetting.
17Text from OCR
Char
Hindi
Telugu
1 D. Arya, T. Patnaik, S. Chaudhury, C. V.
Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G.
S. Lehal, and C. Bhagavati, Experiences of
Integration and Performance Testing of
Multilingual OCR for Printed Indian Scripts, in
ICDAR MOCR Workshop, 2011.
18Text from OCR
Word
Hindi
Telugu
1 D. Arya, T. Patnaik, S. Chaudhury, C. V.
Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G.
S. Lehal, and C. Bhagavati, Experiences of
Integration and Performance Testing of
Multilingual OCR for Printed Indian Scripts, in
ICDAR MOCR Workshop, 2011.
19Text from OCR
Search
Hindi
Telugu
20BoVW for Image Retrieval
Text Retrieval
Image Recognition
Query Image
Ranked Retrieved Results
Josef Sivic, Andrew Zisserman Video Google A
Text Retrieval Approach to Object Matching in
Videos. ICCV 2003
21BoVW for Image Retrieval
- Fixed Length Representation
- Invariant to popular deformation
Query Image
Ranked Retrieved Results
Josef Sivic, Andrew Zisserman Video Google A
Text Retrieval Approach to Object Matching in
Videos. ICCV 2003
22BoVW for Document Image Retrieval
R. Shekhar and C. V. Jawahar. Word Image
Retrieval Using Bag of Visual Words. In DAS, 2012.
23BoVW for Document Image Retrieval
Histogram of Visual Words
R. Shekhar and C. V. Jawahar. Word Image
Retrieval Using Bag of Visual Words. In DAS, 2012.
24BoVW for Document Image Retrieval
Cuts
R. Shekhar and C. V. Jawahar. Word Image
Retrieval Using Bag of Visual Words. In DAS, 2012.
25BoVW for Document Image Retrieval
Cuts
Histogram of Visual Words
R. Shekhar and C. V. Jawahar. Word Image
Retrieval Using Bag of Visual Words. In DAS, 2012.
26BoVW for Document Image Retrieval
Merges
R. Shekhar and C. V. Jawahar. Word Image
Retrieval Using Bag of Visual Words. In DAS, 2012.
27BoVW for Document Image Retrieval
Merges
Histogram of Visual Words
R. Shekhar and C. V. Jawahar. Word Image
Retrieval Using Bag of Visual Words. In DAS, 2012.
28BoVW for Document Image Retrieval
- Robust against degradation
- Lost Geometry
- Use Spatial Verification
- SIFT based.
- Longest Subsequence alignment.
R. Shekhar and C. V. Jawahar. Word Image
Retrieval Using Bag of Visual Words. In DAS,
2012. I. Z. Yalniz and R. Manmatha. An Efficient
Framework for Searching Text in Noisy Document
Images. In DAS, 2012.
29Query Expansion
Querying
Database
Query Image
Query Image
Histogram
Refined Histogram
30Query Expansion
Querying
Database
Query Image
Query Histogram
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Better Results
31Text Query Support
- Originally formulated in a query by example
setting.
32Text Query Support
- Originally formulated in a query by example
setting. - Need Text Queries
Input Text Query
Text Query Histogram
33Observations
- Are the results of OCR and BoVW complementary?
BoVW OCR
OCR BoVW
34Observations
mAP
No. of Characters
35Observations
- OCR system has a high precision while BoVW
approach has a high recall. - Example GT 5
OCR Out List Precision 1 Recall 0.4
BoVW Out List Precision 0.8 Recall 1
36Fusion
- Fusion Techniques-
- Naïve Fusion
mAP Chart
OCR
37Fusion
- Fusion Techniques-
- Naïve Fusion
mAP Chart
BoVW
38Fusion
- Fusion Techniques-
- Naïve Fusion
- Concatenating OCR Results with BoVW
mAP Chart
OCR
BoVW
39Fusion
- Fusion Techniques-
- Edit Distance Based Fusion
mAP Chart
OCR
BoVW
40Fusion
- Fusion Techniques-
- Edit Distance Based Fusion
mAP Chart
- Reordering BoVW
- BoVW score
- Modified Edit distance cost
BoVW
41Fusion
- Fusion Techniques-
- Edit Distance Based Fusion
mAP Chart
- Reordering BoVW
- BoVW score
- Modified Edit distance cost
BoVW
42Fusion
- Fusion Techniques-
- Edit Distance Based Fusion
mAP Chart
OCR
BoVW
43Fusion
- Fusion Techniques-
- Hybrid Fusion
mAP Chart
OCR
BoVW
44Fusion
- Fusion Techniques-
- Hybrid Fusion
mAP Chart
- Re-querying BoVW using
- OCR retrieved results.
- Using rank aggregation
- techniques
BoVW
45Fusion
- Fusion Techniques-
- Hybrid Fusion
mAP Chart
- Re-querying BoVW using
- OCR retrieved results.
- Using rank aggregation
- techniques
BoVW
46Fusion
- Fusion Techniques-
- Hybrid Fusion
mAP Chart
OCR
BoVW
47Experimental Results
48Experimental Details
- OCR 1
- Feature Detector
- Harris Interest point detection. 2
- Feature Descriptor
- SIFT 2
- Indexing
- Lucene 3
1 D. Arya, T. Patnaik, S. Chaudhury, C. V.
Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G.
S. Lehal, and C. Bhagavati, Experiences of
Integration and Performance Testing of
Multilingual OCR for Printed Indian Scripts,in
ICDAR MOCR Workshop, 2011. 2
http//www.vlfeat.org 3 http//lucene.apache.or
g/
49Test Bed
Sample Word Images
Language Books Pages Words Annotation
Hindi (HS1) 11 1000 362,593 Yes
Hindi (HS2) 52 10,196 4,290,864 No
Telugu (TS1) 11 1000 161,276 Yes
Telugu (TS2) 69 13,871 2,531,069 No
DLI Corpus
- In addition, we used HP1 TP1 fully annotated
dataset
50Evaluation Measures
- Precision
- Recall
- mAP (Mean Average Precision)
- Mean of the area under the precision
- recall curve for all the queries.
- Precision _at_ 10
- Shows how accurate top 10 retrieved
- results are.
TP True Positive FP False Positive FN False
Negative
Precision-Recall Curve
51Language Query BoVW Search BoVW Search BoVW Query Expansion BoVW Query Expansion
Language Query mAP Prec_at_10 mAP Prec_at_10
Hindi (HP1) 100 62.54 81.30 66.09 83.86
Telugu (TP1) 100 71.13 78 73.08 79.89
Comparison of naïve BoVW with BoVW Query
Expansion
52Language Query BoVW Search BoVW Search BoVW using Text Queries BoVW using Text Queries
Language Query mAP Prec_at_10 mAP Prec_at_10
Hindi (HP1) 100 62.54 81.30 56.32 73.89
Telugu (TP1) 100 71.13 78 69.06 78.83
Comparison of naïve BoVW with BoVW Text Query
Support
53Language Query Naïve Naïve Edit Distance Edit Distance Hybrid Hybrid
Language Query mAP Prec_at_10 mAP Prec_at_10 mAP Prec_at_10
Hindi (HP1) 100 75.66 90.7 79.58 90.8 80.37 91.4
Telugu (TP1) 100 76.02 81.2 78.01 81.4 80.23 83.7
Comparative performance of different fusion
techniques on HP1 TP1
54Language Query OCR OCR BoVW BoVW Fusion Fusion
Language Query mAP Prec_at_10 mAP Prec_at_10 mAP Prec_at_10
Hindi (HS1) 100 14.95 62.60 60.55 95.5 68.81 95.6
Telugu (TS1) 100 27.03 62.10 74.38 90.6 78.41 91.9
Performance statistics on DLI Annotated Corpus
55Language Query Precision _at_ N OCR BoVW Fusion
Hindi (HS2) 50 Prec_at_10 82.03 96.94 97.11
Hindi (HS2) 50 Prec_at_20 75.16 94.83 95.42
Hindi (HS2) 50 Prec_at_30 71.12 92.82 93.16
Telugu (TS2) 50 Prec_at_10 90.85 99.14 99.14
Telugu (TS2) 50 Prec_at_20 85.42 98.00 98.85
Telugu (TS2) 50 Prec_at_30 80.76 96.38 96.57
Performance statistics on DLI Un-Annotated Corpus
56Retrieved Results
57Retrieved Results
58Failure Cases
- The word images shown in the figure fails in both
OCR and BoVW. - Reason
- (a) Word Image smaller in length and containing a
character not used these days. - (b) A highly degraded word image.
59Implementation Details
- Search Engine Development
- An elegant web based search and retrieval
interface.
Lucene Scalability
Time in milliseconds
No of Images
Sample Retrieved Page
No of Visual Words
60Search Architecture (Ongoing)
Ranked Results
61Ongoing Work
- Learn to improve from annotated dataset
- Use of visual confusion matrix to improve BoVW
results from annotated datasets. - Necessity of Costly Features for Re-ranking
- The images shows in failure cases would require
costly features to show up. - Use of machine learning algorithms.
- Exploration of features better than SIFT.
62Thank You