Content%20Level%20Access%20to%20Digital%20Library%20of%20India%20Pages - PowerPoint PPT Presentation

About This Presentation
Title:

Content%20Level%20Access%20to%20Digital%20Library%20of%20India%20Pages

Description:

Content Level Access to Digital Library of India Pages Praveen Krishnan, Ravi Shekhar, C.V. Jawahar CVIT, IIIT Hyderabad – PowerPoint PPT presentation

Number of Views:152
Avg rating:3.0/5.0
Slides: 63
Provided by: KRIS300
Category:

less

Transcript and Presenter's Notes

Title: Content%20Level%20Access%20to%20Digital%20Library%20of%20India%20Pages


1
Content Level Access to Digital Library of India
Pages
  • Praveen Krishnan, Ravi Shekhar, C.V. Jawahar
  • CVIT, IIIT Hyderabad

2
Digital Library of India (DLI)
http//www.dli.iiit.ac.in/
Vision To enhance access to information and
knowledge to masses.
  • Partner to Million Book Universal Digital
    Library Programme.

Information for people
Dataset for researchers
Vamshi Ambati, N.Balakrishnan, Raj Reddy, Lakshmi
Pratha, C V Jawahar The Digital Library of India
Project Process, Policies and Architecture,
ICDL , 2007.
3
Digital Library of India (DLI)
Vision To enhance access to information and
knowledge to masses.
Languages
Content
Statistics
  • 41 different languages
  • Includes
  • - Hindi, Telugu, Marathi..
  • - English, French, Greek..
  • Books 4 Lakhs
  • Pages 134 Million
  • Words 26 Billion

Source http//www.new1.dli.ernet.in/
4
Digital Library of India (DLI)
Meta data search
  • Supports Meta data based search.
  • No Content Level Access

Indian freedom struggle and independence
Search
5
Digital Library of India (DLI)
  • Need Content Level Access
  • Content Meta Data

Indian freedom struggle and independence
Search
6
Digital Library of India (DLI)
Reliable Text Representation
?
  • Need Content Level Access
  • Content Meta Data

Indian freedom struggle and independence
Search
7
Goal
Digital Library of India Search
  • Build a search engine with support for Indian
    languages.
  • Word Spotting

8
Goal
Indian Language Document Search Engine
Text Query Support
???
Page 1
9
Goal
Indian Language Document Search Engine
?????? ?? ????? ?????????
???
Multi Keyword Support
Page 1
10
Goal
Indian Language Document Search Engine
?????? ?? ????? ?????????
???
Ranks based on Occurrences
Page 1
11
Goal
Indian Language Document Search Engine
?????? ?? ????? ?????????
???
Semantically Related Words
Page 1
12
Goal
Indian Language Document Search Engine
?????? ?? ????? ?????????
???
Seamless scaling to billions of word images. Sub
second retrieval
Page 1
13
Text from OCR
Hindi Page
Telugu Page
- Hindi Title - Praachin Bhaartiy Vichaar Aur
Vibhutiyaan, Published in 1624 - Telugu Title -
Andhra Vagmayaramba Dasha, Published in 1960
14
Text from OCR
Hindi Page
Telugu Page
Cuts
Cuts
15
Text from OCR
Hindi Page
Telugu Page
Cuts
Merges
16
Text from OCR
Hindi Page
Telugu Page
Cuts
Variations in Script, Font and Typesetting.
17
Text from OCR
Char
Hindi
Telugu
1 D. Arya, T. Patnaik, S. Chaudhury, C. V.
Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G.
S. Lehal, and C. Bhagavati, Experiences of
Integration and Performance Testing of
Multilingual OCR for Printed Indian Scripts, in
ICDAR MOCR Workshop, 2011.
18
Text from OCR
Word
Hindi
Telugu
1 D. Arya, T. Patnaik, S. Chaudhury, C. V.
Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G.
S. Lehal, and C. Bhagavati, Experiences of
Integration and Performance Testing of
Multilingual OCR for Printed Indian Scripts, in
ICDAR MOCR Workshop, 2011.
19
Text from OCR
Search
Hindi
Telugu
20
BoVW for Image Retrieval
Text Retrieval
Image Recognition
Query Image
Ranked Retrieved Results
Josef Sivic, Andrew Zisserman Video Google A
Text Retrieval Approach to Object Matching in
Videos. ICCV 2003
21
BoVW for Image Retrieval
  • Fixed Length Representation
  • Invariant to popular deformation

Query Image
Ranked Retrieved Results
Josef Sivic, Andrew Zisserman Video Google A
Text Retrieval Approach to Object Matching in
Videos. ICCV 2003
22
BoVW for Document Image Retrieval
R. Shekhar and C. V. Jawahar. Word Image
Retrieval Using Bag of Visual Words. In DAS, 2012.
23
BoVW for Document Image Retrieval
Histogram of Visual Words
R. Shekhar and C. V. Jawahar. Word Image
Retrieval Using Bag of Visual Words. In DAS, 2012.
24
BoVW for Document Image Retrieval
Cuts
R. Shekhar and C. V. Jawahar. Word Image
Retrieval Using Bag of Visual Words. In DAS, 2012.
25
BoVW for Document Image Retrieval
Cuts
Histogram of Visual Words
R. Shekhar and C. V. Jawahar. Word Image
Retrieval Using Bag of Visual Words. In DAS, 2012.
26
BoVW for Document Image Retrieval
Merges
R. Shekhar and C. V. Jawahar. Word Image
Retrieval Using Bag of Visual Words. In DAS, 2012.
27
BoVW for Document Image Retrieval
Merges
Histogram of Visual Words
R. Shekhar and C. V. Jawahar. Word Image
Retrieval Using Bag of Visual Words. In DAS, 2012.
28
BoVW for Document Image Retrieval
  • Robust against degradation
  • Lost Geometry
  • Use Spatial Verification
  • SIFT based.
  • Longest Subsequence alignment.

R. Shekhar and C. V. Jawahar. Word Image
Retrieval Using Bag of Visual Words. In DAS,
2012. I. Z. Yalniz and R. Manmatha. An Efficient
Framework for Searching Text in Noisy Document
Images. In DAS, 2012.
29
Query Expansion
Querying
Database
Query Image
Query Image
Histogram
Refined Histogram
30
Query Expansion
Querying
Database
Query Image
Query Histogram
Rank 1
Rank 2
Rank 3
Rank 4
Rank 5
Rank 6
Better Results
31
Text Query Support
  • Originally formulated in a query by example
    setting.

32
Text Query Support
  • Originally formulated in a query by example
    setting.
  • Need Text Queries

Input Text Query
Text Query Histogram
33
Observations
  • Are the results of OCR and BoVW complementary?

BoVW OCR
OCR BoVW
34
Observations
  • mAP v/s Word Length

mAP
No. of Characters
35
Observations
  • OCR system has a high precision while BoVW
    approach has a high recall.
  • Example GT 5

OCR Out List Precision 1 Recall 0.4
BoVW Out List Precision 0.8 Recall 1
36
Fusion
  • Fusion Techniques-
  • Naïve Fusion

mAP Chart
OCR
37
Fusion
  • Fusion Techniques-
  • Naïve Fusion

mAP Chart
BoVW
38
Fusion
  • Fusion Techniques-
  • Naïve Fusion
  • Concatenating OCR Results with BoVW

mAP Chart
OCR
BoVW
39
Fusion
  • Fusion Techniques-
  • Edit Distance Based Fusion

mAP Chart
OCR
BoVW
40
Fusion
  • Fusion Techniques-
  • Edit Distance Based Fusion

mAP Chart
  • Reordering BoVW
  • BoVW score
  • Modified Edit distance cost

BoVW
41
Fusion
  • Fusion Techniques-
  • Edit Distance Based Fusion

mAP Chart
  • Reordering BoVW
  • BoVW score
  • Modified Edit distance cost

BoVW
42
Fusion
  • Fusion Techniques-
  • Edit Distance Based Fusion

mAP Chart
OCR
BoVW
43
Fusion
  • Fusion Techniques-
  • Hybrid Fusion

mAP Chart
OCR
BoVW
44
Fusion
  • Fusion Techniques-
  • Hybrid Fusion

mAP Chart
  • Re-querying BoVW using
  • OCR retrieved results.
  • Using rank aggregation
  • techniques

BoVW
45
Fusion
  • Fusion Techniques-
  • Hybrid Fusion

mAP Chart
  • Re-querying BoVW using
  • OCR retrieved results.
  • Using rank aggregation
  • techniques

BoVW
46
Fusion
  • Fusion Techniques-
  • Hybrid Fusion

mAP Chart
OCR
BoVW
47
Experimental Results
48
Experimental Details
  • OCR 1
  • Feature Detector
  • Harris Interest point detection. 2
  • Feature Descriptor
  • SIFT 2
  • Indexing
  • Lucene 3

1 D. Arya, T. Patnaik, S. Chaudhury, C. V.
Jawahar, B. B. Chaudhury, A. G. Ramakrishnan, G.
S. Lehal, and C. Bhagavati, Experiences of
Integration and Performance Testing of
Multilingual OCR for Printed Indian Scripts,in
ICDAR MOCR Workshop, 2011. 2
http//www.vlfeat.org 3 http//lucene.apache.or
g/
49
Test Bed
Sample Word Images
Language Books Pages Words Annotation
Hindi (HS1) 11 1000 362,593 Yes
Hindi (HS2) 52 10,196 4,290,864 No
Telugu (TS1) 11 1000 161,276 Yes
Telugu (TS2) 69 13,871 2,531,069 No
DLI Corpus
  • In addition, we used HP1 TP1 fully annotated
    dataset

50
Evaluation Measures
  • Precision
  • Recall
  • mAP (Mean Average Precision)
  • Mean of the area under the precision
  • recall curve for all the queries.
  • Precision _at_ 10
  • Shows how accurate top 10 retrieved
  • results are.

TP True Positive FP False Positive FN False
Negative
Precision-Recall Curve
51
Language Query BoVW Search BoVW Search BoVW Query Expansion BoVW Query Expansion
Language Query mAP Prec_at_10 mAP Prec_at_10
Hindi (HP1) 100 62.54 81.30 66.09 83.86
Telugu (TP1) 100 71.13 78 73.08 79.89
Comparison of naïve BoVW with BoVW Query
Expansion
52
Language Query BoVW Search BoVW Search BoVW using Text Queries BoVW using Text Queries
Language Query mAP Prec_at_10 mAP Prec_at_10
Hindi (HP1) 100 62.54 81.30 56.32 73.89
Telugu (TP1) 100 71.13 78 69.06 78.83
Comparison of naïve BoVW with BoVW Text Query
Support
53
Language Query Naïve Naïve Edit Distance Edit Distance Hybrid Hybrid
Language Query mAP Prec_at_10 mAP Prec_at_10 mAP Prec_at_10
Hindi (HP1) 100 75.66 90.7 79.58 90.8 80.37 91.4
Telugu (TP1) 100 76.02 81.2 78.01 81.4 80.23 83.7
Comparative performance of different fusion
techniques on HP1 TP1
54
Language Query OCR OCR BoVW BoVW Fusion Fusion
Language Query mAP Prec_at_10 mAP Prec_at_10 mAP Prec_at_10
Hindi (HS1) 100 14.95 62.60 60.55 95.5 68.81 95.6
Telugu (TS1) 100 27.03 62.10 74.38 90.6 78.41 91.9
Performance statistics on DLI Annotated Corpus
55
Language Query Precision _at_ N OCR BoVW Fusion
Hindi (HS2) 50 Prec_at_10 82.03 96.94 97.11
Hindi (HS2) 50 Prec_at_20 75.16 94.83 95.42
Hindi (HS2) 50 Prec_at_30 71.12 92.82 93.16
Telugu (TS2) 50 Prec_at_10 90.85 99.14 99.14
Telugu (TS2) 50 Prec_at_20 85.42 98.00 98.85
Telugu (TS2) 50 Prec_at_30 80.76 96.38 96.57
Performance statistics on DLI Un-Annotated Corpus
56
Retrieved Results
57
Retrieved Results
58
Failure Cases
  • The word images shown in the figure fails in both
    OCR and BoVW.
  • Reason
  • (a) Word Image smaller in length and containing a
    character not used these days.
  • (b) A highly degraded word image.

59
Implementation Details
  • Search Engine Development
  • An elegant web based search and retrieval
    interface.

Lucene Scalability
Time in milliseconds
No of Images
Sample Retrieved Page
No of Visual Words
60
Search Architecture (Ongoing)
Ranked Results
61
Ongoing Work
  • Learn to improve from annotated dataset
  • Use of visual confusion matrix to improve BoVW
    results from annotated datasets.
  • Necessity of Costly Features for Re-ranking
  • The images shows in failure cases would require
    costly features to show up.
  • Use of machine learning algorithms.
  • Exploration of features better than SIFT.

62
Thank You
Write a Comment
User Comments (0)
About PowerShow.com