Title: 1' OCR Engines for Printed Documents 2' Development of OHR Modules
11. OCR Engines for Printed Documents2.
Development of OHR Modules
- C.V. Jawahar
- International Institute of Information Technology
- Gachobowli, Hyderabad.
2Summary
- Title OCR Engines for Printed Documents
- Proposer C.V. Jawahar, P.J.Narayanan, Anoop
Namboodiri - Institution Intl. Inst. of Information
Technology, Hyderabad - Languages Telugu, Malayalam, Hindi
- Components to be Implemented
- Data Collection and Annotation
- Page Segmentation
- Modular Classifier Engine
- Adaptive OCR Framework
3Language TELUGU, HINDI, MALAYALAM
Name the component Annotated data with Data
Annotation Tool and Formats
List the technique(s) that will be used QT,
Platform Independent (Linux, Windows)
What is the performance of these techniques in
other languages? Effective
for all scripts.
Give an estimate of the expected performance -
Large corpus (200,000 words x 3).
Name the domain for which the performance will be
optimized Not Applicable
Name other evaluation metrics in addition to the
domain Evaluation Metrics based on this data can
be defined
4Annotated Data Collection and APIs
- Existing Datasets
- Need benchmark datasets of Indian scripts
- Should represent style, font, quality variations
- Essential for reliable performance measurements
- Features
- Data from 3 different types of scripts Hindi,
Telugu, Malayalam. - Annotated at character, word and region levels
- Representation formats clearly specified in
consultation with experts - APIs for reading and storing in the specified
format - Data selected from different sources for font,
style, layout and quality variations,
- Dataset Specs
- 200,000 annotated words each in 3 scripts
- Benchmark for evaluation
5Data Collection and Annotation Schedule
- Notes
- Useful Benchmark
- Data made available to all as and when it gets
ready
6Language Script Independent
Name the component Adaptive Page Segmentation
Algorithm
List the technique(s) that will be used QT,
Platform Independent (Linux, Windows)
What is the performance of these techniques in
other languages? Effective
for all scripts.
Give an estimate of the expected performance -
Comparable with state of the art in english.
Name the domain for which the performance will be
optimized Printed documents, various quality
levels
Name other evaluation metrics in addition to the
domainSplits/Merges to Ground Truth
7Adaptive Page Segmentation Module
- Current Segmentation Algorithms
- Graph-based
- White-space based
- Texture-based
- Each work in specific situations, but not in all
cases, specifically IL documents.
- Features
- Trainable from examples
- Detects and labels text and graphics regions
- XML Representation to suit the reconstruction
- Specially suited for classes of Indian scripts
(scripts with shirorekha, scripts with scattered
components) - Provision for supervised and unsupervised
learning - Robust to print quality degradations
- A Robust Page Segmentation Algorithm
- Input Document Image
- Output XML Description of regions
- Implementation Modular (C)
8Page Segmentation Plans
- Notes
- Script limitations will be solved
- Performance comparable to state of the art in
English - Fully shared among participants
9Language Script Independent
Name the component Modular Classification
Algorithm
List the technique(s) that will be used C,
Platform Independent (Linux, Windows)
What is the performance of these techniques in
other languages? Effective
for all scripts.
Give an estimate of the expected performance -
Comparable with state of the art in english.
Name the domain for which the performance will be
optimized Printed documents, various quality
levels
Name other evaluation metrics in addition to the
domainPercentage accuracy for specific datasets.
10Modular Classification System
- Classifiers for OCR
- Large number of classes
- Similar shapes
- Complex decision boundaries
- Reduced accuracies due to cascading errors.
- Features
- Designed as ensemble of modular classifiers
- Highly accurate component classifiers using
techniques like SVMs - Optimal classifier combination schemes
- Boosting the classifier performance
- Discriminative features for similar shapes
- Compact and efficient
- Interfaces as per common agreement
- An Efficient Modular Classifier
- Input Component Image
- Output Component Class label
- Implementation C
11Classifier Development Plans
- Notes
- Generic code, useful for all scripts
- Fully shared among participants
12Adaptive Learnable OCR Framework
OCR
Learning Framework
Text Documents
Document Images
13Language Telugu
Name the component OHR System for Telugu
List the technique(s) that will be used C,
Platform Independent (Linux, Windows)
What is the performance of these techniques in
other languages? Part of the Techniques will be
useful for other scripts (Preprocessing,
Classification Engine). Complete engine for
Telugu only
Give an estimate of the expected performance -
Around 95 on legibile handwriting
Name the domain for which the performance will be
optimized High resolution digitizers
Name other evaluation metrics in addition to the
domainPercentage accuracy for specific datasets.
14Online Handwriting Recognizer for Telugu
- Current Algorithms
- No existing algorithm for Telugu OHR
- Script layout is much more complex
- Large number of classes or aksharas
- Features
- Trainable from examples
- Combine online and offline features for high
accuracy - Spline-based pre-processing that is resilient to
various noise distributions - Human motor model based stroke representation for
efficiency and accuracy - Use of discriminant features
- Proposed Algorithm
- Input OHR Data from high resolution digitizers
- Output Unicode representation of the data
- Implementation Modular (C)
15Recognizer Development Plans
- Notes
- Fully shared among participants
- Pre-processing modules useful for other languages
16Language TELUGU
Name the component Annotated data from 100
writers, 1000 words each
List the technique(s) that will be used Tool
QT, Platform Independent (Linux, Windows)
What is the performance of these techniques in
other languages? Effective
for all scripts.
Give an estimate of the expected performance -
Large corpus (100,000 words).
Name the domain for which the performance will be
optimized Data from High Resolution Digitizers
Name other evaluation metrics in addition to the
domain Evaluation Metrics based on this data can
be defined
17Annotated Data Collection and APIs
- Existing Datasets
- Existing datasets are mostly independent
characters or digits - Need benchmark data in each language
- Essential for reliable performance measurements
and comparisons
- Features
- Data from 100 Telugu writers, each writing 1000
words in running text under natural settings. - Annotated at character, word and region levels
- Representation formats clearly specified in
consultation with experts - APIs for reading and storing in the specified
format - Data selected from different sources for font,
style, layout and quality variations,
- Dataset Specs
- Writers 100 native writers, 1000 words each.
- Device high resolution digitizers
18Data Collection and Annotation Schedule
- Notes
- Useful Benchmark
- Data made available to all as and when it gets
ready
19Demo 1 Recognizing Station Names
20Investigating Team
- C.V. Jawahar completed his PhD from
IIT-Kharagpur, with his thesis related to aspects
of segmentation of images. He has done extensive
work in the fields of OCR, Document
Understanding, Information Search and Retrieval
from Digital Libraries. He has initiated several
activities at our institute, including
development of efficient and accurate OCR
engines, page layout analysis for digital
libraries, especially for Indian languages. He
has several publications to his credit, that are
related to these topics. - Anoop Namboodiri holds a PhD in the field of
Pattern Recognition from Michigan State
University and his thesis dealt with
understanding of handwritten as well as printed
documents. He has been working in topics related
to Document Segmentation and Understanding,
Handwriting Recognition as well as other aspects
of Pattern Recognition. He has initiated or has
been actively involved in activities related to
Handwriting Recognition, Document Layout Analysis
for Digital Libraries, and OCR at the institute. - P.J. Narayanan holds a PhD in the field of
Computer Vision from University of Maryland. He
has done extensive work in the fields of Image
Segmentation, Scene Understanding, and Image
Rendering. He has been actively involved with the
research and development activities related to
Digital Libraries.
21Publications
- C.V.Jawahar, M.Meshesha and A.Balasubramanian,Sea
rching in Document Images, Proc. of Indian
Conference on Vision Graphics and Image
Processing, 2004,Calcutta, India, pp. 622--627. - A.Balasubramanian, M.Meshesha and C.V.Jawahar,
Retrieval from Document Image Collections,
Proc. of 7th IAPR Workshop on Document Analysis
Systems, Nelson, New Zealand, 2006 (LNCS 3872),
pp. 1--12. - S.Rawat, K.S.Kumar, M.Meshesha,
A.Balasubramanian, I.D.Sikdar, and C.V.Jawahar,
A Semi-Automatic Adaptive OCR for Digital
Libraries, Proc. of 7th IAPR Workshop on
Document Analysis Systems, 2006 (LNCS 3872), pp.
13--24. - P.Sankar, V.Ambati, L.Hari and C.V.Jawahar,
Digitizing A Million Books Challenges for
Document Analysis, Proc. of 7th IAPR Workshop on
Document Analysis Systems, 2006 (LNCS 3872), pp
425--436.
22Publications
- K.S.Kumar, A.M.Namboodiri and C.V.Jawahar,
Learning to Segment Document Images,
Proceedings the 1st International Conference on
Pattern Recognition and Machine Intelligence,
2005, Kolkata, India. December 2005, pp.
471--476. - M.Meshesha and C.V.Jawahar, Recognition of
Printed Amharic Documents, Proceedings of 8th
International Conference on Document Analysis and
Recognition, Seoul, Korea 2005, Vol 1, pp.
784--788. - M.P.Kumar and C.V. Jawahar, "Configurable Hybrid
Architectures for Character Recognition
Applications", Proceedings of 8th International
Conference on Document Analysis and Recognition,
Seoul, Korea 2005, Vol 1, pp 1199-1203. - C. V. Jawahar, MNSSK Pavan Kumar and S. S.
Ravikiran, "A Bilingual OCR system for
Hindi-Telugu Documents and its Applications",
Proceedings of the International Conference on
Document Analysis and Recognition, Aug. 2003,
Edinburgh, Scotland, pp. 408--413.
23Publications
- M. N. S. S. K. Pavan Kumar and C. V. Jawahar, "
Design of Hierarchical Classifier with Hybrid
Architectures ", Proceedings of First
International Conference on Pattern Recognition
and Machine Intelligence (PReMI 2005) , Kolkata,
India. December 2005, pp 276-279 - Ranjith Kumar, Vamsi Chaitanya and C. V. Jawahar,
" A Novel Approach to Script Separation",
Proceedings of the International Conference on
Advances in Pattern Recognition (ICAPR), Dec.
2003, Calcutta, India, pp. 289--292. - MNSSK Pavan Kumar and C. V. Jawahar, " On
Improving Design of Multiclass Classifiers",
Proceedings of the International Conference on
Advances in Pattern Recognition (ICAPR), Dec.
2003, Calcutta, India, pp. 109--112. - MNSSK Pavan Kumar, S. S. Ravikiran, Abhishek
Nayani, C. V. Jawahar and P.J. Narayanan, " Tools
for Developing OCRs for Indian Scripts",
Proceedings of the Workshop on Document Image
Analysis and Retrieval (DIARCVPR'03) Jun. 2003,
Madison, WI.
24Publications
- Mudit Agrawal, M. N. S. S. K. Pavan Kumar, C. V.
Jawahar, "Indexing and Retrieval of Devanagari
Text from Printed Documents", Proceedings of the
National Conference on Document Analysis and
Recognition (NCDAR), Jul. 2003, Mandya, India,
pp. 244--251. - Vishwanatha Kaushik and C. V. Jawahar, "Detection
of Devanagari Text in Digital Images using
Connected Component Analysis", Proceedings of the
National Conference on Document Analysis and
Recognition (NCDAR), Jul. 2001, Mandya, India,
pp. 41--48. - K.Alahari, S.Lahari and C.V.Jawahar,
Discriminant Substrokes for Online Handwriting
Recognition', Proc. of 8th Interational
Conference on Document Analysis and Recognition ,
Seoul, Korea 2005, Vol 1, pp. 499--503. - Anoop Namboodiri and Anil Jain, "Robust
Segmentation of Unconstrained Online Handwritten
Documents", Proc. of the Indian Conference on
Vision Graphics and Image Processing, Dec. 2004,
Calcutta, India, pp. 165--170.
25Publications
- A. Bhaskarbhatla, S.Madhavanath, M. Pavan Kumar,
A. Balasubramanian, and C. V. Jawahar, "
Representation and Annotation of Online
Handwritten Data", Proc. of International
Workshop in Frontiers of Handwriting Recognition,
Oct. 2004, Tokyo, Japan, pp. 136--141. - Anil K. Jain and Anoop M. Namboodiri, Indexing
and Retrieval of On-line Handwritten Documents'',
Proc. of 7th Interational Conference on Document
Analysis and Recognition , Edinburgh, Scotland,
Aug. 2003, pp. 655--659.