1' OCR Engines for Printed Documents 2' Development of OHR Modules

1 / 25

About This Presentation

Title:

1' OCR Engines for Printed Documents 2' Development of OHR Modules

Description:

List the technique(s) that will be used: QT, Platform Independent (Linux, Windows) ... Discriminative features for similar shapes. Compact and efficient ... –

Number of Views:51

Avg rating:3.0/5.0

Slides: 26

Provided by: tdilM

Category:

more less

Transcript and Presenter's Notes

Title: 1' OCR Engines for Printed Documents 2' Development of OHR Modules

1
1. OCR Engines for Printed Documents2.
Development of OHR Modules

C.V. Jawahar
International Institute of Information Technology
Gachobowli, Hyderabad.

2
Summary

Title OCR Engines for Printed Documents
Proposer C.V. Jawahar, P.J.Narayanan, Anoop
Namboodiri
Institution Intl. Inst. of Information
Technology, Hyderabad
Languages Telugu, Malayalam, Hindi
Components to be Implemented
Data Collection and Annotation
Page Segmentation
Modular Classifier Engine
Adaptive OCR Framework

3
Language TELUGU, HINDI, MALAYALAM
Name the component Annotated data with Data
Annotation Tool and Formats
List the technique(s) that will be used QT,
Platform Independent (Linux, Windows)
What is the performance of these techniques in
other languages? Effective
for all scripts.
Give an estimate of the expected performance -
Large corpus (200,000 words x 3).
Name the domain for which the performance will be
optimized Not Applicable
Name other evaluation metrics in addition to the
domain Evaluation Metrics based on this data can
be defined
4
Annotated Data Collection and APIs

Existing Datasets
Need benchmark datasets of Indian scripts
Should represent style, font, quality variations
Essential for reliable performance measurements

Features
Data from 3 different types of scripts Hindi,
Telugu, Malayalam.
Annotated at character, word and region levels
Representation formats clearly specified in
consultation with experts
APIs for reading and storing in the specified
format
Data selected from different sources for font,
style, layout and quality variations,

Dataset Specs
200,000 annotated words each in 3 scripts
Benchmark for evaluation

5
Data Collection and Annotation Schedule

Notes
Useful Benchmark
Data made available to all as and when it gets
ready

6
Language Script Independent
Name the component Adaptive Page Segmentation
Algorithm
List the technique(s) that will be used QT,
Platform Independent (Linux, Windows)
What is the performance of these techniques in
other languages? Effective
for all scripts.
Give an estimate of the expected performance -
Comparable with state of the art in english.
Name the domain for which the performance will be
optimized Printed documents, various quality
levels
Name other evaluation metrics in addition to the
domainSplits/Merges to Ground Truth
7
Adaptive Page Segmentation Module

Current Segmentation Algorithms
Graph-based
White-space based
Texture-based
Each work in specific situations, but not in all
cases, specifically IL documents.

Features
Trainable from examples
Detects and labels text and graphics regions
XML Representation to suit the reconstruction
Specially suited for classes of Indian scripts
(scripts with shirorekha, scripts with scattered
components)
Provision for supervised and unsupervised
learning
Robust to print quality degradations

A Robust Page Segmentation Algorithm
Input Document Image
Output XML Description of regions
Implementation Modular (C)

8
Page Segmentation Plans

Notes
Script limitations will be solved
Performance comparable to state of the art in
English
Fully shared among participants

9
Language Script Independent
Name the component Modular Classification
Algorithm
List the technique(s) that will be used C,
Platform Independent (Linux, Windows)
What is the performance of these techniques in
other languages? Effective
for all scripts.
Give an estimate of the expected performance -
Comparable with state of the art in english.
Name the domain for which the performance will be
optimized Printed documents, various quality
levels
Name other evaluation metrics in addition to the
domainPercentage accuracy for specific datasets.
10
Modular Classification System

Classifiers for OCR
Large number of classes
Similar shapes
Complex decision boundaries
Reduced accuracies due to cascading errors.

Features
Designed as ensemble of modular classifiers
Highly accurate component classifiers using
techniques like SVMs
Optimal classifier combination schemes
Boosting the classifier performance
Discriminative features for similar shapes
Compact and efficient
Interfaces as per common agreement

An Efficient Modular Classifier
Input Component Image
Output Component Class label
Implementation C

11
Classifier Development Plans

Notes
Generic code, useful for all scripts
Fully shared among participants

12
Adaptive Learnable OCR Framework
OCR
Learning Framework
Text Documents
Document Images
13
Language Telugu
Name the component OHR System for Telugu
List the technique(s) that will be used C,
Platform Independent (Linux, Windows)
What is the performance of these techniques in
other languages? Part of the Techniques will be
useful for other scripts (Preprocessing,
Classification Engine). Complete engine for
Telugu only
Give an estimate of the expected performance -
Around 95 on legibile handwriting
Name the domain for which the performance will be
optimized High resolution digitizers
Name other evaluation metrics in addition to the
domainPercentage accuracy for specific datasets.
14
Online Handwriting Recognizer for Telugu

Current Algorithms
No existing algorithm for Telugu OHR
Script layout is much more complex
Large number of classes or aksharas

Features
Trainable from examples
Combine online and offline features for high
accuracy
Spline-based pre-processing that is resilient to
various noise distributions
Human motor model based stroke representation for
efficiency and accuracy
Use of discriminant features

Proposed Algorithm
Input OHR Data from high resolution digitizers
Output Unicode representation of the data
Implementation Modular (C)

15
Recognizer Development Plans

Notes
Fully shared among participants
Pre-processing modules useful for other languages

16
Language TELUGU
Name the component Annotated data from 100
writers, 1000 words each
List the technique(s) that will be used Tool
QT, Platform Independent (Linux, Windows)
What is the performance of these techniques in
other languages? Effective
for all scripts.
Give an estimate of the expected performance -
Large corpus (100,000 words).
Name the domain for which the performance will be
optimized Data from High Resolution Digitizers
Name other evaluation metrics in addition to the
domain Evaluation Metrics based on this data can
be defined
17
Annotated Data Collection and APIs

Existing Datasets
Existing datasets are mostly independent
characters or digits
Need benchmark data in each language
Essential for reliable performance measurements
and comparisons

Features
Data from 100 Telugu writers, each writing 1000
words in running text under natural settings.
Annotated at character, word and region levels
Representation formats clearly specified in
consultation with experts
APIs for reading and storing in the specified
format
Data selected from different sources for font,
style, layout and quality variations,

Dataset Specs
Writers 100 native writers, 1000 words each.
Device high resolution digitizers

18
Data Collection and Annotation Schedule

Notes
Useful Benchmark
Data made available to all as and when it gets
ready

19
Demo 1 Recognizing Station Names
20
Investigating Team

C.V. Jawahar completed his PhD from
IIT-Kharagpur, with his thesis related to aspects
of segmentation of images. He has done extensive
work in the fields of OCR, Document
Understanding, Information Search and Retrieval
from Digital Libraries. He has initiated several
activities at our institute, including
development of efficient and accurate OCR
engines, page layout analysis for digital
libraries, especially for Indian languages. He
has several publications to his credit, that are
related to these topics.
Anoop Namboodiri holds a PhD in the field of
Pattern Recognition from Michigan State
University and his thesis dealt with
understanding of handwritten as well as printed
documents. He has been working in topics related
to Document Segmentation and Understanding,
Handwriting Recognition as well as other aspects
of Pattern Recognition. He has initiated or has
been actively involved in activities related to
Handwriting Recognition, Document Layout Analysis
for Digital Libraries, and OCR at the institute.
P.J. Narayanan holds a PhD in the field of
Computer Vision from University of Maryland. He
has done extensive work in the fields of Image
Segmentation, Scene Understanding, and Image
Rendering. He has been actively involved with the
research and development activities related to
Digital Libraries.

21
Publications

C.V.Jawahar, M.Meshesha and A.Balasubramanian,Sea
rching in Document Images, Proc. of Indian
Conference on Vision Graphics and Image
Processing, 2004,Calcutta, India, pp. 622--627.
A.Balasubramanian, M.Meshesha and C.V.Jawahar,
Retrieval from Document Image Collections,
Proc. of 7th IAPR Workshop on Document Analysis
Systems, Nelson, New Zealand, 2006 (LNCS 3872),
pp. 1--12.
S.Rawat, K.S.Kumar, M.Meshesha,
A.Balasubramanian, I.D.Sikdar, and C.V.Jawahar,
A Semi-Automatic Adaptive OCR for Digital
Libraries, Proc. of 7th IAPR Workshop on
Document Analysis Systems, 2006 (LNCS 3872), pp.
13--24.
P.Sankar, V.Ambati, L.Hari and C.V.Jawahar,
Digitizing A Million Books Challenges for
Document Analysis, Proc. of 7th IAPR Workshop on
Document Analysis Systems, 2006 (LNCS 3872), pp
425--436.

22
Publications

K.S.Kumar, A.M.Namboodiri and C.V.Jawahar,
Learning to Segment Document Images,
Proceedings the 1st International Conference on
Pattern Recognition and Machine Intelligence,
2005, Kolkata, India. December 2005, pp.
471--476.
M.Meshesha and C.V.Jawahar, Recognition of
Printed Amharic Documents, Proceedings of 8th
International Conference on Document Analysis and
Recognition, Seoul, Korea 2005, Vol 1, pp.
784--788.
M.P.Kumar and C.V. Jawahar, "Configurable Hybrid
Architectures for Character Recognition
Applications", Proceedings of 8th International
Conference on Document Analysis and Recognition,
Seoul, Korea 2005, Vol 1, pp 1199-1203.
C. V. Jawahar, MNSSK Pavan Kumar and S. S.
Ravikiran, "A Bilingual OCR system for
Hindi-Telugu Documents and its Applications",
Proceedings of the International Conference on
Document Analysis and Recognition, Aug. 2003,
Edinburgh, Scotland, pp. 408--413.

23
Publications

M. N. S. S. K. Pavan Kumar and C. V. Jawahar, "
Design of Hierarchical Classifier with Hybrid
Architectures ", Proceedings of First
International Conference on Pattern Recognition
and Machine Intelligence (PReMI 2005) , Kolkata,
India. December 2005, pp 276-279
Ranjith Kumar, Vamsi Chaitanya and C. V. Jawahar,
" A Novel Approach to Script Separation",
Proceedings of the International Conference on
Advances in Pattern Recognition (ICAPR), Dec.
2003, Calcutta, India, pp. 289--292.
MNSSK Pavan Kumar and C. V. Jawahar, " On
Improving Design of Multiclass Classifiers",
Proceedings of the International Conference on
Advances in Pattern Recognition (ICAPR), Dec.
2003, Calcutta, India, pp. 109--112.
MNSSK Pavan Kumar, S. S. Ravikiran, Abhishek
Nayani, C. V. Jawahar and P.J. Narayanan, " Tools
for Developing OCRs for Indian Scripts",
Proceedings of the Workshop on Document Image
Analysis and Retrieval (DIARCVPR'03) Jun. 2003,
Madison, WI.

24
Publications

Mudit Agrawal, M. N. S. S. K. Pavan Kumar, C. V.
Jawahar, "Indexing and Retrieval of Devanagari
Text from Printed Documents", Proceedings of the
National Conference on Document Analysis and
Recognition (NCDAR), Jul. 2003, Mandya, India,
pp. 244--251.
Vishwanatha Kaushik and C. V. Jawahar, "Detection
of Devanagari Text in Digital Images using
Connected Component Analysis", Proceedings of the
National Conference on Document Analysis and
Recognition (NCDAR), Jul. 2001, Mandya, India,
pp. 41--48.
K.Alahari, S.Lahari and C.V.Jawahar,
Discriminant Substrokes for Online Handwriting
Recognition', Proc. of 8th Interational
Conference on Document Analysis and Recognition ,
Seoul, Korea 2005, Vol 1, pp. 499--503.
Anoop Namboodiri and Anil Jain, "Robust
Segmentation of Unconstrained Online Handwritten
Documents", Proc. of the Indian Conference on
Vision Graphics and Image Processing, Dec. 2004,
Calcutta, India, pp. 165--170.

25
Publications

A. Bhaskarbhatla, S.Madhavanath, M. Pavan Kumar,
A. Balasubramanian, and C. V. Jawahar, "
Representation and Annotation of Online
Handwritten Data", Proc. of International
Workshop in Frontiers of Handwriting Recognition,
Oct. 2004, Tokyo, Japan, pp. 136--141.
Anil K. Jain and Anoop M. Namboodiri, Indexing
and Retrieval of On-line Handwritten Documents'',
Proc. of 7th Interational Conference on Document
Analysis and Recognition , Edinburgh, Scotland,
Aug. 2003, pp. 655--659.