1' OCR Engines for Printed Documents 2' Development of OHR Modules

1 / 25
About This Presentation
Title:

1' OCR Engines for Printed Documents 2' Development of OHR Modules

Description:

List the technique(s) that will be used: QT, Platform Independent (Linux, Windows) ... Discriminative features for similar shapes. Compact and efficient ... –

Number of Views:51
Avg rating:3.0/5.0
Slides: 26
Provided by: tdilM
Category:

less

Transcript and Presenter's Notes

Title: 1' OCR Engines for Printed Documents 2' Development of OHR Modules


1
1. OCR Engines for Printed Documents2.
Development of OHR Modules
  • C.V. Jawahar
  • International Institute of Information Technology
  • Gachobowli, Hyderabad.

2
Summary
  • Title OCR Engines for Printed Documents
  • Proposer C.V. Jawahar, P.J.Narayanan, Anoop
    Namboodiri
  • Institution Intl. Inst. of Information
    Technology, Hyderabad
  • Languages Telugu, Malayalam, Hindi
  • Components to be Implemented
  • Data Collection and Annotation
  • Page Segmentation
  • Modular Classifier Engine
  • Adaptive OCR Framework

3
Language TELUGU, HINDI, MALAYALAM
Name the component Annotated data with Data
Annotation Tool and Formats
List the technique(s) that will be used QT,
Platform Independent (Linux, Windows)
What is the performance of these techniques in
other languages? Effective
for all scripts.
Give an estimate of the expected performance -
Large corpus (200,000 words x 3).
Name the domain for which the performance will be
optimized Not Applicable
Name other evaluation metrics in addition to the
domain Evaluation Metrics based on this data can
be defined
4
Annotated Data Collection and APIs
  • Existing Datasets
  • Need benchmark datasets of Indian scripts
  • Should represent style, font, quality variations
  • Essential for reliable performance measurements
  • Features
  • Data from 3 different types of scripts Hindi,
    Telugu, Malayalam.
  • Annotated at character, word and region levels
  • Representation formats clearly specified in
    consultation with experts
  • APIs for reading and storing in the specified
    format
  • Data selected from different sources for font,
    style, layout and quality variations,
  • Dataset Specs
  • 200,000 annotated words each in 3 scripts
  • Benchmark for evaluation

5
Data Collection and Annotation Schedule
  • Notes
  • Useful Benchmark
  • Data made available to all as and when it gets
    ready

6
Language Script Independent
Name the component Adaptive Page Segmentation
Algorithm
List the technique(s) that will be used QT,
Platform Independent (Linux, Windows)
What is the performance of these techniques in
other languages? Effective
for all scripts.
Give an estimate of the expected performance -
Comparable with state of the art in english.
Name the domain for which the performance will be
optimized Printed documents, various quality
levels
Name other evaluation metrics in addition to the
domainSplits/Merges to Ground Truth
7
Adaptive Page Segmentation Module
  • Current Segmentation Algorithms
  • Graph-based
  • White-space based
  • Texture-based
  • Each work in specific situations, but not in all
    cases, specifically IL documents.
  • Features
  • Trainable from examples
  • Detects and labels text and graphics regions
  • XML Representation to suit the reconstruction
  • Specially suited for classes of Indian scripts
    (scripts with shirorekha, scripts with scattered
    components)
  • Provision for supervised and unsupervised
    learning
  • Robust to print quality degradations
  • A Robust Page Segmentation Algorithm
  • Input Document Image
  • Output XML Description of regions
  • Implementation Modular (C)

8
Page Segmentation Plans
  • Notes
  • Script limitations will be solved
  • Performance comparable to state of the art in
    English
  • Fully shared among participants

9
Language Script Independent
Name the component Modular Classification
Algorithm
List the technique(s) that will be used C,
Platform Independent (Linux, Windows)
What is the performance of these techniques in
other languages? Effective
for all scripts.
Give an estimate of the expected performance -
Comparable with state of the art in english.
Name the domain for which the performance will be
optimized Printed documents, various quality
levels
Name other evaluation metrics in addition to the
domainPercentage accuracy for specific datasets.
10
Modular Classification System
  • Classifiers for OCR
  • Large number of classes
  • Similar shapes
  • Complex decision boundaries
  • Reduced accuracies due to cascading errors.
  • Features
  • Designed as ensemble of modular classifiers
  • Highly accurate component classifiers using
    techniques like SVMs
  • Optimal classifier combination schemes
  • Boosting the classifier performance
  • Discriminative features for similar shapes
  • Compact and efficient
  • Interfaces as per common agreement
  • An Efficient Modular Classifier
  • Input Component Image
  • Output Component Class label
  • Implementation C

11
Classifier Development Plans
  • Notes
  • Generic code, useful for all scripts
  • Fully shared among participants

12
Adaptive Learnable OCR Framework
OCR
Learning Framework
Text Documents
Document Images
13
Language Telugu
Name the component OHR System for Telugu
List the technique(s) that will be used C,
Platform Independent (Linux, Windows)
What is the performance of these techniques in
other languages? Part of the Techniques will be
useful for other scripts (Preprocessing,
Classification Engine). Complete engine for
Telugu only
Give an estimate of the expected performance -
Around 95 on legibile handwriting
Name the domain for which the performance will be
optimized High resolution digitizers
Name other evaluation metrics in addition to the
domainPercentage accuracy for specific datasets.
14
Online Handwriting Recognizer for Telugu
  • Current Algorithms
  • No existing algorithm for Telugu OHR
  • Script layout is much more complex
  • Large number of classes or aksharas
  • Features
  • Trainable from examples
  • Combine online and offline features for high
    accuracy
  • Spline-based pre-processing that is resilient to
    various noise distributions
  • Human motor model based stroke representation for
    efficiency and accuracy
  • Use of discriminant features
  • Proposed Algorithm
  • Input OHR Data from high resolution digitizers
  • Output Unicode representation of the data
  • Implementation Modular (C)

15
Recognizer Development Plans
  • Notes
  • Fully shared among participants
  • Pre-processing modules useful for other languages

16
Language TELUGU
Name the component Annotated data from 100
writers, 1000 words each
List the technique(s) that will be used Tool
QT, Platform Independent (Linux, Windows)
What is the performance of these techniques in
other languages? Effective
for all scripts.
Give an estimate of the expected performance -
Large corpus (100,000 words).
Name the domain for which the performance will be
optimized Data from High Resolution Digitizers
Name other evaluation metrics in addition to the
domain Evaluation Metrics based on this data can
be defined
17
Annotated Data Collection and APIs
  • Existing Datasets
  • Existing datasets are mostly independent
    characters or digits
  • Need benchmark data in each language
  • Essential for reliable performance measurements
    and comparisons
  • Features
  • Data from 100 Telugu writers, each writing 1000
    words in running text under natural settings.
  • Annotated at character, word and region levels
  • Representation formats clearly specified in
    consultation with experts
  • APIs for reading and storing in the specified
    format
  • Data selected from different sources for font,
    style, layout and quality variations,
  • Dataset Specs
  • Writers 100 native writers, 1000 words each.
  • Device high resolution digitizers

18
Data Collection and Annotation Schedule
  • Notes
  • Useful Benchmark
  • Data made available to all as and when it gets
    ready

19
Demo 1 Recognizing Station Names
20
Investigating Team
  • C.V. Jawahar completed his PhD from
    IIT-Kharagpur, with his thesis related to aspects
    of segmentation of images. He has done extensive
    work in the fields of OCR, Document
    Understanding, Information Search and Retrieval
    from Digital Libraries. He has initiated several
    activities at our institute, including
    development of efficient and accurate OCR
    engines, page layout analysis for digital
    libraries, especially for Indian languages. He
    has several publications to his credit, that are
    related to these topics.
  • Anoop Namboodiri holds a PhD in the field of
    Pattern Recognition from Michigan State
    University and his thesis dealt with
    understanding of handwritten as well as printed
    documents. He has been working in topics related
    to Document Segmentation and Understanding,
    Handwriting Recognition as well as other aspects
    of Pattern Recognition. He has initiated or has
    been actively involved in activities related to
    Handwriting Recognition, Document Layout Analysis
    for Digital Libraries, and OCR at the institute.
  • P.J. Narayanan holds a PhD in the field of
    Computer Vision from University of Maryland. He
    has done extensive work in the fields of Image
    Segmentation, Scene Understanding, and Image
    Rendering. He has been actively involved with the
    research and development activities related to
    Digital Libraries.

21
Publications
  • C.V.Jawahar, M.Meshesha and A.Balasubramanian,Sea
    rching in Document Images, Proc. of Indian
    Conference on Vision Graphics and Image
    Processing, 2004,Calcutta, India, pp. 622--627.
  • A.Balasubramanian, M.Meshesha and C.V.Jawahar,
    Retrieval from Document Image Collections,
    Proc. of 7th IAPR Workshop on Document Analysis
    Systems, Nelson, New Zealand, 2006 (LNCS 3872),
    pp. 1--12.
  • S.Rawat, K.S.Kumar, M.Meshesha,
    A.Balasubramanian, I.D.Sikdar, and C.V.Jawahar,
    A Semi-Automatic Adaptive OCR for Digital
    Libraries, Proc. of 7th IAPR Workshop on
    Document Analysis Systems, 2006 (LNCS 3872), pp.
    13--24.
  • P.Sankar, V.Ambati, L.Hari and C.V.Jawahar,
    Digitizing A Million Books Challenges for
    Document Analysis, Proc. of 7th IAPR Workshop on
    Document Analysis Systems, 2006 (LNCS 3872), pp
    425--436.

22
Publications
  • K.S.Kumar, A.M.Namboodiri and C.V.Jawahar,
    Learning to Segment Document Images,
    Proceedings the 1st International Conference on
    Pattern Recognition and Machine Intelligence,
    2005, Kolkata, India. December 2005, pp.
    471--476.
  • M.Meshesha and C.V.Jawahar, Recognition of
    Printed Amharic Documents, Proceedings of 8th
    International Conference on Document Analysis and
    Recognition, Seoul, Korea 2005, Vol 1, pp.
    784--788.
  • M.P.Kumar and C.V. Jawahar, "Configurable Hybrid
    Architectures for Character Recognition
    Applications", Proceedings of 8th International
    Conference on Document Analysis and Recognition,
    Seoul, Korea 2005, Vol 1, pp 1199-1203.
  • C. V. Jawahar, MNSSK Pavan Kumar and S. S.
    Ravikiran, "A Bilingual OCR system for
    Hindi-Telugu Documents and its Applications",
    Proceedings of the International Conference on
    Document Analysis and Recognition, Aug. 2003,
    Edinburgh, Scotland, pp. 408--413.

23
Publications
  • M. N. S. S. K. Pavan Kumar and C. V. Jawahar, "
    Design of Hierarchical Classifier with Hybrid
    Architectures ", Proceedings of First
    International Conference on Pattern Recognition
    and Machine Intelligence (PReMI 2005) , Kolkata,
    India. December 2005, pp 276-279
  • Ranjith Kumar, Vamsi Chaitanya and C. V. Jawahar,
    " A Novel Approach to Script Separation",
    Proceedings of the International Conference on
    Advances in Pattern Recognition (ICAPR), Dec.
    2003, Calcutta, India, pp. 289--292.
  • MNSSK Pavan Kumar and C. V. Jawahar, " On
    Improving Design of Multiclass Classifiers",
    Proceedings of the International Conference on
    Advances in Pattern Recognition (ICAPR), Dec.
    2003, Calcutta, India, pp. 109--112.
  • MNSSK Pavan Kumar, S. S. Ravikiran, Abhishek
    Nayani, C. V. Jawahar and P.J. Narayanan, " Tools
    for Developing OCRs for Indian Scripts",
    Proceedings of the Workshop on Document Image
    Analysis and Retrieval (DIARCVPR'03) Jun. 2003,
    Madison, WI.

24
Publications
  • Mudit Agrawal, M. N. S. S. K. Pavan Kumar, C. V.
    Jawahar, "Indexing and Retrieval of Devanagari
    Text from Printed Documents", Proceedings of the
    National Conference on Document Analysis and
    Recognition (NCDAR), Jul. 2003, Mandya, India,
    pp. 244--251.
  • Vishwanatha Kaushik and C. V. Jawahar, "Detection
    of Devanagari Text in Digital Images using
    Connected Component Analysis", Proceedings of the
    National Conference on Document Analysis and
    Recognition (NCDAR), Jul. 2001, Mandya, India,
    pp. 41--48.
  • K.Alahari, S.Lahari and C.V.Jawahar,
    Discriminant Substrokes for Online Handwriting
    Recognition', Proc. of 8th Interational
    Conference on Document Analysis and Recognition ,
    Seoul, Korea 2005, Vol 1, pp. 499--503.
  • Anoop Namboodiri and Anil Jain, "Robust
    Segmentation of Unconstrained Online Handwritten
    Documents", Proc. of the Indian Conference on
    Vision Graphics and Image Processing, Dec. 2004,
    Calcutta, India, pp. 165--170.

25
Publications
  • A. Bhaskarbhatla, S.Madhavanath, M. Pavan Kumar,
    A. Balasubramanian, and C. V. Jawahar, "
    Representation and Annotation of Online
    Handwritten Data", Proc. of International
    Workshop in Frontiers of Handwriting Recognition,
    Oct. 2004, Tokyo, Japan, pp. 136--141.
  • Anil K. Jain and Anoop M. Namboodiri, Indexing
    and Retrieval of On-line Handwritten Documents'',
    Proc. of 7th Interational Conference on Document
    Analysis and Recognition , Edinburgh, Scotland,
    Aug. 2003, pp. 655--659.
Write a Comment
User Comments (0)
About PowerShow.com