Title: Document Image Analysis
1Document Image Analysis CSE 717 An Introduction
2Document Image Analysis
- DIA is the theory and practice of recovering the
symbol structures of digital images scanned from
paper or produced by computer - DIA is a subfield of Digital Image processing
- Digital images of natural objects X-rays,
fingerprints, faces, scenery, etc. are NOT part
of DIA - Digital images of symbolic objects Postal
addresses, printed articles, forms, music sheets,
engineering drawings, topographic maps belong to
DIA - Source Scanners, printers, fax machines, hand!
- Incidental text license plates, billboards,
subtitles, in photos and video - WWW ??
- DIAs grand goal is take us to the land of
paperless office
3Document Image Analysis
Graphical Processing
Textual Processing
Optical Character Recognition
Page Layout Analysis
Line Processing
Region and Symbol Processing
Skew, blocks, paragraphs
Lines, curves, corners
Filled regions
Text
4 Document Image Analysis
Processing Text Graphics
Pixels Preprocessing Representation, Noise removal, binarization, skew, script id, font id Preprocessing Representation, Noise removal, binarization, thinning, vectorization
Primitives Glyph Recognition Connected components, strokes, punctuations, words Primitive Recognition Straight lines, curve segments, junctions, nodes, loops, characters
Structures Text Recognition Word segmentation, text line reconstruction, table analysis, linguistics Structure Recognition Text fields, legends, labels, dimensions, graphics symbols
Documents Page Layout Analysis Text versus non-text, physical component analysis, logical component analysis, functional component analysis, compression Interpretation Component recognition, connectivity analysis, CAD layer separation, Database attribute extraction, Compression
Corpus Information Retrieval Document Classification, indexing, search, security, authentication, privacy Database, CAD Validation, search, update
5Postal Examples
6Forms
7Unconstrained Text
8Graphics Documents
9References
- Handbook of Character Recognition and Document
Image Analysis, H. Bunke and PSP Wang (editors),
World Scientific Press - Document Image Analysis, Gorman and Kasturi ,
IEEE Computer Society Press - International Conference on Document Analysis and
Recognition proceedings - International Workshop on Document Analysis
Systems proceedings - Symposium on Document Image Understanding
Technology
10- OCR Features and Systems
- Script ID, Devanagari OCR, Tamil OCR, MP versus
HW - Handwriting Recognition
- Postal applications, Arabic Documents
- Classifiers and Learning
- Multi-classifier systems
- Layout Analysis
- Skew correction, geometric methods, test/graphics
separation, logical labeling - Tables and Forms
- Detecting tables in HTML documents, use of graph
grammars, semantics - Document Engineering
- Processing of historical documents (palm leaf
manuscripts). - Camera Based DIA
- Locating and reading Barcodes
- New Applications -CAPTCHA