Title: Document Image Indexing
1Document Image Indexing
2Indexing of document images
- Apply IR techniques in modified form
- Different approaches
- Methods differ in how much analysis they do i.e
how rich are the document representations
involved. - Image objects and/or image structure
- layout objects and/or layout structure
- logical objects and/or logical structure
3Approaches
- full document conversion
- geometric analysis, OCR, logical analysis
- results mostly incomplete
- methods remain valid only when OCR quality is
reasonable - expensive process, not always feasible to apply
to millions of pages - partial document conversion
- only recognize important features
- cheap analysis or cheap pre-processing with
expensive processing of limited document parts
only
4Text characterization (deSilva)
- Focus on proper nouns
- Examples
- names of people, places, important objects
- Characteristics
- important for indexing
- difficult to do extensive post-processing as the
set of proper nouns can be very large
5Text characterization (deSilva)
- Observations made in experiments
- 95.5 of proper nouns are capitalized
- 35 of capitalized words are proper nouns
- beginning of sentence 10 proper nouns
- 85 of capitalized words following a one letter
word are proper nouns - average length is one larger than for other words
6Text characterization
Proper nouns
High level features - syntactic category of
previous/next word
Candidate nouns
Low level features - capitalization - length of
word - length of previous/next word - position in
the sentence
Box based image abstraction
Characters and words
7Identification of document function (Doermann)
- Document functions
- reading user is supposed to read the whole
document - browsing user is supposed to quickly go through
the document - searching user is supposed to look for specific
parts of the document - Observed properties
- reading few titles, large content blocks
- browsing large number of head/body pairings
- searching large number of small similar-sized
blocks
8Identification of document functionReading,
Browsing, or Searching?
Browsing
Searching
Reading
9Identification of document function
Document function
High level features - distribution of functional
units
Salient regions titles, abstracts, index keys
etc.
Low level features - zone properties - position
on the page
Box based image abstraction
Zones
10Presentation
- Basis is logical tree
- Functions
- reading
- dept first search of the logical tree
- browsing
- pruned depth first search of the logical tree
- searching
- decision tree based on the logical tree
11Presentation by document function
Searching path depends on user
Browsing
Reading
12Layout similarity (Doermann)
- Which documents are similar?
13Layout similarity
- Different measures required
- mapping between the typed component
- one-to-one mapping between components
- overlap of components which are matched
- relative positions of document parts
- shape of document parts
- The above measures are not independent
- order and relevance of the different measures has
to be chosen
14Edit distance
- Definition
- the minimum number of actions that you have to
perform to transform the one layout structure to
the other - Actions
- delete an object
- move an object
- change the shape of an object
- Weighting
- the different actions can have different
weighting depending on the application
15Graphics indexing (Lorenz)
- Basic components
- lines, parallel lines, adjacent lines, junctions
- text
- (circles, ellipses, etc.)
- Feature frequency weighting
- technique similar to text indexing as before
- indexing focussed on salient basic components
which occur often in the query graphics, but are
rare with respect to the whole collection of
graphics - allows access to heterogeneous collections of
data (e.g. text and graphics)
16Example document
17Example histogram
FREQUENCY
Symbol
18Indexing by spatial information
- observation
- It turns out that many people in fact do access
archives by remembering partial layout, knowing
approximately where things are positioned and how
they are related
19Spatial relations and indexing
- Spatial queries on document information
- mixture of
- document labels, document properties, keywords
- examples abstract, title, footnote
- examples large square text box, text box with
low high aspect ratio - examples text box containing the word
motorcycle, picture with keyword typing unit
in it - spatial relations
- left-of, right-of, above, below, adjacent, etc.
20Example
- give me documents with a large box above a (box
with large aspect ratio containing the word
Titanic)
Result
Specification
T e x t
Any type
figure
above
Titanic
Titanic
The Titanic arrived
21Conclusion (indexing)
- partial document analysis
- relatively cheap methods based on simple
characteristics - capable of indexing documents efficiently and
effectively - can always be combined with full OCR and layout
and logical analysis - methods for text do apply in the same way for
graphics
22Multimedia Indexing
23Authoring versus visual analysis
Content descriptive metadata
Intentions
multimedia content text, video objects etc.
Partial script
Sensory content objects, images etc.
Multimedia script
Extracted multimedia structure and content
Structure multimedia document
(Analog) multimedia document
Digital multimedia document
Multimedia document
24Multimedia structures
- Geometric structure
- the layout of the multimedia document
- Logical structure
- the interpretation of the multimedia document
- Non-linear (hypertext structure)
- relations between (logical) elements in the
document
Note structures and relations can also be time
based, hence synchronization important
25Video example
26Introduction
- Single media indexing
- text (standard information retrieval)
- video (Brunelli)
- documents (Doermann)
- image (Informedia)
- figure (HyperDoc)
- audio (Informedia)
27Multi media indexing (examples)
- Figure and text
- manuals with labels in figures and explanations
in the text - caption of the figure explaining the content
- Text and image
- caption of newspaper picture
- context of a picture on a web page
- Audio and video
- commentator explaining what you see
28Multi media indexing (examples)
- Audio and image
- expert describing a picture
- photographer annotating his picture
- Text and video
- film script
- closed captions of news
29Multi media analysis
- General approach
- find common ontology
- analyze both media and express the result in the
common ontology - Most often text other modality based
30Overview
- Multimodal Document Indexing
- The HyperDocument system
- From document to hypertext
- The IMAT system
- From document to reusable fragments
- Multimodal Video Indexing
- Name-It
- Face-Name association
- Informedia
- Multimodal Video Summaries
- Review paper (Snoek)
- General framework and overview
31Multimodal Document Indexing
32The HyperDoc system
- Data
- an (old) manual with annotated pictures and
associated texts - Goal
- WWW based access to the paper version of the
document
33(No Transcript)
34Document Structures
geometric
logical
hypertext
header
figure
page number
figure
caption
figure
textbody
text
35Document structures
- Structure definition
- a set of objects and their relations (links)
- Structure types
- we identify different (hypertext) structures
which pose restriction on the admissable
relations between objects
36Hierarchical Structure
- Tree shaped structure
- links at each level
- Example Geometric structure
- grouping of elements in columns
- Example Logical structure
- grouping of captions and figures
- sections, subsections
37Linear Structure
- Set of connected links
- no loops
- access to first element only
- relative links
- Example Reading order
- depth first traversal of logical structure of
main text body - Example Lists
- tables
- figures
38Index Structure
- Ordered set of links
- outgoing links only
- Examples
- index to text elements
- keywords
- labels in figures
39Side-loop Structure
- Structure consisting of
- two links in opposite direction, from and to one
component - no other links out of the component
- Examples
- footnotes
- references
40Cross-group Structure
- Structure with
- two components
- links between them
- Examples
- whole text body and set of figures
- defines scope of each figure
- one figure and its scope
- relations between figure content and text
41Cross-reference Structure
- Remaining relations
- semantic relations between keywords
- semantic relations between paragraphs
42From Paper to HyperDocument Access
- Document Image Analysis
- layout analysis
- content analysis
- objects and text (by OCR) in figures
- text of the paragraphs (using OCR)
- Logical Analysis
- interpretation of document parts
- Hypertext analysis
- identifying instantiations of the six hypertext
structures - Presentation design
- present the structures to the user
43Figure content analysis
- Here focus on labels in an image
- plain text labels
- generic text labels
- icon labels
- legend labels
44From object to content
- Figure label detection
- candidate characters should have height in
(1-a)
modal_height,(1a) modal_height grouping into
complete multi-line labels based on predicates
and actions as explained in document analysis
lecture - Object content
- text objects and figure labels are processed with
commercial OCR - logical labels are identified by processing OCR
output - e.g. titles (indicated by view), notes
(indicated by note)
45Legends
- Definition
- a legend is a list of icon-name pairs
- Use
- legends can be very important in document image
analysis as they provide a relation between
objects in the image and associated semantic
concepts
46Legend label detection and analysis
To decompose the legend picture,
projection profiles in x- and y- direction
(counting the number of pixels) are used
47From objects to Geometric and Logical Structure
Basic geometric object
Basic logical object
content
Column detection
Grouping and text analysis
Reading order
Geometric structure
Logical structure
Logical structure search for occurrences from
start of each textline chapter
ltwhite_spacegtltnumeralgt section
ltwhite_spacegtltnumeralgtlt.gtltnumeralgt check
whether sequence is increasing properly
48Hypertext Analysis
Logical structure
- Hierarchical structure - Linear structure -
Index structure - Side-loop structure -
Cross-group structure - Cross-references
Hypertext analysis
Structured HyperDocument
49Hypertext Analysis
- Hierarchical structure
- geometric structure irrelevant after document
image analysis - logical structure most important
- Linear structure
- detected reading order
- list of detected figures
50Hypertext Analysis
- Index structure
- list of detected labels
- important keywords
- can be found using statistical analysis as
explained in document indexing - Side-loop structure
- relies on OCR to detect superscripts or other
conventions - Cross-reference structure
- should be found by semantic analysis of the text
51Cross-group structure
- Cross group-links from text to whole figures
- search for reference patterns e.g
- ltNote reference figuregt ltnumeralgt
- ltNote reference figuregt ltnumeralgt ltandgt
ltltnumeralgtlt,gtgt - Consistency checking
- check figure number range
- check for order in one reference sequence
- Figure scope
- the part of the text between different references
defines the scope of the figure in the text
52Cross-group structure
- Cross group links between figure and text
- use scope(s) of specific figure as found in above
step - match text of label with the body of text
- match each individual word, combine close matches
- match semantic label of an icon with the body of
text by considering the legend
53Presentation rules
- Make structures explicit
- provide access to all 6 structures identified
- Allow for media specific navigation
- provide access to the set of figures and the text
- Leave out irrelevant information
- dont show irrelevant layout information
- show side-loops only on request
54Document presentation (HyperDoc)
- Make structures explicit
- make explicit the logical structure and all links
derived from the logical structure - introduce anchors in both text and figures for
the links in cross-goup structures - Allow for media specific navigation
- Use different frames for figures and text
- next/prev buttons for figures, scrollbar for text
- Leave out all irrelevant information
- remove page numbers
- show footnotes only on request
55HyperDoc presentation
56HyperDoc summary
- Model
- for hyperdocuments at least 6 different
structures can be identified - Processing
- scanning
- layout analysis
- content analysis
- logical analysis
- hypertext analysis
- presentation
- based on the structures
57The IMAT system
- Data
- a large set of manuals from different companies
with text (in digital format) and figures (in
both digital and paper format) - Goal
- automatic decomposition of the dataset into
reusable fragments so that they can be used in
system assisted generation of training material
58Introduction
Value Assets x Reconfigurability (R. Jain, ACM
Multimedia 2000)
Index terms
high value
59Introduction
Both should be decomposed and indexed for reuse
60Applications
- Course development assistance
- Example scenarios
- Query based selection of fragments
- On the job-training
- Consult limited part of the manual when you need
it - Personalized delivery
- Deliver information based on task, level of
expertise, etc
61Why Difficult?
- Not meant for reuse
- Based on linear reading order
- Information implicit
- Document structure
- Conventions used
62Goal
- Automatic decomposition and annotation based on
- Explicit representation of the different levels
of representation of a document - Formalization of the implicit information
- A general approach suited for both text and
graphics
63Datamodel
- Three levels of document representation
- Layout primitives and their structure
- Logical primitives and their structure
- Indexed fragments and their structure
64Example graphics data
65Example text data
ltitemgt ltboldgt The processor is connected to
amp-1. The purpose of the connection is to
allow disabling .. of the processor
lt/boldgt ltitalicsgt A more elaborate description
tells you that .
lt/italicsgt lt/itemgt
66Layout Primitives
Definition the smallest components in the
document with consistent visual representation.
67Logical Primitives
Definition the smallest components in the
document that can be assigned a role.
68Indexed Fragments
Definition the logical primitives endowed with
semantic index terms allowing for reuse
69Document Knowledge
- Vocabulary
- Domain ontology
- concepts to describe what the manual is about
- index terms needed for reuse
- Visual dictionary
- The set of symbols and their visualization
70Document Knowledge
- Knowledge from authoring process
Index terms
Inverse semantic style rules
Semantic style rules
Logical primitives
Layout style rules
Inverse layout style rules
Layout primitives
Document Analysis
71Layout Analysis
- Low level analysis
- Standard tags for text
- Symbol matching to image
- Detection of text, lines etc.
- Optical Character
- Recognition
XML/SVG tagged datafile
72Logical Analysis
- Bottom-up analyis
- to derive the possible role
- In the document
- Top-down analysis
- grammar based analysis
- to select the genuine role
Inverted layout style rules
Note not unique
73Semantic Analysis
- Similar analysis
- as for layout analysis
- instantiates each
- component as a concept
- in the ontology
Standardized logical primitives
Inverted semantic style rules
Again not unique
Indexed fragments
74Graphics storage
75Authoring functionality
Reasoning/Ontology
76Disabling of the processor
the disable connection ...
77Conclusion
- Summary
- A set of tools is presented that automatically
converts a technical manual into a set of indexed
fragment which can be reused for many different
purposes - Extension
- Method is general, hence applying the techniques
to video based training material is an
interesting and viable option