Title: Inferring Structure Information from Typography
1Inferring Structure Informationfrom Typography
- Christian Fuß
- Dipl.-Inform. Felix Gatzemeier
- Michael Kirchhof
- Dipl.-Inform. Oliver Meyer
- Department of Computer Science III, RWTH Aachen
2Overview
- Context
- Deriving Structure Information
- Partitioning
- Typographic abstraction
- Determine Type
- Conclusion
- Cooperation project of
- Prototype aTool in the WEP goupof the
Global-Info Project
(www.global-info.org)
3Todays Publication Chain
Publisher
Copy Editing
Web Publ.
Reader
Typesetting
Reading
4Classification of Submissions
TEX
Submissions
MS Word
Unformatted
Formatted
Somehow Formatted
Correctly Formatted
5Basic Assumptions
Known target document type
Textual Nature
Typographic markup
Consistent markup
6Deriving Structure Information
- In MS Word document
- Record Formatting (Format Tuples)
- Locate the Elements
- Reduce Format Tuples to Patterns
- Determine Types
- Out XML documentAlso interactively
7Format Tuples
- The basic typographic abstraction
- FormatTuple("Is this a dagger?") Times,
22pt, regular, roman - Here Font, Size, Weight, Variation
- Planned Search expressions modulo Text
- More general Including regular expressions
of text content or context.
8Locate the Elements
- Tree-Partitioning of Formatted Character Streams
on - Format Tuple changes
- Paragraphs breaks
- Nesting of Inline Elements
- Is this a dagger? ? ltft1gt
- Is this a dagger? ? ltft1 ltft2gt ft1gt
- Is this a dagger? ? ltft1 ltft2gt gt
- Is this a dagger? ? ltft1 ltft2 ltft3gt gt gt
- Format-To-Type Map
FormatTuple ElementType ft1 (times,
22pt, reg, roman) dummyType1 ft2 (times, 22pt,
bold, roman) dummyType2 ft3 (times, 22pt, reg,
italic) dummyType3
9Format patterns
- Identity too restrictive ? wildcard
generalization Is this a
dagger? ?(,,)Times Times Times 22pt 22pt 22pt
regular bold regular boldroman roman roman - ?(?, a, b) ?(a, a, b) ?(a, b, ?) ?(a,
b, b) - ?(?, a, ?) propagated to paragraph level
- Format-To-Type Map
10Determine Types
- Replace dummy types in Format-To-Type Map
- Preconfiguration by publisher
- Controlled Learning from the author
11Further useable information
- Allowed context from the DTD
- Paragraph standard format
- Text patterns
- Bullets
- Enumeration
- Whitespace
- ASCII Markup (Is this a dagger?)
- Format pattern match confidence
12Motivational aspects
- Quick feedback on formal correctness
- Publication preview while keeping format freedom
- (Via XSL) flexible previews of other formats
- New structure-based functionality
- Structure editing
- Structure evaluation
- Document templates
13Conclusion
- Summary
- 4-step inference
- Record format tuples
- Locate the elements
- Reduce tuples to patterns
- Determine types
- Increase efficiency of publication chain
- Provide unobtrusive structuring for non-expert
authors - Plans
- Cautious extension of inference
- Validation of document
- Evaluation with authors
14Todays Publication Chain
Publisher
Copy Editing
Web Publ.
Reader
Typesetting
Reading
15Classification of Submissions
TEX
Submissions
MS Word
Unformatted
Formatted
Somehow Formatted
Correctly Formatted
16Determine Types
- Replace dummy types in Format-To-Type Map
- Preconfiguration by publisher
- Controlled Learning from the author