Inferring Structure Information from Typography - PowerPoint PPT Presentation

1 / 13
About This Presentation
Title:

Inferring Structure Information from Typography

Description:

September 14, 2000 / Digital Documents and Electronic Publishing 2000 ... In: MS Word document. Record Formatting (Format Tuples) Locate the Elements ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 14
Provided by: cfufgatzem
Category:

less

Transcript and Presenter's Notes

Title: Inferring Structure Information from Typography


1
Inferring Structure Informationfrom Typography
  • Christian Fuß
  • Dipl.-Inform. Felix Gatzemeier
  • Michael Kirchhof
  • Dipl.-Inform. Oliver Meyer
  • Department of Computer Science III, RWTH Aachen

2
Overview
  • Context
  • Deriving Structure Information
  • Partitioning
  • Typographic abstraction
  • Determine Type
  • Conclusion
  • Cooperation project of
  • Prototype aTool in the WEP goupof the
    Global-Info Project
    (www.global-info.org)

3
Todays Publication Chain
Publisher
Copy Editing
Web Publ.
Reader
Typesetting
Reading
4
Classification of Submissions
TEX
Submissions
MS Word
Unformatted
Formatted
Somehow Formatted
Correctly Formatted
5
Basic Assumptions
Known target document type
Textual Nature
Typographic markup
Consistent markup
6
Deriving Structure Information
  • In MS Word document
  • Record Formatting (Format Tuples)
  • Locate the Elements
  • Reduce Format Tuples to Patterns
  • Determine Types
  • Out XML documentAlso interactively

7
Format Tuples
  • The basic typographic abstraction
  • FormatTuple("Is this a dagger?") Times,
    22pt, regular, roman
  • Here Font, Size, Weight, Variation
  • Planned Search expressions modulo Text
  • More general Including regular expressions
    of text content or context.

8
Locate the Elements
  • Tree-Partitioning of Formatted Character Streams
    on
  • Format Tuple changes
  • Paragraphs breaks
  • Nesting of Inline Elements
  • Is this a dagger? ? ltft1gt
  • Is this a dagger? ? ltft1 ltft2gt ft1gt
  • Is this a dagger? ? ltft1 ltft2gt gt
  • Is this a dagger? ? ltft1 ltft2 ltft3gt gt gt
  • Format-To-Type Map

FormatTuple ElementType ft1 (times,
22pt, reg, roman) dummyType1 ft2 (times, 22pt,
bold, roman) dummyType2 ft3 (times, 22pt, reg,
italic) dummyType3
9
Format patterns
  • Identity too restrictive ? wildcard
    generalization Is this a
    dagger? ?(,,)Times Times Times 22pt 22pt 22pt
    regular bold regular boldroman roman roman
  • ?(?, a, b) ?(a, a, b) ?(a, b, ?) ?(a,
    b, b)
  • ?(?, a, ?) propagated to paragraph level
  • Format-To-Type Map

10
Determine Types
  • Replace dummy types in Format-To-Type Map
  • Preconfiguration by publisher
  • Controlled Learning from the author

11
Further useable information
  • Allowed context from the DTD
  • Paragraph standard format
  • Text patterns
  • Bullets
  • Enumeration
  • Whitespace
  • ASCII Markup (Is this a dagger?)
  • Format pattern match confidence

12
Motivational aspects
  • Quick feedback on formal correctness
  • Publication preview while keeping format freedom
  • (Via XSL) flexible previews of other formats
  • New structure-based functionality
  • Structure editing
  • Structure evaluation
  • Document templates

13
Conclusion
  • Summary
  • 4-step inference
  • Record format tuples
  • Locate the elements
  • Reduce tuples to patterns
  • Determine types
  • Increase efficiency of publication chain
  • Provide unobtrusive structuring for non-expert
    authors
  • Plans
  • Cautious extension of inference
  • Validation of document
  • Evaluation with authors

14
Todays Publication Chain
Publisher
Copy Editing
Web Publ.
Reader
Typesetting
Reading
15
Classification of Submissions
TEX
Submissions
MS Word
Unformatted
Formatted
Somehow Formatted
Correctly Formatted
16
Determine Types
  • Replace dummy types in Format-To-Type Map
  • Preconfiguration by publisher
  • Controlled Learning from the author
Write a Comment
User Comments (0)
About PowerShow.com