On Embedding Machine-Processable Semantics into Documents - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

On Embedding Machine-Processable Semantics into Documents

Description:

On Embedding Machine-Processable Semantics into Documents Krishnaprasad Thirunarayan Department of Computer Science & Engineering Wright State University – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 28
Provided by: TK177
Learn more at: http://cecs.wright.edu
Category:

less

Transcript and Presenter's Notes

Title: On Embedding Machine-Processable Semantics into Documents


1
On Embedding Machine-Processable Semantics into
Documents
  • Krishnaprasad Thirunarayan
  • Department of Computer Science Engineering
  • Wright State University
  • Dayton, OH-45435, USA

2
Talk Outline
  • Background and Motivation (Why?)
  • Goals (What?)
  • Details (How?)
  • Conclusions

3
Background and Motivation
4

Content Extraction Formalize doc, using
controlled vocabulary
Heterogeneous Doc.
Spec. Defn. Rep.
5
Problems with this approach to content extraction
  • Archiving spec (for human comprehension)
    separately from its formalization is not
    conducive traceability.
  • Manual extraction from spec (from scratch) for
    each use is labor intensive, time consuming, and
    prone to typographical errors.

6
Observation
  • Conceptually, every piece of information in an
    extraction owes its existence to a phrase in
    spec, and possibly, controlled vocabulary.
  • So, explore techniques to maintain correspondence
    between a spec fragment and its formalization.

7
Goal
8
General Problem
  • Embed domain-specific mark-up (annotations) into
    human sensible document
  • to make explicit semantics of content text and
    complex data, and
  • to augment an interpretation in a modular
    fashion.
  • Document text Human comprehensible
  • Semantic Mark-up Machine processable

9
Details (How?)
10
Nature of Specs
  • Semi-structured
  • Heterogeneous
  • Text
  • Tables
  • Images
  • Constrained technical vocabulary
  • Available as MS Word document

11
Pre-processing Spec
  • Abstract content from spec document by removing
    display oriented information
  • Save text
  • Save tabular data, preserving grid layout
  • Retain links to images
  • Note Save As text option in MS Word inadequate

12
Heterogeneous Document
13
XML generated by Majix
14
ASCII Output
15
Annotating Pre-processed Spec
  • Embedding Machine Processable Semantics
  • Recognizing and tagging text using controlled
    vocabulary
  • By product of Document Indexing and Semantic
    Search
  • Tagging tabular data to make explicit its
    semantics Same grid layout, but different
    interpretation and dependencies based on
    headings
  • Explore XML-based programming language Water for
    defining data and its behavior (semantics)

16
Locating Controlled Vocabulary Terms
17
Example Table
Thickness (mm) Tensile Strength (ksi) Yield Strength (ksi)
0.50 and under 165 155
0.05 1.00 160 150
1.00 1.50 155 145
18
Example of Tagged Table
  • Thickness (mm) Tensile Strength (ksi) Yield
    Strength (ksi)
  • table.ltsetHeading thickness strength.tensile
    strength.yield/gt
  • 0.50 and under 165
    155
  • table.ltaddRow 0 0.50 165
    155 /gt
  • 0.50 - 1.00 160
    150
  • table.ltaddRow 0.50 1.00 160
    150 /gt
  • 1.00 - 1.50 155
    145
  • table.ltaddRow 1.00 1.50 155
    145 /gt ...

19
Example of Processing Code
  • ltdefclass table rowsrequiredvector
    headingoptionalvectorgt
  • ltdefmethod setHeading trequired tsrequired
    ysrequiredgt
  • ltset headingltvector t ts ys/gt/gt
  • lt/gt
  • ltdefmethod addRow smin smax ts ysgt
  • ltset rows
  • table.rows.ltinsert ltvector smin
    smax ts ys/gt/gt/gt
  • lt/gt
  • ltdefmethod computeYieldStrengthgt
    lt/gt
  • ltdefmethod computeTensileStrengthgt lt/gt
  • lt/gt

20
(contd)
  • ltdefclass table rowsrequiredvector
    headingoptionalvectorgt
  • ltdefmethod computeTensileStrengthgt
  • ltset tempfluid.Thickness/gt
  • ltset i0/gt
  • ltdogt
  • ltuntil ltand temp.ltless table.rows.ltget
    i/gt.1/gt
  • temp.ltmore_or_equal
    table.rows.ltget i/gt.0/gt /gt gt
  • table.rows.ltget i/gt.2
  • lt/untilgt
  • ltset ii.ltplus 1/gt/gt
  • lt/dogt
  • lt/gt
  • lt/gt

21
(contd)
  • ltdefclass table rowsrequiredvector
    headingoptionalvectorgt
  • lt/gt
  • fluid.ltset Thickness0.60gt
  • lttry
  • ltset TensileStrengthtable.ltcomputeTensileStre
    ngth/gt/gt
  • TensileStrength
  • gt
  • "TABLE out of range error occurred"
  • lt/trygt

22
Water
  • XML-based OO Scripting Language
  • Facilitates creating Web Services
  • Run methods remotely via web-browser
  • Generalizes dynamic typing to constraint checking
  • Conformance of actuals to formals

23
Pros and cons
  • Encoding Improvement
  • Amount of tagging can be controlled by suitably
    delimiting table data and annotating it with
    corresponding string-processing method
  • Master Copy Update
  • Changes to spec requires manual modification to
    archived annotated version.
  • Irregular Tables in Specs
  • Different units, etc

24
Some Related Work
  • Microsoft Smart Tags
  • Recognize controlled words in Office 2003
    documents and associate predefined list of
    actions with each occurrence
  • SHOE
  • Table data in a declarative (logic) language

25
Prolog rendition
  • strengthTableRow( 0, 0.50, 165, 155).
  • strengthTableRow(0.50, 1.00, 160, 150).
  • strengthTableRow(1.00, 1.50, 155, 145).
  • ...
  • strengthTable(Thickness, TensileStrength,
    YieldStrength) -
  • strengthTableRow(L, U,
    TensileStrength, YieldStrength),
  • L lt Thickness, U gt Thickness.
  • thicknessToTensileStrength(Thickness,
    TensileStrength) -
  • strengthTable(Thickness,
    TensileStrength, _).
  • thicknessToYieldStrength(Thickness,
    YieldStrength) -
  • strengthTable(Thickness, _,
    YieldStrength).
  • ?- thicknessToYieldStrength(0.6,YS).

26
Conclusions
27
A Step towards Holy Grail
  • Ultimately enable authoring and/or extracting,
    human-comprehensible and machine-processable
    parts of a document hand in hand, and keep them
    side by side.
Write a Comment
User Comments (0)
About PowerShow.com