From Tables To Frames - PowerPoint PPT Presentation

About This Presentation
Title:

From Tables To Frames

Description:

The Third International Semantic Web Conference - ISWC 2004 ... recapitulation of FTM: consider multiple-level sub-trees for merging ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 37
Provided by: aleksan3
Category:

less

Transcript and Presenter's Notes

Title: From Tables To Frames


1
From Tables To Frames
The Third International Semantic Web Conference -
ISWC 2004 November 07 11, 2004, Hiroshima, Japan
  • Aleksander Pivk1,2, Philipp Cimiano2, York Sure2
  • 1Jozef Stefan Institute, Ljubljana, Slovenia
  • 2 AIFB Institute, University of Karlsruhe,
    Karlsruhe

09.11.2004
2
Outline
  • Motivation
  • Foundation Table Model
  • Methodology
  • Evaluation
  • Conclusion
  • Future Work

3
Motivation
  • problem well-known annotation bottleneck
  • solution automatic metadata generation
  • goal describe the semantics of tables in
    model-theoretic-way (F-Logic)
  • tables with different structure but same meaning
    (should) have the same representation
  • benefit enable e.g. query answering
  • all conferences where prof. Studer is in PC
  • all tours to COUNTRY at DATE where priceltAMOUNT

4
Foundation Table Model
  • dimensions of table model Hurst00
  • graphical (image processing)
  • physical (inter-cell relative location)
  • structural (organization of cells indicating
    their navigational relationship)
  • functional (purpose of regions in terms of data
    access)
  • two functional cell types A-cell and I-cell
  • two functional I-cell roles data and access
  • semantic (relation between cell content,
    structure and orientation)
  • frame makes explicit
  • the meaning of the cell contents (F-Logic
    concepts)
  • the functional dimension of the table (method
    signature)
  • the semantic dimension of the table (frame
    structure)
  • example

5
Table model
6
Simple Table Classes
7
Complex Table Classes
1. Over-expanded labels
3. Combination running example
8
Methodology
  • the methodology instantiates stepwise the table
    model
  • main differences
  • do not consider graphical component
  • extent semantic component

9
Cleaning Norm.
  • construct an initial matrix structure
  • DOM tree
  • cleaning syntactic errors (CyberNeko HTML
    parser)
  • normalization aligning the table, resorting
    cells spanning multiple rows/columns (colspan,
    rowspan)
  • example

10
Structure Detection
  • detecting table orientation
  • rely on similarity of cells (size, content, token
    types)
  • intuition
  • if rows are similar, then orientation is vertical
    (top-to-down)
  • if columns are similar, then orientation is
    horizontal (left-to-right)
  • initialize logical units and regions
  • split table into LUs
  • group same-sized, similar cells into regions
    within LUs

11
Structure Detection
  • heuristics for an assignment of initial
    functional types and probabilities to cells
  • I-cell content of cell consists mostly of tokens
    recognized as dates, numbers, and currencies
  • lower-right cell is always an I-cell (p1)
  • upper-left cell is always an A-cell (p1)
  • detecting table orientation
  • rely on similarity of cells (size, content)
  • intuition
  • if rows are similar, then orientation is vertical
    (top-to-down)
  • if columns are similar, then orientation is
    horizontal (left-to-right)

12
Table Orientation
  • token type hierarchy
  • hierarchical ordering permits measuring the
    distance between different types (i.e. in number
    of edges)

13
Table Orientation
  • difference between two cells
  • difference between rows/columns
  • orientation decision
  • example
  • orientation set to vertical

where
14
Discovery of Regions
  • algorithm (7-steps)
  • determine a table class
  • 1D, 2D, and complex (partition labels,
    over-expanded labels, combination)
  • reformulate a table

15
Discovery of Regions
  • initialize logical units and regions
  • splits
  • every row with a cell spanning multi columns
    (vertical orientation)
  • every column with a cell spanning multi rows
    (horizontal orientation)
  • regions
  • group same-sized, similar cells within one
    logical unit
  • update functional types and probabilities
  • learn string patterns of regions
  • learn significant forward and backward patterns
  • pattern is a sequence of token types and tokens,
    describing a content of a significant number of
    cells
  • i.e. pattern FIRST_UPPER Room covers Double
    Room and Single Room
  • implementation of DATAPROG algorithm Lerman et
    al., 2003
  • example

16
Discovery of Regions
17
Discovery of Regions
  • do while (distribution in LU not
    uniform)(explanation of uniformity logical unit
    consists of logical sub-units where each sub-unit
    includes only regions of same size and
    orientation)
  • choose the best coherent region
  • used to propagate and normalize the neighboring
    regions
  • normalize logical sub-unit
  • choose neighboring regions (i.e. only within same
    rows for vertical orientation)
  • example

18
Discovery of Regions
  • do while (distribution in LU not
    uniform)(explanation of uniformity logical unit
    consists of logical sub-units where each sub-unit
    includes only regions of same size and
    orientation)
  • choose the best coherent region
  • used to propagate and normalize the neighboring
    regions
  • choose region that maximizes
  • normalize logical sub-unit
  • choose neighboring regions (i.e. only within same
    rows for vertical orientation)
  • two options
  • neighboring regions within one column DO NOT
    extend over boundaries of best region
  • neighboring regions within one column DO extend
    over boundaries of best region
  • update string patterns for updated regions
  • example

19
Building FTM
  • functional table model
  • regions as nodes arranged in a tree
  • properties of leaf nodes
  • are only regions consisting exclusively of
    I-cells
  • are assigned their functional role (access, data)
  • are assigned two semantic labels
  • label describing the content of the region
    (instances)
  • label as a combination of a region label and
    parent A-cell nodes labels
  • inner nodes are either regions consisting of
    A-cells or connection nodes (e.g. root)
  • construction of FTM
  • bottom-up approach (from lowest logical unit
    upwards)
  • description through an example

20
Building FTM
  • type of the (colored) logical unit I-cells only
    ?
  • regions are turned into leaves
  • semantic labels and roles are set to a default
    value

21
Building FTM
  • type of the (colored) logical unit A-cells only
    ?
  • regions turned into inner nodes and connected to
    appropriate sub-nodes (leaves)

22
Building FTM
  • type of the (colored) logical unit special case
    ?
  • close a subtree by inserting a connection node
    which reflects a logical separation in the table
    (transition from a LU with only A-cells to a LU
    with I-cells)
  • assign functional roles to leaves within a
    connected sub-tree
  • functional role access assigned to all
    consecutive leaves (from left) that together form
    a unique identifier (key) other leaves assign
    functional role data
  • (possible) change of reading orientation in the
    new logical unit

23
Building FTM
  • type of the (colored) logical unit A-cells only
    ?
  • regions turned into inner nodes and connected to
    appropriate sub-nodes (leaves)
  • finally, connect all unconnected nodes to a root
    node

24
Building FTM
  • recapitulation of FTM
  • consider multiple-level sub-trees for merging
  • conditions same tree structure and at least one
    level of matching A-cells
  • merging step
  • merge nodes at the same position and level (leaf
    and inner nodes)
  • if merged inner nodes (A-cells) are not equal
  • find a semantic label of a new merged node
  • create a new leaf node (with A-cells as values)
  • assign functional role of the new leaf to access
  • example

25
Building FTM
26
Semantic Enriching of FTM
  • find semantic labels for regions by consulting
  • Wordnet lexical ontology use synsets to find
    hypernyms
  • GoogleSets service additonal way to find
    synonyms
  • transformations of regions cell labels
  • punctuation removal
  • stopword removal
  • compute IDF (document is a cell) for each word,
    and filter out the ones with value lower than
    treshold
  • select words that appear at the end of the labels
    (nominal head in the nominal compound is at the
    end)
  • query GoogleSets with the remaining words to
    filter out the ones that are not mutually similar

27
Semantic Enriching of FTM
  • assign each leaf its semantic label that
    describes the content (instances) of the region

Root
Connection Node
Tour Code
Valid
Class
Price
ltlabelgt
data
35,45032,50030,55025,800/22,900
2,5101,4307201,430720360
28
Final FTM
  • (final) semantic labels of leaves
  • label is a combination of a region label and
    parent A-cell nodes labels

29
Map FTM to a Frame
  • method is a tuple
  • frame is a pair
  • generation of a frame
  • create method m for every leaf node, which
    functional role is data
  • parameters of m are all leaf nodes with
    functional role access,where they must be
    located on the same level of m s sub-tree or on
    m s parent path towards root node
  • set range for m according to the syntactic token
    type of its region
  • names for parameters and methods are obtained
    from a final FTM
  • example

Tour Code gt ALPHANUMERIC DateValid
gt DATE Price (PersonClass, RoomClass,
TypePrice) gt LARGE_NUMBER.
30
Evaluation
  • task
  • for each table compare automatically generated
    frame against two manually created frames
  • measure in terms of Precision, Recall and
    F-measure
  • dataset
  • consists of 21 tables 3 tables for each simple
    table class (1D, 2D) and 5 tables for each
    complex table class
  • tourism domain
  • annotators
  • 14 subjects
  • each subject had to annotate 3 tables, each
    belonging to a different table class
  • (14x321x242)

31
Evaluation
  • performed along following 4 functions
  • - example m1 (X, Y) gt INTEGER vs.
    method1 (X, YY, W)gtINTEGER
  • syntactic correctness
  • how well the functional dimension of the table is
    captured (SynC2/3)
  • strict comparison
  • calculate how identical are nameM , rangeM , and
    PM identifiers of methods (P2/4, R2/5)
  • soft comparison
  • for soft matching we used a combination of TFIDF
    and Jaro-Wrinkler string distance scheme Cohen
    et al., 2003
  • calculate soft matching for identifiers of
    methods (P3/4, R3/5, where YYY)
  • conceptual comparison
  • conceptually equivalent identifiers have been
    determined (i.e. RegionTypeRegionLocation
    )
  • calculate conceptual matching for identifiers of
    methods(P4/4, R4/5, where m1method1)

32
Evaluation
  • performed from 2 aspects
  • average consider all frames
  • maximum choose only the best manually created
    frame for each generated frame
  • results

33
Conclusion
  • shown that our methodology stepwise instantiates
    the underlying table model
  • experiments show that
  • from conceptual point of view the system gets
    appropriate names for frames in almost 75
  • it gets totally identical names in more than 50
  • we demonstrated and evaluated the successful
    automatic generation of frames from HTML tables

34
Future Work
  • generate one (most general) frame from multiple
    tables
  • reduction of complexity
  • population of ontologies with instances
  • show feasibility of approach in practical setting
  • use given ontology as background knowledge

35
TNX
?
36
Inter-annotator agreement
  • max (FX)Fconceptual 60
  • only 2 totally identical frames (2/219.52)
  • only 5 identical frames from a conceptual view
    (5/2123.81)
  • this 5 tables cover all 1D class tables and 2
    (out of 3) 2D class tables
  • possible reasons for low agreements
  • the annotators did not follow the guidelines
    precisely
  • the task itself is hard
  • the annotation guidelines were not clear/detailed
    enough
  • actual results

37
Example 1
38
Example 1
Tour Name (Code) gt TOKEN Price (Code) gt
CURRENCY Hotel (Code) gt TOKEN Meal (Code)
gt TOKEN ---------------------------------------
---------------- Tour TourCode gt
ALPHANUMERIC TourName gt TOKEN Price
gt CURRENCY Hotel gt TOKEN Meal
gt TOKEN -------------------------------------
------------------ TourCode TourName gt
TOKEN Price gt CURRENCY Hotel
gt ALPHANUMERIC Meal gt ALPHANUMERIC
  • Generated Frame
  • Annotator 1
  • Annotator 2

39
Example 2
40
Example 2
Trip Cost (TimePeriod) gt CURRENCY
Insurance (TimePeriod) gt CURRENCY ------------
------------------------------------------- Trip
Cost(Duration) gt CURRENCY
Insurance(Duration) gt CURRENCY ---------------
---------------------------------------- Trip
DurationgtALPHANUMERIC DurationTypegtALPHANUMERI
C CostgtCURRENCY InsurancegtCURRENCY
  • Generated Frame
  • Annotator 1
  • Annotator 2

41
Example 3
42
Example 3
Transportation Description (Transportation)
gt STRING HalfDay (Transportation) gt
CURRENCY FullDay (Transportation) gt
CURRENCY HoursHakone (Transportation)gt
CURRENCY ---------------------------------------
---------------- Transportation Vehicle gt
ALPHANUMERIC Seats gt NUMBER WheelChairs gt
NUMBER JumpSeats gt NUMBER Baggage gt NUMBER
Toilet gt NUMBER Duration(TourType) gt
NUMBER Cost(TourType) gt CURRENCY
  • Generated Frame
  • Annotator 1
Write a Comment
User Comments (0)
About PowerShow.com