Title: From Tables To Frames
1From Tables To Frames
The Third International Semantic Web Conference -
ISWC 2004 November 07 11, 2004, Hiroshima, Japan
- Aleksander Pivk1,2, Philipp Cimiano2, York Sure2
- 1Jozef Stefan Institute, Ljubljana, Slovenia
- 2 AIFB Institute, University of Karlsruhe,
Karlsruhe
09.11.2004
2Outline
- Motivation
- Foundation Table Model
- Methodology
- Evaluation
- Conclusion
- Future Work
3Motivation
- problem well-known annotation bottleneck
- solution automatic metadata generation
- goal describe the semantics of tables in
model-theoretic-way (F-Logic) - tables with different structure but same meaning
(should) have the same representation - benefit enable e.g. query answering
- all conferences where prof. Studer is in PC
- all tours to COUNTRY at DATE where priceltAMOUNT
4Foundation Table Model
- dimensions of table model Hurst00
- graphical (image processing)
- physical (inter-cell relative location)
- structural (organization of cells indicating
their navigational relationship) - functional (purpose of regions in terms of data
access) - two functional cell types A-cell and I-cell
- two functional I-cell roles data and access
- semantic (relation between cell content,
structure and orientation) - frame makes explicit
- the meaning of the cell contents (F-Logic
concepts) - the functional dimension of the table (method
signature) - the semantic dimension of the table (frame
structure) - example
5Table model
6Simple Table Classes
7Complex Table Classes
1. Over-expanded labels
3. Combination running example
8Methodology
- the methodology instantiates stepwise the table
model - main differences
- do not consider graphical component
- extent semantic component
9Cleaning Norm.
- construct an initial matrix structure
- DOM tree
- cleaning syntactic errors (CyberNeko HTML
parser) - normalization aligning the table, resorting
cells spanning multiple rows/columns (colspan,
rowspan) - example
10Structure Detection
- detecting table orientation
- rely on similarity of cells (size, content, token
types) - intuition
- if rows are similar, then orientation is vertical
(top-to-down) - if columns are similar, then orientation is
horizontal (left-to-right) - initialize logical units and regions
- split table into LUs
- group same-sized, similar cells into regions
within LUs
11Structure Detection
- heuristics for an assignment of initial
functional types and probabilities to cells - I-cell content of cell consists mostly of tokens
recognized as dates, numbers, and currencies - lower-right cell is always an I-cell (p1)
- upper-left cell is always an A-cell (p1)
- detecting table orientation
- rely on similarity of cells (size, content)
- intuition
- if rows are similar, then orientation is vertical
(top-to-down) - if columns are similar, then orientation is
horizontal (left-to-right)
12Table Orientation
- token type hierarchy
- hierarchical ordering permits measuring the
distance between different types (i.e. in number
of edges)
13Table Orientation
- difference between two cells
- difference between rows/columns
- orientation decision
- example
- orientation set to vertical
where
14Discovery of Regions
- algorithm (7-steps)
- determine a table class
- 1D, 2D, and complex (partition labels,
over-expanded labels, combination) - reformulate a table
15Discovery of Regions
- initialize logical units and regions
- splits
- every row with a cell spanning multi columns
(vertical orientation) - every column with a cell spanning multi rows
(horizontal orientation) - regions
- group same-sized, similar cells within one
logical unit - update functional types and probabilities
- learn string patterns of regions
- learn significant forward and backward patterns
- pattern is a sequence of token types and tokens,
describing a content of a significant number of
cells - i.e. pattern FIRST_UPPER Room covers Double
Room and Single Room - implementation of DATAPROG algorithm Lerman et
al., 2003 - example
16Discovery of Regions
17Discovery of Regions
- do while (distribution in LU not
uniform)(explanation of uniformity logical unit
consists of logical sub-units where each sub-unit
includes only regions of same size and
orientation) - choose the best coherent region
- used to propagate and normalize the neighboring
regions - normalize logical sub-unit
- choose neighboring regions (i.e. only within same
rows for vertical orientation) - example
18Discovery of Regions
- do while (distribution in LU not
uniform)(explanation of uniformity logical unit
consists of logical sub-units where each sub-unit
includes only regions of same size and
orientation) - choose the best coherent region
- used to propagate and normalize the neighboring
regions - choose region that maximizes
- normalize logical sub-unit
- choose neighboring regions (i.e. only within same
rows for vertical orientation) - two options
- neighboring regions within one column DO NOT
extend over boundaries of best region - neighboring regions within one column DO extend
over boundaries of best region - update string patterns for updated regions
- example
19Building FTM
- functional table model
- regions as nodes arranged in a tree
- properties of leaf nodes
- are only regions consisting exclusively of
I-cells - are assigned their functional role (access, data)
- are assigned two semantic labels
- label describing the content of the region
(instances) - label as a combination of a region label and
parent A-cell nodes labels - inner nodes are either regions consisting of
A-cells or connection nodes (e.g. root) - construction of FTM
- bottom-up approach (from lowest logical unit
upwards) - description through an example
20Building FTM
- type of the (colored) logical unit I-cells only
? - regions are turned into leaves
- semantic labels and roles are set to a default
value
21Building FTM
- type of the (colored) logical unit A-cells only
? - regions turned into inner nodes and connected to
appropriate sub-nodes (leaves)
22Building FTM
- type of the (colored) logical unit special case
? - close a subtree by inserting a connection node
which reflects a logical separation in the table
(transition from a LU with only A-cells to a LU
with I-cells) - assign functional roles to leaves within a
connected sub-tree - functional role access assigned to all
consecutive leaves (from left) that together form
a unique identifier (key) other leaves assign
functional role data - (possible) change of reading orientation in the
new logical unit
23Building FTM
- type of the (colored) logical unit A-cells only
? - regions turned into inner nodes and connected to
appropriate sub-nodes (leaves) - finally, connect all unconnected nodes to a root
node
24Building FTM
- recapitulation of FTM
- consider multiple-level sub-trees for merging
- conditions same tree structure and at least one
level of matching A-cells - merging step
- merge nodes at the same position and level (leaf
and inner nodes) - if merged inner nodes (A-cells) are not equal
- find a semantic label of a new merged node
- create a new leaf node (with A-cells as values)
- assign functional role of the new leaf to access
- example
25Building FTM
26Semantic Enriching of FTM
- find semantic labels for regions by consulting
- Wordnet lexical ontology use synsets to find
hypernyms - GoogleSets service additonal way to find
synonyms - transformations of regions cell labels
- punctuation removal
- stopword removal
- compute IDF (document is a cell) for each word,
and filter out the ones with value lower than
treshold - select words that appear at the end of the labels
(nominal head in the nominal compound is at the
end) - query GoogleSets with the remaining words to
filter out the ones that are not mutually similar
27Semantic Enriching of FTM
- assign each leaf its semantic label that
describes the content (instances) of the region
Root
Connection Node
Tour Code
Valid
Class
Price
ltlabelgt
data
35,45032,50030,55025,800/22,900
2,5101,4307201,430720360
28Final FTM
- (final) semantic labels of leaves
- label is a combination of a region label and
parent A-cell nodes labels
29Map FTM to a Frame
- method is a tuple
- frame is a pair
- generation of a frame
- create method m for every leaf node, which
functional role is data - parameters of m are all leaf nodes with
functional role access,where they must be
located on the same level of m s sub-tree or on
m s parent path towards root node - set range for m according to the syntactic token
type of its region - names for parameters and methods are obtained
from a final FTM - example
Tour Code gt ALPHANUMERIC DateValid
gt DATE Price (PersonClass, RoomClass,
TypePrice) gt LARGE_NUMBER.
30Evaluation
- task
- for each table compare automatically generated
frame against two manually created frames - measure in terms of Precision, Recall and
F-measure - dataset
- consists of 21 tables 3 tables for each simple
table class (1D, 2D) and 5 tables for each
complex table class - tourism domain
- annotators
- 14 subjects
- each subject had to annotate 3 tables, each
belonging to a different table class - (14x321x242)
31Evaluation
- performed along following 4 functions
- - example m1 (X, Y) gt INTEGER vs.
method1 (X, YY, W)gtINTEGER - syntactic correctness
- how well the functional dimension of the table is
captured (SynC2/3) - strict comparison
- calculate how identical are nameM , rangeM , and
PM identifiers of methods (P2/4, R2/5) - soft comparison
- for soft matching we used a combination of TFIDF
and Jaro-Wrinkler string distance scheme Cohen
et al., 2003 - calculate soft matching for identifiers of
methods (P3/4, R3/5, where YYY) - conceptual comparison
- conceptually equivalent identifiers have been
determined (i.e. RegionTypeRegionLocation
) - calculate conceptual matching for identifiers of
methods(P4/4, R4/5, where m1method1)
32Evaluation
- performed from 2 aspects
- average consider all frames
- maximum choose only the best manually created
frame for each generated frame - results
33Conclusion
- shown that our methodology stepwise instantiates
the underlying table model - experiments show that
- from conceptual point of view the system gets
appropriate names for frames in almost 75 - it gets totally identical names in more than 50
- we demonstrated and evaluated the successful
automatic generation of frames from HTML tables
34Future Work
- generate one (most general) frame from multiple
tables - reduction of complexity
- population of ontologies with instances
- show feasibility of approach in practical setting
- use given ontology as background knowledge
35TNX
?
36Inter-annotator agreement
- max (FX)Fconceptual 60
- only 2 totally identical frames (2/219.52)
- only 5 identical frames from a conceptual view
(5/2123.81) - this 5 tables cover all 1D class tables and 2
(out of 3) 2D class tables - possible reasons for low agreements
- the annotators did not follow the guidelines
precisely - the task itself is hard
- the annotation guidelines were not clear/detailed
enough - actual results
37Example 1
38Example 1
Tour Name (Code) gt TOKEN Price (Code) gt
CURRENCY Hotel (Code) gt TOKEN Meal (Code)
gt TOKEN ---------------------------------------
---------------- Tour TourCode gt
ALPHANUMERIC TourName gt TOKEN Price
gt CURRENCY Hotel gt TOKEN Meal
gt TOKEN -------------------------------------
------------------ TourCode TourName gt
TOKEN Price gt CURRENCY Hotel
gt ALPHANUMERIC Meal gt ALPHANUMERIC
- Generated Frame
- Annotator 1
- Annotator 2
39Example 2
40Example 2
Trip Cost (TimePeriod) gt CURRENCY
Insurance (TimePeriod) gt CURRENCY ------------
------------------------------------------- Trip
Cost(Duration) gt CURRENCY
Insurance(Duration) gt CURRENCY ---------------
---------------------------------------- Trip
DurationgtALPHANUMERIC DurationTypegtALPHANUMERI
C CostgtCURRENCY InsurancegtCURRENCY
- Generated Frame
- Annotator 1
- Annotator 2
41Example 3
42Example 3
Transportation Description (Transportation)
gt STRING HalfDay (Transportation) gt
CURRENCY FullDay (Transportation) gt
CURRENCY HoursHakone (Transportation)gt
CURRENCY ---------------------------------------
---------------- Transportation Vehicle gt
ALPHANUMERIC Seats gt NUMBER WheelChairs gt
NUMBER JumpSeats gt NUMBER Baggage gt NUMBER
Toilet gt NUMBER Duration(TourType) gt
NUMBER Cost(TourType) gt CURRENCY
- Generated Frame
- Annotator 1