Title: From Tessellations to Table Interpretation
1From Tessellations to Table Interpretation
- R. C. Jandhyala1, M. Krishnamoorthy1,
- G. Nagy1, R. Padmanabhan1,
- S. Seth2, W. Silversmith1
- 1DocLab, Rensselaer Polytechnic Institute
- 2Computer Science and Engineering, University of
Nebraska-Lincoln - (Supported by NSF Grants 044114854 and 0414644,
and Rensselaer Center for Open Source Software)
2Goal Construction of a narrow-domain ontology
from semi-structured web data (table
understanding )
3Outline
Tilings (rectangular tessellations) X-Y trees
(1984)
Grammars
Tables
Wang Categories (1996)
A B C D
4Outline
Tilings (rectangular tessellations) X-Y trees
(1984)
Grammars
Tables
Wang Categories (1996)
A B C D
5Web tables
- Cannot precisely define human-understandable
tables. - Convert to smaller set of admissible tables.
- Why? Algorithmic ease.
6Admissible Tables
- Have stub, headings and data cells.
7Factor out layout-equivalent tables
8Outline
Tilings (rectangular tessellations) X-Y trees
(1984)
Grammars
Tables
Wang Categories (1996)
A B C D
9Rectangular Tessellations
- Partition of an isothetic rectangle into
rectangles. - Uniquely defined by junction points (location and
type). - Number of tessellations increases rapidly with
table size.
10XY Tessellations
- Special case of rectangular tessellations.
- Successive horizontal and vertical cuts.
- Easily represented by trees.
11A tiling and its X-Y Tree(aka slicing structure,
puzzle tree, tree map)
12Non-slicing structures No XY tree
In fact, X-Y tilings are an infinitesimal
fraction of all tilings. This helps, because
tables never contain this spiral structure.
13Fundamental Idea
- Use XY trees to automate table processing and
understanding.
14Table to XY tree EX2XY
- Applicable to any XY tessellation.
- Input Excel Table
- Copy and paste or Import.
- Edit to make admissible.
- Output XY tree
- as XML for portability.
- as parenthesized string for grammars.
15Example
(http//www40.statcan.ca/l01/cst01/econ50-eng.htm)
16After import into Excel
17After Editing
18Output - XML
-
- ltblock id'1.1.2.1' range'17,230,2'gt
- ltcontentgt
- Real gross domestic product, expenditure-based,
by province and territory (millions of chained
(2002) dollars) - lt/contentgt
- lt/blockgt
-
19Outline
Tilings (rectangular tessellations) X-Y trees
(1984)
Grammars
Tables
Wang Categories (1996)
A B C D
20Table Grammars
- Can characterize entire families of tables.
- Developed grammar for one family.
- Input - Nested parenthesized notation .
- Output Accept/Reject as example of family.
21Grammar
- For parsing column headers
- S A (Rule 1)
- A B (Rule 2)
- B c X B c X (Rules 3 and 4)
- X c X A X A c (Rules 5, 6, 7 and
8) - S is start symbol.
- A generates all admissible column headers.
- B generates category trees.
- c is a root category.
- X generates sub-categories.
22Table Grammars
- Cannot check if table is consistent.
- Need further geometric alignment and lexical
checks.
23Outline
Tilings (rectangular tessellations) X-Y trees
(1984)
Grammars
Tables
Wang Categories (1996)
A B C D
24Logical Structure of Tables
- How to interpret a table?
- Describe relationship between header cells and
content cells Wang, U. Waterloo,1996. - Wang notation
- Elegant description.
- Dimensionality Number of category trees.
- Cartesian product maps categories to data.
25Layout independent Wang Notation
Different layout and same information means same
Wang Notation
26Wang Category Trees for either table
- characteristic
- gonsity
- hepth
- fleck burlam falder multon
- Any data cell can be designated by a path
through each category tree. - Leaves correspond to row or column headings.
27Real Table Understanding
- Analyzing logical structure not sufficient.
- Need additional information from title,
footnotes, captions, etc. - Semantic analysis of the labels also important
need external knowledge.
28Does Wang Notation always exist?
- Not always!
- Inconsistent tables do not have Wang Notation.
- Others can be edited using virtual headers.
29XY tree to Wang Notation Algorithm
- Input XY trees.
- Output XML version of Wang Notation.
- Checks for table consistency.
30Algorithm
- Locate principal regions - stub, headers and
content cells. - Extract Wang categories.
- Compute Cartesian product of category paths.
- Match each key to the content of a delta cell.
31Conclusions
- Admissible layouts identified for ease of
processing. - Algorithms developed for
- extracting XY trees from tables.
- extracting Wang notation from XY trees.
- Family of tables identified using a grammar.
32Future work
- Augmentations - captions, aggregates, units, etc.
- Expand the grammar.
- Automate conversion of table to admissible
formats.
(http//www40.statcan.ca/l01/cst01/agri111a-eng.ht
m)
33THANK YOU
34Goal construction of a narrow-domain
ontologyfrom semi-structured web data(table
understanding )
- Currently multon is the best choice for rapitting
velters. It is about 25 better than burlam or
falder, which have the same girby (hepth/gonsity
ratio). - Check another table to see whether elmer is even
better. - NOT TODAY!
35H-first tree can be transformed into V-first
tree(and vice-versa)
36EX2XY Algorithm
- Two workhorses
- Vertical_cut returns leftmost sub-rectangle of
a given rectangle. - Horizontal_cut returns topmost sub-rectangle of
a given rectangle.
37EX2XY Algorithm (contd.)
- Used in a pair of procedures P1 and P2.
- P1 cuts vertically and submits first
sub-rectangle to P2 for horizontal cuts. - Similarly with P2.
38Parenthesized notation
- P-notation has 11 correspondence with general
trees. - For above table, the XY tree sentence is
- Sxy c c c c c c c c c c c c.
39A table with six Wang dimensions
40XY2WANG Other features
- Handles more complex scenarios
- Higher dimensionality.
- Deeper nesting of headers.
- Repetitive headers.
41(http//www40.statcan.ca/l01/cst01/econ50-eng.htm)
42Table Augmentations Example
43Raghavs Experiment
44Results
45Results (Contd.)
46Conclusion
- Average total time to process a table - 231
seconds. - Average table size - 587 cells before
preprocessing. - Average preprocessing time - 104 seconds.
- 3 category tables took approximately 27 seconds
more than 2 category tables.
47Conclusion (Contd.)
- Tables with aggregates and footnotes - more time
to process. - Strong correlation between processing time and
table size. - For future automatically segmenting
augmentations, categories and delta cells using
visual cues.