Title: Extracting Relations from XML Documents
1Extracting Relations from XML Documents
IBM Almaden and Columbia University
2Extraction for Data Integration Motivating
Example
External Schema
Native Schema
Products
Publications
books
music
video
book
item
title
author
publisher
ISBN
price
booktitle
author
publisher
ISBN
price
ISBN Title Author Publisher Price
3Why Extract Data from XML?
- XML query processing is still in development.
Still not as fast as RDBMS - Relational query processing is still standard for
many business applications - By extracting into one relational schema, avoid
overhead of XML runtime data integration - Extracted relations can be best exploited for
relatively static data (e.g., product catalogs)
4Related Work
- XTRACT (induces DTDs)
- Lore/DataGuides
- HTML Wrappers (LixTo, RoadRunner, WHISK, STALKER,
) - Plain Text Information Extraction (Proteus,
Snowball, Rapier) - Supervised/Assisted XML Schema Mapping (e.g.,
Clio)
5Outline
- Motivation
- Problem statement
- XMLMiner approach
- Training XMLMiner
- Extraction from new documents
- Some observation from the prototype
- Summary
6Problem Statement
- Given a target flat relation R, extract
information for the tuples in R from XML (or
HTML) documents, with potentially significant
variations in schema. - Problems with current integration/extraction
approaches - Hard-coding the rules/queries requires
significant effort The resulting rules can be
brittle. - XML Schema or DTD is not always provided
7XMLMiner Approach
- Learn signatures from example XML documents
- Represent document structure while maintaining
flexibility (to allow schema variations) - Assume that a tuple in the target relation
corresponds to a subtree rooted at an instance
node. Â (The subtree may contain more detailed
info of the tuple than needed.) - Represent input document nodes as vectors, and
then find the closest (i.e., most similar)
instance node vector - Use labels and data values to map children of the
instance node to target tuple attributes
8XMLMiner Architecture Training and Extraction
Canonical Tree
Canonical Tree
9High Level Description
- Training
- Each XML document is merged/split to a
schema-like tree, called canonical tree - User identifies the attributes nodes (under
instance node), corresponding to the target tuple
attributes - System derives the instance node in the tree
- Build a model for the structure of the tuple and
each attribute - Extracting
- Apply the model to find the most likely instance
node and attribute nodes in the new XML documents
10Training Stage I Create Canonical Tree for each
Example Document
11Canonical Form Conversion ExampleMerging
Similar Nodes
Original Document Structure
Merged Document
- Merge all siblings with the same label (e.g.,
Item ? Item) - Intuition Siblings with the same label
represent similar entities.
12Example Split Heterogeneous Nodes ? Canonical
Form
Canonical Tree
13Training Stage I Result Canonical Tree
OriginalDocument
Canonical Form
14Training Stage II Generate Instance Node
Signatures
- Features used to createsignatures for an
instance node I (item) in the canonical tree - A Ancestors of I
- S Siblings of I
- C Descendants of I
- I Self Tag of I
- Siblings and Ancestors ?position of I in the
document - The Descendants ? internal structure of I
15Training Stage (cont.)Example Instance Node
Signature
Signature (A,S,C,I) for Item A
Products, Books, S
Category_Desc, C Title, Author,
Publisher, New, Used,
ISBN, Price, Num_Copies
I Item
16Signature Similarity
- Vector Space model, TFIDF weights for terms
- Incorporates structure (similarity-by-region)
SX A Products1, , S
Music0.33, Video0.33, C
Title0.33, Author0.33,
Publisher0.33, New0.2,
Used0.2, ISBN0.6, Price0.2,
Copies0.5 , I Item
SY A Products1, Books0.5, S
CDs0.5, C Title0.33,
Author0.33, Publisher0.33,
ISBN0.6, Price0.2, Copies0.5
, I Book
Similarity(SX, SY) SX.A SY.A SX.S SY.S
SX.C SY.C
SX.I SY.I
17Training Stage III Attribute Signatures
- Structural Data signature S(D, A, S, C, I)
- 1 Data signature D for the values of R.X(e.g.,
can be a histogram of values for X) - Structure signature for attribute X (A S C I
) - Similar to instance signature
- Original instance node ? document root,
- A ? ancestors (Item, Publisher, New)
- I ? self (ISBN)
- S ? siblings (Price, NumCopies)
- C ? null.
18Outline
- Motivation
- Problem statement
- XMLMiner approach
- Training XMLMiner
- Extraction from new documents
- XMLMiner prototype
- Summary
19Extraction Stage
- Assumption Input documents have internal
regularity - Compute canonical tree for some of the input
documents - Build signature of each node in the canonical
form, and compute similarity with known instance
node signatures - Map descendants of highest scoring node to
attributes of target table using attribute
signatures
20Extraction I Represent test documents in
canonical form
Canonical Form
Publications
Test Document
Publications
book
book
book
title
author
publisher
editor
title
author
publisher
editor
price
ISBN
price
ISBN
- Intuition
- Robustness (allows optional nodes)
- Efficiency Canonical form has fewer nodes that
original tree
21Extraction II Find Instance Node in Canonical
Tree
Publications
- For each node K in CT
- Compute Signature of K SK
- Compute score for K as Similarity( SK , SI )
- SI is the signature of instance node I from
training - The node with highest score is the instance node
in CT
book
title
author
publisher
editor
price
ISBN
22Extraction III Map children of instance node to
attributes
book
title
author
publisher
editor
- For each node J of subtree at K
- For each attribute X of R
- ASJ ? Attribute Signature of J
- ASX ? Attribute Signature of X
- Compute score for J as Similarity( ASJ , ASX )
- Pick mapping such that Product of the scores over
attributes of R is maximized.
price
ISBN
23Extraction IV Generate XPath queries for the new
documents
- Apply XPath queries to the new XML documents
- Simple XPath queries can be handled by Xerces
parser or more advanced streaming parser
24XMLMiner Prototype
Successfully finds best instance node (Book) in
test document
25Summary
- Partially supervised, low effort XML? relational
extraction - Flexible vector space representation that
preserves some original structure - Can potentially be more robust than current
state-of-the-art systems that rely on rules