Extracting Relations from XML Documents - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Extracting Relations from XML Documents

Description:

Extracting Relations from XML Documents C. T. Howard Ho Joerg Gerhardt Eugene Agichtein* Vanja Josifovski IBM Almaden and Columbia University* – PowerPoint PPT presentation

Number of Views:86

Avg rating:3.0/5.0

Slides: 26

Provided by: emoryEdu

Learn more at: http://www.mathcs.emory.edu

Category:

more less

Transcript and Presenter's Notes

Title: Extracting Relations from XML Documents

1
Extracting Relations from XML Documents
IBM Almaden and Columbia University
2
Extraction for Data Integration Motivating
Example
External Schema
Native Schema
Products
Publications
books
music
video
book
item
title
author
publisher
ISBN
price
booktitle
author
publisher
ISBN
price
ISBN Title Author Publisher Price

3
Why Extract Data from XML?

XML query processing is still in development.
Still not as fast as RDBMS
Relational query processing is still standard for
many business applications
By extracting into one relational schema, avoid
overhead of XML runtime data integration
Extracted relations can be best exploited for
relatively static data (e.g., product catalogs)

4
Related Work

XTRACT (induces DTDs)
Lore/DataGuides
HTML Wrappers (LixTo, RoadRunner, WHISK, STALKER,
)
Plain Text Information Extraction (Proteus,
Snowball, Rapier)
Supervised/Assisted XML Schema Mapping (e.g.,
Clio)

5
Outline

Motivation
Problem statement
XMLMiner approach
Training XMLMiner
Extraction from new documents
Some observation from the prototype
Summary

6
Problem Statement

Given a target flat relation R, extract
information for the tuples in R from XML (or
HTML) documents, with potentially significant
variations in schema.
Problems with current integration/extraction
approaches
Hard-coding the rules/queries requires
significant effort The resulting rules can be
brittle.
XML Schema or DTD is not always provided

7
XMLMiner Approach

Learn signatures from example XML documents
Represent document structure while maintaining
flexibility (to allow schema variations)
Assume that a tuple in the target relation
corresponds to a subtree rooted at an instance
node. (The subtree may contain more detailed
info of the tuple than needed.)
Represent input document nodes as vectors, and
then find the closest (i.e., most similar)
instance node vector
Use labels and data values to map children of the
instance node to target tuple attributes

8
XMLMiner Architecture Training and Extraction
Canonical Tree
Canonical Tree
9
High Level Description

Training
Each XML document is merged/split to a
schema-like tree, called canonical tree
User identifies the attributes nodes (under
instance node), corresponding to the target tuple
attributes
System derives the instance node in the tree
Build a model for the structure of the tuple and
each attribute
Extracting
Apply the model to find the most likely instance
node and attribute nodes in the new XML documents

10
Training Stage I Create Canonical Tree for each
Example Document
11
Canonical Form Conversion ExampleMerging
Similar Nodes
Original Document Structure
Merged Document

Merge all siblings with the same label (e.g.,
Item ? Item)
Intuition Siblings with the same label
represent similar entities.

12
Example Split Heterogeneous Nodes ? Canonical
Form
Canonical Tree
13
Training Stage I Result Canonical Tree
OriginalDocument
Canonical Form
14
Training Stage II Generate Instance Node
Signatures

Features used to createsignatures for an
instance node I (item) in the canonical tree
A Ancestors of I
S Siblings of I
C Descendants of I
I Self Tag of I
Siblings and Ancestors ?position of I in the
document
The Descendants ? internal structure of I

15
Training Stage (cont.)Example Instance Node
Signature
Signature (A,S,C,I) for Item A
Products, Books, S
Category_Desc, C Title, Author,
Publisher, New, Used,
ISBN, Price, Num_Copies
I Item
16
Signature Similarity

Vector Space model, TFIDF weights for terms
Incorporates structure (similarity-by-region)

SX A Products1, , S
Music0.33, Video0.33, C
Title0.33, Author0.33,
Publisher0.33, New0.2,
Used0.2, ISBN0.6, Price0.2,
Copies0.5 , I Item
SY A Products1, Books0.5, S
CDs0.5, C Title0.33,
Author0.33, Publisher0.33,
ISBN0.6, Price0.2, Copies0.5
, I Book
Similarity(SX, SY) SX.A SY.A SX.S SY.S
SX.C SY.C
SX.I SY.I
17
Training Stage III Attribute Signatures

Structural Data signature S(D, A, S, C, I)
1 Data signature D for the values of R.X(e.g.,
can be a histogram of values for X)
Structure signature for attribute X (A S C I
)
Similar to instance signature
Original instance node ? document root,
A ? ancestors (Item, Publisher, New)
I ? self (ISBN)
S ? siblings (Price, NumCopies)
C ? null.

18
Outline

Motivation
Problem statement
XMLMiner approach
Training XMLMiner
Extraction from new documents
XMLMiner prototype
Summary

19
Extraction Stage

Assumption Input documents have internal
regularity
Compute canonical tree for some of the input
documents
Build signature of each node in the canonical
form, and compute similarity with known instance
node signatures
Map descendants of highest scoring node to
attributes of target table using attribute
signatures

20
Extraction I Represent test documents in
canonical form
Canonical Form
Publications
Test Document
Publications
book
book
book
title
author
publisher
editor
title
author
publisher
editor
price
ISBN
price
ISBN

Intuition
Robustness (allows optional nodes)
Efficiency Canonical form has fewer nodes that
original tree

21
Extraction II Find Instance Node in Canonical
Tree
Publications

For each node K in CT
Compute Signature of K SK
Compute score for K as Similarity( SK , SI )
SI is the signature of instance node I from
training
The node with highest score is the instance node
in CT

book
title
author
publisher
editor
price
ISBN
22
Extraction III Map children of instance node to
attributes
book
title
author
publisher
editor

For each node J of subtree at K
For each attribute X of R
ASJ ? Attribute Signature of J
ASX ? Attribute Signature of X
Compute score for J as Similarity( ASJ , ASX )
Pick mapping such that Product of the scores over
attributes of R is maximized.

price
ISBN
23
Extraction IV Generate XPath queries for the new
documents

Apply XPath queries to the new XML documents
Simple XPath queries can be handled by Xerces
parser or more advanced streaming parser

24
XMLMiner Prototype
Successfully finds best instance node (Book) in
test document
25
Summary

Partially supervised, low effort XML? relational
extraction
Flexible vector space representation that
preserves some original structure
Can potentially be more robust than current
state-of-the-art systems that rely on rules

Write a Comment

User Comments (0)