Extracting Relations from XML Documents - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Extracting Relations from XML Documents

Description:

Relational query processing is still standard for many business applications. By extracting into one relational schema, avoid overhead of XML runtime data integration ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 26
Provided by: mathcs
Category:

less

Transcript and Presenter's Notes

Title: Extracting Relations from XML Documents


1
Extracting Relations from XML Documents
IBM Almaden and Columbia University
2
Extraction for Data Integration Motivating
Example
External Schema
Native Schema
Products
Publications
books
music
video
book
item
title
author
publisher
ISBN
price
booktitle
author
publisher
ISBN
price
ISBN Title Author Publisher Price

3
Why Extract Data from XML?
  • XML query processing is still in development.
    Still not as fast as RDBMS
  • Relational query processing is still standard for
    many business applications
  • By extracting into one relational schema, avoid
    overhead of XML runtime data integration
  • Extracted relations can be best exploited for
    relatively static data (e.g., product catalogs)

4
Related Work
  • XTRACT (induces DTDs)
  • Lore/DataGuides
  • HTML Wrappers (LixTo, RoadRunner, WHISK, STALKER,
    )
  • Plain Text Information Extraction (Proteus,
    Snowball, Rapier)
  • Supervised/Assisted XML Schema Mapping (e.g.,
    Clio)

5
Outline
  • Motivation
  • Problem statement
  • XMLMiner approach
  • Training XMLMiner
  • Extraction from new documents
  • Some observation from the prototype
  • Summary

6
Problem Statement
  • Given a target flat relation R, extract
    information for the tuples in R from XML (or
    HTML) documents, with potentially significant
    variations in schema.
  • Problems with current integration/extraction
    approaches
  • Hard-coding the rules/queries requires
    significant effort The resulting rules can be
    brittle.
  • XML Schema or DTD is not always provided

7
XMLMiner Approach
  • Learn signatures from example XML documents
  • Represent document structure while maintaining
    flexibility (to allow schema variations)
  • Assume that a tuple in the target relation
    corresponds to a subtree rooted at an instance
    node.  (The subtree may contain more detailed
    info of the tuple than needed.)
  • Represent input document nodes as vectors, and
    then find the closest (i.e., most similar)
    instance node vector
  • Use labels and data values to map children of the
    instance node to target tuple attributes

8
XMLMiner Architecture Training and Extraction
Canonical Tree
Canonical Tree
9
High Level Description
  • Training
  • Each XML document is merged/split to a
    schema-like tree, called canonical tree
  • User identifies the attributes nodes (under
    instance node), corresponding to the target tuple
    attributes
  • System derives the instance node in the tree
  • Build a model for the structure of the tuple and
    each attribute
  • Extracting
  • Apply the model to find the most likely instance
    node and attribute nodes in the new XML documents

10
Training Stage I Create Canonical Tree for each
Example Document
11
Canonical Form Conversion ExampleMerging
Similar Nodes
Original Document Structure
Merged Document
  • Merge all siblings with the same label (e.g.,
    Item ? Item)
  • Intuition Siblings with the same label
    represent similar entities.

12
Example Split Heterogeneous Nodes ? Canonical
Form
Canonical Tree
13
Training Stage I Result Canonical Tree
OriginalDocument
Canonical Form
14
Training Stage II Generate Instance Node
Signatures
  • Features used to createsignatures for an
    instance node I (item) in the canonical tree
  • A Ancestors of I
  • S Siblings of I
  • C Descendants of I
  • I Self Tag of I
  • Siblings and Ancestors ?position of I in the
    document
  • The Descendants ? internal structure of I

15
Training Stage (cont.)Example Instance Node
Signature
Signature (A,S,C,I) for Item A
Products, Books, S
Category_Desc, C Title, Author,
Publisher, New, Used,
ISBN, Price, Num_Copies
I Item
16
Signature Similarity
  • Vector Space model, TFIDF weights for terms
  • Incorporates structure (similarity-by-region)

SX A Products1, , S
Music0.33, Video0.33, C
Title0.33, Author0.33,
Publisher0.33, New0.2,
Used0.2, ISBN0.6, Price0.2,
Copies0.5 , I Item
SY A Products1, Books0.5, S
CDs0.5, C Title0.33,
Author0.33, Publisher0.33,
ISBN0.6, Price0.2, Copies0.5
, I Book
Similarity(SX, SY) SX.A SY.A SX.S SY.S
SX.C SY.C
SX.I SY.I
17
Training Stage III Attribute Signatures
  • Structural Data signature S(D, A, S, C, I)
  • 1 Data signature D for the values of R.X(e.g.,
    can be a histogram of values for X)
  • Structure signature for attribute X (A S C I
    )
  • Similar to instance signature
  • Original instance node ? document root,
  • A ? ancestors (Item, Publisher, New)
  • I ? self (ISBN)
  • S ? siblings (Price, NumCopies)
  • C ? null.

18
Outline
  • Motivation
  • Problem statement
  • XMLMiner approach
  • Training XMLMiner
  • Extraction from new documents
  • XMLMiner prototype
  • Summary

19
Extraction Stage
  1. Assumption Input documents have internal
    regularity
  2. Compute canonical tree for some of the input
    documents
  3. Build signature of each node in the canonical
    form, and compute similarity with known instance
    node signatures
  4. Map descendants of highest scoring node to
    attributes of target table using attribute
    signatures

20
Extraction I Represent test documents in
canonical form
Canonical Form
Publications
Test Document
Publications
book
book
book
title
author
publisher
editor
title
author
publisher
editor
price
ISBN
price
ISBN
  • Intuition
  • Robustness (allows optional nodes)
  • Efficiency Canonical form has fewer nodes that
    original tree

21
Extraction II Find Instance Node in Canonical
Tree
Publications
  • For each node K in CT
  • Compute Signature of K SK
  • Compute score for K as Similarity( SK , SI )
  • SI is the signature of instance node I from
    training
  • The node with highest score is the instance node
    in CT

book
title
author
publisher
editor
price
ISBN
22
Extraction III Map children of instance node to
attributes
book
title
author
publisher
editor
  • For each node J of subtree at K
  • For each attribute X of R
  • ASJ ? Attribute Signature of J
  • ASX ? Attribute Signature of X
  • Compute score for J as Similarity( ASJ , ASX )
  • Pick mapping such that Product of the scores over
    attributes of R is maximized.

price
ISBN
23
Extraction IV Generate XPath queries for the new
documents
  • Apply XPath queries to the new XML documents
  • Simple XPath queries can be handled by Xerces
    parser or more advanced streaming parser

24
XMLMiner Prototype
Successfully finds best instance node (Book) in
test document
25
Summary
  • Partially supervised, low effort XML? relational
    extraction
  • Flexible vector space representation that
    preserves some original structure
  • Can potentially be more robust than current
    state-of-the-art systems that rely on rules
Write a Comment
User Comments (0)
About PowerShow.com