QueryingTranslating Heterogeneous XML Data - PowerPoint PPT Presentation

1 / 56

About This Presentation

Title:

QueryingTranslating Heterogeneous XML Data

Description:

Frame Work for Querying and Translating Heterogeneous Data ... LastName McGuire /LastName /Name /PersonalInfo /Student Example Query DTD Tree ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 57

Provided by: yah4

Category:

more less

Transcript and Presenter's Notes

Title: QueryingTranslating Heterogeneous XML Data

1
Querying/TranslatingHeterogeneous XML Data

Srujana Merugu

2
Executive Summary

Techniques
Frame Work for Querying and Translating
Heterogeneous Data
Algorithms for computing Structural and Semantic
Similarity
(Tree Matching , Hybrid Matching, Bipartite
Matching )
Implementation
Java Software Package (HXMLQ)
Demo
Evaluation
Performance Measures (ROC curves,
Querying/Preprocessing Times and Storage,
Robustness )
Data (1077 documents and 80 DTDs) from HNC NER
Corpus,
www.xml.org, www.xcbl.org

3
Outline

Problem Definition Motivation
System Overview
Key Techniques
Lexical Processing, Ranking, Transforming
Software Package Overview
API
Demo Walk Through
System Evaluation
Future Work Conclusion

4
XML Document (example)

ltbooksgt
ltbook title Pierre The Ambiguitiesgt
ltauthorgtHerman Melvillelt/authorgt
ltpricegt9.99lt/pricegt
lt/bookgt
ltbook title Heart of Darknessgt
ltauthorgtJoseph Conradlt/authorgt
ltpricegt12.99lt/pricegt
lt/bookgt
ltbooksgt

5
Document Type Definition (example)

lt!ELEMENT books ( book ) gt
lt!ELEMENT book (author, price ) gt
lt!ATTLIST book title CDATA Unknowngt
lt!ELEMENT author ( PCDATA ) gt
lt!ELEMENT price ( PCDATA ) gt

6
Problem Definition.

Given a query object (described in XML) find all
matching objects from a collection of XML
databases/documents based on different schemas.
Given an XML object and some data schemas (DTDs),
identify the relevant ones and translate the
object to these schemas

7
Example

Given XML Object
ltstudentgt
ltnamegt abc lt/namegt
ltdegree gtXXXlt/degreegt
ltGPAgt XXX lt/GPAgt
ltnationgt xyz lt/nationgt
ltSSNgt56778988lt/SSNgt
lt/studentgt
(XXX unknown)

Query Result
ltstudent_recordgt
ltstudent_namegt abc
ltstudent_namegt
ltprogramgt M.S Electrical
Engineering lt/programgt
ltGradegt 3.2 lt/Gradegt
ltSSNgt 56778988lt/SSNgt
ltcountrygtxyzlt/countrygt
lt/student_record gt

Translated Object
ltstudent_recordgt
ltstudent_namegt abc
ltstudent_namegt
ltprogramgt XXX
lt/programgt
ltGradegt XXX lt/Gradegt
ltSSNgt 56778988lt/SSNgt
ltcountrygtxyzlt/countrygt
lt/student_record gt

8
Motivation

Querying across Schemas
Provides precise results for general queries
Makes it easier to integrate different databases
Translating Data across Schemas
Useful for automatic processing of XML data
Can be used to homogenize data for effective
storage

Existing Querying Techniques
Document Retrieval (general queries but only
documents )
Database querying (precise results but queries
of data schema)

9
New Idea

Find the best mappings from query tags to data
DTD tags
Generate database queries using original query
object and the mappings and obtain the results
Translate the original object to the relevant
data schemas (DTDs) using the mappings

10
System Overview

Three Phases of Operation
Pre-Processing
Mapping the Query to Data Schemas
Querying /Translation

11
Pre-processing
DBMS
12
Mapping Query To Data
13
Querying/Translation
14
Lexical Processing

Words are collected from DTD Tags while parsing
Word Net is used to find synonyms for each word
Word Similarity is computed using some rules
(e.g.0.9 for close synonyms, 0.5 for not so close
synonyms)
Word Groups are refined by merging the Word Net
groups

15
Example Word Lists

Word File
ABSTRACT
ABSTRACTION
ACCESSIBILITY
Word Sim File
ABSTRACT ABSTRACTION 0.3
ABSTRACT ADUMBRATION 0.5
ABSTRACT OUTLINE 0.5
ABSTRACT PRECIS 0.5
Word Group File
ABSTRACT ABSTRACTION ADUMBRATION OUTLINE PRECIS
SYNOPSIS
ACCESSIBILITY AVAILABILITY HANDINESS

16
Example -Indices
Word Id Word List of Group
Ids 1 ABSTRACT 1 2
ABSTRACTION 1 3 ACCESSBILITY
2

Group Id List of Word Ids 1 1 ,
2 , 28, 1332, 1479, 1913 2 3, 136,
872
Group Id List of Matching DTD Tags (DTD
No, Tag Object) 1 (23, T1) , (35, T2)
, (62, T3) , (62, T4) .. 2 (23,
T6) , ( 38, T7) ..
17
Ranking Methods

Tag Count/Similarities
Find relevant DTDs based on matching tags or sum
of their similarities.
Bi-Partite Matching
Bipartite graph for each DTD with semantic
similarity as edgewt
Use Hungarian Algorithm/Heavy Edge Matching to
obtain maximum weight bipartite
matchings(mappings)

18
Ranking Methods (Cont.)

3. Tree Matching
Use Weighted Tree Matching Algorithm to compute
Tree Edit Distance for each mapping combine
with Semantic Similarity
Hybrid Matching
Construct Bipartite graph for each relevant DTD
Search for mappings with low tree edit distance
and high semantic similarity (slightly greedy
method)

19
Heavy Edge Matching
20
Tree Matching

Query DTD and Data DTD are trees, say T1, T2
Let H be the mapping from the nodes (tags) of the
query
tree to the data tree.
H maps T1 to some structure embedded in T2
Objective
To compute similarity/distance between T1 and
H(T1)

21
Tree Edit Distance

Tree Edit Distance
Minimum no. of insertions/deletions/replacements
of nodes
A good measure for structural dis-similarity/dista
nce
Computation is NP complete for unordered trees
and O(n2m2) for ordered trees (n, m tree
nodes)
Can be done faster as we know individual node
mappings as well

22
Algorithm- Tree Matching

Cost (root) 0
Add the path from root of T2 to H(root) to
known (only one way)
Traverse T1 depth wise and for each node, n
a) If H(n) has H(parent(n)) as an ancestor,
Cost(n) new path from
H(parent(n)) to H(n) 1
else
Cost(n) new path from
H(n) to H(parent(n)) 1
(might consist of union of two parts)
b) Add the new path to the known paths
Sum up the node costs

23
AlgorithmTree Matching(Cont.)

The known paths (starting from root) are stored
in the form of suffix tree
The cost of each node is of insertions and
deletions required for the node. No replacements
are required.
Time complexity is O( nodes in T1)(longest path
in T2)
Variant
Instead of counting each insertion/deletion only
once, we weight it with the importance of all
the nodes requiring it and then sum the counts
Have to take care of query tag deletion costs and
use immediate ancestor instead of parent

24
Hybrid Matching

Problem with two step Approach
Number of maximal mappings per DTD could be
very high. ( 3 nodes with 3 partners each gt 729
mappings)
Tree Edit Distance Computation will be expensive
if we choose all of them and Accuracy will drop
if we dont.
Key Idea
Compute partial Tree Edit Distance and Semantic
Similarity measures and also define a hybrid of
these measures.
Construct/Search for mappings by choosing edges
which improve
the hybrid measure measures (somewhat
greedy method)

25
Algorithm Hybrid Matching

Build a bipartite graph from the query tree, T1
to a data tree, T2
Initialize Mapping , M to empty set
Traverse T1 depth wise and for each node, n
a) For each edge, e incident on n, compute the
increase in the hybrid
similarity measure obtained by adding the edge,
e to M
b) Edge Set, E set of all edges (n,x)
c) Repeat till n is assigned or no more
edges in E
Let edge (n,m) is the edge with maximum
increase in E,
If (n.m) has maximum weight among all the
edges (n,m)
M M edge (n,m)
else
M M edge(n,m) (with Prob. based on
?weight )
Remove the edge (n.m) from the set E
d) Compute the Node Cost using the chosen edge
Sum up and Normalize the node costs

26
Example Query Object

lt?xml version 1.0?gt
ltStudentgt
ltIDgt 123 lt/IDgt
ltPersonalInfogt
ltNamegt
ltFirstNamegtMichaellt/FirstNamegt
ltLastNamegtMcGuirelt/LastNamegt
lt/Namegt
lt/PersonalInfogt
lt/Studentgt

27
Example Query DTD Tree
28
Example Relevant DTDs

Student ? STUDENT ? Groups 1124
ID ? ID ? Groups 753
PersonalInfo ? PERSONAL, INFO ? Groups 943,
467
Name ? NAME ? Groups 137, 897
FirstName ? FIRST, NAME ? Groups 152,
684, 137, 897
LastName ? LAST, NAME ? Groups 355,
594, 137, 897

29
Example Relevant DTDs (Cont.)

Inverted Indices Group ? (DTD No., Tag)
Group 137 ? (1, T1) (1 , T2) (1, T3) (5,
T6)
Group 152 ? (1, T1 (4, T5)
Group 355 ? (1, T2) (4, T8)
Group 467 ? (1, T3) (4, T9)
Group 594 ? (1, T2)
Group 684 ? (1, T1) (4, T5) (5, T7)
Group 753 ? (1, T4)
Group 897 ? (1, T1) (1,T2) (1, T3)
Group 943 ? (1,T4)
Group 1124 ? (4, T5) (6 , T10)
Plain Tag Counts (DTD1, 5) (DTD4, 4) (DTD5,
2) (DTD6 , 1)

30
Example Target DTD
Applicant
ID
Resume
Bio-Data
PersonName
GivenName
FamilyName
31
Example Bi-Partite Graph
32
Example Heavy Edge Matching
33
Example Heavy Edge Matching
34
Example Tree Matching
Node Edit Cost Root Applicant Student
DeletionCost ID 0 PersonalInfo
1 Name 1 FirstName 2
LastName 3 2 in un weighted case
35
Example Tree Matching
Node Edit Cost Root Applicant Student
DeletionCost ID 0 PersonalInfo
1 Name 0 FirstName 0 LastName 0
36
Example Hybrid Matching
37
Transforming Rules

For Query Generation and Translation
Leaf Query Nodes ( Simple Elements, Attributes)
Node value is the content of the node
Higher Query Nodes (Complex Elements)
If none of the child nodes are mapped to target
schema
Node value is concatenated text of entire
sub-tree under the node
Node value is the text directly under the node
For Translation
Nest elements so that common ancestors correspond
to each other
Distribute node value among children using some
fixed rules

38
Example Query Generation

Mappings
Operator File
Query

Root - Applicant Student - FirstName
hasWordStudent.ID - Applicant.ApplicantId
LastName hasWordStudent.PersonalInfo -
Applicant.Resume.Bio-Data Student.PersonalInf
o.Name - Applicant.Resume.Bio-Data.PersonName
Student.PersonalInfo.Name.FirstName
Applicant.Resume.Bio-Data.PersonName.GivenName Stu
dent.PersonalInfo.Name.LastName -
Applicant.Resume.Bio-Data.PersonName.FamilyName

MATCH_OP ApplicantId MATCH_OP FirstName
hasWordMATCH_OP LastName grepJOIN_OP TOP AND
gp1 ApplicantIdJOIN_OP gp1 OR FirstName
LastName
SELECT Applicant WHERE (Applicant.ApplicantId
123) AND ((Applicant.Resume.Bio-data.PersonName.Gi
venName hasWord Michael ) OR (Applicant.Resume.B
io-Data.PersonName.FamilyName grep McGuire))
39
Example - Translation
Translated Object
Creation Steps

lt?xml version 1.0?gt
ltApplicantgt
ltApplicantIdgt 123 lt/ApplicantIdgt
ltResumegt
ltBio-Datagt
ltPersonNamegt
ltGivenNamegtMichaellt/GivenNamegt
ltFamilyNamegtMcGuirelt/FamilyNamegt
lt/PersonNamegt
lt/Bio-Datagt
lt/Resumegt
lt/Studentgt

Make Applicant - no map Make ApplicantId - fill
it Append ApplicantId Make Resume - no
map Make Bio-Data - do nothing Make PersonName
-do nothing Make GivenName fill it Append
GivenName Make FamilyName fill it Append
FamilyName Append PersonName Append
Bio-Data Append Resume Return the Applicant Object
40
Performance Analysis

Preprocessing
Storage O(DataTags Words Similarity
Pairs WordGroupsAvgDataTags/WordGroup )
Time O(DataTagsAvgWordGroups/DataTag
Words Similarity Pairs WordGroups)
Query Time
Linear in the number of tags in the query
Linear in the number of first level matching
schemas
Independent of number of actual words and schemas

41
Software Package Overview

Main components
Source Code
API Documentation
Case Study Data
XML Documents, DTDs, Queries/Results, Word Files
Demo
Supporting Libraries and Install Files
XML Parsers(Oracle, Saxon), Database
Utilities(Lore,Oracle)

42
Code Description

Four Main Levels
Specification Code
interfaces and abstract classes
Implementation Code
the actual functionality
Application Code
Integrated applications
Graphical Interface Code

43
Code Description (Cont.)

Controls/Settings
FilePathOptions,ThresholdOptions,
MiscellaneousOptions etc.
Processors
XMLInputProcessor, WordProcessor,
InvertedIndexProcessor, Database-Indexer,
DTDSimProcessor, QueryProcessor, DocTranslator
etc.
Data Structures
BiGraph, Mapping etc.
Function Objects
TagWtComputationRoutine, TreeEditDistanceRoutine
etc.
XML Parser Extensions
ExtDOMParser, ExtDTD etc.

44
Top Level API

Methods of the Class hxmlq.HXMLQFunc
Configuration
public static void initialiseApp(String
optionFile)
throws hxmlq.HXMLException
Parameters
optionFilefile containing all config information
Pre-Processing Data
public static void preprocessData()
throws hxmlq.HXMLException
Indexing Data
public static void indexLore()
throws hxmlq.HXMLException

45
Top Level API (Cont.)

Mapping Query To Data
public static void loadQuery(String query, int
mode) throws hxmlq.HXMLException
Parameters
query string containing XML data or the file
name
mode 0 for string input and 1 for file input
Generating Database Queries
public static void generateLoreQueries(String
operator, int mode) throws hxmlq.HXMLException
Parameters
operator string containing the operator info
or the file name
mode 0 for string input and 1 for file input

46
Top Level API (Cont.)

Obtaining Database Results
public static void obtainLoreResults() throws
hxmlq.HXMLException
Translating Document
public static void translateDoc() throws
hxmlq.HXMLException

47
System Configuration
48
System Configuration
49
System Configuration
50
Pre-Processing/Indexing
51
Querying/Translation Menu
52
Loading the Query/Operators
53
Mapping Query To Data
54
Translating Documents
55
Generate Database Queries
56
Obtain Database Results
57
Help Menu
58
System Evaluation

Data Set
1077 documents (News Articles, Business Forms,
Resumes etc.) covering different topics
80 DTDs
50 queries based on the data
Source HNC/ATS NER Corpus, www.editml.com,
www.xml.org, www.xcbl.org

59
System Evaluation (Cont.)

Performance Measures
Quality
Precision - Recall Curves
Speed
Pre-processing Query Mapping Time (machine
dependent)
Storage
Number of DTD nodes and word groups, pairs etc.
Robustness
Performance with different similarities and tag
weights

60
Precision-Recall Curves
61
Time/Storage Statistics

62
Robustness

Similarity Measure
(1-?) Avg. Semantic Similarity ?/(1 tree edit
distance)
? is varied
Importance Wts. of Query Tags
Uniform,
Based on Tag Similarities,
Based on Hierarchy
Precision/Recall not much affected, Ranking
changes

63
Future Work

Extend the approach to XML Schemas (instead of
DTDs) and handle nested elements
Develop criteria for picking thresholds based on
overall coverage and accuracy
Improve the word similarity measures (instead of
using simple rules)
Define a query language for specifying the query
object with operators and also incorporate
regular expression matching
Complete the interface with Oracle DBMS both
on Unix and PC Platforms.
Incorporate rule based techniques into
Translation and extend it to include entities

64
Conclusions

Technique for querying heterogeneous XML data
and translating data across DTDs
Tree Matching and Hybrid Matching Algorithms
consider both structural and semantic similarity
The current system is interfaced with Lore DBMS
and it delivers reasonably good results on the
evaluation data set.
Requires some extensions to work with real life
data

65
(No Transcript)

Write a Comment

User Comments (0)