Title: QueryingTranslating Heterogeneous XML Data
1Querying/TranslatingHeterogeneous XML Data
2Executive Summary
- Techniques
- Frame Work for Querying and Translating
Heterogeneous Data - Algorithms for computing Structural and Semantic
Similarity - (Tree Matching , Hybrid Matching, Bipartite
Matching ) - Implementation
- Java Software Package (HXMLQ)
- Demo
- Evaluation
- Performance Measures (ROC curves,
Querying/Preprocessing Times and Storage,
Robustness ) - Data (1077 documents and 80 DTDs) from HNC NER
Corpus, - www.xml.org, www.xcbl.org
3Outline
- Problem Definition Motivation
- System Overview
- Key Techniques
- Lexical Processing, Ranking, Transforming
- Software Package Overview
- API
- Demo Walk Through
- System Evaluation
- Future Work Conclusion
-
4XML Document (example)
- ltbooksgt
- ltbook title Pierre The Ambiguitiesgt
- ltauthorgtHerman Melvillelt/authorgt
- ltpricegt9.99lt/pricegt
- lt/bookgt
- ltbook title Heart of Darknessgt
- ltauthorgtJoseph Conradlt/authorgt
- ltpricegt12.99lt/pricegt
- lt/bookgt
- ltbooksgt
5 Document Type Definition (example)
- lt!ELEMENT books ( book ) gt
- lt!ELEMENT book (author, price ) gt
- lt!ATTLIST book title CDATA Unknowngt
- lt!ELEMENT author ( PCDATA ) gt
- lt!ELEMENT price ( PCDATA ) gt
6Problem Definition.
- Given a query object (described in XML) find all
matching objects from a collection of XML
databases/documents based on different schemas. - Given an XML object and some data schemas (DTDs),
identify the relevant ones and translate the
object to these schemas
7Example
- Given XML Object
- ltstudentgt
- ltnamegt abc lt/namegt
- ltdegree gtXXXlt/degreegt
- ltGPAgt XXX lt/GPAgt
- ltnationgt xyz lt/nationgt
- ltSSNgt56778988lt/SSNgt
- lt/studentgt
- (XXX unknown)
- Query Result
- ltstudent_recordgt
- ltstudent_namegt abc
- ltstudent_namegt
- ltprogramgt M.S Electrical
- Engineering lt/programgt
- ltGradegt 3.2 lt/Gradegt
- ltSSNgt 56778988lt/SSNgt
- ltcountrygtxyzlt/countrygt
- lt/student_record gt
-
- Translated Object
- ltstudent_recordgt
- ltstudent_namegt abc
- ltstudent_namegt
- ltprogramgt XXX
- lt/programgt
- ltGradegt XXX lt/Gradegt
- ltSSNgt 56778988lt/SSNgt
- ltcountrygtxyzlt/countrygt
- lt/student_record gt
8 Motivation
- Querying across Schemas
- Provides precise results for general queries
- Makes it easier to integrate different databases
- Translating Data across Schemas
- Useful for automatic processing of XML data
- Can be used to homogenize data for effective
storage
- Existing Querying Techniques
- Document Retrieval (general queries but only
documents ) - Database querying (precise results but queries
of data schema)
9New Idea
- Find the best mappings from query tags to data
DTD tags - Generate database queries using original query
object and the mappings and obtain the results - Translate the original object to the relevant
data schemas (DTDs) using the mappings
10System Overview
- Three Phases of Operation
- Pre-Processing
- Mapping the Query to Data Schemas
- Querying /Translation
-
11Pre-processing
DBMS
12Mapping Query To Data
13Querying/Translation
14Lexical Processing
- Words are collected from DTD Tags while parsing
- Word Net is used to find synonyms for each word
- Word Similarity is computed using some rules
- (e.g.0.9 for close synonyms, 0.5 for not so close
synonyms) - Word Groups are refined by merging the Word Net
groups
15Example Word Lists
- Word File
- ABSTRACT
- ABSTRACTION
- ACCESSIBILITY
- Word Sim File
- ABSTRACT ABSTRACTION 0.3
- ABSTRACT ADUMBRATION 0.5
- ABSTRACT OUTLINE 0.5
- ABSTRACT PRECIS 0.5
- Word Group File
- ABSTRACT ABSTRACTION ADUMBRATION OUTLINE PRECIS
SYNOPSIS - ACCESSIBILITY AVAILABILITY HANDINESS
16Example -Indices
Word Id Word List of Group
Ids 1 ABSTRACT 1 2
ABSTRACTION 1 3 ACCESSBILITY
2
Group Id List of Word Ids 1 1 ,
2 , 28, 1332, 1479, 1913 2 3, 136,
872
Group Id List of Matching DTD Tags (DTD
No, Tag Object) 1 (23, T1) , (35, T2)
, (62, T3) , (62, T4) .. 2 (23,
T6) , ( 38, T7) ..
17Ranking Methods
- Tag Count/Similarities
- Find relevant DTDs based on matching tags or sum
of their similarities. - Bi-Partite Matching
- Bipartite graph for each DTD with semantic
similarity as edgewt - Use Hungarian Algorithm/Heavy Edge Matching to
obtain maximum weight bipartite
matchings(mappings)
18Ranking Methods (Cont.)
- 3. Tree Matching
- Use Weighted Tree Matching Algorithm to compute
Tree Edit Distance for each mapping combine
with Semantic Similarity - Hybrid Matching
- Construct Bipartite graph for each relevant DTD
- Search for mappings with low tree edit distance
and high semantic similarity (slightly greedy
method) -
19Heavy Edge Matching
20Tree Matching
- Query DTD and Data DTD are trees, say T1, T2
- Let H be the mapping from the nodes (tags) of the
query - tree to the data tree.
- H maps T1 to some structure embedded in T2
- Objective
- To compute similarity/distance between T1 and
H(T1)
21Tree Edit Distance
- Tree Edit Distance
- Minimum no. of insertions/deletions/replacements
of nodes - A good measure for structural dis-similarity/dista
nce - Computation is NP complete for unordered trees
and O(n2m2) for ordered trees (n, m tree
nodes) - Can be done faster as we know individual node
mappings as well
22Algorithm- Tree Matching
- Cost (root) 0
- Add the path from root of T2 to H(root) to
known (only one way) - Traverse T1 depth wise and for each node, n
- a) If H(n) has H(parent(n)) as an ancestor,
- Cost(n) new path from
H(parent(n)) to H(n) 1 - else
- Cost(n) new path from
H(n) to H(parent(n)) 1 - (might consist of union of two parts)
- b) Add the new path to the known paths
- Sum up the node costs
23AlgorithmTree Matching(Cont.)
- The known paths (starting from root) are stored
in the form of suffix tree - The cost of each node is of insertions and
deletions required for the node. No replacements
are required. - Time complexity is O( nodes in T1)(longest path
in T2) - Variant
- Instead of counting each insertion/deletion only
once, we weight it with the importance of all
the nodes requiring it and then sum the counts - Have to take care of query tag deletion costs and
use immediate ancestor instead of parent -
24Hybrid Matching
- Problem with two step Approach
- Number of maximal mappings per DTD could be
very high. ( 3 nodes with 3 partners each gt 729
mappings) - Tree Edit Distance Computation will be expensive
if we choose all of them and Accuracy will drop
if we dont. - Key Idea
- Compute partial Tree Edit Distance and Semantic
Similarity measures and also define a hybrid of
these measures. - Construct/Search for mappings by choosing edges
which improve - the hybrid measure measures (somewhat
greedy method)
25Algorithm Hybrid Matching
- Build a bipartite graph from the query tree, T1
to a data tree, T2 - Initialize Mapping , M to empty set
- Traverse T1 depth wise and for each node, n
- a) For each edge, e incident on n, compute the
increase in the hybrid - similarity measure obtained by adding the edge,
e to M - b) Edge Set, E set of all edges (n,x)
- c) Repeat till n is assigned or no more
edges in E - Let edge (n,m) is the edge with maximum
increase in E, - If (n.m) has maximum weight among all the
edges (n,m) - M M edge (n,m)
- else
- M M edge(n,m) (with Prob. based on
?weight ) - Remove the edge (n.m) from the set E
- d) Compute the Node Cost using the chosen edge
- Sum up and Normalize the node costs
26Example Query Object
- lt?xml version 1.0?gt
- ltStudentgt
- ltIDgt 123 lt/IDgt
- ltPersonalInfogt
- ltNamegt
- ltFirstNamegtMichaellt/FirstNamegt
- ltLastNamegtMcGuirelt/LastNamegt
- lt/Namegt
- lt/PersonalInfogt
- lt/Studentgt
27Example Query DTD Tree
28Example Relevant DTDs
- Student ? STUDENT ? Groups 1124
- ID ? ID ? Groups 753
- PersonalInfo ? PERSONAL, INFO ? Groups 943,
467 - Name ? NAME ? Groups 137, 897
- FirstName ? FIRST, NAME ? Groups 152,
684, 137, 897 - LastName ? LAST, NAME ? Groups 355,
594, 137, 897
29Example Relevant DTDs (Cont.)
- Inverted Indices Group ? (DTD No., Tag)
- Group 137 ? (1, T1) (1 , T2) (1, T3) (5,
T6) - Group 152 ? (1, T1 (4, T5)
- Group 355 ? (1, T2) (4, T8)
- Group 467 ? (1, T3) (4, T9)
- Group 594 ? (1, T2)
- Group 684 ? (1, T1) (4, T5) (5, T7)
- Group 753 ? (1, T4)
- Group 897 ? (1, T1) (1,T2) (1, T3)
- Group 943 ? (1,T4)
- Group 1124 ? (4, T5) (6 , T10)
- Plain Tag Counts (DTD1, 5) (DTD4, 4) (DTD5,
2) (DTD6 , 1)
30Example Target DTD
Applicant
ID
Resume
Bio-Data
PersonName
GivenName
FamilyName
31Example Bi-Partite Graph
32Example Heavy Edge Matching
33Example Heavy Edge Matching
34Example Tree Matching
Node Edit Cost Root Applicant Student
DeletionCost ID 0 PersonalInfo
1 Name 1 FirstName 2
LastName 3 2 in un weighted case
35Example Tree Matching
Node Edit Cost Root Applicant Student
DeletionCost ID 0 PersonalInfo
1 Name 0 FirstName 0 LastName 0
36Example Hybrid Matching
37Transforming Rules
- For Query Generation and Translation
- Leaf Query Nodes ( Simple Elements, Attributes)
- Node value is the content of the node
- Higher Query Nodes (Complex Elements)
- If none of the child nodes are mapped to target
schema - Node value is concatenated text of entire
sub-tree under the node - Node value is the text directly under the node
- For Translation
- Nest elements so that common ancestors correspond
to each other - Distribute node value among children using some
fixed rules
38Example Query Generation
- Mappings
- Operator File
- Query
Root - Applicant Student - FirstName
hasWordStudent.ID - Applicant.ApplicantId
LastName hasWordStudent.PersonalInfo -
Applicant.Resume.Bio-Data Student.PersonalInf
o.Name - Applicant.Resume.Bio-Data.PersonName
Student.PersonalInfo.Name.FirstName
Applicant.Resume.Bio-Data.PersonName.GivenName Stu
dent.PersonalInfo.Name.LastName -
Applicant.Resume.Bio-Data.PersonName.FamilyName
MATCH_OP ApplicantId MATCH_OP FirstName
hasWordMATCH_OP LastName grepJOIN_OP TOP AND
gp1 ApplicantIdJOIN_OP gp1 OR FirstName
LastName
SELECT Applicant WHERE (Applicant.ApplicantId
123) AND ((Applicant.Resume.Bio-data.PersonName.Gi
venName hasWord Michael ) OR (Applicant.Resume.B
io-Data.PersonName.FamilyName grep McGuire))
39Example - Translation
Translated Object
Creation Steps
- lt?xml version 1.0?gt
- ltApplicantgt
- ltApplicantIdgt 123 lt/ApplicantIdgt
- ltResumegt
- ltBio-Datagt
- ltPersonNamegt
- ltGivenNamegtMichaellt/GivenNamegt
- ltFamilyNamegtMcGuirelt/FamilyNamegt
- lt/PersonNamegt
- lt/Bio-Datagt
- lt/Resumegt
- lt/Studentgt
Make Applicant - no map Make ApplicantId - fill
it Append ApplicantId Make Resume - no
map Make Bio-Data - do nothing Make PersonName
-do nothing Make GivenName fill it Append
GivenName Make FamilyName fill it Append
FamilyName Append PersonName Append
Bio-Data Append Resume Return the Applicant Object
40Performance Analysis
- Preprocessing
- Storage O(DataTags Words Similarity
Pairs WordGroupsAvgDataTags/WordGroup ) - Time O(DataTagsAvgWordGroups/DataTag
Words Similarity Pairs WordGroups) - Query Time
- Linear in the number of tags in the query
- Linear in the number of first level matching
schemas - Independent of number of actual words and schemas
41Software Package Overview
- Main components
- Source Code
- API Documentation
- Case Study Data
- XML Documents, DTDs, Queries/Results, Word Files
- Demo
- Supporting Libraries and Install Files
- XML Parsers(Oracle, Saxon), Database
Utilities(Lore,Oracle)
42Code Description
- Four Main Levels
- Specification Code
- interfaces and abstract classes
- Implementation Code
- the actual functionality
- Application Code
- Integrated applications
- Graphical Interface Code
43Code Description (Cont.)
- Controls/Settings
- FilePathOptions,ThresholdOptions,
MiscellaneousOptions etc. - Processors
- XMLInputProcessor, WordProcessor,
InvertedIndexProcessor, Database-Indexer,
DTDSimProcessor, QueryProcessor, DocTranslator
etc. - Data Structures
- BiGraph, Mapping etc.
- Function Objects
- TagWtComputationRoutine, TreeEditDistanceRoutine
etc. - XML Parser Extensions
- ExtDOMParser, ExtDTD etc.
44Top Level API
- Methods of the Class hxmlq.HXMLQFunc
- Configuration
- public static void initialiseApp(String
optionFile) - throws hxmlq.HXMLException
- Parameters
- optionFilefile containing all config information
- Pre-Processing Data
- public static void preprocessData()
- throws hxmlq.HXMLException
- Indexing Data
- public static void indexLore()
- throws hxmlq.HXMLException
45Top Level API (Cont.)
- Mapping Query To Data
- public static void loadQuery(String query, int
mode) throws hxmlq.HXMLException - Parameters
- query string containing XML data or the file
name - mode 0 for string input and 1 for file input
- Generating Database Queries
- public static void generateLoreQueries(String
operator, int mode) throws hxmlq.HXMLException - Parameters
- operator string containing the operator info
or the file name - mode 0 for string input and 1 for file input
46Top Level API (Cont.)
- Obtaining Database Results
- public static void obtainLoreResults() throws
hxmlq.HXMLException - Translating Document
- public static void translateDoc() throws
hxmlq.HXMLException
47System Configuration
48System Configuration
49System Configuration
50Pre-Processing/Indexing
51Querying/Translation Menu
52Loading the Query/Operators
53Mapping Query To Data
54Translating Documents
55Generate Database Queries
56Obtain Database Results
57Help Menu
58System Evaluation
- Data Set
- 1077 documents (News Articles, Business Forms,
Resumes etc.) covering different topics - 80 DTDs
- 50 queries based on the data
- Source HNC/ATS NER Corpus, www.editml.com,
www.xml.org, www.xcbl.org
59System Evaluation (Cont.)
- Performance Measures
- Quality
- Precision - Recall Curves
- Speed
- Pre-processing Query Mapping Time (machine
dependent) - Storage
- Number of DTD nodes and word groups, pairs etc.
- Robustness
- Performance with different similarities and tag
weights
60Precision-Recall Curves
61Time/Storage Statistics
62Robustness
- Similarity Measure
- (1-?) Avg. Semantic Similarity ?/(1 tree edit
distance) - ? is varied
- Importance Wts. of Query Tags
- Uniform,
- Based on Tag Similarities,
- Based on Hierarchy
- Precision/Recall not much affected, Ranking
changes
63Future Work
- Extend the approach to XML Schemas (instead of
DTDs) and handle nested elements - Develop criteria for picking thresholds based on
overall coverage and accuracy - Improve the word similarity measures (instead of
using simple rules) - Define a query language for specifying the query
object with operators and also incorporate
regular expression matching - Complete the interface with Oracle DBMS both
on Unix and PC Platforms. - Incorporate rule based techniques into
Translation and extend it to include entities
64Conclusions
- Technique for querying heterogeneous XML data
and translating data across DTDs - Tree Matching and Hybrid Matching Algorithms
consider both structural and semantic similarity - The current system is interfaced with Lore DBMS
and it delivers reasonably good results on the
evaluation data set. - Requires some extensions to work with real life
data
65(No Transcript)