QueryingTranslating Heterogeneous XML Data - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

QueryingTranslating Heterogeneous XML Data

Description:

Frame Work for Querying and Translating Heterogeneous Data ... LastName McGuire /LastName /Name /PersonalInfo /Student Example Query DTD Tree ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 57
Provided by: yah4
Category:

less

Transcript and Presenter's Notes

Title: QueryingTranslating Heterogeneous XML Data


1
Querying/TranslatingHeterogeneous XML Data
  • Srujana Merugu

2
Executive Summary
  • Techniques
  • Frame Work for Querying and Translating
    Heterogeneous Data
  • Algorithms for computing Structural and Semantic
    Similarity
  • (Tree Matching , Hybrid Matching, Bipartite
    Matching )
  • Implementation
  • Java Software Package (HXMLQ)
  • Demo
  • Evaluation
  • Performance Measures (ROC curves,
    Querying/Preprocessing Times and Storage,
    Robustness )
  • Data (1077 documents and 80 DTDs) from HNC NER
    Corpus,
  • www.xml.org, www.xcbl.org

3
Outline
  • Problem Definition Motivation
  • System Overview
  • Key Techniques
  • Lexical Processing, Ranking, Transforming
  • Software Package Overview
  • API
  • Demo Walk Through
  • System Evaluation
  • Future Work Conclusion

4
XML Document (example)
  • ltbooksgt
  • ltbook title Pierre The Ambiguitiesgt
  • ltauthorgtHerman Melvillelt/authorgt
  • ltpricegt9.99lt/pricegt
  • lt/bookgt
  • ltbook title Heart of Darknessgt
  • ltauthorgtJoseph Conradlt/authorgt
  • ltpricegt12.99lt/pricegt
  • lt/bookgt
  • ltbooksgt

5
Document Type Definition (example)
  • lt!ELEMENT books ( book ) gt
  • lt!ELEMENT book (author, price ) gt
  • lt!ATTLIST book title CDATA Unknowngt
  • lt!ELEMENT author ( PCDATA ) gt
  • lt!ELEMENT price ( PCDATA ) gt

6
Problem Definition.
  • Given a query object (described in XML) find all
    matching objects from a collection of XML
    databases/documents based on different schemas.
  • Given an XML object and some data schemas (DTDs),
    identify the relevant ones and translate the
    object to these schemas

7
Example
  • Given XML Object
  • ltstudentgt
  • ltnamegt abc lt/namegt
  • ltdegree gtXXXlt/degreegt
  • ltGPAgt XXX lt/GPAgt
  • ltnationgt xyz lt/nationgt
  • ltSSNgt56778988lt/SSNgt
  • lt/studentgt
  • (XXX unknown)
  • Query Result
  • ltstudent_recordgt
  • ltstudent_namegt abc
  • ltstudent_namegt
  • ltprogramgt M.S Electrical
  • Engineering lt/programgt
  • ltGradegt 3.2 lt/Gradegt
  • ltSSNgt 56778988lt/SSNgt
  • ltcountrygtxyzlt/countrygt
  • lt/student_record gt
  •  
  • Translated Object
  •  ltstudent_recordgt
  • ltstudent_namegt abc
  • ltstudent_namegt
  • ltprogramgt XXX
  • lt/programgt
  • ltGradegt XXX lt/Gradegt
  • ltSSNgt 56778988lt/SSNgt
  • ltcountrygtxyzlt/countrygt
  • lt/student_record gt

8
Motivation
  • Querying across Schemas
  • Provides precise results for general queries
  • Makes it easier to integrate different databases
  • Translating Data across Schemas
  • Useful for automatic processing of XML data
  • Can be used to homogenize data for effective
    storage
  • Existing Querying Techniques
  • Document Retrieval (general queries but only
    documents )
  • Database querying (precise results but queries
    of data schema)

9
New Idea
  • Find the best mappings from query tags to data
    DTD tags
  • Generate database queries using original query
    object and the mappings and obtain the results
  • Translate the original object to the relevant
    data schemas (DTDs) using the mappings

10
System Overview
  • Three Phases of Operation
  • Pre-Processing
  • Mapping the Query to Data Schemas
  • Querying /Translation

11
Pre-processing
DBMS
12
Mapping Query To Data
13
Querying/Translation
14
Lexical Processing
  • Words are collected from DTD Tags while parsing
  • Word Net is used to find synonyms for each word
  • Word Similarity is computed using some rules
  • (e.g.0.9 for close synonyms, 0.5 for not so close
    synonyms)
  • Word Groups are refined by merging the Word Net
    groups

15
Example Word Lists
  • Word File
  • ABSTRACT
  • ABSTRACTION
  • ACCESSIBILITY
  • Word Sim File
  • ABSTRACT ABSTRACTION 0.3
  • ABSTRACT ADUMBRATION 0.5
  • ABSTRACT OUTLINE 0.5
  • ABSTRACT PRECIS 0.5
  • Word Group File
  • ABSTRACT ABSTRACTION ADUMBRATION OUTLINE PRECIS
    SYNOPSIS
  • ACCESSIBILITY AVAILABILITY HANDINESS

16
Example -Indices
Word Id Word List of Group
Ids 1 ABSTRACT 1 2
ABSTRACTION 1 3 ACCESSBILITY
2

Group Id List of Word Ids 1 1 ,
2 , 28, 1332, 1479, 1913 2 3, 136,
872
Group Id List of Matching DTD Tags (DTD
No, Tag Object) 1 (23, T1) , (35, T2)
, (62, T3) , (62, T4) .. 2 (23,
T6) , ( 38, T7) ..
17
Ranking Methods
  • Tag Count/Similarities
  • Find relevant DTDs based on matching tags or sum
    of their similarities.
  • Bi-Partite Matching
  • Bipartite graph for each DTD with semantic
    similarity as edgewt
  • Use Hungarian Algorithm/Heavy Edge Matching to
    obtain maximum weight bipartite
    matchings(mappings)

18
Ranking Methods (Cont.)
  • 3. Tree Matching
  • Use Weighted Tree Matching Algorithm to compute
    Tree Edit Distance for each mapping combine
    with Semantic Similarity
  • Hybrid Matching
  • Construct Bipartite graph for each relevant DTD
  • Search for mappings with low tree edit distance
    and high semantic similarity (slightly greedy
    method)

19
Heavy Edge Matching
20
Tree Matching
  • Query DTD and Data DTD are trees, say T1, T2
  • Let H be the mapping from the nodes (tags) of the
    query
  • tree to the data tree.
  • H maps T1 to some structure embedded in T2
  • Objective
  • To compute similarity/distance between T1 and
    H(T1)

21
Tree Edit Distance
  • Tree Edit Distance
  • Minimum no. of insertions/deletions/replacements
    of nodes
  • A good measure for structural dis-similarity/dista
    nce
  • Computation is NP complete for unordered trees
    and O(n2m2) for ordered trees (n, m tree
    nodes)
  • Can be done faster as we know individual node
    mappings as well

22
Algorithm- Tree Matching
  • Cost (root) 0
  • Add the path from root of T2 to H(root) to
    known (only one way)
  • Traverse T1 depth wise and for each node, n
  • a) If H(n) has H(parent(n)) as an ancestor,
  • Cost(n) new path from
    H(parent(n)) to H(n) 1
  • else
  • Cost(n) new path from
    H(n) to H(parent(n)) 1
  • (might consist of union of two parts)
  • b) Add the new path to the known paths
  • Sum up the node costs

23
AlgorithmTree Matching(Cont.)
  • The known paths (starting from root) are stored
    in the form of suffix tree
  • The cost of each node is of insertions and
    deletions required for the node. No replacements
    are required.
  • Time complexity is O( nodes in T1)(longest path
    in T2)
  • Variant
  • Instead of counting each insertion/deletion only
    once, we weight it with the importance of all
    the nodes requiring it and then sum the counts
  • Have to take care of query tag deletion costs and
    use immediate ancestor instead of parent

24
Hybrid Matching
  • Problem with two step Approach
  • Number of maximal mappings per DTD could be
    very high. ( 3 nodes with 3 partners each gt 729
    mappings)
  • Tree Edit Distance Computation will be expensive
    if we choose all of them and Accuracy will drop
    if we dont.
  • Key Idea
  • Compute partial Tree Edit Distance and Semantic
    Similarity measures and also define a hybrid of
    these measures.
  • Construct/Search for mappings by choosing edges
    which improve
  • the hybrid measure measures (somewhat
    greedy method)

25
Algorithm Hybrid Matching
  • Build a bipartite graph from the query tree, T1
    to a data tree, T2
  • Initialize Mapping , M to empty set
  • Traverse T1 depth wise and for each node, n
  • a) For each edge, e incident on n, compute the
    increase in the hybrid
  • similarity measure obtained by adding the edge,
    e to M
  • b) Edge Set, E set of all edges (n,x)
  • c) Repeat till n is assigned or no more
    edges in E
  • Let edge (n,m) is the edge with maximum
    increase in E,
  • If (n.m) has maximum weight among all the
    edges (n,m)
  • M M edge (n,m)
  • else
  • M M edge(n,m) (with Prob. based on
    ?weight )
  • Remove the edge (n.m) from the set E
  • d) Compute the Node Cost using the chosen edge
  • Sum up and Normalize the node costs

26
Example Query Object
  • lt?xml version 1.0?gt
  • ltStudentgt
  • ltIDgt 123 lt/IDgt
  • ltPersonalInfogt
  • ltNamegt
  • ltFirstNamegtMichaellt/FirstNamegt
  • ltLastNamegtMcGuirelt/LastNamegt
  • lt/Namegt
  • lt/PersonalInfogt
  • lt/Studentgt

27
Example Query DTD Tree
28
Example Relevant DTDs
  • Student ? STUDENT ? Groups 1124
  • ID ? ID ? Groups 753
  • PersonalInfo ? PERSONAL, INFO ? Groups 943,
    467
  • Name ? NAME ? Groups 137, 897
  • FirstName ? FIRST, NAME ? Groups 152,
    684, 137, 897
  • LastName ? LAST, NAME ? Groups 355,
    594, 137, 897

29
Example Relevant DTDs (Cont.)
  • Inverted Indices Group ? (DTD No., Tag)
  • Group 137 ? (1, T1) (1 , T2) (1, T3) (5,
    T6)
  • Group 152 ? (1, T1 (4, T5)
  • Group 355 ? (1, T2) (4, T8)
  • Group 467 ? (1, T3) (4, T9)
  • Group 594 ? (1, T2)
  • Group 684 ? (1, T1) (4, T5) (5, T7)
  • Group 753 ? (1, T4)
  • Group 897 ? (1, T1) (1,T2) (1, T3)
  • Group 943 ? (1,T4)
  • Group 1124 ? (4, T5) (6 , T10)
  • Plain Tag Counts (DTD1, 5) (DTD4, 4) (DTD5,
    2) (DTD6 , 1)

30
Example Target DTD
Applicant
ID
Resume
Bio-Data
PersonName
GivenName
FamilyName
31
Example Bi-Partite Graph
32
Example Heavy Edge Matching
33
Example Heavy Edge Matching
34
Example Tree Matching
Node Edit Cost Root Applicant Student
DeletionCost ID 0 PersonalInfo
1 Name 1 FirstName 2
LastName 3 2 in un weighted case
35
Example Tree Matching
Node Edit Cost Root Applicant Student
DeletionCost ID 0 PersonalInfo
1 Name 0 FirstName 0 LastName 0
36
Example Hybrid Matching
37
Transforming Rules
  • For Query Generation and Translation
  • Leaf Query Nodes ( Simple Elements, Attributes)
  • Node value is the content of the node
  • Higher Query Nodes (Complex Elements)
  • If none of the child nodes are mapped to target
    schema
  • Node value is concatenated text of entire
    sub-tree under the node
  • Node value is the text directly under the node
  • For Translation
  • Nest elements so that common ancestors correspond
    to each other
  • Distribute node value among children using some
    fixed rules

38
Example Query Generation
  • Mappings
  • Operator File
  • Query

Root - Applicant Student - FirstName
hasWordStudent.ID - Applicant.ApplicantId
LastName hasWordStudent.PersonalInfo -
Applicant.Resume.Bio-Data Student.PersonalInf
o.Name - Applicant.Resume.Bio-Data.PersonName
Student.PersonalInfo.Name.FirstName
Applicant.Resume.Bio-Data.PersonName.GivenName Stu
dent.PersonalInfo.Name.LastName -
Applicant.Resume.Bio-Data.PersonName.FamilyName


MATCH_OP ApplicantId MATCH_OP FirstName
hasWordMATCH_OP LastName grepJOIN_OP TOP AND
gp1 ApplicantIdJOIN_OP gp1 OR FirstName
LastName
SELECT Applicant WHERE (Applicant.ApplicantId
123) AND ((Applicant.Resume.Bio-data.PersonName.Gi
venName hasWord Michael ) OR (Applicant.Resume.B
io-Data.PersonName.FamilyName grep McGuire))
39
Example - Translation
Translated Object
Creation Steps
  • lt?xml version 1.0?gt
  • ltApplicantgt
  • ltApplicantIdgt 123 lt/ApplicantIdgt
  • ltResumegt
  • ltBio-Datagt
  • ltPersonNamegt
  • ltGivenNamegtMichaellt/GivenNamegt
  • ltFamilyNamegtMcGuirelt/FamilyNamegt
  • lt/PersonNamegt
  • lt/Bio-Datagt
  • lt/Resumegt
  • lt/Studentgt

Make Applicant - no map Make ApplicantId - fill
it Append ApplicantId Make Resume - no
map Make Bio-Data - do nothing Make PersonName
-do nothing Make GivenName fill it Append
GivenName Make FamilyName fill it Append
FamilyName Append PersonName Append
Bio-Data Append Resume Return the Applicant Object
40
Performance Analysis
  • Preprocessing
  • Storage O(DataTags Words Similarity
    Pairs WordGroupsAvgDataTags/WordGroup )
  • Time O(DataTagsAvgWordGroups/DataTag
    Words Similarity Pairs WordGroups)
  • Query Time
  • Linear in the number of tags in the query
  • Linear in the number of first level matching
    schemas
  • Independent of number of actual words and schemas

41
Software Package Overview
  • Main components
  • Source Code
  • API Documentation
  • Case Study Data
  • XML Documents, DTDs, Queries/Results, Word Files
  • Demo
  • Supporting Libraries and Install Files
  • XML Parsers(Oracle, Saxon), Database
    Utilities(Lore,Oracle)

42
Code Description
  • Four Main Levels
  • Specification Code
  • interfaces and abstract classes
  • Implementation Code
  • the actual functionality
  • Application Code
  • Integrated applications
  • Graphical Interface Code

43
Code Description (Cont.)
  • Controls/Settings
  • FilePathOptions,ThresholdOptions,
    MiscellaneousOptions etc.
  • Processors
  • XMLInputProcessor, WordProcessor,
    InvertedIndexProcessor, Database-Indexer,
    DTDSimProcessor, QueryProcessor, DocTranslator
    etc.
  • Data Structures
  • BiGraph, Mapping etc.
  • Function Objects
  • TagWtComputationRoutine, TreeEditDistanceRoutine
    etc.
  • XML Parser Extensions
  • ExtDOMParser, ExtDTD etc.

44
Top Level API
  • Methods of the Class hxmlq.HXMLQFunc
  • Configuration
  • public static void initialiseApp(String
    optionFile)
  • throws hxmlq.HXMLException
  • Parameters
  • optionFilefile containing all config information
  • Pre-Processing Data
  • public static void preprocessData()
  • throws hxmlq.HXMLException
  • Indexing Data
  • public static void indexLore()
  • throws hxmlq.HXMLException

45
Top Level API (Cont.)
  • Mapping Query To Data
  • public static void loadQuery(String query, int
    mode) throws hxmlq.HXMLException
  • Parameters
  • query string containing XML data or the file
    name
  • mode 0 for string input and 1 for file input
  • Generating Database Queries
  • public static void generateLoreQueries(String
    operator, int mode) throws hxmlq.HXMLException
  • Parameters
  • operator string containing the operator info
    or the file name
  • mode 0 for string input and 1 for file input

46
Top Level API (Cont.)
  • Obtaining Database Results
  • public static void obtainLoreResults() throws
    hxmlq.HXMLException
  • Translating Document
  • public static void translateDoc() throws
    hxmlq.HXMLException

47
System Configuration
48
System Configuration
49
System Configuration
50
Pre-Processing/Indexing
51
Querying/Translation Menu
52
Loading the Query/Operators
53
Mapping Query To Data
54
Translating Documents
55
Generate Database Queries
56
Obtain Database Results
57
Help Menu
58
System Evaluation
  • Data Set
  • 1077 documents (News Articles, Business Forms,
    Resumes etc.) covering different topics
  • 80 DTDs
  • 50 queries based on the data
  • Source HNC/ATS NER Corpus, www.editml.com,
    www.xml.org, www.xcbl.org

59
System Evaluation (Cont.)
  • Performance Measures
  • Quality
  • Precision - Recall Curves
  • Speed
  • Pre-processing Query Mapping Time (machine
    dependent)
  • Storage
  • Number of DTD nodes and word groups, pairs etc.
  • Robustness
  • Performance with different similarities and tag
    weights

60
Precision-Recall Curves
61
Time/Storage Statistics
 
62
Robustness
  • Similarity Measure
  • (1-?) Avg. Semantic Similarity ?/(1 tree edit
    distance)
  • ? is varied
  • Importance Wts. of Query Tags
  • Uniform,
  • Based on Tag Similarities,
  • Based on Hierarchy
  • Precision/Recall not much affected, Ranking
    changes

63
Future Work
  • Extend the approach to XML Schemas (instead of
    DTDs) and handle nested elements
  • Develop criteria for picking thresholds based on
    overall coverage and accuracy
  • Improve the word similarity measures (instead of
    using simple rules)
  • Define a query language for specifying the query
    object with operators and also incorporate
    regular expression matching
  • Complete the interface with Oracle DBMS both
    on Unix and PC Platforms.
  • Incorporate rule based techniques into
    Translation and extend it to include entities

64
Conclusions
  • Technique for querying heterogeneous XML data
    and translating data across DTDs
  • Tree Matching and Hybrid Matching Algorithms
    consider both structural and semantic similarity
  • The current system is interfaced with Lore DBMS
    and it delivers reasonably good results on the
    evaluation data set.
  • Requires some extensions to work with real life
    data

65
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com