Flexible Querying of XML Documents - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Flexible Querying of XML Documents

Description:

... APIs in ... SAXParser APIs. 23. Mapping to Lucene. XML documents to Lucene ... Implemented using Lucene 2.0 APIs. Indexing: Time and Space Intensive. But ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 37
Provided by: TKPr6
Learn more at: http://cecs.wright.edu
Category:

less

Transcript and Presenter's Notes

Title: Flexible Querying of XML Documents


1
Flexible Querying of XML Documents
  • Krishnaprasad Thirunarayan and Trivikram Immaneni
  • Department of Computer Science and Engineering
  • Wright State University
  • Dayton, OH-45435, USA

2
Talk Outline
  • Goal (What?)
  • Background and Motivation (Why?)
  • Query Language and Examples (What?)
  • Implementation Details (How?)
  • Evaluation and Applications (Why?)
  • Conclusions

3
Goal
4
  • Develop a keyword-based XML Query Language and
    its Semantics that is
  • flexible and sufficiently expressive
  • easy to use (for query formulation)
  • Implement, reusing mature software components,
    for efficient indexing and search

5
Background and Motivation
6
XML vs Text Documents
  • DATA Exploit metadata/markup and aggregation
    structure implicit in XML documents
  • For expressiveness and precision
  • QUERY Obtain progressively improved extractions
    using convenient keyword-based queries in
    contrast with accurate extractions using complex
    XML-based queries

7
Relationship to Other Work
  • Extends XSEarch (Cohen et al)
  • Expressive power Incorporates attributes and
    their values
  • Equivalence (E.g., RDF)
  • ltT A"s"/gt
  • vs
  • ltTgt ltAgt s lt/Agt lt/Tgt

8
  • Invariance under Refinement
  • ltTgt ltAgt word_1 and word_2 lt/Agt lt/Tgt
  • vs
  • ltTgt ltAgt
  • ltBgt word_1 lt/Bgt
  • and
  • ltCgt word_2 lt/Cgt
  • lt/Agt lt/Tgt

9
Coherence
  • Interconnectedness Infer related pieces of
    information using aggregation implicit in XML
  • Cohen et al Name equivalence
  • XSEarch
  • Li et al Structural equivalence
  • Scheme-free XML
  • Guo et al Completeness
  • XRANK

10
  • Information Retrieval
  • Explore robust relevance ranking strategy to deal
    with high recall
  • Variation on TFIDF
  • Naïve implementation computationally prohibitive
  • Extension beyond type-delimited-document
    unclear

11
Query Language and Examples (What?)
12
Query Syntax
  • Entity-Attribute-Keyword
  • Search Terms
  • eak
  • ea, ak, ek
  • e, a, k
  • Signed/optional Search Terms
  • eak vs eak

13
Single Search Term Satisfaction
  • eak
  • The search term eak is satisfied by a tree
    containing a subtree with the top element e that
    is associated with the attribute a with value
    containing k, or a subelement a with descendant
    text node containing k.

14
Example (Mondial)
  • ltcountry id"f0_149" name"Austria"
    capital"f0_1467" population"8023244"
  • datacode"AU" total_area"83850"
    population_growth"0.41"
  • infant_mortality"6.2" ...
    government"federal republic" ...gt ...
  • lt/countrygt
  • nameVienna is satisfied by
  • ltprovince id"f0_17447" name"Vienna" ...gt
  • ltcity id"f0_1467" country"f0_149"
    province"f0_17447" ...gt
  • ltnamegtViennalt/namegt ltpopulation
    year"94"gt1583000lt/populationgt lt/citygt
  • lt/provincegt ...
  • nameVienna is satisfied by a part of it
  • ltnamegtViennalt/namegt.

15
Example (Heterogeneity)
  • ltauthorgtltnamegtAdam Dinglelt/namegtlt/authorgt
  • ltauthor name"A. Dingle" gtlt/authorgt
  • ltarticle id"3"gt
  • _at_inproceedingsIMN97,
  • author"Adam Dingle and Ed MacNair and
    Thao Nguyen",
  • lt/articlegt
  • authornameDingle misses the last one.

16
Query Answer Candidate
  • Query Answer Candidate for the query
    Q(t_1,t_2,...,t_m), is a
  • Most preferred satisfying collection of trees
    (P_1,P_2,...,P_m)
  • Precise smallest enclosing
  • Adequate optional search terms satisfied as
    much as possible

17
Query Answer
  • Query Answer for the query Q(t_1,t_2,...,t_m), is
    a
  • Query Answer Candidate (P_1,P_2,...,P_m) in
    which
  • Trees P_is are Interconnected
  • Specifies trees related to the same real-world
    entity

18
Interconnectedness (Cohen et al)
  • Two subtrees T_a and T_b are said to be
    interconnected if the path from their roots to
    the lowest common ancestor does not contain two
    distinct nodes with the same element, or the only
    distinct nodes with the same element are these
    roots.

19
Interconnectedness (Li et al)
  • Two subtrees T_a and T_b are said to be
    interconnected, if the path from T_a's root to
    their lowest common ancestor in the tree does not
    contain another node that is the lowest common
    ancestor of T_a and a distinct subtree T_b ,
    where T_b' has the same root element label as T_b.

20
Interconnectedness (Two Approaches)
21
Implementation Details (How?)
22
Tools Used
  • Apache Lucene 2.0 APIs in Java
  • A high-performance, text search engine library
    with smart indexing strategies.
  • Further tuned for memory-centric operation in
    contrast with disk-centric defaults
  • SAXParser APIs

23
Mapping to Lucene
  • XML documents to Lucene documents for indexing
  • XML keyword-based queries to Lucene queries for
    searching
  • ENCODING
  • XML fragment of an XML document is referred to
    internally using the filename and the XPath (of
    the XML fragment's root from the XML document
    root)

24
Evaluation and Application (Why?)
25
Experiments
  • DATASETs Sigmod, Mondial, and DBLP.
  • PLATFORMS
  • For Sigmod and Mondial datasets HP xw9300
    Workstation with 2 GHz AMD Opteron dual-core
    processor (270), 4 GB of main memory, and 250 GB
    7200 rpm hard drive, running 32-bit Windows XP.
  • (java -Xms750M -Xmx1500M).
  • For DBLP dataset SUN Ultra-40 Workstation with
    2.4 GHz dual AMD Opteron dual-core processor
    (280), 8GB of main memory, and 250GB 7500 rpm
    hard drive, running 64-bit Solaris 10.
  • (java -Xms1000M -Xmx3600M).

26
Dataset Sizes
27
Dataset Indexing via Lucene
28
Query Answer Computation Time vs Display Time
29
More Subtle Example
  • In Extended Paper
  • A Coherent Keyword-Based XML Query Language

30
Pubs.xml
  • ltpublicationsgt
  • - ltbookgt
  •   lttitlegtModern Information Retrievallt/titlegt
  •   ltauthorgtRicardo Baeza-Yateslt/authorgt  
    ltauthorgtBerthier Ribeiro-Netolt/authorgt
  • - ltchaptergt
  •   lttitlegtDigital Librarieslt/titlegt
  •   ltauthorgtEdward A. Foxlt/authorgt  
    ltauthorgtOhm Sornillt/authorgt
  •   lt/chaptergt
  •   lt/bookgt
  • - ltarticlegt
  •   lttitlegtThe Anatomy of a Large-Scale
    Hypertextual Web Search Enginelt/titlegt
  •   ltauthorgtSergey Brinlt/authorgt  
    ltauthorgtLawrence Pagelt/authorgt
  •   lt/articlegt
  • - ltarticlegt
  •   lttitlegtAn Algorithm for Suffix
    Strippinglt/titlegt
  •   ltauthorgtM.F.Porterlt/authorgt
  •   lt/articlegt
  • - ltarticlegt
  • lttitlegtIndexing by Latent Semantic
    Analysislt/titlegt  

31
Characteristics of Pubs.xml
  • Total number of authors 7
  • Total number of titles 5
  • Title Distribution
  • 1 book (with 1 chapter) 3 articles
  • Author Distribution
  • 2 ( 2 ) 2 1 0

32
Queries to Pubs.xml (Answer counts)
  • Arbitrary mix and match of authors and titles 7
    5 35
  • author, title (8 hits)
  • author, title (7 hits)
  • author, title, author (4 hits)

33
Completeness ( pWord, qWord) (Guo et al)
34
Conclusions
35
  • Developed declarative semantics for keyword-based
    XML Query language with an effective query
    answering algorithm
  • Developed a notion of interconnectedness that
    provides coherent answers
  • Implemented using Lucene 2.0 APIs
  • Indexing Time and Space Intensive
  • But
  • Query Answering Quick

36
  • THANK YOU!
Write a Comment
User Comments (0)
About PowerShow.com