Indexing Methods for Efficient XML Query Processing - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Indexing Methods for Efficient XML Query Processing

Description:

data representation and exchange on the Web. XML Data. An instance of ... Construction Algorithm emulates the conversion algorithm from non-deterministic ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 36
Provided by: islabK
Category:

less

Transcript and Presenter's Notes

Title: Indexing Methods for Efficient XML Query Processing


1
Indexing Methods for Efficient XML Query
Processing
  • Jun-Ki Min
  • KAIST
  • http//islab.kaist.ac.kr/jkmin/

2
XML
  • eXtensible Markup Language
  • The de facto standard
  • data representation and exchange on the Web
  • XML Data
  • An instance of semistructured data
  • self-describing
  • irregularly structured

3
XML Data
  • Comprise hierarchically nested collections of
    elements
  • Element can contains
  • Atomic data value
  • A sequences of subelements
  • attributes composed of name-value pairs
  • ID-IDREF relationship
  • Tree or Graph representation

4
XML Example
ltlibraryDBgt ltbook editor 1gt lttitlegt title1
lt/titlegt ltauthorgt author1 lt/authorgt
ltchaptergt lt/chaptergt lt/bookgt ltpapergt
lttitlegt title2lt/titlegt ltauthor id 1gt
author2 lt/authorgt ltauthorgt author3 lt/authorgt
ltsectiongt lt/sectiongt lt/papergt lt/libraryDBgt
ToXin
Index Fabric
APEX
5
XML Query
  • XML Query Language
  • XSLT, XML-QL, XPath, XQuery
  • use path expression to traverse the irregularly
    structured data
  • ex) /libraryDB/book/title or //title
  • search the whole XML data gt inefficiency
  • Structural Summary Path Index
  • by restricting the search to only relevant
    portions of XML Data

6
Schemas for XML
  • DTD, XML Schema
  • Specifies the constraints of XML Data
  • lt!ELEMENT book (title, author,chapter)gt
  • are not mandatory
  • gt lack of external schema
  • Structural Summary
  • Summary of label paths
  • Path Index
  • Structural Summary Extents

7
Schemas for XML
  • Applications
  • User Interface
  • XML Data Design, Editing
  • Query Formulation
  • Query Validation
  • Query Optimization
  • Path Index

8
Structural Summary
  • DTD Extraction
  • XTRACT
  • based on element information
  • Structural Summary
  • Representative Objects
  • based on path information

9
XTRACT
  • Garofalakis, Gionis, Rastogi, Seshadri, Shim
    SIGMOD 00
  • Infer concise and accurate DTD
  • Choose a DTD from candidate DTDs
  • (a b),(b a) gt (ab) or (a b)(b a)
  • Based on Minimum Description Length (MDL)
    Principle
  • ranks each candidate DTDs depending on the number
    of bits required to describe the subelement
    sequences in terms of the candidate DTD
  • 6(for DTD)33 12
  • 9(for DTD)11 11

10
Representative Objects(RO)
  • Nestorov, Ullman, Wiener, Chawathe ICDE 97
  • Provide a concise representation of the inherent
    schema of a semistructured hierarchical data
  • Full-RO
  • Describe all simple paths
  • K-RO
  • K-RO guarantees that its paths whose length are
    k1 exist in data.
  • 1-RO
  • Simplest very compacted representation

11
Representative Objects(RO)
12
Path Index
  • Access Support Relations
  • Deterministic
  • Strong DataGuide
  • Index Fabric
  • ToXin
  • APEX
  • Non-Deterministic
  • 1-Index
  • A(k) Index
  • FB Index

13
Access Support Relations
  • Kemper, Moerkotte IS 92
  • Originated from OODBMS
  • select Name
  • from Mercedes.Manufactures.Composition.Division
  • To support join along arbitrary reference chains
  • Generalization of Join IndexValduriez 87
  • Based on the paths in the schema
  • Materialize access paths of arbitrary length
  • Support only predefined subsets of paths.

14
DataGuides
  • Goldman, Widom VLDB 97
  • An implementation version of Full-RO
  • Summary of label paths from the root ( simple
    paths)
  • Concise describe every unique simple path
    exactly once, regardless of the number of times
    it appears
  • Accuracy do not contains label paths that do not
    appear in the data
  • Convenience can store and access it using
    similar techniques available for processing
    semistructured data

15
DataGuides
  • Construction Algorithm emulates the conversion
    algorithm from non-deterministic finite automata
    (NFA) to deterministic finite automata (DFA)
  • Intuitively, a simple path is represented as a
    node in DataGuide
  • One XML Data may have multiple DataGuides

16
Strong DataGuide
  • If the sets of nodes which are reachable for
    simple paths are equal, then the simple paths are
    represented as a single node.
  • Linear time and linear space for tree structured
    data
  • Exponential time and exponential space for
    graph structured data

17
1/2/T-Index
  • Milo and Suciu ICDT 99
  • 1-Index
  • Summary all label paths starting from the root
  • Support queries of q Px where P /l1/l2//ln
  • Non-deterministic
  • Based on backward bisimulation which is
    originated from graph verification
  • Extents are disjoint
  • More compact size than Strong DataGuides

18
1-Index
  • Equivalence relation ()
  • v u iff Lv Lu
  • where Lx w w is a simple path from the root
    to x
  • the collection of all equivalence class
  • Exponential construction cost
  • Backward Bisimulation (b)
  • If xby and x is the root then y is the root
  • Conversely, If xby and y is the root, then x is
    the root.
  • If xby and ltxl xgt is an edge, then there is
    exists an edge (yl y), such that x by
  • Conversely, if xby and (yl y) is an edge, then
    there exists an edge (xl x) such that xby

19
vs b
a
a
a
a
c
b
b
c
d
d
d
X
Y
  • X Y since LX LY a.b.d, a.c.d
  • X Y
  • v b u ? v u
  • O(mlogm) construction cost Paige and Tarjan 87

20
1-Index vs Strong DataGuide
  • In tree structured Data, strong Dataguide and
    1-Index coincide

21
2/T-Index
  • 2-Index
  • To support queries of x1Px2
  • ex) //title
  • Equivalence relation ()
  • (v, u) (v, u) iff L(v,u) L(v,u)
  • where L(x,y) w w is a label path from x to
    y
  • Summary of path information bwt. two arbitrary
    nodes
  • T-Index
  • Generalization of 1/2-Index
  • (v1,,vn ) (u1,,un) iff L(v1,,vn) L(u1,,un)
  • Conceptually similar to Access Support Relations
  • Support only predefined paths

22
Index Fabric
  • Cooper, Sample, Franklin, Hjaltason, Shadmon,
    VLDB 01
  • Tree Structured Data
  • Conceptual similar to strong DataGuide
  • Layered structure
  • Use Patricia trie to index a large number of
    search keys
  • The simple path of an element which has a data
    value is encoded as a special character sequence
  • Keeps the key which is the combination of encoded
    sequence and data value.

23
Index Fabric
XML Data
  • Keeps only the information of elements which have
    data values
  • Patricia trie lossy Compression

24
ToXin
  • Rizzolo, Mendelzon WebDB 01
  • Tree Structured Data
  • Conceptually Similar to strong DataGuide (not
    minimal DataGuide)
  • Support navigation of forward and backward
    traversal
  • Path Tree ( strong DataGuide)
  • A node of Path Tree has an Index Table or Value
    Tables
  • Index Table (IT) parent-child relationships
  • Value Table (VT) owner-value relationships

25
ToXin
XML Data
  • Index Tables

LibararyDB parent child null 1
LibraryDB.book parent child 1 2
LibraryDB.paper parent child 1 6
  • Value Tables
  • LibraryDB.book.author
  • parent value
  • author1

  • Since ToXin keeps parent-child relationships,
    ToXin supports path expression with value
    predicates
  • ex) /libraryDB/bookauthor author1

26
A(k)-Index
  • Kaushik, Shenoy, Bohannon, Gudes ICDE 02
  • Strong DataGuide and 1-Index record the all
    simple paths
  • Increase index size gt Increase search space
  • Approximation of 1-Index
  • Non-deterministic
  • Utilize local similarity( degree k)
  • reduce the size of index graph

27
A(k)-Index
  • k-bisimulation (k)
  • For any two nodes, v and u, v 0 u iff u and v
    have the same label
  • Node vku iff vk-1u and for every parent v of
    v, there is a parent u of u such that vk-1u

28
A(k)-Index
  • Building cost O(km)
  • In general, for 1-Index, k lt logm
  • Query Processing
  • label path expression whose length k1
  • precise
  • label path expression whose length gt k1
  • safe include false results
  • validation gt require the data scan

29
APEXAdaptive Path indEx for XML Data
  • Chung, Min, Shim SIGMOD 02
  • Strong DataGuide and 1-Index are kept the all
    simple paths
  • Users used partial matching path queries
  • //book/title
  • Exhaustive navigation of index structure for
    partial matching path queries may result in
    performance degradation

30
APEX
  • Deterministic
  • Approximation of DataGuides
  • Efficient processing of partial matching path
    queries
  • Workload-Aware
  • Self Tuning Strategies Chaudhuri et. al 00
  • Utilize Query Workload
  • Build APEX with both XML data and frequently used
    paths
  • Sequential pattern mining Agrawal and Srikant 95

31
APEX
APEX frequently used paths book.title
extent 0 ltnull,0gt 1 lt0,1gt 2 lt1,2gt
3 lt1,6gt 4 lt2,4gt, lt6,8gt, lt6,9gt 5
lt2,5gt 6 lt6,10gt 7 lt2,8gt 8
lt2,3gt 9 lt6,7gt
  • Hash Tree
  • keep frequently used paths
  • prevent the exhaustive search
  • Graph Structure
  • structural summary extents

XML Data
32
FB Index
  • Kaushik, Bohannon, Naughton, Korth SIGMOD 02
  • Support Twig path expression
  • /A/BC
  • Basic Idea
  • For every edge e labelled l from v to u, add an
    (inverse) edge e-1 with label l-1 from u to v
  • And then, compute 1-Index on this modified graph.
  • Very large Index space
  • Apply some heuristics
  • Exploiting Local Similarity k-bisimulation

A
B
C-1
33
Discussion
  • Path Index
  • Improve the query performance by restriction of
    search space
  • Can be apply to various application
  • Selectivity Estimation
  • QBE(Query By Example)
  • Future Work
  • Support twig queries
  • Query Optimization
  • cost formula of path index

34
Thank You!
  • Any Question?
  • http//islab.kaist.ac.kr/jkmin
  • jkmin_at_islab.kaist.ac.kr

35
Reference
  • C. Chung, J. Min and K. Shim, APEX An Adaptive
    Path Index for XML Data, SIGMOD 02
  • B. Cooper, N. Sample, M. Franklin, G. Hjaltason
    and M. Shadmon, A Fast Index for Semistructed
    Data, VLDB 01
  • M. Garofalakis, A. Gionis, R. Rastogi, S.
    Seshadri, and K. Shim, XTRACT A System for
    Extracting Document Type Descriptors from XML
    Documents, SIGMOD 00
  • L. Goldman and J. Widom, DataGuides Enabling
    Queries Formulation and Optimization in
    Seminstructured Databases, VLDB 97
  • R. Kaushik, P. Bohannon, J. Naughton and H.
    Korth, Covering Indexes for Branching Path
    Queries, SIGMOD 02
  • R. Kaushik, P. Shenoy, P. Bohannon and E. Gudes,
    Exploiting Local Similarity for Indexing Paths
    in Graph-Structured Data, ICDE 02
  • A. Kemper and G. Moerkotte, Access Support
    Relations An Indexing Method for Object Bases,
    Information Systems 92
  • T. Milo and D. Suciu, Index Structures for Path
    Expressions, ICDT 99
  • S. Nestorov, J. Ullman, J. Wiener and S.
    Chawathe, Representative Objects Concise
    Representations of Semi structured, Hierarchical
    Data, ICDE 97
  • F. Rizzolo and A. Mendelzon, Indexing XML Data
    with ToXin, WebDB 01
  • R. Paige and R. Tarjan, Three partition
    refinement algorithms, SIAM Journal of Computing
    87
  • P. Valduriez, Join Indices, TODS 87
Write a Comment
User Comments (0)
About PowerShow.com