Title: Indexing Methods for Efficient XML Query Processing
1Indexing Methods for Efficient XML Query
Processing
- Jun-Ki Min
- KAIST
- http//islab.kaist.ac.kr/jkmin/
2 XML
- eXtensible Markup Language
- The de facto standard
- data representation and exchange on the Web
- XML Data
- An instance of semistructured data
- self-describing
- irregularly structured
3XML Data
- Comprise hierarchically nested collections of
elements - Element can contains
- Atomic data value
- A sequences of subelements
- attributes composed of name-value pairs
- ID-IDREF relationship
- Tree or Graph representation
4XML Example
ltlibraryDBgt ltbook editor 1gt lttitlegt title1
lt/titlegt ltauthorgt author1 lt/authorgt
ltchaptergt lt/chaptergt lt/bookgt ltpapergt
lttitlegt title2lt/titlegt ltauthor id 1gt
author2 lt/authorgt ltauthorgt author3 lt/authorgt
ltsectiongt lt/sectiongt lt/papergt lt/libraryDBgt
ToXin
Index Fabric
APEX
5XML Query
- XML Query Language
- XSLT, XML-QL, XPath, XQuery
- use path expression to traverse the irregularly
structured data - ex) /libraryDB/book/title or //title
- search the whole XML data gt inefficiency
- Structural Summary Path Index
- by restricting the search to only relevant
portions of XML Data
6Schemas for XML
- DTD, XML Schema
- Specifies the constraints of XML Data
- lt!ELEMENT book (title, author,chapter)gt
- are not mandatory
- gt lack of external schema
- Structural Summary
- Summary of label paths
- Path Index
- Structural Summary Extents
7Schemas for XML
- Applications
- User Interface
- XML Data Design, Editing
- Query Formulation
- Query Validation
- Query Optimization
- Path Index
8Structural Summary
- DTD Extraction
- XTRACT
- based on element information
- Structural Summary
- Representative Objects
- based on path information
9XTRACT
- Garofalakis, Gionis, Rastogi, Seshadri, Shim
SIGMOD 00 - Infer concise and accurate DTD
- Choose a DTD from candidate DTDs
- (a b),(b a) gt (ab) or (a b)(b a)
- Based on Minimum Description Length (MDL)
Principle - ranks each candidate DTDs depending on the number
of bits required to describe the subelement
sequences in terms of the candidate DTD - 6(for DTD)33 12
- 9(for DTD)11 11
10Representative Objects(RO)
- Nestorov, Ullman, Wiener, Chawathe ICDE 97
- Provide a concise representation of the inherent
schema of a semistructured hierarchical data - Full-RO
- Describe all simple paths
- K-RO
- K-RO guarantees that its paths whose length are
k1 exist in data. - 1-RO
- Simplest very compacted representation
11Representative Objects(RO)
12Path Index
- Access Support Relations
- Deterministic
- Strong DataGuide
- Index Fabric
- ToXin
- APEX
- Non-Deterministic
- 1-Index
- A(k) Index
- FB Index
13Access Support Relations
- Kemper, Moerkotte IS 92
- Originated from OODBMS
- select Name
- from Mercedes.Manufactures.Composition.Division
- To support join along arbitrary reference chains
- Generalization of Join IndexValduriez 87
- Based on the paths in the schema
- Materialize access paths of arbitrary length
- Support only predefined subsets of paths.
14DataGuides
- Goldman, Widom VLDB 97
- An implementation version of Full-RO
- Summary of label paths from the root ( simple
paths) - Concise describe every unique simple path
exactly once, regardless of the number of times
it appears - Accuracy do not contains label paths that do not
appear in the data - Convenience can store and access it using
similar techniques available for processing
semistructured data
15DataGuides
- Construction Algorithm emulates the conversion
algorithm from non-deterministic finite automata
(NFA) to deterministic finite automata (DFA) - Intuitively, a simple path is represented as a
node in DataGuide - One XML Data may have multiple DataGuides
16Strong DataGuide
- If the sets of nodes which are reachable for
simple paths are equal, then the simple paths are
represented as a single node. - Linear time and linear space for tree structured
data - Exponential time and exponential space for
graph structured data
171/2/T-Index
- Milo and Suciu ICDT 99
- 1-Index
- Summary all label paths starting from the root
- Support queries of q Px where P /l1/l2//ln
- Non-deterministic
- Based on backward bisimulation which is
originated from graph verification - Extents are disjoint
- More compact size than Strong DataGuides
181-Index
- Equivalence relation ()
- v u iff Lv Lu
- where Lx w w is a simple path from the root
to x - the collection of all equivalence class
- Exponential construction cost
- Backward Bisimulation (b)
- If xby and x is the root then y is the root
- Conversely, If xby and y is the root, then x is
the root. - If xby and ltxl xgt is an edge, then there is
exists an edge (yl y), such that x by - Conversely, if xby and (yl y) is an edge, then
there exists an edge (xl x) such that xby
19 vs b
a
a
a
a
c
b
b
c
d
d
d
X
Y
- X Y since LX LY a.b.d, a.c.d
- X Y
- v b u ? v u
- O(mlogm) construction cost Paige and Tarjan 87
201-Index vs Strong DataGuide
- In tree structured Data, strong Dataguide and
1-Index coincide
212/T-Index
- 2-Index
- To support queries of x1Px2
- ex) //title
- Equivalence relation ()
- (v, u) (v, u) iff L(v,u) L(v,u)
- where L(x,y) w w is a label path from x to
y - Summary of path information bwt. two arbitrary
nodes - T-Index
- Generalization of 1/2-Index
- (v1,,vn ) (u1,,un) iff L(v1,,vn) L(u1,,un)
- Conceptually similar to Access Support Relations
- Support only predefined paths
22Index Fabric
- Cooper, Sample, Franklin, Hjaltason, Shadmon,
VLDB 01 - Tree Structured Data
- Conceptual similar to strong DataGuide
- Layered structure
- Use Patricia trie to index a large number of
search keys - The simple path of an element which has a data
value is encoded as a special character sequence - Keeps the key which is the combination of encoded
sequence and data value.
23Index Fabric
XML Data
- Keeps only the information of elements which have
data values - Patricia trie lossy Compression
24ToXin
- Rizzolo, Mendelzon WebDB 01
- Tree Structured Data
- Conceptually Similar to strong DataGuide (not
minimal DataGuide) - Support navigation of forward and backward
traversal - Path Tree ( strong DataGuide)
- A node of Path Tree has an Index Table or Value
Tables - Index Table (IT) parent-child relationships
- Value Table (VT) owner-value relationships
25ToXin
XML Data
LibararyDB parent child null 1
LibraryDB.book parent child 1 2
LibraryDB.paper parent child 1 6
- LibraryDB.book.author
- parent value
- author1
- Since ToXin keeps parent-child relationships,
ToXin supports path expression with value
predicates - ex) /libraryDB/bookauthor author1
26A(k)-Index
- Kaushik, Shenoy, Bohannon, Gudes ICDE 02
- Strong DataGuide and 1-Index record the all
simple paths - Increase index size gt Increase search space
- Approximation of 1-Index
- Non-deterministic
- Utilize local similarity( degree k)
- reduce the size of index graph
27A(k)-Index
- k-bisimulation (k)
- For any two nodes, v and u, v 0 u iff u and v
have the same label - Node vku iff vk-1u and for every parent v of
v, there is a parent u of u such that vk-1u
28A(k)-Index
- Building cost O(km)
- In general, for 1-Index, k lt logm
- Query Processing
- label path expression whose length k1
- precise
- label path expression whose length gt k1
- safe include false results
- validation gt require the data scan
29APEXAdaptive Path indEx for XML Data
- Chung, Min, Shim SIGMOD 02
- Strong DataGuide and 1-Index are kept the all
simple paths - Users used partial matching path queries
- //book/title
- Exhaustive navigation of index structure for
partial matching path queries may result in
performance degradation
30APEX
- Deterministic
- Approximation of DataGuides
- Efficient processing of partial matching path
queries - Workload-Aware
- Self Tuning Strategies Chaudhuri et. al 00
- Utilize Query Workload
- Build APEX with both XML data and frequently used
paths - Sequential pattern mining Agrawal and Srikant 95
31APEX
APEX frequently used paths book.title
extent 0 ltnull,0gt 1 lt0,1gt 2 lt1,2gt
3 lt1,6gt 4 lt2,4gt, lt6,8gt, lt6,9gt 5
lt2,5gt 6 lt6,10gt 7 lt2,8gt 8
lt2,3gt 9 lt6,7gt
- Hash Tree
- keep frequently used paths
- prevent the exhaustive search
- Graph Structure
- structural summary extents
XML Data
32FB Index
- Kaushik, Bohannon, Naughton, Korth SIGMOD 02
- Support Twig path expression
- /A/BC
- Basic Idea
- For every edge e labelled l from v to u, add an
(inverse) edge e-1 with label l-1 from u to v - And then, compute 1-Index on this modified graph.
- Very large Index space
- Apply some heuristics
- Exploiting Local Similarity k-bisimulation
A
B
C-1
33Discussion
- Path Index
- Improve the query performance by restriction of
search space - Can be apply to various application
- Selectivity Estimation
- QBE(Query By Example)
- Future Work
- Support twig queries
- Query Optimization
- cost formula of path index
34Thank You!
- Any Question?
- http//islab.kaist.ac.kr/jkmin
- jkmin_at_islab.kaist.ac.kr
35Reference
- C. Chung, J. Min and K. Shim, APEX An Adaptive
Path Index for XML Data, SIGMOD 02 - B. Cooper, N. Sample, M. Franklin, G. Hjaltason
and M. Shadmon, A Fast Index for Semistructed
Data, VLDB 01 - M. Garofalakis, A. Gionis, R. Rastogi, S.
Seshadri, and K. Shim, XTRACT A System for
Extracting Document Type Descriptors from XML
Documents, SIGMOD 00 - L. Goldman and J. Widom, DataGuides Enabling
Queries Formulation and Optimization in
Seminstructured Databases, VLDB 97 - R. Kaushik, P. Bohannon, J. Naughton and H.
Korth, Covering Indexes for Branching Path
Queries, SIGMOD 02 - R. Kaushik, P. Shenoy, P. Bohannon and E. Gudes,
Exploiting Local Similarity for Indexing Paths
in Graph-Structured Data, ICDE 02 - A. Kemper and G. Moerkotte, Access Support
Relations An Indexing Method for Object Bases,
Information Systems 92 - T. Milo and D. Suciu, Index Structures for Path
Expressions, ICDT 99 - S. Nestorov, J. Ullman, J. Wiener and S.
Chawathe, Representative Objects Concise
Representations of Semi structured, Hierarchical
Data, ICDE 97 - F. Rizzolo and A. Mendelzon, Indexing XML Data
with ToXin, WebDB 01 - R. Paige and R. Tarjan, Three partition
refinement algorithms, SIAM Journal of Computing
87 - P. Valduriez, Join Indices, TODS 87