Title: A Documentbased Approach to Indexing XML Data
1A Document-based Approach to Indexing XML Data
- Ya-Hui Chang and Tsan-Lung Hsieh
- Department of Computer Science
- National Taiwan Ocean University
- yahui_at_cs.ntou.edu.tw
- Sept. 10th, 2002
2Overview
- XML introduction
- Element block
- Element tree
- Two types of index structures
- Document index
- Element index
- Experiment results
- Conclusion
3Element Block
Principles of database
systems UllmanLastname Jeffrey Author Computer Science Pressisher 1999 databaseKeyword
4Element Tree
Example of Offset Blocks
5the Query Processor
DocumentIndex
ElementIndex
XMLDocument
IdentifyingDocument
DeterminingPosition
RetrievingData
Query
Result
6the Index Structures
- Purpose
- Providing efficient query processing over
multiple XML documents
- Two types
- Document index
- Representing the correspondence of document
identifiers and element values
- Element index
- Representing the positions of elements
7Document Index
- Based on B-Tree
- the size of each node is restricted by order
- the tree is balanced.
Order5
8Document Index (cont)
- Each node is represented as an XML document.
- Search-key value is represented as the attribute
key of the element Pointer, while the document
identifier is represented as the content.
B0001Pointer B0002
B0001
B3.bt XML
y CDATA REQUIRED
DTD
9Element Index
- The position information of elements is
represented based on the order specified in DTD,
or the element tree.
- The element indexes are partitioned into offset
blocks corresponding to element blocks to capture
the nesting structures of elements.
- It is named offset since we keep the relative
position of elements, to reduce the cost of
maintenance.
- Offset tuples constitute the offset block
- the first component records the offset to the
parent element
- the last component records the pointer to the
offset tuple for the next sibling element
- the other components record the relative
positions of sub-elements.
10Example of Offset Blocks
Books pointer null
Child link
Book1 Title1 pointer Publisher1 Date1
Keyword1 pointer
Author1 Lastname1 Firstname1 point
Author2 Lastname2 Firstname2 null
Sibling link
Book2 Title2 pointer Publisher2 Date2
Keyword2 null
Author3 Lastname3 Firstname3 null
Element tree
11Example of Retrieving Offsets
- Suppose we plan to retrieve all the data
corresponding to the path /Books/Book/Title.
- Based on the element tree, Book is the first
child of Books, and Title is the first child of
Book.
- This information tells us which components to
retrieve in the offset tuples of Books and Book.
- We also need to follow the sibling links.
12Example of Retrieving Offsets (cont)
- Suppose the input path is /Books/Book/Author/Last
name, where Book is the first child, Author is
the second child and Lastname is the first
child. - We need to process the sibling elements for both
Author and Book.
13Constructing Algorithm
- Idea performing a linear scan on the XML
document retrieving the absolute positions of
all tags to calculate offsets.
- data structures used
- StartTagList the sequence of start-tags and
their absolute positions
- EndTagList the sequence of end-tags and their
absolute positions
- Stack all unfinished elements on top is the
most recent one, which is also the parent of the
current element
- Each internal node of the element tree will need
to record how many child nodes it has.
14Initial Data
StartTagList
EndTagList
Offset Tuples
'Title', 18 'Book', 9 'Books', 0
'Firstname', 138 'Lastname', 104 'Title'
, 62
Principles of dat
abase systems astnameUllman Jef
frey rComputer Science Press
1999 databaseKeyword
/, 0, -1
Stack
15Round 1
StartTagList
EndTagList
Offset Tuples
'Title', 18 'Book', 9 'Books', 0
0 0, _, _
'Firstname', 138 'Lastname', 104 'Title'
, 62
4
2
1
3
'Books', 0, 0 /, 0, -1
Principles of dat
abase systems astnameUllman
Stack
16Round 2
StartTagList
EndTagList
Offset Tuples
'Author', 66 'Title', 18 'Book', 9
0 0, 1, _ 1 9, _, _, _, _, _, _
'Firstname', 138 'Lastname', 104 'Title'
, 62
4
2
1
3
'Book', 9, 1 'Books', 0, 0 /, 0, -1
Principles of dat
abase systems astnameUllman
Stack
17Round 3
StartTagList
EndTagList
Offset Tuples
'Lastname', 78 'Author', 66 'Title', 18
0 0, 1, _ 1 9, 9, _, _, _, _, _
'Firstname', 138 'Lastname', 104 'Title'
, 62
3
2
1
'Book', 9, 1 'Books', 0, 0 /, 0, -1
Principles of dat
abase systems astnameUllman
Stack
18Round 4
StartTagList
EndTagList
Offset Tuples
'Firstname', 109 'Lastname', 78 'Author'
, 66
0 0, 1, _ 1 9, 9, 2, _, _, _, _ 2 57,
_, _, _
'Author', 150 'Firstname', 138 'Lastname
', 104
4
2
1
3
'Author', 66, 2 'Book', 9, 1 'Books', 0, 0
/, 0, -1
Principles of daa
tabase systems LastnameUllman
Stack
19Round 5
StartTagList
EndTagList
Offset Tuples
'Publisher', 154 'Firstname', 109 'Lastn
ame', 78
0 0, 1, _ 1 9, 9, 2, _, _, _, _ 2 57,
12, _, _
'Author', 150 'Firstname', 138 'Lastname
', 104
3
2
1
'Author', 66, 2 'Book', 9, 1 'Books', 0, 0
/, 0, -1
Principles of daa
tabase systems LastnameUllman
Stack
20Round 6
StartTagList
EndTagList
Offset Tuples
'Date', 202 'Publisher', 154 'Firstname'
, 109
0 0, 1, _ 1 9, 9, 2, _, _, _, _ 2 57,
12, 43, _
'Publisher', 198 'Author', 150 'Firstnam
e', 138
3
2
1
'Author', 66, 2 'Book', 9, 1 'Books', 0, 0
/, 0, -1
Ullman
Jeffrey
Computer Science
Press 1999
Stack
21Round 7
StartTagList
EndTagList
Offset Tuples
'Keyword', 222 'Date', 202 'Publisher', 1
54
0 0, 1, _ 1 9, 9, 2, _, _, _, _ 2 57,
12, 43, 0
'Date', 218 'Publisher', 198 'Author', 1
50
1
'Author', 66, 2 'Book', 9, 1 'Books', 0, 0
/, 0, -1
Ullman
Jeffrey
Computer Science
Press 1999
Stack
22Round 8
StartTagList
EndTagList
Offset Tuples
'Keyword', 222 'Date', 202 'Publisher', 1
54
0 0, 1, _ 1 9, 9, 2, 145, _, _, _ 2 5
7, 12, 43, 0
'Keyword', 248 'Date', 218 'Publisher',
198
3
2
1
'Book', 9, 1 'Books', 0, 0 /, 0, -1
Ullman
Jeffrey
Computer Science
Press 1999
Stack
23Round 9
StartTagList
EndTagList
Offset Tuples
'Keyword', 222 'Date', 202
0 0, 1, _ 1 9, 9, 2, 145, 193, _, _ 2
57, 12, 43, 0
'Books', 266 'Book', 257 'Keyword', 248
'Date', 218
3
2
1
'Book', 9, 1 'Books', 0, 0 /, 0, -1
Computer Science
Press 1999
database
Stack
24Round 10
StartTagList
EndTagList
Offset Tuples
'Keyword', 222
0 0, 1, _ 1 9, 9, 2, 145, 193, 213, _ 2
57, 12, 43, 0
'Books', 266 'Book', 257 'Keyword', 248
3
2
1
'Book', 9, 1 'Books', 0, 0 /, 0, -1
Computer Science
Press 1999
database
Stack
25Round 11
StartTagList
EndTagList
Offset Tuples
0 0, 1, _ 1 9, 9, 2, 145, 193, 213, 0 2
57, 12, 43, 0
'Books', 266 'Book', 257
1
'Book', 9, 1 'Books', 0, 0 /, 0, -1
Computer Science
Press 1999
database
Stack
26Round 12
StartTagList
EndTagList
Offset Tuples
0 0, 1, 0 1 9, 9, 2, 145, 193, 213, 0 2
57, 12, 43, 0
'Books', 266
2
1
'Books', 0, 0 /, 0, -1
Computer Science
Press 1999
database
Stack
27Final Data
StartTagList
EndTagList
Offset Tuples
0 0, 1, 0 1 9, 9, 2, 145, 193, 213, 0 2
57, 12, 43, 0
Principles of dat
abase systems astnameUllman Jef
frey rComputer Science Press
1999 databaseKeyword
/, 0, -1
Stack
28Performance Evaluation
- Comparison with DOM showing the efficiency of
utilizing the pre-built element index
- DOM (Document Object Model) a tree-based parsing
mechanism where each element is a node
- Using Microsoft MSXML 3.0 DOM API
- Construction of the cost model showing the
scalability of our indexing scheme
- Comparison with Lore showing the performance of
the whole query processor
- Lore a specialized database system for
semi-structured/XML data
29Comparison with DOM
30Cost Model
- The I/O cost consists of processing the following
four portions of data
- The internal nodes of the document index
- The leaf nodes of the document index
- The offset blocks
- The XML files
- The cost model is as follows
31Experiment Setups
32Experiment Data
33Queries to Compare with Lore
34Experiment Results
35Conclusions
- Summary
- We construct a query processor to retrieve data
from multiple XML documents, which utilizes two
index structures
- the document index could quickly identify the
required document
- the maintainable element index could quickly
determine the precise location of desired data
- Experiment results show the efficiency of our
approach.
- Future work
- Supporting more complicated queries
- Improving space utilization