A Documentbased Approach to Indexing XML Data - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

A Documentbased Approach to Indexing XML Data

Description:

Rural and Agricultural Finance in Tajikistan. Current Status, Challenges and Perspectives ... Tajikistan - Expanding Finance in Rural Areas. 4. Agriculture and ... – PowerPoint PPT presentation

Number of Views:81

Avg rating:3.0/5.0

Slides: 36

Provided by: east7

Category:

more less

Transcript and Presenter's Notes

Title: A Documentbased Approach to Indexing XML Data

1
A Document-based Approach to Indexing XML Data

Ya-Hui Chang and Tsan-Lung Hsieh
Department of Computer Science
National Taiwan Ocean University
yahui_at_cs.ntou.edu.tw
Sept. 10th, 2002

2
Overview

XML introduction
Element block
Element tree
Two types of index structures
Document index
Element index
Experiment results
Conclusion

3
Element Block
Principles of database
systems UllmanLastname Jeffrey Author Computer Science Pressisher 1999 databaseKeyword
4
Element Tree
Example of Offset Blocks
5
the Query Processor
DocumentIndex
ElementIndex
XMLDocument
IdentifyingDocument
DeterminingPosition
RetrievingData
Query
Result
6
the Index Structures

Purpose
Providing efficient query processing over
multiple XML documents
Two types
Document index
Representing the correspondence of document
identifiers and element values
Element index
Representing the positions of elements

7
Document Index

Based on B-Tree
the size of each node is restricted by order
the tree is balanced.

Order5
8
Document Index (cont)

Each node is represented as an XML document.
Search-key value is represented as the attribute
key of the element Pointer, while the document
identifier is represented as the content.

B0001Pointer B0002
B0001
B3.bt XML

y CDATA REQUIRED
DTD
9
Element Index

The position information of elements is
represented based on the order specified in DTD,
or the element tree.
The element indexes are partitioned into offset
blocks corresponding to element blocks to capture
the nesting structures of elements.
It is named offset since we keep the relative
position of elements, to reduce the cost of
maintenance.
Offset tuples constitute the offset block
the first component records the offset to the
parent element
the last component records the pointer to the
offset tuple for the next sibling element
the other components record the relative
positions of sub-elements.

10
Example of Offset Blocks
Books pointer null
Child link
Book1 Title1 pointer Publisher1 Date1
Keyword1 pointer
Author1 Lastname1 Firstname1 point
Author2 Lastname2 Firstname2 null
Sibling link
Book2 Title2 pointer Publisher2 Date2
Keyword2 null
Author3 Lastname3 Firstname3 null
Element tree
11
Example of Retrieving Offsets

Suppose we plan to retrieve all the data
corresponding to the path /Books/Book/Title.
Based on the element tree, Book is the first
child of Books, and Title is the first child of
Book.
This information tells us which components to
retrieve in the offset tuples of Books and Book.
We also need to follow the sibling links.

12
Example of Retrieving Offsets (cont)

Suppose the input path is /Books/Book/Author/Last
name, where Book is the first child, Author is
the second child and Lastname is the first
child.
We need to process the sibling elements for both
Author and Book.

13
Constructing Algorithm

Idea performing a linear scan on the XML
document retrieving the absolute positions of
all tags to calculate offsets.
data structures used
StartTagList the sequence of start-tags and
their absolute positions
EndTagList the sequence of end-tags and their
absolute positions
Stack all unfinished elements on top is the
most recent one, which is also the parent of the
current element
Each internal node of the element tree will need
to record how many child nodes it has.

14
Initial Data
StartTagList
EndTagList
Offset Tuples
'Title', 18 'Book', 9 'Books', 0
'Firstname', 138 'Lastname', 104 'Title'
, 62
Principles of dat
abase systems astnameUllman Jef
frey rComputer Science Press
1999 databaseKeyword
/, 0, -1
Stack
15
Round 1
StartTagList
EndTagList
Offset Tuples
'Title', 18 'Book', 9 'Books', 0
0 0, _, _
'Firstname', 138 'Lastname', 104 'Title'
, 62
4
2
1
3
'Books', 0, 0 /, 0, -1
Principles of dat
abase systems astnameUllman

Stack
16
Round 2
StartTagList
EndTagList
Offset Tuples
'Author', 66 'Title', 18 'Book', 9
0 0, 1, _ 1 9, _, _, _, _, _, _
'Firstname', 138 'Lastname', 104 'Title'
, 62
4
2
1
3
'Book', 9, 1 'Books', 0, 0 /, 0, -1
Principles of dat
abase systems astnameUllman

Stack
17
Round 3
StartTagList
EndTagList
Offset Tuples
'Lastname', 78 'Author', 66 'Title', 18
0 0, 1, _ 1 9, 9, _, _, _, _, _
'Firstname', 138 'Lastname', 104 'Title'
, 62
3
2
1
'Book', 9, 1 'Books', 0, 0 /, 0, -1
Principles of dat
abase systems astnameUllman

Stack
18
Round 4
StartTagList
EndTagList
Offset Tuples
'Firstname', 109 'Lastname', 78 'Author'
, 66
0 0, 1, _ 1 9, 9, 2, _, _, _, _ 2 57,
_, _, _
'Author', 150 'Firstname', 138 'Lastname
', 104
4
2
1
3
'Author', 66, 2 'Book', 9, 1 'Books', 0, 0

/, 0, -1
Principles of daa
tabase systems LastnameUllman

Stack
19
Round 5
StartTagList
EndTagList
Offset Tuples
'Publisher', 154 'Firstname', 109 'Lastn
ame', 78
0 0, 1, _ 1 9, 9, 2, _, _, _, _ 2 57,
12, _, _
'Author', 150 'Firstname', 138 'Lastname
', 104
3
2
1
'Author', 66, 2 'Book', 9, 1 'Books', 0, 0

/, 0, -1
Principles of daa
tabase systems LastnameUllman

Stack
20
Round 6
StartTagList
EndTagList
Offset Tuples
'Date', 202 'Publisher', 154 'Firstname'
, 109
0 0, 1, _ 1 9, 9, 2, _, _, _, _ 2 57,
12, 43, _
'Publisher', 198 'Author', 150 'Firstnam
e', 138
3
2
1
'Author', 66, 2 'Book', 9, 1 'Books', 0, 0

/, 0, -1
Ullman
Jeffrey
Computer Science
Press 1999

Stack
21
Round 7
StartTagList
EndTagList
Offset Tuples
'Keyword', 222 'Date', 202 'Publisher', 1
54
0 0, 1, _ 1 9, 9, 2, _, _, _, _ 2 57,
12, 43, 0
'Date', 218 'Publisher', 198 'Author', 1
50
1
'Author', 66, 2 'Book', 9, 1 'Books', 0, 0

/, 0, -1
Ullman
Jeffrey
Computer Science
Press 1999

Stack
22
Round 8
StartTagList
EndTagList
Offset Tuples
'Keyword', 222 'Date', 202 'Publisher', 1
54
0 0, 1, _ 1 9, 9, 2, 145, _, _, _ 2 5
7, 12, 43, 0
'Keyword', 248 'Date', 218 'Publisher',
198
3
2
1
'Book', 9, 1 'Books', 0, 0 /, 0, -1
Ullman
Jeffrey
Computer Science
Press 1999

Stack
23
Round 9
StartTagList
EndTagList
Offset Tuples
'Keyword', 222 'Date', 202
0 0, 1, _ 1 9, 9, 2, 145, 193, _, _ 2
57, 12, 43, 0
'Books', 266 'Book', 257 'Keyword', 248
'Date', 218
3
2
1
'Book', 9, 1 'Books', 0, 0 /, 0, -1
Computer Science
Press 1999
database

Stack
24
Round 10
StartTagList
EndTagList
Offset Tuples
'Keyword', 222
0 0, 1, _ 1 9, 9, 2, 145, 193, 213, _ 2
57, 12, 43, 0
'Books', 266 'Book', 257 'Keyword', 248
3
2
1
'Book', 9, 1 'Books', 0, 0 /, 0, -1
Computer Science
Press 1999
database

Stack
25
Round 11
StartTagList
EndTagList
Offset Tuples

0 0, 1, _ 1 9, 9, 2, 145, 193, 213, 0 2
57, 12, 43, 0
'Books', 266 'Book', 257
1
'Book', 9, 1 'Books', 0, 0 /, 0, -1
Computer Science
Press 1999
database

Stack
26
Round 12
StartTagList
EndTagList
Offset Tuples

0 0, 1, 0 1 9, 9, 2, 145, 193, 213, 0 2
57, 12, 43, 0
'Books', 266
2
1
'Books', 0, 0 /, 0, -1
Computer Science
Press 1999
database

Stack
27
Final Data
StartTagList
EndTagList
Offset Tuples

0 0, 1, 0 1 9, 9, 2, 145, 193, 213, 0 2
57, 12, 43, 0

Principles of dat
abase systems astnameUllman Jef
frey rComputer Science Press
1999 databaseKeyword
/, 0, -1
Stack
28
Performance Evaluation

Comparison with DOM showing the efficiency of
utilizing the pre-built element index
DOM (Document Object Model) a tree-based parsing
mechanism where each element is a node
Using Microsoft MSXML 3.0 DOM API
Construction of the cost model showing the
scalability of our indexing scheme
Comparison with Lore showing the performance of
the whole query processor
Lore a specialized database system for
semi-structured/XML data

29
Comparison with DOM
30
Cost Model

The I/O cost consists of processing the following
four portions of data
The internal nodes of the document index
The leaf nodes of the document index
The offset blocks
The XML files
The cost model is as follows

31
Experiment Setups
32
Experiment Data
33
Queries to Compare with Lore
34
Experiment Results
35
Conclusions

Summary
We construct a query processor to retrieve data
from multiple XML documents, which utilizes two
index structures
the document index could quickly identify the
required document
the maintainable element index could quickly
determine the precise location of desired data
Experiment results show the efficiency of our
approach.
Future work
Supporting more complicated queries
Improving space utilization