CMPUT 692 Course Project - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

CMPUT 692 Course Project

Description:

Index Fabric. ViST. XRegion. Discussion. Conclusion/Recommendation. Introduction. XML becomes the golden standard for data exchange and their ... Index Fabric ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 25

Provided by: csUal

Category:

more less

Transcript and Presenter's Notes

Title: CMPUT 692 Course Project

1
CMPUT 692 Course Project

Efficient Index Structures for Semi-Structured
Data

Dean Cheng April 15, 2005 Department of Computing
Science, University of Alberta
2
Outline

Introduction
DataGuides
APEX
XISS
Index Fabric
ViST
XRegion
Discussion
Conclusion/Recommendation

3
Introduction

XML becomes the golden standard for data exchange
and their sizes are increasing
Need an efficient mean to query those XML data
XML data can be highly irregular and XML queries
can be complex
Uses Indexes

4
DataGuides

One of the earliest works on index for
semi-structured data
Model XML data as a graph and queries as paths
Store all paths from root to leaves
Handle simple path efficiently (no wildcard, no
branching)

5
APEX (1)

Motivation Storing all paths from root to leaves
is very inefficient for complex queries
(traversal and join costs)
Only store paths of length two, plus frequent
query path according to the query workload,
queries are answered by join
Flexible and faster than the strong DataGuide

6
APEX (2)

Frequent pattern mining example
Let required paths be A, B, C, D, B.D
Let query workload be A.D, C, A.D
Let minSup be 0.6 (remove path whos count lt 2)
A 2, B 0, C 1, D 2, A.D 2, B.D 0
Updated required paths A, B, C, D, A.D, a path
of length 1 is always in the required path set

7
APEX (3)

Pentium III-866MHz platform with MS-Windows 2000
and 512 MBytes of main memory. Dataset Play
(regular), FlixML (irregular), GedML (highly
irregular)

8
APEX (4)
9
XISS (1)

Similar to APEX, break down XML data into
subunits BTree indexes for element, attribute,
name, value and structure
Introduces a numbering scheme ltorder, sizegt to
determine ancestor-descendant relationship in
constant time
Use join algorithms to produce results

10
XISS (2)

Extended preorder (order) and a range of
descendants (size). Y is descendant of X iff
order(X) lt order(Y) lt order(X) size(X)

11
Intermediate (XISS and APEX)

Advantage
Can handle both simple and complex queries
APEX introduces automatic detection and update of
frequent query path
XISS uses BTree for all indexes, take advantage
of the RDBMS technologies
Disadvantage
Join cost of smaller subunits

12
Index Fabric

Use Patricia trie to index strings, store all
paths from root to leaves, refined paths
Key points Compact and balanced therefore very
efficient for index string, very efficient for
simple paths
Require DBA to define refined paths, not as good
as APEX

13
ViST (1)

For previous indexes, they handle branching path
queries by decomposing the query into multiple
sub-queries and the results of the sub-queries
are joined together to form the final answers
Encode path at the structure level, no need to
decompose complex queries
Encoding uses both paths and values

14
ViST (2)

Preorder sequence PSNv1IMv2Nv3IMv4INv5Lv6BLv7Nv8
Encoding (symbol, prefix) pairs

15
ViST (3)

PSNv1IMv2Nv3IMv4INv5Lv6BLv7Nv8
Find orders with Boston sellers and NY buyers (v5
Boston, v7 NY), encoded as (P, ?)(S,P)(L,
PS)(v5, PSL)(B, P)(L, PB)(v7, PBL), no join
require

16
ViST (4)

Use suffix tree

17
ViST(5)

Use static labeling schema to determine
ancestor-descendant relationship under suffix
tree (RIST)
Use dynamic labeling system to determine
ancestor-descendant, no suffix tree requires
(hence it is called Virtual Suffix Tree),
implemented using BTree

18
ViST (6)

BTree API from Berkeley DB library and a Linux
machine with a 662 MHz Pentium III CPU and 256 MB
main memory is used. Dataset DBLP and XMARK.

19
XRegion

A generic mapping method to map XML data into
relational database schema. This important no
matter what index structure used.
Partition according to the cardinalities of node
occurrences - reduces fragmentations of data and
stores related data in one table - less I/O and
join cost
Ancestor-descendant relationships are in a meta
table - provides efficiency

20
Discussion Desired properties (1)

Handle simple path expressions and complex path
expressions while limiting the traversal cost and
the join cost require for answering queries
Labeling system for ancestor-descendant
relationship important to reduce traversal cost
Take advantage of existing relational database
technologies (BTree used by XISS and ViST).

21
Discussion Desired Properties (2)

The index structure should allow dynamic data
insertion, deletion, structural changes, etc
Uses query workload to do frequent query path
mining such as shown in APEX
Use a good mapping strategy such as XRegion (less
data fragmentation, less I/O and join cost)
This helps in handling the increasing size of XML
data by using relational database technologies

22
Conclusion/Recommendation

ViST seems to have the most desirable properties
XRegion outperforms other generic mapping methods
APEX is the only one utilizes query workload
Recommendation Use XRegion to map XML data into
relational database, use ViST to index on the
paths in XRegions meta table, uses frequent
query mining to provide refined path
functionality dynamically

23
References

1 Chin-Wan Chung, Jun-Ki Min, and Kyuseok Shim.
Apex an adaptive path index for xml data. In
SIGMOND Conference, pages 121132, 2002.
2 Brian Cooper, Neal Sample, Michael J.
Franklin, Gisli R. Hjaltason, and Moshe Shadmon.
A fast index for semistructured data. In VLDB,
pages 341350, 2001.
3 Roy Goldman and Jennifer Widom. Dataguides
Enabling query formation and optimization in
semistructured databases. In VLDB, pages 436445,
1997.
4 Quanzhong Li and Bongki Moon, Indexing and
querying xml data for regular path expressions.
In VLDB, pages 361370, 2001.
5 Haixun Wang, Sanghyun Park, Wei Fan, and
Philip S. Yu. Vist A dynamic index method for
querying xml data by tree structures. In SIGMOND
Conference, pages 110121, 2003.
6 Meng Xue. Xregion A structure-based approach
to storing xml data in relational databases.
University of Alberta Computing Science Master of
Science Thesis, 2004.