Presentation for Cmpe-521

About This Presentation

Title:

Presentation for Cmpe-521

Description:

A Dynamic Index Method for Querying XML Data by Tree Structures ... So, no longer need to search among the descendents of X to find Ys one by one. RIST Algorithm: ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 55

Provided by: evr5

Category:

more less

Transcript and Presenter's Notes

Title: Presentation for Cmpe-521

1
Presentation for Cmpe-521

VIST Virtual Suffix Tree
Prepared by
Evren CEYLAN 2003700163
Asli UYAR - 2003701321

VIST
A Dynamic Index Method for Querying XML Data by
Tree Structures
Written by Haixun Wang, Sanghyun Park, Wei Fan,
Philip S. Yu SIGMOD 2003

3
What is XML?

XML Extentional Markup Language
Has a great importance in Data Exchange.
So, lots of research has been done in providing
flexible query mechanisms in order to extract
data from XML Documents.

4
VIST Virtual Suffix Tree

In this paper, VIST is proposed to search XML
Documents.
XML Documents and XML Queries will be represented
in structured-encoded sequences (that will be
explained in on-going pages).
By using this type of sequences it is shown
that, querying XML data is equal to finding
subsequence matches.

5
Index Methods in XML

Previous index methods
Disassemble a query into multiple sub-queries,
and then join the results of these sub-queries to
provide final answers.

6
What does VIST do?

Converts both XML Data and XML Queries to
structure-encoded sequences
Uses tree structures as the basic unit of query
in order to avoid highly expensive join
operations
In other words, uses structured-encoded sequences
instead of nodes or paths

7
What does VIST do?

Matches structured queries against structured
data as a whole, without breaking down the
queries into sub-queries of paths or nodes and
relying on join operations.
Supports dynamic index update.

What does VIST do?
ð In this paper, it is shown that VIST is
effective and efficient in supporting structural
queries.

9
Introduction

XML has a growing importance in data exchange
(extracting data from XML documents)
XML provides a flexible way to define
semi-structured data
In this paper a novel index structure is
introduced called VIST(Virtual Suffix Tree)
VIST provides solutions, offers better
performance and usability than previous
approaches in XML indexing.

In XML query language design, expressing complex
structural or graphical queries is one of the
major concept.
(In figure 2, four sample queries is displayed in
graph form)

11
In previous approaches

i. Indexes are created on path (e.g. /P/S/I/M
in Q1) Path indexes can answer simple queries
efficiently (no branches in Q1).
ii. However, queries that involves branching
structures (such as Q2), have to be disassembled
into sub-queries, then combined by expensive join
operations to produce final results.
iii. So, these methods are inefficient in
handling.

12
In VIST approach

Objective to provide a general method so that
structural XML queries need not to be decomposed
into sub-queries.
Result no need to perform expensive join
operations.

13
Method

XML Data and XML Queries is transformed into to
structure-encoded sequences.
In order to organize structure-encoded sequences
Virtual Suffix Tree is used.
VIST also speeds up the matching process.

14
Structure

VISTs index structure includes two parts
D-Ancestor index, S-Ancestor index (that will be
explained in on-going pages).
VIST unifies structural indexes and value indexes
into a single index.
To achieve this, a method is proposed called
dynamic virtual suffix tree labeling (index
update can be performed directly on BTrees.

15
Structure-Encoded Sequences

Sequential representation of both XML Data and
XML Queries.

Objective Modeling of XML queries through
sequence matching makes us to avoid unnecessary
join operations in query processing.
Result Structure-Encoded Sequences are used
instead of paths or nodes.

17
Mapping Data and Queries to Structure-Encoded
Sequences

Stage 1
Lets consider the purchase record example in
figure 3.
Notation Capital letters represent names of
Attributes.
Lowercase letter represent names of attribute
values.
To encode attribute values into integers we use
hash( ) function.
e.g. v1 h(dell) and v2 h(ibm)
V1 and v2 is used to represent delle and ibm
respectively.

18
Stage 2

Representing an XML document by the preorder
sequence of its tree structure.
e.g. preorder sequence of the tree in Figure 3
is
PSNv1IMv2Nv3IMv4Inv5Lv6BLv7Nv8

19
Stage 3

Definition A structure-encoded sequence is a
sequence of (symbol,prefix) pairs
D (a1,p1), (a2,p2), . . . , (an,pn)
ai node in the XML doc tree.
pi path from the root node to node ai.

Figure 3 can be converted into the
structure-encoded sequence.
D ... ... (Figure 4)

21
Benefits

Modeling XML queries through sequence matching is
that structural queries can be processed as a
whole instead of being broken into smaller query
units(paths or nodes of XML doc tree)
Combining the results of the sub queries by join
operations is expensive.

22
The VIST Approach

Presented in 3 stages
Naïve algorithm based on the suffix trees
RIST improves the naïve algorithm by using
BTrees to index suffix tree nodes
VIST an index structure but relying only on
the BTrees

23
Requirements

XML indexing method needs to include
Should support structural queries directly. This
is done by structure-encoded sequences.
Instead of relying on suffix trees, the index
method uses better indexing techniques such as
BTrees.
The index structure should allow dynamic data
insertion and deletion, etc.

24
A Naïve Algorithm Based on Suffix Trees

Most widely used index structure for
subsequence matching is the suffix tree.

25
Example

2 XML Documents called Doc1 and Doc2,
2 XML Queries called Q1 and Q2
in structure-encoded sequences.
Doc1 (P,e)(S,P)(N,PS)(V1,PSN)(L,PS) (V2,PSL)
Doc2 (P,e) (B,P) (L,PB) (V2,PBL)
Q1 (P,e) (B,P) (L,PB) (V2,PBL)
Q2 (P,e) (L,P) (V2,PL)

26
Example (Contd)

A tree structure for Doc1 and Doc2 is shown in
Figure 5

27
Example (Contd)

As it is shown above elements in the sequences
represent nodes in the suffix tree.
Since the nodes are involed in 2 different trees,
there is 2 kinds of ancestor-descendent
relationships among the nodes.
i ) D-Ancestorship
e.g. (S,P) is a D-ancestor of (L,PS)
ii ) S-Ancestorship
e.g. (v1,PSN) is a S-ancestor of (L,PS)

28
Naïve Algorithm based on the suffix trees

NaiveSearch algorithm based on suffix trees.
Represents a naïve method for non-contigious
subsequence matching.

29
For example to match Q2

Start with the root node, which matches the 1st
element of Q2 that is (P,e).
Then search under the root for ll nodes that
match (L,P) which yields to (L,PS) and (L,PB)
Finally, search for
- (v2,PSL) under the node labeled (L,PS)
- (v2,PBL) under the node labeled (L,PB)
Algorithm 1, searches nodes first by
S-Ancestorship, and then D-Ancestorship.

30
Difficulties of Naive Algorithm

There are difficulties in using suffix tree to
index structure-encoded sequences.
Major difficulty is explained below
Searching for nodes satisfying both
S- Ancestorship, and D-Ancestorship is extremely
costly. (because we need to go over a large
portion of the subtree for each match)

31
RIST Indexing by Ancestor-Descendent
Relationships

Improves Naïve Algorithm by eliminating the
expensive go-over operations in suffix tree.
When we reach node X after matching, we can jump
directly to those nodes Y to which X is both
D-Ancestor and S-Ancestor.
So, no longer need to search among the
descendents of X to find Ys one by one.

32
RIST Algorithm

1. index nodes in suffix tree by their
(Symbol,Prefix) pairs. This is represented by a
BTree.
                      i.This enables us to search
nodes by these (Symbol,Prefix) pairs that is
D-Ancestorship.
                      ii.      This BTree is
called D-Ancestorship BTree.

33
RIST Algorithm

2.among all the nodes satisfying D-Ancestorship,
we are interested in the ones satisfying
S-Ancestorship as well.
i. Labels are created for suffix tree nodes in
order to tell the relationship btw 2 nodes.
ii. We use BTrees to index nodes by labels.
iii.This BTree is called S-Ancestorship
BTree.

34
Labeling Notation

ltnx, sizexgt
nx prefix traversal order of x in the suffix
tree.
Sizex total number of descendants of x in the
suffix tree.
That kind of labeling is shown in figure 5.

35
Labeling Notation

Note with that labeling, the S-Ancestorship
between any two nodes can be decide easily
If x and y are labeled ltnx, sizexgt and ltny,
sizeygt, node x is an S- Ancestor of y if ny ? (
nx , ltnx sizexgt )

36
Constructing the BTrees

Insert all suffix tree nodes into the
D-Ancestorship BTree using their symbols as
their keys.
For all nodes that x inserted with the same
(Symbol,Prefix), we index them by an
S-Ancestorship BTree, using the nx values of
their labels as keys.
Shown in FIGURE 6

37
Building the DocID BTree

DocID BTree stores for each node x ( using nx
as key ), the document IDs of those XML sequences
that end up at node x when they are inserted into
the suffix tree.
Shown in DocID BTree

38
In summary

Unlike the naïve algorithm, RIST does not use
suffix trees for subsequence matching (it uses
D-Ancestorship BTree and S-Ancestorship BTree )
Form any node , instead of searching the entire
subtree under the node, we can jump to the sub
nodes that match the next element in the query.
So, RIST supports non-contigious subsequence
matching efficiently.

39
VIST The Virtual Suffix Tree

RIST uses a static scheme to label suffix tree
nodes and that prevents it from supporting
dynamic insertions.
Because any node x labeled ltn,sizegt , late
insertions can change the number of nodes that
appear before x. (in the prefix order)
As well as the size of the subtree rooted at x,
which means neither n nor size can be fixed.

40
VIST The Virtual Suffix Tree

The purpose of the suffix tree is to provide a
labeling mechanism to encode S-Ancestorship.
Suppose a node x is created for element di
,during the insertion of sequence
d1, , di, ,dk.

41
VIST The Virtual Suffix Tree

If it is estimated
i. how many different elements will possibly
follow di in future insertions.
ii.The occurrence probability of each of these
elements
Then we can label xs child nodes instead of
waiting until all sequences are inserted.

42
VIST The Virtual Suffix Tree (Contd)

It also means
the suffix tree itself is no longer needed,
because its labeling mechanism is inefficient.
It supports dynamic data insertion and deletion.

43
Top down scope allocation

A tree structure defines nested scopes the
scope of a child node is a subscope of its parent
node, and the root node has the max scope which
covers the scope of each node.

44
Top down scope allocation

In dynamic scope allocation there is a parameter
called ?, which is the expected number of child
nodes of any node,
? is usually assumed as 2.
without the knowledge of the occurrence rate of
the each child node, 1/? of the remaining scope
is allocated to xs 1st inserted child.
Child1 ltn1,size/2gt
Child2 lt(n1size)/2, size/4gt

45
Dynamic scope of a Suffix Tree Node

The dynamic scope of a node is triple
ltn,size,kgt ,
where k is the number of subscopes allocated
inside current scope.

46
Algorithm of VIST

VIST uses the same sequence matching algorithm as
RIST
Dynamic method for labeling suffix tree nodes is
represented without building the suffix tree.

47
Algorithm of VIST

The method relies on insensitive estimations of
the number of attribute values.
Because of that the labeling mechanism is based
on a virtual suffix tree .

Example
- lets look at the index structure before
and after insertion

49
Algortihm of VIST

Suppose, before the insertion the index structure
already contains the following sequence
Doc1 (P,e) (S,P) (N,PS) (V1,PSN) (L,PS)
(V2,PSL)
The sequence to be inserted
gt Doc2 (P,e) (S,P) (L,PS) (V2,PSL)

50
Assumptions of the Example

There are 2 assumptions for the algorithm
Max 20480
Dynamic scope allocation method uses the
parameter ? 2

The insertion process is much like that of
inserting a sequence into a suffix tree.
We follow the branches, and when there is no
branch to follow we create one.

52
CONCLUSION

VIST (a dynamic index method) is developed for
XML Documents.
XML data and XML queries is converted into
sequences that encode their structural
information.

53
VISTs Pros

Uses tree structure as the basic unit of query to
avoid expensive join operations.
Supports dynamic data insertion and deletion.
Unlike some other data structures used in other
approaches, the index structure of VIST which is
based on BTrees, are well supported by DBMSs.