Title: Presentation for Cmpe-521
1Presentation for Cmpe-521
- VIST Virtual Suffix Tree
- Prepared by
- Evren CEYLAN 2003700163
- Asli UYAR - 2003701321
2- VIST
- A Dynamic Index Method for Querying XML Data by
Tree Structures - Written by Haixun Wang, Sanghyun Park, Wei Fan,
Philip S. Yu SIGMOD 2003
3What is XML?
- XML Extentional Markup Language
- Has a great importance in Data Exchange.
- So, lots of research has been done in providing
flexible query mechanisms in order to extract
data from XML Documents.
4VIST Virtual Suffix Tree
- In this paper, VIST is proposed to search XML
Documents. - XML Documents and XML Queries will be represented
in structured-encoded sequences (that will be
explained in on-going pages). - By using this type of sequences it is shown
that, querying XML data is equal to finding
subsequence matches.
5Index Methods in XML
- Previous index methods
- Disassemble a query into multiple sub-queries,
and then join the results of these sub-queries to
provide final answers.
6What does VIST do?
- Converts both XML Data and XML Queries to
structure-encoded sequences - Uses tree structures as the basic unit of query
in order to avoid highly expensive join
operations - In other words, uses structured-encoded sequences
instead of nodes or paths
7What does VIST do?
- Matches structured queries against structured
data as a whole, without breaking down the
queries into sub-queries of paths or nodes and
relying on join operations. - Supports dynamic index update.
8- What does VIST do?
- ð In this paper, it is shown that VIST is
effective and efficient in supporting structural
queries.
9Introduction
- XML has a growing importance in data exchange
(extracting data from XML documents) - XML provides a flexible way to define
semi-structured data - In this paper a novel index structure is
introduced called VIST(Virtual Suffix Tree) - VIST provides solutions, offers better
performance and usability than previous
approaches in XML indexing.
10 - In XML query language design, expressing complex
structural or graphical queries is one of the
major concept. - (In figure 2, four sample queries is displayed in
graph form) -
11In previous approaches
- i. Indexes are created on path (e.g. /P/S/I/M
in Q1) Path indexes can answer simple queries
efficiently (no branches in Q1). - ii. However, queries that involves branching
structures (such as Q2), have to be disassembled
into sub-queries, then combined by expensive join
operations to produce final results. - iii. So, these methods are inefficient in
handling.
12In VIST approach
- Objective to provide a general method so that
structural XML queries need not to be decomposed
into sub-queries. - Result no need to perform expensive join
operations.
13Method
- XML Data and XML Queries is transformed into to
structure-encoded sequences. - In order to organize structure-encoded sequences
Virtual Suffix Tree is used. - VIST also speeds up the matching process.
14Structure
- VISTs index structure includes two parts
D-Ancestor index, S-Ancestor index (that will be
explained in on-going pages). - VIST unifies structural indexes and value indexes
into a single index. - To achieve this, a method is proposed called
dynamic virtual suffix tree labeling (index
update can be performed directly on BTrees.
15Structure-Encoded Sequences
- Sequential representation of both XML Data and
XML Queries.
16- Objective Modeling of XML queries through
sequence matching makes us to avoid unnecessary
join operations in query processing. - Result Structure-Encoded Sequences are used
instead of paths or nodes.
17Mapping Data and Queries to Structure-Encoded
Sequences
- Stage 1
- Lets consider the purchase record example in
figure 3. - Notation Capital letters represent names of
Attributes. - Lowercase letter represent names of attribute
values. - To encode attribute values into integers we use
hash( ) function. - e.g. v1 h(dell) and v2 h(ibm)
- V1 and v2 is used to represent delle and ibm
respectively.
18Stage 2
- Representing an XML document by the preorder
sequence of its tree structure. -
- e.g. preorder sequence of the tree in Figure 3
is - PSNv1IMv2Nv3IMv4Inv5Lv6BLv7Nv8
19Stage 3
- Definition A structure-encoded sequence is a
sequence of (symbol,prefix) pairs - D (a1,p1), (a2,p2), . . . , (an,pn)
- ai node in the XML doc tree.
- pi path from the root node to node ai.
20- Figure 3 can be converted into the
structure-encoded sequence. - D ... ... (Figure 4)
-
21Benefits
- Modeling XML queries through sequence matching is
that structural queries can be processed as a
whole instead of being broken into smaller query
units(paths or nodes of XML doc tree) - Combining the results of the sub queries by join
operations is expensive.
22The VIST Approach
- Presented in 3 stages
- Naïve algorithm based on the suffix trees
- RIST improves the naïve algorithm by using
BTrees to index suffix tree nodes - VIST an index structure but relying only on
the BTrees
23Requirements
- XML indexing method needs to include
- Should support structural queries directly. This
is done by structure-encoded sequences. - Instead of relying on suffix trees, the index
method uses better indexing techniques such as
BTrees. - The index structure should allow dynamic data
insertion and deletion, etc.
24A Naïve Algorithm Based on Suffix Trees
- Most widely used index structure for
subsequence matching is the suffix tree. -
25Example
- 2 XML Documents called Doc1 and Doc2,
- 2 XML Queries called Q1 and Q2
- in structure-encoded sequences.
-
- Doc1 (P,e)(S,P)(N,PS)(V1,PSN)(L,PS) (V2,PSL)
- Doc2 (P,e) (B,P) (L,PB) (V2,PBL)
-
- Q1 (P,e) (B,P) (L,PB) (V2,PBL)
- Q2 (P,e) (L,P) (V2,PL)
26Example (Contd)
- A tree structure for Doc1 and Doc2 is shown in
Figure 5
27Example (Contd)
- As it is shown above elements in the sequences
represent nodes in the suffix tree. - Since the nodes are involed in 2 different trees,
there is 2 kinds of ancestor-descendent
relationships among the nodes. - i ) D-Ancestorship
- e.g. (S,P) is a D-ancestor of (L,PS)
- ii ) S-Ancestorship
- e.g. (v1,PSN) is a S-ancestor of (L,PS)
28Naïve Algorithm based on the suffix trees
- NaiveSearch algorithm based on suffix trees.
- Represents a naïve method for non-contigious
subsequence matching.
29For example to match Q2
- Start with the root node, which matches the 1st
element of Q2 that is (P,e). - Then search under the root for ll nodes that
match (L,P) which yields to (L,PS) and (L,PB) - Finally, search for
- - (v2,PSL) under the node labeled (L,PS)
- - (v2,PBL) under the node labeled (L,PB)
- Algorithm 1, searches nodes first by
- S-Ancestorship, and then D-Ancestorship.
30Difficulties of Naive Algorithm
- There are difficulties in using suffix tree to
index structure-encoded sequences. - Major difficulty is explained below
- Searching for nodes satisfying both
S- Ancestorship, and D-Ancestorship is extremely
costly. (because we need to go over a large
portion of the subtree for each match)
31RIST Indexing by Ancestor-Descendent
Relationships
- Improves Naïve Algorithm by eliminating the
expensive go-over operations in suffix tree. - When we reach node X after matching, we can jump
directly to those nodes Y to which X is both
D-Ancestor and S-Ancestor. - So, no longer need to search among the
descendents of X to find Ys one by one.
32RIST Algorithm
-
- 1. index nodes in suffix tree by their
(Symbol,Prefix) pairs. This is represented by a
BTree. -
i.This enables us to search
nodes by these (Symbol,Prefix) pairs that is
D-Ancestorship. -
ii. This BTree is
called D-Ancestorship BTree.
33RIST Algorithm
- 2.among all the nodes satisfying D-Ancestorship,
we are interested in the ones satisfying
S-Ancestorship as well. -
i. Labels are created for suffix tree nodes in
order to tell the relationship btw 2 nodes. -
ii. We use BTrees to index nodes by labels. -
iii.This BTree is called S-Ancestorship
BTree.
34Labeling Notation
- ltnx, sizexgt
- nx prefix traversal order of x in the suffix
tree. - Sizex total number of descendants of x in the
suffix tree. - That kind of labeling is shown in figure 5.
35Labeling Notation
- Note with that labeling, the S-Ancestorship
between any two nodes can be decide easily - If x and y are labeled ltnx, sizexgt and ltny,
sizeygt, node x is an S- Ancestor of y if ny ? (
nx , ltnx sizexgt )
36Constructing the BTrees
- Insert all suffix tree nodes into the
D-Ancestorship BTree using their symbols as
their keys. - For all nodes that x inserted with the same
(Symbol,Prefix), we index them by an
S-Ancestorship BTree, using the nx values of
their labels as keys. - Shown in FIGURE 6
37Building the DocID BTree
- DocID BTree stores for each node x ( using nx
as key ), the document IDs of those XML sequences
that end up at node x when they are inserted into
the suffix tree. - Shown in DocID BTree
38In summary
- Unlike the naïve algorithm, RIST does not use
suffix trees for subsequence matching (it uses
D-Ancestorship BTree and S-Ancestorship BTree ) - Form any node , instead of searching the entire
subtree under the node, we can jump to the sub
nodes that match the next element in the query. - So, RIST supports non-contigious subsequence
matching efficiently.
39VIST The Virtual Suffix Tree
- RIST uses a static scheme to label suffix tree
nodes and that prevents it from supporting
dynamic insertions. - Because any node x labeled ltn,sizegt , late
insertions can change the number of nodes that
appear before x. (in the prefix order) - As well as the size of the subtree rooted at x,
which means neither n nor size can be fixed.
40VIST The Virtual Suffix Tree
- The purpose of the suffix tree is to provide a
labeling mechanism to encode S-Ancestorship. - Suppose a node x is created for element di
,during the insertion of sequence - d1, , di, ,dk.
41VIST The Virtual Suffix Tree
- If it is estimated
- i. how many different elements will possibly
follow di in future insertions. - ii.The occurrence probability of each of these
elements - Then we can label xs child nodes instead of
waiting until all sequences are inserted.
42VIST The Virtual Suffix Tree (Contd)
- It also means
- the suffix tree itself is no longer needed,
because its labeling mechanism is inefficient. - It supports dynamic data insertion and deletion.
43Top down scope allocation
- A tree structure defines nested scopes the
scope of a child node is a subscope of its parent
node, and the root node has the max scope which
covers the scope of each node.
44Top down scope allocation
- In dynamic scope allocation there is a parameter
called ?, which is the expected number of child
nodes of any node, - ? is usually assumed as 2.
- without the knowledge of the occurrence rate of
the each child node, 1/? of the remaining scope
is allocated to xs 1st inserted child. - Child1 ltn1,size/2gt
- Child2 lt(n1size)/2, size/4gt
45Dynamic scope of a Suffix Tree Node
- The dynamic scope of a node is triple
ltn,size,kgt , - where k is the number of subscopes allocated
inside current scope.
46Algorithm of VIST
- VIST uses the same sequence matching algorithm as
RIST -
- Dynamic method for labeling suffix tree nodes is
represented without building the suffix tree.
47Algorithm of VIST
- The method relies on insensitive estimations of
the number of attribute values. - Because of that the labeling mechanism is based
on a virtual suffix tree .
48- Example
- - lets look at the index structure before
and after insertion
49Algortihm of VIST
- Suppose, before the insertion the index structure
already contains the following sequence -
- Doc1 (P,e) (S,P) (N,PS) (V1,PSN) (L,PS)
(V2,PSL) - The sequence to be inserted
- gt Doc2 (P,e) (S,P) (L,PS) (V2,PSL)
50Assumptions of the Example
- There are 2 assumptions for the algorithm
- Max 20480
- Dynamic scope allocation method uses the
parameter ? 2
51- The insertion process is much like that of
inserting a sequence into a suffix tree. - We follow the branches, and when there is no
branch to follow we create one.
52CONCLUSION
- VIST (a dynamic index method) is developed for
XML Documents. - XML data and XML queries is converted into
sequences that encode their structural
information.
53VISTs Pros
- Uses tree structure as the basic unit of query to
avoid expensive join operations. - Supports dynamic data insertion and deletion.
- Unlike some other data structures used in other
approaches, the index structure of VIST which is
based on BTrees, are well supported by DBMSs.
54-
- End of Presentation
- Questions ?