Presentation for Cmpe-521 - PowerPoint PPT Presentation

1 / 54
About This Presentation
Title:

Presentation for Cmpe-521

Description:

A Dynamic Index Method for Querying XML Data by Tree Structures ... So, no longer need to search among the descendents of X to find Ys one by one. RIST Algorithm: ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 55
Provided by: evr5
Category:

less

Transcript and Presenter's Notes

Title: Presentation for Cmpe-521


1
Presentation for Cmpe-521
  • VIST Virtual Suffix Tree
  • Prepared by
  • Evren CEYLAN 2003700163
  • Asli UYAR - 2003701321

2
  • VIST
  • A Dynamic Index Method for Querying XML Data by
    Tree Structures
  • Written by Haixun Wang, Sanghyun Park, Wei Fan,
    Philip S. Yu SIGMOD 2003

3
What is XML?
  • XML Extentional Markup Language
  • Has a great importance in Data Exchange.
  • So, lots of research has been done in providing
    flexible query mechanisms in order to extract
    data from XML Documents.

4
VIST Virtual Suffix Tree
  • In this paper, VIST is proposed to search XML
    Documents.
  • XML Documents and XML Queries will be represented
    in structured-encoded sequences (that will be
    explained in on-going pages).
  • By using this type of sequences it is shown
    that, querying XML data is equal to finding
    subsequence matches.

5
Index Methods in XML
  • Previous index methods
  • Disassemble a query into multiple sub-queries,
    and then join the results of these sub-queries to
    provide final answers.

6
What does VIST do?
  • Converts both XML Data and XML Queries to
    structure-encoded sequences
  • Uses tree structures as the basic unit of query
    in order to avoid highly expensive join
    operations
  • In other words, uses structured-encoded sequences
    instead of nodes or paths

7
What does VIST do?
  • Matches structured queries against structured
    data as a whole, without breaking down the
    queries into sub-queries of paths or nodes and
    relying on join operations.
  • Supports dynamic index update.

8
  • What does VIST do?
  • ð  In this paper, it is shown that VIST is
    effective and efficient in supporting structural
    queries.

9
Introduction
  • XML has a growing importance in data exchange
    (extracting data from XML documents)
  • XML provides a flexible way to define
    semi-structured data
  • In this paper a novel index structure is
    introduced called VIST(Virtual Suffix Tree)
  • VIST provides solutions, offers better
    performance and usability than previous
    approaches in XML indexing.

10
  • In XML query language design, expressing complex
    structural or graphical queries is one of the
    major concept.
  • (In figure 2, four sample queries is displayed in
    graph form)

11
In previous approaches
  • i. Indexes are created on path (e.g. /P/S/I/M
    in Q1) Path indexes can answer simple queries
    efficiently (no branches in Q1).
  •   ii. However, queries that involves branching
    structures (such as Q2), have to be disassembled
    into sub-queries, then combined by expensive join
    operations to produce final results.
  • iii. So, these methods are inefficient in
    handling.

12
In VIST approach
  • Objective to provide a general method so that
    structural XML queries need not to be decomposed
    into sub-queries.
  • Result no need to perform expensive join
    operations.

13
Method
  • XML Data and XML Queries is transformed into to
    structure-encoded sequences.
  • In order to organize structure-encoded sequences
    Virtual Suffix Tree is used.
  • VIST also speeds up the matching process.

14
Structure
  • VISTs index structure includes two parts
    D-Ancestor index, S-Ancestor index (that will be
    explained in on-going pages).
  • VIST unifies structural indexes and value indexes
    into a single index.
  • To achieve this, a method is proposed called
    dynamic virtual suffix tree labeling (index
    update can be performed directly on BTrees.

15
Structure-Encoded Sequences
  • Sequential representation of both XML Data and
    XML Queries.

16
  • Objective Modeling of XML queries through
    sequence matching makes us to avoid unnecessary
    join operations in query processing.
  • Result Structure-Encoded Sequences are used
    instead of paths or nodes.

17
Mapping Data and Queries to Structure-Encoded
Sequences
  • Stage 1
  • Lets consider the purchase record example in
    figure 3.
  • Notation Capital letters represent names of
    Attributes.
  • Lowercase letter represent names of attribute
    values.
  • To encode attribute values into integers we use
    hash( ) function.
  • e.g. v1 h(dell) and v2 h(ibm)
  • V1 and v2 is used to represent delle and ibm
    respectively.

18
Stage 2
  • Representing an XML document by the preorder
    sequence of its tree structure.
  • e.g. preorder sequence of the tree in Figure 3
    is
  • PSNv1IMv2Nv3IMv4Inv5Lv6BLv7Nv8

19
Stage 3
  • Definition A structure-encoded sequence is a
    sequence of (symbol,prefix) pairs
  • D (a1,p1), (a2,p2), . . . , (an,pn)
  • ai node in the XML doc tree.
  • pi path from the root node to node ai.

20
  • Figure 3 can be converted into the
    structure-encoded sequence.
  • D ... ... (Figure 4)

21
Benefits
  • Modeling XML queries through sequence matching is
    that structural queries can be processed as a
    whole instead of being broken into smaller query
    units(paths or nodes of XML doc tree)
  • Combining the results of the sub queries by join
    operations is expensive.

22
The VIST Approach
  • Presented in 3 stages
  • Naïve algorithm based on the suffix trees
  • RIST improves the naïve algorithm by using
    BTrees to index suffix tree nodes
  • VIST an index structure but relying only on
    the BTrees

23
Requirements
  • XML indexing method needs to include
  • Should support structural queries directly. This
    is done by structure-encoded sequences.
  • Instead of relying on suffix trees, the index
    method uses better indexing techniques such as
    BTrees.
  • The index structure should allow dynamic data
    insertion and deletion, etc.

24
A Naïve Algorithm Based on Suffix Trees
  • Most widely used index structure for
    subsequence matching is the suffix tree.
  •  

25
Example
  • 2 XML Documents called Doc1 and Doc2,
  • 2 XML Queries called Q1 and Q2
  • in structure-encoded sequences.
  •  
  • Doc1 (P,e)(S,P)(N,PS)(V1,PSN)(L,PS) (V2,PSL)
  • Doc2 (P,e) (B,P) (L,PB) (V2,PBL)
  •  
  • Q1 (P,e) (B,P) (L,PB) (V2,PBL)
  • Q2 (P,e) (L,P) (V2,PL)

26
Example (Contd)
  • A tree structure for Doc1 and Doc2 is shown in
    Figure 5

27
Example (Contd)
  • As it is shown above elements in the sequences
    represent nodes in the suffix tree.
  • Since the nodes are involed in 2 different trees,
    there is 2 kinds of ancestor-descendent
    relationships among the nodes.
  • i ) D-Ancestorship
  • e.g. (S,P) is a D-ancestor of (L,PS)
  • ii ) S-Ancestorship
  • e.g. (v1,PSN) is a S-ancestor of (L,PS)

28
Naïve Algorithm based on the suffix trees
  • NaiveSearch algorithm based on suffix trees.
  • Represents a naïve method for non-contigious
    subsequence matching.

29
For example to match Q2
  • Start with the root node, which matches the 1st
    element of Q2 that is (P,e).
  • Then search under the root for ll nodes that
    match (L,P) which yields to (L,PS) and (L,PB)
  • Finally, search for
  • - (v2,PSL) under the node labeled (L,PS)
  • - (v2,PBL) under the node labeled (L,PB)
  • Algorithm 1, searches nodes first by
  • S-Ancestorship, and then D-Ancestorship.

30
Difficulties of Naive Algorithm
  • There are difficulties in using suffix tree to
    index structure-encoded sequences.
  • Major difficulty is explained below
  • Searching for nodes satisfying both
    S- Ancestorship, and D-Ancestorship is extremely
    costly. (because we need to go over a large
    portion of the subtree for each match)

31
RIST Indexing by Ancestor-Descendent
Relationships
  • Improves Naïve Algorithm by eliminating the
    expensive go-over operations in suffix tree.
  • When we reach node X after matching, we can jump
    directly to those nodes Y to which X is both
    D-Ancestor and S-Ancestor.
  • So, no longer need to search among the
    descendents of X to find Ys one by one.

32
RIST Algorithm
  •    
  • 1. index nodes in suffix tree by their
    (Symbol,Prefix) pairs. This is represented by a
    BTree.
  •                                                   
                                                      
                          i.This enables us to search
    nodes by these (Symbol,Prefix) pairs that is
    D-Ancestorship.
  •                                                   
                                                      
                          ii.      This BTree is
    called D-Ancestorship BTree.

33
RIST Algorithm
  • 2.among all the nodes satisfying D-Ancestorship,
    we are interested in the ones satisfying
    S-Ancestorship as well.
  •                                                   
                                                     
    i. Labels are created for suffix tree nodes in
    order to tell the relationship btw 2 nodes.
  •                                                   
                                                      
      ii.  We use BTrees to index nodes by labels.
  •                                                   
                                                    
    iii.This BTree is called S-Ancestorship
    BTree.

34
Labeling Notation
  • ltnx, sizexgt
  • nx prefix traversal order of x in the suffix
    tree.
  • Sizex total number of descendants of x in the
    suffix tree.
  • That kind of labeling is shown in figure 5.

35
Labeling Notation
  • Note with that labeling, the S-Ancestorship
    between any two nodes can be decide easily
  • If x and y are labeled ltnx, sizexgt and ltny,
    sizeygt, node x is an S- Ancestor of y if ny ? (
    nx , ltnx sizexgt )

36
Constructing the BTrees
  • Insert all suffix tree nodes into the
    D-Ancestorship BTree using their symbols as
    their keys.
  • For all nodes that x inserted with the same
    (Symbol,Prefix), we index them by an
    S-Ancestorship BTree, using the nx values of
    their labels as keys.
  • Shown in FIGURE 6

37
Building the DocID BTree
  • DocID BTree stores for each node x ( using nx
    as key ), the document IDs of those XML sequences
    that end up at node x when they are inserted into
    the suffix tree.
  • Shown in DocID BTree

38
In summary
  • Unlike the naïve algorithm, RIST does not use
    suffix trees for subsequence matching (it uses
    D-Ancestorship BTree and S-Ancestorship BTree )
  • Form any node , instead of searching the entire
    subtree under the node, we can jump to the sub
    nodes that match the next element in the query.
  • So, RIST supports non-contigious subsequence
    matching efficiently.

39
VIST The Virtual Suffix Tree
  • RIST uses a static scheme to label suffix tree
    nodes and that prevents it from supporting
    dynamic insertions.
  • Because any node x labeled ltn,sizegt , late
    insertions can change the number of nodes that
    appear before x. (in the prefix order)
  • As well as the size of the subtree rooted at x,
    which means neither n nor size can be fixed.

40
VIST The Virtual Suffix Tree
  • The purpose of the suffix tree is to provide a
    labeling mechanism to encode S-Ancestorship.
  • Suppose a node x is created for element di
    ,during the insertion of sequence
  • d1, , di, ,dk.

41
VIST The Virtual Suffix Tree
  • If it is estimated
  • i. how many different elements will possibly
    follow di in future insertions.
  • ii.The occurrence probability of each of these
    elements
  • Then we can label xs child nodes instead of
    waiting until all sequences are inserted.

42
VIST The Virtual Suffix Tree (Contd)
  • It also means
  • the suffix tree itself is no longer needed,
    because its labeling mechanism is inefficient.
  • It supports dynamic data insertion and deletion.

43
Top down scope allocation
  • A tree structure defines nested scopes the
    scope of a child node is a subscope of its parent
    node, and the root node has the max scope which
    covers the scope of each node.

44
Top down scope allocation
  • In dynamic scope allocation there is a parameter
    called ?, which is the expected number of child
    nodes of any node,
  • ? is usually assumed as 2.
  • without the knowledge of the occurrence rate of
    the each child node, 1/? of the remaining scope
    is allocated to xs 1st inserted child.
  • Child1 ltn1,size/2gt
  • Child2 lt(n1size)/2, size/4gt

45
Dynamic scope of a Suffix Tree Node
  • The dynamic scope of a node is triple
    ltn,size,kgt ,
  • where k is the number of subscopes allocated
    inside current scope.

46
Algorithm of VIST
  • VIST uses the same sequence matching algorithm as
    RIST
  • Dynamic method for labeling suffix tree nodes is
    represented without building the suffix tree.

47
Algorithm of VIST
  • The method relies on insensitive estimations of
    the number of attribute values.
  • Because of that the labeling mechanism is based
    on a virtual suffix tree .

48
  • Example
  • - lets look at the index structure before
    and after insertion

49
Algortihm of VIST
  • Suppose, before the insertion the index structure
    already contains the following sequence
  • Doc1 (P,e) (S,P) (N,PS) (V1,PSN) (L,PS)
    (V2,PSL)
  • The sequence to be inserted
  • gt Doc2 (P,e) (S,P) (L,PS) (V2,PSL)

50
Assumptions of the Example
  • There are 2 assumptions for the algorithm
  • Max 20480
  • Dynamic scope allocation method uses the
    parameter ? 2

51
  • The insertion process is much like that of
    inserting a sequence into a suffix tree.
  • We follow the branches, and when there is no
    branch to follow we create one.

52
CONCLUSION
  • VIST (a dynamic index method) is developed for
    XML Documents.
  • XML data and XML queries is converted into
    sequences that encode their structural
    information.

53
VISTs Pros
  • Uses tree structure as the basic unit of query to
    avoid expensive join operations.
  • Supports dynamic data insertion and deletion.
  • Unlike some other data structures used in other
    approaches, the index structure of VIST which is
    based on BTrees, are well supported by DBMSs.

54
  • End of Presentation
  • Questions ?
Write a Comment
User Comments (0)
About PowerShow.com