A Fast Index For Semistructured Data - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

A Fast Index For Semistructured Data

Description:

Detour: Tries and Patricia Tries. The Index Fabric Structure. Indexing XML with the Index Fabric ... Detour Tries & Patricia Tries ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 29
Provided by: sa69
Category:

less

Transcript and Presenter's Notes

Title: A Fast Index For Semistructured Data


1
A Fast Index For Semistructured Data
  • Presented by
  • Alexandra Martinez
  • CIS 6930 - Indexing Large Databases

2
Indexing Semistructured Data
  • Introduction
  • Detour Tries and Patricia Tries
  • The Index Fabric Structure
  • Indexing XML with the Index Fabric
  • Results and Conclusion

3
Introduction
  • Semistructured Data
  • Data with an irregular or changing organization.
  • Often represented as a graph (elems relations
    schema)
  • Queries over semistructured data
  • Navigating paths through graph
  • Indexes usually built for efficient access.
  • Conventional techniques
  • Use a relational database (translation, querying
    is costly)
  • Use a native semistructured repository (new, bad
    query perf)

4
Introduction
  • Proposed approach
  • Relies on relational db but provides better
    performance.
  • Encodes data paths as strings and insert them
    into an index that is optimized for string
    searching.
  • For evaluating a query, the path is encoded as a
    search key string, then we do a lookup on index.
  • () No need for knowing data schema a priori.
  • () High perf. with changing and irregular
    structure.
  • () Can accelerate queries along diff. access
    paths.

5
Indexing Semistructured Data
  • Introduction
  • Detour Tries and Patricia Tries
  • The Index Fabric Structure
  • Indexing XML with the Index Fabric
  • Results and Conclusion

6
Detour Tries Patricia Tries
  • A Trie is a tree that stores strings it
    represents each character as an edge on the path
    from the root to a leaf.
  • Patricia tries (PT's) are a more compact form of
    tries. A PT is similar to a trie, except that
    nodes with only one child have been removed.
  • The numbers inside the nodes (depth) indicate the
    character position in the string to compare to
    the labels on the outgoing edges.
  • PTs achieve compression at the cost of no longer
    storing the complete keys, but rather the
    differences between keys.
  • PTs are unbalanced structures.

7
Detour Tries PTs Example
key2
A trie indexing one string
A trie indexing mulitple strings
A Patricia trie indexing mulitple strings
8
Indexing Semistructured Data
  • Introduction
  • Detour Tries and Patricia Tries
  • The Index Fabric Structure
  • Balancing Patricia tries
  • Two kinds of links
  • Searching
  • Updates
  • Indexing XML with the Index Fabric
  • Results and Conclusion

9
Fabric Balancing Patricia Tries
  • PTs are not balanced in large dbs, unbalance
    can be large result in performance degradation
  • Problem is solved by introducing multiple layers
    into the PT.
  • Horizontal layers are added to skip some of the
    vertical levels.
  • Horizontal structure is always balanced.
  • Balancing the PT allows for searches (and
    updates) in time proportional to the number of
    layers instead of the length of the indexed keys.

10
Fabric Balancing PTs Example
Layer 1 indexes common prefixes of each subtrie
(block) in Layer 0
Root is always at leftmost layer

c
11
Fabric Balancing Patricia Tries
12
Fabric Two kinds of links
a
  • Labeled Far Link ( )
  • This link is the same as an edge between a parent
    and a child in a normal trie, except that the
    parent is in layer i1 and the child is in layer
    i.
  • Unlabeled Direct Link ( )
  • Connects a node in layer i1 with a node
    representing the same prefix in layer i.

13
Fabric - Searching
  • Start at the root node of the block at leftmost
    layer (layer N)
  • Within a block, compare characters in search key
    to edge labels, and keep following the edges.
  • If edge is a far link search proceeds to a
    block in next layer (N-1)
  • If this block miss, backtrack (follow direct link
    instead of far link).
  • If no labeled edge matches follow a direct link
    to a new block in the next layer (N-1)
  • Eventually layer 0 is reached.
  • If no labeled edge matches key not found.
  • Otherwise, a path if followed to the data.

14
Fabric Search Examples castle, fast
15
Fabric Searching One I/O
  • Search accesses one block per layer.
  • Since horizontal layers are balanced, all
    searches traverse same layers gt access same
    blocks.
  • Compact storage of keys - blocks have very high
    fan out PT has low height
  • Ex 3 layers - sufficient to store a billion keys
  • 2 upper layers in memory, layer 0 on disk.
  • Thus searches require a single index I/O.

16
Fabric Updates
  • Similar to B-trees. Very efficient, like
    searches.
  • Insertion
  • Involves a change to single block in the lowest
    layer (layer 0). If a block has no space for
    insertion, it splits. Splits may cascade to
    higher layers.
  • Deletion
  • Find block to be updated, remove edge pointing to
    the key. Blocks might merge to compact trie.
  • Updates
  • Deletion followed by Insertion.

17
Indexing Semistructured Data
  • Introduction
  • Detour Tries and Patricia Tries
  • The Index Fabric Structure
  • Indexing XML with the Index Fabric
  • Designators
  • Raw Paths
  • Refined Paths
  • Results and Conclusion

18
Indexing XML Example XML
19
Indexing XML - Designators
  • Designator - A unique special character(s)
    assigned to each tag that appears in the XML
  • Designator Dictionary maintains mapping between
    tags and designators

Designator Dictionary
The designator-encoded XML strings are inserted
into the Index Fabric.
20
Indexing XML Raw Paths
  • Raw paths index the hierarchical structure of the
    XML by encoding root-to-leaf paths as strings.
  • Simple path expressions starting at the root
    requires a single index lookup.
  • Ex the XML fragment
  • ltAgtalphaltBgtbetaltCgtgammalt/Cgtlt/Bgtlt/Agt
  • Can be represented as a tree with 3 root-to-leaf
    paths
  • 1)ltAgtalpha 2)ltAgtltBgtbeta 3)ltAgtltBgtltCgtgamma
  • which are encoded as
  • 1)A alpha 2) A B beta 3) A B C gamma
  • Under mapping f, where
  • f (ltAgt) A, f (ltBgt) B , f (ltCgt) C

21
Indexing XML Raw Paths Example
1
alpha
A
A alpha A B beta A B C gamma
2
beta
B
3
ltAgtalpha
C
ltBgtbeta
gamma
ltCgtgamma
lt/Cgt
lt/Bgt
lt/Agt
22
Indexing XML Refined Paths
  • Specialized paths through XML that optimize
    frequently occurring access patterns.
  • Can support queries that have wildcards (,),
    alternates (), and constants.
  • DBA decides which refined paths are appropriate

23
Refined Paths An Example
  • Freq.Q Find the invoices where company X sold to
    company Y. Find ltbuyergt tags that are siblings
    of a ltsellergt tag.
  • Assign a designator Z to such a path
  • Encode info indexed by this ref. path in a key
  • gt
  • Insert the created keys into the Index Fabric.
  • Keys refer to the XML fragments that answer Q.

ltinvoicegt ltbuyergtYlt/buyergt
ltsellergtXlt/sellergt lt/invoicegt
ltinvoicegt ltbuyergtABC Corplt/buyergt
ltsellergtAcme Inclt/sellergt lt/invoicegt
Z ABC Corp Acme Inc
24
Indexing Semistructured Data
  • Introduction
  • Detour Tries and Patricia Tries
  • The Index Fabric Structure
  • Indexing XML with the Index Fabric
  • Results and Conclusion

25
Results
  • Index Fabric compared to
  • DBMSs native B-tree index over tables generated
    by STORED.
  • DBMSs native B-tree index over tables generated
    by basic edge-mapping (roots and edges).
  • Index Fabric outperforms B-tree indexes.
  • Index Fabric offers significant optimization
    especially for complex queries, refined paths.

26
Conclusions
  • Indexing for Semistructured data - significant
    challenges
  • Many problems not yet solved - efficient
    processing of query involving complex regular
    expressions
  • Index Fabric Indexing for XML stored in
    relational DB (can work for other models)
  • Interesting features
  • Combines aspects of Patricia Tries (scaling) and
    B-trees (balanced, optimized for disk access)
  • No a priori knowledge of structure is needed

27
References
  • B. Cooper et al. A Fast Index for Semistructured
    Data. In Proc.VLDB, 2001. Available at
  • B. Cooper and M. Shadmon. The Index Fabric A
    mechanism for indexing and querying the same data
    in many different ways. Technical Report, 2000.
    Available at http//www.rightorder.com/technology
    /overview.pdf

28
Thanks
Write a Comment
User Comments (0)
About PowerShow.com