Title: A Fast Index For Semistructured Data
1A Fast Index For Semistructured Data
- Presented by
- Alexandra Martinez
- CIS 6930 - Indexing Large Databases
2Indexing Semistructured Data
- Introduction
- Detour Tries and Patricia Tries
- The Index Fabric Structure
- Indexing XML with the Index Fabric
- Results and Conclusion
3Introduction
- Semistructured Data
- Data with an irregular or changing organization.
- Often represented as a graph (elems relations
schema) - Queries over semistructured data
- Navigating paths through graph
- Indexes usually built for efficient access.
- Conventional techniques
- Use a relational database (translation, querying
is costly) - Use a native semistructured repository (new, bad
query perf)
4Introduction
- Proposed approach
- Relies on relational db but provides better
performance. - Encodes data paths as strings and insert them
into an index that is optimized for string
searching. - For evaluating a query, the path is encoded as a
search key string, then we do a lookup on index. - () No need for knowing data schema a priori.
- () High perf. with changing and irregular
structure. - () Can accelerate queries along diff. access
paths.
5Indexing Semistructured Data
- Introduction
- Detour Tries and Patricia Tries
- The Index Fabric Structure
- Indexing XML with the Index Fabric
- Results and Conclusion
6Detour Tries Patricia Tries
- A Trie is a tree that stores strings it
represents each character as an edge on the path
from the root to a leaf. - Patricia tries (PT's) are a more compact form of
tries. A PT is similar to a trie, except that
nodes with only one child have been removed. - The numbers inside the nodes (depth) indicate the
character position in the string to compare to
the labels on the outgoing edges. - PTs achieve compression at the cost of no longer
storing the complete keys, but rather the
differences between keys. - PTs are unbalanced structures.
7Detour Tries PTs Example
key2
A trie indexing one string
A trie indexing mulitple strings
A Patricia trie indexing mulitple strings
8Indexing Semistructured Data
- Introduction
- Detour Tries and Patricia Tries
- The Index Fabric Structure
- Balancing Patricia tries
- Two kinds of links
- Searching
- Updates
- Indexing XML with the Index Fabric
- Results and Conclusion
9Fabric Balancing Patricia Tries
- PTs are not balanced in large dbs, unbalance
can be large result in performance degradation - Problem is solved by introducing multiple layers
into the PT. - Horizontal layers are added to skip some of the
vertical levels. - Horizontal structure is always balanced.
- Balancing the PT allows for searches (and
updates) in time proportional to the number of
layers instead of the length of the indexed keys.
10Fabric Balancing PTs Example
Layer 1 indexes common prefixes of each subtrie
(block) in Layer 0
Root is always at leftmost layer
c
11Fabric Balancing Patricia Tries
12Fabric Two kinds of links
a
- Labeled Far Link ( )
- This link is the same as an edge between a parent
and a child in a normal trie, except that the
parent is in layer i1 and the child is in layer
i. - Unlabeled Direct Link ( )
- Connects a node in layer i1 with a node
representing the same prefix in layer i.
13Fabric - Searching
- Start at the root node of the block at leftmost
layer (layer N) - Within a block, compare characters in search key
to edge labels, and keep following the edges. - If edge is a far link search proceeds to a
block in next layer (N-1) - If this block miss, backtrack (follow direct link
instead of far link). - If no labeled edge matches follow a direct link
to a new block in the next layer (N-1) - Eventually layer 0 is reached.
- If no labeled edge matches key not found.
- Otherwise, a path if followed to the data.
14Fabric Search Examples castle, fast
15Fabric Searching One I/O
- Search accesses one block per layer.
- Since horizontal layers are balanced, all
searches traverse same layers gt access same
blocks. - Compact storage of keys - blocks have very high
fan out PT has low height - Ex 3 layers - sufficient to store a billion keys
- 2 upper layers in memory, layer 0 on disk.
- Thus searches require a single index I/O.
16Fabric Updates
- Similar to B-trees. Very efficient, like
searches. - Insertion
- Involves a change to single block in the lowest
layer (layer 0). If a block has no space for
insertion, it splits. Splits may cascade to
higher layers. - Deletion
- Find block to be updated, remove edge pointing to
the key. Blocks might merge to compact trie. - Updates
- Deletion followed by Insertion.
17Indexing Semistructured Data
- Introduction
- Detour Tries and Patricia Tries
- The Index Fabric Structure
- Indexing XML with the Index Fabric
- Designators
- Raw Paths
- Refined Paths
- Results and Conclusion
18Indexing XML Example XML
19Indexing XML - Designators
- Designator - A unique special character(s)
assigned to each tag that appears in the XML - Designator Dictionary maintains mapping between
tags and designators
Designator Dictionary
The designator-encoded XML strings are inserted
into the Index Fabric.
20Indexing XML Raw Paths
- Raw paths index the hierarchical structure of the
XML by encoding root-to-leaf paths as strings. - Simple path expressions starting at the root
requires a single index lookup. - Ex the XML fragment
- ltAgtalphaltBgtbetaltCgtgammalt/Cgtlt/Bgtlt/Agt
- Can be represented as a tree with 3 root-to-leaf
paths - 1)ltAgtalpha 2)ltAgtltBgtbeta 3)ltAgtltBgtltCgtgamma
- which are encoded as
- 1)A alpha 2) A B beta 3) A B C gamma
- Under mapping f, where
- f (ltAgt) A, f (ltBgt) B , f (ltCgt) C
21Indexing XML Raw Paths Example
1
alpha
A
A alpha A B beta A B C gamma
2
beta
B
3
ltAgtalpha
C
ltBgtbeta
gamma
ltCgtgamma
lt/Cgt
lt/Bgt
lt/Agt
22Indexing XML Refined Paths
- Specialized paths through XML that optimize
frequently occurring access patterns. - Can support queries that have wildcards (,),
alternates (), and constants. - DBA decides which refined paths are appropriate
23Refined Paths An Example
- Freq.Q Find the invoices where company X sold to
company Y. Find ltbuyergt tags that are siblings
of a ltsellergt tag. -
- Assign a designator Z to such a path
- Encode info indexed by this ref. path in a key
-
- gt
- Insert the created keys into the Index Fabric.
- Keys refer to the XML fragments that answer Q.
ltinvoicegt ltbuyergtYlt/buyergt
ltsellergtXlt/sellergt lt/invoicegt
ltinvoicegt ltbuyergtABC Corplt/buyergt
ltsellergtAcme Inclt/sellergt lt/invoicegt
Z ABC Corp Acme Inc
24Indexing Semistructured Data
- Introduction
- Detour Tries and Patricia Tries
- The Index Fabric Structure
- Indexing XML with the Index Fabric
- Results and Conclusion
25Results
- Index Fabric compared to
- DBMSs native B-tree index over tables generated
by STORED. - DBMSs native B-tree index over tables generated
by basic edge-mapping (roots and edges). - Index Fabric outperforms B-tree indexes.
- Index Fabric offers significant optimization
especially for complex queries, refined paths.
26Conclusions
- Indexing for Semistructured data - significant
challenges - Many problems not yet solved - efficient
processing of query involving complex regular
expressions - Index Fabric Indexing for XML stored in
relational DB (can work for other models) - Interesting features
- Combines aspects of Patricia Tries (scaling) and
B-trees (balanced, optimized for disk access) - No a priori knowledge of structure is needed
27References
- B. Cooper et al. A Fast Index for Semistructured
Data. In Proc.VLDB, 2001. Available at - B. Cooper and M. Shadmon. The Index Fabric A
mechanism for indexing and querying the same data
in many different ways. Technical Report, 2000.
Available at http//www.rightorder.com/technology
/overview.pdf -
28Thanks