Title: Liang JinUC Irvine
1Indexing Mixed Types for Approximate Retrieval
- Liang Jin UC Irvine
- Nick Koudas University of Toronto
- Chen Li UC Irvine
- Anthony K.H. Tung National University of
Singapore
VLDB2005 Liang Jin and Chen Li supported by
NSF CAREER Award IIS-0238586
2Queries with Mixed-Type Predicates
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year 1980 lt 5
- SIMLARTO
- a domain-specific function
- returns a similarity value between two strings
- Example edit distance ed(Tom Hanks,
Ton Hank) 2
3Why fuzzy predicates?
- Errors in queries
- User doesnt remember a string exactly
- User types a wrong string
4Problem Formulation
Given A query with fuzzy predicates on strings
and range
predicates on numeric attributes on a
single relation Goal Answer the query
efficiently
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year 1980 lt 5
5Rest of the talk
- Motivation supporting queries with mixed-type
predicates - Our approach MAT tree
- Construction and maintenance of MAT tree
- Experiments
6Assumptions
- One fuzzy string predicate (edit distance)
- One numeric predicate
(Qs, ds, Qn, dn)
Query
SELECT FROM Movies WHERE star SIMILARTO
Schwarrzenger AND year 1980 lt 5
(Schwarrzenger, 2, 1980, 5)
7Intuition of MAT (Mixed-attribute-type) Tree
- 2 gt 1 1
- One integrated indexing structure is better than
- two independent indexing structures on two
attributes - Indexing numeric attributes B-tree or R-tree
- Indexing strings as a tree to support fuzzy
predicates?
MAT tree
8Answering a query (Qs, ds, Qn, dn)
- Top-down traverse the MAT-tree
- At each node, do pruning by checking
- If Qn dn, Qn dn overlap with the numeric
range. - If minEditDistance(Qs, Tn) lt ds.
9Challenge
- How to represent strings to fit into a limited
space - and support fuzzy-predicate pruning
Limited space (disk based)
10Existing Approaches to Indexing Strings as Trees
- M-tree
- Edit distance metric space
- Q-tree
- Utilize the q-gram property of strings.
- See our paper for details
11Representing strings as a trie
12Compressing a trie
compression
- Select k representative nodes (centers).
- Each center is in the format of
ltalphabet,heightgt. - A compressed trie represents more strings
13Minimum edit distance between a string a trie
- minEditDistace (Qs, Tn)?
- Convert a trie to an automaton.
- Compute the min distance between a string and an
automaton Myers and Miller, 1989 - Early termination possible
14Compressed trie ? Automaton
- Each node is a state.
- Each edge becomes a transition between two
states. - For compressed node ltS, Lgt, expand it to L
levels. At each level, all characters in S become
single states and are connected to a common tail
e.
Convert a compressed node lta,b,c,2gt into
automaton nodes.
15Outline
- Motivation supporting queries with mixed-type
predicates - Our approach MAT tree
- Construction and maintenance of MAT tree
- Experiments
16Constructing MAT-tree
- Option 1 insert records one by one.
- Option 2
- bulk-load records
- construct the MAT-tree bottom-up
17Compressing a trie
- Important
- Accurately represent strings in a limited space.
- Minimize information loss.
- Maintain the pruning power during a traversal.
- Three methods
- (1) Reducing of accepted strings
- (2) Keeping accepted strings clustered
- (3) Combining of (1) and (2)
18Method (1) Reducing of accepted strings
- Intuition
- reducing this makes the compressed trie more
accurate - Goodness function of accepted strings
- Algorithm Randomized
- Randomly select k initial centers
- Randomly select one of the centers
- Randomly select an unselected node
- Swap them if it can improve the goodness function
- Do certain of iterations
19Method (2) Keeping accepted strings clustered
- Intuition
- keeping the accepted strings similar to the
original ones by letting them share common
prefix. - Place k centers as close to the root as possible.
- Algorithm BreadthFirst
20Method (3) Combining (1) and (2)
- Intuition
- minimize the number of accepted strings, and in
the same time maintain their similarity to the
originals. - Algorithm Bottomup
- Keep shrinking the trie bottom up until we have k
nodes - Compress a node that minimizes of additional
strings
21Dynamic maintenance
- Insertion (s, n)
- Search the index for (s, n). If its not in the
index, identify the correct leaf node. - If no overflow
- update the MBR of the leaf node and its
precedents recursively if necessary. - If overflow
- Split the leaf node and
- Construct two compressed tries
- Cascade the split to the precedents if necessary.
- Deletion and Update are handled similarly
22Outline
- Motivation supporting queries with mixed-type
predicates - Our approach MAT tree
- Construction and maintenance of MAT tree
- Experiments
23Setting
- Data
- IMDB 100K movie star records (Name and YOB).
- Customers 50K records (Name and YOB)
- Test bed
- PC 2.4G P4, 1.2GB Memory, Windows XP
- Visual C compiler
- Similar results. Report result for IMDB.
24Implemented approaches
- B-tree
- Q-tree
- B-tree Q-tree
- BQ-tree
- BM-tree
- Sequential scan
- BBQ-tree? ?
252 gt 1 1
An integrated indexing structure is better than
two separate indexing structures
ds3, dn4
26Scalability
27Effect of numeric threshold dn
28Effect of string threshold ds
29Dynamic Maintenance time
30Dynamic maintenance MAT quality
31Number of centers
- Increasing cluster may not reduce the running
time pruning power versus computational cost - For BottomUp and BreadthFirst (compared to
Randomized) - - Centers close to the root, thus more likely
to do early termination
32Conclusion
- MAT-tree an efficient indexing structure for
queries with mixed-type predicates - Can be efficiently constructed and maintained
- Future work develop a uniform framework to
support different kinds of similarity functions
The Flamingo Project http//www.ics.uci.edu/fla
mingo/
QA?