Title: An Efficient Index Structure for String Databases
1An Efficient Index Structure for String Databases
- Tamer Kahveci
- Ambuj K. Singh
- Presented By
- Atul Ugalmugale/Nikita Rasam
2- Issue ?
- Find similar substrings in a large database, that
is similar to a given query string quickly,
using a small index structure - In some applications we store, search and analyze
long sequences of discrete characters, which we
call strings - There is a frequent need to find similarities
between genetic data, web data and event
sequences.
3- Applications ?
- Information Retrieval A typical application of
information retrieval is text searching given a
large collection of documents and some text
keywords we want to find the documents which
contain these keywords.
- searching keywords through the net usually by
mtallica we mean metallica
4- Computational Biology The problem is similar in
computational biology here we have a long DNA
sequence and we want to find subsequences in it
that match approximately a query sequence. - ATGCATACGATCGATT
- TGCAATGGCTTAGCTAAnimal species from the same
family are bound to have more similar DNAs
5- Video data can be viewed as an event sequence if
some pre-specified set of events are detected and
stored as a sequence. Searching similar event
subsequences can be used to find related video
segments.
6- String search algorithms proposed so far are
in-memory algorithms. - Scan the whole database for each query.
- Size of the string database grows faster than the
available memory capacity, and extensive memory
requirements make the search techniques
impractical. - Suffer from disk I/Os when the database is too
large - Performance deteriorates for long query patterns
7- Similarity Metrics
- The difference between two strings s1 and s2 is
generally defined as the minimum number of edit
operations to transform s1 to s2 called edit
distance ED. - Edit operations
- Insert
- Delete
- Replace
8- Suppose we have two strings x,y
- e.g. x kitten, y sitting
- and we want to transform x into y.
- A closer look
- k i t t e n
- s i t t i n g
- 1st step kitten ?sitten (Replace)
- 2nd step sitten?sittin (Replace)
- 3rd step sittin?sitting (Insert)s
- What is the edit distance between survey and
surgery? - s u r v e y --- s u r g e y replace
(1) --- s u r g e r y insert (1) - Edit distance 2
9- In the general version of edit distance,
different operations may have different costs, or
the costs depend on the characters involved. - For example replacement could be more expensive
than insertion, or replacing a with o could
be less expensive than replacing a with k. - This is called as weighted edit distance.
10- Global Alignment
- Global alignment (or similarity) of s1 and s2 is
defined as the maximum valued alignment of s1 and
s2. - Given two strings S1 and S2, the global alignment
of them is obtained by inserting spaces into S1
or S2 and at the ends so that are of the same
length and then writing them one against the
other - Example
- qacdbd qawdb qac_dbd qa_wdb_
- Edits and alignments are dual.
- A sequence of edits can be converted into a
global alignment. - An alignment can be converted into a sequence of
edits
11- Local Alignment
- Given two strings X and Y find two substrings x
and y from X and Y, respectively, such that their
alignment score (in the global sense) is maximum
over all pairs of such substrings. (empty
substrings are allowed) - S(x,y) 2 , x y
- -2, x ! y
- -1, x _ or y _
Xpqraxabcstvq Yyxaxbacsll xaxabcs yaxbacs
a x a b _ c s a x _ b a c s 22-12-1228
12String Matching Problem
- Whole Matching
- finding the edit distance ED(q,s) between a data
string s and a query string q. - Substring Matching
- Consider all substrings sij of s which are
close to the query string. - Two Types of Queries
- Range search seeks all the substrings of S which
are within an edit distance of r to a given query
q (r range query) - K-nearest neighbor search seeks the K closest
substrings of S to q.
13Challenges in solving the substring matching
problem
- Finding the edit distance is very costly in
terms of both time and space. - The strings in the database may be very long.
- The database size for most applications grows
exponentially. - New approach to overcome challenges
- Define a lower bound distance for substring
searching - Improve this lower bound by using the idea of
wavelet transformation - Use the MRS index structure based on the
aforementioned distance formulations
14A dynamic programming algorithm for computing the
edit distance
- Problem find the edit distance between strings x
and y. - Create a (x1)(y1) matrix C, where Ci,j
represents the minimum number of operations to
match x1..i with y1..j. The matrix is constructed
as follows. - Ci,0 I
- C0,j j
- Ci,j min(Ci-1,j-1)cost, replace
- (Ci,j-1)1, insert
- (Ci-1,j)1 delete
- cost 0 if xiyi, else 1
15How do we perform substring search?
- The same dynamic programming algorithm can be
used to find the most similar substrings of a
query sting q. - The difference is that we set C0,j0 for all j,
since any text position could be the potential
start of a match. - If the similarity distance bound is k, we report
all positions, where Cm k (m is the last row m
q).
16Frequency Vector
- Let s be a string from the alphabet ??1, ...,
??. Let ni be the number of occurrences of the
character ?i in s for 1?i??, then - frequency vector f(s) n1, ..., n?.
- Example
- s AATGATAG
- f(s) nA, nC, nG, nT 4, 0, 2, 2
- Let s be a string from the alphabet ??1, ...,
??. Let f(s) v1, ..., v?, be the frequency
vector of s then ? ?i-1 vi s. - An edit operation on s has one of the following
effects on f(s), for 1 ? i , j ? ?, and i ! j - vi vi 1
- vi vi - 1
- vi vi 1 and vj vj - 1
17Effect of Edit Operations on Frequency Vector
- Delete decreases an entry by 1.
- Insert increases an entry by 1.
- Replace Insert Delete
- Example
- s AATGATAG f(s) 4, 0, 2, 2
- (del. G), s AAT.ATAG f(s) 4, 0, 1, 2
- (ins. C), s AACTATAG f(s) 4, 1, 1, 2
- (A?C), s ACCTATAG f(s) 3, 2, 1, 2
18Frequency Distance
- Let u and v be integer points in ? dimensional
space. The frequency distance, FD 1 (u,v) between
u and v is defined as the minimum number of steps
in order to go from u to v ( or equivalently from
v to u) by moving to a neighbor point at each
step. - frequency vector f(s) n1, ..., n?.
- Let s 1 and s 2 be two strings from the alphabet
??1, ..., ?? then - FD 1 (f(s 1), f(s 2)) ? ED (s 1 ,s 2)
19An Approximation to ED Frequency Distance (FD1)
- s AATGATAG f(s)4, 0, 2, 2
- q ACTTAGC f(q)2, 2, 1, 2
- pos (4-2) (2-1) 3
- neg (2-0) 2
- FD1(f(s),f(q)) 3
- ED(q,s) 4
- FD1(f(s1),f(s2))maxpos,neg.
- FD1(f(s1),f(s2))? ED(s1,s2).
20Frequency Distance Calculation/ u and v are ?
dimensional integer points /Algorithm FD 1
(u,v) posDistance negDistance 0For i 1
to ?
- FD1(u, v) max posDist, negDist
21Wavelet Vector ComputationLet s c1c2cn be a
string from the alphabet ??1, ..., ?? then Kth
level wavelet transformation, ?k (s) , 0 log2n of s is defined as ?k (s) vk,1, ...,
vk,n/2k where vk,I Ak,i , Bk,i,
- f (ci) k 0
- Ak-1,2i Ak-1,2i1 0
- 0 k 0
- Ak-1,2i - Ak-1,2i1 0
- 0
Ak,i
Bk,i
22Using Local Information Wavelet Decomposition of
Strings
- s AATGATAC f(s)4, 1, 1, 2
- s AATG ATAC s1 s2
- f(s1) 2, 0, 1, 1
- f(s2) 2, 1, 0, 1
- ?1(s) f(s1)f(s2) 4, 1, 1, 2
- ?2(s) f(s1)-f(s2) 0, -1, 1, 0
23Wavelet Decomposition of a String General Idea
- Ai,j f(s(j2i (j1)2i-1))
- Bi,j Ai-1,2j - Ai-1,2j1
First wavelet coefficient
Second wavelet coefficient
?(s)
24Wavelet Transformation Example
- s T C A C n s 4
- ?0(s) v0,0 , v0,1 , v0,2 , v0,3
- (A0,0, B0,0), (A0,1, B0,1), (A0,2,
B0,2), (A0,3, B0,3) - (f(t), 0), (f(c), 0), (f(a),
0), (f(c), 0) - (0,0,1, 0), (0,1,0, 0), (1,0,0, 0),
(0,1,0, 0) - ?1(s) (0,1,1, 0,-1,1), (1,1,0,
1,-1,0) - ?2(s) ( 1,2,1, -1,0,1 )
First wavelet coefficient
Second wavelet coefficient
25Wavelet Distance Calculation
26Maximum Frequency Distance Calculation
- FD(s1,s2)
- max FD1(f (s1), f (s2)), FD2(?(s1),?(s2))
- FD1 is the Frequency Distance
- FD2 is the Wavelet Distance
27MRS-Index Structure Creation
s1
w2a
28MRS-Index Structure Creation
s1
29MRS-Index Structure Creation
s1
30MRS-Index Structure Creation
s1
...
slide c times
cbox capacity
31MRS-Index Structure Creation
s1
...
32MRS-Index Structure Creation
s1
Ta,1
...
W2a
33Using Different Resolutions
s1
Ta,1
...
W2a
Ta1,1
...
W2a1
34MRS-Index Structure
35MRS-index properties
- Relative MBR volume (Precision) decreases when
- c increases.
- w decreases.
- MBRs are highly clustered.
Box volume
Box Capacity
36Frequency Distance to an MBRLet q be the query
string of length 2i where a Given an MBR B, we define FD(q,B) min(s belongs
to B) FD(q,s)
37Range Search Algorithm
38Range Queries
1. Partition the query string into subqueries at
various resolutions available in our index.
2. Perform a partial range query for each
subquery on the corresponding row of the index
structure, and refine e.
3. Disk pages corresponding to last result set
are read, and postprocessing is done to elminate
false retrievals.
s1
s2
sd
...
...
...
...
w24
...
...
...
...
w25
...
...
...
...
w26
...
...
...
...
w27
q1
q2
q3
q
39K-Nearest Neighbor Algorithm
40k-Nearest Neighbor Query
k 3
41k-Nearest Neighbor Query
k 3
42k-Nearest Neighbor Query KSF96, SK98
k 3
43k-Nearest Neighbor Query
r
k 3
r Edit distance to 3rd closest substring
44Experimental Settings
- w128, 256, 512, 1024.
- Human chromosomes from (www.ncbi.nlm.nih.gov)
- chr02, chr18, chr21, chr22
- Plotted results are from chr18 dataset.
- Queries are selected from data set randomly for
512 ? q ? 10000. - An NFA based technique BYN99 is implemented for
comparison.
45Experimental Results 1Effect of Box Capacity
(10-NN)
- The cost of the MRS-index increases as the box
capacity increases. - The cost of the MRS-index is much lower than the
NFA technique for all these box capacities. - Although using 2-wavelet coefficient slightly
improves the performance for the same box
capacity, the size of the index structure is
doubled. For same amount of memory, the single
coefficient version performs better
46Experimental Results 2Effect of Window Size
(10-NN)
- The MRS-index structure outperforms the NFA
technique for all the window sizes. - The performance of the MRS index structure itself
improves as the window size increases.
47Experimental Results 3k-NN queries
- The performance of the MRS-index structure drops
for large values of k , it still performs better
than the NFA technique. - Achieved speedups up to 45 for 10 nearest
neighbors. The speedup for 200 nearest neighbors
is 3. - As the number of nearest neighbors increases, the
performance of the MRS-index structure approaches
to that of the NFA technique.
48Experimental Results 4Range Queries
- The MRS-index structure performed up to 12 times
faster than the NFA technique. The performance of
the MRS-index structure improved when the queries
are selected from different data strings. This is
because the DNA strings have a high self
similarity. - The performance of the MRS index structure
deteriorates as the error rate increases. This is
because the size of the candidate set increases
as the error rate increases.
49Discussion
- In-memory (index size is 1-2 of the database
size). - Lossless search.
- 3 to 45 times faster than NFA technique for k-NN
queries. - 2 to 12 times faster than NFA technique for range
queries. - Can be used to speedup any previously defined
technique.
50THANK YOU