Title: Index Structures for String Databases
1Index Structures for String Databases
- Alexandra Martinez
- Computational Molecular Biology
- CISE, University of Florida
- Spring 2004
2Outline
- Problem Definition
- Motivation
- Background
- Proposed Solution
- Future Work
3Problem Definition
- Substring searching in databases
- Ex Similarity of two DNA strings
- Functional relationships
Query Q
T C G A T T A C A G T G A A T
Database S
G C A T T C G A T G G A C T G G A C T A G T G A A
T C A G T
4Outline
- Problem Definition
- Motivation
- Background
- Proposed Solution
- Future Work
5Motivation for Indexing
- Very large databases, exponential growth
- Ex GenBank (NCBI) doubles every 15 months.
- Most string search algorithms are in-memory
algorithms - Scan the whole database for each query.
- Suffer from disk I/O when db is too large.
- For index-based techniques, size of index
structure is larger than size of db - Index size exceeds memory size -gt resides on disk
- Performance deteriorates for long queries.
- Need efficient external memory algorithms!
6Outline
- Problem Definition
- Motivation
- Background
- Proposed Solution
- Future Work
7Background Edit Distance
- Edit Operations
- Insert, Delete, Replace
- Edit Distance between s1 and s2
- Minimum number of operations to transform s1 to
s2. - ED (ABC, ABDA) 2
- Time space complexity is O(mn) using dynamic
programming.
8Background Alignments
- Alignment
- Matches chars of s1 and s2 increasingly.
- Each char pair pk is assigned a score s(pk).
- A C T - - T A G C
- R I I D
- A A T G A T A G -
- Global Alignment
- Maximum alignment value of s1 and s2.
- Local Alignment
- Highest alignment value of all the substrings of
s1 and s2. - Ex BLAST
Alignment Value Sk s(pk)
9Background Query Types
- String database S s1, s2, , sd
- Range Queries
- Seek all substrings of S that are within an edit
distance of r (range) to the input query q. - Nearest Neighbor Queries (kNN)
- Seek the k closest substrings of S to the input
query q.
10Background Wavelets
- Wavelet transform provides a time-frequency
representation of a signal. - WT developed as alternative to STFT.
- WT solves resolution problem
- Narrow window gt good time resolution, poor
frequency resolution. - Wide window gt good frequency resolution, poor
time resolution (w?8FT) - Complexity of WT is O(N).
Short Time Fourier Transform
11Background Wavelets (2)
Original time signal
STFT with small window
STFT with large window
Wavelet Transform
12Outline
- Problem Definition
- Motivation
- Background
- Proposed Solution
- General Approach
- Frequency Vector Distance
- Wavelet Transform Distance
- MRS Index Structure
- Range Nearest Neighbor Queries
- Future Work
13General Approach
- Map the substrings of the db into an integer
space. - Frequency Vector
- Vector of Wavelet Coefficients
- Define a distance function in this integer space,
which is lower bound of the actual edit distance. - Cluster the vectors of consecutive substrings
into Minimum Bounding Rectangles (MBRs). - Obtain an array of MBRs for different resolutions
-gt grid.
14Outline
- Problem Definition
- Motivation
- Background
- Proposed Solution
- General Approach
- Frequency Vector Distance
- Wavelet Transform Distance
- MRS Index Structure
- Range Nearest Neighbor Queries
- Future Work
15Frequency Vector
- s string from alphabet ??1, ..., ??
- ni number of occurrences of ?i in s (1 ? i ?
?) - Define the frequency vector of s as f(s)n1,
..., n? - Example
- s AATGATAG
- f(s) nA, nC, nG, nT 4, 0, 2, 2
16Effect of Edit Ops on the Frequency Vector
- Delete decreases an entry by 1
- Insert increases an entry by 1
- Replace Insert Delete
- Example A C G T
- s AATGATAG ? f(s) 4, 0, 2, 2
- Del G s AAT.ATAG ? f(s) 4, 0, 1, 2
- Ins C s AACTATAG ? f(s) 4, 1, 1, 2
- A?C s ACCTATAG ? f(s) 3, 2, 1, 2
17Frequency Distance (FD1) A Lower Bound on the ED
- Define FD1(u, v) as the minimum number of steps
in order to go from u to v (or viceversa) by
moving to a neighbor point at each step. - Two points u and v in sdim space are neighbors if
one of them can be obtained from the other by a
single edit operation.
18Frequency Distance Example
- s AATGATAG gt f(s)4, 0, 2, 2
- t ACTTAGC gt f(t)2, 2, 1, 2
- pos (4-2) (2-1) 3
- neg (2-0) 2
- FD1(f(s), f(t)) 3
- ED(s, t) 4
- FD1( f(s), f(t) ) maxpos, neg
- FD1( f(s), f(t) ) ? ED(s, t)
19Outline
- Problem Definition
- Motivation
- Background
- Proposed Solution
- General Approach
- Frequency Vector Distance
- Wavelet Transform Distance
- MRS Index Structure
- Range Nearest Neighbor Queries
- Future Work
20Wavelet Transformation Example
- s T C A C n s 4
- ?0(s) v0,0 , v0,1 , v0,2 , v0,3
- (A0,0, B0,0), (A0,1, B0,1), (A0,2,
B0,2), (A0,3, B0,3) - (f(t), 0), (f(c), 0), (f(a),
0), (f(c), 0) - (0,0,0,1, 0), (0,1,0,0, 0), (1,0,0,0,
0), (0,1,0,0, 0) - ?1(s) (0,1,0,1, 0,-1,0,1),
(1,1,0,0, 1,-1,0,0) - ?2(s) ( 1,2,0,1, -1,0,0,1
)
First wavelet coefficient
Second wavelet coefficient
21Wavelet Transformation String Decomposition
- Ak,i Ak-1,2i Ak-1,2i1 0ltklt(log2n)-1
- Bk,i Ak-1,2i - Ak-1,2i1 0ltilt(n/2k)-1
i
k
First wavelet coefficient
Second wavelet coefficient
?(s)
22Wavelet Distance (FD2) A Lower Bound on the ED
- Maximum Frequency Distance FD(s1,s2)
- max FD1(f(s1), f(s2)), FD2(?(s1),?(s2))
23Outline
- Problem Definition
- Motivation
- Background
- Proposed Solution
- General Approach
- Frequency Vector Distance
- Wavelet Transform Distance
- MRS Index Structure
- Range Nearest Neighbor Queries
- Future Work
24MRS Index Creation
s1
w2a
MBR
25MRS Index Creation
s1
transform
26MRS Index Creation
s1
MBR
27MRS Index Creation
s1
...
slide c times
cbox capacity
MBR
28MRS Index Creation
s1
...
MBRs containing wavelet coefficients of
substrings of s1
29MRS Index Creation
s1
Ta,1
...
W2a
Tree of MBRs for a resolution of W2a over s1
30Using Different Resolutions
s1
Ta,1
...
W2a
Ta1,1
...
W2a1
31MRS Index Structure
j
1jd
Database
Resolution levels
Ti,j index for j th string and window size 2i
i
aib
32Outline
- Problem Definition
- Motivation
- Background
- Proposed Solution
- General Approach
- Frequency Vector Distance
- Wavelet Transform Distance
- MRS Index Structure
- Range Nearest Neighbor Queries
- Future Work
33Range Queries
1. Partition the query string into subqueries at
various resolutions available in our index.
2. Perform a partial range query for each
subquery on the corresponding row of the index
structure, and refine e.
3. Disk pages corresponding to last result set
are read, and postprocessing is done to elminate
false retrievals.
s1
s2
sd
...
...
...
...
w24
...
...
...
...
w25
...
...
...
...
w26
...
...
...
...
w27
q1
q2
q3
q
34k-Nearest Neighbor Queries Phase 1
k 3
B set of k closest MBRs to query string q.
35k-Nearest Neighbor Queries Phase 1
k 3
B set of k closest MBRs to query string q.
r kth smallest edit distance of strings in B to
q.
36k-Nearest Neighbor QueriesPhase 2
r
k 3
Perform a range query using r as the query radius.
37k-Nearest Neighbor Query(2)
k 3
Perform a range query using r as the query radius.
38Outline
- Problem Definition
- Motivation
- Background
- Proposed Solution
- Future Work
39Future Work
- Adapt the MRS-Index to work as an external
indexing over tuples of a database. - Evaluate and compare the performance of the two
distance functions, FD1 and FD2. - Test with protein sequences rather than DNA
sequences.
40References
- T. Kahveci, A. K. Singh. Efficient Index
Structures for String Databases. VLDB 2001
351-360. - O. Camoglu, T. Kahveci, A. K. Singh. PSI
Indexing Protein Structures for Fast Similarity
Search. Bioinformatics, 11, pages 1-3, 2003. - O. Camoglu, T. Kahveci, A. K. Singh. PSI
Indexing Protein Structures for Fast Similarity
Search. 2003. - R. Polikar. The Wavelet Tutorial.
http//users.rowan.edu/polikar/WAVELETS/