ReferenceBased Indexing of Sequence Databases VLDB 06 - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

ReferenceBased Indexing of Sequence Databases VLDB 06

Description:

Performance deteriorates quickly as the size of the database increases. Suffix tree ... DR: a zebrafish species. QR: query range. Comparison with existing methods ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 27

Provided by: Ange338

Category:

more less

Transcript and Presenter's Notes

Title: ReferenceBased Indexing of Sequence Databases VLDB 06

1
Reference-Based Indexing of Sequence
Databases(VLDB 06)

Jayendra Venkateswaran
Deepak Lachwani
Tamer Kahveci
Christopher Jermaine
Presented by Angela Siu

2
Content

Introduction
Related work
Reference-Based Methodology
Selection of References
Mapping of References
Search Algorithm
Experimental evaluation
Conclusion

3
Introduction

Many and/or very long sequences
Similarity search
Genomics, proteomics, dictionary search
Edit distance metric
Dynamic programming (Expensive)
Reference-based indexing
Reduce number of comparisons
Choice of references

4
Related work

Index structures
k-gram indexing
Exact matches of k-gram
Extend to find longer alignments with errors
Eg. FASTA, BLAST
Performance deteriorates quickly as the size of
the database increases
Suffix tree
Manage mismatches ineficiently
Excessive memory usage 10 37 bytes per letter
Vector space indexing
SST, frequency vector,
Store the occurrence of each letter in sequence
Lower bound of actual distance
Performs poorly as the query range increases
Reference-based indexing
A variation of vector space indexing
VP-tree, MVP-Tree, iDistance, Omni, M-Tree,
Slim-Tree, DBM-Tree, DF-Tree

5
Reference-Based indexing

A seqeunce database S
A set of reference sequences V
Pre-compute edit distances ED
ED(si, vj)(?si ? S) ? (?vj ? V )
Similarity Search
Distance Threshold e
Triangle inequality
Prune sequences that are too close or too far
away from a reference
LB max(? vj?V ED(q, vj) - ED(vj, s))
UB min(? vj?V ED(q, vj) ED(vj, s))
If e lt LB, add si to the pruned set
If e gt UB, add si to the result set
If LB e UB, add si to the candidate set
si in candidate set are compared with queries
using dynamic programming

6
Cost Analysis

Memory
Main memory B bytes
Number of sequences N
Number of references assigned k
Average size of a sequence z bytes
Sequence-reference mapping of sequence s and
reference vi i, ED(s, vi)
B ltstorage of referencegt ltstorage of
pre-computed edit distancesgt
B 8kN zk
Time
Query Set Q
Time taken for one sequence comparison t
Average size of candidate set cavg
Total query time
ltQuery-Reference comparisongt ltCandidate-Query
comparisiongt
tkQ tCavgQ

7
Selection of references

Omni method
Existing approach
References near the convex hull of the database
Sequences near the hull pruned by multiple,
redundant references
Sequences far away from the hull cannot be pruned
Poor pruning rates

8
Proposed methods

Goal choose references that represent all parts
of the database
Two novel strategies
Maximum Variance (MV)
Maximize the spread of database around the
references
Maximum Pruning (MP)
Optimizes pruning based on a set of sample queries

9
Maximum Variance

If q is close to reference v
Prune sequences far away from v
Accept sequences close to v
If q is far away from v
Prune sequences close to v
Select references with high variance of distances
Assume queries follow the same distribution as
the database sequences
New reference prunes some part of the database
not pruned by existing set of references

10
Maximum Variance

Measure closeness of sequences
L the length of the longest sequence in S
µi mean of distances of si
si variance of distances of si
w a cut-off distance
w L.perc, where 0 lt perc lt 1
sj is close to si if ED(si, sj) lt (µi - w)
sj is far away from si if ED(si, sj) gt (µi w)
Choose perc 0.15, derived from experiment

11
Maximum Variance
Calculate ?
Sort ?
Remove sequences close to or far away from
the reference

Variance
Sequence database
Random subset
Candidate Reference Set
12
Maximum Variance

Time complexity
Step 2 O(NSL2)
Step 4 O(N logN)
Step 5 O(mN)
Overall time
O(NL2S N logN mN)

Algorithm

13
Maximum Pruning

Combinatorially tries to compute the best
reference sets for a given query distribution
Greedy approach
Start with an initial reference set
Consider each sequence in the database as a
candidate
Iteratively, replace an existing reference with a
new one if pruning is improved
Gain the amount of improvement in pruning
Stop if no further improvement
Sampling-based optimization

14
Maximum Pruning
Sequence Database
S1
Replace ?
Get ?
Max
S1
Current Reference Set
Gain
Sample Query Set
Candidate Reference
15
Maximum Pruning

Time complexity
Sequence Distances O(N2)
PRUNE() O(NQ)
Step 2
Number of sequences O(N2)
Compute gain O(mQ)
Time O(N2mQ)
Overall worst case
N iterations
O(N3mQ)

Algorithm

16
Maximum Pruning

Sampling-Based Optimization
Estimation of gain
Reduce the number of sequences, use subset of
database
Determine accuracy of gain estimate based on
Central Limit Theorem
Iteratively randomly select a sequence to
calculate the gain of a candidate until desired
accuracy is reached
Time complexity O(N2fmQ), f is the sample size
Estimation of largest gain
Reduce the number of candidate references
Ensure the largest gain is at least tGe with ?
probability, where 0 t , ? 1, Ge has the
largest gain
Use Extreme Value Distribution to estimate Ge
From the sample set of candidates, find mean and
standard deviation
Best reference sequence has the expected gain of
where
Sample size
Time complexity O(NfhmQ)

17
Mapping of references

Each sequence has its own set of best references
Based on a sample query set Q
Assign references that prune the sequence for
most queries in Q
Avoid redundant references
Keep a reference only if it can prune a total of
more than Q sequences

18
Mapping of references
max
Reference Set
Reference Set for S1
Sequence database
Query prune count
Sample Query Set
19
Mapping of references

Time complexity
Distance computation O(tmQ), sequence
comparison takes t time
Pruning amount calculation O(mQ)
Overall time O(NmkQ)

Algorithm

20
Search Algorithm

Calculate edit distances between queries and
every reference
Compute lower bound LB and upper bound UB
e query range
By triangle inequality,
If LB gt e, prune sequence
If UB lt e, accept sequence
Otherwise, perform actual sequence comparison
Memory complexity
z average sequence size
i, ED(s, vi) Sequence-Reference mapping
N number of database sequences
m number of references
k number of reference per database sequence
Overall memory (8Nk mz) bytes
Time complexity
Q query set
L average sequence length
Cm average candidate set size for Q using m
references
Overall time O((m Cm)QL2 NkQ)

21
Experimental evaluation

Size of reference set 200
Datasets
Text alphabet size of 36 and 8000 sequences of
length 100
DNA alphabet size of 4 and 20000 sequences
Protein alphabet size of 20 and 4000 sequences
of length 500
Comparisons of the selection strategies
MV-S, MV-D Maximum variance with same and
different reference sets
MP-S, MP-D Maximum pruning with same and
different reference sets
Comparisons with existing methods
Omni, FV, M-Tree, DBM-Tree, Slim-Tree, DF-Tree

22
Comparison of selection strategies

Impact of query range

Impact of number of reference per sequence

23
Comparisons with existing methods

Impact of query range

Number of sequence comparisons
IC index contruction time
ss second
ms minute
QR query range
MP-D is sampling-based optimized

24
Comparison with existing methods

Impact of input queries

Number of sequence comparisons
Sample query set in reference selection E.Coli
Actual query set
HM a butterfly species
MM a mouse speciecs
DR a zebrafish species
QR query range

25
Comparison with existing methods

Scalability of database size and sequence length

26
Conclusion

Similarity search over a large database
Edit distance as the similarity measure
Selection of references
Maximum variance
Maximize spread of database around the database
Maximum pruning
Optimize pruning based on a set of sample queries
Sampling-based optimization
Mapping of references
Each sequence has a different set of references
Experimental evaluation
Outperform existing strategies including Omni and
frequency vectors
MP-D, Maximum pruning with dynamic assignment of
reference sequences, performs the best