ReferenceBased Indexing of Sequence Databases VLDB 06 - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

ReferenceBased Indexing of Sequence Databases VLDB 06

Description:

Performance deteriorates quickly as the size of the database increases. Suffix tree ... DR: a zebrafish species. QR: query range. Comparison with existing methods ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 27
Provided by: Ange338
Category:

less

Transcript and Presenter's Notes

Title: ReferenceBased Indexing of Sequence Databases VLDB 06


1
Reference-Based Indexing of Sequence
Databases(VLDB 06)
  • Jayendra Venkateswaran
  • Deepak Lachwani
  • Tamer Kahveci
  • Christopher Jermaine
  • Presented by Angela Siu

2
Content
  • Introduction
  • Related work
  • Reference-Based Methodology
  • Selection of References
  • Mapping of References
  • Search Algorithm
  • Experimental evaluation
  • Conclusion

3
Introduction
  • Many and/or very long sequences
  • Similarity search
  • Genomics, proteomics, dictionary search
  • Edit distance metric
  • Dynamic programming (Expensive)
  • Reference-based indexing
  • Reduce number of comparisons
  • Choice of references

4
Related work
  • Index structures
  • k-gram indexing
  • Exact matches of k-gram
  • Extend to find longer alignments with errors
  • Eg. FASTA, BLAST
  • Performance deteriorates quickly as the size of
    the database increases
  • Suffix tree
  • Manage mismatches ineficiently
  • Excessive memory usage 10 37 bytes per letter
  • Vector space indexing
  • SST, frequency vector,
  • Store the occurrence of each letter in sequence
  • Lower bound of actual distance
  • Performs poorly as the query range increases
  • Reference-based indexing
  • A variation of vector space indexing
  • VP-tree, MVP-Tree, iDistance, Omni, M-Tree,
    Slim-Tree, DBM-Tree, DF-Tree

5
Reference-Based indexing
  • A seqeunce database S
  • A set of reference sequences V
  • Pre-compute edit distances ED
  • ED(si, vj)(?si ? S) ? (?vj ? V )
  • Similarity Search
  • Distance Threshold e
  • Triangle inequality
  • Prune sequences that are too close or too far
    away from a reference
  • LB max(? vj?V ED(q, vj) - ED(vj, s))
  • UB min(? vj?V ED(q, vj) ED(vj, s))
  • If e lt LB, add si to the pruned set
  • If e gt UB, add si to the result set
  • If LB e UB, add si to the candidate set
  • si in candidate set are compared with queries
    using dynamic programming

6
Cost Analysis
  • Memory
  • Main memory B bytes
  • Number of sequences N
  • Number of references assigned k
  • Average size of a sequence z bytes
  • Sequence-reference mapping of sequence s and
    reference vi i, ED(s, vi)
  • B ltstorage of referencegt ltstorage of
    pre-computed edit distancesgt
  • B 8kN zk
  • Time
  • Query Set Q
  • Time taken for one sequence comparison t
  • Average size of candidate set cavg
  • Total query time
  • ltQuery-Reference comparisongt ltCandidate-Query
    comparisiongt
  • tkQ tCavgQ

7
Selection of references
  • Omni method
  • Existing approach
  • References near the convex hull of the database
  • Sequences near the hull pruned by multiple,
    redundant references
  • Sequences far away from the hull cannot be pruned
  • Poor pruning rates

8
Proposed methods
  • Goal choose references that represent all parts
    of the database
  • Two novel strategies
  • Maximum Variance (MV)
  • Maximize the spread of database around the
    references
  • Maximum Pruning (MP)
  • Optimizes pruning based on a set of sample queries

9
Maximum Variance
  • If q is close to reference v
  • Prune sequences far away from v
  • Accept sequences close to v
  • If q is far away from v
  • Prune sequences close to v
  • Select references with high variance of distances
  • Assume queries follow the same distribution as
    the database sequences
  • New reference prunes some part of the database
    not pruned by existing set of references

10
Maximum Variance
  • Measure closeness of sequences
  • L the length of the longest sequence in S
  • µi mean of distances of si
  • si variance of distances of si
  • w a cut-off distance
  • w L.perc, where 0 lt perc lt 1
  • sj is close to si if ED(si, sj) lt (µi - w)
  • sj is far away from si if ED(si, sj) gt (µi w)
  • Choose perc 0.15, derived from experiment

11
Maximum Variance
Calculate ?
Sort ?
Remove sequences close to or far away from
the reference

Variance
Sequence database
Random subset
Candidate Reference Set
12
Maximum Variance
  • Time complexity
  • Step 2 O(NSL2)
  • Step 4 O(N logN)
  • Step 5 O(mN)
  • Overall time
  • O(NL2S N logN mN)
  • Algorithm

13
Maximum Pruning
  • Combinatorially tries to compute the best
    reference sets for a given query distribution
  • Greedy approach
  • Start with an initial reference set
  • Consider each sequence in the database as a
    candidate
  • Iteratively, replace an existing reference with a
    new one if pruning is improved
  • Gain the amount of improvement in pruning
  • Stop if no further improvement
  • Sampling-based optimization

14
Maximum Pruning
Sequence Database
S1
Replace ?
Get ?
Max
S1
Current Reference Set
Gain
Sample Query Set
Candidate Reference
15
Maximum Pruning
  • Time complexity
  • Sequence Distances O(N2)
  • PRUNE() O(NQ)
  • Step 2
  • Number of sequences O(N2)
  • Compute gain O(mQ)
  • Time O(N2mQ)
  • Overall worst case
  • N iterations
  • O(N3mQ)
  • Algorithm

16
Maximum Pruning
  • Sampling-Based Optimization
  • Estimation of gain
  • Reduce the number of sequences, use subset of
    database
  • Determine accuracy of gain estimate based on
    Central Limit Theorem
  • Iteratively randomly select a sequence to
    calculate the gain of a candidate until desired
    accuracy is reached
  • Time complexity O(N2fmQ), f is the sample size
  • Estimation of largest gain
  • Reduce the number of candidate references
  • Ensure the largest gain is at least tGe with ?
    probability, where 0 t , ? 1, Ge has the
    largest gain
  • Use Extreme Value Distribution to estimate Ge
  • From the sample set of candidates, find mean and
    standard deviation
  • Best reference sequence has the expected gain of
  • where
  • Sample size
  • Time complexity O(NfhmQ)

17
Mapping of references
  • Each sequence has its own set of best references
  • Based on a sample query set Q
  • Assign references that prune the sequence for
    most queries in Q
  • Avoid redundant references
  • Keep a reference only if it can prune a total of
    more than Q sequences

18
Mapping of references
max
Reference Set
Reference Set for S1
Sequence database
Query prune count
Sample Query Set
19
Mapping of references
  • Time complexity
  • Distance computation O(tmQ), sequence
    comparison takes t time
  • Pruning amount calculation O(mQ)
  • Overall time O(NmkQ)
  • Algorithm

20
Search Algorithm
  • Calculate edit distances between queries and
    every reference
  • Compute lower bound LB and upper bound UB
  • e query range
  • By triangle inequality,
  • If LB gt e, prune sequence
  • If UB lt e, accept sequence
  • Otherwise, perform actual sequence comparison
  • Memory complexity
  • z average sequence size
  • i, ED(s, vi) Sequence-Reference mapping
  • N number of database sequences
  • m number of references
  • k number of reference per database sequence
  • Overall memory (8Nk mz) bytes
  • Time complexity
  • Q query set
  • L average sequence length
  • Cm average candidate set size for Q using m
    references
  • Overall time O((m Cm)QL2 NkQ)

21
Experimental evaluation
  • Size of reference set 200
  • Datasets
  • Text alphabet size of 36 and 8000 sequences of
    length 100
  • DNA alphabet size of 4 and 20000 sequences
  • Protein alphabet size of 20 and 4000 sequences
    of length 500
  • Comparisons of the selection strategies
  • MV-S, MV-D Maximum variance with same and
    different reference sets
  • MP-S, MP-D Maximum pruning with same and
    different reference sets
  • Comparisons with existing methods
  • Omni, FV, M-Tree, DBM-Tree, Slim-Tree, DF-Tree

22
Comparison of selection strategies
  • Impact of query range
  • Impact of number of reference per sequence

23
Comparisons with existing methods
  • Impact of query range
  • Number of sequence comparisons
  • IC index contruction time
  • ss second
  • ms minute
  • QR query range
  • MP-D is sampling-based optimized

24
Comparison with existing methods
  • Impact of input queries
  • Number of sequence comparisons
  • Sample query set in reference selection E.Coli
  • Actual query set
  • HM a butterfly species
  • MM a mouse speciecs
  • DR a zebrafish species
  • QR query range

25
Comparison with existing methods
  • Scalability of database size and sequence length

26
Conclusion
  • Similarity search over a large database
  • Edit distance as the similarity measure
  • Selection of references
  • Maximum variance
  • Maximize spread of database around the database
  • Maximum pruning
  • Optimize pruning based on a set of sample queries
  • Sampling-based optimization
  • Mapping of references
  • Each sequence has a different set of references
  • Experimental evaluation
  • Outperform existing strategies including Omni and
    frequency vectors
  • MP-D, Maximum pruning with dynamic assignment of
    reference sequences, performs the best
Write a Comment
User Comments (0)
About PowerShow.com