Reference-Based Alignment in Large Sequence Databases - PowerPoint PPT Presentation

About This Presentation
Title:

Reference-Based Alignment in Large Sequence Databases

Description:

Smith-Waterman [Smith&Waterman et al. 1981] Similarity measure used for local alignment: ... Smith-Waterman. 0. C. 0. T. 0. C. 0. G. 0. A. 0. T. 0. A. 0. 0. 0 ... – PowerPoint PPT presentation

Number of Views:91
Avg rating:3.0/5.0
Slides: 58
Provided by: vass98
Category:

less

Transcript and Presenter's Notes

Title: Reference-Based Alignment in Large Sequence Databases


1
Reference-Based Alignment in Large Sequence
Databases
Speaker Panagiotis Papapetrou Boston University
  • Joint work with
  • Vassilis Athitsos (UTA)?
  • George Kollios (BU)
  • Dimitrios Gunopulos (UOA and UCR)?

2
General Problem
  • Given
  • S collection of strings.
  • Q query string.
  • D similarity measure.
  • Find that substring of S that is most similar to
    Q, under the similarity measure D.

3
Motivation
  • Spell-checking
  • given some input text the spell-checker consults
    its dictionary to find words of high similarity
    to the text, so as to identify potential typos.
  • Data cleaning
  • data obtained from different sources might
    contain inconsistencies which can be eliminated
    by looking for similar entities (strings) in the
    data.
  • Near homology search in biological sequences
  • given different genomes we want to find regions
    of high similarity that were the result of a
    mutation, etc.

4
Motivation
  • Our focus
  • Near homology search in DNA sequences.
  • Two major requirements
  • Retrieve near-exact matches of long query
    sequences efficiently.

TCTAGGGCA
Q
ACTTAGCTGTAGTCGTTCTATGGCATATGCATGCTGATCTCGTGCGTCA
TG
5
Motivation
  • Large query sizes
  • Locate genes in large genomes.
  • Find chromosome similarities across different
    organisms.
  • Chromosomes can be relatively large (e.g. Human
    Chromosome 1 is approx. 272 million bases).
  • Near homology search
  • Meaningful for DNA similarity search.
  • Genomes evolve over time due to small mutations.
  • Genomes from different organisms might have high
    similarity.

6
Problem Statement
  • Given
  • S collection of DNA sequences.
  • Q DNA query sequence.
  • D similarity measure.
  • Find the most similar subsequence in S
  • with a deviation of at most d Q edit
    operations.
  • d at most 15 (near homology search).

7
The Edit Distance Levenshtein et al.1966
  • Measures how dissimilar two strings are.
  • ED (A,B) minimum number of operations needed to
    transform A into B.
  • Operations insertion, deletion, substitution.
  • Example
  • A ATC and B ACTG

A A T C
ED (A,B) 2
B A C T G
8
(No Transcript)
9
(No Transcript)
10
(No Transcript)
11
(No Transcript)
12
(No Transcript)
13
(No Transcript)
14
(No Transcript)
15
Smith-Waterman SmithWaterman et al. 1981
  • Similarity measure used for local alignment
  • Match can be a subsequence of the query sequence.
  • Define three penalties
  • match, mismatch, gap.
  • Scoring parameters are defined by the user.
  • Example
  • A ATC and B TATTCG
  • match 2, mismatch -1, gap -1.

16
(No Transcript)
17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
Strategy Identify Candidate Endpoints
database sequence X
candidate endpoints
candidate endpoints
indexing structure
query Q
  • Use dynamic programming only to evaluate the
    candidates.

23
RBSA
  • Decompose subsequence matching into two distinct
    problems
  • Fixed query length
  • Assumes all queries have the same length.
  • Variable query length
  • Uses the solution to the fixed query length
    problem.
  • Achieves efficient retrieval for queries of
    arbitrary length.

24
Fixed query length
  • Q query.
  • (X, t) database position t.
  • Q and (X, t) are mapped into a number
  • D the Edit Distance.
  • R a reference sequence.

25
(No Transcript)
26
Database Embedding
database sequence
X2
X1
X4
X3
X6
X5
X8
X7
X10
X9
X12
X11
X14
X13
X15
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
Refine step
  • Refine only those database positions that were
    not pruned by filtering.
  • For refinement we can use either the Edit
    Distance or the Smith-Waterman dynamic
    programming algorithms.
  • For Smith-Waterman an upper bound can be applied

SW (Q, X, t) 2Q LBED (Q, X, t)?
38
Offline selection of reference sequences
  • Goal represent each database position (X, t)
    using a set of reference sequences Rt.
  • Given
  • Qsample a set of random queries, of size q.
  • R a set of random reference sequences of size
    q.
  • For each (X, t)
  • Choose Rt that prunes (X, t) for the largest
    number of queries in Qsample.
  • Greedy selection.

39
Alphabet Reduction
  • Improve filtering power of RBSA by applying
    alphabet reduction
  • S A, C, G, T.
  • Use four letter collapsing schemes
  • Scheme 0 no collapsing.
  • Scheme 1 A, C -gt X and G, T -gt Y.
  • Scheme 2 A, G -gt X and C, T -gt Y.
  • Scheme 3 A, T -gt X and C, G -gt Y.
  • The number of possible reference sequences
    decreases with the alphabet size
    4q (2q)2 vs. 2q

40
Variable Query Length
  • So far we assumed that Qi q, for every Qi.
  • Q can have arbitrary size
  • For simplicity assume that Q aq.
  • At query time
  • Break Q into non-overlapping segments of size q.
  • Two versions of RBSA
  • Exact and approximate.

41
Exact version
  • Let Xst be a subsequence match for Q, within d
    Q.
  • At least one Qi has within Xst a subsequence
    match within edit distance d q.

Q2
Q3
Q1
a 3
q
q
Q
q
ACTTAGCTGTAGTCGTTCTATGGCATATGCATGCTGATCTCGTGCGTCA
TG
Xst
t
s
42
Exact version
  • Filter and refine.
  • Break Q into a non-overlapping segments Q1, Q2,
    , Qa.

Q2
Q3
Q1
q
q
q
Q
  • If for some Qi
  • ED (Qi, Xst) d q
  • Take the union of all candidates from all Qi s.
  • Perform the refinement step.

43
Approximate version
  • Question
  • Use only one segment Qi of Q.
  • What is the probability P (Qi) that the
    subsequence match of Q is included in the
    candidates of Qi?
  • Proposition
  • P (Qi) 50.
  • Using Hamza et. al. 1995.

44
Approximate version
  • By the previous proposition
  • If a single Qi is chosen and all candidate
    endpoints are generated,
  • there is at least 50 probability of finding the
    correct endpoint of the optimal subsequence match.

45
Approximate version
  • By the previous proposition
  • Assume that the optimal match was not found under
    Qi.
  • P (Qj) probability of not finding the optimal
    match under Qj, with P (Qj) 0.5, for j1,,a.
  • If we use p segments Q1, Q2, , Qp
  • P (Q1, Q2, , Qp) (0.5)p.
  • Thus, the probability of retrieving the optimal
    match is
  • 1 (0.5)p
  • For p10, this probability is at least 99.9.

46
Experimental Setup
  • Datasets
  • Database
  • Human Chromosome 22 (35,059,634 bases).
  • Queries
  • Mouse genome (random chromosomes).
  • Variable size 40, , 10K bases.
  • Similarity to DB varied within 5, 10 and 15.
  • Each dataset contains 200 queries.

47
Performance Measures
  • Accuracy
  • Percentage of queries giving correct results.
  • Efficiency
  • DP cell cost cost of dynamic programming, as
    percentage of brute-force search cost.
  • Retrieval Runtime cost CPU time per query, as
    percentage of brute-force CPU time.
  • Brute force
  • Full Dynamic Programming Algorithm
  • Edit Distance or Smith-Waterman.

48
Competitors
  • Competitors for Endpoint Subsequence Matching
  • Edit Distance.
  • Q-grams Burkhardt et al. 1999.
  • Competitors for Local Alignment
  • BLAST Altschul et al. 1990.
  • BWT-SW Lam et al. 2008.

49
Study on Q-grams
  • Database
  • First 184,309 bases of Human Chromosome 22.

50
Study on Q-grams
  • Database
  • First 184,309 bases of Human Chromosome 22.

51
Results on Edit Distance
  • Retrieval Runtime Percentage

52
Results on Edit Distance
  • Retrieval Runtime Percentage

53
Results on S-W
  • Retrieval Runtime Percentage

54
Results on S-W
  • Retrieval Runtime Percentage

55
Conclusions
  • RBSA identifies subsequence matches in large
    sequence databases.
  • Two versions exact and approximate.
  • Is designed for near homology search.
  • Can handle large query sizes.

56
Future Work
  • Perform RBSA on larger genomes and other
    datasets.
  • Extend RBSA for remote homology search
  • Proteins alphabet size is 20.
  • Improve the reference sequence selection process.
  • Reduce the embedding size
  • Compression techniques.

57
Questions ?
Write a Comment
User Comments (0)
About PowerShow.com