Title: Sequence alignment algorithms
1Sequence alignment algorithms
Presented By Cary Miller Sastry
Akella Daisuke Yasuda
2Overview
- Biological background / motivation / applications
- Dot matrix / dynamic programming
- FASTA / BLAST
3biology
- Biomolecules are strings from a restricted
alphabet - Length4 DNA
- Length20 protein
- Proteins are the working part
4Proteins
- Protein is a linear sequence of 20 characters
(amino acids) - Proteins do not maintain linearity
- Folding happens
- Folding determines overall 3-D shape
- Shape determines function
5Sequence Structure Function
- sequence does not reveal structure
- Much less function
- A sequenceARTUVEDYERRWWUHUK
6Structure
7Structure
8function
- Protein A is a constituent of muscle, skin,
cartilage, or - Protein B catalyzes the transformation of glucose
to fructose, or - How do we find proteins with similar function?
9Nature does not solve the same problem twice
(usually)
- Short sequence with a specific function (or
shape) is called a domain - The same domain appears in multiple proteins
- If we find the same domain in multiple proteins
that provides a clue to function and/or structure
10Amino acids
- Each has the same basic chemical configuration
but has a functional group that makes it
chemically unique - They occur in families
- Some functional groups are similar
11How biologists study proteins
- Expensive (NMR, x-ray crystallography)
- Discovery of function is difficult
- Few proteins are understood in detail
- Many are known by sequence
- Sequence is easier to get than structure or
function
12A biological scenario
- Biologist discovers the sequence of a new protein
with unknown function - She has no idea of function
- If sequence can be associated with a known
protein sequence we have a clue about structure
and/or function - Most proteins have unknown function
13Public databases
- Vast quantities of sequence, structure, function
info is deposited into public databases - A new sequence should be compared to the database
14Comparing sequences
- Alignment with exact matchABCTUVABUVABCTUVAB
----UV
15Alignment with inexact match
- InexactGARUIPPRSTGARVVBUIEEYSTGAR------UIPPRS
TGARVVBUIEEYST
16Global vs. local alignment
- ABQRTASGGBV
- ABRRRASGVBB
- ABQRTASGGBV
- ABQ------SGGBV
17A real alignment
- MyoglobinPDLRKY FKG-A ENFTA DDVQ KSDRPDTKAY
FPKFG DLSTA AALK SSPK - Homology common ancestry
18Real alignment
19Real alignment
20Scoring pairs of amino acids
- For amino acid pairs assign a score based on
frequency of substitutionATRGUVXQATRCVVXTATRGV
VEQAT-----VVEQ
21A substitution matrix
22A substitution matrix
23Substitution matrices
- Pam and Blosum are standard substitution matrices
- Also include scores for
- Gap opening
- Gap extension
24Scoring amino acid strings
- Sum the individual pair scores
- Database is huge
- Spurious match to random sequence is likely
- Try your name
- E-value is probability of getting a given score
from a random sequence
25Alignment algorithms
- Dot matrix
- Dynamic programming
- FASTA
- BLAST
26Dot Matrix and DP
27Dot Matrix
- Locating regions of similarity between two DNA or
protein sequences which provide a great deal of
information about the function and structure of
the query sequence. - Similar structure indicates homology, or similar
evolution, which provides critical information
about the functions of these sequences.
28Dot Matrix Contd..
- A dot matrix plot is a method of aligning two
sequences to provide a picture of the homology
between them. - The dot matrix plot is created by designating one
sequence to be the subject and placing it on the
horizontal axis and designating the second
sequence to be the query and placing it on the
vertical axis of the matrix.
29Dot Matrix Contd..
- At each position within the matrix, a point is
plotted if the horizontal and vertical elements
are identical. - Diagonal lines within the resulting matrix
indicate regions of similarity. A simple dot
matrix plot is shown in Figure A.
30(No Transcript)
31Dot Matrix with noise reduction
- A certain percentage of the matches between
sequence elements can be expected to be the
result of the random nature of their evolution.
These random matches are considered noise" and
are filtered out to enhance the diagonal lines.
32Dot Matrix
- Noise Reduction
- a) Noise reduction in dot matrix can be done
by centering a substring of elements of the
query sequence over each element in the
subject sequence and determining the number of
corresponding elements within this window.
33Dot Matrix
- b) If the number of corresponding elements
exceeds a specified threshold then a point is
plotted for the center element. This is
demonstrated in figure B.
34Dot Matrix (Figure B)
35Dot Matrix
- Advantages Readily reveals the presence of
insertions/deletions and direct and inverted
repeats that are more difficult to find by the
other, more automated methods. - DisadvantagesMost dot matrix computer programs
do not show an actual alignment. Does not return
a score to indicate how optimal a given
alignment is.
36Dynamic Programming
- Dynamic programming (DP) algorithms are a general
class of algorithms typically applied to
optimization problems. - For DP to be applicable, an optimization problem
must have two key ingredients - a) Optimal substructure an optimal solution
to the problem contains within it optimal
solutions to sub-problems. - b) Overlapping sub-problems the pieces
of larger problem have a sequential
dependency.
37Dynamic Programming
- DP works by first solving every sub-sub-problem
just once, and saves its answer in a table,
thereby avoiding the work of re- computing the
answer every time the sub-sub-problem is
encountered. Each intermediate answer is stored
with a score, and DP finally chooses the sequence
of solution that yields the highest score.
38Dynamic Programming
39Dynamic Programming
- Both global and local types of alignments may be
made by simple changes in the basic DP algorithm. - Alignments depend on the choice of a scoring
system for comparing character pairs and penalty
scores (e.g. PAM and BLOSUM matrixes covered
before) - Scoring functions example
- w (match) 2 or substitution matrix
- w (mismatch) -1 or substitution matrix
- w (gap) -3
40Dynamic Programming
- Global Alignment (Needleman-Wunsch)
- a) General goal is to obtain optimal global
alignment between two sequences, allowing
gaps.b) We construct a matrix F indexed by i
and j, one index for each sequence, where the
value F(i,j) is the score of the best
alignment between the initial segment x1i of x
up to xi and the initial segment y1j of y up
to yj. We begin by initializing F(0,0) 0.
We then proceed to fill the matrix from top
left to bottom right. If F(i-1, j-1),
F(i-1,j) and F(i,j-1) are known, it is
possible to calculate F(i,j). -
41Dynamic Programming
- F(i,j) max F(i-1, j-1) s(xi , yj
)F(i-1,j) dF(i, j-1) d. - where s(a,b) is the likelihood score
that residues a and b occur as an aligned
pair, and d is the gap penalty. - Once you construct the matrix, you trace back the
path that leads to F(n,m), which is by definition
the best score for an alignment of x1n to y1m.
42Dynamic Programming
- Global Dynamic programming matrix
43Dynamic Programming
- Local alignment (Smith-Waterman)Two changes from
global alignment1. Possibility of taking the
value 0 if all other options have value less
than 0. This corresponds to starting a new
alignment.2. Alignments can end anywhere in
the matrix, so instead of taking the value
in the bottom right corner, F(n,m) for the
best score, we look for the highest value of
F(i,j) over the whole matrix and start the
trace-back from there. - F(i,j) max 0F(i-1, j-1) s(xi , yj
) F(i-1,j) dF(i, j-1) d.
44Dynamic Programming
- Local Dynamic programming matrix
45Dynamic Programming
- Advantages Guaranteed in a mathematical
sense to provide the optimal (very best or
highest-scoring) alignment for a given set of
scoringfunctions. - Disadvantages
- a) Slow due to the very large number of
computational steps O(n 2).b) Computer
memory requirement also increases as the square
of the sequence lengths. - Therefore, it is difficult to use the
method for very long sequences.
46FASTA and BLAST
47FASTA - Idea -
- Problem of Dynamic Programming
- D.P. compute the score in a lot of useless
area for optimal sequence - FASTA focuses on diagonal area
48FASTA - Heuristic -
- Heuristic
- Good local alignment should have some exact
match subsequence.
FASTA focus on this area
49FASTA - Hi Level Algorithm -
- Hi level algorithm
- Let q be a query
- max ? 0
- For each sequence, s in DB
- compare q with s and compute a score, y
- if max
- max ? y
- bestSequence ? s
- Return bestSequence
50FASTA - Algorithm -
- Step 1
- Find all hot-spots
- // Hot spots is pairs of words of length k
that exactly match
Sequence 1
Hot Spots
Sequence 2
51FASTA - Algorithm -
- Step 1 in detail
- Use look-up Table
- Query G A A T T C A G T T A
- Sequence G G A T C G A
-
DotMatrix
Look-up Table
52FASTA - Algorithm -
- Step 2
- Score the Hot-spot and locate the ten best
diagonal run. - // There is some scoring system ex. PAM250
53FASTA - Algorithm -
- Step 3
- Combine sub-alignments into one alignment
with GAP
GAP
One of local alignment
54FASTA - Algorithm -
- Step 4
- Consider weighted direct graph.
- Let node be a sub-alignment found in step 1
- Let u and v be nodes
- Edge (u,v) exists if alignment u is before
in the sequence. - Each edge has gap penalty (negative)
- Find the maximum weight path
-
-
-
Sub-sequence
Edge
One Sequence
55FASTA - Algorithm -
GAP
Sub-alignment
Gap
-5
-3
-3
Max Weight Path
56FASTA - Algorithm -
- Step 5
- Use the dynamic programming in restricted area
around the best-score alignment to find out the
higher-score alignment than the best-score
alignment
Width of this band is a parameter
57FASTA - Algorithm -
- Summary of Algorithm
- 1 Find all hot-spots
- // Hot spots is pairs of words of length k
that exactly match - 2 Score the Hot-spot and locate the ten best
diagonal run. - 3 Combine sub-alignments into one alignment
- 4 Score Each alignment with gap penalty and
pick up the best-score alignment - 5 Use the dynamic programming in restricted
area around the best-score alignment to find out
the alignment greater than the best-score
alignment. -
58FASTA - Complexity -
- Complexity
- Step 1 and 2 // select the best 10
diagonal run - Let n be a sequence from DB
- O(n) because Step 1 just uses look up the
table - O(n)
59FASTA - Complexity -
- Step 3 and 4 // compute the MAX Weight Path
- Let r be the number of sub-alignments. (r
10) - Lets be the number of edges
- O(r2)
- n1 n2 n3
- n1
- n2
- n3
- ? 1 of D.P because r2 102
- and mn 104
Positive Weight
-5
-3
-3
Max Weight Path
60FASTA - Complexity -
- Step 5 // compute partial D.P.
- Depends on the restricted area
-
- Therefore, FASTA is faster than D.P.
Width of this band is a parameter
61BLAST - Heuristic -
- Another Heuristic algorithm
- Heuristic but evaluating the result
statistically. - Homologous sequence are likely to contain a
short high scoring word pair, a hit. - BLAST tries to extend it on the both sides to
get optimal sequence.
A T T A G .
Sequence
Short high score Word
62BLAST - Algorithm -
Neighborhood Word
- Step 1 preprocessing Query
- Compile the short-hit scoring word list from
query. - The length of query word,w, is 3 for brosom
scoring - Threshold T is 13
-
63BLAST - Algorithm -
- Step 1 2
- Create neighborhood words for each query word
Query Word
Neighborhood words
64BLAST - Algorithm -
- Step 2 Scanning DB
- For each words list, identify all exact matches
with DB sequences
Query Word
Neighborhood Word list
Sequences in DB
Sequence 1
Sequence 2
Step 2
Step 1
The purpose of Step 1 and 2 is as same as FASTA
65BLAST - Algorithm -
- Step 2-2
- Method 1 Hash Table
- Query LAALLNKCKTPQGQRLVNQWIKQPLMD
Hash Table
Word list
66BLAST - Algorithm -
- Step 2-3
- Method 2 Finite Automata
A,G
L
A
G
A
A
A
I
67BLAST Algorithm -
- Step 3 (Search optimal alignment)
- Let S be a score of hit-word
- For each hit-word, extend ungapped alignmentin
both directions. - Step 4 (Evaluate the alignment statistically)
- Stop extension when E-value (depending on score
S) become less than threshold. The hit-word is
called High Scoring Segment Pair. BLAST return it - E-value the number of HSPs having score S
(or higher) expected to - occur only by chance.
- ? Smaller E-value, more significant in
statistics - Bigger E-value , by chance
A T T A G .
Sequence
Hit Word
68BLAST - Algorithm -
- Step 3 -2
- Definition of E-Value
- The expected number of HSP with the score at
least S is - E Knme-?S
- K, ? is constant depending on model
- n, m are the length of query and sequence
- The probability of finding at least one such HSP
is - P 1 - eE
- ? If a word is hit by chance
(E-value is bigger), - P become smaler.
69BLAST - Running Time -
- Running Time
- The length of Query 153
- DB size 5997 sequences
- PC Pentium 4
- By Dr. Takeshi Kawabata
- Nara Sentan Gijyutu University
70Comparison of Algorithm
- Dynamic Programming
- 1. most sensitive result
- ? D.P uses all information of two sequence
- 2. Running time is slow
- ? D.P compute the useless area for computing
the optimal sequence.
71Comparison of Algorithm
- FASTA
- 1. Less sensitive than D.P and BLAST
- ? FASTA uses partial information to speed up
the computaiotn. - ? FASTA does not evaluate the result
statistically. - 2. Running time is faster D.P
- ? the same reason as the above.
72Comparison of Algorithms
- BLAST
- 1. Sensitive than FASTA
- ? BLAST evaluate the result statistically.
- 2.Faster than FASTA
- ? Because BLAST evaluate the entire DB with
the same threshold based on statistics. BLAST
eliminate noises and reduces the running time.
73FASTA vs BLAST
- BLAST
- Compare the query and sequences in DB
- with the same threshold.
- FASTA
- compare the query and a sequence one by one
- And compare the each result.
DB
DB
Query
74Conclusion