Title: Sequence Similarity
1Sequence Similarity
2The Viterbi algorithm for alignment
- Compute the following matrices (DP)
- M(i, j) most likely alignment of x1xi with
y1yj ending in state M - I(i, j) most likely alignment of x1xi with
y1yj ending in state I - J(i, j) most likely alignment of x1xi with
y1yj ending in state J - M(i, j) log( Prob(xi, yj) )
- max M(i-1, j-1) log(1-2?),
- I(i-1, j) log(1-?),
- J(i, j-1) log(1-?)
- I(i, j) max M(i-1, j) log ?,
- I(i-1, j) log ?
log(1 2?)
M P(xi, yj)
log Prob(xi, yj)
log(1 ?)
log(1 ?)
log ?
log ?
I P(xi)
J P(yj)
log ?
log ?
3One way to view the state paths State M
y1
yn
x1
xm
4State I
y1
yn
x1
xm
5State J
y1
yn
x1
xm
6Putting it all together
States I(i, j) are connected with states J and M
(i-1, j) States J(i, j) are connected with
states I and M (i-1, j) States M(i, j) are
connected with states J and I (i-1, j-1)
y1
yn
x1
xm
7Putting it all together
States I(i, j) are connected with states J and M
(i-1, j) States J(i, j) are connected with
states I and M (i-1, j) States M(i, j) are
connected with states J and I (i-1, j-1) Optimal
solution is the best scoring path from top-left
to bottom-right corner This gives the likeliest
alignment according to our HMM
y1
yn
x1
xm
8Yet another way to represent this model
Ix
Ix
BEGIN
END
Iy
Iy
Mx1
Mxm
Sequence X
We are aligning, or threading, sequence Y through
sequence X Every time yj lands in state xi, we
get substitution score s(xi, yj) Every time yj
is gapped, or some xi is skipped, we pay gap
penalty
9From this model, we can compute additional
statistics
- P(xi yj x, y) The probability that positions
i, j align, given that sequences x and y align - P(xi yj x, y) ?a alignmentP(a x, y) 1(xi
yj in a) - We will not cover the details, but
- this quantity can also be
- calculated with DP
log(1 2?)
M P(xi, yj)
log Prob(xi, yj)
log(1 ?)
log(1 ?)
log ?
log ?
I P(xi)
J P(yj)
log ?
log ?
10Fast database search BLAST
- (Basic Local Alignment Search Tool)
- Main idea
- Construct a dictionary of all the words in the
query - Initiate a local alignment for each word match
between query and DB - Running Time O(MN)
- However, orders of magnitude faster than
Smith-Waterman
query
DB
11BLAST ? Original Version
- Dictionary
- All words of length k (11 nucl. 4 aa)
- Alignment initiated between words of alignment
score ? T (typically T k) - Alignment
- Ungapped extensions until score below
statistical threshold - Output
- All local alignments with score gt statistical
threshold
query
scan
DB
query
12PSI-BLAST
- Given a sequence query x, and database D
- Find all pairwise alignments of x to sequences in
D - Collect all matches of x to y with some minimum
significance - Construct position specific matrix M
- Each sequence y is given a weight so that many
similar sequences cannot have much influence on a
position (Henikoff Henikoff 1994) - Using the matrix M, search D for more matches
- Iterate 14 until convergence
Profile M
13BLAST Variants
- BLASTN genomic sequences
- BLASTP proteins
- BLASTX translated genome versus proteins
- TBLASTN proteins versus translated genomes
- TBLASTX translated genome versus translated
genome - PSIBLAST iterated BLAST search
- http//www.ncbi.nlm.nih.gov/BLAST
14Multiple Sequence Alignments
15Protein Phylogenies
- Proteins evolve by both duplication and species
divergence
16(No Transcript)
17(No Transcript)
18Definition
- Given N sequences x1, x2,, xN
- Insert gaps (-) in each sequence xi, such that
- All sequences have the same length L
- Score of the global map is maximum
- A faint similarity between two sequences becomes
significant if present in many - Multiple alignments can help improve the pairwise
alignments
19Scoring Function Sum Of Pairs
- Definition Induced pairwise alignment
- A pairwise alignment induced by the multiple
alignment - Example
-
- x AC-GCGG-C
- y AC-GC-GAG
- z GCCGC-GAG
- Induces
- x ACGCGG-C x AC-GCGG-C y AC-GCGAG
- y ACGC-GAC z GCCGC-GAG z GCCGCGAG
20Sum Of Pairs (contd)
- Heuristic way to incorporate evolution tree
Human
Mouse
Duck
Chicken
- Weighted SOP
- S(m) ?kltl wkl s(mk, ml)
- wkl weight decreasing with distance
21A Profile Representation
- A G G C T A T C A C C T G T
A G C T A C C A - - - G C A G
C T A C C A - - - G C A G C
T A T C A C G G C A G C T A
T C G C G G A 1 1
.8 C .6 1 .4 1 .6
.2 G 1 .2 .2 .4
1 T .2 1 .6 .2 - .2
.8 .4 .8 .4
- Given a multiple alignment M m1mn
- Replace each column mi with profile entry pi
- Frequency of each letter in ?
- gaps
- Optional gap openings, extensions, closings
- Can think of this as a likelihood of each
letter in each position
22Multiple Sequence Alignments
23Multidimensional DP
- Generalization of Needleman-Wunsh
- S(m) ?i S(mi)
- (sum of column scores)
- F(i1,i2,,iN) Optimal alignment up to (i1, ,
iN) - F(i1,i2,,iN) max(all neighbors of
cube)(F(nbr)S(nbr))
24Multidimensional DP
- Example in 3D (three sequences)
- 7 neighbors/cell
F(i,j,k) max F(i-1,j-1,k-1)S(xi, xj,
xk), F(i-1,j-1,k )S(xi, xj, -
), F(i-1,j ,k-1)S(xi, -, xk), F(i-1,j
,k )S(xi, -, - ), F(i ,j-1,k-1)S( -,
xj, xk), F(i ,j-1,k )S( -, xj,
xk), F(i ,j ,k-1)S( -, -, xk)
25Multidimensional DP
- Running Time
- Size of matrix LN
- Where L length of each sequence
- N number of sequences
- Neighbors/cell 2N 1
- Therefore O(2N LN)
26Multidimensional DP
- How do gap states generalize?
- VERY badly!
- Require 2N states, one per combination of
gapped/ungapped sequences - Running time O(2N ? 2N ? LN) O(4N LN)
- Running Time
- Size of matrix LN
- Where L length of each sequence
- N number of sequences
- Neighbors/cell 2N 1
- Therefore O(2N LN)
Y
YZ
XY
XYZ
Z
X
XZ
27Progressive Alignment
x
pxy
y
z
pxyzw
pzw
w
- When evolutionary tree is known
- Align closest first, in the order of the tree
- In each step, align two sequences x, y, or
profiles px, py, to generate a new alignment with
associated profile presult - Weighted version
- Tree edges have weights, proportional to the
divergence in that edge - New profile is a weighted average of two old
profiles
28Progressive Alignment
x
y
Example Profile (A, C, G, T, -) px (0.8, 0.2,
0, 0, 0) py (0.6, 0, 0, 0, 0.4) s(px, py)
0.80.6s(A, A) 0.20.6s(C, A)
0.80.4s(A, -) 0.20.4s(C, -) Result pxy
(0.7, 0.1, 0, 0, 0.2) s(px, -) 0.81.0s(A, -)
0.21.0s(C, -) Result px- (0.4, 0.1, 0, 0,
0.5)
z
w
- When evolutionary tree is known
- Align closest first, in the order of the tree
- In each step, align two sequences x, y, or
profiles px, py, to generate a new alignment with
associated profile presult - Weighted version
- Tree edges have weights, proportional to the
divergence in that edge - New profile is a weighted average of two old
profiles
29Progressive Alignment
x
y
?
z
w
- When evolutionary tree is unknown
- Perform all pairwise alignments
- Define distance matrix D, where D(x, y) is a
measure of evolutionary distance, based on
pairwise alignment - Construct a tree
- Align on the tree
30Heuristics to improve alignments
- Iterative refinement schemes
- A-based search
- Consistency
- Simulated Annealing
31Iterative Refinement
- One problem of progressive alignment
- Initial alignments are frozen even when new
evidence comes - Example
-
- x GAAGTT
- y GAC-TT
- z GAACTG
- w GTACTG
Frozen!
Now clear correct y GA-CTT
32Iterative Refinement
- Algorithm (Barton-Stenberg)
- For j 1 to N,
- Remove xj, and realign to x1xj-1xj1xN
- Repeat 4 until convergence
z
x
y
33Iterative Refinement
- Example align (x,y), (z,w), (xy, zw)
- x GAAGTTA
- y GAC-TTA
- z GAACTGA
- w GTACTGA
- After realigning y
- x GAAGTTA
- y G-ACTTA 3 matches
- z GAACTGA
- w GTACTGA
Variant Refinement on a tree tree
partitioning
34Iterative Refinement
- Example align (x,y), (z,w), (xy, zw)
- x GAAGTTA
- y GAC-TTA
- z GAACTGA
- w GTACTGA
- After realigning y
- x GAAGTTA
- y G-ACTTA 3 matches
- z GAACTGA
- w GTACTGA
35Iterative Refinement
- Example not handled well
- x GAAGTTA
- y1 GAC-TTA
- y2 GAC-TTA
- y3 GAC-TTA
- z GAACTGA
- w GTACTGA
- Realigning any single yi changes nothing
36Some Resources
- http//www.ncbi.nlm.nih.gov/BLAST
- BLAST PSI-BLAST
- http//www.ebi.ac.uk/clustalw/
- CLUSTALW most widely used
- http//phylogenomics.berkeley.edu/cgi-bin/muscle/i
nput_muscle.py - MUSCLE most scalable
- http//probcons.stanford.edu/
- PROBCONS most accurate
37MUSCLE at a glance
- Fast measurement of all pairwise distances
between sequences - DDRAFT(x, y) defined in terms of common k-mers
(k3) O(N2 L logL) time - Build tree TDRAFT based on DDRAFT, with a
hierarchical clustering method (UPGMA) - Progressive alignment over TDRAFT, resulting in
multiple alignment MDRAFT - Measure distances D(x, y) based on MDRAFT
- Build tree T based on D
- Progressive alignment over T, to build M
- Iterative refinement for many rounds, do
- Tree Partitioning Split M on one branch and
realign the two resulting profiles - If new alignment M has better sum-of-pairs score
than previous one, accept
38PROBCONS Probabilistic Consistency-based
Multiple Alignment of Proteins
xi
yj
MATCH
INSERT X
INSERT Y
?
yj
39A pair-HMM model of pairwise alignment
x
ABRACA-DABRA AB-ACARDI---
y
- Parameterizes a probability distribution, P(A),
over all possible alignments of all possible
pairs of sequences - Transition probabilities gap penalties
- Emission probabilities substitution matrix
40Computing Pairwise Alignments
P(a)
aviterbi
P(a x, y)
- The Viterbi algorithm
- conditional distribution P(a x, y) reflects
models uncertainty over the correct alignment
of x and y - identifies highest probability alignment,
aviterbi, in O(L2) time - Caveat the most likely alignment is not the most
accurate - Alternative find the alignment of maximum
expected accuracy
41The Lazy-Teacher Analogy
- 10 students take a 10-question true-false quiz
- How do you make the answer key?
- Approach 1 Use the answer sheet of the best
student! - Approach 2 Weighted majority vote!
42Viterbi vs. Maximum Expected Accuracy (MEA)
- Viterbi
- picks single alignment with highest chance of
being completely correct - mathematically, finds the alignment a that
maximizes - Ea1a a
- Maximum Expected Accuracy
- picks alignment with highest expected number of
correct predictions - mathematically, finds the alignment a that
maximizes - Eaaccuracy(a, a)
4. F
4. T
4. F
4. F
4. F
A-
A
B
A-
A
4. F
4. F
4. F
4. F
4. T
C
B
B
B-
B-