Title: Multiple Sequence Alignments
1Multiple Sequence Alignments
2The Global Alignment problem
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
3Multiple Sequence Alignment
S1AGGTC
S2GTTCG
S3TGAAC
4Motivations
- Protein databases categorized by protein families
- Collection of proteins with similar structure,
function, or evolutionary history - Comparing a new protein with a family requires to
construct a representation of the family and then
compare the new protein with the family
representation - How to score a multiple alignment ?
- Consensus Distance
- Evolutionary Tree Distance
- Sum-of-Pairs Distance
5Definition
- Given N sequences x1, x2,, xN
- Insert gaps (-) in each sequence xi, such that
- All sequences have the same length L
- Score of the global map is maximum
- A faint similarity between two sequences becomes
significant if present in many - Multiple alignments can help improve the pairwise
alignments
6Multiple Sequence Alignment
- Definition Given stings S1, S2, ,Sk a multiple
(global) alignment map them to strings S1, S2,
,Sk that may contain spaces, where - S1 S2 Sk
- The removal of spaces from Si leaves Si
7Scoring Function Sum Of Pairs
- Definition Induced pairwise alignment
- A pairwise alignment induced by the multiple
alignment - Example
-
- x AC-GCGG-C
- y AC-GC-GAG
- z GCCGC-GAG
- Induces
- x ACGCGG-C x AC-GCGG-C y AC-GCGAG
- y ACGC-GAC z GCCGC-GAG z GCCGCGAG
8Multiple Sequence Alignment
Given k strings of length n, there is a
generalization of the dynamic programming
algorithm that finds an optimal SP alignment.
- NP completeness
- Instead of a 2-dimensional table we now have a
k-dimensional table to fill. O(nk) cells to fill - Each dimensions size is n1. Each entry depends
on 2k - 1 adjacent entries.
9Multidimensional Dynamic Programming
- Generalization of Needleman-Wunsh
- S(m) ?i S(mi)
- (sum of column scores)
- F(i1,i2,,iN) Optimal alignment up to (i1, ,
iN) - F(i1,i2,,iN) max(all neighbors of
cube)(F(nbr)S(nbr))
10 Multidimensional Dynamic Programming
- Example in 3D (three sequences)
- 7 neighbors/cells
F(i,j,k) max F(i-1,j-1,k-1)S(xi, xj,
xk), F(i-1,j-1,k )S(xi, xj, -
), F(i-1,j ,k-1)S(xi, -, xk), F(i-1,j
,k )S(xi, -, - ), F(i ,j-1,k-1)S( -,
xj, xk), F(i ,j-1,k )S( -, xj,
xk), F(i ,j ,k-1)S( -, -, xk)
11Multiple alignments
- We use a matrix to represent the alignment of k
sequences, K(x1,...,xk). We assume no columns
consists solely of blanks.
The common scoring functions give a score to each
column, and set score(K) ?i score(column(i))
For k10, a scoring function has 2k -1 gt 1000
entries to specify. The scoring function is
symmetric - the order of arguments need not
matter score(I,_,I,V) score(_,I,I,V).
12SUM OF PAIRS
A common scoring function is SP sum of scores
of the projected pairwise alignments
SPscore(K)?iltj score(xi,xj).
Note that we need to specify the score(-,-)
because a column may have several blanks (as long
as not all entries are blanks).
In order for this score to be written as ?i
score(column(i)), we set score(-,-) 0. Why ?
Because these entries appear in the sum of
columns but not in the sum of projected pairwise
alignments (rows).
13SUM OF PAIRS
14Example
Consider the following alignment
a c - c d b -
- c - a d b d
a - b c d a d
Using the edit distance and
for , this alignment has a SP
value of
3 4 5 12
15Multidimensional Dynamic Programming
- How do affine gaps generalize?
- VERY badly!
- Require 2N states, one per combination of
gapped/ungapped sequences - Running time O(2N ? 2N ? LN) O(4N LN)
- Running Time
- Size of matrix LN
- Where L length of each sequence
- N number of sequences
- Neighbors/cell 2N 1
- Therefore O(2N LN)
Y
YZ
XY
XYZ
Z
X
XZ
16Multiple Sequence Alignment
- Given k strings of length n, there is a natural
generalization of the dynamic programming
algorithm that finds an alignment that maximizes - SP-score(K) ?iltj
score(xi,xj).
Instead of a 2-dimensional table, we now have a
k-dimensional table to fill.
17The idea via K2
Recall the notation and the following
recurrence for V
Note that the new cell index (i1,j1) differs
from previous indices by one of 2k-1 non-zero
binary vectors (1,1), (1,0), (0,1).
18The idea for arbitrary k
- Order the vectors i(i1,..,ik) by increasing
order of the sum ?ij. Set s(0,..,0)0, and for i
gt (0,...,0)
Where
- The vector b ranges over all non-zero binary
vectors. - The vector i-b is the non-negative difference of
i and b. - The jth entry of column(i,b) equals
cj xj(ij) if bi1, and
cj - otherwise. - (Reflecting that b is 1 at location j if that
location changed in the current comparison).
19Complexity of the DP approach
Number of cells nk. Number of adjacent cells
O(2k). Computation of SP score for each
column(i,b) is O(k2)
Total run time is O(k22knk) which is utterly
unacceptable ! Not much hope for a polynomial
algorithm because the problem has been shown to
be NP complete. Need heuristic to reduce time.
20Time saving heuristics Relevance tests
Heuristic Avoid computing score(i) for
irrelevant vectors.
Let L be a lower bound on the optimal SP score of
a multiple alignment of the k sequences. A lower
bound L can be obtained from an arbitrary
multiple alignment, computed in any way.
Main idea Compute upper bounds H(u,v) for the
optimal score for every two sequences sxu and
txv, 1 ? u lt v ? k. When processing vector
i(..iu,..iv), the relevant cells are such that
in every projection on xu and xv, the optimal
pairwise score is above a value based on H(u,v)
and L.
21Recall the Linear Space algorithm
- Vi,j d(s1..i,t1..j)
- Bi,j d(si1..n,tj1..m)
- Fi,j Bi,j score of best alignment through
(i,j)
These computations done in linear space.
Build such a table for every two sequences sxu
and txv, 1 ? u, v ? k. This entry encodes the
optimum through (iu,iv).
22Time saving heuristicsRelevance test
Let S(u,v) the score of the alignment of xu and
xv in the multiple alignment. Then, we have L
S(u,v) H(u,v) And then S(u,v) L H(u,v)
Now for each pair u,v we want to consider
only the cells Iu and Iv for which the best
pairwise alignment score that can be obtained
through them (that is ) is greater
than the above value
L H(u,v) -
23A Profile Representation of Multiple Alignment
- A G G C T A T C A C C T G T
A G C T A C C A - - - G C A G
C T A C C A - - - G C A G C
T A T C A C G G C A G C T A
T C G C G G A 1 1
.8 C .6 1 .4 1 .6
.2 G 1 .2 .2 .4
1 T .2 1 .6 .2 O .2
.8 .4 .4 E
.4 C .2 .8
.4 .2
- Given a multiple alignment M m1mn
- Replace each column mi with profile entry pi
- Frequency of each letter in ?
- gaps
- Optional gap openings, extensions, closings
24Multiple Alignments With Profile
- Its corresponding profile P is
- C1 C2 C3 C4 C5
- a 75 25 50
- b 75 75
- c 25 25 50 25
- - 25 25 25
- Consider the MSA
- a b c - a
- a b a b a
- a c c b -
- c b - b c
Aligning a string S to a profile P will tell us
how well S fits P. Given the column positions C
of P, the alignment consists of inserting spaces
into S and C(1,2,3,4,5) as in pure string
alignment. For instance, an alignment of aabbc to
P is a a b - b c 1 - 2 3 4 5
25String-to-Profile Alignment
- Scoring a column j is equivalent to aligning Sj
to each character at column j. - s(j) sumover all is(Sj, ij)pij
- pij is frequency of i-th character in column j,
- Score of an alignment sum of all column scores
s(j). - Use Dynamic Programming as before (NW, SW, ) to
do a string-to-profile alignment - Except that you should use this scoring function
defined above. - Profile-to-Profile Alignments?
26Progressive Alignment
x
y
z
w
- When evolutionary tree is known
- Align closest first, in the order of the tree
- In each step, align two sequences x, y, or
profiles px, py, to generate a new alignment with
associated profile presult - Weighted version
- Tree edges have weights, proportional to the
divergence in that edge - New profile is a weighted average of two old
profiles
27Example Profile (A, C, G, T, -) px (0.8, 0.2,
0, 0, 0) py (0.6, 0, 0, 0, 0.4) s(px, py)
0.80.6s(A, A) 0.20.6s(C, A)
0.80.4s(A, -) 0.20.4s(C, -) Result pxy
(0.7, 0.1, 0, 0, 0.2) s(px, -) 0.81.0s(A, -)
0.21.0s(C, -) Result px- (0.4, 0.1, 0, 0,
0.5)
28Progressive Alignment
x
y
?
z
w
- When evolutionary tree is unknown
- Perform all pairwise alignments
- Define distance matrix D, where D(x, y) is a
measure of evolutionary distance, based on
pairwise alignment - Construct a tree (we will describe more in detail
later in the course) - Align on the tree
29CLUSTALW
- (1). Perform pairwise alignments of all sequences
- (2). Use alignment scores to produce a
phylogenetic tree - (3). Align the sequences sequentially by the tree
that is based on genetic distances. - -- The most closely related sequences are aligned
first, then additional sequences or groups are
added according to initial alignments - -- Genetic distance no. of mismatched positions
divided by the total no. of matched positions
(positions opposite a gap are not scored) - -- Sequence contributions to MSA are weighted
according to their relationships on the tree - -- weighting scheme the more distant, the higher
the weight - -- Context (neighbor amino acid) is taken into
account for the gap penality - -- Gap score is adapted to force gaps to open at
the same position.
30Pairwise alignment
S1 sequence S3 sequence
S2 sequence S5 sequence S6 sequence
S1 S3 S2 S5 S6
First align S1 and S3, S5 and S6, then align
(S5,S6) and S2. Finally
31Tree Alignments
- Assume that there is a tree T(V,E) whose leaves
are the sequences. - Associate a sequence in each internal node.
- Tree-score(K) ?(i,j)?Escore(xi,xj).
- Finding the optimal assignment of sequences to
the internal nodes is NP Hard.
We will meet again this problem in the study
of Phylogenetic trees
32Star Alignments
Rather then summing up all pairwise alignments,
select a fixed sequence x0 as a center, and set
Star-score(K) ?jgt0score(x0,xj). The
algorithm to find optimal alignment at each
step, add another sequence aligned with x0,
keeping old gaps and possibly adding new ones.
33Multiple Sequence Alignment Approximation
Algorithm
- Polynomial time algorithm
- assumption the cost function d is a distance
function -
-
-
(triangle inequality) - Let D(S,T) be the value of the minimum global
alignment between S and T.
34Multiple Sequence Alignment Approximation
Algorithm (cont.)
- 2. Call the remaining strings S2, ,Sk.
- 3. Add a string to the multiple alignment that
initially contains only S1 as follows - Suppose S1, ,Si-1 are already aligned as S1,
,Si-1. Add Si by running dynamic programming
algorithm on S1 and Si to produce S1 and Si. - Adjust S2, ,Si-1 by adding spaces to those
columns where spaces were added to get S1 from
S1. - Replace S1 by S1.
35Multiple Sequence Alignment Approximation
Algorithm (cont.)
- Time analysis
- Choosing S1 running dynamic programming
algorithm - times O(k2n2)
- When Si is added to the multiple alignment, the
length of S1 - is at most in, so the time to add all k
strings is
36Multiple Sequence Alignment Approximation
Algorithm (cont.)
- Error analysis
- M - The alignment produced by this algorithm.
- d(i,j) - the distance M induces on the pair
Si,Sj.
37Multiple Sequence Alignment Approximation
Algorithm (cont.)
Triangle inequality
Error analysis
38Iterative Refinement
- Algorithm (Barton-Stenberg)
- Align most similar xi, xj
- Align xk most similar to (xixj)
- Repeat 2 until (x1xN) are aligned
- For j 1 to N,
- Remove xj, and realign to x1xj-1xj1xN
- Repeat 4 until convergence
- Note Guaranteed to converge
39(No Transcript)
40Some Resources
- Genome Resources
- Annotation and alignment genome browser at UCSC
- http//genome.ucsc.edu/cgi-bin/hgGateway
- Specialized VISTA alignment browser at LBNL
- http//pipeline.lbl.gov/cgi-bin/gateway2
- ABCNice Stanford tool for browsing alignments
- http//encode.stanford.edu/asimenos/ABC/
- Protein Multiple Aligners
- http//www.ebi.ac.uk/clustalw/
- CLUSTALW most widely used
- http//phylogenomics.berkeley.edu/cgi-bin/muscle/i
nput_muscle.py - MUSCLE most scalable