Multiple Sequence Alignments - PowerPoint PPT Presentation

1 / 40

About This Presentation

Title:

Multiple Sequence Alignments

Description:

... global map is ... Replace each column mi with profile entry pi. Frequency of each ... tree (we will describe more in detail later in the course) Align on the ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 41

Provided by: nirfri7

Category:

more less

Transcript and Presenter's Notes

Title: Multiple Sequence Alignments

1
Multiple Sequence Alignments
2
The Global Alignment problem
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
3
Multiple Sequence Alignment
S1AGGTC
S2GTTCG
S3TGAAC
4
Motivations

Protein databases categorized by protein families
Collection of proteins with similar structure,
function, or evolutionary history
Comparing a new protein with a family requires to
construct a representation of the family and then
compare the new protein with the family
representation
How to score a multiple alignment ?
Consensus Distance
Evolutionary Tree Distance
Sum-of-Pairs Distance

5
Definition

Given N sequences x1, x2,, xN
Insert gaps (-) in each sequence xi, such that
All sequences have the same length L
Score of the global map is maximum
A faint similarity between two sequences becomes
significant if present in many
Multiple alignments can help improve the pairwise
alignments

6
Multiple Sequence Alignment

Definition Given stings S1, S2, ,Sk a multiple
(global) alignment map them to strings S1, S2,
,Sk that may contain spaces, where
S1 S2 Sk
The removal of spaces from Si leaves Si

7
Scoring Function Sum Of Pairs

Definition Induced pairwise alignment
A pairwise alignment induced by the multiple
alignment
Example
x AC-GCGG-C
y AC-GC-GAG
z GCCGC-GAG
Induces
x ACGCGG-C x AC-GCGG-C y AC-GCGAG
y ACGC-GAC z GCCGC-GAG z GCCGCGAG

8
Multiple Sequence Alignment
Given k strings of length n, there is a
generalization of the dynamic programming
algorithm that finds an optimal SP alignment.

NP completeness
Instead of a 2-dimensional table we now have a
k-dimensional table to fill. O(nk) cells to fill
Each dimensions size is n1. Each entry depends
on 2k - 1 adjacent entries.

9
Multidimensional Dynamic Programming

Generalization of Needleman-Wunsh
S(m) ?i S(mi)
(sum of column scores)
F(i1,i2,,iN) Optimal alignment up to (i1, ,
iN)
F(i1,i2,,iN) max(all neighbors of
cube)(F(nbr)S(nbr))

10
Multidimensional Dynamic Programming

Example in 3D (three sequences)
7 neighbors/cells

F(i,j,k) max F(i-1,j-1,k-1)S(xi, xj,
xk), F(i-1,j-1,k )S(xi, xj, -
), F(i-1,j ,k-1)S(xi, -, xk), F(i-1,j
,k )S(xi, -, - ), F(i ,j-1,k-1)S( -,
xj, xk), F(i ,j-1,k )S( -, xj,
xk), F(i ,j ,k-1)S( -, -, xk)
11
Multiple alignments

We use a matrix to represent the alignment of k
sequences, K(x1,...,xk). We assume no columns
consists solely of blanks.

The common scoring functions give a score to each
column, and set score(K) ?i score(column(i))
For k10, a scoring function has 2k -1 gt 1000
entries to specify. The scoring function is
symmetric - the order of arguments need not
matter score(I,_,I,V) score(_,I,I,V).
12
SUM OF PAIRS
A common scoring function is SP sum of scores
of the projected pairwise alignments
SPscore(K)?iltj score(xi,xj).
Note that we need to specify the score(-,-)
because a column may have several blanks (as long
as not all entries are blanks).
In order for this score to be written as ?i
score(column(i)), we set score(-,-) 0. Why ?
Because these entries appear in the sum of
columns but not in the sum of projected pairwise
alignments (rows).
13
SUM OF PAIRS
14
Example
Consider the following alignment
a c - c d b -
- c - a d b d
a - b c d a d
Using the edit distance and
for , this alignment has a SP
value of
3 4 5 12
15
Multidimensional Dynamic Programming

How do affine gaps generalize?
VERY badly!
Require 2N states, one per combination of
gapped/ungapped sequences
Running time O(2N ? 2N ? LN) O(4N LN)

Running Time
Size of matrix LN
Where L length of each sequence
N number of sequences
Neighbors/cell 2N 1
Therefore O(2N LN)

Y
YZ
XY
XYZ
Z
X
XZ
16
Multiple Sequence Alignment

Given k strings of length n, there is a natural
generalization of the dynamic programming
algorithm that finds an alignment that maximizes
SP-score(K) ?iltj
score(xi,xj).

Instead of a 2-dimensional table, we now have a
k-dimensional table to fill.
17
The idea via K2
Recall the notation and the following
recurrence for V
Note that the new cell index (i1,j1) differs
from previous indices by one of 2k-1 non-zero
binary vectors (1,1), (1,0), (0,1).
18
The idea for arbitrary k

Order the vectors i(i1,..,ik) by increasing
order of the sum ?ij. Set s(0,..,0)0, and for i
gt (0,...,0)

Where

The vector b ranges over all non-zero binary
vectors.
The vector i-b is the non-negative difference of
i and b.
The jth entry of column(i,b) equals
cj xj(ij) if bi1, and
cj - otherwise.
(Reflecting that b is 1 at location j if that
location changed in the current comparison).

19
Complexity of the DP approach
Number of cells nk. Number of adjacent cells
O(2k). Computation of SP score for each
column(i,b) is O(k2)
Total run time is O(k22knk) which is utterly
unacceptable ! Not much hope for a polynomial
algorithm because the problem has been shown to
be NP complete. Need heuristic to reduce time.
20
Time saving heuristics Relevance tests
Heuristic Avoid computing score(i) for
irrelevant vectors.
Let L be a lower bound on the optimal SP score of
a multiple alignment of the k sequences. A lower
bound L can be obtained from an arbitrary
multiple alignment, computed in any way.
Main idea Compute upper bounds H(u,v) for the
optimal score for every two sequences sxu and
txv, 1 ? u lt v ? k. When processing vector
i(..iu,..iv), the relevant cells are such that
in every projection on xu and xv, the optimal
pairwise score is above a value based on H(u,v)
and L.
21
Recall the Linear Space algorithm

Vi,j d(s1..i,t1..j)
Bi,j d(si1..n,tj1..m)
Fi,j Bi,j score of best alignment through
(i,j)

These computations done in linear space.
Build such a table for every two sequences sxu
and txv, 1 ? u, v ? k. This entry encodes the
optimum through (iu,iv).
22
Time saving heuristicsRelevance test
Let S(u,v) the score of the alignment of xu and
xv in the multiple alignment. Then, we have L
S(u,v) H(u,v) And then S(u,v) L H(u,v)
Now for each pair u,v we want to consider
only the cells Iu and Iv for which the best
pairwise alignment score that can be obtained
through them (that is ) is greater
than the above value
L H(u,v) -
23
A Profile Representation of Multiple Alignment
- A G G C T A T C A C C T G T
A G C T A C C A - - - G C A G
C T A C C A - - - G C A G C
T A T C A C G G C A G C T A
T C G C G G A 1 1
.8 C .6 1 .4 1 .6
.2 G 1 .2 .2 .4
1 T .2 1 .6 .2 O .2
.8 .4 .4 E
.4 C .2 .8
.4 .2

Given a multiple alignment M m1mn
Replace each column mi with profile entry pi
Frequency of each letter in ?
gaps
Optional gap openings, extensions, closings

24
Multiple Alignments With Profile

Its corresponding profile P is
C1 C2 C3 C4 C5
a 75 25 50
b 75 75
c 25 25 50 25
- 25 25 25

Consider the MSA
a b c - a
a b a b a
a c c b -
c b - b c

Aligning a string S to a profile P will tell us
how well S fits P. Given the column positions C
of P, the alignment consists of inserting spaces
into S and C(1,2,3,4,5) as in pure string
alignment. For instance, an alignment of aabbc to
P is a a b - b c 1 - 2 3 4 5
25
String-to-Profile Alignment

Scoring a column j is equivalent to aligning Sj
to each character at column j.
s(j) sumover all is(Sj, ij)pij
pij is frequency of i-th character in column j,
Score of an alignment sum of all column scores
s(j).
Use Dynamic Programming as before (NW, SW, ) to
do a string-to-profile alignment
Except that you should use this scoring function
defined above.
Profile-to-Profile Alignments?

26
Progressive Alignment
x
y
z
w

When evolutionary tree is known
Align closest first, in the order of the tree
In each step, align two sequences x, y, or
profiles px, py, to generate a new alignment with
associated profile presult
Weighted version
Tree edges have weights, proportional to the
divergence in that edge
New profile is a weighted average of two old
profiles

27
Example Profile (A, C, G, T, -) px (0.8, 0.2,
0, 0, 0) py (0.6, 0, 0, 0, 0.4) s(px, py)
0.80.6s(A, A) 0.20.6s(C, A)
0.80.4s(A, -) 0.20.4s(C, -) Result pxy
(0.7, 0.1, 0, 0, 0.2) s(px, -) 0.81.0s(A, -)
0.21.0s(C, -) Result px- (0.4, 0.1, 0, 0,
0.5)
28
Progressive Alignment
x
y
?
z
w

When evolutionary tree is unknown
Perform all pairwise alignments
Define distance matrix D, where D(x, y) is a
measure of evolutionary distance, based on
pairwise alignment
Construct a tree (we will describe more in detail
later in the course)
Align on the tree

29
CLUSTALW

(1). Perform pairwise alignments of all sequences
(2). Use alignment scores to produce a
phylogenetic tree
(3). Align the sequences sequentially by the tree
that is based on genetic distances.
-- The most closely related sequences are aligned
first, then additional sequences or groups are
added according to initial alignments
-- Genetic distance no. of mismatched positions
divided by the total no. of matched positions
(positions opposite a gap are not scored)
-- Sequence contributions to MSA are weighted
according to their relationships on the tree
-- weighting scheme the more distant, the higher
the weight
-- Context (neighbor amino acid) is taken into
account for the gap penality
-- Gap score is adapted to force gaps to open at
the same position.

30
Pairwise alignment
S1 sequence S3 sequence
S2 sequence S5 sequence S6 sequence
S1 S3 S2 S5 S6
First align S1 and S3, S5 and S6, then align
(S5,S6) and S2. Finally
31
Tree Alignments

Assume that there is a tree T(V,E) whose leaves
are the sequences.
Associate a sequence in each internal node.
Tree-score(K) ?(i,j)?Escore(xi,xj).
Finding the optimal assignment of sequences to
the internal nodes is NP Hard.

We will meet again this problem in the study
of Phylogenetic trees
32
Star Alignments
Rather then summing up all pairwise alignments,
select a fixed sequence x0 as a center, and set
Star-score(K) ?jgt0score(x0,xj). The
algorithm to find optimal alignment at each
step, add another sequence aligned with x0,
keeping old gaps and possibly adding new ones.
33
Multiple Sequence Alignment Approximation
Algorithm

Polynomial time algorithm
assumption the cost function d is a distance
function
(triangle inequality)
Let D(S,T) be the value of the minimum global
alignment between S and T.

34
Multiple Sequence Alignment Approximation
Algorithm (cont.)

2. Call the remaining strings S2, ,Sk.
3. Add a string to the multiple alignment that
initially contains only S1 as follows
Suppose S1, ,Si-1 are already aligned as S1,
,Si-1. Add Si by running dynamic programming
algorithm on S1 and Si to produce S1 and Si.
Adjust S2, ,Si-1 by adding spaces to those
columns where spaces were added to get S1 from
S1.
Replace S1 by S1.

35
Multiple Sequence Alignment Approximation
Algorithm (cont.)