Multiple Sequence Alignments - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Multiple Sequence Alignments

Description:

... global map is ... Replace each column mi with profile entry pi. Frequency of each ... tree (we will describe more in detail later in the course) Align on the ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 41
Provided by: nirfri7
Category:

less

Transcript and Presenter's Notes

Title: Multiple Sequence Alignments


1
Multiple Sequence Alignments
2
The Global Alignment problem
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
3
Multiple Sequence Alignment
S1AGGTC
S2GTTCG
S3TGAAC
4
Motivations
  • Protein databases categorized by protein families
  • Collection of proteins with similar structure,
    function, or evolutionary history
  • Comparing a new protein with a family requires to
    construct a representation of the family and then
    compare the new protein with the family
    representation
  • How to score a multiple alignment ?
  • Consensus Distance
  • Evolutionary Tree Distance
  • Sum-of-Pairs Distance

5
Definition
  • Given N sequences x1, x2,, xN
  • Insert gaps (-) in each sequence xi, such that
  • All sequences have the same length L
  • Score of the global map is maximum
  • A faint similarity between two sequences becomes
    significant if present in many
  • Multiple alignments can help improve the pairwise
    alignments

6
Multiple Sequence Alignment
  • Definition Given stings S1, S2, ,Sk a multiple
    (global) alignment map them to strings S1, S2,
    ,Sk that may contain spaces, where
  • S1 S2 Sk
  • The removal of spaces from Si leaves Si

7
Scoring Function Sum Of Pairs
  • Definition Induced pairwise alignment
  • A pairwise alignment induced by the multiple
    alignment
  • Example
  • x AC-GCGG-C
  • y AC-GC-GAG
  • z GCCGC-GAG
  • Induces
  • x ACGCGG-C x AC-GCGG-C y AC-GCGAG
  • y ACGC-GAC z GCCGC-GAG z GCCGCGAG

8
Multiple Sequence Alignment
Given k strings of length n, there is a
generalization of the dynamic programming
algorithm that finds an optimal SP alignment.
  • NP completeness
  • Instead of a 2-dimensional table we now have a
    k-dimensional table to fill. O(nk) cells to fill
  • Each dimensions size is n1. Each entry depends
    on 2k - 1 adjacent entries.

9
Multidimensional Dynamic Programming
  • Generalization of Needleman-Wunsh
  • S(m) ?i S(mi)
  • (sum of column scores)
  • F(i1,i2,,iN) Optimal alignment up to (i1, ,
    iN)
  • F(i1,i2,,iN) max(all neighbors of
    cube)(F(nbr)S(nbr))

10
Multidimensional Dynamic Programming
  • Example in 3D (three sequences)
  • 7 neighbors/cells

F(i,j,k) max F(i-1,j-1,k-1)S(xi, xj,
xk), F(i-1,j-1,k )S(xi, xj, -
), F(i-1,j ,k-1)S(xi, -, xk), F(i-1,j
,k )S(xi, -, - ), F(i ,j-1,k-1)S( -,
xj, xk), F(i ,j-1,k )S( -, xj,
xk), F(i ,j ,k-1)S( -, -, xk)
11
Multiple alignments
  • We use a matrix to represent the alignment of k
    sequences, K(x1,...,xk). We assume no columns
    consists solely of blanks.

The common scoring functions give a score to each
column, and set score(K) ?i score(column(i))
For k10, a scoring function has 2k -1 gt 1000
entries to specify. The scoring function is
symmetric - the order of arguments need not
matter score(I,_,I,V) score(_,I,I,V).
12
SUM OF PAIRS
A common scoring function is SP sum of scores
of the projected pairwise alignments
SPscore(K)?iltj score(xi,xj).
Note that we need to specify the score(-,-)
because a column may have several blanks (as long
as not all entries are blanks).
In order for this score to be written as ?i
score(column(i)), we set score(-,-) 0. Why ?
Because these entries appear in the sum of
columns but not in the sum of projected pairwise
alignments (rows).
13
SUM OF PAIRS
14
Example
Consider the following alignment
a c - c d b -
- c - a d b d
a - b c d a d
Using the edit distance and
for , this alignment has a SP
value of
3 4 5 12
15
Multidimensional Dynamic Programming
  • How do affine gaps generalize?
  • VERY badly!
  • Require 2N states, one per combination of
    gapped/ungapped sequences
  • Running time O(2N ? 2N ? LN) O(4N LN)
  • Running Time
  • Size of matrix LN
  • Where L length of each sequence
  • N number of sequences
  • Neighbors/cell 2N 1
  • Therefore O(2N LN)

Y
YZ
XY
XYZ
Z
X
XZ
16
Multiple Sequence Alignment
  • Given k strings of length n, there is a natural
    generalization of the dynamic programming
    algorithm that finds an alignment that maximizes
  • SP-score(K) ?iltj
    score(xi,xj).

Instead of a 2-dimensional table, we now have a
k-dimensional table to fill.
17
The idea via K2
Recall the notation and the following
recurrence for V
Note that the new cell index (i1,j1) differs
from previous indices by one of 2k-1 non-zero
binary vectors (1,1), (1,0), (0,1).
18
The idea for arbitrary k
  • Order the vectors i(i1,..,ik) by increasing
    order of the sum ?ij. Set s(0,..,0)0, and for i
    gt (0,...,0)

Where
  • The vector b ranges over all non-zero binary
    vectors.
  • The vector i-b is the non-negative difference of
    i and b.
  • The jth entry of column(i,b) equals
    cj xj(ij) if bi1, and
    cj - otherwise.
  • (Reflecting that b is 1 at location j if that
    location changed in the current comparison).

19
Complexity of the DP approach
Number of cells nk. Number of adjacent cells
O(2k). Computation of SP score for each
column(i,b) is O(k2)
Total run time is O(k22knk) which is utterly
unacceptable ! Not much hope for a polynomial
algorithm because the problem has been shown to
be NP complete. Need heuristic to reduce time.
20
Time saving heuristics Relevance tests
Heuristic Avoid computing score(i) for
irrelevant vectors.
Let L be a lower bound on the optimal SP score of
a multiple alignment of the k sequences. A lower
bound L can be obtained from an arbitrary
multiple alignment, computed in any way.
Main idea Compute upper bounds H(u,v) for the
optimal score for every two sequences sxu and
txv, 1 ? u lt v ? k. When processing vector
i(..iu,..iv), the relevant cells are such that
in every projection on xu and xv, the optimal
pairwise score is above a value based on H(u,v)
and L.
21
Recall the Linear Space algorithm
  • Vi,j d(s1..i,t1..j)
  • Bi,j d(si1..n,tj1..m)
  • Fi,j Bi,j score of best alignment through
    (i,j)

These computations done in linear space.
Build such a table for every two sequences sxu
and txv, 1 ? u, v ? k. This entry encodes the
optimum through (iu,iv).
22
Time saving heuristicsRelevance test
Let S(u,v) the score of the alignment of xu and
xv in the multiple alignment. Then, we have L
S(u,v) H(u,v) And then S(u,v) L H(u,v)
Now for each pair u,v we want to consider
only the cells Iu and Iv for which the best
pairwise alignment score that can be obtained
through them (that is ) is greater
than the above value
L H(u,v) -
23
A Profile Representation of Multiple Alignment
- A G G C T A T C A C C T G T
A G C T A C C A - - - G C A G
C T A C C A - - - G C A G C
T A T C A C G G C A G C T A
T C G C G G A 1 1
.8 C .6 1 .4 1 .6
.2 G 1 .2 .2 .4
1 T .2 1 .6 .2 O .2
.8 .4 .4 E
.4 C .2 .8
.4 .2
  • Given a multiple alignment M m1mn
  • Replace each column mi with profile entry pi
  • Frequency of each letter in ?
  • gaps
  • Optional gap openings, extensions, closings

24
Multiple Alignments With Profile
  • Its corresponding profile P is
  • C1 C2 C3 C4 C5
  • a 75 25 50
  • b 75 75
  • c 25 25 50 25
  • - 25 25 25
  • Consider the MSA
  • a b c - a
  • a b a b a
  • a c c b -
  • c b - b c

Aligning a string S to a profile P will tell us
how well S fits P. Given the column positions C
of P, the alignment consists of inserting spaces
into S and C(1,2,3,4,5) as in pure string
alignment. For instance, an alignment of aabbc to
P is a a b - b c 1 - 2 3 4 5
25
String-to-Profile Alignment
  • Scoring a column j is equivalent to aligning Sj
    to each character at column j.
  • s(j) sumover all is(Sj, ij)pij
  • pij is frequency of i-th character in column j,
  • Score of an alignment sum of all column scores
    s(j).
  • Use Dynamic Programming as before (NW, SW, ) to
    do a string-to-profile alignment
  • Except that you should use this scoring function
    defined above.
  • Profile-to-Profile Alignments?

26
Progressive Alignment
x
y
z
w
  • When evolutionary tree is known
  • Align closest first, in the order of the tree
  • In each step, align two sequences x, y, or
    profiles px, py, to generate a new alignment with
    associated profile presult
  • Weighted version
  • Tree edges have weights, proportional to the
    divergence in that edge
  • New profile is a weighted average of two old
    profiles

27
Example Profile (A, C, G, T, -) px (0.8, 0.2,
0, 0, 0) py (0.6, 0, 0, 0, 0.4) s(px, py)
0.80.6s(A, A) 0.20.6s(C, A)
0.80.4s(A, -) 0.20.4s(C, -) Result pxy
(0.7, 0.1, 0, 0, 0.2) s(px, -) 0.81.0s(A, -)
0.21.0s(C, -) Result px- (0.4, 0.1, 0, 0,
0.5)
28
Progressive Alignment
x
y
?
z
w
  • When evolutionary tree is unknown
  • Perform all pairwise alignments
  • Define distance matrix D, where D(x, y) is a
    measure of evolutionary distance, based on
    pairwise alignment
  • Construct a tree (we will describe more in detail
    later in the course)
  • Align on the tree

29
CLUSTALW
  • (1). Perform pairwise alignments of all sequences
  • (2). Use alignment scores to produce a
    phylogenetic tree
  • (3). Align the sequences sequentially by the tree
    that is based on genetic distances.
  • -- The most closely related sequences are aligned
    first, then additional sequences or groups are
    added according to initial alignments
  • -- Genetic distance no. of mismatched positions
    divided by the total no. of matched positions
    (positions opposite a gap are not scored)
  • -- Sequence contributions to MSA are weighted
    according to their relationships on the tree
  • -- weighting scheme the more distant, the higher
    the weight
  • -- Context (neighbor amino acid) is taken into
    account for the gap penality
  • -- Gap score is adapted to force gaps to open at
    the same position.

30
Pairwise alignment
S1 sequence S3 sequence
S2 sequence S5 sequence S6 sequence
S1 S3 S2 S5 S6
First align S1 and S3, S5 and S6, then align
(S5,S6) and S2. Finally
31
Tree Alignments
  • Assume that there is a tree T(V,E) whose leaves
    are the sequences.
  • Associate a sequence in each internal node.
  • Tree-score(K) ?(i,j)?Escore(xi,xj).
  • Finding the optimal assignment of sequences to
    the internal nodes is NP Hard.

We will meet again this problem in the study
of Phylogenetic trees
32
Star Alignments
Rather then summing up all pairwise alignments,
select a fixed sequence x0 as a center, and set
Star-score(K) ?jgt0score(x0,xj). The
algorithm to find optimal alignment at each
step, add another sequence aligned with x0,
keeping old gaps and possibly adding new ones.
33
Multiple Sequence Alignment Approximation
Algorithm
  • Polynomial time algorithm
  • assumption the cost function d is a distance
    function

  • (triangle inequality)
  • Let D(S,T) be the value of the minimum global
    alignment between S and T.

34
Multiple Sequence Alignment Approximation
Algorithm (cont.)
  • 2. Call the remaining strings S2, ,Sk.
  • 3. Add a string to the multiple alignment that
    initially contains only S1 as follows
  • Suppose S1, ,Si-1 are already aligned as S1,
    ,Si-1. Add Si by running dynamic programming
    algorithm on S1 and Si to produce S1 and Si.
  • Adjust S2, ,Si-1 by adding spaces to those
    columns where spaces were added to get S1 from
    S1.
  • Replace S1 by S1.

35
Multiple Sequence Alignment Approximation
Algorithm (cont.)
  • Time analysis
  • Choosing S1 running dynamic programming
    algorithm
  • times O(k2n2)
  • When Si is added to the multiple alignment, the
    length of S1
  • is at most in, so the time to add all k
    strings is

36
Multiple Sequence Alignment Approximation
Algorithm (cont.)
  • Error analysis
  • M - The alignment produced by this algorithm.
  • d(i,j) - the distance M induces on the pair
    Si,Sj.
  • M - optimal alignment.

37
Multiple Sequence Alignment Approximation
Algorithm (cont.)
Triangle inequality
Error analysis
38
Iterative Refinement
  • Algorithm (Barton-Stenberg)
  • Align most similar xi, xj
  • Align xk most similar to (xixj)
  • Repeat 2 until (x1xN) are aligned
  • For j 1 to N,
  • Remove xj, and realign to x1xj-1xj1xN
  • Repeat 4 until convergence
  • Note Guaranteed to converge

39
(No Transcript)
40
Some Resources
  • Genome Resources
  • Annotation and alignment genome browser at UCSC
  • http//genome.ucsc.edu/cgi-bin/hgGateway
  • Specialized VISTA alignment browser at LBNL
  • http//pipeline.lbl.gov/cgi-bin/gateway2
  • ABCNice Stanford tool for browsing alignments
  • http//encode.stanford.edu/asimenos/ABC/
  • Protein Multiple Aligners
  • http//www.ebi.ac.uk/clustalw/
  • CLUSTALW most widely used
  • http//phylogenomics.berkeley.edu/cgi-bin/muscle/i
    nput_muscle.py
  • MUSCLE most scalable
Write a Comment
User Comments (0)
About PowerShow.com