Title: Multiple Sequence Alignment
1Multiple Sequence Alignment
- VIBE Education Edition (VIBE-Ed) Initiative
2- The time will come, I believe, though I shall
not live to see it, when we shall have fairly
true genealogical trees of each great kingdom of
Nature. - Charles Darwin
3Overview
- Why Multiple Sequence Alignment
- Scoring Functions
- Multiple Sequence Alignment Methods
- Dynamic Programming
- Progressive Alignments
- Motif Alignments
4Multiple alignment
- Pairwise alignment
- Infer biological relationships from string
similarity - Multiple alignment
- Infer string similarity from biological
relationships
5Why do we care about multiple sequence alignment?
- Allows us to infer phylogenetic relationships
evolution of organisms - Can help us to elucidate biological facts about
proteins most conserved regions are usually
biologically significant. - Formulate test hypotheses about protein 3-D
structure (based on conserved regions) - Formulate test hypotheses about protein
function (see which regions of a gene, or its
derived protein, are susceptible to mutation and
which can have one residue replaced by another
without changing function)
6Multiple Sequence Alignment (MSA) Defined
- MSA is the alignment of N sequences
(Protein/Nucleotide) simultaneously, where N gt
2 . - Let Si denote a sequence. Then the Global
Multiple Sequence Alignment of N gt 2 sequences
- S S1 , , SN
- is obtained by inserting gaps denoted by -.
- The new set of N sequences denoted by
- S S1 , , SN
- will all have length L
7Scoring Function
- In order to find an optimal alignment, we need to
be able to measure how good an alignment is - Scoring should take into account
- 1. Some positions are more conserved than others
(position-specific scoring) - 2. Sequences are not independent,
- but related by a phylogenetic tree
- (alignment should maximize
- possibility for finding
- common ancestor)
x
y
z
?
w
v
8Scoring Functions
- Columns are statistically independent
- S(m) ?i S(mi)
- mi is column i of multiple alignment m
9Scoring Function Definitions
- Define m
- AC-GCGG-C
- m AC-GC-GAG
- GCACC-GAG
- mij symbol in column i for sequence j
- m42 G
- cia observed counts for residue a in column i
- c1A 2, c1C 0, c1G 1, c1T 0, c1- 0
-
-
10Scoring FunctionMinimum Entropy
- Probability of column mi
- P(mi) ?a (pia)cia
- Define column score as
- S(mi) - log P(mi)
- - log ?a (pia)cia
- - cia log ?a (pia)
- - ?a cia log (pia)
- Measures variability observed in an aligned
columns of residue - cia
- Estimate for pia
- ?a cia
- Good alignment minimizes total entropy ?i S(mi)
11Scoring FunctionMinimum Entropy Example
- For alignment m
- AC-GCGG-C
- m AG-GC-GAG
- GAACC-GAG
- P(m1) ?a (pia)cia (p1A)c1A (p1C)c1C
(p1G)c1G (p1T)c1T - (p1A)2
(p1C)0 (p1G)1 (p1T)0 (p1A)2 (p1G) - p1A c1A / ?a c1a 2/3 p1G 1/3
- S(m1) - ?a c1a log (p1a) - 2 log (2/3)
log (1/3) 0.82 - S(m2) - ?a c2a log (p2a) - log (1/3)
log (1/3) log(1/3) 1.43 - S(m4) - ?a c1a log (p1a) - 3 log (1) 0
12Scoring Function Sum Of Pairs
- The sum-of-pairs (SP) score of a multiple
alignment m is the sum of the scores of all
induced pairwise alignments. - SP score for column mi is
- S(mi) ?kltl s(mik,mil)
- s(a,b) is obtained from substitution matrix
13Notation
- ?k,l (k,l) ?k ?l (k,l)
- ?4k,l1(k,l) ?k (k,1) (k,2) (k,3)
(k,4) - ?4k,l1(k,l) (1,1) (1,2) (1,3) (1,4)
- (2,1) (2,2) (2,3) (2,4)
- (3,1) (3,2) (3,3) (3,4)
- (4,1) (4,2) (4,3) (4,4)
14Notation
- ?kltl (k,l) ?k ?l (k,l) (for all kltl)
- ?4kltl1(k,l) ?k (k,1) (k,2) (k,3) (k,4)
- ?4kltl1(k,l) (1,1) (1,2) (1,3) (1,4)
- (2,1) (2,2) (2,3) (2,4)
- (3,1) (3,2) (3,3) (3,4)
- (4,1) (4,2) (4,3) (4,4)
15Notation
- ?kltl (k,l) ?k ?l (k,l) (for all kltl)
- ?4kltl1(k,l) ?k (k,1) (k,2) (k,3) (k,4)
- ?4kltl1(k,l) (1,2) (1,3) (1,4)
- (2,3) (2,4)
-
(3,4) -
16Scoring Function Sum Of Pairs Example
-
- L-PE
- m L-KE
- ASKE
- -SKE
- S(m1) ?kltl s(m1k,m1l)
- s(m11,m12) s(m11,m13) s(m11,m14)
- s(m12,m13) s(m12,m14)
- s(m13,m14)
- s(L,L) s (L,A) s(L,-)
- s (L,A) s(L,-)
- s(A,-)
- 5 (-2) (-8) (-2) (-8) (-8) -23
17Multiple Alignment Methods
- Now that we have a scoring scheme, lets consider
methods that use those schemes - Dynamic Programming (Optimal Solution)
- Heuristic (MSA)
- Progressive
- Progressive - Refinement
- Model (Profile) Alignment
18Dynamic Programming(Optimal Solution)
- Assume N sequences of length k
- Generalization of pair-wise alignment (N2) to
multiple dimensions (Ngt2) - The dynamic programming array then becomes an
N-dimensional hyper-lattice of length k1
(including initial gaps) - The entry F(i1, , iN) represents score of
optimal alignment for s11..i1, sN1..ik
19Dynamic Programming (2 sequences)
Complexity O(n2)
20Dynamic Programming (3 sequences)
Complexity O(n3)
21Dynamic Programming
- Complexity
- O(nk), for k sequences, each n residues long
- Assume sequences of length 300
- 2 sequences 300300 comparisons (9104)
- 3 sequences 300300300 comparisons (2.7 107)
- 4 sequences 8.1 109
- 5 sequences 2.4 1012
- 10 sequences 5.9 1024
- 20 sequences 3.5 1049
- 30 sequences 2.1 1074
22Optimal Solution Path
23MSA Algorithm (Carillo-Lipman Bound)
24MSA Algorithm (CarilloLipman, 1988)
- A Heuristic for Reducing the Search Space in
Dynamic Programming - Consider the pair-wise alignments of each pair of
sequences. - Create a phylogenetic tree from these scores
(best scores paired first) - Produce a draft multiple sequence alignment
built incrementally from the phylogenetic tree. - The pair-wise alignments and the draft MSA
circumscribe a solution space within which a
full dynamic programming search is performed
(computationally intensive) - Does not guarantee an optimal alignment of all
the sequences in the group. - Does get an optimal alignment within the space
chosen.
25Progressive Methods
- First steps similar to dynamic programming
- Consider the pair-wise alignments of each pair of
sequences. - Create a phylogenetic tree from these scores
(best scores paired first) - Produce a draft multiple sequence alignment
built incrementally from the phylogenetic tree - But does NOT refine the draft MSA by doing a
full search through the reduced search space. - Does not guarantee an optimal alignment
26Progressive MethodsProblems
- Highly sensitive to choice of initial aligned
pairs, i.e. initial alignments are frozen even
when presented with new evidence in subsequent
steps. - Example
- x GAAGTT
- y GAC-TT
- z GAACTG
- w GTACTG
- Choice of scoring matrices and gap penalties is
not straightforward
Frozen!
Now clear that correct y GA-CTT
27Progressive MethodsIterative Refinement
- Attempts to circumvent the problem of error
propagation from frozen initial pair-wise
alignments - Generate initial alignment
- Remove one sequence and realign to the new
alignment of the remaining sequences, recalculate
score - Iterate with different sequences until the
alignment does not change (score does not
increase) - Guaranteed to converge to a local maximum of the
score.
28Profile Alignment
- Once an alignment has been produced, it is
advantageous to use position-specific information
from the groups multiple sequence alignment when
aligning a new sequence to it. - Essentially, perfoms a pairwise sequence
alignment using the profile as a scoring matrix - HMMs can be used for profiles in progressive or
iterative refinement methods - Many progressive alignments use pairwise
alignment of sequences to profiles, and profiles
to profiles - ClustalW
29ClustalW
- Most popular multiple sequence alignment
algorithm - Perform pairwise alignment between sequences,
determine degrees of similarity between each
pair, construct distance matrix - Construct a phylogenetic tree using the distance
matrix and nearest-neighbor algorithm. - Combine the alignments starting from the most
closely related groups to the most distantly
related groups. The most closely-related pairs
of sequences are aligned using dynamic
programming - Includes additional heuristics
30ClustalW
Perform All Pairwise Alignments
Dendrogram
Similarity Matrix
Cluster Analysis
From Higgins(1991) and Thompson(1994).
31Summary
- Scoring scheme critical (similarity matrix, gap
scores) - Dynamic programming methods
- too computationally expensive to use for even a
moderate number of sequences, can use heuristics
to reduce search space - Progressive methods
- Much less computationally intensive, but
sensitive to initial alignments - Iterative refinement
- Decent approach to address a shortcoming of PM,
but only guarantees local maximum of score - Profile methods
- Allows integration of position-specific
information and profile-profile alignments - Most computational methods use large number of
heuristics to obtain the optimum alignment