Title: Corn Hoogendoorn
1Multiple sequence alignment methods
- Corné Hoogendoorn
- Denis Miretskiy
2Overview
- What a multiple alignment means
- Scoring a multiple alignment
- Break
- Multidimensional dynamic programming
- Progressive alignment methods
3What a multiple alignment means
- Homologous residues are aligned in columns
- Structurally homologous
- Evolutionarily homologous
- Similar 3D structural positions
- Diverging from a common ancestral residue
4Multiple alignment - issues
- Identifying unambiguously homologous positions is
not possible - A need to identify which alignment is best
- Protein structures and sequences evolve
- Sequences not entirely superposable
5Multiple alignment - issues
- There always is an unambiguously correct
evolutionary alignment - Common ancestral sequence
- Sheerly impossible to infer the evolutionary
history - Usually easier to construct a structural alignment
6Multiple alignment - issues
- Sequence diverges even faster than structure
- Structurally unalignable protein parts cannot be
aligned by sequence either - Some parts are very well alignable
- Use these parts to align whatever can be aligned
- Disregard the rest to assess alignment quality
- Supposedly meaningless biases are omitted
7Scoring an alignment
- Some positions are more conserved than others
- Position-specific scoring
- Sequences are not independent
- Related to each other by a phylogenetic tree
- Specify a complete probabilistic model of
molecular sequence evolution
8Complete probabilistic model
- Probabilities of all evolutionary events
- Prior probability of root ancestral sequence
- Probabilities of evolutionary change depend on
evolutionary time - Position-specific structural and functional
constraints - We just dont have all the necessary data
9Workable approximations
- Assume that all columns are statistically
independent
Score for multiple alignment m
Gap score/penalty
Score for column i in the multiple alignment m
10Scoring an alignment
11Minimum EntropyFurther simplification
- We already assumed independence between columns
- Complex statistical dependence between sequences
(within columns) if their phylogenetic tree has
many intermediate ancestors - We assume independence between and within columns
12Minimum entropy
- Probability of column mi
- Score of column mi can be defined as the negative
logarithm
A regularized probability estimate as used in
chapter 5
An entropy measure directly related to the
Shannon entropy (chapter 11)
13Example (1)
14Example (2)
15Example (3)
Will this ever be 0 in reality? Why (not)?
16Example (4)
17Minimum entropy
- Very near to the HMM formulation
- Choose the sequences carefully
- Usually the sample of sequences is biased
- Weighting schemes as discussed in chapter 5 are
necessary - This partially compensates for the defects of the
assumption of sequence independence
18Sum of pairs
- Also assumes statistical independence between
columns - Uses substitution matrices
- For simple linear gap costs, s(a,-) s(-,a) and
s(-,-) are defined, with s(-,-) 0
Scores s(a,b) come from substitution matrices
like PAM or BLOSUM
19Sum of pairs
- Substitution scores are usually log-odds scores
for pairwise comparisons - log(pab/qaqb) log(pbc/qbqc) log(pac/qaqc)
- log(pabc/qaqbqc)
- Each sequence is scored as if it descended from
the N-1 other sequences - Evolutionary events are over-counted
20Problem with SP scores
- Consider an alignment of N sequences
- All have leucine (L) at position i
Number of symbol pairs in the column
Score for an L-L alignment according to the
BLOSUM50 matrix
21Problem with SP scores
- What if one sequence has glycine (G) at i?
- G-L pair scores -4, difference with L-L is 9
- The score is worse than the all-leucine column by
a fraction
22What a multiple alignment meansScoring a
multiple alignment
23Multidimensional dynamic programming
- We assume that columns of an alignment are
statistically independent - Gaps are scored with a linear gap cost
- Now we can calculate overall score S(m)
- Where S(mi) is a score for column i
24Calculating the overall score
- Define as the maximum score of an
alignment up to the subsequences ending with
25(No Transcript)
26Simple notation
- Introduce Di which is 0 or 1 and define the
product - Now recursion can be written as follows
27Complexity of algorithm
- The algorithm requires the computation of the
whole dynamic programming matrix with L1, L2,,LN
entries. - We have to view 2N - 1 combinations of gaps in a
column. - All sequences have roughly the same length
- Memory complexity of algorithm is
- Time complexity is
28MSA
- Let akl denote the pairwise alignment between
sequences k and l - the score of the complete alignment is given
- Let âkl be the optimal pairwise alignment of k,
l - Obviously
29Lower bound
- Assume that we have a lower bound of the optimal
multiple alignment, so - In other words
- Where
30Lower bound
- Now we can look only at pairwise alignments of k
and l that score better bkl - We need to obtain s(a), and this can be done by
using a progressive alignment algorithm
31Restricted algorithm
- For each pair k, l we can find the complete set
Bkl of coordinate pairs (ik, il) such that the
best alignment of xk to xl through (ik, il)
scores more than bkl - Now we only have to look at cells (i1, i2,, iN)
which meet the following condition - (ik, il) is in Bkl for all k, l
32(No Transcript)
33Progressive alignment methods
- The algorithms differ in several ways
- Choice of order to do the alignment
- Whether the progression involves only alignment
of sequences to a single growing alignment or
whether subfamilies are built upon a tree
structure
34Feng-Doolittle progressive multiple alignment
- Calculate a diagonal matrix of N(N-1)/2 distances
between all pairs of N sequences by standard
pairwise alignment - Construct a guide tree from the distance matrix
using the FitchMargoliash clustering algorithm - Starting from the first node added to the tree,
align the child nodes - Repeat until all sequences have been aligned.
35Converting scores to distances
- Where
- Smax is the maximum score
- Sobs is the observed pairwise alignment score
- Srand is the expected score for aligning two
random sequences
36Profile alignment
- Linear gap scores can be included in the SP
score - Global alignment score
37CLUSTALW progressive alignment
- Construct a distance matrix of all N(N-1)/2 pair
by pairwise dynamic programming alignment. - Construct a guide tree by a neighbor-joining
clustering algorithm (Saitou Nei). - Progressively align at nodes in order of
decreasing similarity, using sequence-sequence,
sequence-profile and profile-profile alignment.
38CLUSTALW properties
- Sequences are weighted to compensate for biased
representation. - The substitution matrix used to score an
alignment is chosen based on the expected
similarity of the sequences - Position-specific gap-open profile penalties are
multiplied by a modifier that is a function of
the residues observed at the position.
39CLUSTALW properties
- Gap-open penalties are also decreased if the
position is spanned by a consecutive stretch of
five or more hydrophilic residues. - Both gap-open and gap-extend penalties are
increased if there are also no gaps occur nearby
in the alignment. - In the progressive alignment stage, if the score
of an alignment is low, we have to accumulate
profile information
40Iterative refinement methodsBarton-Stenberg
multiple alignment
- Find two sequences with the highest pairwise
similarity and align them using standard pairwise
dynamic programming alignment. - Find the sequence that is most similar to a
profile of the alignment of the first two and
align it to the first two by profile-sequence
alignment. Repeat until all sequences have been
included in the multiply alignment.
41Iterative refinement methodsBarton-Stenberg
multiple alignment
- Remove sequence and realign it to a profile of
the other aligned sequences by profile-sequence
alignment. Repeat for sequences. - Repeat the previous realignment step a fixed
number of times or until the alignment score
converges.