Title: Multiple Sequence Alignment
1Multiple Sequence Alignment
Chapter-4
- Luonan Chen
- Osaka Sangyo University
2Applications
- Genome sequencing
- Protein conserved functional or structural
domains - DNA structural and functional relationships,
promoter, binding site - Biotech specific probe, PCR primer
- Phylogenetic analysis
3Sum of Pairs SP
- M W P I L L L
- M L R - L L - sequences of four
amino acid - M K - I L L -
- M P P V L I L
- SP4SP(I,-,I,V)
- s(I,-)s(I,I)s(I,V)s(-,I)s(-,V)s(I,V)
- SP SiSPi
- ( or SP Sjltj s(aij) aij are sequences i
and j ) - s(-,-)0
- For n sequences, no. of combination n(n-1)/2 for
each column - DP curse of dimensionality
4Multiple Sequence Alignments
- In theory, making an optimal alignment between
two sequences is computationally straightforward
(Smith-Waterman algorithm), but aligning a large
number of sequences using the same method is
almost impossible. (e.g. DP ? O(TN)) - The problem increases exponentially with the
number of sequences involved - (the product of the sequence lengths)
- Most MSA are based on Progressive Pairwise Methods
5Optimal Alignment
- For a given group of sequences, there is no
single "correct" alignment, only an alignment
that is "optimal" according to some set of
calculations. - Determining what alignment is best for a given
set of sequences is really up to the judgement of
the investigator.
6Progressive Pairwise Methods (global)
- Incremental or progressive method first make
pairwise alignments, then adds new sequences one
at a time to these aligned groups. - An approximate method.
- Typical algorithms CLUSTALW, PILEUP
- --- disadvantages (1) dependence on initial
pairwise sequence alignments. - (2) dependence on scoring matrices and gap
penalties
7The PILEUP Algorithm
- First, PILEUP calculates approximate pairwise
similarity scores (DP) between all sequences to
be aligned, and they are clustered into a
dendrogram (tree structure). - Then the most similar pairs of sequences are
aligned. - Averages (similar to consensus sequences) are
calculated for the aligned pairs. - New sequences and clusters of sequences are added
one by one, according to the branching order in
the dendrogram.
8PILEUP Considerations
- Since the alignment is calculated on a
progressive basis, the order of the initial
sequences can affect the final alignment. - PILEUP paramaters 2 gap penalties (gap insert
and gap extend) and an amino acid comparison
matrix. - PILEUP will refuse to align sequences that
require too many gaps or mismatches. - PILEUP will take quite a while to align more than
about 10 sequences
9CLUSTALW
- (1). Perform pairwise alignments of all sequences
- (2). Use alignment scores to produce a
phylogenetic tree - (3). Align the sequences sequentially by the tree
that is based on genetic distances. - --The most closely related sequences are aligned
first, then additional sequences or groups are
added according to initial alignments - --Genetic distance no. of mismatched positions
divided by the total no. of matched positions
(positions opposite a gap are not scored) - --Sequence contributions to MSA are weighted
according to their relationships on the tree - --weighting scheme the more distant, the higher
the weight
10Pairwise alignment
S1 sequence S3 sequence
S2 sequence S5 sequence S6 sequence
S1 S3 S2 S5 S6
First align S1 and S3, S5 and S6, then align
(S5,S6) and S2. Finally
11Iterative Methods for MSA (global)
GA, SA or Hidden Markov Model
12Local MSA(Motif)
Used as PAM
Profile Analysis
Unknow AA,gap,extension
Block Analysis
Log odds score
Consensus pattern
13Motif Expression
- Standard expression
- CL-x(2,5)-V-LIM-x(3)-GP
- x(2,5)for any 2-5 residues LIML or I or M
GP any residue except G and P - Weight matrix expression profile
- HMM similar to profile but with delete and
insert prob.
14Statistical Methods
- Expectation Maximization Algorithm
(deterministic) - Gibbs Sampler (stochastic)
- Hidden Markov Models (stochastic)
- For HMM
- --advantages theoretical explanation, no
sequence ordering, no insertion and deletion
penalties, using prior information - -- disadvantage large number of sequences for
training
15EM Algorithm
- 1. Initial guess and prob. (or odds score)
distribution table - 2. E-step estimate the prob. (likelihood) at any
position in each sequence according to the table - 3. M-step add new counts of bases or amino acids
to each position according the prob., and update
the table. - Goto E-step until converged.
e.g. MEME Multiple EM for Motif Elicitation
16 Gabbs Sampler
Goal ratio of motif prob. to background prob. is
maximum
frequencies
2nd-9th sequences, 20 residues
Score of background is calculated without motif
1st sequence
Randomly select a sequence left out, and repeat
steps A to C gt 100 times.
17Markov Hidden Model
By Forward-backward, Viterbi, Baum-Welch (or EM)
algorithms, (1) find Prob. of the observation
sequence given the model (2) find the most
likely hidden state sequence given the model and
observations (3) adjust parameters to maximize
their Prob. given observations.
--Match states are consensus positions
(or gap)
Bases (4) or amino acids (20)
K L A . . . D
States 134
(or gap)
18S(S0,,SN) state index ss1,,sK K symbols
in a state O(o1,,oT) observation sequence
Q(q1,,qT) state sequence (oisj, qjskin a
certain state Si e.g. Si1,2,,N,sA,T,G,C)
19Key Tasks
- (1) find Prob. of the observation sequence
- -- given the model?and O, compute P(O?)
- (2) find the most likely hidden state sequence
- -- given ?and O, find argmaxQ P(QO,?)
- (3) adjust parameters to maximize their Prob.
given observations O(o1,,oT) - -- given O, find argmax?P(O?)
- --by Forward-backward, Viterbi, Baum-Welch (or
EM) algorithms. Note Q(q1,,qT)
20(1). Forward Algorithm
at(j)P(o1,,ot , qt in Sj ?)
a0(j)1 if j is a START state, 0 otherwise
S2
Compute a recursively. Computation complexity
O(N2T) Task (1) P(O ?) Sj1N aT(j)
21(State 1)
(State 2)
22Backward Algorithm
ßt(j)P(ot1,,oT , qt in Sj ?)
ßT(j)1 if j is a END state, 0 otherwise
Compute ß recursively. Computation complexity
O(N2T) P(O ?) Sj1N ß0(j)
23(No Transcript)
24(2). The Viterbi Algorithm
- Replace summation with maximization in Forward
Algorithm, i.e. - Define Vt(i)maxq1,,qt P(o1,,ot , qt in Si ?)
- Recursive computation
- Vt(j)maxi1,,N Vt-1(i)P(SjSi)P(otSj) for tgt0
- Task (2) maxQ P(QO,?) max1jN VT(j)
- (trace back for the maximum)
25Find the state sequence Q which maximizes P
26(3). Estimation of ParametersOptimization
Algorithm
- Probability of transition from Si to Sj at time
t, given O and ? - Ft(i,j)P(qt in Si, qt1 in SjO,?)
- at(i)P(SjSi)P(ot1Sj)
ßt1(j)/P(O?)
Si
Sj
P(SjSi)P(ot1Sj)
Using Forward-Backward Algorithm
27Parameters of HMM
-
expected no. of transition from Si to Sj -
expected no. of transition from Si
aijP(SjSi)
Expected no. of times in state j with symbol
k Expected no. of tiems in state j
bjP(qkSj)
Then the parameters of HMM ? A,B, where
Aaij, Bbj. (e.g.
kA,T,G or C )
28Baum-Welch Algorithm (EM Algorithm)
- 1. initialization of ? A,B
- 2. compute a, ß,F (E-step)
- 3. update ? A,B from F (M-step)
- 4. if not converged, goto 2.
- -- It can be shown P(OF) strictly increasing,
and converging to a local mode. -
(Task 3)
Questions How many states in model, how to
initialize parameters, how to avoid local modes.
29Application to MSA by HMM
- For a set of given sequences
- (1). Estimate HMM parameters by Baum-Welch
algorthm - (2). Align each sequence to HMM by Viterbi
algorithm - (3). Match states of HMM provide columns of
resulting multiple alignment.
30Application of MSA
- Calculate conservation (index) to identify motif
(e.g. frequency of AA for each site) - Identify the specific properties of sequences to
predict functions and structures (e.g.
hydrophobic) - Calculate variation (variation frequency of AA
for each site) to predict functions and
structures (e.g. surface of globular protein) - Predict secondary structure (a-helix,ß-sheet,
ß-turn, coli) - Predict the interactions by computing covariation
and evolutionary tree - Use correlation (?kpij(a,b)logpij(a,b)/pi(a)pj(b)
) for all sequences k to predict relation for
site-i and site-j