Multiple Sequence Alignment - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Multiple Sequence Alignment

Description:

Protein: conserved functional or structural domains ... Waterman algorithm), but aligning a large number of sequences using the same ... – PowerPoint PPT presentation

Number of Views:47

Avg rating:3.0/5.0

Slides: 31

Provided by: luona

Category:

more less

Transcript and Presenter's Notes

Title: Multiple Sequence Alignment

1
Multiple Sequence Alignment
Chapter-4

Luonan Chen
Osaka Sangyo University

2
Applications

Genome sequencing
Protein conserved functional or structural
domains
DNA structural and functional relationships,
promoter, binding site
Biotech specific probe, PCR primer
Phylogenetic analysis

3
Sum of Pairs SP

M W P I L L L
M L R - L L - sequences of four
amino acid
M K - I L L -
M P P V L I L
SP4SP(I,-,I,V)
s(I,-)s(I,I)s(I,V)s(-,I)s(-,V)s(I,V)
SP SiSPi
( or SP Sjltj s(aij) aij are sequences i
and j )
s(-,-)0
For n sequences, no. of combination n(n-1)/2 for
each column
DP curse of dimensionality

4
Multiple Sequence Alignments

In theory, making an optimal alignment between
two sequences is computationally straightforward
(Smith-Waterman algorithm), but aligning a large
number of sequences using the same method is
almost impossible. (e.g. DP ? O(TN))
The problem increases exponentially with the
number of sequences involved
(the product of the sequence lengths)
Most MSA are based on Progressive Pairwise Methods

5
Optimal Alignment

For a given group of sequences, there is no
single "correct" alignment, only an alignment
that is "optimal" according to some set of
calculations.
Determining what alignment is best for a given
set of sequences is really up to the judgement of
the investigator.

6
Progressive Pairwise Methods (global)

Incremental or progressive method first make
pairwise alignments, then adds new sequences one
at a time to these aligned groups.
An approximate method.
Typical algorithms CLUSTALW, PILEUP
--- disadvantages (1) dependence on initial
pairwise sequence alignments.
(2) dependence on scoring matrices and gap
penalties

7
The PILEUP Algorithm

First, PILEUP calculates approximate pairwise
similarity scores (DP) between all sequences to
be aligned, and they are clustered into a
dendrogram (tree structure).
Then the most similar pairs of sequences are
aligned.
Averages (similar to consensus sequences) are
calculated for the aligned pairs.
New sequences and clusters of sequences are added
one by one, according to the branching order in
the dendrogram.

8
PILEUP Considerations

Since the alignment is calculated on a
progressive basis, the order of the initial
sequences can affect the final alignment.
PILEUP paramaters 2 gap penalties (gap insert
and gap extend) and an amino acid comparison
matrix.
PILEUP will refuse to align sequences that
require too many gaps or mismatches.
PILEUP will take quite a while to align more than
about 10 sequences

9
CLUSTALW

(1). Perform pairwise alignments of all sequences
(2). Use alignment scores to produce a
phylogenetic tree
(3). Align the sequences sequentially by the tree
that is based on genetic distances.
--The most closely related sequences are aligned
first, then additional sequences or groups are
added according to initial alignments
--Genetic distance no. of mismatched positions
divided by the total no. of matched positions
(positions opposite a gap are not scored)
--Sequence contributions to MSA are weighted
according to their relationships on the tree
--weighting scheme the more distant, the higher
the weight

10
Pairwise alignment
S1 sequence S3 sequence
S2 sequence S5 sequence S6 sequence
S1 S3 S2 S5 S6
First align S1 and S3, S5 and S6, then align
(S5,S6) and S2. Finally
11
Iterative Methods for MSA (global)
GA, SA or Hidden Markov Model
12
Local MSA(Motif)
Used as PAM
Profile Analysis
Unknow AA,gap,extension
Block Analysis
Log odds score
Consensus pattern
13
Motif Expression

Standard expression
CL-x(2,5)-V-LIM-x(3)-GP
x(2,5)for any 2-5 residues LIML or I or M
GP any residue except G and P
Weight matrix expression profile
HMM similar to profile but with delete and
insert prob.

14
Statistical Methods

Expectation Maximization Algorithm
(deterministic)
Gibbs Sampler (stochastic)
Hidden Markov Models (stochastic)
For HMM
--advantages theoretical explanation, no
sequence ordering, no insertion and deletion
penalties, using prior information
-- disadvantage large number of sequences for
training

15
EM Algorithm

1. Initial guess and prob. (or odds score)
distribution table
2. E-step estimate the prob. (likelihood) at any
position in each sequence according to the table
3. M-step add new counts of bases or amino acids
to each position according the prob., and update
the table.
Goto E-step until converged.

e.g. MEME Multiple EM for Motif Elicitation
16
Gabbs Sampler
Goal ratio of motif prob. to background prob. is
maximum
frequencies
2nd-9th sequences, 20 residues
Score of background is calculated without motif
1st sequence
Randomly select a sequence left out, and repeat
steps A to C gt 100 times.
17
Markov Hidden Model
By Forward-backward, Viterbi, Baum-Welch (or EM)
algorithms, (1) find Prob. of the observation
sequence given the model (2) find the most
likely hidden state sequence given the model and
observations (3) adjust parameters to maximize
their Prob. given observations.
--Match states are consensus positions
(or gap)
Bases (4) or amino acids (20)
K L A . . . D
States 134
(or gap)
18
S(S0,,SN) state index ss1,,sK K symbols
in a state O(o1,,oT) observation sequence
Q(q1,,qT) state sequence (oisj, qjskin a
certain state Si e.g. Si1,2,,N,sA,T,G,C)

19
Key Tasks

(1) find Prob. of the observation sequence
-- given the model?and O, compute P(O?)
(2) find the most likely hidden state sequence
-- given ?and O, find argmaxQ P(QO,?)
(3) adjust parameters to maximize their Prob.
given observations O(o1,,oT)
-- given O, find argmax?P(O?)
--by Forward-backward, Viterbi, Baum-Welch (or
EM) algorithms. Note Q(q1,,qT)

20
(1). Forward Algorithm
at(j)P(o1,,ot , qt in Sj ?)
a0(j)1 if j is a START state, 0 otherwise

S2
Compute a recursively. Computation complexity
O(N2T) Task (1) P(O ?) Sj1N aT(j)
21
(State 1)
(State 2)
22
Backward Algorithm
ßt(j)P(ot1,,oT , qt in Sj ?)
ßT(j)1 if j is a END state, 0 otherwise
Compute ß recursively. Computation complexity
O(N2T) P(O ?) Sj1N ß0(j)
23
(No Transcript)
24
(2). The Viterbi Algorithm

Replace summation with maximization in Forward
Algorithm, i.e.
Define Vt(i)maxq1,,qt P(o1,,ot , qt in Si ?)
Recursive computation
Vt(j)maxi1,,N Vt-1(i)P(SjSi)P(otSj) for tgt0
Task (2) maxQ P(QO,?) max1jN VT(j)
(trace back for the maximum)

25
Find the state sequence Q which maximizes P
26
(3). Estimation of ParametersOptimization
Algorithm

Probability of transition from Si to Sj at time
t, given O and ?
Ft(i,j)P(qt in Si, qt1 in SjO,?)
at(i)P(SjSi)P(ot1Sj)
ßt1(j)/P(O?)

Si
Sj
P(SjSi)P(ot1Sj)
Using Forward-Backward Algorithm
27
Parameters of HMM

expected no. of transition from Si to Sj
expected no. of transition from Si

aijP(SjSi)
Expected no. of times in state j with symbol
k Expected no. of tiems in state j
bjP(qkSj)
Then the parameters of HMM ? A,B, where
Aaij, Bbj. (e.g.
kA,T,G or C )
28
Baum-Welch Algorithm (EM Algorithm)

1. initialization of ? A,B
2. compute a, ß,F (E-step)
3. update ? A,B from F (M-step)
4. if not converged, goto 2.
-- It can be shown P(OF) strictly increasing,
and converging to a local mode.

(Task 3)
Questions How many states in model, how to
initialize parameters, how to avoid local modes.
29
Application to MSA by HMM

For a set of given sequences
(1). Estimate HMM parameters by Baum-Welch
algorthm
(2). Align each sequence to HMM by Viterbi
algorithm
(3). Match states of HMM provide columns of
resulting multiple alignment.

30
Application of MSA

Calculate conservation (index) to identify motif
(e.g. frequency of AA for each site)
Identify the specific properties of sequences to
predict functions and structures (e.g.
hydrophobic)
Calculate variation (variation frequency of AA
for each site) to predict functions and
structures (e.g. surface of globular protein)
Predict secondary structure (a-helix,ß-sheet,
ß-turn, coli)
Predict the interactions by computing covariation
and evolutionary tree
Use correlation (?kpij(a,b)logpij(a,b)/pi(a)pj(b)
) for all sequences k to predict relation for
site-i and site-j