Multiple Sequence Alignment - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Multiple Sequence Alignment

Description:

Protein: conserved functional or structural domains ... Waterman algorithm), but aligning a large number of sequences using the same ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 31
Provided by: luona
Category:

less

Transcript and Presenter's Notes

Title: Multiple Sequence Alignment


1
Multiple Sequence Alignment
Chapter-4
  • Luonan Chen
  • Osaka Sangyo University

2
Applications
  • Genome sequencing
  • Protein conserved functional or structural
    domains
  • DNA structural and functional relationships,
    promoter, binding site
  • Biotech specific probe, PCR primer
  • Phylogenetic analysis

3
Sum of Pairs SP
  • M W P I L L L
  • M L R - L L - sequences of four
    amino acid
  • M K - I L L -
  • M P P V L I L
  • SP4SP(I,-,I,V)
  • s(I,-)s(I,I)s(I,V)s(-,I)s(-,V)s(I,V)
  • SP SiSPi
  • ( or SP Sjltj s(aij) aij are sequences i
    and j )
  • s(-,-)0
  • For n sequences, no. of combination n(n-1)/2 for
    each column
  • DP curse of dimensionality

4
Multiple Sequence Alignments
  • In theory, making an optimal alignment between
    two sequences is computationally straightforward
    (Smith-Waterman algorithm), but aligning a large
    number of sequences using the same method is
    almost impossible. (e.g. DP ? O(TN))
  • The problem increases exponentially with the
    number of sequences involved
  • (the product of the sequence lengths)
  • Most MSA are based on Progressive Pairwise Methods

5
Optimal Alignment
  • For a given group of sequences, there is no
    single "correct" alignment, only an alignment
    that is "optimal" according to some set of
    calculations.
  • Determining what alignment is best for a given
    set of sequences is really up to the judgement of
    the investigator.

6
Progressive Pairwise Methods (global)
  • Incremental or progressive method first make
    pairwise alignments, then adds new sequences one
    at a time to these aligned groups.
  • An approximate method.
  • Typical algorithms CLUSTALW, PILEUP
  • --- disadvantages (1) dependence on initial
    pairwise sequence alignments.
  • (2) dependence on scoring matrices and gap
    penalties

7
The PILEUP Algorithm
  • First, PILEUP calculates approximate pairwise
    similarity scores (DP) between all sequences to
    be aligned, and they are clustered into a
    dendrogram (tree structure).
  • Then the most similar pairs of sequences are
    aligned.
  • Averages (similar to consensus sequences) are
    calculated for the aligned pairs.
  • New sequences and clusters of sequences are added
    one by one, according to the branching order in
    the dendrogram.

8
PILEUP Considerations
  • Since the alignment is calculated on a
    progressive basis, the order of the initial
    sequences can affect the final alignment.
  • PILEUP paramaters 2 gap penalties (gap insert
    and gap extend) and an amino acid comparison
    matrix.
  • PILEUP will refuse to align sequences that
    require too many gaps or mismatches.
  • PILEUP will take quite a while to align more than
    about 10 sequences

9
CLUSTALW
  • (1). Perform pairwise alignments of all sequences
  • (2). Use alignment scores to produce a
    phylogenetic tree
  • (3). Align the sequences sequentially by the tree
    that is based on genetic distances.
  • --The most closely related sequences are aligned
    first, then additional sequences or groups are
    added according to initial alignments
  • --Genetic distance no. of mismatched positions
    divided by the total no. of matched positions
    (positions opposite a gap are not scored)
  • --Sequence contributions to MSA are weighted
    according to their relationships on the tree
  • --weighting scheme the more distant, the higher
    the weight

10
Pairwise alignment
S1 sequence S3 sequence
S2 sequence S5 sequence S6 sequence
S1 S3 S2 S5 S6
First align S1 and S3, S5 and S6, then align
(S5,S6) and S2. Finally
11
Iterative Methods for MSA (global)
GA, SA or Hidden Markov Model
12
Local MSA(Motif)
Used as PAM
Profile Analysis
Unknow AA,gap,extension
Block Analysis
Log odds score
Consensus pattern
13
Motif Expression
  • Standard expression
  • CL-x(2,5)-V-LIM-x(3)-GP
  • x(2,5)for any 2-5 residues LIML or I or M
    GP any residue except G and P
  • Weight matrix expression profile
  • HMM similar to profile but with delete and
    insert prob.

14
Statistical Methods
  • Expectation Maximization Algorithm
    (deterministic)
  • Gibbs Sampler (stochastic)
  • Hidden Markov Models (stochastic)
  • For HMM
  • --advantages theoretical explanation, no
    sequence ordering, no insertion and deletion
    penalties, using prior information
  • -- disadvantage large number of sequences for
    training

15
EM Algorithm
  • 1. Initial guess and prob. (or odds score)
    distribution table
  • 2. E-step estimate the prob. (likelihood) at any
    position in each sequence according to the table
  • 3. M-step add new counts of bases or amino acids
    to each position according the prob., and update
    the table.
  • Goto E-step until converged.

e.g. MEME Multiple EM for Motif Elicitation
16
Gabbs Sampler
Goal ratio of motif prob. to background prob. is
maximum
frequencies
2nd-9th sequences, 20 residues
Score of background is calculated without motif
1st sequence
Randomly select a sequence left out, and repeat
steps A to C gt 100 times.
17
Markov Hidden Model
By Forward-backward, Viterbi, Baum-Welch (or EM)
algorithms, (1) find Prob. of the observation
sequence given the model (2) find the most
likely hidden state sequence given the model and
observations (3) adjust parameters to maximize
their Prob. given observations.
--Match states are consensus positions
(or gap)
Bases (4) or amino acids (20)
K L A . . . D
States 134
(or gap)
18
S(S0,,SN) state index ss1,,sK K symbols
in a state O(o1,,oT) observation sequence
Q(q1,,qT) state sequence (oisj, qjskin a
certain state Si e.g. Si1,2,,N,sA,T,G,C)



19
Key Tasks
  • (1) find Prob. of the observation sequence
  • -- given the model?and O, compute P(O?)
  • (2) find the most likely hidden state sequence
  • -- given ?and O, find argmaxQ P(QO,?)
  • (3) adjust parameters to maximize their Prob.
    given observations O(o1,,oT)
  • -- given O, find argmax?P(O?)
  • --by Forward-backward, Viterbi, Baum-Welch (or
    EM) algorithms. Note Q(q1,,qT)

20
(1). Forward Algorithm
at(j)P(o1,,ot , qt in Sj ?)
a0(j)1 if j is a START state, 0 otherwise

S2
Compute a recursively. Computation complexity
O(N2T) Task (1) P(O ?) Sj1N aT(j)
21
(State 1)
(State 2)
22
Backward Algorithm
ßt(j)P(ot1,,oT , qt in Sj ?)
ßT(j)1 if j is a END state, 0 otherwise
Compute ß recursively. Computation complexity
O(N2T) P(O ?) Sj1N ß0(j)
23
(No Transcript)
24
(2). The Viterbi Algorithm
  • Replace summation with maximization in Forward
    Algorithm, i.e.
  • Define Vt(i)maxq1,,qt P(o1,,ot , qt in Si ?)
  • Recursive computation
  • Vt(j)maxi1,,N Vt-1(i)P(SjSi)P(otSj) for tgt0
  • Task (2) maxQ P(QO,?) max1jN VT(j)
  • (trace back for the maximum)

25
Find the state sequence Q which maximizes P
26
(3). Estimation of ParametersOptimization
Algorithm
  • Probability of transition from Si to Sj at time
    t, given O and ?
  • Ft(i,j)P(qt in Si, qt1 in SjO,?)
  • at(i)P(SjSi)P(ot1Sj)
    ßt1(j)/P(O?)

Si
Sj
P(SjSi)P(ot1Sj)
Using Forward-Backward Algorithm
27
Parameters of HMM

  • expected no. of transition from Si to Sj

  • expected no. of transition from Si

aijP(SjSi)
Expected no. of times in state j with symbol
k Expected no. of tiems in state j
bjP(qkSj)
Then the parameters of HMM ? A,B, where
Aaij, Bbj. (e.g.
kA,T,G or C )
28
Baum-Welch Algorithm (EM Algorithm)
  • 1. initialization of ? A,B
  • 2. compute a, ß,F (E-step)
  • 3. update ? A,B from F (M-step)
  • 4. if not converged, goto 2.
  • -- It can be shown P(OF) strictly increasing,
    and converging to a local mode.

(Task 3)
Questions How many states in model, how to
initialize parameters, how to avoid local modes.
29
Application to MSA by HMM
  • For a set of given sequences
  • (1). Estimate HMM parameters by Baum-Welch
    algorthm
  • (2). Align each sequence to HMM by Viterbi
    algorithm
  • (3). Match states of HMM provide columns of
    resulting multiple alignment.

30
Application of MSA
  • Calculate conservation (index) to identify motif
    (e.g. frequency of AA for each site)
  • Identify the specific properties of sequences to
    predict functions and structures (e.g.
    hydrophobic)
  • Calculate variation (variation frequency of AA
    for each site) to predict functions and
    structures (e.g. surface of globular protein)
  • Predict secondary structure (a-helix,ß-sheet,
    ß-turn, coli)
  • Predict the interactions by computing covariation
    and evolutionary tree
  • Use correlation (?kpij(a,b)logpij(a,b)/pi(a)pj(b)
    ) for all sequences k to predict relation for
    site-i and site-j
Write a Comment
User Comments (0)
About PowerShow.com