Title: Recent Progress in Multiple Sequence Alignments: A Survey
1Recent Progress in Multiple Sequence
AlignmentsA Survey
Cédric Notredame
2Our Scope
What are The existing Methods?
How Do They Work -Assemby Algorithms -Weightin
g Schemes.
When Do They Work ?
Which Future?
3Outline
-Introduction
-A taxonomy of the existing Packages
-A few algorithms
-Performance Comparison using BaliBase
4Introduction
5What Is A Multiple Sequence Alignment?
A MSA is a MODEL
It Indicates the RELATIONSHIP between residues of
different sequences.
It REVEALS -Similarities -Inconsistencies
6How Can I Use A Multiple Sequence Alignment?
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAK
KGGELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPK
NKSVAAVGKAAGERWKSLSE trybr KKDSNAPKRAMTSFMFFSSDFR
S----KHSDLS-IVEMSKAAGAAWKELGP mouse
-----KPKRPRSAYNIYVSESFQ----EAKDDS-AQGKLKLVNEAWKNLS
P . . .. . . .
. chite AATAKQNYIRALQEYERNGG- wheat
ANKLKGEYNKAIAAYNKGESA trybr AEKDKERYKREM---------
mouse AKDDRIRYDNEMKSWEEQMAE . .
Extrapolation
Motifs/Patterns
Multiple Alignments Are CENTRAL to MOST
Bioinformatics Techniques.
Profiles
Phylogeny
Struc. Prediction
7How Can I Use A Multiple Sequence Alignment?
Multiple Alignments Is the most INTEGRATIVE
Method Available Today.
We Need MSA to INCORPORATE existing DATA
8Why Is It Difficult To Compute A multiple
Sequence Alignment?
A CROSSROAD PROBLEM
9Why Is It Difficult To Compute A multiple
Sequence Alignment ?
BIOLOGY
COMPUTATION
CIRCULAR PROBLEM....
Good
Good
Alignment
Sequences
10A Taxonomy of Multiple Sequence Alignment Methods
11Grouping According to the assembly Algorithm
12Simultaneous As opposed to Progressive
Simultaneous they simultaneously use all the
information
Exact As opposed to Heursistic
Heuristics cut corners like Blast Vs SW
Heuristics do not guarranty an optimal solution
Stochastic As opposed to Determinist
Stochastic contain an element of randomness
Stochastic Example of a Monte Carlo Surface
estimation
Iterative As opposed to Non Iterative
Iterative run the same algorithm many times
Iterative Most stochastic methods are iterative
13OMA
PralineMAFFT
14Stochastic
SAGA
15(No Transcript)
16Grouping According to the Objective Function
17Scoring an Alignment Evolutionary based methods
BIOLOGY How many events separate my sequences?
Such an evaluation relies on a biological
model.
COMPUTATION Every position musd be independant
18REAL Tree
Model ALL the sequences evolved from the same
ancestor
A
C
Tree Cost1
A
A
C
PROBLEM We do not know the true tree
19STAR Tree
Model ALL the sequences have the same ancestor
A
C
Star Tree Cost2
A
A
C
A
PROBLEM the tree star is phylogenetically wrong
20Sums of Pairs
ModelEvery sequence is the ancestor of every
sequence
PROBLEM -over-estimation of the mutation
costs -Requires a weighting scheme
21Sums of Pairs Some of itslimitations (Durbin,
p140)
22Sums of Pairs Some of its limitations (Durbin,
p140)
Conclusion The more Leucine, the less expensive
it gets to add a Glycin to the column...
23Enthropy based Functions
Model Minimize the enthropy (variety) in each
Column
number of Alanine (a) in column i
Score of column i
a alphabet
P can incorporate pseudocounts
S0 if the column is conserved
PROBLEM -requires a simultaneous
alignment -assumes independant sequences
24Consistency based Functions
Model Maximise the consistency (agreement) with
a list of constraints (alignments)
kand l are sequences, i is a column
the two residues are found aligned in the list
of constraints
PROBLEM -requires a list of constraints
25(No Transcript)
26A few Multiple Sequence Alignment Algorithms
27A Few Algorithms
MSA and DCA
POA
ClustalW
MAFFT
Dialign II
Prrp
SAGA
GIBBS Sampler
28Simultaneous MSA and DCA
29Simultaneous Alignments MSA
1) Set Bounds on each pair of sequences (Carillo
and Lipman)
2) Compute the Maln within the Hyperspace
-Few Small Closely Related Sequence.
-Memory and CPU hungry
-Do Well When They Can Run.
30MSA the carillo and Lipman bounds
(
)
S
)
(
S
chite ---ADKPKRPLSAYMLWLNSARESIKRENPDFK-VTEVAKKGG
ELWRGLKD wheat --DPNKPKRAPSAFFVFMGEFREEFKQKNPKNKS
VAAVGKAAGERWKSLSE
)
(
S
Pairwise projection of sequences k and l
31MSA the carillo and Lipman bounds
a(k,l)score of the projection k l in the optimal
MSA
S(a(x,y))score of the complete multiple alignment
â(k,l)score of the optimal alignment of k l
32MSA the carillo and Lipman bounds
LM a lower bound for the complete MSA
LMltS(â(x,y)) - (â(k,l)-a(k,l))
a(k,l)gtLM â(k,l)-S(â(x,y))
33MSA the carillo and Lipman bounds
â(k,l)
LM â(k,l)-S(â(x,y))
a(k,l)
â(k,l)
ä(k,l)
LM can be measured on ANY heuristic alignment
LM S(ä(x,y))
The better LM, the tighter the bounds
34MSA the carillo and Lipman bounds
N
N
0
0
Best( M-i, N-j)
Best( 0-i, 0-j)
M
M
Forward
backward
35Simultaneous Alignments MSA
1) Set Bounds on each pair of sequences (Carillo
and Lipman)
2) Compute the Maln within the Hyperspace
-Few Small Closely Related Sequence.
-Memory and CPU hungry
-Do Well When They Can Run.
36Simultaneous Alignments DCA
37Simultaneous With a New Sequence
Representaion POA-Partial Ordered Graph
38(No Transcript)
39(No Transcript)
40POA
POA makes it possible to represent complex
relationships -domain deletion -domain
inversions
41Progressive ClustalW
42Progressive Alignment ClustalW
Feng and Dolittle, 1988 Taylor 198ç
Clustering
43Progressive Alignment ClustalW
44Tree based Alignment Recursive Algorithm
Align ( Node N) if ( N-gtleft_child is a
Node) A1Align ( N-gtleft_child) else if (
N-gtleft_child is a Sequence) A1N-gtleft_child
if (N-gtright_child is a node) A2Align
(N-gtright_child) else if ( N-gtright_child is a
Sequence) A2N-gtright_child Return
dp_alignment (A1, A2)
A
D
F
G
C
E
B
45Progressive Alignment ClustalW
-Depends on the CHOICE of the sequences.
-Depends on the ORDER of the sequences (Tree).
- -Depends on the PARAMETERS
- Substitution Matrix.
- Penalties (Gop, Gep).
- Sequence Weight.
- Tree making Algorithm.
46Weighting Within ClustalW
47Position Specific GOP
48ClustalW is the most Popular Method
49Progressive Alignment With a Heuristic DP MAFFT
50(No Transcript)
51Progressive And Concistency BasedDialign II
523) Assemble the alignment according to the
segment pairs.
53-May Align Too Few Residues
-No Gap Penalty -Does well with ESTs
54Progressive And Concistency BasedT-COFFEE
55Mixing Local and Global Alignments
Local Alignment
Global Alignment
Extension
Multiple Sequence Alignment
56What is a library?
3 Seq1 anotherseq Seq2 atsecondone Seq3
athirdone 1 2 1 1 25 1 3 3 8 70 .
2 Seq1 MySeq Seq2 MyotherSeq 1 2 1 1 25 3
8 70 .
ExtensionT-Coffee
Library Based Multiple Sequence Alignment
57Iterative
58Iterative Methods
7.16.1 Progressive
-HMMs, HMMER, SAM.
-Slow, Sometimes Inaccurate
-Good Profile Generators
59Iterative Methods Prrp
7.16.2 Prrp
Initial Alignment
Tree and weights computation
YES
End
Weights converged
Outer Iteration
NO
Realign two sub-groups
Inner Iteration
YES
NO
Alignment converged
60Iterative Sochastic SAGA, The Genetic Algorithm
61(No Transcript)
62(No Transcript)
63(No Transcript)
64Automatic scheduling of the operators
65(No Transcript)
66Weighting Schemes
67The Problem
The sequences Contain Correlated Information
Most scoring Schemes Ignore this Correlation
68Weighting Sequence Pairs with a Tree Carillo
and Lipman Rationale I
69QUESTION Which Weight for a Pair of Sequences
EEDGE
PEvolutive Path from A to X
E must contribute the same weight to every path P
that goes throught it.
A
G
F
C
B
D
E
All the weights using E must sum to 1 S(WP,E)1.
70USAGE
71PROBLEM Weight Depends only on the Tree topology
72Weighting Sequences with a Tree Clustal W Weights
73QUESTION Which Weight for Sequences ?
A
F
C
B
G
D
E
74USAGE
75PROBLEM Overweight of distant sequences
G
F
D
E
C
-C Will dominate the Alignment -C Will be very
Difficult to align
76Performance Comparison Using Collections of
Reference Alignments BaliBase and Ribosomal RNA
77What Is BaliBase
BaliBase
BaliBase is a collection of reference Multiple
Alignments
The Structure of the Sequences are known and were
used to assemble the MALN.
Evaluation is carried out by Comparing the
Structure Based Reference Alignment With its
Sequence Based Counterpart
78What Is BaliBase
BaliBase
?
Method X
DALI, Sap
Comparison
79BaliBase
What Is BaliBase
Source BaliBase, Thompson et al, NAR, 1999,
Description
PROBLEM
80Choosing The Right Method
81Choosing The Right Method (POA Evaluation)
82Choosing The Right Method (POA Evaluation)
83Choosing The Right Method (MAFFT evaluation)
84Choosing The Right Method (MAFFT evaluation)
85Choosing The Right Method (MAFFT evaluation)
86Conclusion
87Which Method ?
What Is BaliBase
Source BaliBase, Thompson et al, NAR, 1999,
Strategy
Strategy
PROBLEM
88Methods /Situtations
1-Carillo and Lipman
-MSA, DCA.
-Few Small Closely Related Sequence.
-Do Well When They Can Run.
2-Segment Based
-DIALIGN, MACAW.
-May Align Too Few Residues -Good For Long Indels
3-Iterative
-HMMs, HMMER, SAM.
-Slow, Sometimes Inaccurate
-Good Profile Generators
4-Progressive
-ClustalW, Pileup, Multalign
-Fast and Sensitive
89Addresses
MAFFT Progressive
www.biophys.kyoto-u.jp/kat
oh
POA
Progressive/Simulataneous
www.bioinformatics.ucla.edu/poa
MUSCLE Progressive/Iterative
www.drive5.com/muscle/