Title: Multiple sequence alignment
1Multiple sequence alignmentWhy?
- It is the most important means to assess
relatedness of a set of sequences - Gain information about the structure/function of
a query sequence (conservation patterns) - Construct a phylogenetic tree
- Putting together a set of sequenced fragments
(Fragment assembly) - Recognise alternative splice sites
- Many bioinformatics methods depend on it
(secondary/tertiary structure)
2Multiple sequence alignment (MSA) of 12
Flavodoxin cheY
3Pairwise alignment
- Now we know how to do it
- How do we get a multiple alignment (three or more
sequences)? - Multiple alignment much greater combinatorial
explosion than with pairwise alignment..
4Multi-dimensional dynamic programming(Murata et
al. 1985)
5Simultaneous Multiple alignmentMulti-dimensional
dynamic programming
- MSA (Lipman et al., 1989, PNAS 86, 4412)
- extremely slow and memory intensive
- up to 8-9 sequences of 250 residues
- DCA (Stoye et al., 1997, CABIOS 13, 625)
- still very slow
6Alternative multiple alignment methods
- Biopat (Hogeweg Hesper 1984, first method ever)
- MULTAL (Taylor 1987)
- DIALIGN (Morgenstern 1996)
- PRRP (Gotoh 1996)
- Clustal (Thompson Higgins Gibson 1994)
- Praline (Heringa 1999)
- T-Coffee (Notredame Higgins Heringa 2000)
- HMMER (Eddy 1998) Hidden Markov Model
- SAGA (Notredame Higgins1996) Genetic algorithm
7Progressive multiple alignment general principles
1
Score 1-2
2
1
Score 1-3
3
4
Score 4-5
5
Scores
Similarity matrix
55
Scores to distances
Iteration possibilities
Guide tree
Multiple alignment
8General progressive multiple alignment
technique(follow generated tree)
d
1
3
1
3
2
5
1
3
2
5
1
root
3
2
5
4
9Progressive multiple alignment
- Problem
- Accuracy is very important
- Errors are propagated into the progressive steps
- Once a gap, always a gap
- Feng Doolittle, 1987
10Pair-wise alignment quality versus sequence
identity(Vogt et al., JMB 249, 816-831,1995)
11Multiple alignment profilesGribskov et al. 1987
i
A C D ? ? ? W Y
0.3 0.1 0 ? ? ? 0.3 0.3
Gap penalties
0.5
1.0
Position dependent gap penalties
12Profile-sequence alignment
sequence
profile
ACDVWY
13Profile-profile alignment
profile
A C D . . Y
profile
ACDVWY
14Clustal, ClustalW, ClustalX
- CLUSTAL W/X (Thompson et al., 1994) uses
Neighbour Joining (NJ) algorithm (Saitou and Nei,
1984), widely used in phylogenetic analysis, to
construct guide tree. - Sequence blocks are represented by profiles, in
which the individual sequences are additionally
weighted according to the branch lengths in the
NJ tree. - Further carefully crafted heuristics include
- (i) local gap penalties
- (ii) automatic selection of the amino acid
substitution matrix, (iii) automatic gap penalty
adjustment - (iv) mechanism to delay alignment of sequences
that appear to be distant at the time they are
considered. - CLUSTAL (W/X) does not allow iteration (Hogeweg
and Hesper, 1984 Corpet, 1988, Gotoh, 1996
Heringa, 1999, 2002)
15Strategies for multiple sequence alignment
- Profile pre-processing
- Secondary structure-induced alignment
- Globalised local alignment
- Matrix extension
- Objective try to avoid (early) errors
16Pre-profile generation
1
Score 1-2
2
1
Score 1-3
3
4
Score 4-5
5
Cut-off
Pre-profiles
Pre-alignments
1
A C D . . Y
1
2
3
4
5
2
2
A C D . . Y
1
3
4
5
5
A C D . . Y
1
5
2
3
4
17Pre-profile alignment
Pre-profiles
1
A C D . . Y
2
A C D . . Y
Final alignment
3
A C D . . Y
1
2
3
4
5
4
A C D . . Y
A C D . . Y
5
18Pre-profile alignment
1
2
1
3
4
5
2
2
1
3
4
Final alignment
5
3
1
1
3
2
2
4
3
5
4
5
4
4
1
2
3
5
5
1
5
2
3
4
19Strategies for multiple sequence alignment
- Profile pre-processing
- Secondary structure-induced alignment
- Globalised local alignment
- Matrix extension
- Objective try to avoid (early) errors
20Protein structure hierarchical levels
TERTIARY STRUCTURE (fold)
21One of the Molecular Biology Dogmas
- Structure more conserved than sequence
22Secondary structure-induced alignment
23Using secondary structure for alignment
Dynamic programming search matrix
Amino acid exchange weights matrices
MDAGSTVILCFV
HHHCCCEEEEEE
M D A A S T I L C G S
H H H H C C E E E C C
H
H
C
C
E
E
Default
24Flavodoxin-cheYUsing predicted secondary
structure
1fx1 -PK-ALIVYGSTTGNTEYTAETIARQLANAG-YE
VDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLFD
S-LEETGAQGRKVACF e eeee b
ssshhhhhhhhhhhhhhttt eeeee stt tttttt seeee b
ee sss ee ttthhhhtt ttss tt
eeeee FLAV_DESVH MPK-ALIVYGSTTGNTEYTaETIARELA
DAG-YEVDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDD
FIPLFDS-LEETGAQGRKVACf e eeeeee
hhhhhhhhhhhhhhh eeeeee eeeeee
hhhhhh
eeeee FLAV_DESGI MPK-ALIVYGSTTGNTEGVaEAIAKTLN
SEG-METTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQED
FVPLYED-LDRAGLKDKKVGVf e eeeeee
hhhhhhhhhhhhhh eeeeee hhhhhh eeeeeee
hhhhhh
eeeeee FLAV_DESSA MSK-SLIVYGSTTGNTETAaEYVAEAF
ENKE-IDVELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQD
DFIPLYDS-LENADLKGKKVSVf
eeeeee hhhhhhhhhhhhhh eeeee
eeeee hhhhhhh h
eeeee FLAV_DESDE MSK-VLIVFGSSTGNTESIaQKLEELIA
AGG-HEVTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDD
FLSLFEE-FNRFGLAGRKVAAf eeee
hhhhhhhhhhhhhh eeeee hhhhhhhhhhheeeee
hhhhhhh hh eeeee 2fcr
--K-IGIFFSTSTGNTTEVADFIGKTLGAK---ADAPIDVDDVT
DPQALKDYDLLFLGAPTWNTGAD----TERSGTSWDEFLYDKLPEVDMKD
LPVAIF eeeee
ssshhhhhhhhhhhhhggg b eeggg s gggggg seeeeeee
stt s s s sthhhhhhhtggg tt
eeeee FLAV_ANASP SKK-IGLFYGTQTGKTESVaEIIRDEFG
ND--VVTL-HDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSD
WEGLYSE-LDDVDFNGKLVAYf eeeee
hhhhhhhhhhhh eee hhh hhhhhhheeeeee
hhhhhhhhh
eeeeee FLAV_ECOLI -AI-TGIFFGSDTGNTENIaKMIQKQL
GKD--VADV-HDIAKSS-KEDLEAYDILLLgIPTWYYGEA--------QC
DWDDFFPT-LEEIDFNGKLVALf eee
hhhhhhhhhhhh eee hhh hhhhhhheeeee
hhhhh
eeeeee FLAV_AZOVI -AK-IGLFFGSNTGKTRKVaKSIKKRF
DDET-MSDA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENE
SWEEFLPK-IEGLDFSGKTVALf eee
hhhhhhhhhhhhh hhh hhhhhhheeeee
hhhhhhhhh
eeeeee FLAV_ENTAG MAT-IGIFFGSDTGQTRKVaKLIHQKL
DG---IADAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYD
SWQEFTNT-LSEADLTGKTVALf eeee
hhhhhhhhhhhh hhh hhhhhhheeeee
hhhhh eeeee 4fxn
----MKIVYWSGTGNTEKMAELIAKGIIESG-KDVNTINVSDV
NIDELLNE-DILILGCSAMGDEVL------E-ESEFEPFIEE-IST-KIS
GKKVALF eeeee
ssshhhhhhhhhhhhhhhtt eeeettt sttttt seeeeee
btttb ttthhhhhhh hst t tt
eeeee FLAV_MEGEL M---VEIVYWSGTGNTEAMaNEIEAAVK
AAG-ADVESVRFEDTNVDDVASK-DVILLgCPAMGSEEL------E-DSV
VEPFFTD-LAP-KLKGKKVGLf
hhhhhhhhhhhhhh eeeee hhhhhhhh eeeee
eeeee FLAV_CLOAB M-K-ISILYSSKTGKTERVaKLIEEGVK
RSGNIEVKTMNL-DAVDKKFLQESEGIIFgTPTY-YANI--------SWE
MKKWIDE-SSEFNLEGKLGAAf eee
hhhhhhhhhhhhhh eeeeee hhhhhhhhhh eeee
hhhhhhhhh eeeee 3chy
ADKELKFLVVDDFSTMRRIVRNLLKELGFNN-VEEAEDGV-DAL
NKLQAGGYGFVISD---WNMPNM----------DGLELLKTIRADGAMSA
LPVLMV tt eeee s
hhhhhhhhhhhhhht eeeesshh hhhhhhhh eeeee s
sss hhhhhhhhhh ttttt eeee 1fx1
GCGDS-SY-EYFCGAVDAIEEKLKNLGAEIVQD-----------
----------GLRIDGD--PRAARDDIVGWAHDVRGAI--------
eee s ss sstthhhhhhhhhhhttt ee s
eeees gggghhhhhhhhhhhhhh FLAV_
DESVH GCGDS-SY-EYFCGAVDAIEEKLKNLgAEIVQD------
---------------GLRIDGD--PRAARDDIVGwAHDVRGAI-------
- eee hhhhhhhhhhhh
eeeee eeeee
hhhhhhhhhhhhhh FLAV_DESGI GCGDS-SY-TYFCGAVDVI
EKKAEELgATLVAS---------------------SLKIDGE--P--DSA
EVLDwAREVLARV-------- eee
hhhhhhhhhhhh eeeee
hhhhhhhhhhh FLAV_DESSA
GCGDS-DY-TYFCGAVDAIEEKLEKMgAVVIGD-----------------
----SLKIDGD--P--ERDEIVSwGSGIADKI--------
hhhhhhhhhhhh eeeee
e eee FLAV_DESDE
ASGDQ-EY-EHFCGAVPAIEERAKELgATIIAE-----------------
----GLKMEGD--ASNDPEAVASfAEDVLKQL--------
e hhhhhhhhhhhhhh eeeee
ee hhhhhhhhhhh 2fcr
GLGDAEGYPDNFCDAIEEIHDCFAKQGAKPVGFSNPDDYDYEESKSV
RD-GKFLGLPLDMVNDQIPMEKRVAGWVEAVVSETGV------
eee ttt ttsttthhhhhhhhhhhtt eee b gggs
s tteet teesseeeettt ss hhhhhhhhhhhhhhhht FLAV_A
NASP GTGDQIGYADNFQDAIGILEEKISQRgGKTVGYWSTDGYD
FNDSKALR-NGKFVGLALDEDNQSDLTDDRIKSwVAQLKSEFGL------
hhhhhhhhhhhhhh
eeee
hhhhhhhhhhhhhhhh FLAV_ECOLI
GCGDQEDYAEYFCDALGTIRDIIEPRgATIVGHWPTAGYHFEASKGLADD
DHFVGLAIDEDRQPELTAERVEKwVKQISEELHLDEILNA
hhhhhhhhhhhhhh eeee
hhhhhhhhhhhhhhhhhh FLAV_AZOVI
GLGDQVGYPENYLDALGELYSFFKDRgAKIVGSWSTDGYEFESS
EAVVD-GKFVGLALDLDNQSGKTDERVAAwLAQIAPEFGLS--L--
e hhhhhhhhhhhhhh eeeee
hhhhhhhhhhh FLAV_ENTA
G GLGDQLNYSKNFVSAMRILYDLVIARgACVVGNWPREGYKFSF
SAALLENNEFVGLPLDQENQYDLTEERIDSwLEKLKPAV-L------
hhhhhhhhhhhhhhh eeee
hhhhhhh hhhhhhhhhhhh 4fxn
G-----SYGWGDGKWMRDFEERMNGYGCVVVET---------
------------PLIVQNE--PDEAEQDCIEFGKKIANI---------
e eesss shhhhhhhhhhhhtt ee s
eeees ggghhhhhhhhhhhht FLAV
_MEGEL G-----SYGWGSGEWMDAWKQRTEDTgATVIGT-----
-----------------AIVNEM--PDNAPE-CKElGEAAAKA-------
-- hhhhhhhhhhh
eeeee eeee h
hhhhhhhh FLAV_CLOAB STANSIA-GGSDIALLTILNHLMVK
-gMLVYSG----GVAFGKPKTHLG-----YVHINEI--QENEDENARIfG
ERiANkV--KQIF--
hhhhhhhhhhhhhh eeeee
hhhh hhh hhhhhhhhhhhh h 3chy
-----------TAEAKKENIIAAAQAGASGY-------------------
------VVK----P-FTAATLEEKLNKIFEKLGM------
ess hhhhhhhhhtt see
ees s hhhhhhhhhhhhhhht
G
25Strategies for multiple sequence alignment
- Profile pre-processing
- Secondary structure-induced alignment
- Globalised local alignment
- Matrix extension
- Objective try to avoid (early) errors
26Globalised local alignment
1. Local (SW) alignment (M Po,e)
2. Global (NW) alignment (no M or Po,e)
Double dynamic programming
27M BLOSUM62, Po 0, Pe 0
28M BLOSUM62, Po 12, Pe 1
29M BLOSUM62, Po 60, Pe 5
30Strategies for multiple sequence alignment
- Profile pre-processing
- Secondary structure-induced alignment
- Globalised local alignment
- Matrix extension
- Objective try to avoid (early) errors
31Matrix extension
- T-Coffee
- Tree-based Consistency Objective Function For
alignmEnt Evaluation - Cedric Notredame
- Des Higgins
- Jaap Heringa J. Mol. Biol., 302, 205-2172000
32Matrix extension T COFFEE
2
1
3
1
4
1
3
2
4
2
4
3
33Integrating alignment methods and alignment
information with T-Coffee
- Integrating different pair-wise alignment
techniques (NW, SW, ..) - Combining different multiple alignment methods
(consensus multiple alignment) - Combining sequence alignment methods with
structural alignment techniques - Plug in user knowledge
34Using different sources of alignment information
Structure alignments
Clustal
Clustal
Dialign
Lalign
Manual
T-Coffee
35Search matrix extension
36T-Coffee
- Combine different alignment techniques by adding
scores - W(A(x), B(y)) ?S(A(x), B(y))
- A(x) is residue x in sequence A
- summation is over the scores S of the global and
local alignments containing the residue pair
(A(x), B(y)) - S is sequence identity percentage of the
associated alignment - Combine direct alignment seqA- seqB with each
seqA-seqI-seqB - W(A(x), B(y)) W(A(x), B(y))
- ?I?A,BMin(W(A(x), I(z)), W(I(z), B(y)))
- Summation over all third sequences I other than A
or B
37T-Coffee
Other sequences
Direct alignment
38Search matrix extension
39Evaluating multiple alignments
- Conflicting standards of truth
- evolution
- structure
- function
- With orphan sequences no additional information
- Benchmarks depending on reference alignments
- Quality issue of available reference alignment
databases - Different ways to quantify agreement with
reference alignment (sum-of-pairs, column score) - Charlie Chaplin problem
40Evaluating multiple alignments
- As a standard of truth, often a reference
alignment based on structural superpositioning is
taken
41Evaluation measures
Query
Reference
Column score
Sum-of-Pairs score
42Evaluating multiple alignments
?SP
BAliBASE alignment nseq len
43Summary
- Weighting schemes simulating simultaneous
multiple alignment - Profile pre-processing (global/local)
- Matrix extension (well balanced scheme)
- Smoothing alignment signals
- globalised local alignment
- Using additional information
- secondary structure driven alignment
- Schemes strike balance between speed and
sensitivity
44References
- Heringa, J. (1999) Two strategies for sequence
comparison profile-preprocessed and secondary
structure-induced multiple alignment. Comp. Chem.
23, 341-364. - Notredame, C., Higgins, D.G., Heringa, J. (2000)
T-Coffee a novel method for fast and accurate
multiple sequence alignment. J. Mol. Biol., 302,
205-217. - Heringa, J. (2002) Local weighting schemes for
protein multiple sequence alignment. Comput.
Chem., 26(5), 459-477.
45Where to find this.http//www.ibivu.cs.vu.nl/tea
ching