Title: Multiple sequence alignment
1Introduction to bioinformaticsLecture 7 Multiple
sequence alignment (1)
2Global or Local Pairwise alignment
B
B
C
A
A
B
A
A
C
A
B
C
A
Local
B
Local
A
B
C
A
B
C
B
A
Global
Global
A
B
C
A
3Globin fold ? protein myoglobin PDB 1MBN
Helices are labelled A (blue) to H (red). D
helix can be missing in some globins what
happens with alignment?
4 ? sandwich ? protein immunoglobulin PDB 7FAB
5TIM barrel ? / ? protein Triose phosphate
IsoMerase PDB 1TIM
6Pyruvate kinase Phosphotransferase
b barrel regulatory domain a/b barrel
catalytic substrate binding domain a/b
nucleotide binding domain
7What does this mean for alignments?
- Alignments need to be able to skip secondary
structural elements to complete domains (i.e.
putting gaps opposite these motifs in the shorter
sequence). - Depending on gap penalties chosen, the algorithm
might have difficulty with making such long gaps
(for example when using high affine gap
penalties), resulting in incorrect alignment.
8There are three kinds of pairwise alignments
- Global alignment align all residues in both
sequences all gaps are penalised - Semi-global alignment align all residues in
both sequences end gaps are not penalised (zero
end gap penalties) - Local alignment align part of each sequence
end gaps are not applicable
9Easy global DP recipe for using affine gap
penalties (after Gotoh)
j-1
Penalty Pi gap_lengthPe
MaxS0ltxlti-1, j-1 - Pi - (i-x-1)Px Si-1,j-1 MaxS
i-1, 0ltyltj-1 - Pi - (j-y-1)Px
Si,j si,j Max
i-1
- Mi,j is optimal alignment (highest scoring
alignment until i, j) - At each cell i, j in search matrix, check Max
coming from - any cell in preceding row until j-2 add score
for celli, j minus appropriate gap penalties - any cell in preceding column until i-2 add score
for celli, j minus appropriate gap penalties - or celli-1, j-1 add score for celli, j
- Select highest scoring cell in bottom row and
rightmost column and do trace-back
10Lets do an example global alignmentGotohs DP
algorithm with affine gap penalties (PAM250,
Pi10, Pe2)
D W V T A L K
0 -12 -14 -16 -18 -20 -22 -24
T -12 0 -17 -14 -13 -17 -19 -22 -22
D -14 -8 -7 -14 -14 -13 -42
W -16 -21 9 -13 -19 -18
V -18 -18 -20 13 -3 -16
L -20 -22 -18 -1 14 -1 -14
K -22 -20 -21 -12
-24 -42 -41 -18 -16 -14 -12 0
D W V T A L K
T 0 -5 0 3 1 1 0
D 4 -7 -2 0 0 -4 0
W -7 17 -6 -5 -6 -2 -3
V -2 -6 4 0 0 2 -2
L -4 -2 2 1 -2 6 -3
K 0 -3 -2 0 -1 -3 5
PAM250
Cell (D2, T4) can alternatively come from two
cells (same score) high-road or low-road
Row and column 0 are filled with 0, -12, -14,
-16, if global alignment is used (for
N-terminal end-gaps) also extra row and column
at the end to calculate the score including
C-terminal end-gap penalties.
11Lets do another example semi-global
alignmentGotohs DP algorithm with affine gap
penalties (PAM250, Pi10, Pe2)
D W V T A L K
T 0 -5 0 3 1 1 0
D 4 -7 -2 0 0 -4 0
W -7 17 -6 -5 -6 -2 -3
V -2 -6 4 0 0 2 -2
L -4 -2 2 1 -2 6 -3
K 0 -3 -2 0 -1 -3 5
D W V T A L K
T 0 -5 0 3
D 4 -7 -7
W -7 21 -13
V -2 -13 25 9
L
K
PAM250
Starting row and column 0, and extra column at
right or extra row at bottom is not necessary
when using semi global alignment (zero end-gaps).
Rest works as under global alignment.
12Easy local DP recipe for using affine gap
penalties (after Gotoh)
j-1
Penalty Pi gap_lengthPe
Si,j MaxS0ltxlti-1,j-1 - Pi - (i-x-1)Px Si,j
Si-1,j-1 Si,j Max Si-1,0ltyltj-1 - Pi -
(j-y-1)Px 0
Si,j Max
i-1
- Mi,j is optimal alignment (highest scoring
alignment until i, j) - At each cell i, j in search matrix, check Max
coming from - any cell in preceding row until j-2 add score
for celli, j minus appropriate gap penalties - any cell in preceding column until i-2 add score
for celli, j minus appropriate gap penalties - or celli-1, j-1 add score for celli, j
- Select highest scoring cell anywhere in matrix
and do trace-back until zero-valued cell or start
of sequence(s)
13Lets do yet another example local
alignmentGotohs DP algorithm with affine gap
penalties (PAM250, Pi10, Pe2)
D W V T A L K
T 0 -5 0 3 1 1 0
D 4 -7 -2 0 0 -4 0
W -7 17 -6 -5 -6 -2 -3
V -2 -6 4 0 0 2 -2
L -4 -2 2 1 -2 6 -3
K 0 -3 -2 0 -1 -3 5
D W V T A L K
T 0 0 0 3
D 4 0 0 0
W 0 21 0 0
V 0 0 25 9
L 0 0 11
K 0 0
PAM250
Extra start/end columns/rows not necessary (no
end-gaps). Each negative scoring cell is set to
zero. Highest scoring cell may be found anywhere
in search matrix after calculating it. Trace
highest scoring cell back to first cell with zero
value (or the beginning of one or both sequences)
14For your first exam D1Make sure you
understand and can carry out Gotohs algorithm
for global, semi-global and local
alignment!This is the most general Dynamic
Programming (DP) algorithm (and perhaps the
easiest to understand) Gotoh, O. An Improved
Algorithm for Matching Biological Sequences. J.
Mol. Biol., 162, pp. 705-708, 1982.
15Pairwise alignment
- Now we know how to do it
- How do we get a multiple alignment (three or more
sequences)? - Multiple alignment much greater combinatorial
explosion than with pairwise alignment..
16Multiple sequence alignment (MSA) Why
- One of he most important means to find out about
- Conservation patterns leading to functional clues
- Possible protein structure
- Multiple sequence alignment contains far more
information about conservation than pairwise
alignment - Many bioinformatics methods use MSA as input
e.g. secondary structure prediction (later
lecture)
17Multiple sequence alignment Wanted
- Quality
- Quality
- Quality
- Programs need to be fully automatic for genomic
pipelines - With available genomes (data explosion), speed
becomes crucial
18(Murata et al. 1985)
Simultaneous Multiple alignmentMulti-dimensional
dynamic programming
19Simultaneous Multiple alignmentMulti-dimensional
dynamic programming
- MSA (Lipman et al., 1989, PNAS 86, 4412)
- extremely slow and memory intensive
- up to 8-9 sequences of 250 residues
- DCA (Stoye et al., 1997, CABIOS 13, 625)
- still very slow
20Alternative multiple alignment methods
- Biopat (first complete MSA method ever Hogeweg
Hesper 1984) - MULTAL (Taylor 1987)
- DIALIGN (Morgenstern 1996)
- PRRP (Gotoh 1996)
- Clustal (Thompson Higgins Gibson 1994)
- Praline (Heringa 1999)
- T-Coffee (Notredame et al. 2000)
- HMMER (Eddy 1998) Hidden Markov Models
- SAGA (Notredame 1996) Genetic algorithm
- POA (Lee et al. 2002)
- MUSCLE (Edgar 2004)
21The following three slides are examples of
multiple alignments of 13 flavodoxin and 1 cheY
sequence (PDB code 3chy).The cheY sequence is a
very distant relative of the flavodoxin family,
but has the same basic fold
22- CLUSTAL X (1.64b) multiple sequence alignment
Flavodoxin-cheY - 1fx1 -PKALIVYGSTTGNTEYTAETIARQLANAG-Y-E
VDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPLFD
-SLEETGAQGRK - FLAV_DESVH MPKALIVYGSTTGNTEYTAETIARELADAG-Y-E
VDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPLFD
-SLEETGAQGRK - FLAV_DESGI MPKALIVYGSTTGNTEGVAEAIAKTLNSEG-M-E
TTVVNVADVTAPGLAEGYDVVLLGCSTWGDDEIE------LQEDFVPLYE
-DLDRAGLKDKK - FLAV_DESSA MSKSLIVYGSTTGNTETAAEYVAEAFENKE-I-D
VELKNVTDVSVADLGNGYDIVLFGCSTWGEEEIE------LQDDFIPLYD
-SLENADLKGKK - FLAV_DESDE MSKVLIVFGSSTGNTESIAQKLEELIAAGG-H-E
VTLLNAADASAENLADGYDAVLFGCSAWGMEDLE------MQDDFLSLFE
-EFNRFGLAGRK - FLAV_CLOAB -MKISILYSSKTGKTERVAKLIEEGVKRSGNI-E
VKTMNLDAVDKKFLQE-SEGIIFGTPTYYAN---------ISWEMKKWID
-ESSEFNLEGKL - FLAV_MEGEL --MVEIVYWSGTGNTEAMANEIEAAVKAAG-A-D
VESVRFEDTNVDDVAS-KDVILLGCPAMGSE--E------LEDSVVEPFF
-TDLAPKLKGKK - 4fxn ---MKIVYWSGTGNTEKMAELIAKGIIESG-K-D
VNTINVSDVNIDELLN-EDILILGCSAMGDE--V------LEESEFEPFI
-EEISTKISGKK - FLAV_ANASP SKKIGLFYGTQTGKTESVAEIIRDEFGNDVVT--
--LHDVSQAEVTDLND-YQYLIIGCPTWNIGELQ---SD-----WEGLYS
-ELDDVDFNGKL - FLAV_AZOVI -AKIGLFFGSNTGKTRKVAKSIKKRFDDETMSD-
--ALNVNRVSAEDFAQ-YQFLILGTPTLGEGELPGLSSDCENESWEEFLP
-KIEGLDFSGKT - 2fcr --KIGIFFSTSTGNTTEVADFIGKTLGAKADAP-
--IDVDDVTDPQALKD-YDLLFLGAPTWNTGADTERSGT----SWDEFLY
DKLPEVDMKDLP - FLAV_ENTAG MATIGIFFGSDTGQTRKVAKLIHQKLDGIADAP-
--LDVRRATREQFLS--YPVLLLGTPTLGDGELPGVEAGSQYDSWQEFTN
-TLSEADLTGKT - FLAV_ECOLI -AITGIFFGSDTGNTENIAKMIQKQLGKDVAD--
--VHDIAKSSKEDLEA-YDILLLGIPTWYYGEAQ-CD-------WDDFFP
-TLEEIDFNGKL - 3chy --ADKELKFLVVDDFSTMRRIVRNLLKELG----
FNNVEEAEDGVDALN------KLQAGGYGFV--I------SDWNMPNMDG
-LELLKTIR--- - . ... .
.
23Flavodoxin-cheY Global Pre-processing
(prepro?1500)
- 1fx1 -PKALIVYGSTTGNT-EYTAETIARQLANAG-YE
VDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLF-
DSLEETGAQGRKVACF - FLAV_DESDE MSKVLIVFGSSTGNT-ESIaQKLEELIAAGG-HE
VTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLF-
EEFNRFGLAGRKVAAf - FLAV_DESVH MPKALIVYGSTTGNT-EYTaETIARELADAG-YE
VDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLF-
DSLEETGAQGRKVACf - FLAV_DESSA MSKSLIVYGSTTGNT-ETAaEYVAEAFENKE-ID
VELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLY-
DSLENADLKGKKVSVf - FLAV_DESGI MPKALIVYGSTTGNT-EGVaEAIAKTLNSEG-ME
TTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLY-
EDLDRAGLKDKKVGVf - 2fcr --KIGIFFSTSTGNT-TEVADFIGKTLGA---KA
DAPIDVDDVTDPQALKDYDLLFLGAPTWNTG----ADTERSGTSWDEFLY
DKLPEVDMKDLPVAIF - FLAV_AZOVI -AKIGLFFGSNTGKT-RKVaKSIKKRFDDET-MS
DA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFL-
PKIEGLDFSGKTVALf - FLAV_ENTAG MATIGIFFGSDTGQT-RKVaKLIHQKLDG---IA
DAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFT-
NTLSEADLTGKTVALf - FLAV_ANASP SKKIGLFYGTQTGKT-ESVaEIIRDEFGN---DV
VTLHDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSDWEGLY-
SELDDVDFNGKLVAYf - FLAV_ECOLI -AITGIFFGSDTGNT-ENIaKMIQKQLGK---DV
ADVHDIAKSS-KEDLEAYDILLLgIPTWYYGE--------AQCDWDDFF-
PTLEEIDFNGKLVALf - 4fxn -MK--IVYWSGTGNT-EKMAELIAKGIIESG-KD
VNTINVSDVNIDELL-NEDILILGCSAMGDEVL-------EESEFEPFI-
EEIS-TKISGKKVALF - FLAV_MEGEL MVE--IVYWSGTGNT-EAMaNEIEAAVKAAG-AD
VESVRFEDTNVDDVA-SKDVILLgCPAMGSEEL-------EDSVVEPFF-
TDLA-PKLKGKKVGLf - FLAV_CLOAB -MKISILYSSKTGKT-ERVaKLIEEGVKRSGNIE
VKTMNLDAVD-KKFLQESEGIIFgTPTYYAN---------ISWEMKKWI-
DESSEFNLEGKLGAAf - 3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFN--NV
EEAEDGVDALNKLQAGGYGFVI---SDWNMPNM----------DGLELL-
KTIRADGAMSALPVLM - T
- 1fx1 GCGDS-SY-EYFCGA-VDAIEEKLKNLGAEIVQD
---------------------GLRIDGD--PRAARDDIVGWAHDVRGAI-
------- - FLAV_DESDE ASGDQ-EY-EHFCGA-VPAIEERAKELgATIIAE
---------------------GLKMEGD--ASNDPEAVASfAEDVLKQL-
------- - FLAV_DESVH GCGDS-SY-EYFCGA-VDAIEEKLKNLgAEIVQD
---------------------GLRIDGD--PRAARDDIVGwAHDVRGAI-
-------
24Flavodoxin-cheY Local Pre-processing(locprepro?3
00)
- 1fx1 --PKALIVYGSTTGNTEYTAETIARQLANAGYEV
DSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPL--F
DSLEETGAQGRKVACF - FLAV_DESVH -MPKALIVYGSTTGNTEYTaETIARELADAGYEV
DSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPL--F
DSLEETGAQGRKVACf - FLAV_DESSA -MSKSLIVYGSTTGNTETAaEYVAEAFENKEIDV
ELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPL--Y
DSLENADLKGKKVSVf - FLAV_DESGI -MPKALIVYGSTTGNTEGVaEAIAKTLNSEGMET
TVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPL--Y
EDLDRAGLKDKKVGVf - FLAV_DESDE -MSKVLIVFGSSTGNTESIaQKLEELIAAGGHEV
TLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSL--F
EEFNRFGLAGRKVAAf - 4fxn --MK--IVYWSGTGNTEKMAELIAKGIIESGKDV
NTINVSDVNIDELLN-EDILILGCSAMGDEVL------E-ESEFEPF--I
EEIS-TKISGKKVALF - FLAV_MEGEL -MVE--IVYWSGTGNTEAMaNEIEAAVKAAGADV
ESVRFEDTNVDDVAS-KDVILLgCPAMGSEEL------E-DSVVEPF--F
TDLA-PKLKGKKVGLf - 2fcr ---KIGIFFSTSTGNTTEVADFIGKTLGAKADAP
I--DVDDVTDPQALKDYDLLFLGAPTWNTGAD----TERSGTSWDEFL-Y
DKLPEVDMKDLPVAIF - FLAV_ANASP -SKKIGLFYGTQTGKTESVaEIIRDEFGNDVVTL
H--DVSQAEV-TDLNDYQYLIIgCPTWNIGEL--------QSDWEGL--Y
SELDDVDFNGKLVAYf - FLAV_AZOVI --AKIGLFFGSNTGKTRKVaKSIKKRFDDETMSD
A-LNVNRVSA-EDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEF--L
PKIEGLDFSGKTVALf - FLAV_ENTAG -MATIGIFFGSDTGQTRKVaKLIHQKLDG--IAD
APLDVRRATR-EQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEF--T
NTLSEADLTGKTVALf - FLAV_ECOLI --AITGIFFGSDTGNTENIaKMIQKQLGKDVADV
H--DIAKSSK-EDLEAYDILLLgIPTWYYGEA--------QCDWDDF--F
PTLEEIDFNGKLVALf - FLAV_CLOAB --MKISILYSSKTGKTERVaKLIEEGVKRSGNIE
VKTMNLDAVDKKFLQESEGIIFgTPTYYA-----------NISWEMKKWI
DESSEFNLEGKLGAAf - 3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFNNVEE
AEDGVDALNKLQ-AGGYGFVI---SDWNMPNM----------DGLEL--L
KTIRADGAMSALPVLM -
- 1fx1 GCGDS--SY-EYFCGA-VD--AIEEKLKNLGAEI
VQD---------------------GLRID--GDPRAARDDIVGWAHDVRG
AI-------- - FLAV_DESVH GCGDS--SY-EYFCGA-VD--AIEEKLKNLgAEI
VQD---------------------GLRID--GDPRAARDDIVGwAHDVRG
AI-------- - FLAV_DESSA GCGDS--DY-TYFCGA-VD--AIEEKLEKMgAVV
IGD---------------------SLKID--GDPE--RDEIVSwGSGIAD
KI-------- - FLAV_DESGI GCGDS--SY-TYFCGA-VD--VIEKKAEELgATL
VAS---------------------SLKID--GEPD--SAEVLDwAREVLA
RV--------
25Flavodoxin-cheY Pre-processing (prepro?1500)