Multiple sequence alignment - PowerPoint PPT Presentation

About This Presentation
Title:

Multiple sequence alignment

Description:

Introduction to bioinformatics. Lecture 7. Multiple sequence ... DP recipe for using affine gap penalties (after ... One of he most important means to ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 26
Provided by: VictorAS
Category:

less

Transcript and Presenter's Notes

Title: Multiple sequence alignment


1
Introduction to bioinformaticsLecture 7 Multiple
sequence alignment (1)
2
Global or Local Pairwise alignment
B
B
C
A
A
B
A
A
C
A
B
C
A
Local
B
Local
A
B
C
A
B
C
B
A
Global
Global
A
B
C
A
3
Globin fold ? protein myoglobin PDB 1MBN
Helices are labelled A (blue) to H (red). D
helix can be missing in some globins what
happens with alignment?
4
? sandwich ? protein immunoglobulin PDB 7FAB
5
TIM barrel ? / ? protein Triose phosphate
IsoMerase PDB 1TIM
6
Pyruvate kinase Phosphotransferase
b barrel regulatory domain a/b barrel
catalytic substrate binding domain a/b
nucleotide binding domain
7
What does this mean for alignments?
  • Alignments need to be able to skip secondary
    structural elements to complete domains (i.e.
    putting gaps opposite these motifs in the shorter
    sequence).
  • Depending on gap penalties chosen, the algorithm
    might have difficulty with making such long gaps
    (for example when using high affine gap
    penalties), resulting in incorrect alignment.

8
There are three kinds of pairwise alignments
  • Global alignment align all residues in both
    sequences all gaps are penalised
  • Semi-global alignment align all residues in
    both sequences end gaps are not penalised (zero
    end gap penalties)
  • Local alignment align part of each sequence
    end gaps are not applicable

9
Easy global DP recipe for using affine gap
penalties (after Gotoh)
j-1
Penalty Pi gap_lengthPe
MaxS0ltxlti-1, j-1 - Pi - (i-x-1)Px Si-1,j-1 MaxS
i-1, 0ltyltj-1 - Pi - (j-y-1)Px
Si,j si,j Max
i-1
  • Mi,j is optimal alignment (highest scoring
    alignment until i, j)
  • At each cell i, j in search matrix, check Max
    coming from
  • any cell in preceding row until j-2 add score
    for celli, j minus appropriate gap penalties
  • any cell in preceding column until i-2 add score
    for celli, j minus appropriate gap penalties
  • or celli-1, j-1 add score for celli, j
  • Select highest scoring cell in bottom row and
    rightmost column and do trace-back

10
Lets do an example global alignmentGotohs DP
algorithm with affine gap penalties (PAM250,
Pi10, Pe2)
D W V T A L K
0 -12 -14 -16 -18 -20 -22 -24
T -12 0 -17 -14 -13 -17 -19 -22 -22
D -14 -8 -7 -14 -14 -13 -42
W -16 -21 9 -13 -19 -18
V -18 -18 -20 13 -3 -16
L -20 -22 -18 -1 14 -1 -14
K -22 -20 -21 -12
-24 -42 -41 -18 -16 -14 -12 0
D W V T A L K
T 0 -5 0 3 1 1 0
D 4 -7 -2 0 0 -4 0
W -7 17 -6 -5 -6 -2 -3
V -2 -6 4 0 0 2 -2
L -4 -2 2 1 -2 6 -3
K 0 -3 -2 0 -1 -3 5
PAM250
Cell (D2, T4) can alternatively come from two
cells (same score) high-road or low-road
Row and column 0 are filled with 0, -12, -14,
-16, if global alignment is used (for
N-terminal end-gaps) also extra row and column
at the end to calculate the score including
C-terminal end-gap penalties.
11
Lets do another example semi-global
alignmentGotohs DP algorithm with affine gap
penalties (PAM250, Pi10, Pe2)
D W V T A L K
T 0 -5 0 3 1 1 0
D 4 -7 -2 0 0 -4 0
W -7 17 -6 -5 -6 -2 -3
V -2 -6 4 0 0 2 -2
L -4 -2 2 1 -2 6 -3
K 0 -3 -2 0 -1 -3 5
D W V T A L K
T 0 -5 0 3
D 4 -7 -7
W -7 21 -13
V -2 -13 25 9
L
K
PAM250
Starting row and column 0, and extra column at
right or extra row at bottom is not necessary
when using semi global alignment (zero end-gaps).
Rest works as under global alignment.
12
Easy local DP recipe for using affine gap
penalties (after Gotoh)
j-1
Penalty Pi gap_lengthPe
Si,j MaxS0ltxlti-1,j-1 - Pi - (i-x-1)Px Si,j
Si-1,j-1 Si,j Max Si-1,0ltyltj-1 - Pi -
(j-y-1)Px 0
Si,j Max
i-1
  • Mi,j is optimal alignment (highest scoring
    alignment until i, j)
  • At each cell i, j in search matrix, check Max
    coming from
  • any cell in preceding row until j-2 add score
    for celli, j minus appropriate gap penalties
  • any cell in preceding column until i-2 add score
    for celli, j minus appropriate gap penalties
  • or celli-1, j-1 add score for celli, j
  • Select highest scoring cell anywhere in matrix
    and do trace-back until zero-valued cell or start
    of sequence(s)

13
Lets do yet another example local
alignmentGotohs DP algorithm with affine gap
penalties (PAM250, Pi10, Pe2)
D W V T A L K
T 0 -5 0 3 1 1 0
D 4 -7 -2 0 0 -4 0
W -7 17 -6 -5 -6 -2 -3
V -2 -6 4 0 0 2 -2
L -4 -2 2 1 -2 6 -3
K 0 -3 -2 0 -1 -3 5
D W V T A L K
T 0 0 0 3
D 4 0 0 0
W 0 21 0 0
V 0 0 25 9
L 0 0 11
K 0 0
PAM250
Extra start/end columns/rows not necessary (no
end-gaps). Each negative scoring cell is set to
zero. Highest scoring cell may be found anywhere
in search matrix after calculating it. Trace
highest scoring cell back to first cell with zero
value (or the beginning of one or both sequences)
14
For your first exam D1Make sure you
understand and can carry out Gotohs algorithm
for global, semi-global and local
alignment!This is the most general Dynamic
Programming (DP) algorithm (and perhaps the
easiest to understand) Gotoh, O. An Improved
Algorithm for Matching Biological Sequences. J.
Mol. Biol., 162, pp. 705-708, 1982.
15
Pairwise alignment
  • Now we know how to do it
  • How do we get a multiple alignment (three or more
    sequences)?
  • Multiple alignment much greater combinatorial
    explosion than with pairwise alignment..

16
Multiple sequence alignment (MSA) Why
  • One of he most important means to find out about
  • Conservation patterns leading to functional clues
  • Possible protein structure
  • Multiple sequence alignment contains far more
    information about conservation than pairwise
    alignment
  • Many bioinformatics methods use MSA as input
    e.g. secondary structure prediction (later
    lecture)

17
Multiple sequence alignment Wanted
  • Quality
  • Quality
  • Quality
  • Programs need to be fully automatic for genomic
    pipelines
  • With available genomes (data explosion), speed
    becomes crucial

18
(Murata et al. 1985)
Simultaneous Multiple alignmentMulti-dimensional
dynamic programming
19
Simultaneous Multiple alignmentMulti-dimensional
dynamic programming
  • MSA (Lipman et al., 1989, PNAS 86, 4412)
  • extremely slow and memory intensive
  • up to 8-9 sequences of 250 residues
  • DCA (Stoye et al., 1997, CABIOS 13, 625)
  • still very slow

20
Alternative multiple alignment methods
  • Biopat (first complete MSA method ever Hogeweg
    Hesper 1984)
  • MULTAL (Taylor 1987)
  • DIALIGN (Morgenstern 1996)
  • PRRP (Gotoh 1996)
  • Clustal (Thompson Higgins Gibson 1994)
  • Praline (Heringa 1999)
  • T-Coffee (Notredame et al. 2000)
  • HMMER (Eddy 1998) Hidden Markov Models
  • SAGA (Notredame 1996) Genetic algorithm
  • POA (Lee et al. 2002)
  • MUSCLE (Edgar 2004)

21
The following three slides are examples of
multiple alignments of 13 flavodoxin and 1 cheY
sequence (PDB code 3chy).The cheY sequence is a
very distant relative of the flavodoxin family,
but has the same basic fold
22
  • CLUSTAL X (1.64b) multiple sequence alignment
    Flavodoxin-cheY
  • 1fx1 -PKALIVYGSTTGNTEYTAETIARQLANAG-Y-E
    VDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPLFD
    -SLEETGAQGRK
  • FLAV_DESVH MPKALIVYGSTTGNTEYTAETIARELADAG-Y-E
    VDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSIE------LQDDFIPLFD
    -SLEETGAQGRK
  • FLAV_DESGI MPKALIVYGSTTGNTEGVAEAIAKTLNSEG-M-E
    TTVVNVADVTAPGLAEGYDVVLLGCSTWGDDEIE------LQEDFVPLYE
    -DLDRAGLKDKK
  • FLAV_DESSA MSKSLIVYGSTTGNTETAAEYVAEAFENKE-I-D
    VELKNVTDVSVADLGNGYDIVLFGCSTWGEEEIE------LQDDFIPLYD
    -SLENADLKGKK
  • FLAV_DESDE MSKVLIVFGSSTGNTESIAQKLEELIAAGG-H-E
    VTLLNAADASAENLADGYDAVLFGCSAWGMEDLE------MQDDFLSLFE
    -EFNRFGLAGRK
  • FLAV_CLOAB -MKISILYSSKTGKTERVAKLIEEGVKRSGNI-E
    VKTMNLDAVDKKFLQE-SEGIIFGTPTYYAN---------ISWEMKKWID
    -ESSEFNLEGKL
  • FLAV_MEGEL --MVEIVYWSGTGNTEAMANEIEAAVKAAG-A-D
    VESVRFEDTNVDDVAS-KDVILLGCPAMGSE--E------LEDSVVEPFF
    -TDLAPKLKGKK
  • 4fxn ---MKIVYWSGTGNTEKMAELIAKGIIESG-K-D
    VNTINVSDVNIDELLN-EDILILGCSAMGDE--V------LEESEFEPFI
    -EEISTKISGKK
  • FLAV_ANASP SKKIGLFYGTQTGKTESVAEIIRDEFGNDVVT--
    --LHDVSQAEVTDLND-YQYLIIGCPTWNIGELQ---SD-----WEGLYS
    -ELDDVDFNGKL
  • FLAV_AZOVI -AKIGLFFGSNTGKTRKVAKSIKKRFDDETMSD-
    --ALNVNRVSAEDFAQ-YQFLILGTPTLGEGELPGLSSDCENESWEEFLP
    -KIEGLDFSGKT
  • 2fcr --KIGIFFSTSTGNTTEVADFIGKTLGAKADAP-
    --IDVDDVTDPQALKD-YDLLFLGAPTWNTGADTERSGT----SWDEFLY
    DKLPEVDMKDLP
  • FLAV_ENTAG MATIGIFFGSDTGQTRKVAKLIHQKLDGIADAP-
    --LDVRRATREQFLS--YPVLLLGTPTLGDGELPGVEAGSQYDSWQEFTN
    -TLSEADLTGKT
  • FLAV_ECOLI -AITGIFFGSDTGNTENIAKMIQKQLGKDVAD--
    --VHDIAKSSKEDLEA-YDILLLGIPTWYYGEAQ-CD-------WDDFFP
    -TLEEIDFNGKL
  • 3chy --ADKELKFLVVDDFSTMRRIVRNLLKELG----
    FNNVEEAEDGVDALN------KLQAGGYGFV--I------SDWNMPNMDG
    -LELLKTIR---
  • . ... .
    .

23
Flavodoxin-cheY Global Pre-processing
(prepro?1500)
  • 1fx1 -PKALIVYGSTTGNT-EYTAETIARQLANAG-YE
    VDSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPLF-
    DSLEETGAQGRKVACF
  • FLAV_DESDE MSKVLIVFGSSTGNT-ESIaQKLEELIAAGG-HE
    VTLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSLF-
    EEFNRFGLAGRKVAAf
  • FLAV_DESVH MPKALIVYGSTTGNT-EYTaETIARELADAG-YE
    VDSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPLF-
    DSLEETGAQGRKVACf
  • FLAV_DESSA MSKSLIVYGSTTGNT-ETAaEYVAEAFENKE-ID
    VELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPLY-
    DSLENADLKGKKVSVf
  • FLAV_DESGI MPKALIVYGSTTGNT-EGVaEAIAKTLNSEG-ME
    TTVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPLY-
    EDLDRAGLKDKKVGVf
  • 2fcr --KIGIFFSTSTGNT-TEVADFIGKTLGA---KA
    DAPIDVDDVTDPQALKDYDLLFLGAPTWNTG----ADTERSGTSWDEFLY
    DKLPEVDMKDLPVAIF
  • FLAV_AZOVI -AKIGLFFGSNTGKT-RKVaKSIKKRFDDET-MS
    DA-LNVNRVS-AEDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEFL-
    PKIEGLDFSGKTVALf
  • FLAV_ENTAG MATIGIFFGSDTGQT-RKVaKLIHQKLDG---IA
    DAPLDVRRAT-REQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEFT-
    NTLSEADLTGKTVALf
  • FLAV_ANASP SKKIGLFYGTQTGKT-ESVaEIIRDEFGN---DV
    VTLHDVSQAE-VTDLNDYQYLIIgCPTWNIGEL--------QSDWEGLY-
    SELDDVDFNGKLVAYf
  • FLAV_ECOLI -AITGIFFGSDTGNT-ENIaKMIQKQLGK---DV
    ADVHDIAKSS-KEDLEAYDILLLgIPTWYYGE--------AQCDWDDFF-
    PTLEEIDFNGKLVALf
  • 4fxn -MK--IVYWSGTGNT-EKMAELIAKGIIESG-KD
    VNTINVSDVNIDELL-NEDILILGCSAMGDEVL-------EESEFEPFI-
    EEIS-TKISGKKVALF
  • FLAV_MEGEL MVE--IVYWSGTGNT-EAMaNEIEAAVKAAG-AD
    VESVRFEDTNVDDVA-SKDVILLgCPAMGSEEL-------EDSVVEPFF-
    TDLA-PKLKGKKVGLf
  • FLAV_CLOAB -MKISILYSSKTGKT-ERVaKLIEEGVKRSGNIE
    VKTMNLDAVD-KKFLQESEGIIFgTPTYYAN---------ISWEMKKWI-
    DESSEFNLEGKLGAAf
  • 3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFN--NV
    EEAEDGVDALNKLQAGGYGFVI---SDWNMPNM----------DGLELL-
    KTIRADGAMSALPVLM
  • T
  • 1fx1 GCGDS-SY-EYFCGA-VDAIEEKLKNLGAEIVQD
    ---------------------GLRIDGD--PRAARDDIVGWAHDVRGAI-
    -------
  • FLAV_DESDE ASGDQ-EY-EHFCGA-VPAIEERAKELgATIIAE
    ---------------------GLKMEGD--ASNDPEAVASfAEDVLKQL-
    -------
  • FLAV_DESVH GCGDS-SY-EYFCGA-VDAIEEKLKNLgAEIVQD
    ---------------------GLRIDGD--PRAARDDIVGwAHDVRGAI-
    -------

24
Flavodoxin-cheY Local Pre-processing(locprepro?3
00)
  • 1fx1 --PKALIVYGSTTGNTEYTAETIARQLANAGYEV
    DSRDAASVEAGGLFEGFDLVLLGCSTWGDDSI------ELQDDFIPL--F
    DSLEETGAQGRKVACF
  • FLAV_DESVH -MPKALIVYGSTTGNTEYTaETIARELADAGYEV
    DSRDAASVEAGGLFEGFDLVLLgCSTWGDDSI------ELQDDFIPL--F
    DSLEETGAQGRKVACf
  • FLAV_DESSA -MSKSLIVYGSTTGNTETAaEYVAEAFENKEIDV
    ELKNVTDVSVADLGNGYDIVLFgCSTWGEEEI------ELQDDFIPL--Y
    DSLENADLKGKKVSVf
  • FLAV_DESGI -MPKALIVYGSTTGNTEGVaEAIAKTLNSEGMET
    TVVNVADVTAPGLAEGYDVVLLgCSTWGDDEI------ELQEDFVPL--Y
    EDLDRAGLKDKKVGVf
  • FLAV_DESDE -MSKVLIVFGSSTGNTESIaQKLEELIAAGGHEV
    TLLNAADASAENLADGYDAVLFgCSAWGMEDL------EMQDDFLSL--F
    EEFNRFGLAGRKVAAf
  • 4fxn --MK--IVYWSGTGNTEKMAELIAKGIIESGKDV
    NTINVSDVNIDELLN-EDILILGCSAMGDEVL------E-ESEFEPF--I
    EEIS-TKISGKKVALF
  • FLAV_MEGEL -MVE--IVYWSGTGNTEAMaNEIEAAVKAAGADV
    ESVRFEDTNVDDVAS-KDVILLgCPAMGSEEL------E-DSVVEPF--F
    TDLA-PKLKGKKVGLf
  • 2fcr ---KIGIFFSTSTGNTTEVADFIGKTLGAKADAP
    I--DVDDVTDPQALKDYDLLFLGAPTWNTGAD----TERSGTSWDEFL-Y
    DKLPEVDMKDLPVAIF
  • FLAV_ANASP -SKKIGLFYGTQTGKTESVaEIIRDEFGNDVVTL
    H--DVSQAEV-TDLNDYQYLIIgCPTWNIGEL--------QSDWEGL--Y
    SELDDVDFNGKLVAYf
  • FLAV_AZOVI --AKIGLFFGSNTGKTRKVaKSIKKRFDDETMSD
    A-LNVNRVSA-EDFAQYQFLILgTPTLGEGELPGLSSDCENESWEEF--L
    PKIEGLDFSGKTVALf
  • FLAV_ENTAG -MATIGIFFGSDTGQTRKVaKLIHQKLDG--IAD
    APLDVRRATR-EQFLSYPVLLLgTPTLGDGELPGVEAGSQYDSWQEF--T
    NTLSEADLTGKTVALf
  • FLAV_ECOLI --AITGIFFGSDTGNTENIaKMIQKQLGKDVADV
    H--DIAKSSK-EDLEAYDILLLgIPTWYYGEA--------QCDWDDF--F
    PTLEEIDFNGKLVALf
  • FLAV_CLOAB --MKISILYSSKTGKTERVaKLIEEGVKRSGNIE
    VKTMNLDAVDKKFLQESEGIIFgTPTYYA-----------NISWEMKKWI
    DESSEFNLEGKLGAAf
  • 3chy ADKELKFLVVDDFSTMRRIVRNLLKELGFNNVEE
    AEDGVDALNKLQ-AGGYGFVI---SDWNMPNM----------DGLEL--L
    KTIRADGAMSALPVLM
  • 1fx1 GCGDS--SY-EYFCGA-VD--AIEEKLKNLGAEI
    VQD---------------------GLRID--GDPRAARDDIVGWAHDVRG
    AI--------
  • FLAV_DESVH GCGDS--SY-EYFCGA-VD--AIEEKLKNLgAEI
    VQD---------------------GLRID--GDPRAARDDIVGwAHDVRG
    AI--------
  • FLAV_DESSA GCGDS--DY-TYFCGA-VD--AIEEKLEKMgAVV
    IGD---------------------SLKID--GDPE--RDEIVSwGSGIAD
    KI--------
  • FLAV_DESGI GCGDS--SY-TYFCGA-VD--VIEKKAEELgATL
    VAS---------------------SLKID--GEPD--SAEVLDwAREVLA
    RV--------

25
Flavodoxin-cheY Pre-processing (prepro?1500)
Write a Comment
User Comments (0)
About PowerShow.com