Title: Multiple alignment
1Multiple alignment
Peter Højrup Department of Biochemistry
Molecular Biology, SDU, Odense University,
Denmark.
2Multiple sequence alignment
One amino acid sequence plays coy A pair of
homologous sequences whisper Many aligned
sequences shout out loud.
3Main applications for MA
- Extrapolation
- Determine family relationship
- Phylogenetic analysis
- Reconstruct the history of the protein
- Pattern identification
- Identify structural/functional important residues
- Domain identification
- Construct a pattern to find other family members
- Structure prediction
- A multiple alignment greatly enhances prediction
accuracy - PCR analysis
- Find the less degenerate parts of a gene.
4Definition
Consensus sequence
5Multiple alignment parameters
- AA substitution matrix
- Usually one based on global alignment PAM250 or
Gonnet - Gap parameters
- Both opening and extension parameters are very
important for the optimal alignment - Alignment order
- ClustalX initially performs a pairwise comparison
closest fit is the initial alignment. - Sequence length
- Try to align sequences of the same length you
may do an initial dot-plot for alignment regions.
6Multiple sequence alignment
Rat_CALRETICULIN ----MLLSVPLLLGLLGLAAAD-------
---------------------------PAIYFKEQFLDGDAWTNR-----
----WVESKHKSD--FGKFVL Human_CALRETICULIN
----MLLSVPLLLGLLGLAVAE----------------------------
------PAVYFKEQFLDGDGWTSR---------WIESKHKSD--FGKFVL
RAT_CALNEXIN MEGKWLLCLLLVLGTAAIQAHDGHDDD
MIDIEDDLDDVIEEVEDSKSKSDTSTPPSPKVTYKAPVPTGEVYFADSFD
RGSLSGWILSKAKKDDTDDEIAK Human_CALNEXIN
MEGKWLLCMLLVLGTAIVEAHDGHDDDVIDIEDDLDDVIEEVEDSKPDT-
TAPPSSPKVTYKAPVPTGEVYFADSFDRGTLSGWILSKAKKDDTDDEIAK
. .
.
. .. Prim.cons.
MEGK2LL2V2L2LG22GLAA2DGHDDD2IDIEDDLDDVIEEVEDSK222D
T22P2SP2V22K22222G2V22A2SFDRG2LSGWI2SK2K2DDT222222
Rat_CALRETICULIN SSGKFYGDQEK------DKGLQTSQD
ARFYALSARF-EPFSNKGQTLVVQFTVKHEQNIDCGGGYVKLFPGG--LD
QKDMHGDSEYNIMFGPDICGPGTK Human_CALRETICULIN
SSGKFYGDEEK------DKGLQTSQDARFYALSASF-EPFSNKGQTLVVQ
FTVKHEQNIDCGGGYVKLFPNS--LDQTDMHGDSEYNIMFGPDICGPGTK
RAT_CALNEXIN YDGKWEVDEMKETKLPGDKGLVLMSRA
KHHAISAKLNKPFLFDTKPLIVQYEVNFQNGIECGGAYVKLLSKTSELNL
DQFHDKTPYTIMFGPDKCG-EDY Human_CALNEXIN
YDGKWEVEEMKESKLPGDKGLVLMSRAKHHAISAKLNKPFLFDTKPLIVQ
YEVNFQNGIECGGAYVKLLSKTPELNLDQFHDKTPYTIMFGPDKCG-EDY
. .
. . . ....
.. . Prim.cons.
22GK222DE2KE2KLPGDKGL22222A222A2SAK2N2PF222222L2VQ
22V22222I2CGG2YVKL22KT2EL22D22H2222Y2IMFGPD2CGP222
Rat_CALRETICULIN KVHVIFNYKGKNVLINKDIRCK----
------DDEFTHLYTLIVRPDNTYEVKIDNSQVESGSLEDDWD--FLPPK
KIKDPDAAKPEDWDERAKIDDPTD Human_CALRETICULIN
KVHVIFNYKGKNVLINKDIRCK----------DDEFTHLYTLIVRPDNTY
EVKIDNSQVESGSLEDDWD--FLPPKKIKDPDASKPEDWDERAKIDDPTD
RAT_CALNEXIN KLHFIFRHKNPKTGVYEEKHAKRPDAD
LKTYFTDKKTHLYTLILNPDNSFEILVDQSVVNSGNLLNDMTPPVNPSRE
IEDPEDRKPEDWDERPKIADPDA Human_CALNEXIN
KLHFIFRHKNPKTGIYEEKHAKRPDADLKTYFTDKKTHLYTLILNPDNSF
EILVDQSVVNSGNLLNDMTPPVNPSREIEDPEDRKPEDWDERPKIPDPEA
... . .
. . .
. . Prim.cons.
K2H2IF22K22222I222222KRPDADLKTYF2D22THLYTLI22PDN22
E222D2S2V2SG2L22D22PP22P222I2DP22RKPEDWDER2KIDDPT2
Rat_CALRETICULIN SKPEDWDK------------------
---PEHIPDPDAKKPEDWDEEMDGEWEP-------------------PVI
QNPEYKGEWKPRQIDNPDYKGTWI Human_CALRETICULIN
SKPEDWDK---------------------PEHIPDPDAKKPEDWDEEMDG
EWEP-------------------PVIQNPEYKGEWKPRQIDNPDYKGTWI
RAT_CALNEXIN VKPDDWDEDAPSKIPDEEATKPEGWLDD
EPEYIPDPDAEKPEDWDEDMDGEWEAPQIANPKCESAPGCGVWQRPMIDN
PNYKGKWKPPMIDNPNYQGIWK Human_CALNEXIN
VKPDDWDEDAPAKIPDEEATKPEGWLDDEPEYVPDPDAEKPEDWDEDMDG
EWEAPQIANPRCESAPGCGVWQRPVIDNPNYKGKWKPPMIDNPSYQGIWK
.
. Prim.cons.
2KP2DWD2DAP2KIPDEEATKPEGWLDDEPE2IPDPDA2KPEDWDE2MDG
EWE2PQIANP2CESAPGCGVWQRPVI2NP2YKG2WKP22IDNPDY2G2W2
7Nomenclature
Rat_CALRETICULIN SSGKFYGDQEK------DKGLQTSQDARF
YALSARF-EPFSNKGQTL Human_CALRETICULIN
SSGKFYGDEEK------DKGLQTSQDARFYALSASF-EPFSNKGQTL RA
T_CALNEXIN YDGKWEVDEMKETKLPGDKGLVLMSRAKHHA
ISAKLNKPFLFDTKPL Human_CALNEXIN
YDGKWEVEEMKESKLPGDKGLVLMSRAKHHAISAKLNKPFLFDTKPL
. .
. . . Prim.cons.
22GK222DE2KE2KLPGDKGL22222A222A2SAK2N2PF222222L
Consensus sequence
8Mind the gap!
250 260 270
280 290 300
Papain DGVRQVQPYNEGALLYSIANQPVSVVLEAAGKDFQ
LYRGGIFVGPCGNKVDHAVAAVGYG Staphopain
--------I---AILGSRV-E-----S----------RNGMHAGHAMAVV
GN--AKLNNG .
... . . . Prim.cons.
DGVRQVQP2NEGA2L2S22N2PVSVV2EAAGKDFQLYR2G222G22222V
22AVA2222G
Never have islands (widows)
Gaps should be in-frame
Papain YTTTELSYEEVLNDGDVNIPEYVDWRQKGAVTPVKNQ
GSCGSCWAFSAVVTIEGIIKIRT CathLx2
PRKGKVFQEPLFYEA----PRSVDWREKGYVTPVKNQGQCGSCWAFSATG
ALEGQMFRKT CathBx3 PPQRVMFTEDLKLPAS--FDAREQWP
QCPTIKEIRDQGSCGSCWAFGAVEAISDRICIHT Staphopain
---------------------ETQGNN-------------GWCAGYTMSA
LLN-------
. . Prim.cons.
P33333F3E3L333A2VN2P34V2WRQKG3VTPVKNQGSCGSCWAFSAV4
A2EG3I3I3T
9Structural inferences
- The most highly conserved regions are likely to
correspond to the active site. - Regions rich in insertions and deletions probably
correspond to surface loops. - A position containing a conserved Gly or Pro
probably corresponds to a turn. - A conserved pattern of hydrophobicity with
spacing 2 (that is every second residue) with
the intervening residues more variable and
including hydrophilic residues suggests a
b-strand on the surface. - A conserved pattern of hydrophobicity with
spacing 4 suggests a (surface) a-helix.
10ClustalW/ClustalX
- Multiple alignment takes place in three steps
- Pairwise alignment of all sequences
- Calculating a guide tree
- Progressive alignment.
11Guide tree of hexokinases
12Hexokinase alignment
133D aspects - thioredoxins
Loop
a
Loop
Loop
Active site
Loop
b
14Thioredoxin
Loop
Loop
a
b
Loop
Loop
Active site
15Sequence logo from alignment
16Naming conventions in multiple alignments
- Clustal W only use the first word (i.e. never use
white space in name) - Do not use special symbols use underscore _
to connect words - The protein should be understandable in 15
characters (truncation) - All proteins to be aligned needs an individual
name.
17PSI - BLAST
- Position Iterated Blast
- For each search round, the aligned results are
used as the basis for calculating a new
substitution matrix. - New iterations can be carried out as long as new
hits are found. - If no results are found in a normal BLAST,
PSI-BLAST will not help. - Check results carefully!!
18PSI BLAST of HIT protein
19First search 103 hits
20First PSI iteration
21Second PSI iteration