Title: Scoring multiple sequence alignments
1Multiple Sequence Alignment (MSA)
2Plan
- Introduction to sequence alignments
- Multiple alignment construction
- Traditional approaches
- Alignment parameters
- Alternative approaches
- Multiple alignment main applications
- MACSIMS Multiple Alignment of Complete
Sequences Information Management System
3Local alignment / Global alignment
Sequence A
Sequence B
Optimal global pairwise alignment Needleman and
Wunsch, 1970
Optimal local pairwise alignment Smith and
Waterman, 1981
4Pairwise alignment / Multiple alignment
5What is a multiple alignment?
A representation of a set of sequences, in which
equivalent residues (e.g. functional or
structural) are aligned in columns
Conserved residues
Conservation profile
Secondary structure
6MACS
- Schematic overview of complete alignment
- e.g. domain organisation (Interpro)
Key
CH
SH3
PI-PLC-X
SH2
PI-PLC-Y
rhoGEF
DAG_PE-bind
PH
C2
7Why multiple alignments?
Integration of a sequence in the context of the
protein family
- Applications
- phylogeny
- domain organisation
- functional residue identification
- 2D/3D structure prediction
- transmembrane prediction
-
8MSA Construction
9Multiple alignment construction
- Traditional approaches
- Optimal multiple alignment
- Progressive multiple alignment
- Alignment parameters
- Residue similarity matrices
- Gap penalties
- Alternative approaches
- Iterative alignment methods
- Combinatorial algorithms
- PipeAlign a protein family analysis tool
10Traditional Approaches
11Optimal multiple alignment
Is the direct extension of pairwise dynamic
programming to N-dimension (Sankoff,
1975). Examine all possible alignments to find
the optimal alignment
Exemple alignment of 3 sequences
Problem The optimised mathematical alignment is
not necessarily the biologically optimal
alignment CPU time and memory required are
prohibitive for practical purposes (the required
time is proportional to Nk for k sequences with
length N) limited to lt10 sequences
12Progressive multiple alignment
Heuristic algorithm which avoids calculating all
possible alignments, but does not garuantee
optimal alignment
Principle Progressively align the sequences
(or sequence groups) by pair
13Progressive multiple alignment
Example Alignment of 7 globins (Hbb_human,
Hbb_horse, Hba_human, Hba_horse, Myg_phyca,
Glb5_petma and Lgb2_lupla)
Step 1 Pairwise alignment of all sequences
Hbb_human 1 VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYP
WTQRFFESFGDLST ... . .
Hbb_horse 2
VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSN
...
Ex pairwise alignment of 2 globin sequences
Hbb_human 1 LTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYP
WTQRFFESFGDLST ... . .
. . . Hba_human 3
LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF.DLS.
...
Hba_human 3 LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFP
TTKTYFPHF.DLSH ...
. . . Hbb_horse 2
LSGEEKAAVLALWDKVNEE..EVGGEALGRLLVVYPWTQRFFDSFGDLSN
...
The alignment can be obtained with - global or
local method - dynamic programming or heuristic
methods Example in Clustalx gt global
alignments gt choice between - heuristic method
(used in Fasta program) gt faster -
dynamic programming (Smith Waterman) gt
better
14Progressive multiple alignment
Step 2 Distance matrix construction
In Clustalx
distance between 2 sequences 1-
nb of identical residues
nb of compared residues
Ex Hbb_human vs Hbb_horse 83 identity 17
distance
- .17 - .59 .60 - .59 .59 .13 - .77 .77 .75 .75 -
.81 .82 .73 .74 .80 - .87 .86 .86 .88 .93 .90 -
1
2
3
4
5
6
7
15Progressive multiple alignment
Step 3 Sequential branching / Guide tree
construction
Sequential branching
Guide tree
Hba_human
Hba_horse
Hba_human
Hbb_horse
Hba_horse
Hbb_human
Glb5_petma
Myg_phyca
Lgb2_lupla
- Join the 2 closest sequences - Recalculate
distances and join the 2 closest sequences or
nodes - Step 3 is repeated until all sequences
are joined
16Progressive multiple alignment
Step 4 Progressive alignment
The progressive multiple alignment follows the
branching order in tree
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx xxxxxxxxxxxxxxx x
xxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
xxxxxxxxxxxxxxx
Hba_human
Hba_horse
17Progressive multiple alignment
H1
H3
H2
H4
H6
H7
H5
18Progressive multiple alignment methods
Progressive
Global
Local
SB
SBpima
multal
NJ
clustalx
UPGMA
ML
multalign pileup
MLpima
SB - Sequential Branching UPGMA - Unweighted
Pair Grouping Method ML - Maximum Likelihood NJ
- Neighbor-Joining
19Alignment Parameters
20Residue similarity matrices
- Dynamic programming methods score an alignment
using residue similarity matrices, containing a
score for matching all pairs of residues
- For proteins, a wide variety of matrices exist
Identity, PAM, Blosum, Gonnet etc.
21Residue similarity matrices
- Dynamic programming methods score an alignment
using residue similarity matrices, containing a
score for matching all pairs of residues
- For proteins, a wide variety of matrices exist
Identity, PAM, Blosum, Gonnet etc.
- Matrices are generally constructed by observing
the mutations in large sets of alignments, either
sequence-based or structure-based
- Matrices range from strict ones for comparing
closely related sequences to soft ones for very
divergent sequences.
A single best matrix does not exist!!
ClustalW automatically selects a suitable matrix
depending on the observed pairwise identity.
22Gap penalties
- A gap penalty is a cost for introducing gaps into
the alignment, corresponding to insertions or
deletions in the sequences
SFGDLSNPGAVMG HF-DLS-----HG
Goal is to introduce gaps in sequence segments
corresponding to flexible regions of the protein
structure
23Alternative Approaches
24Iterative alignment methods
- Iterative Alignment e.g. PRRP (Gotoh, 1993)
- - refine an initial progressive multiple
alignment by iteratively dividing the alignment
into 2 profiles and realigning them. - Genetic Algorithms e.g. SAGA (Notredame et al,
1996) - - iteratively refine an alignment using genetic
algorithms (evolves a population of alignments in
a quasi evolutionary manner) - Segment-to-segment alignment DIALIGN
(Morgenstern et al. 1999) - - search for locally conserved motifs in all
sequences and compares segments of sequences
instead of single residues - Hidden Markov Models
- - iteratively refine an alignment using HMMs
- e.g. HMMER (Eddy, 1998)
- SAM (Karplus et al, 2001)
25Multiple alignment methods
Progressive
Global
Local
SB
SBpima
multal
NJ
clustalx
UPGMA
ML
multalign pileup
MLpima
prrp
Genetic Algo.
HMM
dialign
saga
hmmt
Iterative
26BAliBASE objective evaluation of MACS programs
- High-quality alignments based on 3D structural
superpositions and manually verified - Alignments compared only in reliable core
blocks, excluding non-superposable regions - Separate reference sets specifically designed to
address distinct alignment problems
BAliBASE1 Thompson et al. 1999
Bioinformatics BAliBASE2 Bahr et al, 2001 Nucl
Acids Res.
reference set description
1 small number of sequences divergence, length
2 a family with one to 3 orphans
3 several sub-families
4 long N/C terminal extensions
5 long insertions
6 repeats
7 transmembrane regions
8 circular permutations
27Comparison of multiple alignment methods
gt Need of reference alignments to evaluate the
alignment programs
- BaliBASE (Thompson et al. Bioinformatics. 1999)
benchmark database - Alignments based on 3D structure superposition
- Alignments must be compared for the superposable
regions - Alignments take into account
- - the effect of the number of sequences
- - the effect of the sequence length
- - the effect of the sequence similarity
- - alignment of an orphan sequence with a
sequence family - - sub-family alignments
- - alignments of sequences with different length
(insertions,extensions)
28Comparison of multiple alignment methods
gt 35 Id any method
Local / global methods
- Colinear sequences gt global methods
- N/C-ter extensions or insertions gt local
methods
Progressive / iterative methods
- Iterative algorithms usually improve alignment
quality - Problems
- - Can give bad alignment in case of orphan
sequences - - Iteratif process can be very long !
Example alignment of 89 histone sequences
(66-92 residues)
ClustalW 2 mins 41 secs PRRP 3 hours 40
mins Dialign 3 hours 48 mins
To increase the alignment quality, as many
sequences as possible have to be integrated !
29DbClustal local and global algorithm coupling
Blast Database Search
Query Sequence
Database Hits
Domain A
Domain B
Domain C
30ClustalW / DbClustal comparison
ClustalW
DbClustal
31Combinatorial algorithms
- T-Coffee (Notredame et al. 2000)
http//igs-server.cnrs-mrs.fr/Tcoffee/ - performs local and global alignments for all
pairs of sequences, then combines them in a
progressive multiple alignment, similar to
ClustalW. - DbClustal (Thompson et al. 2000)
http//bips.u-strasbg.fr/PipeAlign/jump_to.cgi?DbC
lustalnoid - designed to align the sequences detected by a
database search. Locally conserved motifs are
detected using the Ballast program (Plewniak et
al. 1999) and are used in the global multiple
alignment as anchor points. - MAFFT (Katoh et al. 2002) http//timpani.genome.ad
.jp/7Emafft/server - detects locally conserved segments using a Fast
Fourier Transform, then uses a restricted global
DP and a progressive algorithm - MUSCLE (Edgar, 2004) http//www.drive5.com/muscle
- kmer distances and log-expectation scores,
progressive and iterative refinement - PROBCONS (Do et al, 2005) http//probcons.stanford
.edu - pairwise consistency based on an objective
function
32Multiple Alignment Quality
Truncated Alignments
Ref1 Ref2 Ref3 Ref4 Ref5 Time
V1 (lt20) V2 (20-40) orphans subgroups extensions insertions (sec)
ClustalW1.83 0.42 0.78 0.42 0.52 0.41 0.38 902
Dialign2.2.1 0.31 0.71 0.37 0.39 0.45 0.43 5993
Mafft5.32 0.44 0.78 0.49 0.53 0.47 0.48 96
Maffti5.32 0.54 0.83 0.56 0.60 0.49 0.57 327
Muscle3.51 0.52 0.82 0.50 0.58 0.46 0.54 523
Muscle_fast 0.40 0.77 0.43 0.44 0.35 0.49 34
Muscle_med 0.45 0.80 0.50 0.59 0.44 0.51 219
Tcoffee2.66 0.47 0.84 0.50 0.64 0.54 0.58 216133
Probcons1.1 0.63 0.87 0.60 0.65 0.54 0.63 19035
1. Significant improvement in accuracy/efficiency
since 2000
2. Twilight zone still exists
3. Probcons scores best in all tests, but is MUCH
slower than MAFFT or MUSCLE
4. MAFFTI scores slightly better than MUSCLE in
all test, and is more efficient
muscle_fast muscle maxiters1 diags1 sv
distance1 kbit20_3 muscle_medium muscle
maxiters2
33Multiple Alignment Quality
Comparison truncated versus full-length sequences
Ref1 Ref1 Ref1 Ref1 Ref2 orphans Ref2 orphans Ref3 subgroups Ref3 subgroups Time (sec) for all refs Time (sec) for all refs
V1 (lt20) V1 (lt20) V2 (20-40) V2 (20-40) Ref2 orphans Ref2 orphans Ref3 subgroups Ref3 subgroups Time (sec) for all refs Time (sec) for all refs
T FL T FL T FL T FL T FL
ClustalW1.83 0.42 0.24 0.78 0.72 0.42 0.20 0.52 0.27 902 2227
Dialign2.2.1 0.31 0.26 0.71 0.70 0.37 0.29 0.39 0.31 5993 12595
Mafft5.32 0.44 0.25 0.78 0.75 0.49 0.35 0.53 0.38 96 312
Maffti5.32 0.54 0.35 0.83 0.80 0.56 0.40 0.60 0.50 327 1409
Muscle3.51 0.52 0.34 0.82 0.79 0.50 0.36 0.58 0.39 523 3608
Muscle_fast 0.40 0.28 0.77 0.72 0.43 0.29 0.44 0.33 34 132
Muscle_med 0.45 0.29 0.80 0.74 0.50 0.34 0.59 0.38 219 1601
Tcoffee2.66 0.47 0.35 0.84 0.82 0.50 0.40 0.64 0.49 216133 341578
Probcons1.1 0.63 0.43 0.87 0.86 0.60 0.41 0.65 0.54 19035 58488
- Loss of accuracy is more important in twilight
zone (Ref1 V1, orphans, and subgroups) - Probcons still scores best in all tests
- MAFFT still scores better than MUSCLE in all tests
34Multiple alignment quality
Development of objective functions to estimate
multiple alignment quality
- Sum-of-pairs (Carrillo, Lipman, 1988)
- Sum the scores of all the pair of sequences
(based on a similarity matrix and gap penalty)
- Relative Entropy
- uses a normalized log-likelihood ratio to measure
the degree of conservation for each column
(identical residues only).
- MD
- (column scores used in ClustalX) uses a
comparison matrix (Gonnet) to take into account
similar residues
- norMD (Thompson et al, 2001)
- - scores by column using a substitution matrix
and gap penalties - - normalisation according to the sequences to
align (their number, length and the similarity
between them)
35Evaluation of Objective Functions using BAliBase
36Multiple sequence alignment editors
No automatic method is 100 reliable. Manual
verification and refinement is essential!
SeqLab GCG Wisconsin Package SeaView (Gaultier
et al, 1996) http//pbil.univ-lyon1.fr/software/se
aview.html WEB servers GeneAlign (Kurukawa)
http//www.gen-info.osaka-u.ac.jp/geneweb2/geneali
gn/ Jalview (Clamp, 1998) http//www.ebi.ac.u
k/michele/jalview/ CINEMA (Lord et al, 2002)
http//www.bioinf.man.ac.uk/dbbrowser/cinema-mx
37FASTA format
gtO88763 Phosphatidylinositol 3-kinase. ------MGEAE
KFHYIYSCDLDINVQLKIGSLEGKREQKSYKAVLEDPMLKFSGLYQETC
SDLYVTCQVFAEGKPLALPVRTSYKPFSTRWN-WNEWLKLPVKYPDLPRN
AQVALTIWD- -----VYGPG-RAVPVGGTTVSLFGKYGMFRQGMHDLKV
WPNVEADGSEPTRTPGRTSST LSEDQMSRLAKLTKAHRQGHMVKVLDRL
TFREIEMINESEKRSS--NFMYLMVEFRCVKC DDKE-YGIVYYE----
gtQ9W1M7 CG5373-PA (GH13170p). -----MDQPDDHFRYIHSSS
LHERVQIKVGTLEGKKRQPDYEKLLEDPILRFSGLYSEEH PSFQVRLQV
FNQGRPYCLPVTSSYKAFGKRWS-WNEWVTLPLQFSDLPRSAMLVLTILD
- -----CSGAG-QTTVIGGTSISMFGKDGMFRQGMYDLRVWLGVEGDGN
FPSRTPGK-GKE SSKSQMQRLGKLAKKHRNGQVQKVLDRLTFREIEVIN
EREKRMS--DYMFLMIEFPAIVV DDMYNYAVVYFE---- gtQ7PMF0
ENSANGP00000002906 (Fragment). ------------LRYIGSS
SLLQKISIKIGTLEGENVGYSYEKLIEQPLLKFSGMYTEKT PPLKVKLQ
IFDNGEPVGLPVCTSHKHFTTRWS-WNEWVTLPLRFTDISRTAVLGLTIY
D- -----CAGGREQLTVVGGTSISFFSTNGLFRQGLYDLKVWPQMEPDG
ACNSITPGK-AIT TGVHQMQRLSKLAKKHRNGQMEKILDRLTFRELEVI
NEMEKRNS--QFLYLMVEFPQVYI HEKL-YSVIHLE---- gtQ9TXI7
Related to yeast vacuolar protein sorting factor
protein 34 MIPGMRATPTESFSFVYSCDLQTNVQVKVAEFEG-----
IFRDVLN-PVRRLNQLFAEIT VYCNNQQIGYPVCTSFHTPPDSSQLARQ
KLIQKWNEWLTLPIRYSDLSRDAFLHITIWEH EDDEIVNNSTFSRRLVA
QSKLSMFSKRGILKSGVIDVQMNVSTTPDPFVKQPETWKYSDA WG-DEI
DLLFKQVTRQSRGLVEDVLDPFASRRIEMIRAKYKYSSPDRHVFLVLEMA
AIRL GPTF-YKVVYYEDETK
38MSF format
toto.msf MSF 256 Type P May 24, 2005 1934
Check 3415 .. Name O88763 Len
256 Check 9443 Weight 1.00 Name Q9W1M7
Len 256 Check 1161 Weight 1.00
Name Q7PMF0 Len 256 Check 8095
Weight 1.00 Name Q9TXI7 Len 256
Check 4716 Weight 1.00 // 1
50 O88763 ......MGEA EKFHYIYSCD LDINVQLKIG
SLEGKREQKS YKAVLEDPML Q9W1M7 .....MDQPD
DHFRYIHSSS LHERVQIKVG TLEGKKRQPD YEKLLEDPIL
Q7PMF0 .......... ..LRYIGSSS LLQKISIKIG
TLEGENVGYS YEKLIEQPLL Q9TXI7 MIPGMRATPT
ESFSFVYSCD LQTNVQVKVA EFEG.....I FRDVLN.PVR
51
100 O88763 KFSGLYQETC SDLYVTCQVF AEGKPLALPV
RTSYKPFSTR WN.WNEWLKL Q9W1M7 RFSGLYSEEH
PSFQVRLQVF NQGRPYCLPV TSSYKAFGKR WS.WNEWVTL
Q7PMF0 KFSGMYTEKT PPLKVKLQIF DNGEPVGLPV
CTSHKHFTTR WS.WNEWVTL Q9TXI7 RLNQLFAEIT
VYCNNQQIGY PVCTSFHTPP DSSQLARQKL IQKWNEWLTL
101
150 O88763 PVKYPDLPRN AQVALTIWD. .....VYGPG
.RAVPVGGTT VSLFGKYGMF Q9W1M7 PLQFSDLPRS
AMLVLTILD. .....CSGAG .QTTVIGGTS ISMFGKDGMF
Q7PMF0 PLRFTDISRT AVLGLTIYD. .....CAGGR
EQLTVVGGTS ISFFSTNGLF Q9TXI7 PIRYSDLSRD
AFLHITIWEH EDDEIVNNST FSRRLVAQSK LSMFSKRGIL
151
200 O88763 RQGMHDLKVW PNVEADGSEP TRTPGRTSST
LSEDQMSRLA KLTKAHRQGH Q9W1M7 RQGMYDLRVW
LGVEGDGNFP SRTPGK.GKE SSKSQMQRLG KLAKKHRNGQ
Q7PMF0 RQGLYDLKVW PQMEPDGACN SITPGK.AIT
TGVHQMQRLS KLAKKHRNGQ Q9TXI7 KSGVIDVQMN
VSTTPDPFVK QPETWKYSDA WG.DEIDLLF KQVTRQSRGL
201
250 O88763 MVKVLDRLTF REIEMINESE KRSS..NFMY
LMVEFRCVKC DDKE.YGIVY Q9W1M7 VQKVLDRLTF
REIEVINERE KRMS..DYMF LMIEFPAIVV DDMYNYAVVY
Q7PMF0 MEKILDRLTF RELEVINEME KRNS..QFLY
LMVEFPQVYI HEKL.YSVIH Q9TXI7 VEDVLDPFAS
RRIEMIRAKY KYSSPDRHVF LVLEMAAIRL GPTF.YKVVY
251 O88763 YE.... Q9W1M7 FE.... Q7PMF0
LE.... Q9TXI7 YEDETK
Multiple Sequence File
39With an editor
40PipeAlign protein family analysis tool
http//bips.u-strasbg.fr/PipeAlign/
Plewniak et al, 2003
41PipeAlign
42MSA Main Applications
43MSA central role in biology
MACS
44MACS new landscape
High volume heterogeneity of sequence data
- Length from tens of amino acids or nucleotides
to thousands or millions (genomes) - Number from tens up to thousands of sequences
- Variability from small percent identity to
almost identical - Complexity of the sequences to be aligned
- - Family with linear or highly irregular
repartition of sequence variability - - Heterogeneity of length, structure or
composition (large insertions or extensions,
repeats, circular permutations, transmembrane
regions) - Fidelity from 15-30 errors (sequence,
eucaryotic gene prediction, annotation)
45MACS new concepts
Distinct objectives imply distinct needs
strategies
- Overview of one sequence family to quickly infer
and integrate information from a limited number
of closely related, well annotated sequences
(reliable and efficient) - Exhaustive analysis of one sequence family for
(very high quality) - - homology modeling
- - phylogenetic studies
- - subfamily-specific features (differentially
conserved domains, regions or residues) - Massive analysis of sets of sequences
(reliable/high quality and efficient) - - phylogenetic distribution, co-presence and
co-absence and structural complex - - genome annotation
- - target characterisation for functional
genomics studies (transcriptomics)
46Residue conservation identification
- residues conserved in all sequences in family
- structural or functional importance
characteristic motifs - residues conserved within a sub-group of
sequences - discriminant residues
47Ordered Alignment analysis of TyrRS
Euc
Arc Euc
Bac
Motif I
Euc
Arc Euc
Bac
Motif II
10 aa
C-terminal extension
N-terminal extension
S4 domain
EMAP domain
48Ordered Alignment analysis of TyrRS
Euc
Arc Euc
Bac
Motif I
Euc
Arc Euc
Bac
Motif II
10 aa
C-terminal extension
N-terminal extension
S4 domain
EMAP domain
49(No Transcript)
50Phylogenetic studies
Multiple alignments basis for calculation of
the levels of similarity between sequences
Multiple alignments basis for calculation of
sequences evolutionary distances
Multiple alignments basis for the computation
of phylogenetic trees
Creation of high quality phylogenetic tree
implies to work with high quality multiple
sequence alignments
51Phylogenetic studies
PLASM FALC
Whole alignment
ARABI THAL
Eucarya
CAENO ELEG
SCHI PO MT
DROSO MEGA
SACC CE MT
MYCOP GENI
HOMO SAPIE
DROS ME MT
RATTU NORV
CAEN EL MT
MYCOP PNEU
Bacteria Mitochondrie
SCHIZ POMB
SACCH CERE
BORRE BURG
CANDI ALBI
TREPO PALI
MYCOP CAPR
BUCHN AFID
RICKE PROW
RHODO CAPS
HALOB SALI
CHLOR TEPI
ARCHE FULG
MYCOB TUBE
AQUIF AEOL
MYCOB LEPR
THERM MARI
METBA THER
METHA JANN
HELIC PYLO
PORPH GING
CAMPY JEJU
Archaea
CLOST ACET
PYROC KODA
CHLAM TRAC
BORDE PERT
PYROC HORI
SYNECHO SP
AR THA CHL
NEISS GONO
NEISS MENI
THERM THER
BACIL SUBT
DEINO RADI
PSEUD AERU
ENTER FAEC
SHEWA PUTR
YERSI PEST
ESCHE COLI
STREP PYOG
SALMO TYPH
VIBRI CHOL
HAEMO INFL
ACTIN ACTI
52Phylogenetic studies
N terminus global gap removal
Eukarya
PLASM FALC
SACCH CERE
Bacteria Archaea Mito.
SCHIZ POMB
ARABI THAL
CANDI ALBI
CAENO ELEG
DROSO MEGA
HALOB SALI
HOMO SAPIE
PYROC HORI
RATTU NORV
METBA THER
PYROC KODA
DROS ME MT
METHA JANN
SCHI PO MT
CAEN EL MT
ARCHE FULG
BORRE BURG
SACC CE MT
MYCOP CAPR
BUCHN AFID
PORPH GING
CLOST ACET
DEINO RADI
RICKE PROW
BACIL SUBT
RHODO CAPS
MYCOP GENI
CHLOR TEPI
SYNECHO SP
MYCOP PNEU
CHLAM TRAC
BORDE PERT
NEISS MENI
NEISS GONO
HELIC PYLO
CAMPY JEJU
PSEUD AERU
SHEWA PUTR
MYCOB TUBE
SALMO TYPH
ESCHE COLI
ENTER FAEC
YERSI PEST
VIBRI CHOL
MYCOB LEPR
HAEMO INFL
ACTIN ACTI
AQUIF AEOL
STREP PYOG
THERM THER
TREPO PALI
THERM MARI
0.1
AR THA CHL
53Schematic alignment of Aspartyl-tRNA synthetases
54(No Transcript)
55Protein sequence validation
Sequencing / frameshift error detection
Estimation 44 of predicted proteins from genome
sequencing projects and 31 of high-throughput
cDNA (HTC) contain errors in their intron/exon
structure. Bianchetti et al, 2005
Example transcription TFIIH complex protein
56Clustered MACS Starter
Multiple alignment of complete sequences
Determination of sequence groups
- Hierarchical clustering of positions
- based on insertion/deletion
- Definition of blocs
- N-terminal region analysis
- Reference position
- Proposed N-terminus potential start codon
closest to the reference position
--------MXXXXXX-XXXXXX-------XXX -------MXXXX-XXXX
XXXXXX------XXX MXXXXXXMXXXMXXXXX-XXXXX-XXXXXXXX -
-----MXXXXXXXXXXXXX-XX--XXXXXXX ---------MXXXXX-XX
XXXXXXXXXXXXXX
extension
Reference position
3000 proteins from B. subtilis with wrong
randomly generated N-ter. 82 predicted For
the 3828 proteins from the Vibrio cholera
proteome 817 specific / 1722 valid start
codons / 236 wrong (from 1 up to 56 aas)
57Clustered MACS vAlid
Bianchetti et al. (2005) JBCB
58Clustered MACS DbW
- Databases
- - Proteins
- Structures
Automatic up-date of more than 300 different
protein families gt 24 AaRS (amino-acid tRNA
synhetases), nuclear receptors, ribosomal
proteins, transcription factors
Prigent et al. (2005) BioInformatics
59Clustered MACS GOAnno
GoAnno find a pertinent level automatically and
propagate Gene Ontology to an unannotated target
protein according to clustered MACS
Chalmel et al. (2005) Bioinfomatics
60Protein 3D structure prediction
Proteins with similar sequences tend to fold into
similar structure
? Above 50 identity, pairwise alignment is
enough for accurate model ? Below 50 identity,
multiple alignment is better
- Basic steps for comparative (homology) modelling
- Identify a template structure
- Align the target sequence to the template
sequence - Copy the backbone coordinates from template to
the matching residues in the target sequence - Build the side-chains (copied for identical
residues, predicted for non-identical) - Model the loop regions
- Optimise (energy refinement)
Applicable to 60 of proteins from fully
sequenced genomes
61Protein functional characterisation
By homology Similar sequences generally share
similar structures and often have similar
functions
Propagation of information from a known sequence
to an unknown one e.g. domains, active sites,
cellular localisation, post-transcriptional
modifications, 1. Database search for
homologues e.g. BlastP, PSI-Blast 2. Domain
databases e.g. Interpro (EBI), CDD (NCBI) 3.
Multiple alignment construction and analysis e.g.
PipeAlign
62MSA applications Summary
Error in ORF definition
Additional domain
Transmembrane region
Phosphorylation site
1st FAMILY
Bacteria
Bacteria
2nd FAMILY
Archaea
Eucarya
NLS
Intra-group conservation
Universal conservation
Differential conservation between the two families
domain organization, structural motifs key
functional residues, ORF definition localization
signals, conservation pattern ...
Functional genomics
Mutagenesis experiments
Evolutionary studies
Structure modeling
Drug design
Lecompte et al Gene. 2001
63MACSIMS
64MAO Multiple Alignment Ontology
http//www-igbmc.u-strasbg.fr/BioInfo/MAO/mao.html
MAO consortium
- RNA analysis (Steve HOLBROOK, Berkeley) -
MACS algorithm (Kazutake KATOH, Kyoto) -
Protein 3D analysis (Patrice KOEHL, Davis) -
Protein 3D structure (Dino MORAS, Strasbourg)
- 3D RNA structure (Eric WESTHOF, Strasbourg)
Also available from OBO web site
http//obo.sourceforge.net
Thompson et al. (2005) Nucleic Acids Res.
65MACSIMS
- Multiple Alignment of Complete Sequences
Information Management System
Thompson et al BMC Bioinformatics 2006
Structural and functional information is mined
automatically from the public databases
Homologous regions are identified in the MACS
Mined data is evaluated and cross-validated
Mined data is propagated from known to unknown
sequences with the homologous regions
MACSIMS provides a unique environment that
facilitates knowledge extraction and the
presentation of the most pertinent information to
the biologist
66MACSIMS
http//bips.u-strasbg.fr/MACSIMS/
67MACSIMS
- Schematic overview of complete alignment
- e.g. domain organisation (Interpro)
Key
CH
SH3
PI-PLC-X
SH2
PI-PLC-Y
rhoGEF
C2
DAG_PE-bind
PH
68MACSIMS visualisation
JalView II, Coll. G. Barton
69MACSIMS
BAliBASE reference 3 aldehyde dehydrogenase-like
70(No Transcript)
71(No Transcript)
72Summary
- Choice of multiple alignment method
- traditional progressive method (e.g. clustalw /
clustalx) - combined local and global method (e.g. mafft,
muscle, dbclustal) - knowledge-based method (e.g. PipeAlign)
- Web Server versus Local Installation ?
WARNING Automatic alignment methods can make
mistakes. Verify alignment quality by automatic
methods (e.g. norMD) and visual inspection !
- Multiple alignment applications
- Traditional applications
- phylogeny
- conserved residue / motif identification
- Information in multiple alignments also
improves accuracy in - sequence error detection
- structure prediction
- functional annotation
73Laboratory of Integrative Genomics and
BioinformaticsIGBMC, Strasbourg
74alternative algorithms
Iterative Refinement
PRRP (Gotoh, 1993) refines an initial progressive
multiple alignment by iteratively dividing the
alignment into 2 profiles and realigning them.
75alternative algorithms
Genetic Algorithms
SAGA (Notredame, Higgins, 1996) evolves a
population of alignments in a quasi evolutionary
manner, iteratively improving the fitness of the
population
population n
select a number of individuals to be parents
modify the parents by shuffling gaps, merging 2
alignments etc.
population n1
evaluation of the fitness using OF (sum-of-pairs
or COFFEE)
END
76alternative algorithms
HMM
- Probabilistic model for sequence profiles,
visualized as a finite state machine - For each column of the alignment a match state
models the distribution of residues allowed - Insert and delete states at each column allow
for insertion or deletion of one or more residues
Original profile HMM (Krogh et al, 1994)
E
AK
Y W
L L
D D
V
AKY-L-D --WVLED
77Multiple Alignment using HMM
generate initial alignment (Baum-Welch
expectation maximization)
HMMER (Eddy, unpublished) SAM-T98 (Hughey, 1996)
produce a model
generate new alignment (Viterbi algorithm or
posterior decoding)
evaluate alignment (expectation maximization)
END
78alternative algorithms
Segment-to-segment Alignment
Dialign (Morgenstern et al. 1996) compares
segments of sequences instead of single
residues 1. construct dot-plots of all possible
pairs of sequences
2. find a maximal set of consistent diagonals in
all the sequences
.......aeyVRALFDFngndeedlpfkKGDILRIrdkpeeq........
.......WWNAedsegkr.GMIPVPYVek.......... ........nl
FVALYDFvasgdntlsitKGEKLRVlgynhnge..............WCE
Aqtkngq..GWVPSNYItpvns....... ieqvpqqptyVQALFDFdpq
edgelgfrRGDFIHVmdnsdpn...............WWKGachgqt..G
MFPRNYVtpvnrnv..... gsmstselkkVVALYDYmpmnandlqlrKG
DEYFIleesnlp...............WWRArdkngqe.GYIPSNYVtea
eds...... .....tagkiFRAMYDYmaadadevsfkDGDAIINvqaid
eg...............WMYGtvqrtgrtGMLPANYVeai.........
..gsptfkcaVKALFDYkaqredeltfiKSAIIQNvekqegg........
.......WWRGdyggkkq.LWFPSNYVeemvnpegihrd .......gyq
YRALYDYkkereedidlhLGDILTVnkgslvalgfsdgqearpeeigWLN
GynettgerGDFPGTYVeyigrkkisp..
Local alignment - residues between the diagonals
are not aligned