Title: Pr
1La place de la phylogénie en Bio-informatique
Jean-Michel CLAVERIE Structural Genetic
Information Laboratory CNRS UPR2589 Luminy, France
http//www.igs.cnrs-mrs.fr
Journées RNG Phylogénie
2Bio-Informatique des SéquencesUne progression
naturelle
- Une séquence trouver où sont les motifs et les
gènes - Deux séquences les aligner
- Trois et plus faire une arbre
3Gene finding the general principles behind all
current methods
- Gene-encoded peptide sequences have to lead to
foldable, compact structure in a water
environnement, - Real protein are thus made of a balanced mixture
of hydrophilic, hydrophobic, rigid, flexible,
neutral, or charged residues - This constraint induces a recognizable
statistical bias in coding DNA sequences, that we
use for detection. - Around the good looking regions, we look for
facultative additional signals, - The optimal prediction is then the best scoring
one according to a standard gene model.
4Non-coding vs. Protein-coding DNA sequence Why
is there a bias?
64 codons, including 3 STOPs TGA, TAA,
TAG Simple situation like E.coli
(ATGC0.25) Between two successive STOP
codons each of the 61 codons -gt 1.6 expected
relative frequency In contrast, codon
distributions within bona fide genes have to
results into foldable proteins, hence balancing
hydrophylic and hydrophobic, and - charged,
rigid and flexible residues.
Coding segments have to be compatible with the
typical amino acid composition of natural proteins
5 Amino Acid in N codon
expected
proteins at
random Ala 9.46 4 6.4 Arg
5.84 6 9.6 - Asn 3.99
2 3.2 Asp 5.10
2 3.2 Cys 1.20
2 3.2 - Gln 4.57
2 3.2 Glu 5.50
2 3.2 Gly 7.41 4 6.4
His 2.44 2 3.2
- Ile 5.95 3 4.8
Leu 10.5 6 9.6 Lys
4.14 2 3.2 Met 2.72
1 1.6 Phe 4.19
2 3.2 Pro 4.39
4 6.4 - Ser 5.71 6 9.6 - Thr
5.50 4 6.4 - Trp
1.48 1 1.6 Tyr 2.76
2 3.2 Val 7.16 4 6.4
AA with or 20 deviation from random
6A variety of methods take advantage of this bias
Chi2 (obs-rand)
Sliding Window
Sequence
7The Caesar (Julius) Cipher
Plaintext veni, vidi, vici Ciphertext YHQL,
YLGL, YLFL
Plain alphabet abcdefghijklmnopqrstuvwxyz Cipher
alphabet DEFGHIJKLMNOPQRSTUVWXYZABC
The Caesar cipher is based on a cipher alphabet
that is shifted a certain number of places,
relative to the plain alphabet
8Frequency of letters in English
letters letters
a 8.2 n 6.7
b 1.5 o 7.5
c 2.8 p 1.9
d 4.3 q 0.1
e 12.7 r 6.0
f 2.2 s 6.3
g 2.0 t 9.1
h 6.1 u 2.8
i 7.0 v 1.0
j 0.2 w 2.4
k 0.8 x 0.2
l 4.0 y 2.0
m 2.4 z 0.1
From 100,362 characters (Beker Piper, 1950)
9Frequency analysis of an enciphered message
letters letters
A 0.9 N 0.9
B 7.4 O 11.2
C 8.0 P 9.2
D 4.1 Q 0.6
E 1.5 R 1.8
F 0.6 S 2.1
G 0.3 T 0.0
H 0.0 U 1.8
I 3.3 V 5.3
J 5.3 W 0.3
K 7.7 X 10.1
L 7.4 Y 5.6
M 3.3 Z 1.5
10Al-kin-dee-u Al-Kindi
The founding Father of Gene-finding
bioinformatics
Al-Kindi (0850) Decyphering Cryptographic
Messages PMID 0000001
11Homophonic substitution cipher
a b c d e f g h i j k l m n o p q r s t u v w x y
09 48 13 01 14 10 06 23 32 15 04 26 22 18 00 38 94 29 11 17 08 34 60 28 21
12 81 41 03 16 31 25 39 70 37 27 58 05 95 35 19 20 61 89 52
33 62 45 24 50 73 51 66 54 42 76 43
47 79 44 56 83 84 71 72 77 86 49
53 46 65 88 91 90 80 96 69
67 55 68 93 99 75
78 57 85
92 64 97
74
82
87
Frequent letters, many codes
12 Homophonic Substitution in the Genetic Code
Amino Acid in N codon
expected
proteins at
random Leu 10.5 6 9.6 Arg
5.84 6 9.6 - Ser 5.71
6 9.6 - Ala 9.46 4
6.4 Gly 7.41 4 6.4
Val 7.16 4 6.4 Thr 5.50
4 6.4 - Pro 4.39
4 6.4 - Ile 5.95 3 4.8
Glu 5.50 2 3.2 Asp
5.10 2 3.2 Gln 4.57
2 3.2 Phe 4.19
2 3.2 Lys 4.14
2 3.2 Asn 3.99 2 3.2
Tyr 2.76 2 3.2 His
2.44 2 3.2 - Cys 1.20
2 3.2 - Met 2.72
1 1.6 Trp 1.48
1 1.6
13From old to new methods
- Methods
- Codon usage (Staden McLachlan, 1982) profile
- Differential k-tuple frequency (1986) profile
14K-tuple concepts and methods
ATGCTAGCATAGCTGCATGACATGCATGCA ATGC
TGCA TGCT ATGC GCTA
CATG CTAG GCAT
TAGC TGCA AGCA
ATGC
4-tuple (tetramer)
Pre-compute Fcoding (ATGC) , Fncoding (ATGC)
. ,
. ,
Fcoding () , Fncoding
()
15K-tuple profile
0.5
Fc ------- FcFnc
ATGC
Sequence
16Homophonic deciphering looking for 2-tuples
that often, never or rarely occurs je, jx, wa
k-tuple frequency analysis was probably known of
the Arabs (1000) and rediscovered during the
Renaissance
17Blaise de Vigenère (1586)
invented his undecipherable cipher to resist
k-tuple analysis
18Plain abcdefghijklmnopqrstuvwxyz 1
BCDEFGHIJKLMNOPQRSTUVWXYZA 2
CDEFGHIJKLMNOPQRSTUVWXYZAB 3
DEFGHIJKLMNOPQRSTUVWXYZABC 4
EFGHIJKLMNOPQRSTUVWXYZABCD 5
FGHIJKLMNOPQRSTUVWXYZABCDE 6
GHIJKLMNOPQRSTUVWXYZABCDEF 7
HIJKLMNOPQRSTUVWXYZABCDEFG 8
IJKLMNOPQRSTUVWXYZABCDEFGH 9
JKLMNOPQRSTUVWXYZABCDEFGHI 10
KLMNOPQRSTUVWXYZABCDEFGHIJ 11-20
21
VWXYZABCDEFGHIJKLMNOPQRSTU 22
WXYZABCDEFGHIJKLMNOPQRSTUV 23
XYZABCDEFGHIJKLMNOPQRSTUVW 24
YZABCDEFGHIJKLMNOPQRSTUVWX 25
ZABCDEFGHIJKLMNOPQRSTUVWXY 26
ABCDEFGHIJKLMNOPQRSTUVWXYZ
The Vigenère square
19The Vigenère square example
Keyword WHITE
WHITEWHITEWHITEWHI Plaintext diverttroopstoeastr
idge Ciphertext ZPDXVPAZHSLZBHIWZBKMZNM
Plain abcdefghijklmnopqrstuvwxyz 4
EFGHIJKLMNOPQRSTUVWXYZABCD 7
HIJKLMNOPQRSTUVWXYZABCDEFG 8
IJKLMNOPQRSTUVWXYZABCDEFGH 19
TUVWXYZABCDEFGHIJKLMNOPQRS 22
WXYZABCDEFGHIJKLMNOPQRSTUV
WHITE
20Charles Babbage Another pioneer In
bioinformatics
Cryptanalysis of The Vigenère cypher (1854)
PMID 0000002
21The invention of inhomogeneous (and hidden)
Markov Models
- Look for repeats in the cipher text
- Infer the keyword lenght from consistent repeat
distance L - Then analyze the character frequency at each
position independently 1, 2, 3 , ., L - Deduce ( call ) the corresponding Ceasar shift
22From old to new methods
- Methods
- Codon usage (Staden McLachlan, 1982) profile
- Differential k-tuple frequency (1986) profile
- Differential in-phase k-tuples (1988) profile
- Non-homogeous Markov chains (Borodvsky,1986
Tavare Song, 1989) profile/call - Neural-net (Mural Uberbacher, 1991) call
- Hidden Markov chain (Kulp al., 1996) call
23Bioinformatics vs. Biology
- 1951 1st protein sequence (Insulin, Sanger)
- 1960 Sequence-structure relationship (Globins,
Perutz) - 1965 "Evolutionary divergence convergence in
Proteins" Zuckerkandl Pauling - 1967 "Construction of Phylogenetic Trees" Fitch
Margoliash. - 1968 Atlas of Protein Sequences (M. Dayhoff,
Georgetown) - 1970 "A general method applicable to the search
for similaries in amino-acid sequences of two
proteins" Needleman Wunsch - 1973 Genetic engineering (Cohen, Boyer et al.)
- 1974 "Prediction of Protein Conformation" Chou
Fasman - 1977 ADN sequencing (Sanger, Maxam, Gilbert)
- 1977 1st bioinformatic "package" (Staden DB/
assembly, analysis) - 1978 Databases ACNUC, PIR, EMBL, GenBank
- 1980 Database access via telephone lines ( PIR )
- 1981 Los Alamos-GenBank 270 seqs, 370.000 nt
24Bioinformatics vs. Computers
- 1965 First industrial computer IBM/360
- "Evolutionary divergence and convergence in
Proteins" Zuckerkandl Pauling - 1967 "Construction of Phylogenetic Trees" Fitch
Margoliash. - 1968 First mini computer DEC PDP-8 (floor
top) - Atlas of Protein Sequences (M. Dayhoff,
Georgetown) - 1970 "A general method applicable to the search
for similaries " Needleman Wunsch. - 1971 1st work on RNA folding (Ninio)
- 1972 First micro processor Intel 8008
- 1973 Genetic engineering (Cohen et al.)
- 1974 "Prediction of Protein Conformation" Chou
Fasman - 1975 Intel 8080, kit Altair
- 1977 1st bioinformatic package (Staden PDP
8/11) - DEC-VAX Mini-computers
- Micro-computer (Apple, Commodore, Radioshack)
- 1978 Databases ACNUC, PIR, EMBL, GenBank
- 1980 Telephone access to the PIR database
- 1981 Los Alamos-GenBank 270 seqs, 370.000 nt
- IBM-PC (8088), 16-32kb
- 1983 IBM-XT HARD disk (10 Mbytes)
25Perspective historique (suite)
Bioinformatics Genomics more recent past
- 1981 Local Alignement (Smith-Waterman , JMB)
- 1985 "Fasta" (Pearson-Lipman, PNAS)
- 1989 ARPANET --gt INTERNET
- 1990 "Blast" (Altschul et al., JMB)
- 1990 Positional cloning of NF-1
- 1991 "Grail", first practical gene finder
(Mural et al., PNAS) - 1991 "EST" (Venter et al., Matsubara et al.)
- 1992 Complete sequence of yeast chr. 3
- 1995 Complete sequence of H. influenza
- 1996 Complete sequence of S. cerevisiae
- 1997 "Gapped Blast" (Alschul et al.,
NAR)/Genscan (Burge Karlin) - 1997 11 complete bacterial genomes available
- 1998 2 Mbase/day of new public sequence data, C.
elegans - 2000 Human Chr 22, 21, 90draft, Drosophila, 30
bacterial genomes
26Bon workshop!
27Genes Genomes
Gene 1
Gene 2
Gene 3
Gene 4
DNA
transcription
RNA (transcripts)
translation folding
Proteins (or RNAs)
Function 1
Function 2
Function 3
Function 4
28(No Transcript)