Pr - PowerPoint PPT Presentation

About This Presentation
Title:

Pr

Description:

1967 'Construction of Phylogenetic Trees' Fitch & Margoliash. ... 1983 IBM-XT HARD disk (10 Mbytes) 1984 MacIntosh : graphic/mouse interface ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 29
Provided by: graf4
Learn more at: https://tcoffee.org
Category:
Tags: construct

less

Transcript and Presenter's Notes

Title: Pr


1
La place de la phylogénie en Bio-informatique
Jean-Michel CLAVERIE Structural Genetic
Information Laboratory CNRS UPR2589 Luminy, France
http//www.igs.cnrs-mrs.fr
Journées RNG  Phylogénie 
2
Bio-Informatique des SéquencesUne progression
naturelle
  • Une séquence trouver où sont les motifs et les
    gènes
  • Deux séquences les aligner
  • Trois et plus faire une arbre

3
Gene finding the general principles behind all
current methods
  1. Gene-encoded peptide sequences have to lead to
    foldable, compact structure in a water
    environnement,
  2. Real protein are thus made of a balanced mixture
    of hydrophilic, hydrophobic, rigid, flexible,
    neutral, or charged residues
  3. This constraint induces a recognizable
    statistical bias in coding DNA sequences, that we
    use for detection.
  4. Around the good looking regions, we look for
    facultative additional signals,
  5. The optimal prediction is then the best scoring
    one according to a  standard  gene model.

4
Non-coding vs. Protein-coding DNA sequence Why
is there a bias?
64 codons, including 3 STOPs  TGA, TAA,
TAG Simple situation like E.coli
(ATGC0.25) Between two successive STOP
codons each of the 61 codons -gt 1.6 expected
relative frequency In contrast, codon
distributions within bona fide genes have to
results into foldable proteins, hence balancing
hydrophylic and hydrophobic, and - charged,
rigid and flexible residues. 
Coding segments have to be compatible with the
typical amino acid composition of natural proteins
5
  Amino Acid in N codon
expected
proteins at
random   Ala 9.46 4 6.4 Arg
5.84 6 9.6 - Asn 3.99
2 3.2 Asp 5.10
2 3.2 Cys 1.20
2 3.2 - Gln 4.57
2 3.2 Glu 5.50
2 3.2 Gly 7.41 4 6.4
His 2.44 2 3.2
- Ile 5.95 3 4.8
Leu 10.5 6 9.6 Lys
4.14 2 3.2 Met 2.72
1 1.6 Phe 4.19
2 3.2 Pro 4.39
4 6.4 - Ser 5.71 6 9.6 - Thr
5.50 4 6.4 - Trp
1.48 1 1.6 Tyr 2.76
2 3.2 Val 7.16 4 6.4
AA with or 20 deviation from random
6
A variety of methods take advantage of this bias
Chi2 (obs-rand)
Sliding Window
Sequence
7
The Caesar (Julius) Cipher
Plaintext veni, vidi, vici Ciphertext YHQL,
YLGL, YLFL
Plain alphabet abcdefghijklmnopqrstuvwxyz Cipher
alphabet DEFGHIJKLMNOPQRSTUVWXYZABC
The Caesar cipher is based on a cipher alphabet
that is shifted a certain number of places,
relative to the plain alphabet
8
Frequency of letters in English
letters letters
a 8.2 n 6.7
b 1.5 o 7.5
c 2.8 p 1.9
d 4.3 q 0.1
e 12.7 r 6.0
f 2.2 s 6.3
g 2.0 t 9.1
h 6.1 u 2.8
i 7.0 v 1.0
j 0.2 w 2.4
k 0.8 x 0.2
l 4.0 y 2.0
m 2.4 z 0.1
From 100,362 characters (Beker Piper, 1950)
9
Frequency analysis of an enciphered message
letters letters
A 0.9 N 0.9
B 7.4 O 11.2
C 8.0 P 9.2
D 4.1 Q 0.6
E 1.5 R 1.8
F 0.6 S 2.1
G 0.3 T 0.0
H 0.0 U 1.8
I 3.3 V 5.3
J 5.3 W 0.3
K 7.7 X 10.1
L 7.4 Y 5.6
M 3.3 Z 1.5
10
Al-kin-dee-u Al-Kindi
The founding Father of  Gene-finding 
bioinformatics
Al-Kindi (0850) Decyphering Cryptographic
Messages PMID 0000001
11
Homophonic substitution cipher
a b c d e f g h i j k l m n o p q r s t u v w x y
09 48 13 01 14 10 06 23 32 15 04 26 22 18 00 38 94 29 11 17 08 34 60 28 21
12 81 41 03 16 31 25 39 70 37 27 58 05 95 35 19 20 61 89 52
33 62 45 24 50 73 51 66 54 42 76 43
47 79 44 56 83 84 71 72 77 86 49
53 46 65 88 91 90 80 96 69
67 55 68 93 99 75
78 57 85
92 64 97
74
82
87
Frequent letters, many codes
12
 Homophonic  Substitution in the Genetic Code
  Amino Acid in N codon
expected
proteins at
random   Leu 10.5 6 9.6 Arg
5.84 6 9.6 - Ser 5.71
6 9.6 - Ala 9.46 4
6.4 Gly 7.41 4 6.4
Val 7.16 4 6.4 Thr 5.50
4 6.4 - Pro 4.39
4 6.4 - Ile 5.95 3 4.8
Glu 5.50 2 3.2 Asp
5.10 2 3.2 Gln 4.57
2 3.2 Phe 4.19
2 3.2 Lys 4.14
2 3.2 Asn 3.99 2 3.2
Tyr 2.76 2 3.2 His
2.44 2 3.2 - Cys 1.20
2 3.2 - Met 2.72
1 1.6 Trp 1.48
1 1.6
13
From old to new methods
  • Methods
  • Codon usage (Staden McLachlan, 1982) profile
  • Differential k-tuple frequency (1986) profile

14
K-tuple concepts and methods
ATGCTAGCATAGCTGCATGACATGCATGCA ATGC
TGCA TGCT ATGC GCTA
CATG CTAG GCAT
TAGC TGCA AGCA
ATGC
4-tuple (tetramer)
Pre-compute Fcoding (ATGC) , Fncoding (ATGC)
. ,
. ,
Fcoding () , Fncoding
()
15
K-tuple profile
0.5
Fc ------- FcFnc
ATGC
Sequence
16
Homophonic deciphering looking for 2-tuples
that often, never or rarely occurs je, jx, wa
k-tuple frequency analysis was probably known of
the Arabs (1000) and rediscovered during the
Renaissance
17
Blaise de Vigenère (1586)
invented his undecipherable cipher to resist
k-tuple analysis
18
Plain abcdefghijklmnopqrstuvwxyz 1
BCDEFGHIJKLMNOPQRSTUVWXYZA 2
CDEFGHIJKLMNOPQRSTUVWXYZAB 3
DEFGHIJKLMNOPQRSTUVWXYZABC 4
EFGHIJKLMNOPQRSTUVWXYZABCD 5
FGHIJKLMNOPQRSTUVWXYZABCDE 6
GHIJKLMNOPQRSTUVWXYZABCDEF 7
HIJKLMNOPQRSTUVWXYZABCDEFG 8
IJKLMNOPQRSTUVWXYZABCDEFGH 9
JKLMNOPQRSTUVWXYZABCDEFGHI 10
KLMNOPQRSTUVWXYZABCDEFGHIJ 11-20
21
VWXYZABCDEFGHIJKLMNOPQRSTU 22
WXYZABCDEFGHIJKLMNOPQRSTUV 23
XYZABCDEFGHIJKLMNOPQRSTUVW 24
YZABCDEFGHIJKLMNOPQRSTUVWX 25
ZABCDEFGHIJKLMNOPQRSTUVWXY 26
ABCDEFGHIJKLMNOPQRSTUVWXYZ
The Vigenère square
19
The Vigenère square example
Keyword WHITE
WHITEWHITEWHITEWHI Plaintext diverttroopstoeastr
idge Ciphertext ZPDXVPAZHSLZBHIWZBKMZNM
Plain abcdefghijklmnopqrstuvwxyz 4
EFGHIJKLMNOPQRSTUVWXYZABCD 7
HIJKLMNOPQRSTUVWXYZABCDEFG 8
IJKLMNOPQRSTUVWXYZABCDEFGH 19
TUVWXYZABCDEFGHIJKLMNOPQRS 22
WXYZABCDEFGHIJKLMNOPQRSTUV
WHITE
20
Charles Babbage Another pioneer In
bioinformatics
Cryptanalysis of The Vigenère cypher (1854)
PMID 0000002
21
The invention of inhomogeneous (and hidden)
Markov Models
  • Look for repeats in the cipher text
  • Infer the keyword lenght from consistent repeat
    distance L
  • Then analyze the character frequency at each
    position independently 1, 2, 3 , ., L
  • Deduce ( call ) the corresponding Ceasar shift

22
From old to new methods
  • Methods
  • Codon usage (Staden McLachlan, 1982) profile
  • Differential k-tuple frequency (1986) profile
  • Differential in-phase k-tuples (1988) profile
  • Non-homogeous Markov chains (Borodvsky,1986
    Tavare Song, 1989) profile/call
  • Neural-net (Mural Uberbacher, 1991) call
  • Hidden Markov chain (Kulp al., 1996) call

23
Bioinformatics vs. Biology
  • 1951 1st protein sequence (Insulin, Sanger)
  • 1960 Sequence-structure relationship (Globins,
    Perutz)
  • 1965 "Evolutionary divergence convergence in
    Proteins" Zuckerkandl Pauling
  • 1967 "Construction of Phylogenetic Trees" Fitch
    Margoliash.
  • 1968 Atlas of Protein Sequences (M. Dayhoff,
    Georgetown)
  • 1970 "A general method applicable to the search
    for similaries in amino-acid sequences of two
    proteins" Needleman Wunsch
  • 1973 Genetic engineering (Cohen, Boyer et al.)
  • 1974 "Prediction of Protein Conformation" Chou
    Fasman
  • 1977 ADN sequencing (Sanger, Maxam, Gilbert)
  • 1977 1st bioinformatic "package" (Staden DB/
    assembly, analysis)
  • 1978 Databases ACNUC, PIR, EMBL, GenBank
  • 1980 Database access via telephone lines ( PIR )
  • 1981 Los Alamos-GenBank 270 seqs, 370.000 nt

24
Bioinformatics vs. Computers
  • 1965 First   industrial computer IBM/360
  • "Evolutionary divergence and convergence in
    Proteins" Zuckerkandl Pauling
  • 1967 "Construction of Phylogenetic Trees" Fitch
    Margoliash.
  • 1968 First  mini  computer DEC PDP-8 (floor
    top)
  • Atlas of Protein Sequences (M. Dayhoff,
    Georgetown)
  • 1970 "A general method applicable to the search
    for similaries " Needleman Wunsch.
  • 1971 1st work on RNA folding (Ninio)
  • 1972 First  micro  processor Intel 8008
  • 1973 Genetic engineering (Cohen et al.)
  • 1974 "Prediction of Protein Conformation" Chou
    Fasman
  • 1975 Intel 8080, kit Altair
  • 1977 1st bioinformatic package (Staden PDP
    8/11)
  • DEC-VAX Mini-computers
  • Micro-computer (Apple, Commodore, Radioshack)
  • 1978 Databases ACNUC, PIR, EMBL, GenBank
  • 1980 Telephone access to the PIR database
  • 1981 Los Alamos-GenBank 270 seqs, 370.000 nt
  • IBM-PC (8088), 16-32kb
  • 1983 IBM-XT HARD disk (10 Mbytes)

25
Perspective historique (suite)
Bioinformatics Genomics more recent past
  • 1981 Local Alignement (Smith-Waterman , JMB)
  • 1985 "Fasta" (Pearson-Lipman, PNAS)
  • 1989 ARPANET --gt INTERNET
  • 1990 "Blast" (Altschul et al., JMB)
  • 1990 Positional cloning of NF-1
  • 1991 "Grail", first practical gene finder 
    (Mural et al., PNAS)
  • 1991 "EST" (Venter et al., Matsubara et al.)
  • 1992 Complete sequence of yeast chr. 3
  • 1995 Complete sequence of H. influenza
  • 1996 Complete sequence of S. cerevisiae
  • 1997 "Gapped Blast" (Alschul et al.,
    NAR)/Genscan (Burge Karlin)
  • 1997 11 complete bacterial genomes available
  • 1998 2 Mbase/day of new public sequence data, C.
    elegans
  • 2000 Human Chr 22, 21, 90draft, Drosophila, 30
    bacterial genomes

26
Bon workshop!
27
Genes Genomes
Gene 1
Gene 2
Gene 3
Gene 4
DNA
transcription
RNA (transcripts)
translation folding
Proteins (or RNAs)
Function 1
Function 2
Function 3
Function 4
28
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com