Title: Aucun titre de diapositive
1What is Bioinformatics ?
Answer There is
NO single answer
2The biologists answer The computer is a TOOL
that helps me to make biological discoveries
? the computer scientist is a SERVICE PROVIDER
3The computer scientists answer Bioinformatics
means RESEARCH in computing science, inspired
by biological problems
4 ATTGCTTTGATTGCTTTG... ...ATTGATTTGCAAAGCAAT...
Biologists question find the
repetitions within this sequence
5 ATTGCTTTGATTGCTTTG... ...ATTGATTTGCAAAGCAAT...
ATTGCTTT
G
ATTGCTTTG
A
TAACGAAAC
GTTTAGTT
6DATA COMPRESSION
I am stupid and because I am stupid, I can't
even tell you that I am stupid
1 and because 1, I can't even tell you that 1
7There exists thousands of algorithms to find
Common words in a text Longest common words
Approximate common words Approximate longest
common words ... But ... " Inverted repeats "
?
8This prompted researches by mathematicians/ comput
er scientists sequence compression -gt less
disk space if a sequence can be compressed,
then there are repetitions finding new
algorithms (approximate matches, inverted
repeats) All this is transparent to the
biologist. The algorithm must find
repetitions, period.
9WHY BIOINFORMATICS ?
Bioinformatics is not new population genetics
study of the distribution of and change in
allele frequencies (allele variant of a
gene) -gt modelling, simulation
Today a consequence of massive data throughtput
sequencing, microarrays ... and litterature
10Sequencing a genome consists in determining the
precise ORDER of the nucleotides -or "bases"
(A, T, G, C) along the chromosome(s)
How many bases ? Is it difficult ? What's
inside a genome sequence ?
11(No Transcript)
12The molecule to be sequenced (e.g. a
chromosome) must be cut into small fragments (
1000 bases). Each fragment is sequenced.
13(No Transcript)
14A difficult problem in higher organisms
Repeated sequences
15(No Transcript)
16 GENES are most important portions of the
chromosomes Genes that code for PROTEINS are
most important Proteins muscles, hair,
nails, ENZYMES
17(No Transcript)
18A simple way to find genes in bacteria
19Open Reading Frames gt 300 bases in the bacterium
Rhizobium meliloti
Stop codons TAA, TAG and TGA
The genome of the bacterium is GC rich
? stop codons are rare
20A long Open Reading Frame does not necessarily
mean a gene ...
21Q How to choose between different ORFs ?
22An example of STYLE codon usage Not all
codons are equal Most of the programs aiming
at finding genes use MARKOV MODELS
23UUU F 0.57 UCU S 0.15 UAU Y 0.57 UGU C 0.45
UUC F 0.43 UCC S 0.15 UAC Y 0.43 UGC C 0.55
UUA L 0.13 UCA S 0.12 UAA 0.64 UGA 0.29
UUG L 0.13 UCG S 0.15 UAG 0.07 UGG W 1.00
CUU L 0.10 CCU P 0.16 CAU H 0.57 CGU R 0.38
CUC L 0.10 CCC P 0.12 CAC H 0.43 CGC R 0.40
CUA L 0.04 CCA P 0.19 CAA Q 0.35 CGA R 0.06
CUG L 0.50 CCG P 0.52 CAG Q 0.65 CGG R 0.10
AUU I 0.51 ACU T 0.17 AAU N 0.45 AGU S 0.15
AUC I 0.42 ACC T 0.44 AAC N 0.55 AGC S 0.28
AUA I 0.07 ACA T 0.13 AAA K 0.77 AGA R 0.04
AUG M 1.00 ACG T 0.27 AAG K 0.23 AGG R 0.02
GUU V 0.26 GCU A 0.16 GAU D 0.63 GGU G 0.34
GUC V 0.22 GCC A 0.27 GAC D 0.37 GGC G 0.40
GUA V 0.15 GCA A 0.21 GAA E 0.69 GGA G 0.11
GUG V 0.37 GCG A 0.36 GAG E 0.31 GGG G 0.15
24EXON
INTRON
A
TC
25GT
AG
26238 gènes, 1254 introns
Distribution of nucleotides near the end of an
intron
27Distribution of exon lengths in three species
28Distribution of intron lengths in three species
29About the splicing sites (intron/exon
junctions) The average number of introns in
human genes is 5 The gene coding for Titin
harbors 233 introns ! Suppose the introns are
predicted with 85 accuracy. Take a gene with
8 introns 0.858 0.27 Many exons are short
(lt50 bases) and separated by long introns
(gt1000 bases) There are mini-exons as short as
3 bases in A. thaliana ? unpredictable While
gt 80 of a bacterial chromosome consists of
genes, they account for less than 5 of the
human genome
30DNA Chips
31Gene (or part of gene) n 1
322
33(No Transcript)
34(No Transcript)
35(No Transcript)
36(No Transcript)
37A A
B B
Problem how to identify which bacteria are
present in a given sample ?
38How to discriminate between acute lymphoblastic
leukemia (ALL) and acute myeloid leukemia (AML)
? 38 patients, 27 ALL and 11 AML ? chip
comprising 6817 human genes ? 1100 genes seem
discriminating ? 50 genes kept after analysis
39(No Transcript)
40? Spot- tracking program
? Statistical analysis of the
measurements
41Co-regulated genes ? Interaction
network(s) SYSTEMS BIOLOGY ? modeling of the
cell
42Co-regulation of gene expression Enzymes that
are involved in the same metabolic pathway
43(No Transcript)
44Co-regulation of gene expression Enzymes that
are built up by different protein subunits
45The Lactose Operon
46The Lactose Operon
47(No Transcript)
48(No Transcript)
49(No Transcript)
50Modeling and Simulation of Genetic Regulatory
Systems
J. Comput. Biol. 9 (2002) 67-103
Directed and undirected graphs Bayesian
networks Boolean networks Generalized logical
networks Nonlinear ordinary differential
equations Piecewise-linear differential
equations Qualitative differential equations
Partial differential equations Stochastic
master equations Rule-based formalisms