Jacques'van'Heldenulb'ac'be - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Jacques'van'Heldenulb'ac'be

Description:

Laboratoire de Bioinformatique des G nomes et des R seaux ... H.pylori 1 composite. A. B. A^B. References. Marcotte, et al. (1999). Science 285(5428), 751-3. ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 26

Provided by: jacquesv8

Category:

more less

Transcript and Presenter's Notes

Title: Jacques'van'Heldenulb'ac'be

1
Genome analysis

Bioinformatics

2
Contents

Genome annotation
Comparative genomics
Phylogenetic profiles
Gene fusion analysis
Phylogenetic footprinting

3
From sequences to genomes

Bioinformatics

4
From sequences to genomes

Before the 1990s, DNA sequencing represented an
important investment in terms of human work. A
PhD student could spend a significant fraction of
his thesis to sequence a single gene.
Genome projects stimulated the development of
automatic sequencing methods, and led to
important technological improvement.
There are currently (2008) several hundreds of
publicly available fully sequenced genomes.
The NCBI genome distribution (ftp//ftp.ncbi.nih.g
ov/genomes/) contains
gt650 prokaryotes (Bacteria and Archaea)
Insects (Drosophila melanogaster, Apis mellifera)
Plants (Arabidopsis thaliana, rice, maize)
A worm (Caenorhabditis elegans)
Some fungi (Saccharomyces cerevisiae,
Schizosaccharomyces pombe, )
Some mammals (Homo sapiens, Mus musculus, Rattus
norvegicus)
Other genome centres give acces to other genomes.
ENSEMBL (http//www.ensembl.org/) maintains many
vertebrate genomes
UCSC (http//genome.ucsc.edu/) maintains genomes
of metazoan insects
Sanger Institute (http//www.sanger.ac.uk/genbiol/
)
Integr8 800 of genomes in 2008.
Many other genomes were sequenced by commercial
companies, and are not available to the public.

5
Gene organization
Source Mount (2000)
6
Gene function
gtPHO4,SPBC428.03C THIAMINE-REPRESSIBLE ACID
PHOSPHATASE PRECURSOR Q01682Q9UU70 Length
463 Score 161 bits (408), Expect 1e-40
Identities 138/473 (29), Positives 223/473
(46), Gaps 47/473 (9) Query 9
ILAASLVNAGTIPLGKLSDIDKIGTQTEIFPFLGGSGPYYSFPGDYGISR
DLPESCEMKQ 68 LAASVAG S
LG Y P G PESC KQ Sbjct 10
LLAASIVHAGK------SQFEAFENEFYFKDHLGTISVYHE-PYFNGPTT
SFPESCAIKQ 62 Query 69 VQMVGRHGERYPT-------VSKAK
SIMTTWYKLSNYTGQFSGALSFLNDDYEFFIRDTK 121
V RHG R PT VS A I KL N G S
F T Sbjct 63 VHLLQRHGSRNPTGDDTATDVSSAQYI
DIFQNKLLN--GSIPVNFSYPENPLYFVKHWTP 120 Query 122
NLEMETTLANSVNVLNPYTGEMNAKRHARDFLAQYGYMVENQTSFAVFTS
NSNRCHDTAQ 181 E S G
R Y Y T R DA Sbjct 121
VIKAENADQLSSS------GRIELFDLGRQVFERY-YELFDTDVYDINTA
AQERVVDSAE 173 Query 182 YFIDGL-GDKFN--ISLQTISEAE
SAGANTLSAHHSCPAWDDDVNDDILKK-----YDTK 233
F G GD E SAGANL SCP D D
Sbjct 174 WFSYGMFGDDMQNKTNFIVLPEDDSA
GANSLAMYYSCPVYEDNNIDENTTEAAHTSWRNV 233 Query 234
YLSGIAKRLNKE-NKGLNLTSSDANTFFAWCAYEINARGYSDICNIFTKD
ELVRFSYGQD 292 L IA RLNK G NLT SD
C YEI R SD CFT E F Y D Sbjct 234
FLKPIANRLNKYFDSGYNLTVSDVRSLYYICVYEIALRDNSDFCSLFTPS
EFLNFEYDSD 293 Query 293 LETYYQTGPGYDVVRSVGANLFNA
SVKLLKE--SEVQDQKVWLSFTHDTDILNYLTTIGI 350
L Y GP G N L
DKVLFTHD I G Sbjct 294
LDYAYWGGPASEWASTLGGAYVNNLANNLRKGVNNASDRKVFLAFTHDSQ
IIPVEAALGF 353 Query 351 IDDKNNLTAEH-VPFMENTF----
HRSWYVPQGARVYTEKFQCS-NDTYVRYVINDAVVP 404
D T EH P N F S VP TE F CS N
YVRN V P Sbjct 354 FPD---ITPEHPLPTDKNIFTYSLKT
SSFVPFAGNLITELFLCSDNKYYVRHLVNQQVYP 410 Query 405
IETCSTGPGFS----CEINDFYDYAEKRVAGTDFLKVCNVSSVSNSTELT
FFW 453 C GP CE
N ST T Sbjct 411
LTDCGYGPSGASDGLCELSAYLNSSVRVNSTSNGIANFNSQCQAHSTNVT
VYY 463

After having localized genes on the sequence, we
have to predict their function.
Some genes have already been characterized before
the genome project, but these are generally a
minority of those found in the genome.
For the majority of the genes, one tries to
predict function on the basis of similarities
between the sequence of the newly sequenced gene
and some previously known genes (function
assignation by sequence similarity).
Example yeast genome (1996) there are still
2500 genes (39) whose function is completely
unknown. However
Yeast is among the best known model organisms
(genetics, molecular biology).
The full genome is available since 1996.
When the first traft of the Human genome has been
published, 60 of the predicted genes were of
unknwown function.

7
Some milestones
8
Genes and genome size

In prokaryotes, the number of genes increases
linearly with genome size
In eukaryotes, this is not the case the genome
size increases faster than the number of genes

9
Genes and genome size

Beware the axes are logarithmic.
This plot represents the same data as the
previous one, but in logarithmic scale, in order
to see Mammals as well.

10
Gene spacing

Gene spacing increases considerably with the
complexity off the organisms.
Note the X axis si logarithmic, not the Y axis
-gt the increase seems grossly exponential.

11
Proportion of intergenic regions

Beware the X axis is logarithmic.
The proportion of intergenic regions increases
with the complexity of an organism.
In addition (not shown here), introns represent
an increasing fraction of the genome.
For example, the exonic fraction represents lt5
of the human genome.

12
Protein size versus genome size

Protein sequences are shorter in prokaryotes than
in eukaryotes.
Among eukaryotes, the increase in genome size is
not correlated to an increase in protein size
higher eukaryotes have a much larger genome than
fungi, without increase in protein size

13
Genome annotation

Bioinformatics

14
Gene prediction

Starting from a completely sequenced genome,
predict the positions of genes
Elements of prediction
Open Reading Frames
Start and stop codons, separated by a a
continuous set of non-stop codons.
Region content
Hexanucleotide composition
Codon adaptation index (CAI).
Signals
In prokaryotes Shine-Delgarno boxes.
In eukaryotes intron/exon boundary elements
(splicing signals).
Similarity with known genes.

15
Gene prediction - limitations

Typical problems
Gene prediction programs are trained for a
specific organism, and can give very bad results
with other organisms (e.g., the first rounds of
annotations of A.thaliana were done with programs
trained for mammals).
Any gene prediction program will unavoidably
predict false genes, and miss some true genes.
The prediction of intron/exon boundaries is
particularly difficult.
For prokaryotes, the predicted start codons are
sometimes imprecise.
Example genome of the yeast Saccharomyces
cerevisiae
For the yeast genomes, the gene detection
protocol used in 1996 was over-predictive.
The program essentially relied on ORF, and
predicted 6400 gene.
Some researchers estimated that 1,000 ORFs might
be false predictions.
Since 1996, the reality of the predicted genes
has been tested by combining several methods of
functional genomics (expression studies, mutant
phenotypes, comparative genomics between closely
related species, ).
A few hundreds of the initially predicted genes
have been removed from the annotations.

16
Non-coding genes

There are many types of non-coding genes
tRNA transfer RNA
rRNA ribosomial RNA
snRNA small nuclear RNA (elements of spliceosome)
snoRNA methylation guides
...
Detection of non-coding RNA
generally transcribed by polymerase I and III and
have different promoters

17
Annotation of gene function

Once a genomic region has been predicted to
contain a gene, the next step is to predict the
function of this gene.
The translated product is compared with all known
proteins, and a putative function can be assigned
on the basis of high similarity matches.
Problems
Sequence similarity is not always sufficient to
confer the same function
Where to put the threshold ?
Some proteins might have similar function with
different sequences (convergent evolution).
Once a gene has been assigned some putative
function, this will be used to assign the same
function to other genes ? expansion of errors.
We should thus be aware that gene annotations
have to be taken with caution.

18
Genes with unknown function

When genomes of model organisms were sequenced,
about 40 of the predicted genes could not be
associated to any known function
These genes are annotated as "hypothetical
proteins".
Note
In the yeast genome, many of these hypothetical
proteins have been removed from the annotations
since 1996, because they were false predictions.

19
Comparative genomics

Bioinformatics

20
Phylogenetic footprinting

One of the main reasons for sequencing the mouse
genome was to detect conserved regions between
mouse and human, which will reveal exons and
regulatory regions.
The fact that an unknown gene is found in
different genomes gives more confidence in the
existence of this gene.
Another important goal was to detect conserved
regions in non-coding regions.
On the basis of a few known cases, it has been
shown that conserved non-coding regions contain a
high concentration in regulatory elements.
The detection of conserved non-coding sequences
gives thus indications about regions potentially
involved in regulation.
Such conserved regions are called phylogenetic
footprints.

21
Phylogenetic profiles

For each gene of the query genome (e.g. E.coli),
orthologs are searched in all the sequenced
genomes
Each gene is characterized by a profile of
presence/absence in all the sequenced genomes
Groups of genes having similar phylogenetic
profiles are likely to be functionally related

Pellegrini et al. (1999). Proc Natl Acad Sci U S
A 96(8), 4285-8.
22
Gene fusion analysis

It is quite frequent to observe that two genes of
a given organism are fused into a single gene in
another organism.
Fusions between more than 2 genes are
occasionally observed.
Fused genes are likely to be functionally related.

References Marcotte, et al. (1999). Science
285(5428), 751-3. Marcotte, et al. (1999).
Nature 402(6757), 83-6. Enright, et al.
(1999). Nature 402(6757), 86-90.
23
Conclusion

Bioinformatics

24
The genome challenge

Despite the availability of several hundreds of
genomes, we are far from understanding the
organization and function of a single genome.
In particular, a lot of work remains to be done
to decipher genomes of higher organisms.
Genome sequence by itself is far from sufficient
for this.
Since 1997, several high-throughput methods have
been invented to give complementary information
about gene function (see courses on
transcriptome, proteome and interactome).

25
Quelques jalons

Write a Comment

User Comments (0)