Title: Jacques'van'Heldenulb'ac'be
1Genome analysis
2Contents
- Genome annotation
- Comparative genomics
- Phylogenetic profiles
- Gene fusion analysis
- Phylogenetic footprinting
3From sequences to genomes
4From sequences to genomes
- Before the 1990s, DNA sequencing represented an
important investment in terms of human work. A
PhD student could spend a significant fraction of
his thesis to sequence a single gene. - Genome projects stimulated the development of
automatic sequencing methods, and led to
important technological improvement. - There are currently (2008) several hundreds of
publicly available fully sequenced genomes. - The NCBI genome distribution (ftp//ftp.ncbi.nih.g
ov/genomes/) contains - gt650 prokaryotes (Bacteria and Archaea)
- Insects (Drosophila melanogaster, Apis mellifera)
- Plants (Arabidopsis thaliana, rice, maize)
- A worm (Caenorhabditis elegans)
- Some fungi (Saccharomyces cerevisiae,
Schizosaccharomyces pombe, ) - Some mammals (Homo sapiens, Mus musculus, Rattus
norvegicus) - Other genome centres give acces to other genomes.
- ENSEMBL (http//www.ensembl.org/) maintains many
vertebrate genomes - UCSC (http//genome.ucsc.edu/) maintains genomes
of metazoan insects - Sanger Institute (http//www.sanger.ac.uk/genbiol/
) - Integr8 800 of genomes in 2008.
- Many other genomes were sequenced by commercial
companies, and are not available to the public.
5Gene organization
Source Mount (2000)
6Gene function
gtPHO4,SPBC428.03C THIAMINE-REPRESSIBLE ACID
PHOSPHATASE PRECURSOR Q01682Q9UU70 Length
463 Score 161 bits (408), Expect 1e-40
Identities 138/473 (29), Positives 223/473
(46), Gaps 47/473 (9) Query 9
ILAASLVNAGTIPLGKLSDIDKIGTQTEIFPFLGGSGPYYSFPGDYGISR
DLPESCEMKQ 68 LAASVAG S
LG Y P G PESC KQ Sbjct 10
LLAASIVHAGK------SQFEAFENEFYFKDHLGTISVYHE-PYFNGPTT
SFPESCAIKQ 62 Query 69 VQMVGRHGERYPT-------VSKAK
SIMTTWYKLSNYTGQFSGALSFLNDDYEFFIRDTK 121
V RHG R PT VS A I KL N G S
F T Sbjct 63 VHLLQRHGSRNPTGDDTATDVSSAQYI
DIFQNKLLN--GSIPVNFSYPENPLYFVKHWTP 120 Query 122
NLEMETTLANSVNVLNPYTGEMNAKRHARDFLAQYGYMVENQTSFAVFTS
NSNRCHDTAQ 181 E S G
R Y Y T R DA Sbjct 121
VIKAENADQLSSS------GRIELFDLGRQVFERY-YELFDTDVYDINTA
AQERVVDSAE 173 Query 182 YFIDGL-GDKFN--ISLQTISEAE
SAGANTLSAHHSCPAWDDDVNDDILKK-----YDTK 233
F G GD E SAGANL SCP D D
Sbjct 174 WFSYGMFGDDMQNKTNFIVLPEDDSA
GANSLAMYYSCPVYEDNNIDENTTEAAHTSWRNV 233 Query 234
YLSGIAKRLNKE-NKGLNLTSSDANTFFAWCAYEINARGYSDICNIFTKD
ELVRFSYGQD 292 L IA RLNK G NLT SD
C YEI R SD CFT E F Y D Sbjct 234
FLKPIANRLNKYFDSGYNLTVSDVRSLYYICVYEIALRDNSDFCSLFTPS
EFLNFEYDSD 293 Query 293 LETYYQTGPGYDVVRSVGANLFNA
SVKLLKE--SEVQDQKVWLSFTHDTDILNYLTTIGI 350
L Y GP G N L
DKVLFTHD I G Sbjct 294
LDYAYWGGPASEWASTLGGAYVNNLANNLRKGVNNASDRKVFLAFTHDSQ
IIPVEAALGF 353 Query 351 IDDKNNLTAEH-VPFMENTF----
HRSWYVPQGARVYTEKFQCS-NDTYVRYVINDAVVP 404
D T EH P N F S VP TE F CS N
YVRN V P Sbjct 354 FPD---ITPEHPLPTDKNIFTYSLKT
SSFVPFAGNLITELFLCSDNKYYVRHLVNQQVYP 410 Query 405
IETCSTGPGFS----CEINDFYDYAEKRVAGTDFLKVCNVSSVSNSTELT
FFW 453 C GP CE
N ST T Sbjct 411
LTDCGYGPSGASDGLCELSAYLNSSVRVNSTSNGIANFNSQCQAHSTNVT
VYY 463
- After having localized genes on the sequence, we
have to predict their function. - Some genes have already been characterized before
the genome project, but these are generally a
minority of those found in the genome. - For the majority of the genes, one tries to
predict function on the basis of similarities
between the sequence of the newly sequenced gene
and some previously known genes (function
assignation by sequence similarity). - Example yeast genome (1996) there are still
2500 genes (39) whose function is completely
unknown. However - Yeast is among the best known model organisms
(genetics, molecular biology). - The full genome is available since 1996.
- When the first traft of the Human genome has been
published, 60 of the predicted genes were of
unknwown function.
7Some milestones
8Genes and genome size
- In prokaryotes, the number of genes increases
linearly with genome size - In eukaryotes, this is not the case the genome
size increases faster than the number of genes
9Genes and genome size
- Beware the axes are logarithmic.
- This plot represents the same data as the
previous one, but in logarithmic scale, in order
to see Mammals as well.
10Gene spacing
- Gene spacing increases considerably with the
complexity off the organisms. - Note the X axis si logarithmic, not the Y axis
-gt the increase seems grossly exponential.
11Proportion of intergenic regions
- Beware the X axis is logarithmic.
- The proportion of intergenic regions increases
with the complexity of an organism. - In addition (not shown here), introns represent
an increasing fraction of the genome. - For example, the exonic fraction represents lt5
of the human genome.
12Protein size versus genome size
- Protein sequences are shorter in prokaryotes than
in eukaryotes. - Among eukaryotes, the increase in genome size is
not correlated to an increase in protein size - higher eukaryotes have a much larger genome than
fungi, without increase in protein size
13Genome annotation
14Gene prediction
- Starting from a completely sequenced genome,
predict the positions of genes - Elements of prediction
- Open Reading Frames
- Start and stop codons, separated by a a
continuous set of non-stop codons. - Region content
- Hexanucleotide composition
- Codon adaptation index (CAI).
- Signals
- In prokaryotes Shine-Delgarno boxes.
- In eukaryotes intron/exon boundary elements
(splicing signals). - Similarity with known genes.
15Gene prediction - limitations
- Typical problems
- Gene prediction programs are trained for a
specific organism, and can give very bad results
with other organisms (e.g., the first rounds of
annotations of A.thaliana were done with programs
trained for mammals). - Any gene prediction program will unavoidably
predict false genes, and miss some true genes. - The prediction of intron/exon boundaries is
particularly difficult. - For prokaryotes, the predicted start codons are
sometimes imprecise. - Example genome of the yeast Saccharomyces
cerevisiae - For the yeast genomes, the gene detection
protocol used in 1996 was over-predictive. - The program essentially relied on ORF, and
predicted 6400 gene. - Some researchers estimated that 1,000 ORFs might
be false predictions. - Since 1996, the reality of the predicted genes
has been tested by combining several methods of
functional genomics (expression studies, mutant
phenotypes, comparative genomics between closely
related species, ). - A few hundreds of the initially predicted genes
have been removed from the annotations.
16Non-coding genes
- There are many types of non-coding genes
- tRNA transfer RNA
- rRNA ribosomial RNA
- snRNA small nuclear RNA (elements of spliceosome)
- snoRNA methylation guides
- ...
- Detection of non-coding RNA
- generally transcribed by polymerase I and III and
have different promoters
17Annotation of gene function
- Once a genomic region has been predicted to
contain a gene, the next step is to predict the
function of this gene. - The translated product is compared with all known
proteins, and a putative function can be assigned
on the basis of high similarity matches. - Problems
- Sequence similarity is not always sufficient to
confer the same function - Where to put the threshold ?
- Some proteins might have similar function with
different sequences (convergent evolution). - Once a gene has been assigned some putative
function, this will be used to assign the same
function to other genes ? expansion of errors. - We should thus be aware that gene annotations
have to be taken with caution.
18Genes with unknown function
- When genomes of model organisms were sequenced,
about 40 of the predicted genes could not be
associated to any known function - These genes are annotated as "hypothetical
proteins". - Note
- In the yeast genome, many of these hypothetical
proteins have been removed from the annotations
since 1996, because they were false predictions.
19Comparative genomics
20Phylogenetic footprinting
- One of the main reasons for sequencing the mouse
genome was to detect conserved regions between
mouse and human, which will reveal exons and
regulatory regions. - The fact that an unknown gene is found in
different genomes gives more confidence in the
existence of this gene. - Another important goal was to detect conserved
regions in non-coding regions. - On the basis of a few known cases, it has been
shown that conserved non-coding regions contain a
high concentration in regulatory elements. - The detection of conserved non-coding sequences
gives thus indications about regions potentially
involved in regulation. - Such conserved regions are called phylogenetic
footprints.
21Phylogenetic profiles
- For each gene of the query genome (e.g. E.coli),
orthologs are searched in all the sequenced
genomes - Each gene is characterized by a profile of
presence/absence in all the sequenced genomes - Groups of genes having similar phylogenetic
profiles are likely to be functionally related
Pellegrini et al. (1999). Proc Natl Acad Sci U S
A 96(8), 4285-8.
22Gene fusion analysis
- It is quite frequent to observe that two genes of
a given organism are fused into a single gene in
another organism. - Fusions between more than 2 genes are
occasionally observed. - Fused genes are likely to be functionally related.
References Marcotte, et al. (1999). Science
285(5428), 751-3. Marcotte, et al. (1999).
Nature 402(6757), 83-6. Enright, et al.
(1999). Nature 402(6757), 86-90.
23Conclusion
24The genome challenge
- Despite the availability of several hundreds of
genomes, we are far from understanding the
organization and function of a single genome. - In particular, a lot of work remains to be done
to decipher genomes of higher organisms. - Genome sequence by itself is far from sufficient
for this. - Since 1997, several high-throughput methods have
been invented to give complementary information
about gene function (see courses on
transcriptome, proteome and interactome).
25Quelques jalons