Jacques'van'Heldenulb'ac'be - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Jacques'van'Heldenulb'ac'be

Description:

Laboratoire de Bioinformatique des G nomes et des R seaux ... H.pylori 1 composite. A. B. A^B. References. Marcotte, et al. (1999). Science 285(5428), 751-3. ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 26
Provided by: jacquesv8
Category:

less

Transcript and Presenter's Notes

Title: Jacques'van'Heldenulb'ac'be


1
Genome analysis
  • Bioinformatics

2
Contents
  • Genome annotation
  • Comparative genomics
  • Phylogenetic profiles
  • Gene fusion analysis
  • Phylogenetic footprinting

3
From sequences to genomes
  • Bioinformatics

4
From sequences to genomes
  • Before the 1990s, DNA sequencing represented an
    important investment in terms of human work. A
    PhD student could spend a significant fraction of
    his thesis to sequence a single gene.
  • Genome projects stimulated the development of
    automatic sequencing methods, and led to
    important technological improvement.
  • There are currently (2008) several hundreds of
    publicly available fully sequenced genomes.
  • The NCBI genome distribution (ftp//ftp.ncbi.nih.g
    ov/genomes/) contains
  • gt650 prokaryotes (Bacteria and Archaea)
  • Insects (Drosophila melanogaster, Apis mellifera)
  • Plants (Arabidopsis thaliana, rice, maize)
  • A worm (Caenorhabditis elegans)
  • Some fungi (Saccharomyces cerevisiae,
    Schizosaccharomyces pombe, )
  • Some mammals (Homo sapiens, Mus musculus, Rattus
    norvegicus)
  • Other genome centres give acces to other genomes.
  • ENSEMBL (http//www.ensembl.org/) maintains many
    vertebrate genomes
  • UCSC (http//genome.ucsc.edu/) maintains genomes
    of metazoan insects
  • Sanger Institute (http//www.sanger.ac.uk/genbiol/
    )
  • Integr8 800 of genomes in 2008.
  • Many other genomes were sequenced by commercial
    companies, and are not available to the public.

5
Gene organization
Source Mount (2000)
6
Gene function
gtPHO4,SPBC428.03C THIAMINE-REPRESSIBLE ACID
PHOSPHATASE PRECURSOR Q01682Q9UU70 Length
463 Score 161 bits (408), Expect 1e-40
Identities 138/473 (29), Positives 223/473
(46), Gaps 47/473 (9) Query 9
ILAASLVNAGTIPLGKLSDIDKIGTQTEIFPFLGGSGPYYSFPGDYGISR
DLPESCEMKQ 68 LAASVAG S
LG Y P G PESC KQ Sbjct 10
LLAASIVHAGK------SQFEAFENEFYFKDHLGTISVYHE-PYFNGPTT
SFPESCAIKQ 62 Query 69 VQMVGRHGERYPT-------VSKAK
SIMTTWYKLSNYTGQFSGALSFLNDDYEFFIRDTK 121
V RHG R PT VS A I KL N G S
F T Sbjct 63 VHLLQRHGSRNPTGDDTATDVSSAQYI
DIFQNKLLN--GSIPVNFSYPENPLYFVKHWTP 120 Query 122
NLEMETTLANSVNVLNPYTGEMNAKRHARDFLAQYGYMVENQTSFAVFTS
NSNRCHDTAQ 181 E S G
R Y Y T R DA Sbjct 121
VIKAENADQLSSS------GRIELFDLGRQVFERY-YELFDTDVYDINTA
AQERVVDSAE 173 Query 182 YFIDGL-GDKFN--ISLQTISEAE
SAGANTLSAHHSCPAWDDDVNDDILKK-----YDTK 233
F G GD E SAGANL SCP D D
Sbjct 174 WFSYGMFGDDMQNKTNFIVLPEDDSA
GANSLAMYYSCPVYEDNNIDENTTEAAHTSWRNV 233 Query 234
YLSGIAKRLNKE-NKGLNLTSSDANTFFAWCAYEINARGYSDICNIFTKD
ELVRFSYGQD 292 L IA RLNK G NLT SD
C YEI R SD CFT E F Y D Sbjct 234
FLKPIANRLNKYFDSGYNLTVSDVRSLYYICVYEIALRDNSDFCSLFTPS
EFLNFEYDSD 293 Query 293 LETYYQTGPGYDVVRSVGANLFNA
SVKLLKE--SEVQDQKVWLSFTHDTDILNYLTTIGI 350
L Y GP G N L
DKVLFTHD I G Sbjct 294
LDYAYWGGPASEWASTLGGAYVNNLANNLRKGVNNASDRKVFLAFTHDSQ
IIPVEAALGF 353 Query 351 IDDKNNLTAEH-VPFMENTF----
HRSWYVPQGARVYTEKFQCS-NDTYVRYVINDAVVP 404
D T EH P N F S VP TE F CS N
YVRN V P Sbjct 354 FPD---ITPEHPLPTDKNIFTYSLKT
SSFVPFAGNLITELFLCSDNKYYVRHLVNQQVYP 410 Query 405
IETCSTGPGFS----CEINDFYDYAEKRVAGTDFLKVCNVSSVSNSTELT
FFW 453 C GP CE
N ST T Sbjct 411
LTDCGYGPSGASDGLCELSAYLNSSVRVNSTSNGIANFNSQCQAHSTNVT
VYY 463
  • After having localized genes on the sequence, we
    have to predict their function.
  • Some genes have already been characterized before
    the genome project, but these are generally a
    minority of those found in the genome.
  • For the majority of the genes, one tries to
    predict function on the basis of similarities
    between the sequence of the newly sequenced gene
    and some previously known genes (function
    assignation by sequence similarity).
  • Example yeast genome (1996) there are still
    2500 genes (39) whose function is completely
    unknown. However
  • Yeast is among the best known model organisms
    (genetics, molecular biology).
  • The full genome is available since 1996.
  • When the first traft of the Human genome has been
    published, 60 of the predicted genes were of
    unknwown function.

7
Some milestones
8
Genes and genome size
  • In prokaryotes, the number of genes increases
    linearly with genome size
  • In eukaryotes, this is not the case the genome
    size increases faster than the number of genes

9
Genes and genome size
  • Beware the axes are logarithmic.
  • This plot represents the same data as the
    previous one, but in logarithmic scale, in order
    to see Mammals as well.

10
Gene spacing
  • Gene spacing increases considerably with the
    complexity off the organisms.
  • Note the X axis si logarithmic, not the Y axis
    -gt the increase seems grossly exponential.

11
Proportion of intergenic regions
  • Beware the X axis is logarithmic.
  • The proportion of intergenic regions increases
    with the complexity of an organism.
  • In addition (not shown here), introns represent
    an increasing fraction of the genome.
  • For example, the exonic fraction represents lt5
    of the human genome.

12
Protein size versus genome size
  • Protein sequences are shorter in prokaryotes than
    in eukaryotes.
  • Among eukaryotes, the increase in genome size is
    not correlated to an increase in protein size
  • higher eukaryotes have a much larger genome than
    fungi, without increase in protein size

13
Genome annotation
  • Bioinformatics

14
Gene prediction
  • Starting from a completely sequenced genome,
    predict the positions of genes
  • Elements of prediction
  • Open Reading Frames
  • Start and stop codons, separated by a a
    continuous set of non-stop codons.
  • Region content
  • Hexanucleotide composition
  • Codon adaptation index (CAI).
  • Signals
  • In prokaryotes Shine-Delgarno boxes.
  • In eukaryotes intron/exon boundary elements
    (splicing signals).
  • Similarity with known genes.

15
Gene prediction - limitations
  • Typical problems
  • Gene prediction programs are trained for a
    specific organism, and can give very bad results
    with other organisms (e.g., the first rounds of
    annotations of A.thaliana were done with programs
    trained for mammals).
  • Any gene prediction program will unavoidably
    predict false genes, and miss some true genes.
  • The prediction of intron/exon boundaries is
    particularly difficult.
  • For prokaryotes, the predicted start codons are
    sometimes imprecise.
  • Example genome of the yeast Saccharomyces
    cerevisiae
  • For the yeast genomes, the gene detection
    protocol used in 1996 was over-predictive.
  • The program essentially relied on ORF, and
    predicted 6400 gene.
  • Some researchers estimated that 1,000 ORFs might
    be false predictions.
  • Since 1996, the reality of the predicted genes
    has been tested by combining several methods of
    functional genomics (expression studies, mutant
    phenotypes, comparative genomics between closely
    related species, ).
  • A few hundreds of the initially predicted genes
    have been removed from the annotations.

16
Non-coding genes
  • There are many types of non-coding genes
  • tRNA transfer RNA
  • rRNA ribosomial RNA
  • snRNA small nuclear RNA (elements of spliceosome)
  • snoRNA methylation guides
  • ...
  • Detection of non-coding RNA
  • generally transcribed by polymerase I and III and
    have different promoters

17
Annotation of gene function
  • Once a genomic region has been predicted to
    contain a gene, the next step is to predict the
    function of this gene.
  • The translated product is compared with all known
    proteins, and a putative function can be assigned
    on the basis of high similarity matches.
  • Problems
  • Sequence similarity is not always sufficient to
    confer the same function
  • Where to put the threshold ?
  • Some proteins might have similar function with
    different sequences (convergent evolution).
  • Once a gene has been assigned some putative
    function, this will be used to assign the same
    function to other genes ? expansion of errors.
  • We should thus be aware that gene annotations
    have to be taken with caution.

18
Genes with unknown function
  • When genomes of model organisms were sequenced,
    about 40 of the predicted genes could not be
    associated to any known function
  • These genes are annotated as "hypothetical
    proteins".
  • Note
  • In the yeast genome, many of these hypothetical
    proteins have been removed from the annotations
    since 1996, because they were false predictions.

19
Comparative genomics
  • Bioinformatics

20
Phylogenetic footprinting
  • One of the main reasons for sequencing the mouse
    genome was to detect conserved regions between
    mouse and human, which will reveal exons and
    regulatory regions.
  • The fact that an unknown gene is found in
    different genomes gives more confidence in the
    existence of this gene.
  • Another important goal was to detect conserved
    regions in non-coding regions.
  • On the basis of a few known cases, it has been
    shown that conserved non-coding regions contain a
    high concentration in regulatory elements.
  • The detection of conserved non-coding sequences
    gives thus indications about regions potentially
    involved in regulation.
  • Such conserved regions are called phylogenetic
    footprints.

21
Phylogenetic profiles
  • For each gene of the query genome (e.g. E.coli),
    orthologs are searched in all the sequenced
    genomes
  • Each gene is characterized by a profile of
    presence/absence in all the sequenced genomes
  • Groups of genes having similar phylogenetic
    profiles are likely to be functionally related

Pellegrini et al. (1999). Proc Natl Acad Sci U S
A 96(8), 4285-8.
22
Gene fusion analysis
  • It is quite frequent to observe that two genes of
    a given organism are fused into a single gene in
    another organism.
  • Fusions between more than 2 genes are
    occasionally observed.
  • Fused genes are likely to be functionally related.

References Marcotte, et al. (1999). Science
285(5428), 751-3. Marcotte, et al. (1999).
Nature 402(6757), 83-6. Enright, et al.
(1999). Nature 402(6757), 86-90.
23
Conclusion
  • Bioinformatics

24
The genome challenge
  • Despite the availability of several hundreds of
    genomes, we are far from understanding the
    organization and function of a single genome.
  • In particular, a lot of work remains to be done
    to decipher genomes of higher organisms.
  • Genome sequence by itself is far from sufficient
    for this.
  • Since 1997, several high-throughput methods have
    been invented to give complementary information
    about gene function (see courses on
    transcriptome, proteome and interactome).

25
Quelques jalons
Write a Comment
User Comments (0)
About PowerShow.com