Annotating genomes using proteomics data - PowerPoint PPT Presentation

About This Presentation
Title:

Annotating genomes using proteomics data

Description:

Find start codons / transcriptional initiation. Recognise ... Stop codon. Genomic DNA. mRNA. Computational gene prediction. De novo prediction single genome ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 23
Provided by: jone74
Category:

less

Transcript and Presenter's Notes

Title: Annotating genomes using proteomics data


1
Annotating genomes using proteomics data
  • Andy Jones
  • Department of Preclinical Veterinary Science

2
Overview
  • Genome annotation
  • Current informatics methods
  • Experimental data
  • How good are we at annotating genomes?
  • Proteome data for genome annotation
  • Study on Toxoplasma
  • Challenges
  • Proposed solutions

3
Summary 780 completed genomes 734 draft
assembly 842 in progress Total 2356 (1996
prokaryote, 360 eukaryote) Genome sequencing is
just a starting point to understanding genes /
proteins
4
Annotating eukaryotic genomes
Start codon
Stop codon
Exon 1
Exon 2
Exon 3
Exon 4
Genomic DNA
mRNA
  • Genome annotation
  • Find start codons / transcriptional initiation
  • Recognise splice acceptor and donor sequences
  • Stop codon
  • Predict alternative splicing...

5
Computational gene prediction
  • De novo prediction single genome
  • Trained with typical gene structures - learn
    exon-intron signals, translation initiation and
    termination signals e.g. Markov models
  • Many different predictions scored based on
    training set of known genes
  • Multiple genome
  • Compare confirmed gene sequences from other
    species
  • Coding regions more highly conserved ?
    conservation indicates gene position
  • Pattern searching Higher mutation rate of bases
    separated in multiples of three (mutations in 3rd
    position of codons are often silent)
  • Experimental data also contribute to many genome
    projects
  • New methods weigh evidence from a variety of
    sources
  • Attempting to reproduce how a human annotator
    would work

Brent, Nat Rev Genet. 2008 Jan9(1)62-73
6
Experimental corroboration of models
  • Expressed Sequence Tags
  • Simple to obtain large volumes of data sequence
    randomly from cDNA libraries
  • Problems
  • Data sets can contain unprocessed transcripts (do
    not always confirm splicing)
  • Rarely cover 5 end of gene
  • Generally low-quality sequences
  • High-throughput sequencing
  • Next-generation sequencers capable of directly
    sequencing mRNA
  • Likely to become more widely used in the future
  • Proteome data (peptide sequence data)

7
How good are gene models?
  • Plasmodium falciparum (causative agent malaria)
  • genome sequenced in 2002, undergone considerable
    curation of gene models
  • Recent article cDNA study of P. falciparum
  • Suggests 25 of genes in Plasmodium falciparum
    are incorrect (85 genes out of 356 sampled)
  • Majority of errors are in splice junctions
    (intron-exon boundaries)
  • What does this mean for other genomes...?
  • Likely that high percentage of gene sequences are
    incorrect!
  • BMC Genomics. 2007 Jul 278255.

8
Proteome data for genome annotation
  • Motivation for genome annotation
  • Can rule out that transcripts are non
    protein-coding
  • Large volumes of proteome data often collected
    for other purposes
  • Certain types of proteome data able to confirm
    the start codon of genes (difficult by other
    methods)
  • Even where considerable ESTs / cDNA sequencing
    has been performed, proteins can be detected with
    no corresponding EST evidence

9
Proteogenomic study of Toxoplasma gondii
  • Proteome study of Toxoplasma gondii using three
    complementary techniques
  • parasite of clinical significance related to
    Plasmodium
  • Study aims
  • Identify as many components of the proteome as
    possible
  • Relate peptide sequence data back to genome to
    confirm genes
  • Relate protein expression data to
    transcriptional data (EST / microarray)

10
Cut bands Trypsin digestion
1D gel electrophoresis
Mass spectrometry
Peptides
Cut gel spot Trypsin digestion
2D gel electrophoresis
Fractions
Trypsin digestion
Sequence database search (compare with
theoretical spectra predicted for each peptide in
DB)
Liquid chromatography
11
Database search strategy
Official gene models
Concatenate databases
60MB genome sequence
Alternative gene models predicted by gene finders
Search all spectra
ToxoDB
Identify peptides and proteins
ORFs predicted in a 6 frame translation
Align peptide sequences back to corresponding
genomic region
DNA sequence database
amino acid sequence database
12
  • Five exon gene incomplete agreement between
    different gene models
  • Peptide evidence for all 5 exons and 2 introns
    out of 4
  • Note Can only provide positive evidence, no
    peptides matched to 5 and 3 termini of gene
    model

13
  • Appears to be additional exon at 5
  • None of GLEAN, TwinScan or TigrScan algorithms
    appears to have made correct prediction

14
- All peptides matched to gene models on opposite
strand
15
Study outcomes
  • Protein evidence for approximately 1/3 of
    predicted genes (2250 proteins)
  • Around 2500 splicing events confirmed
  • Peptides aligned across intron-exon boundaries
  • Around 400 protein IDs appear to match
    alternative gene models
  • Genome database (ToxoDB) hosts peptide sequences
    aligned against gene models
  • Can we use informatics to improve this
    strategy...?
  • Xia et al. (2008) Genome Biology,9(7),pp.R11

16
Challenges of proteogenomics
  • Main informatics challenge
  • A protein can usually only be identified if the
    gene sequence has been correctly predicted from
    the genome
  • In effect, would like to use MS data directly for
    gene discovery
  • But... searching a six frame genome translation
    is problematic
  • All peptide and protein identifications are
    probabilistic
  • False positive rate is proportional to search
    database size
  • On average only 10-20 of spectra identify a
    peptide
  • Need methods that can exploit the rest of the
    meaningful spectra
  • When gene models change, protein identifications
    are out of date
  • No dynamic interaction between proteome and
    genome data

17
Automated re-annotation pipeline
  • Planned improvements to the informatics workflow
  • Re-querying pipeline
  • each time gene models change, all mass spectra
    are automatically re-queried
  • Integrate peptide evidence directly into gene
    finding software
  • Maximising the number of informative mass spectra
  • Attempt to optimise algorithms for de novo
    sequencing of peptides
  • N-terminal proteomics
  • - Could be used to confirm gene initiation point

18
Spectra
Official gene set
Stage 1
Multiple database search engines
Confirmed official model
Genome sequence
Gene Finder
Alternative gene models
Stage 2
Multiple database search engines
Promote alternative model
Stage 3
Modified de novo algorithms
Novel ORF, splice junction
Proteomic evidence
  • Spectra searched in series
  • Peptide evidence confirming official gene,
    alternative model, new ORF
  • Direct flow back to modified gene finder
  • Produce new set of predictions
  • Iteratively improve number of spectra identified
  • In each iteration, fewer spectra flow on to stage
    2 and 3

19
  • Combining evidence in gene finders
  • Dynamically checking proposed gene models
    against peptide evidence
  • Combining evidence from different gene finding
    algorithms
  • In this case, probably no single algorithm
    appears to have correct model

20
Query spectra using different search engines
Peptide identifications
Omssa
Omssa
X!Tandem
Peptides
X!Tandem
Rescoring Algorithm (FDR)
Combined list
Peptides
Mascot
Peptides
Mascot
  • Each search engine produces a different
    non-standard score of the quality of a match
  • Developed a search engine independent score,
    based on analysis of false discovery rate
  • Identifications made more search engines are
    scored more highly
  • Can generate 35 more peptide identification
    than best single search engine

Jones et al. Improving sensitivity in proteome
studies by analysis of false discovery rates for
multiple search engines. PROTEOMICS, in press
(2008)
21
Conclusions
  • Proteome data is able to confirm gene models are
    correct
  • Currently data under-exploited
  • Challenges searching mass spec data directly
    against the genome for gene discovery
  • Build re-querying pipeline
  • Iteratively improve gene models
  • Improve capabilities for using multiple search
    engines
  • Integrate peptide evidence directly into gene
    finders

22
Acknowledgments
  • Data from Wastling lab
  • Dong Xia, Sanya Sanderson, Jonathan Wastling
  • ToxoDB at Upenn
  • David Roos, Brian Brunk
  • Email Andrew.jones_at_liv.ac.uk
Write a Comment
User Comments (0)
About PowerShow.com