Annotating genomes using proteomics data

About This Presentation

Title:

Annotating genomes using proteomics data

Description:

Find start codons / transcriptional initiation. Recognise ... Stop codon. Genomic DNA. mRNA. Computational gene prediction. De novo prediction single genome ... –

Number of Views:68

Avg rating:3.0/5.0

Slides: 23

Provided by: jone74

Category:

more less

Transcript and Presenter's Notes

Title: Annotating genomes using proteomics data

1
Annotating genomes using proteomics data

Andy Jones
Department of Preclinical Veterinary Science

2
Overview

Genome annotation
Current informatics methods
Experimental data
How good are we at annotating genomes?
Proteome data for genome annotation
Study on Toxoplasma
Challenges
Proposed solutions

3
Summary 780 completed genomes 734 draft
assembly 842 in progress Total 2356 (1996
prokaryote, 360 eukaryote) Genome sequencing is
just a starting point to understanding genes /
proteins
4
Annotating eukaryotic genomes
Start codon
Stop codon
Exon 1
Exon 2
Exon 3
Exon 4
Genomic DNA
mRNA

Genome annotation
Find start codons / transcriptional initiation
Recognise splice acceptor and donor sequences
Stop codon
Predict alternative splicing...

5
Computational gene prediction

De novo prediction single genome
Trained with typical gene structures - learn
exon-intron signals, translation initiation and
termination signals e.g. Markov models
Many different predictions scored based on
training set of known genes
Multiple genome
Compare confirmed gene sequences from other
species
Coding regions more highly conserved ?
conservation indicates gene position
Pattern searching Higher mutation rate of bases
separated in multiples of three (mutations in 3rd
position of codons are often silent)
Experimental data also contribute to many genome
projects
New methods weigh evidence from a variety of
sources
Attempting to reproduce how a human annotator
would work

Brent, Nat Rev Genet. 2008 Jan9(1)62-73
6
Experimental corroboration of models

Expressed Sequence Tags
Simple to obtain large volumes of data sequence
randomly from cDNA libraries
Problems
Data sets can contain unprocessed transcripts (do
not always confirm splicing)
Rarely cover 5 end of gene
Generally low-quality sequences
High-throughput sequencing
Next-generation sequencers capable of directly
sequencing mRNA
Likely to become more widely used in the future
Proteome data (peptide sequence data)

7
How good are gene models?

Plasmodium falciparum (causative agent malaria)
genome sequenced in 2002, undergone considerable
curation of gene models
Recent article cDNA study of P. falciparum
Suggests 25 of genes in Plasmodium falciparum
are incorrect (85 genes out of 356 sampled)
Majority of errors are in splice junctions
(intron-exon boundaries)
What does this mean for other genomes...?
Likely that high percentage of gene sequences are
incorrect!
BMC Genomics. 2007 Jul 278255.

8
Proteome data for genome annotation

Motivation for genome annotation
Can rule out that transcripts are non
protein-coding
Large volumes of proteome data often collected
for other purposes
Certain types of proteome data able to confirm
the start codon of genes (difficult by other
methods)
Even where considerable ESTs / cDNA sequencing
has been performed, proteins can be detected with
no corresponding EST evidence

9
Proteogenomic study of Toxoplasma gondii

Proteome study of Toxoplasma gondii using three
complementary techniques
parasite of clinical significance related to
Plasmodium

Study aims
Identify as many components of the proteome as
possible
Relate peptide sequence data back to genome to
confirm genes
Relate protein expression data to
transcriptional data (EST / microarray)

10
Cut bands Trypsin digestion
1D gel electrophoresis
Mass spectrometry
Peptides
Cut gel spot Trypsin digestion
2D gel electrophoresis
Fractions
Trypsin digestion
Sequence database search (compare with
theoretical spectra predicted for each peptide in
DB)
Liquid chromatography
11
Database search strategy
Official gene models
Concatenate databases
60MB genome sequence
Alternative gene models predicted by gene finders
Search all spectra
ToxoDB
Identify peptides and proteins
ORFs predicted in a 6 frame translation
Align peptide sequences back to corresponding
genomic region
DNA sequence database
amino acid sequence database
12

Five exon gene incomplete agreement between
different gene models
Peptide evidence for all 5 exons and 2 introns
out of 4
Note Can only provide positive evidence, no
peptides matched to 5 and 3 termini of gene
model

Appears to be additional exon at 5
None of GLEAN, TwinScan or TigrScan algorithms
appears to have made correct prediction

14
- All peptides matched to gene models on opposite
strand
15
Study outcomes

Protein evidence for approximately 1/3 of
predicted genes (2250 proteins)
Around 2500 splicing events confirmed
Peptides aligned across intron-exon boundaries
Around 400 protein IDs appear to match
alternative gene models
Genome database (ToxoDB) hosts peptide sequences
aligned against gene models
Can we use informatics to improve this
strategy...?
Xia et al. (2008) Genome Biology,9(7),pp.R11

16
Challenges of proteogenomics

Main informatics challenge
A protein can usually only be identified if the
gene sequence has been correctly predicted from
the genome
In effect, would like to use MS data directly for
gene discovery
But... searching a six frame genome translation
is problematic
All peptide and protein identifications are
probabilistic
False positive rate is proportional to search
database size
On average only 10-20 of spectra identify a
peptide
Need methods that can exploit the rest of the
meaningful spectra
When gene models change, protein identifications
are out of date
No dynamic interaction between proteome and
genome data

17
Automated re-annotation pipeline

Planned improvements to the informatics workflow
Re-querying pipeline
each time gene models change, all mass spectra
are automatically re-queried
Integrate peptide evidence directly into gene
finding software
Maximising the number of informative mass spectra
Attempt to optimise algorithms for de novo
sequencing of peptides
N-terminal proteomics
- Could be used to confirm gene initiation point

18
Spectra
Official gene set
Stage 1
Multiple database search engines
Confirmed official model
Genome sequence
Gene Finder
Alternative gene models
Stage 2
Multiple database search engines
Promote alternative model
Stage 3
Modified de novo algorithms
Novel ORF, splice junction
Proteomic evidence

Spectra searched in series
Peptide evidence confirming official gene,
alternative model, new ORF
Direct flow back to modified gene finder
Produce new set of predictions
Iteratively improve number of spectra identified
In each iteration, fewer spectra flow on to stage
2 and 3

Combining evidence in gene finders
Dynamically checking proposed gene models
against peptide evidence
Combining evidence from different gene finding
algorithms
In this case, probably no single algorithm
appears to have correct model

20
Query spectra using different search engines
Peptide identifications
Omssa
Omssa
X!Tandem
Peptides
X!Tandem
Rescoring Algorithm (FDR)
Combined list
Peptides
Mascot
Peptides
Mascot

Each search engine produces a different
non-standard score of the quality of a match
Developed a search engine independent score,
based on analysis of false discovery rate
Identifications made more search engines are
scored more highly
Can generate 35 more peptide identification
than best single search engine

Jones et al. Improving sensitivity in proteome
studies by analysis of false discovery rates for
multiple search engines. PROTEOMICS, in press
(2008)
21
Conclusions

Proteome data is able to confirm gene models are
correct
Currently data under-exploited
Challenges searching mass spec data directly
against the genome for gene discovery
Build re-querying pipeline
Iteratively improve gene models
Improve capabilities for using multiple search
engines
Integrate peptide evidence directly into gene
finders

22
Acknowledgments