Title: Annotating genomes using proteomics data
1Annotating genomes using proteomics data
- Andy Jones
- Department of Preclinical Veterinary Science
2Overview
- Genome annotation
- Current informatics methods
- Experimental data
- How good are we at annotating genomes?
- Proteome data for genome annotation
- Study on Toxoplasma
- Challenges
- Proposed solutions
3Summary 780 completed genomes 734 draft
assembly 842 in progress Total 2356 (1996
prokaryote, 360 eukaryote) Genome sequencing is
just a starting point to understanding genes /
proteins
4Annotating eukaryotic genomes
Start codon
Stop codon
Exon 1
Exon 2
Exon 3
Exon 4
Genomic DNA
mRNA
- Genome annotation
- Find start codons / transcriptional initiation
- Recognise splice acceptor and donor sequences
- Stop codon
- Predict alternative splicing...
5Computational gene prediction
- De novo prediction single genome
- Trained with typical gene structures - learn
exon-intron signals, translation initiation and
termination signals e.g. Markov models - Many different predictions scored based on
training set of known genes - Multiple genome
- Compare confirmed gene sequences from other
species - Coding regions more highly conserved ?
conservation indicates gene position - Pattern searching Higher mutation rate of bases
separated in multiples of three (mutations in 3rd
position of codons are often silent) - Experimental data also contribute to many genome
projects - New methods weigh evidence from a variety of
sources - Attempting to reproduce how a human annotator
would work
Brent, Nat Rev Genet. 2008 Jan9(1)62-73
6Experimental corroboration of models
- Expressed Sequence Tags
- Simple to obtain large volumes of data sequence
randomly from cDNA libraries - Problems
- Data sets can contain unprocessed transcripts (do
not always confirm splicing) - Rarely cover 5 end of gene
- Generally low-quality sequences
- High-throughput sequencing
- Next-generation sequencers capable of directly
sequencing mRNA - Likely to become more widely used in the future
- Proteome data (peptide sequence data)
7How good are gene models?
- Plasmodium falciparum (causative agent malaria)
- genome sequenced in 2002, undergone considerable
curation of gene models - Recent article cDNA study of P. falciparum
- Suggests 25 of genes in Plasmodium falciparum
are incorrect (85 genes out of 356 sampled) - Majority of errors are in splice junctions
(intron-exon boundaries) - What does this mean for other genomes...?
- Likely that high percentage of gene sequences are
incorrect! - BMC Genomics. 2007 Jul 278255.
8Proteome data for genome annotation
- Motivation for genome annotation
- Can rule out that transcripts are non
protein-coding - Large volumes of proteome data often collected
for other purposes - Certain types of proteome data able to confirm
the start codon of genes (difficult by other
methods) - Even where considerable ESTs / cDNA sequencing
has been performed, proteins can be detected with
no corresponding EST evidence
9Proteogenomic study of Toxoplasma gondii
- Proteome study of Toxoplasma gondii using three
complementary techniques - parasite of clinical significance related to
Plasmodium
- Study aims
- Identify as many components of the proteome as
possible - Relate peptide sequence data back to genome to
confirm genes - Relate protein expression data to
transcriptional data (EST / microarray)
10Cut bands Trypsin digestion
1D gel electrophoresis
Mass spectrometry
Peptides
Cut gel spot Trypsin digestion
2D gel electrophoresis
Fractions
Trypsin digestion
Sequence database search (compare with
theoretical spectra predicted for each peptide in
DB)
Liquid chromatography
11Database search strategy
Official gene models
Concatenate databases
60MB genome sequence
Alternative gene models predicted by gene finders
Search all spectra
ToxoDB
Identify peptides and proteins
ORFs predicted in a 6 frame translation
Align peptide sequences back to corresponding
genomic region
DNA sequence database
amino acid sequence database
12- Five exon gene incomplete agreement between
different gene models - Peptide evidence for all 5 exons and 2 introns
out of 4 - Note Can only provide positive evidence, no
peptides matched to 5 and 3 termini of gene
model
13- Appears to be additional exon at 5
- None of GLEAN, TwinScan or TigrScan algorithms
appears to have made correct prediction
14- All peptides matched to gene models on opposite
strand
15Study outcomes
- Protein evidence for approximately 1/3 of
predicted genes (2250 proteins) - Around 2500 splicing events confirmed
- Peptides aligned across intron-exon boundaries
- Around 400 protein IDs appear to match
alternative gene models - Genome database (ToxoDB) hosts peptide sequences
aligned against gene models - Can we use informatics to improve this
strategy...? - Xia et al. (2008) Genome Biology,9(7),pp.R11
16Challenges of proteogenomics
- Main informatics challenge
- A protein can usually only be identified if the
gene sequence has been correctly predicted from
the genome - In effect, would like to use MS data directly for
gene discovery - But... searching a six frame genome translation
is problematic - All peptide and protein identifications are
probabilistic - False positive rate is proportional to search
database size - On average only 10-20 of spectra identify a
peptide - Need methods that can exploit the rest of the
meaningful spectra - When gene models change, protein identifications
are out of date - No dynamic interaction between proteome and
genome data
17Automated re-annotation pipeline
- Planned improvements to the informatics workflow
- Re-querying pipeline
- each time gene models change, all mass spectra
are automatically re-queried - Integrate peptide evidence directly into gene
finding software - Maximising the number of informative mass spectra
- Attempt to optimise algorithms for de novo
sequencing of peptides - N-terminal proteomics
- - Could be used to confirm gene initiation point
18Spectra
Official gene set
Stage 1
Multiple database search engines
Confirmed official model
Genome sequence
Gene Finder
Alternative gene models
Stage 2
Multiple database search engines
Promote alternative model
Stage 3
Modified de novo algorithms
Novel ORF, splice junction
Proteomic evidence
- Spectra searched in series
- Peptide evidence confirming official gene,
alternative model, new ORF - Direct flow back to modified gene finder
- Produce new set of predictions
- Iteratively improve number of spectra identified
- In each iteration, fewer spectra flow on to stage
2 and 3
19- Combining evidence in gene finders
- Dynamically checking proposed gene models
against peptide evidence - Combining evidence from different gene finding
algorithms - In this case, probably no single algorithm
appears to have correct model
20Query spectra using different search engines
Peptide identifications
Omssa
Omssa
X!Tandem
Peptides
X!Tandem
Rescoring Algorithm (FDR)
Combined list
Peptides
Mascot
Peptides
Mascot
- Each search engine produces a different
non-standard score of the quality of a match - Developed a search engine independent score,
based on analysis of false discovery rate - Identifications made more search engines are
scored more highly - Can generate 35 more peptide identification
than best single search engine
Jones et al. Improving sensitivity in proteome
studies by analysis of false discovery rates for
multiple search engines. PROTEOMICS, in press
(2008)
21Conclusions
- Proteome data is able to confirm gene models are
correct - Currently data under-exploited
- Challenges searching mass spec data directly
against the genome for gene discovery - Build re-querying pipeline
- Iteratively improve gene models
- Improve capabilities for using multiple search
engines - Integrate peptide evidence directly into gene
finders
22Acknowledgments
- Data from Wastling lab
- Dong Xia, Sanya Sanderson, Jonathan Wastling
- ToxoDB at Upenn
- David Roos, Brian Brunk
- Email Andrew.jones_at_liv.ac.uk