Title: Novel Peptide Identification using ESTs and Genomic Sequence
1Novel Peptide Identification using ESTs and
Genomic Sequence
- Nathan Edwards
- Center for Bioinformatics and Computational
Biology - University of Maryland, College Park
2Mass Spectrometry for Proteomics
- Measure mass of many (bio)molecules
simultaneously - High bandwidth
- Mass is an intrinsic property of all
(bio)molecules - No prior knowledge required
3Mass Spectrometry for Proteomics
- Measure mass of many molecules simultaneously
- ...but not too many, abundance bias
- Mass is an intrinsic property of all
(bio)molecules - ...but need a reference to compare to
4Mass Spectrometry for Proteomics
- Mass spectrometry has been around since the turn
of the century... - ...why is MS Proteomics so new?
- Ionization methods
- MALDI, Electrospray
- Protein chemistry automation
- Chromatography, Gels, Computers
- Protein sequence databases
- A reference for comparison
5Microorganism Identification by MALDI Mass
Spectrometry
- Direct observation of microorganism biomarkers in
the field. - Peaks represent masses of abundant proteins.
- Statistical models assess identification
significance.
B.anthracis
MALDI Mass Spectrometry
6Key Principles
- Protein mass from protein sequence
- No introns, few PTMs
- Specificity of single mass is very weak
- Statistical significance from many peaks
- Not all proteins are equally likely to be
observed - Ribosomal proteins, SASPs
7Rapid Microorganism Identification Database
(www.RMIDb.org)
- Protein Sequences
- 5.3M (1.9M)
- Species
- 15K
- Genbank,
- RefSeq
- CMR,
- Swiss-Prot
- TrEMBL
8Rapid Microorganism Identification Database
(www.RMIDb.org)
9Informatics Issues
- Need good species / strain annotation
- B.anthracis vs B.thuringiensisÂ
- Need correct protein sequence
- B.anthracis Sterne a/ß SASP
- RefSeq/Gb MVMARN... (7442 Da)
- CMR MARN... (7211 Da)
- Need chemistry based protein classification
10Sample Preparation for Peptide Identification
11Single Stage MS
MS
m/z
12Tandem Mass Spectrometry(MS/MS)
m/z
Precursor selection
m/z
13Tandem Mass Spectrometry(MS/MS)
Precursor selection collision induced
dissociation (CID)
m/z
MS/MS
m/z
14Peptide Identification
- For each (likely) peptide sequence
- 1. Compute fragment masses
- 2. Compare with spectrum
- 3. Retain those that match well
- Peptide sequences from protein sequence databases
- Swiss-Prot, IPI, NCBIs nr, ...
- Automated, high-throughput peptide identification
in complex mixtures
15Why dont we see more novel peptides?
- Tandem mass spectrometry doesnt discriminate
against novel peptides......but protein
sequence databases do! - Searching traditional protein sequence databases
biases the results towards well-understood
protein isoforms!
16What goes missing?
- Known coding SNPs
- Novel coding mutations
- Alternative splicing isoforms
- Alternative translation start-sites
- Microexons
- Alternative translation frames
17Why should we care?
- Alternative splicing is the norm!
- Only 20-25K human genes
- Each gene makes many proteins
- Proteins have clinical implications
- Biomarker discovery
- Evidence for SNPs and alternative splicing stops
with transcription - Genomic assays, ESTs, mRNA sequence.
- Little hard evidence for translation start site
18Novel Splice Isoform
19Novel Splice Isoform
20Novel Frame
21Novel Frame
22Novel Mutation
Ala2?Pro associated with familial amyloid
polyneuropathy
23Novel Mutation
24Searching ESTs
- Proposed long ago
- Yates, Eng, and McCormack Anal Chem, 95.
- Now
- Protein sequences are sufficient for protein
identification - Computationally expensive/infeasible
- Difficult to interpret
- Make EST searching feasible for routine searching
to discover novel peptides.
25Searching Expressed Sequence Tags (ESTs)
- Pros
- No introns!
- Primary splicing evidence for annotation
pipelines - Evidence for dbSNP
- Often derived from clinical cancer samples
- Cons
- No frame
- Large (8Gb)
- Untrusted by annotation pipelines
- Highly redundant
- Nucleotide error rate 1
26Compressed EST Peptide Sequence Database
- For all ESTs mapped to a UniGene gene
- Six-frame translation
- Eliminate ORFs lt 30 amino-acids
- Eliminate amino-acid 30-mers observed once
- Compress to C2 FASTA database
- Complete, Correct for amino-acid 30-mers
- Gene-centric peptide sequence database
- Size lt 3 of naïve enumeration, 20774 FASTA
entries - Running time 1 of naïve enumeration search
- E-values 2 of naïve enumeration search results
27Compressed EST Peptide Sequence Database
- For all ESTs mapped to a UniGene gene
- Six-frame translation
- Eliminate ORFs lt 30 amino-acids
- Eliminate amino-acid 30-mers observed once
- Compress to C2 FASTA database
- Complete, Correct for amino-acid 30-mers
- Gene-centric peptide sequence database
- Size lt 3 of naïve enumeration, 20774 FASTA
entries - Running time 1 of naïve enumeration search
- E-values 2 of naïve enumeration search results
28SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
29Compressed SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
30Sequence Databases CSBH-graphs
- Original sequences correspond to paths
ACDEFGI, ACDEFACG, DEFGEFGI
31Sequence Databases CSBH-graphs
- All k-mers represented by an edge have the same
count
1
2
2
1
2
32cSBH-graphs
- Quickly determine those that occur twice
2
2
1
2
33Compressed-SBH-graph
2
2
1
2
ACDEFGI
34Compressed EST Database
- Gene centric compressed EST peptide sequence
database - 20,774 sequence entries
- 8Gb vs 223 Mb
- 35 fold compression
- 22 hours becomes 15 minutes
- E-values improve by similar factor!
- Makes routine EST searching feasible
- Search ESTs instead of IPI?
35Back to the lab...
- Current LC/MS/MS workflows identify a few
peptides per protein - ...not sufficient for protein isoforms
- Need to raise the sequence coverage to (say) 80
- ...protein separation prior to LC/MS/MS analysis
- Potential for database of splice sites of
(functional) proteins!
36Conclusions
- Good informatics gets the most out of proteomics
data - Proteomics may be useful for genome annotation
- Peptides identify more than just proteins
- Compressed peptide sequence databases make
routine EST searching feasible
37Acknowledgements
- Chau-Wen Tseng, Xue Wu
- UMCP Computer Science
- Catherine Fenselau
- UMCP Biochemistry
- Calibrant Biosystems
- PeptideAtlas, HUPO PPP, X!Tandem
- Funding National Cancer Institute