Title: Proteomic Characterization of Alternative Splicing and Coding Polymorphism
1Proteomic Characterization of Alternative
Splicing and Coding Polymorphism
- Nathan Edwards
- Center for Bioinformatics and Computational
Biology - University of Maryland, College Park
2Synopsis
- MS/MS spectra provide evidence for the amino-acid
sequence of functional proteins. - Key concepts
- Spectrum acquisition is unbiased
- Direct observation of amino-acid sequence
- Sensitive to minor sequence variation
- Observed peptides represent folded proteins
3Synopsis
- MS/MS spectra provide evidence for the amino-acid
sequence of functional proteins. - Applications
- Cancer biomarkers
- Genome annotation
4Mass Spectrometry for Proteomics
- Measure mass of many (bio)molecules
simultaneously - High bandwidth
- Mass is an intrinsic property of all
(bio)molecules - No prior knowledge required
5Mass Spectrometer
- Time-Of-Flight (TOF)
- Quadrapole
- Ion-Trap
- MALDI
- Electro-SprayIonization (ESI)
6High Bandwidth
7Mass is fundamental!
8Mass Spectrometry for Proteomics
- Measure mass of many molecules simultaneously
- ...but not too many, abundance bias
- Mass is an intrinsic property of all
(bio)molecules - ...but need a reference to compare to
9Mass Spectrometry for Proteomics
- Mass spectrometry has been around since the turn
of the century... - ...why is MS based Proteomics so new?
- Ionization methods
- MALDI, Electrospray
- Protein chemistry automation
- Chromatography, Gels, Computers
- Protein sequence databases
- A reference for comparison
10Sample Preparation for Peptide Identification
11Single Stage MS
MS
m/z
12Tandem Mass Spectrometry(MS/MS)
m/z
Precursor selection
m/z
13Tandem Mass Spectrometry(MS/MS)
Precursor selection collision induced
dissociation (CID)
m/z
MS/MS
m/z
14Peptide Identification
- For each (likely) peptide sequence
- 1. Compute fragment masses
- 2. Compare with spectrum
- 3. Retain those that match well
- Peptide sequences from protein sequence databases
- Swiss-Prot, IPI, NCBIs nr, ...
- Automated, high-throughput peptide identification
in complex mixtures
15Why dont we see more novel peptides?
- Tandem mass spectrometry doesnt discriminate
against novel peptides......but protein
sequence databases do! - Searching traditional protein sequence databases
biases the results towards well-understood
protein isoforms!
16What goes missing?
- Known coding SNPs
- Novel coding mutations
- Alternative splicing isoforms
- Alternative translation start-sites
- Microexons
- Alternative translation frames
17Why should we care?
- Alternative splicing is the norm!
- Only 20-25K human genes
- Each gene makes many proteins
- Proteins have clinical implications
- Biomarker discovery
- Evidence for SNPs and alternative splicing stops
with transcription - Genomic assays, ESTs, mRNA sequence.
- Little hard evidence for translation start site
18Novel Splice Isoform
- Human Jurkat leukemia cell-line
- Lipid-raft extraction protocol, targeting T cells
- von Haller, et al. MCP 2003.
- LIME1 gene
- LCK interacting transmembrane adaptor 1
- LCK gene
- Leukocyte-specific protein tyrosine kinase
- Proto-oncogene
- Chromosomal aberration involving LCK in
leukemias. - Multiple significant peptide identifications
19Novel Splice Isoform
20Novel Splice Isoform
21Novel Mutation
- HUPO Plasma Proteome Project
- Pooled samples from 10 male 10 female healthy
Chinese subjects - Plasma/EDTA sample protocol
- Li, et al. Proteomics 2005. (Lab 29)
- TTR gene
- Transthyretin (pre-albumin)
- Defects in TTR are a cause of amyloidosis.
- Familial amyloidotic polyneuropathy
- late-onset, dominant inheritance
22Novel Mutation
Ala2?Pro associated with familial amyloid
polyneuropathy
23Novel Mutation
24Translation Start-Site
- Human erythroleukemia K562 cell-line
- Depth of coverage study
- Resing et al. Anal. Chem. 2004.
- THOC2 gene
- Part of the heteromultimeric THO/TREX complex.
- Initially believed to be a novel ORF
- RefSeq mRNA in Jun 2007, no RefSeq protein
- TrEMBL entry Feb 2005, no SwissProt entry
- Genbank mRNA in May 2002 (complete CDS)
- Plenty of EST support
- 100,000 bases upstream of other isoforms
25Translation Start-Site
26Translation Start-Site
27Translation Start-Site
28Translation Start-Site
29Expressed Sequence Tags (ESTs)
- Cheap, fast, coding
- Single sequencing reads of mRNA
- Sequence from 5 or 3 end
- No assembly
http//www.ncbi.nlm.nih.gov/About/primer/est.html
30Searching ESTs
- Proposed long ago
- Yates, Eng, and McCormack Anal Chem, 95.
- Now
- Protein sequences are sufficient for protein
identification - Computationally expensive/infeasible
- Difficult to interpret
- Make EST searching feasible for routine searching
to discover novel peptides.
31Searching Expressed Sequence Tags (ESTs)
- Pros
- No introns!
- Primary splicing evidence for annotation
pipelines - Evidence for dbSNP
- Often derived from clinical cancer samples
- Cons
- No frame
- Large (8Gb)
- Untrusted by annotation pipelines
- Highly redundant
- Nucleotide error rate 1
32Compressed EST Peptide Sequence Database
- For all ESTs mapped to a UniGene gene
- Six-frame translation
- Eliminate ORFs lt 30 amino-acids
- Eliminate amino-acid 30-mers observed once
- Compress to C2 FASTA database
- Complete, Correct for amino-acid 30-mers
- Gene-centric peptide sequence database
- Size lt 3 of naïve enumeration, 20774 FASTA
entries - Running time 1 of naïve enumeration search
- E-values 2 of naïve enumeration search results
33Compressed EST Peptide Sequence Database
- For all ESTs mapped to a UniGene gene
- Six-frame translation
- Eliminate ORFs lt 30 amino-acids
- Eliminate amino-acid 30-mers observed once
- Compress to C2 FASTA database
- Complete, Correct for amino-acid 30-mers
- Gene-centric peptide sequence database
- Size lt 3 of naïve enumeration, 20774 FASTA
entries - Running time 1 of naïve enumeration search
- E-values 2 of naïve enumeration search results
34SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
35Compressed SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
36Sequence Databases CSBH-graphs
- Original sequences correspond to paths
ACDEFGI, ACDEFACG, DEFGEFGI
37Sequence Databases CSBH-graphs
- All k-mers represented by an edge have the same
count
1
2
2
1
2
38cSBH-graphs
- Quickly determine those that occur twice
2
2
1
2
39Correct, Complete (C2) Enumeration
- Set of paths that use each edge at least once
ACDEFGEFGI, DEFACG
40Compressed EST Database
- Gene centric compressed EST peptide sequence
database - 20,774 sequence entries
- 8Gb vs 223 Mb
- 35 fold compression
- 22 hours becomes 15 minutes
- E-values improve by similar factor!
- Makes routine EST searching feasible
- Search ESTs instead of IPI?
41Significant False Positives
- E-values are not enough!
- Random guessers are easy to beat.
- Post-translational modifications vs. amino-acid
substitution - methylation (on I/L, Q, R, C, H, K, S, T, N) 14
- D ? E, G ? A, V ? I/L, N ? Q, S ? T 14
- Peptide extension z2 ? z3
- Nonsense AA masses sum to precursor
- Need to ensure
- fragment ions define novel sequence
- sequence evidence is strong
- other plausible explanations can be eliminated
42Significant False Positives
- DFLAGGLAAAISK 2.2x10-8
- 2 ESTs
- DFLAGGIAAAISK 2.2x10-8
- IPI (2), RefSeq, mRNA, 1400 ESTs
- DFLAGGVAAAISK 3.7x10-8
- IPI, RefSeq, mRNA, 700 ESTs
- DFLAGGVAAAISKMAVVPI 3.5x10-5
- Genscan exon
- AISFAKDFLAGGIAAAISK 3.3x10-4
- Genscan exon
43Significant False Positives
44Back to the lab...
- Current LC/MS/MS workflows identify a few
peptides per protein - ...not sufficient for protein isoforms
- Need to raise the sequence coverage to (say) 80
- ...protein separation prior to LC/MS/MS analysis
- Potential for database of splice sites of
(functional) proteins!
45Spectral Matching for Peptide Identification
- Detection vs. identification
- Increased sensitivity specificity
- No novel peptides
- NIST GC/MS Spectral Library
- Identifies small molecules,
- 100,000s of (consensus) spectra
- Bundled/Sold with many instruments
- Dot-product spectral comparison
- Current project Peptide MS/MS
46NIST MS Search Peptides
47Peptide DLATVYVDVLK
48Protein Families
49Protein Families
50Peptide DLATVYVDVLK
51Hidden Markov Models for Spectral Matching
- Capture statistical variation and consensus in
peak intensity - Capture semantics of peaks
- Extrapolate model to other peptides
- Good specificity with superior sensitivity for
peptide detection - Assign 1000s of additional spectra (p-value lt
10-5)
52Hidden Markov Model
Delete
Insert
Ion
(m/z,int) pair emitted by ion insert states
53Spectral Matching of Peptide Variants
DFLAGGIAAAISK
DFLAGGVAAAISK
54Spectral Matching of Peptide Variants
AVMDDFAAFVEK
AVMDDFAAFVEK
55HMM model extrapolation
56Conclusions
- Proteomics can inform genome annotation
- Eukaryotic and prokaryotic
- Functional vs silencing variants
- Peptides identify more than just proteins
- Untapped source of disease biomarkers
- Compressed peptide sequence databases make
routine EST searching feasible - Novel spectral matching technique using HMMs
looks very promising
57Acknowledgements
- Catherine Fenselau, Steve Swatkoski
- UMCP Biochemistry
- Chau-Wen Tseng, Xue Wu
- UMCP Computer Science
- Cheng Lee
- Calibrant Biosystems
- PeptideAtlas, HUPO PPP, X!Tandem
- Funding NIH/NCI, USDA/ARS