Title: Proteomic Characterization of Alternative Splicing and Coding Polymorphism
1Proteomic Characterization of Alternative
Splicing and Coding Polymorphism
- Nathan Edwards
- Center for Bioinformatics and Computational
Biology - University of Maryland, College Park
2Mass Spectrometry for Proteomics
- Measure mass of many (bio)molecules
simultaneously - High bandwidth
- Mass is an intrinsic property of all
(bio)molecules - No prior knowledge required
3Mass Spectrometry for Proteomics
- Measure mass of many molecules simultaneously
- ...but not too many, abundance bias
- Mass is an intrinsic property of all
(bio)molecules - ...but need a reference to compare to
4High Bandwidth
5Mass is fundamental!
6Mass Spectrometry for Proteomics
- Mass spectrometry has been around since the turn
of the century... - ...why is MS based Proteomics so new?
- Ionization methods
- MALDI, Electrospray
- Protein chemistry automation
- Chromatography, Gels, Computers
- Protein sequence databases
- A reference for comparison
7Sample Preparation for Peptide Identification
8Single Stage MS
MS
m/z
9Tandem Mass Spectrometry(MS/MS)
m/z
Precursor selection
m/z
10Tandem Mass Spectrometry(MS/MS)
Precursor selection collision induced
dissociation (CID)
m/z
MS/MS
m/z
11Peptide Identification
- For each (likely) peptide sequence
- 1. Compute fragment masses
- 2. Compare with spectrum
- 3. Retain those that match well
- Peptide sequences from protein sequence databases
- Swiss-Prot, IPI, NCBIs nr, ...
- Automated, high-throughput peptide identification
in complex mixtures
12Why dont we see more novel peptides?
- Tandem mass spectrometry doesnt discriminate
against novel peptides......but protein
sequence databases do! - Searching traditional protein sequence databases
biases the results towards well-understood
protein isoforms!
13What goes missing?
- Known coding SNPs
- Novel coding mutations
- Alternative splicing isoforms
- Alternative translation start-sites
- Microexons
- Alternative translation frames
14Why should we care?
- Alternative splicing is the norm!
- Only 20-25K human genes
- Each gene makes many proteins
- Proteins have clinical implications
- Biomarker discovery
- Evidence for SNPs and alternative splicing stops
with transcription - Genomic assays, ESTs, mRNA sequence.
- Little hard evidence for translation start site
15Novel Splice Isoform
- Human Jurkat leukemia cell-line
- Lipid-raft extraction protocol, targeting T cells
- von Haller, et al. MCP 2003.
- LIME1 gene
- LCK interacting transmembrane adaptor 1
- LCK gene
- Leukocyte-specific protein tyrosine kinase
- Proto-oncogene
- Chromosomal aberration involving LCK in
leukemias. - Multiple significant peptide identifications
16Novel Splice Isoform
17Novel Splice Isoform
18Novel Frame
19Novel Frame
20Novel Mutation
- HUPO Plasma Proteome Project
- Pooled samples from 10 male 10 female healthy
Chinese subjects - Plasma/EDTA sample protocol
- Li, et al. Proteomics 2005. (Lab 29)
- TTR gene
- Transthyretin (pre-albumin)
- Defects in TTR are a cause of amyloidosis.
- Familial amyloidotic polyneuropathy
- late-onset, dominant inheritance
21Novel Mutation
Ala2?Pro associated with familial amyloid
polyneuropathy
22Novel Mutation
23Searching ESTs
- Proposed long ago
- Yates, Eng, and McCormack Anal Chem, 95.
- Now
- Protein sequences are sufficient for protein
identification - Computationally expensive/infeasible
- Difficult to interpret
- Make EST searching feasible for routine searching
to discover novel peptides.
24Searching Expressed Sequence Tags (ESTs)
- Pros
- No introns!
- Primary splicing evidence for annotation
pipelines - Evidence for dbSNP
- Often derived from clinical cancer samples
- Cons
- No frame
- Large (8Gb)
- Untrusted by annotation pipelines
- Highly redundant
- Nucleotide error rate 1
25Compressed EST Peptide Sequence Database
- For all ESTs mapped to a UniGene gene
- Six-frame translation
- Eliminate ORFs lt 30 amino-acids
- Eliminate amino-acid 30-mers observed once
- Compress to C2 FASTA database
- Complete, Correct for amino-acid 30-mers
- Gene-centric peptide sequence database
- Size lt 3 of naïve enumeration, 20774 FASTA
entries - Running time 1 of naïve enumeration search
- E-values 2 of naïve enumeration search results
26Compressed EST Peptide Sequence Database
- For all ESTs mapped to a UniGene gene
- Six-frame translation
- Eliminate ORFs lt 30 amino-acids
- Eliminate amino-acid 30-mers observed once
- Compress to C2 FASTA database
- Complete, Correct for amino-acid 30-mers
- Gene-centric peptide sequence database
- Size lt 3 of naïve enumeration, 20774 FASTA
entries - Running time 1 of naïve enumeration search
- E-values 2 of naïve enumeration search results
27SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
28Compressed SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
29Sequence Databases CSBH-graphs
- Original sequences correspond to paths
ACDEFGI, ACDEFACG, DEFGEFGI
30Sequence Databases CSBH-graphs
- All k-mers represented by an edge have the same
count
1
2
2
1
2
31cSBH-graphs
- Quickly determine those that occur twice
2
2
1
2
32Correct, Complete, Compact (C3) Enumeration
- Set of paths that use each edge exactly once
ACDEFGEFGI, DEFACG
33Correct, Complete (C2) Enumeration
- Set of paths that use each edge at least once
ACDEFGEFGI, DEFACG
34Patching the CSBH-graph
- Use artificial edges to fix unbalanced nodes
35Compressed EST Database
- Gene centric compressed EST peptide sequence
database - 20,774 sequence entries
- 8Gb vs 223 Mb
- 35 fold compression
- 22 hours becomes 15 minutes
- E-values improve by similar factor!
- Makes routine EST searching feasible
- Search ESTs instead of IPI?
36Novel Peptide Computational Infrastructure
- Binaries (C)
- cSBH-graph construction
- Condor grid-enabled
- Eulerian path k-mer enumeration
- Suitable for large graphs
- Data-model for peptide identification
- Spectra (gt5 million)
- Peptide identifications
- Mascot, SEQUEST, X!Tandem, NIST
- Genomic context of peptides
37Novel Peptide Computational Infrastructure
- Condor grid-enabled MS/MS search
- Mascot, X!Tandem, (Inspect, OMSSA)
- TurboGears python web-stack
- SQLObject Object-Relational-Manager
- MVC web-application framework
- Suitable for AJAX web-services too
- Integration with UCSC genome browser
- caBIG compatible web-services
- Java applet for viewing spectra
38Peptide Identification Navigator
39Peptide Identification Navigator
40Spectrum Viewer
41Spectrum Viewer
42Back to the lab...
- Current LC/MS/MS workflows identify a few
peptides per protein - ...not sufficient for protein isoforms
- Need to raise the sequence coverage to (say) 80
- ...protein separation prior to LC/MS/MS analysis
- Potential for database of splice sites of
(functional) proteins!
43Microorganism Identification by MALDI Mass
Spectrometry
- Direct observation of microorganism biomarkers in
the field. - Peaks represent masses of abundant proteins.
- Statistical models assess identification
significance.
B.anthracisspores
MALDI Mass Spectrometry
44Key Principles
- Protein mass from protein sequence
- No introns, few PTMs
- Specificity of single mass is very weak
- Statistical significance from many peaks
- Not all proteins are equally likely to be
observed - Ribosomal proteins, SASPs
45Rapid Microorganism Identification Database
(www.RMIDb.org)
- Protein Sequences
- 8.1M (2.9M)
- Species
- 18K
- Genbank,
- Microbial, Virus, Plasmid
- RefSeq
- CMR,
- Swiss-Prot
- TrEMBL
46Rapid Microorganism Identification Database
(www.RMIDb.org)
47Informatics Issues
- Need good species / strain annotation
- B.anthracis vs B.thuringiensisÂ
- Need correct protein sequence
- B.anthracis Sterne a/ß SASP
- RefSeq/Gb MVMARN... (7442 Da)
- CMR MARN... (7211 Da)
- Need chemistry based protein classification
48Conclusions
- Proteomics can inform genome annotation
- Eukaryotic and prokaryotic
- Functional vs silencing variants
- Peptides identify more than just proteins
- Untapped source of disease biomarkers
- Compressed peptide sequence databases make
routine EST searching feasible
49Future Research Directions
- Identification of protein isoforms
- Optimize proteomics workflow for isoform
detection - Identify splice variants in cancer cell-lines
(MCF-7) and clinical brain tumor samples - Aggressive peptide sequence enumeration
- dbPep for genomic annotation
- Open, flexible informatics infrastructure for
peptide identification
50Future Research Directions
- Proteomics for Microorganism Identification
- Specificity of tandem mass spectra
- Revamp RMIDb prototype
- Incorporate spectral matching
- Primer design
- k-mer sets as FASTA sequence databases
- Uniqueness oracle for exact and inexact match
- Integration with Primer3
- Tiling, multiplexing, pooling, tag arrays
51Acknowledgements
- Chau-Wen Tseng, Xue Wu
- UMCP Computer Science
- Catherine Fenselau, Steve Swatkoski
- UMCP Biochemistry
- Calibrant Biosystems
- PeptideAtlas, HUPO PPP, X!Tandem
- Funding National Cancer Institute