Proteomic Characterization of Alternative Splicing and Coding Polymorphism - PowerPoint PPT Presentation

1 / 51

About This Presentation

Title:

Proteomic Characterization of Alternative Splicing and Coding Polymorphism

Description:

Condor grid-enabled. Eulerian path k-mer enumeration. Suitable for large graphs ... Condor grid-enabled MS/MS search. Mascot, X!Tandem, (Inspect, OMSSA) ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 52

Provided by: cbcb6

Category:

more less

Transcript and Presenter's Notes

Title: Proteomic Characterization of Alternative Splicing and Coding Polymorphism

1
Proteomic Characterization of Alternative
Splicing and Coding Polymorphism

Nathan Edwards
Center for Bioinformatics and Computational
Biology
University of Maryland, College Park

2
Mass Spectrometry for Proteomics

Measure mass of many (bio)molecules
simultaneously
High bandwidth
Mass is an intrinsic property of all
(bio)molecules
No prior knowledge required

3
Mass Spectrometry for Proteomics

Measure mass of many molecules simultaneously
...but not too many, abundance bias
Mass is an intrinsic property of all
(bio)molecules
...but need a reference to compare to

4
High Bandwidth
5
Mass is fundamental!
6
Mass Spectrometry for Proteomics

Mass spectrometry has been around since the turn
of the century...
...why is MS based Proteomics so new?
Ionization methods
MALDI, Electrospray
Protein chemistry automation
Chromatography, Gels, Computers
Protein sequence databases
A reference for comparison

7
Sample Preparation for Peptide Identification
8
Single Stage MS
MS
m/z
9
Tandem Mass Spectrometry(MS/MS)
m/z
Precursor selection
m/z
10
Tandem Mass Spectrometry(MS/MS)
Precursor selection collision induced
dissociation (CID)
m/z
MS/MS
m/z
11
Peptide Identification

For each (likely) peptide sequence
1. Compute fragment masses
2. Compare with spectrum
3. Retain those that match well
Peptide sequences from protein sequence databases
Swiss-Prot, IPI, NCBIs nr, ...
Automated, high-throughput peptide identification
in complex mixtures

12
Why dont we see more novel peptides?

Tandem mass spectrometry doesnt discriminate
against novel peptides......but protein
sequence databases do!
Searching traditional protein sequence databases
biases the results towards well-understood
protein isoforms!

13
What goes missing?

Known coding SNPs
Novel coding mutations
Alternative splicing isoforms
Alternative translation start-sites
Microexons
Alternative translation frames

14
Why should we care?

Alternative splicing is the norm!
Only 20-25K human genes
Each gene makes many proteins
Proteins have clinical implications
Biomarker discovery
Evidence for SNPs and alternative splicing stops
with transcription
Genomic assays, ESTs, mRNA sequence.
Little hard evidence for translation start site

15
Novel Splice Isoform

Human Jurkat leukemia cell-line
Lipid-raft extraction protocol, targeting T cells
von Haller, et al. MCP 2003.
LIME1 gene
LCK interacting transmembrane adaptor 1
LCK gene
Leukocyte-specific protein tyrosine kinase
Proto-oncogene
Chromosomal aberration involving LCK in
leukemias.
Multiple significant peptide identifications

16
Novel Splice Isoform
17
Novel Splice Isoform
18
Novel Frame
19
Novel Frame
20
Novel Mutation

HUPO Plasma Proteome Project
Pooled samples from 10 male 10 female healthy
Chinese subjects
Plasma/EDTA sample protocol
Li, et al. Proteomics 2005. (Lab 29)
TTR gene
Transthyretin (pre-albumin)
Defects in TTR are a cause of amyloidosis.
Familial amyloidotic polyneuropathy
late-onset, dominant inheritance

21
Novel Mutation
Ala2?Pro associated with familial amyloid
polyneuropathy
22
Novel Mutation
23
Searching ESTs

Proposed long ago
Yates, Eng, and McCormack Anal Chem, 95.
Now
Protein sequences are sufficient for protein
identification
Computationally expensive/infeasible
Difficult to interpret
Make EST searching feasible for routine searching
to discover novel peptides.

24
Searching Expressed Sequence Tags (ESTs)

Pros
No introns!
Primary splicing evidence for annotation
pipelines
Evidence for dbSNP
Often derived from clinical cancer samples

Cons
No frame
Large (8Gb)
Untrusted by annotation pipelines
Highly redundant
Nucleotide error rate 1

25
Compressed EST Peptide Sequence Database

For all ESTs mapped to a UniGene gene
Six-frame translation
Eliminate ORFs lt 30 amino-acids
Eliminate amino-acid 30-mers observed once
Compress to C2 FASTA database
Complete, Correct for amino-acid 30-mers
Gene-centric peptide sequence database
Size lt 3 of naïve enumeration, 20774 FASTA
entries
Running time 1 of naïve enumeration search
E-values 2 of naïve enumeration search results

26
Compressed EST Peptide Sequence Database

For all ESTs mapped to a UniGene gene
Six-frame translation
Eliminate ORFs lt 30 amino-acids
Eliminate amino-acid 30-mers observed once
Compress to C2 FASTA database
Complete, Correct for amino-acid 30-mers
Gene-centric peptide sequence database
Size lt 3 of naïve enumeration, 20774 FASTA
entries
Running time 1 of naïve enumeration search
E-values 2 of naïve enumeration search results

27
SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
28
Compressed SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
29
Sequence Databases CSBH-graphs

Original sequences correspond to paths

ACDEFGI, ACDEFACG, DEFGEFGI
30
Sequence Databases CSBH-graphs

All k-mers represented by an edge have the same
count

1
2
2
1
2
31
cSBH-graphs

Quickly determine those that occur twice

2
2
1
2
32
Correct, Complete, Compact (C3) Enumeration

Set of paths that use each edge exactly once

ACDEFGEFGI, DEFACG
33
Correct, Complete (C2) Enumeration

Set of paths that use each edge at least once

ACDEFGEFGI, DEFACG
34
Patching the CSBH-graph

Use artificial edges to fix unbalanced nodes

35
Compressed EST Database

Gene centric compressed EST peptide sequence
database
20,774 sequence entries
8Gb vs 223 Mb
35 fold compression
22 hours becomes 15 minutes
E-values improve by similar factor!
Makes routine EST searching feasible
Search ESTs instead of IPI?

36
Novel Peptide Computational Infrastructure

Binaries (C)
cSBH-graph construction
Condor grid-enabled
Eulerian path k-mer enumeration
Suitable for large graphs
Data-model for peptide identification
Spectra (gt5 million)
Peptide identifications
Mascot, SEQUEST, X!Tandem, NIST
Genomic context of peptides

37
Novel Peptide Computational Infrastructure

Condor grid-enabled MS/MS search
Mascot, X!Tandem, (Inspect, OMSSA)
TurboGears python web-stack
SQLObject Object-Relational-Manager
MVC web-application framework
Suitable for AJAX web-services too
Integration with UCSC genome browser
caBIG compatible web-services
Java applet for viewing spectra

38
Peptide Identification Navigator
39
Peptide Identification Navigator
40
Spectrum Viewer
41
Spectrum Viewer
42
Back to the lab...

Current LC/MS/MS workflows identify a few
peptides per protein
...not sufficient for protein isoforms
Need to raise the sequence coverage to (say) 80
...protein separation prior to LC/MS/MS analysis
Potential for database of splice sites of
(functional) proteins!

43
Microorganism Identification by MALDI Mass
Spectrometry

Direct observation of microorganism biomarkers in
the field.
Peaks represent masses of abundant proteins.
Statistical models assess identification
significance.

B.anthracisspores
MALDI Mass Spectrometry
44
Key Principles

Protein mass from protein sequence
No introns, few PTMs
Specificity of single mass is very weak
Statistical significance from many peaks
Not all proteins are equally likely to be
observed
Ribosomal proteins, SASPs

45
Rapid Microorganism Identification Database
(www.RMIDb.org)

Protein Sequences
8.1M (2.9M)
Species
18K
Genbank,
Microbial, Virus, Plasmid
RefSeq
CMR,
Swiss-Prot
TrEMBL

46
Rapid Microorganism Identification Database
(www.RMIDb.org)
47
Informatics Issues

Need good species / strain annotation
B.anthracis vs B.thuringiensis
Need correct protein sequence
B.anthracis Sterne a/ß SASP
RefSeq/Gb MVMARN... (7442 Da)
CMR MARN... (7211 Da)
Need chemistry based protein classification

48
Conclusions

Proteomics can inform genome annotation
Eukaryotic and prokaryotic
Functional vs silencing variants
Peptides identify more than just proteins
Untapped source of disease biomarkers
Compressed peptide sequence databases make
routine EST searching feasible

49
Future Research Directions

Identification of protein isoforms
Optimize proteomics workflow for isoform
detection
Identify splice variants in cancer cell-lines
(MCF-7) and clinical brain tumor samples
Aggressive peptide sequence enumeration
dbPep for genomic annotation
Open, flexible informatics infrastructure for
peptide identification

50
Future Research Directions