Proteomic Characterization of Alternative Splicing and Coding Polymorphism - PowerPoint PPT Presentation

1 / 47

About This Presentation

Title:

Proteomic Characterization of Alternative Splicing and Coding Polymorphism

Description:

Title: Faster, More Sensitive Peptide ID by Sequence DB Compression Last modified by: Nathan John Edwards Created Date: 12/6/2004 12:44:14 AM Document presentation format – PowerPoint PPT presentation

Number of Views:99

Avg rating:3.0/5.0

Slides: 48

Provided by: edwardsla

Learn more at: http://edwardslab.bmcb.georgetown.edu

Category:

more less

Transcript and Presenter's Notes

Title: Proteomic Characterization of Alternative Splicing and Coding Polymorphism

1
Proteomic Characterization of Alternative
Splicing and Coding Polymorphism

Nathan Edwards
Center for Bioinformatics and Computational
Biology
University of Maryland, College Park

2
Proteomics

Proteins are the machines that drive much of
biology
Genes are merely the recipe
The direct characterization of a samples
proteins en masse.
What proteins are present?
How much of each protein is present?

3
Systems Biology

Establish relationships by
Choosing related samples,
Global characterization, and
Comparison.

Gene / Transcript / Protein Gene / Transcript / Protein
Measurement Predetermined Unknown
Discrete (DNA) Genotyping Sequencing
Continuous Gene Expression Proteomics
4
Samples

Healthy / Diseased
Cancerous / Benign
Drug resistant / Drug susceptible
Progression or Prognosis
Bound / Unbound
Tissue specific
Cellular location specific
Mitochondria, Membrane

5
2D Gel-Electrophoresis

Protein separation
Molecular weight (MW)
Isoelectric point (pI)
Staining
Birds-eye view of protein abundance

6
2D Gel-Electrophoresis
Bécamel et al., Biol. Proced. Online 2002494-104
.
7
Paradigm Shift

Traditional protein chemistry assay methods
struggle to establish identity.
Identity requires
Specificity of measurement (Precision)
A reference for comparison

8
Mass Spectrometry for Proteomics

Measure mass of many (bio)molecules
simultaneously
High bandwidth
Mass is an intrinsic property of all
(bio)molecules
No prior knowledge required

9
Mass Spectrometer

ElectronMultiplier(EM)

Time-Of-Flight (TOF)
Quadrapole
Ion-Trap

MALDI
Electro-SprayIonization (ESI)

10
High Bandwidth
11
Mass is fundamental!
12
Mass Spectrometry for Proteomics

Measure mass of many molecules simultaneously
...but not too many, abundance bias
Mass is an intrinsic property of all
(bio)molecules
...but need a reference to compare to

13
Mass Spectrometry for Proteomics

Mass spectrometry has been around since the turn
of the century...
...why is MS based Proteomics so new?
Ionization methods
MALDI, Electrospray
Protein chemistry automation
Chromatography, Gels, Computers
Protein sequence databases
A reference for comparison

14
Sample Preparation for Peptide Identification
15
Single Stage MS
MS
m/z
16
Tandem Mass Spectrometry(MS/MS)
m/z
Precursor selection
m/z
17
Tandem Mass Spectrometry(MS/MS)
Precursor selection collision induced
dissociation (CID)
m/z
MS/MS
m/z
18
Peptide Identification

For each (likely) peptide sequence
1. Compute fragment masses
2. Compare with spectrum
3. Retain those that match well
Peptide sequences from protein sequence databases
Swiss-Prot, IPI, NCBIs nr, ...
Automated, high-throughput peptide identification
in complex mixtures

19
Why dont we see more novel peptides?

Tandem mass spectrometry doesnt discriminate
against novel peptides......but protein
sequence databases do!
Searching traditional protein sequence databases
biases the results towards well-understood
protein isoforms!

20
What goes missing?

Known coding SNPs
Novel coding mutations
Alternative splicing isoforms
Alternative translation start-sites
Microexons
Alternative translation frames

21
Why should we care?

Alternative splicing is the norm!
Only 20-25K human genes
Each gene makes many proteins
Proteins have clinical implications
Biomarker discovery
Evidence for SNPs and alternative splicing stops
with transcription
Genomic assays, ESTs, mRNA sequence.
Little hard evidence for translation start site

22
Novel Splice Isoform

Human Jurkat leukemia cell-line
Lipid-raft extraction protocol, targeting T cells
von Haller, et al. MCP 2003.
LIME1 gene
LCK interacting transmembrane adaptor 1
LCK gene
Leukocyte-specific protein tyrosine kinase
Proto-oncogene
Chromosomal aberration involving LCK in
leukemias.
Multiple significant peptide identifications

23
Novel Splice Isoform
24
Novel Splice Isoform
25
Novel Mutation

HUPO Plasma Proteome Project
Pooled samples from 10 male 10 female healthy
Chinese subjects
Plasma/EDTA sample protocol
Li, et al. Proteomics 2005. (Lab 29)
TTR gene
Transthyretin (pre-albumin)
Defects in TTR are a cause of amyloidosis.
Familial amyloidotic polyneuropathy
late-onset, dominant inheritance

26
Novel Mutation
Ala2?Pro associated with familial amyloid
polyneuropathy
27
Novel Mutation
28
Expressed Sequence Tags (ESTs)

Cheap, fast, coding
Single sequencing reads of mRNA
Sequence from 5 or 3 end
No assembly

http//www.ncbi.nlm.nih.gov/About/primer/est.html
29
Searching ESTs

Proposed long ago
Yates, Eng, and McCormack Anal Chem, 95.
Now
Protein sequences are sufficient for protein
identification
Computationally expensive/infeasible
Difficult to interpret
Make EST searching feasible for routine searching
to discover novel peptides.

30
Searching Expressed Sequence Tags (ESTs)

Pros
No introns!
Primary splicing evidence for annotation
pipelines
Evidence for dbSNP
Often derived from clinical cancer samples

Cons
No frame
Large (8Gb)
Untrusted by annotation pipelines
Highly redundant
Nucleotide error rate 1

31
Compressed EST Peptide Sequence Database

For all ESTs mapped to a UniGene gene
Six-frame translation
Eliminate ORFs lt 30 amino-acids
Eliminate amino-acid 30-mers observed once
Compress to C2 FASTA database
Complete, Correct for amino-acid 30-mers

32
Compressed EST Peptide Sequence Database

For all ESTs mapped to a UniGene gene
Six-frame translation
Eliminate ORFs lt 30 amino-acids
Eliminate amino-acid 30-mers observed once
Compress to C2 FASTA database
Complete, Correct for amino-acid 30-mers

33
Compressed EST Database

Gene centric compressed EST peptide sequence
database
20,774 sequence entries
8Gb vs 223 Mb
35 fold compression
22 hours becomes 15 minutes
E-values improve by similar factor!
Makes routine EST searching feasible
Search ESTs instead of IPI?

34
Back to the lab...

Current LC/MS/MS workflows identify a few
peptides per protein
...not sufficient for protein isoforms
Need to raise the sequence coverage to (say) 80
...protein separation prior to LC/MS/MS analysis
Potential for database of splice sites of
(functional) proteins!

35
Microorganism Identification by MALDI Mass
Spectrometry

Direct observation of microorganism biomarkers in
the field.
Peaks represent masses of abundant proteins.
Statistical models assess identification
significance.

B.anthracisspores
MALDI Mass Spectrometry
36
Key Principles

Protein mass from protein sequence
No introns, few PTMs
Specificity of single mass is very weak
Statistical significance from many peaks
Not all proteins are equally likely to be
observed
Ribosomal proteins, SASPs

37
Rapid Microorganism Identification Database
(www.RMIDb.org)

Protein Sequences
8.1M (2.9M)
Species
18K
Genbank,
Microbial, Virus, Plasmid
RefSeq
CMR,
Swiss-Prot
TrEMBL

38
Rapid Microorganism Identification Database
(www.RMIDb.org)
39
Informatics Issues

Need good species / strain annotation
B.anthracis vs B.thuringiensis
Need correct protein sequence
B.anthracis Sterne a/ß SASP
RefSeq/Gb MVMARN... (7442 Da)
CMR MARN... (7211 Da)
Need chemistry based protein classification

40
Spectral Matching

Detection vs. identification
Increased sensitivity
No novel peptides
NIST GC/MS Spectral Library
Identifies small molecules,
100,000s of (consensus) spectra
Bundled/Sold with many instruments
Dot-product spectral comparison
Current project Peptide MS/MS

41
Peptide DLATVYVDVLK
42
Peptide DLATVYVDVLK
43
Hidden Markov Models for Spectral Matching

Capture statistical variation and consensus in
peak intensity
Capture semantics of peaks
Extrapolate model to other peptides
Good specificity with superior sensitivity for
peptide detection

44
Conclusions

Molecular biology bioinformatics provide a
reference for biotechnologies
Foundation of systems biology
Peptides identify more than just proteins
Untapped source of disease biomarkers
Compressed peptide sequence databases make
routine EST searching feasible

45
Future Research Directions

Identification of protein isoforms
Optimize proteomics workflow for isoform
detection
Identify splice variants in cancer cell-lines
(MCF-7) and clinical brain tumor samples
dbPep for genomic annotation

46
Future Research Directions