Proteomic Characterization of Alternative Splicing and Coding Polymorphism - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Proteomic Characterization of Alternative Splicing and Coding Polymorphism

Description:

MS/MS spectra provide evidence for the amino-acid sequence of functional proteins. ... Post-translational modifications vs. amino-acid substitution ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 58
Provided by: umiac7
Category:

less

Transcript and Presenter's Notes

Title: Proteomic Characterization of Alternative Splicing and Coding Polymorphism


1
Proteomic Characterization of Alternative
Splicing and Coding Polymorphism
  • Nathan Edwards
  • Center for Bioinformatics and Computational
    Biology
  • University of Maryland, College Park

2
Synopsis
  • MS/MS spectra provide evidence for the amino-acid
    sequence of functional proteins.
  • Key concepts
  • Spectrum acquisition is unbiased
  • Direct observation of amino-acid sequence
  • Sensitive to minor sequence variation
  • Observed peptides represent folded proteins

3
Synopsis
  • MS/MS spectra provide evidence for the amino-acid
    sequence of functional proteins.
  • Applications
  • Cancer biomarkers
  • Genome annotation

4
Mass Spectrometry for Proteomics
  • Measure mass of many (bio)molecules
    simultaneously
  • High bandwidth
  • Mass is an intrinsic property of all
    (bio)molecules
  • No prior knowledge required

5
Mass Spectrometer
  • ElectronMultiplier(EM)
  • Time-Of-Flight (TOF)
  • Quadrapole
  • Ion-Trap
  • MALDI
  • Electro-SprayIonization (ESI)

6
High Bandwidth
7
Mass is fundamental!
8
Mass Spectrometry for Proteomics
  • Measure mass of many molecules simultaneously
  • ...but not too many, abundance bias
  • Mass is an intrinsic property of all
    (bio)molecules
  • ...but need a reference to compare to

9
Mass Spectrometry for Proteomics
  • Mass spectrometry has been around since the turn
    of the century...
  • ...why is MS based Proteomics so new?
  • Ionization methods
  • MALDI, Electrospray
  • Protein chemistry automation
  • Chromatography, Gels, Computers
  • Protein sequence databases
  • A reference for comparison

10
Sample Preparation for Peptide Identification
11
Single Stage MS
MS
m/z
12
Tandem Mass Spectrometry(MS/MS)
m/z
Precursor selection
m/z
13
Tandem Mass Spectrometry(MS/MS)
Precursor selection collision induced
dissociation (CID)
m/z
MS/MS
m/z
14
Peptide Identification
  • For each (likely) peptide sequence
  • 1. Compute fragment masses
  • 2. Compare with spectrum
  • 3. Retain those that match well
  • Peptide sequences from protein sequence databases
  • Swiss-Prot, IPI, NCBIs nr, ...
  • Automated, high-throughput peptide identification
    in complex mixtures

15
Why dont we see more novel peptides?
  • Tandem mass spectrometry doesnt discriminate
    against novel peptides......but protein
    sequence databases do!
  • Searching traditional protein sequence databases
    biases the results towards well-understood
    protein isoforms!

16
What goes missing?
  • Known coding SNPs
  • Novel coding mutations
  • Alternative splicing isoforms
  • Alternative translation start-sites
  • Microexons
  • Alternative translation frames

17
Why should we care?
  • Alternative splicing is the norm!
  • Only 20-25K human genes
  • Each gene makes many proteins
  • Proteins have clinical implications
  • Biomarker discovery
  • Evidence for SNPs and alternative splicing stops
    with transcription
  • Genomic assays, ESTs, mRNA sequence.
  • Little hard evidence for translation start site

18
Novel Splice Isoform
  • Human Jurkat leukemia cell-line
  • Lipid-raft extraction protocol, targeting T cells
  • von Haller, et al. MCP 2003.
  • LIME1 gene
  • LCK interacting transmembrane adaptor 1
  • LCK gene
  • Leukocyte-specific protein tyrosine kinase
  • Proto-oncogene
  • Chromosomal aberration involving LCK in
    leukemias.
  • Multiple significant peptide identifications

19
Novel Splice Isoform
20
Novel Splice Isoform
21
Novel Mutation
  • HUPO Plasma Proteome Project
  • Pooled samples from 10 male 10 female healthy
    Chinese subjects
  • Plasma/EDTA sample protocol
  • Li, et al. Proteomics 2005. (Lab 29)
  • TTR gene
  • Transthyretin (pre-albumin)
  • Defects in TTR are a cause of amyloidosis.
  • Familial amyloidotic polyneuropathy
  • late-onset, dominant inheritance

22
Novel Mutation
Ala2?Pro associated with familial amyloid
polyneuropathy
23
Novel Mutation
24
Translation Start-Site
  • Human erythroleukemia K562 cell-line
  • Depth of coverage study
  • Resing et al. Anal. Chem. 2004.
  • THOC2 gene
  • Part of the heteromultimeric THO/TREX complex.
  • Initially believed to be a novel ORF
  • RefSeq mRNA in Jun 2007, no RefSeq protein
  • TrEMBL entry Feb 2005, no SwissProt entry
  • Genbank mRNA in May 2002 (complete CDS)
  • Plenty of EST support
  • 100,000 bases upstream of other isoforms

25
Translation Start-Site
26
Translation Start-Site
27
Translation Start-Site
28
Translation Start-Site
29
Expressed Sequence Tags (ESTs)
  • Cheap, fast, coding
  • Single sequencing reads of mRNA
  • Sequence from 5 or 3 end
  • No assembly

http//www.ncbi.nlm.nih.gov/About/primer/est.html
30
Searching ESTs
  • Proposed long ago
  • Yates, Eng, and McCormack Anal Chem, 95.
  • Now
  • Protein sequences are sufficient for protein
    identification
  • Computationally expensive/infeasible
  • Difficult to interpret
  • Make EST searching feasible for routine searching
    to discover novel peptides.

31
Searching Expressed Sequence Tags (ESTs)
  • Pros
  • No introns!
  • Primary splicing evidence for annotation
    pipelines
  • Evidence for dbSNP
  • Often derived from clinical cancer samples
  • Cons
  • No frame
  • Large (8Gb)
  • Untrusted by annotation pipelines
  • Highly redundant
  • Nucleotide error rate 1

32
Compressed EST Peptide Sequence Database
  • For all ESTs mapped to a UniGene gene
  • Six-frame translation
  • Eliminate ORFs lt 30 amino-acids
  • Eliminate amino-acid 30-mers observed once
  • Compress to C2 FASTA database
  • Complete, Correct for amino-acid 30-mers
  • Gene-centric peptide sequence database
  • Size lt 3 of naïve enumeration, 20774 FASTA
    entries
  • Running time 1 of naïve enumeration search
  • E-values 2 of naïve enumeration search results

33
Compressed EST Peptide Sequence Database
  • For all ESTs mapped to a UniGene gene
  • Six-frame translation
  • Eliminate ORFs lt 30 amino-acids
  • Eliminate amino-acid 30-mers observed once
  • Compress to C2 FASTA database
  • Complete, Correct for amino-acid 30-mers
  • Gene-centric peptide sequence database
  • Size lt 3 of naïve enumeration, 20774 FASTA
    entries
  • Running time 1 of naïve enumeration search
  • E-values 2 of naïve enumeration search results

34
SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
35
Compressed SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
36
Sequence Databases CSBH-graphs
  • Original sequences correspond to paths

ACDEFGI, ACDEFACG, DEFGEFGI
37
Sequence Databases CSBH-graphs
  • All k-mers represented by an edge have the same
    count

1
2
2
1
2
38
cSBH-graphs
  • Quickly determine those that occur twice

2
2
1
2
39
Correct, Complete (C2) Enumeration
  • Set of paths that use each edge at least once

ACDEFGEFGI, DEFACG
40
Compressed EST Database
  • Gene centric compressed EST peptide sequence
    database
  • 20,774 sequence entries
  • 8Gb vs 223 Mb
  • 35 fold compression
  • 22 hours becomes 15 minutes
  • E-values improve by similar factor!
  • Makes routine EST searching feasible
  • Search ESTs instead of IPI?

41
Significant False Positives
  • E-values are not enough!
  • Random guessers are easy to beat.
  • Post-translational modifications vs. amino-acid
    substitution
  • methylation (on I/L, Q, R, C, H, K, S, T, N) 14
  • D ? E, G ? A, V ? I/L, N ? Q, S ? T 14
  • Peptide extension z2 ? z3
  • Nonsense AA masses sum to precursor
  • Need to ensure
  • fragment ions define novel sequence
  • sequence evidence is strong
  • other plausible explanations can be eliminated

42
Significant False Positives
  • DFLAGGLAAAISK 2.2x10-8
  • 2 ESTs
  • DFLAGGIAAAISK 2.2x10-8
  • IPI (2), RefSeq, mRNA, 1400 ESTs
  • DFLAGGVAAAISK 3.7x10-8
  • IPI, RefSeq, mRNA, 700 ESTs
  • DFLAGGVAAAISKMAVVPI 3.5x10-5
  • Genscan exon
  • AISFAKDFLAGGIAAAISK 3.3x10-4
  • Genscan exon

43
Significant False Positives
44
Back to the lab...
  • Current LC/MS/MS workflows identify a few
    peptides per protein
  • ...not sufficient for protein isoforms
  • Need to raise the sequence coverage to (say) 80
  • ...protein separation prior to LC/MS/MS analysis
  • Potential for database of splice sites of
    (functional) proteins!

45
Spectral Matching for Peptide Identification
  • Detection vs. identification
  • Increased sensitivity specificity
  • No novel peptides
  • NIST GC/MS Spectral Library
  • Identifies small molecules,
  • 100,000s of (consensus) spectra
  • Bundled/Sold with many instruments
  • Dot-product spectral comparison
  • Current project Peptide MS/MS

46
NIST MS Search Peptides
47
Peptide DLATVYVDVLK
48
Protein Families
49
Protein Families
50
Peptide DLATVYVDVLK
51
Hidden Markov Models for Spectral Matching
  • Capture statistical variation and consensus in
    peak intensity
  • Capture semantics of peaks
  • Extrapolate model to other peptides
  • Good specificity with superior sensitivity for
    peptide detection
  • Assign 1000s of additional spectra (p-value lt
    10-5)

52
Hidden Markov Model
Delete
Insert
Ion
(m/z,int) pair emitted by ion insert states
53
Spectral Matching of Peptide Variants
DFLAGGIAAAISK
DFLAGGVAAAISK
54
Spectral Matching of Peptide Variants
AVMDDFAAFVEK
AVMDDFAAFVEK
55
HMM model extrapolation
56
Conclusions
  • Proteomics can inform genome annotation
  • Eukaryotic and prokaryotic
  • Functional vs silencing variants
  • Peptides identify more than just proteins
  • Untapped source of disease biomarkers
  • Compressed peptide sequence databases make
    routine EST searching feasible
  • Novel spectral matching technique using HMMs
    looks very promising

57
Acknowledgements
  • Catherine Fenselau, Steve Swatkoski
  • UMCP Biochemistry
  • Chau-Wen Tseng, Xue Wu
  • UMCP Computer Science
  • Cheng Lee
  • Calibrant Biosystems
  • PeptideAtlas, HUPO PPP, X!Tandem
  • Funding NIH/NCI, USDA/ARS
Write a Comment
User Comments (0)
About PowerShow.com