Novel Peptide Identification using ESTs and Genomic Sequence - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

Novel Peptide Identification using ESTs and Genomic Sequence

Description:

Novel Peptide Identification using ESTs and Genomic Sequence – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 38
Provided by: umiac7
Category:

less

Transcript and Presenter's Notes

Title: Novel Peptide Identification using ESTs and Genomic Sequence


1
Novel Peptide Identification using ESTs and
Genomic Sequence
  • Nathan Edwards
  • Center for Bioinformatics and Computational
    Biology
  • University of Maryland, College Park

2
Mass Spectrometry for Proteomics
  • Measure mass of many (bio)molecules
    simultaneously
  • High bandwidth
  • Mass is an intrinsic property of all
    (bio)molecules
  • No prior knowledge required

3
Mass Spectrometry for Proteomics
  • Measure mass of many molecules simultaneously
  • ...but not too many, abundance bias
  • Mass is an intrinsic property of all
    (bio)molecules
  • ...but need a reference to compare to

4
Mass Spectrometry for Proteomics
  • Mass spectrometry has been around since the turn
    of the century...
  • ...why is MS Proteomics so new?
  • Ionization methods
  • MALDI, Electrospray
  • Protein chemistry automation
  • Chromatography, Gels, Computers
  • Protein sequence databases
  • A reference for comparison

5
Microorganism Identification by MALDI Mass
Spectrometry
  • Direct observation of microorganism biomarkers in
    the field.
  • Peaks represent masses of abundant proteins.
  • Statistical models assess identification
    significance.

B.anthracis
MALDI Mass Spectrometry
6
Key Principles
  • Protein mass from protein sequence
  • No introns, few PTMs
  • Specificity of single mass is very weak
  • Statistical significance from many peaks
  • Not all proteins are equally likely to be
    observed
  • Ribosomal proteins, SASPs

7
Rapid Microorganism Identification Database
(www.RMIDb.org)
  • Protein Sequences
  • 5.3M (1.9M)
  • Species
  • 15K
  • Genbank,
  • RefSeq
  • CMR,
  • Swiss-Prot
  • TrEMBL

8
Rapid Microorganism Identification Database
(www.RMIDb.org)
9
Informatics Issues
  • Need good species / strain annotation
  • B.anthracis vs B.thuringiensis 
  • Need correct protein sequence
  • B.anthracis Sterne a/ß SASP
  • RefSeq/Gb MVMARN... (7442 Da)
  • CMR MARN... (7211 Da)
  • Need chemistry based protein classification

10
Sample Preparation for Peptide Identification
11
Single Stage MS
MS
m/z
12
Tandem Mass Spectrometry(MS/MS)
m/z
Precursor selection
m/z
13
Tandem Mass Spectrometry(MS/MS)
Precursor selection collision induced
dissociation (CID)
m/z
MS/MS
m/z
14
Peptide Identification
  • For each (likely) peptide sequence
  • 1. Compute fragment masses
  • 2. Compare with spectrum
  • 3. Retain those that match well
  • Peptide sequences from protein sequence databases
  • Swiss-Prot, IPI, NCBIs nr, ...
  • Automated, high-throughput peptide identification
    in complex mixtures

15
Why dont we see more novel peptides?
  • Tandem mass spectrometry doesnt discriminate
    against novel peptides......but protein
    sequence databases do!
  • Searching traditional protein sequence databases
    biases the results towards well-understood
    protein isoforms!

16
What goes missing?
  • Known coding SNPs
  • Novel coding mutations
  • Alternative splicing isoforms
  • Alternative translation start-sites
  • Microexons
  • Alternative translation frames

17
Why should we care?
  • Alternative splicing is the norm!
  • Only 20-25K human genes
  • Each gene makes many proteins
  • Proteins have clinical implications
  • Biomarker discovery
  • Evidence for SNPs and alternative splicing stops
    with transcription
  • Genomic assays, ESTs, mRNA sequence.
  • Little hard evidence for translation start site

18
Novel Splice Isoform
19
Novel Splice Isoform
20
Novel Frame
21
Novel Frame
22
Novel Mutation
Ala2?Pro associated with familial amyloid
polyneuropathy
23
Novel Mutation
24
Searching ESTs
  • Proposed long ago
  • Yates, Eng, and McCormack Anal Chem, 95.
  • Now
  • Protein sequences are sufficient for protein
    identification
  • Computationally expensive/infeasible
  • Difficult to interpret
  • Make EST searching feasible for routine searching
    to discover novel peptides.

25
Searching Expressed Sequence Tags (ESTs)
  • Pros
  • No introns!
  • Primary splicing evidence for annotation
    pipelines
  • Evidence for dbSNP
  • Often derived from clinical cancer samples
  • Cons
  • No frame
  • Large (8Gb)
  • Untrusted by annotation pipelines
  • Highly redundant
  • Nucleotide error rate 1

26
Compressed EST Peptide Sequence Database
  • For all ESTs mapped to a UniGene gene
  • Six-frame translation
  • Eliminate ORFs lt 30 amino-acids
  • Eliminate amino-acid 30-mers observed once
  • Compress to C2 FASTA database
  • Complete, Correct for amino-acid 30-mers
  • Gene-centric peptide sequence database
  • Size lt 3 of naïve enumeration, 20774 FASTA
    entries
  • Running time 1 of naïve enumeration search
  • E-values 2 of naïve enumeration search results

27
Compressed EST Peptide Sequence Database
  • For all ESTs mapped to a UniGene gene
  • Six-frame translation
  • Eliminate ORFs lt 30 amino-acids
  • Eliminate amino-acid 30-mers observed once
  • Compress to C2 FASTA database
  • Complete, Correct for amino-acid 30-mers
  • Gene-centric peptide sequence database
  • Size lt 3 of naïve enumeration, 20774 FASTA
    entries
  • Running time 1 of naïve enumeration search
  • E-values 2 of naïve enumeration search results

28
SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
29
Compressed SBH-graph
ACDEFGI, ACDEFACG, DEFGEFGI
30
Sequence Databases CSBH-graphs
  • Original sequences correspond to paths

ACDEFGI, ACDEFACG, DEFGEFGI
31
Sequence Databases CSBH-graphs
  • All k-mers represented by an edge have the same
    count

1
2
2
1
2
32
cSBH-graphs
  • Quickly determine those that occur twice

2
2
1
2
33
Compressed-SBH-graph
2
2
1
2
ACDEFGI
34
Compressed EST Database
  • Gene centric compressed EST peptide sequence
    database
  • 20,774 sequence entries
  • 8Gb vs 223 Mb
  • 35 fold compression
  • 22 hours becomes 15 minutes
  • E-values improve by similar factor!
  • Makes routine EST searching feasible
  • Search ESTs instead of IPI?

35
Back to the lab...
  • Current LC/MS/MS workflows identify a few
    peptides per protein
  • ...not sufficient for protein isoforms
  • Need to raise the sequence coverage to (say) 80
  • ...protein separation prior to LC/MS/MS analysis
  • Potential for database of splice sites of
    (functional) proteins!

36
Conclusions
  • Good informatics gets the most out of proteomics
    data
  • Proteomics may be useful for genome annotation
  • Peptides identify more than just proteins
  • Compressed peptide sequence databases make
    routine EST searching feasible

37
Acknowledgements
  • Chau-Wen Tseng, Xue Wu
  • UMCP Computer Science
  • Catherine Fenselau
  • UMCP Biochemistry
  • Calibrant Biosystems
  • PeptideAtlas, HUPO PPP, X!Tandem
  • Funding National Cancer Institute
Write a Comment
User Comments (0)
About PowerShow.com