Title: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance
1Protein Sequence Databases, Peptides to Proteins,
and Statistical Significance
- Nathan Edwards
- Department of Biochemistry and Mol. Cell.
Biology - Georgetown University Medical Center
2Protein Sequence Databases
- Link between mass spectra and proteins
- A proteins amino-acid sequence provides a basis
for interpreting - Enzymatic digestion
- Separation protocols
- Fragmentation
- Peptide ion masses
- We must interpret database information as
carefully as mass spectra.
3More than sequence
- Protein sequence databases provide much more than
sequence - Names
- Descriptions
- Facts
- Predictions
- Links to other information sources
- Protein databases provide a link to the current
state of our understanding about a protein.
4Much more than sequence
- Names
- Accession, Name, Description
- Biological Source
- Organism, Source, Taxonomy
- Literature
- Function
- Biological process, molecular function, cellular
component - Known and predicted
- Features
- Polymorphism, Isoforms, PTMs, Domains
- Derived Data
- Molecular weight, pI
5Database types
Curated Swiss-Prot UniProt RefSeq NP Translated TrEMBL RefSeq XP, ZP
Omnibus NCBIs nr MSDB IPI Other PDB HPRD EST Genomic
6SwissProt
- From ExPASy
- Expert Protein Analysis System
- Swiss Institute of Bioinformatics
- 515,000 protein sequence entries
- 12,000 species represented
- 20,000 Human proteins
- Highly curated
- Minimal redundancy
- Part of UniProt Consortium
7TrEMBL
- Translated EMBL nucleotide sequences
- European Molecular Biology Laboratory
- European Bioinformatics Institute (EBI)
- Computer annotated
- Only sequences absent from SwissProt
- 10.5 M protein sequence entries
- 230,000 species
- 75,000 Human proteins
- Part of UniProt Consortium
8UniProt
- Universal Protein Resource
- Combination of sequences from
- Swiss-Prot
- TrEMBL
- Mixture of highly curated/reviewed (SwissProt)
and computer annotation (TrEMBL) - Similar sequence clusters are available
- 50, 90, 100 sequence similarity
9RefSeq
- Reference Sequence
- From NCBI (National Center for Biotechnology
Information), NLM, NIH - Integrated genomic, transcript, and protein
sequences. - Varying levels of curation
- Reviewed, Validated, , Predicted,
- 9.7 M protein sequence entries
- 209,000 reviewed, 90,000 validated
- 39,000 Human proteins
10RefSeq
- Particular focus on major research organisms
- Tightly integrated with genome projects.
- Curated entries NP accessions
- Predicted entries XP accessions
- Others YP, ZP, AP
11IPI
- International Protein Index
- From EBI
- For a specific species, combines
- UniProt, RefSeq, Ensembl
- Species specific databases HInv-DB, VEGA, TAIR
- 87,000 (from 307,000 ) human protein sequence
entries - Human, mouse, rat, zebra fish, arabidopsis,
chicken, cow - Slated for closure November 2010, but still
going
12MSDB
- From the Imperial College (London)
- Combines
- PIR, TrEMBL, GenBank, SwissProt
- Distributed with Mascot
- so well integrated with Mascot
- 3.2M protein sequence entries
- Similar sequences suppressed
- 100 sequence similarity
- Not updated since September 2006 (obsolete)
13NCBIs nr
- non-redundant
- Contains
- GenBank CDS translations
- RefSeq Proteins
- Protein Data Bank (PDB)
- SwissProt, TrEMBL, PIR
- Others
- Similar sequences suppressed
- 100 sequence similarity
- 10.5 M protein sequence entries
14Human Sequences
- Number of Human genes is believed to be between
20,000 and 25,000
SwissProt 20,000
RefSeq 39,000
TrEMBL 75,000
IPI-HUMAN 87,000
MSDB 130,000
nr 230,000
15DNA to Protein Sequence
Derived from http//online.itp.ucsb.edu/online/inf
obio01/burge
16UCSC Genome Browser
- Shows many sources of protein sequence evidence
in a unified display
17Accessions
- Permanent labels
- Short, machine readable
- Enable precise communication
- Typos render them unusable!
- Each database uses a different format
- Swiss-Prot P17947
- Ensembl ENSG00000066336
- PIR S60367 S60367
- GO GO0003700
18Names / IDs
- Compact mnemonic labels
- Not guaranteed permanent
- Require careful curation
- Conceptual objects
- ALBU_HUMAN
- Serum Albumin
- RT30_HUMAN
- Mitochondrial 28S ribosomal protein S30
- CP3A7_HUMAN
- Cytochrome P450 3A7
19Description / Name
- Free text description
- Human readable
- Space limited
- Hard for computers to interpret!
- No standard nomenclature or format
- Often abused.
- COX7R_HUMAN
- Cytochrome c oxidase subunit VIIa-related
protein, mitochondrial Precursor
20FASTA Format
- gt
- Accession number
- No uniform format
- Multiple accessions separated by
- One line of description
- Usually pretty cryptic
- Organism of sequence?
- No uniform format
- Official latin name not necessarily used
- Amino-acid sequence in single-letter code
- Usually spread over multiple lines.
21FASTA Format
22Organism / Species / Taxonomy
- The proteins organism
- or the source of the biological sample
- The most reliable sequence annotation available
- Useful only to the extent that it is correct
- NCBIs taxonomy is widely used
- Provides a standard of sorts Heirachical
- Other databases dont necessarily keep up
- Organism specific sequence databases starting to
become available.
23Organism / Species / Taxonomy
- Buffalo rat
- Gunn rats
- Norway rat
- Rattus PC12 clone IS
- Rattus norvegicus
- Rattus norvegicus8
- Rattus norwegicus
- Rattus rattiscus
- Rattus sp.
- Rattus sp. strain Wistar
- Sprague-Dawley rat
- Wistar rats
- brown rat
- laboratory rat
- rat
- rats
- zitter rats
24Controlled Vocabulary
- Middle ground between computers and people
- Provides precision for concepts
- Searching, sorting, browsing
- Concept relationships
- Vocabulary / Ontology must be established
- Human curation
- Link between concept and object
- Manually curated
- Automatic / Predicted
25Gene Ontology
- Hierarchical
- Molecular function
- Biological process
- Cellular component
- Describes the vocabulary only!
- Protein families provide GO association
- Not necessarily any appropriate GO category.
- Not necessarily in all three hierarchies.
- Sometimes general categories are used because
none of the specific categories are correct.
26Gene Ontology
27Protein Families
- Similar sequence implies similar function
- Similar structure implies similar function
- Common domains imply similar function
- Bootstrap up from small sets of proteins/domains
with well understood characteristics - Usually a hybrid manual / automatic approach
28Protein Families
29Protein Families
30Sequence Variants
- Protein sequence can vary due to
- Polymorphism
- Alternative splicing
- Post-translational modification
- Sequence databases typically do not capture all
versions of a proteins sequence
31Swiss-Prot Variant Annotations
32Swiss-Prot Variant Annotations
33Omnibus Database Redundancy Elimination
- Source databases often contain the same sequences
with different descriptions - Omnibus databases keep one copy of the sequence,
and - An arbitrary description, or
- All descriptions, or
- Particular description, based on source
preference - Good definitions can be lost, including taxonomy
34Description Elimination
- gi12053249embCAB66806.1 hypothetical protein
Homo sapiens - gi46255828gbAAH68998.1 COMMD4 protein Homo
sapiens - gi42632621gbAAS22242.1 COMMD4 Homo
sapiens - gi21361661refNP_060298.2 COMM domain
containing 4 Homo sapiens - gi51316094spQ9H0A8COM4_HUMAN COMM domain
containing protein 4 - gi49065330embCAG38483.1 COMMD4 Homo
sapiens
35Peptides to Proteins
Nesvizhskii et al., Anal. Chem. 2003
36Peptides to Proteins
37Peptides to Proteins
- A peptide sequence may occur in many different
protein sequences - Variants, paralogues, protein families
- Separation, digestion and ionization is not well
understood - Proteins in sequence database are extremely
non-random, and very dependent
38Indistinguishable Protein Sequences
Nesvizhskii, Aebersold, Mol Cell Proteomics, 2005
39Indistinguishable Protein Sequences
Nesvizhskii, Aebersold, Mol Cell Proteomics, 2005
40Protein Families
Nesvizhskii, Aebersold, Mol Cell Proteomics, 2005
41Protein Grouping Scenarios
- Parsimony
- Minimum of proteins
- Weighted
- Choose proteinswith the most confident
peptides(ProteinProphet) - Show all
- Mark repeated peptides
- Often no (ideal) resolution is possible!
Nesvizhskii, Aebersold, Mol Cell Proteomics, 2005
42High Quality Peptide Identification E-value lt
10-8
43Moderate quality peptide identification E-value
lt 10-3
44Peptide Identification
- Peptide fragmentation by CID is poorly understood
- MS/MS spectra represent incomplete information
about amino-acid sequence - I/L, K/Q, GG/N,
- Correct identifications dont come with a
certificate!
45Peptide Identification
- High-throughput workflows demand we analyze all
spectra, all the time. - Spectra may not contain enough information to be
interpreted correctly - bad static on a cell phone
- Peptides may not match our assumptions
- its all Greek to me
- Dont know is an acceptable answer!
46What scores do wrong peptides get?
- Generate random peptide sequences
- Real looking fragment masses
- Empirical distribution
- Require similar precursor mass
- Arbitrary score function can model anything we
like!
47Random Peptide Scores
Fenyo Beavis, Anal. Chem., 2003
48Random Peptide Scores
Fenyo Beavis, Anal. Chem., 2003
49Random Peptide Scores
- Truly random peptides dont look much like real
peptides - Just use peptides from the sequence database!
- Assumptions
- IID sampling of score values per spectra
- Caveats
- Correct peptide (non-random) may be included
- Peptides are not independent
50Extrapolating from the Empirical Distribution
- Often, the empirical shape is consistent with a
theoretical model
Fenyo Beavis, Anal. Chem., 2003
Geer et al., J. Proteome Research, 2004
51E-values vs p-values
- Need to adjust for the size of the sequence
database - Best false/random score goes up with number of
trials - E-value makes this adjustment
- Expected number of incorrect peptides (with this
score) from this sequence database. - E-value Trials p-value (to 1st approx.)
52False Discovery Rate
- Which peptide IDs to accept?
- E-value only provides a per-spectrum statistic
- With enough spectra, even these can be
misleading! - Decide which spectra (w/ scores) will be
accepted - SEQUEST Xcorr, E-value, Score, etc., plus...
- Threshold on identification criteria
- Control the proportion of incorrect
identifications in the result for entire dataset
53Distribution of scores over all spectra
Brian Searle, Proteome Software
54Distribution of scores over all spectra
False
True
Brian Searle, Proteome Software
55False Discovery Rate
- FDRscore x false ids with score x
- all ids with score
x - Need to estimate numerator!
- Assumes the false (and true) scores, sampled over
spectra, are IID - Not true for some peptide-spectrum scores
- (Mostly) true for E-values
- Can compute the false ids using a decoy search
56Peptide Prophet
Keller et al., Anal. Chem. 2002
Distribution of spectral scores in the results
57Decoy searches
- Shuffle or reverse sequence database
- Same size as original
- Known false identifications
- Estimate False distribution
- Alternatively, merge targetdecoy results
- Competition between target and decoy scores
- Assume false target and false decoys each win
half the time - FDRscore x 2 decoy ids with score x
- target ids with
score x
58Summary
- Protein sequence databases have varying
characteristics, choose wisely! - Inferring proteins from peptides can be (very)
tricky! - Statistical significance can help control the
proportion of errors in the (peptide-level)
results.