Protein Sequence Databases, Peptides to Proteins, and Statistical Significance - PowerPoint PPT Presentation

About This Presentation

Title:

Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

Description:

Title: Proteomics Technology and Protein Identification Author: Nathan Last modified by: Nathan Created Date: 12/6/2004 12:44:14 AM Document presentation format – PowerPoint PPT presentation

Number of Views:184

Avg rating:3.0/5.0

Slides: 59

Provided by: nathan

Learn more at: http://edwardslab.bmcb.georgetown.edu

Category:

more less

Transcript and Presenter's Notes

Title: Protein Sequence Databases, Peptides to Proteins, and Statistical Significance

1
Protein Sequence Databases, Peptides to Proteins,
and Statistical Significance

Nathan Edwards
Department of Biochemistry and Mol. Cell.
Biology
Georgetown University Medical Center

2
Protein Sequence Databases

Link between mass spectra and proteins
A proteins amino-acid sequence provides a basis
for interpreting
Enzymatic digestion
Separation protocols
Fragmentation
Peptide ion masses
We must interpret database information as
carefully as mass spectra.

3
More than sequence

Protein sequence databases provide much more than
sequence
Names
Descriptions
Facts
Predictions
Links to other information sources
Protein databases provide a link to the current
state of our understanding about a protein.

4
Much more than sequence

Names
Accession, Name, Description
Biological Source
Organism, Source, Taxonomy
Literature
Function
Biological process, molecular function, cellular
component
Known and predicted
Features
Polymorphism, Isoforms, PTMs, Domains
Derived Data
Molecular weight, pI

5
Database types
Curated Swiss-Prot UniProt RefSeq NP Translated TrEMBL RefSeq XP, ZP
Omnibus NCBIs nr MSDB IPI Other PDB HPRD EST Genomic
6
SwissProt

From ExPASy
Expert Protein Analysis System
Swiss Institute of Bioinformatics
515,000 protein sequence entries
12,000 species represented
20,000 Human proteins
Highly curated
Minimal redundancy
Part of UniProt Consortium

7
TrEMBL

Translated EMBL nucleotide sequences
European Molecular Biology Laboratory
European Bioinformatics Institute (EBI)
Computer annotated
Only sequences absent from SwissProt
10.5 M protein sequence entries
230,000 species
75,000 Human proteins
Part of UniProt Consortium

8
UniProt

Universal Protein Resource
Combination of sequences from
Swiss-Prot
TrEMBL
Mixture of highly curated/reviewed (SwissProt)
and computer annotation (TrEMBL)
Similar sequence clusters are available
50, 90, 100 sequence similarity

9
RefSeq

Reference Sequence
From NCBI (National Center for Biotechnology
Information), NLM, NIH
Integrated genomic, transcript, and protein
sequences.
Varying levels of curation
Reviewed, Validated, , Predicted,
9.7 M protein sequence entries
209,000 reviewed, 90,000 validated
39,000 Human proteins

10
RefSeq

Particular focus on major research organisms
Tightly integrated with genome projects.
Curated entries NP accessions
Predicted entries XP accessions
Others YP, ZP, AP

11
IPI

International Protein Index
From EBI
For a specific species, combines
UniProt, RefSeq, Ensembl
Species specific databases HInv-DB, VEGA, TAIR
87,000 (from 307,000 ) human protein sequence
entries
Human, mouse, rat, zebra fish, arabidopsis,
chicken, cow
Slated for closure November 2010, but still
going

12
MSDB

From the Imperial College (London)
Combines
PIR, TrEMBL, GenBank, SwissProt
Distributed with Mascot
so well integrated with Mascot
3.2M protein sequence entries
Similar sequences suppressed
100 sequence similarity
Not updated since September 2006 (obsolete)

13
NCBIs nr

non-redundant
Contains
GenBank CDS translations
RefSeq Proteins
Protein Data Bank (PDB)
SwissProt, TrEMBL, PIR
Others
Similar sequences suppressed
100 sequence similarity
10.5 M protein sequence entries

14
Human Sequences

Number of Human genes is believed to be between
20,000 and 25,000

SwissProt 20,000
RefSeq 39,000
TrEMBL 75,000
IPI-HUMAN 87,000
MSDB 130,000
nr 230,000
15
DNA to Protein Sequence
Derived from http//online.itp.ucsb.edu/online/inf
obio01/burge
16
UCSC Genome Browser

Shows many sources of protein sequence evidence
in a unified display

17
Accessions

Permanent labels
Short, machine readable
Enable precise communication
Typos render them unusable!
Each database uses a different format
Swiss-Prot P17947
Ensembl ENSG00000066336
PIR S60367 S60367
GO GO0003700

18
Names / IDs

Compact mnemonic labels
Not guaranteed permanent
Require careful curation
Conceptual objects
ALBU_HUMAN
Serum Albumin
RT30_HUMAN
Mitochondrial 28S ribosomal protein S30
CP3A7_HUMAN
Cytochrome P450 3A7

19
Description / Name

Free text description
Human readable
Space limited
Hard for computers to interpret!
No standard nomenclature or format
Often abused.
COX7R_HUMAN
Cytochrome c oxidase subunit VIIa-related
protein, mitochondrial Precursor

20
FASTA Format

gt
Accession number
No uniform format
Multiple accessions separated by
One line of description
Usually pretty cryptic
Organism of sequence?
No uniform format
Official latin name not necessarily used
Amino-acid sequence in single-letter code
Usually spread over multiple lines.

21
FASTA Format
22
Organism / Species / Taxonomy

The proteins organism
or the source of the biological sample
The most reliable sequence annotation available
Useful only to the extent that it is correct
NCBIs taxonomy is widely used
Provides a standard of sorts Heirachical
Other databases dont necessarily keep up
Organism specific sequence databases starting to
become available.

23
Organism / Species / Taxonomy

Buffalo rat
Gunn rats
Norway rat
Rattus PC12 clone IS
Rattus norvegicus
Rattus norvegicus8
Rattus norwegicus
Rattus rattiscus
Rattus sp.

Rattus sp. strain Wistar
Sprague-Dawley rat
Wistar rats
brown rat
laboratory rat
rat
rats
zitter rats

24
Controlled Vocabulary

Middle ground between computers and people
Provides precision for concepts
Searching, sorting, browsing
Concept relationships
Vocabulary / Ontology must be established
Human curation
Link between concept and object
Manually curated
Automatic / Predicted

25
Gene Ontology

Hierarchical
Molecular function
Biological process
Cellular component
Describes the vocabulary only!
Protein families provide GO association
Not necessarily any appropriate GO category.
Not necessarily in all three hierarchies.
Sometimes general categories are used because
none of the specific categories are correct.

26
Gene Ontology
27
Protein Families

Similar sequence implies similar function
Similar structure implies similar function
Common domains imply similar function
Bootstrap up from small sets of proteins/domains
with well understood characteristics
Usually a hybrid manual / automatic approach

28
Protein Families
29
Protein Families
30
Sequence Variants

Protein sequence can vary due to
Polymorphism
Alternative splicing
Post-translational modification
Sequence databases typically do not capture all
versions of a proteins sequence

31
Swiss-Prot Variant Annotations
32
Swiss-Prot Variant Annotations
33
Omnibus Database Redundancy Elimination

Source databases often contain the same sequences
with different descriptions
Omnibus databases keep one copy of the sequence,
and
An arbitrary description, or
All descriptions, or
Particular description, based on source
preference
Good definitions can be lost, including taxonomy

34
Description Elimination

gi12053249embCAB66806.1 hypothetical protein
Homo sapiens
gi46255828gbAAH68998.1 COMMD4 protein Homo
sapiens
gi42632621gbAAS22242.1 COMMD4 Homo
sapiens
gi21361661refNP_060298.2 COMM domain
containing 4 Homo sapiens
gi51316094spQ9H0A8COM4_HUMAN COMM domain
containing protein 4
gi49065330embCAG38483.1 COMMD4 Homo
sapiens

35
Peptides to Proteins
Nesvizhskii et al., Anal. Chem. 2003
36
Peptides to Proteins
37
Peptides to Proteins

A peptide sequence may occur in many different
protein sequences
Variants, paralogues, protein families
Separation, digestion and ionization is not well
understood
Proteins in sequence database are extremely
non-random, and very dependent

38
Indistinguishable Protein Sequences
Nesvizhskii, Aebersold, Mol Cell Proteomics, 2005
39
Indistinguishable Protein Sequences
Nesvizhskii, Aebersold, Mol Cell Proteomics, 2005
40
Protein Families
Nesvizhskii, Aebersold, Mol Cell Proteomics, 2005
41
Protein Grouping Scenarios

Parsimony
Minimum of proteins
Weighted
Choose proteinswith the most confident
peptides(ProteinProphet)
Show all
Mark repeated peptides
Often no (ideal) resolution is possible!

Nesvizhskii, Aebersold, Mol Cell Proteomics, 2005
42
High Quality Peptide Identification E-value lt
10-8
43
Moderate quality peptide identification E-value
lt 10-3
44
Peptide Identification

Peptide fragmentation by CID is poorly understood
MS/MS spectra represent incomplete information
about amino-acid sequence
I/L, K/Q, GG/N,
Correct identifications dont come with a
certificate!

45
Peptide Identification

High-throughput workflows demand we analyze all
spectra, all the time.
Spectra may not contain enough information to be
interpreted correctly
bad static on a cell phone
Peptides may not match our assumptions
its all Greek to me
Dont know is an acceptable answer!

46
What scores do wrong peptides get?

Generate random peptide sequences
Real looking fragment masses
Empirical distribution
Require similar precursor mass
Arbitrary score function can model anything we
like!

47
Random Peptide Scores
Fenyo Beavis, Anal. Chem., 2003
48
Random Peptide Scores
Fenyo Beavis, Anal. Chem., 2003
49
Random Peptide Scores

Truly random peptides dont look much like real
peptides
Just use peptides from the sequence database!
Assumptions
IID sampling of score values per spectra
Caveats
Correct peptide (non-random) may be included
Peptides are not independent

50
Extrapolating from the Empirical Distribution

Often, the empirical shape is consistent with a
theoretical model

Fenyo Beavis, Anal. Chem., 2003
Geer et al., J. Proteome Research, 2004
51
E-values vs p-values

Need to adjust for the size of the sequence
database
Best false/random score goes up with number of
trials
E-value makes this adjustment
Expected number of incorrect peptides (with this
score) from this sequence database.
E-value Trials p-value (to 1st approx.)

52
False Discovery Rate

Which peptide IDs to accept?
E-value only provides a per-spectrum statistic
With enough spectra, even these can be
misleading!
Decide which spectra (w/ scores) will be
accepted
SEQUEST Xcorr, E-value, Score, etc., plus...
Threshold on identification criteria
Control the proportion of incorrect
identifications in the result for entire dataset

53
Distribution of scores over all spectra
Brian Searle, Proteome Software
54
Distribution of scores over all spectra
False
True
Brian Searle, Proteome Software
55
False Discovery Rate

FDRscore x false ids with score x
all ids with score
x
Need to estimate numerator!
Assumes the false (and true) scores, sampled over
spectra, are IID
Not true for some peptide-spectrum scores
(Mostly) true for E-values
Can compute the false ids using a decoy search

56
Peptide Prophet
Keller et al., Anal. Chem. 2002
Distribution of spectral scores in the results
57
Decoy searches

Shuffle or reverse sequence database
Same size as original
Known false identifications
Estimate False distribution
Alternatively, merge targetdecoy results
Competition between target and decoy scores
Assume false target and false decoys each win
half the time
FDRscore x 2 decoy ids with score x
target ids with
score x

58
Summary

Protein sequence databases have varying
characteristics, choose wisely!
Inferring proteins from peptides can be (very)
tricky!
Statistical significance can help control the
proportion of errors in the (peptide-level)
results.

Write a Comment

User Comments (0)