Title: Interpreting MS/MS Proteomics Results
1Interpreting MS/MS Proteomics Results
The first thing I should say is that none of the
material presented is original research done at
Proteome Software
but we do strive to make the tools presented here
available in our software product Scaffold. With
that caveat aside
- Brian C. Searle
- Proteome Software Inc.
- Portland, Oregon USA
- Brian.Searle_at_ProteomeSoftware.com
- NPC Progress Meeting
- (February 2nd, 2006)
Illustrated by Toni Boudreault
2Organization
SEQUEST
Identify
This is an foremost an introduction so were
first going to talk about
Then were going to talk about the motivations
behind the development of the first really useful
bioinformatics technique in our field, SEQUEST.
how you go about identifying proteins with tandem
mass spectrometry in the first place
This technique has been extended by two other
tools called X! Tandem and Mascot.
X! Tandem/Mascot
Were also going to talk about how these programs
differ
Differ Combine
and how we can use that to our advantage by
considering them simultaneously using
probabilities.
3Start with a protein
A
A
I
E
P
A
T
H
K
K
Q
So, this is proteomics, so were going to use
tandem mass spectrometry to identify proteins--
hopefully many of them, and hopefully very
quickly.
I
G
L
R
L
K
N
V
I
T
I
D
D
C
G
V
R
T
A
4Cut with an enzyme
A
A
I
E
P
A
T
And to use this technique you generally have to
lyse the protein into peptides about 8 to 20
amino acids in length and
H
K
K
Q
I
G
L
R
L
K
N
V
I
T
I
D
D
C
G
V
R
T
A
5Select a peptide
A
A
I
E
P
A
T
H
K
K
Q
I
G
L
Look at each peptide individually.
R
L
K
We select the peptide by mass using the first
half of the tandem mass spectrometer
N
V
I
T
I
D
D
C
G
V
R
T
A
6Impart energy in collision cell
A
E
P
T
I
R
H2O
The mass spectrometer imparts energy into the
peptide causing it to fragment at the peptide
bonds between amino acids.
7Measure mass of daughter ions
The masses of these fragment ions is recorded
using the second mass spectrometer.
A
E
P
T
A
E
P
A
E
Intensity
399.2
A
298.1
201.1
72.0
M/z
8B-type Ions
These ions are commonly called B ions, based on
nomenclature you dont really want to know about
A
E
P
T
I
R
H2O
Intensity
72.0
129.0
97.0
101.0
113.1
174.1
M/z
But the mass difference between the peaks
corresponds directly to the amino acid sequence.
9B-type Ions
A
E
P
T
I
R
H2O
Intensity
72.0
129.0
97.0
101.0
113.1
174.1
AE-A
AEP -AE
AEPT -AEP
AEPTI -AEPT
AEPTIR -AEPTI
A-0
For example, the A-E peak minus the A peak should
produce the mass of E.
You can build these mass differences up and
derive a sequence for the original peptide
This is pretty neat and it makes tandem mass
spectrometry one of the best tools out there for
sequencing novel peptides.
M/z
10But there are a couple confounding factors.
So, it seems pretty easy, doesnt it?
For example
11B-type Ions
B ions have a tendency to degrade and lose carbon
monoxide producing
A
E
P
T
I
R
H2O
CO
CO
CO
CO
CO
CO
Intensity
M/z
12A-type Ions
A ions.
A
E
P
T
I
R
H2O
Furthermore
CO
CO
CO
CO
CO
CO
M/z
13Y-type Ions
The second half are represented as Y ions that
sequence backwards.
And, unfortunately, this is the real world, so
R
I
T
P
E
A
H2O
Intensity
M/z
14Y-type Ions
All the peaks have different measured heights
and many peaks can often be missing.
R
I
T
P
E
A
H2O
Intensity
M/z
15B-type, A-type, Y-type Ions
All these peaks are seen together simultaneously
and we dont even know
R
I
T
P
E
A
H2O
Intensity
M/z
16What type of ion they are, making the mass
differences approach even more difficult.
Finally, as with all analytical techniques,
Intensity
M/z
17Theres noise,
producing a final spectrum that looks like
Intensity
M/z
18And so its actually fairly difficult to
.This, on a good day.
Intensity
M/z
19 compute the mass differences to sequence the
peptide, certainly in a computer automated way.
A
E
P
T
I
R
H2O
Intensity
72.0
129.0
97.0
101.0
113.1
174.1
M/z
20So the community needed a new technique.
Now, it wasnt all without hope
21Known Ion Types
We knew a couple of things about peptide
fragmentation.
- B-type ions
- A-type ions
- Y-type ions
Not only do we know to expect B, A, and Y ions,
but
22Known Ion Types
We also know a couple of other variations on
those ions that come up.
- B-type ions
- A-type ions
- Y-type ions
- B- or Y-type 2H ions
- B- or Y-type -NH3 ions
- B- or Y-type -H2O ions
We even know something about the
23Known Ion Types
likelihood of seeing each type of ion,
- B-type ions
- A-type ions
- Y-type ions
- B- or Y-type 2H ions
- B- or Y-type -NH3 ions
- B- or Y-type -H2O ions
where generally B and Y ions are most prominent.
24If we know the amino acid sequence of a peptide,
we can guess what the spectra should look like!
So its actually pretty easy to guess what a
spectrum should look like
if we know what the peptide sequence is.
25ELVISLIVESK
Model Spectrum
So as an example, consider the peptide ELVIS
LIVES K
that was synthesized by Rich Johnson in Seattle
Courtesy of Dr. Richard Johnson http//www.hairyf
atguy.com/
26Model Spectrum
We can create a hypothetical spectrum based on
our rules
27B/Y type ions (100)
Where B and Y ions are estimated at 100,
plus 2 ions are estimated at 50,
and other stragglers are at 20.
B/Y 2H type ions (50)
A type ions B/Y -NH3/-H2O (20)
28Model Spectrum
So if we consider the spectrum that was derived
from the ELVIS LIVES K peptide
29Model Spectrum
We can find where the overlap is between the
hypothetical and the actual spectra
30Model Spectrum
And say conclusively based on the evidence that
the spectrum does belong to the ELVIS LIVES K
peptide.
31But who cares?
The more important question is
what about situations where we dont know the
sequence?
32We guess!
33PepSeq
And so this was an approach followed by a program
called PepSeq
which would guess every combination of amino
acids possible
- AAAAAAAAAA
- AAAAAAAAAC
- AAAAAAAACC
- AAAAAAACCC
- ELVISLIVESK
- WYYYYYYYYY
- YYYYYYYYYY
build a hypothetical spectrum,
and find the best matching hypothetical.
J. Rozenski et al., Org. Mass Spectrom., 29
(1994) 654-658.
34PepSeq
This was a start,
but its clearly impossibly hard with larger
peptides
- Impossibly hard after 7 or 8 amino acids!
- High false positive rate because you consider so
many options
and theres a lot of room to overfit the data.
35PepSeq
So obviously this isnt going to work in the long
run.
Another strategy is needed!
- Impossibly hard after 7 or 8 amino acids!
- High false positive rate because you consider so
many options
36Sequencing Explosion
We needed a new invention to come around
and that was shotgun Sanger-sequencing
- 1977 Shotgun sequencing invented,
bacteriophage fX174 sequenced. - 1989 Yeast Genome project announced
- 1990 Human Genome project announced
- 1992 First chromosome (Yeast) sequenced
- 1995 H. influenza sequenced
- 1996 Yeast Genome sequenced
- 2000 Human Genome draft
In 89 and 90 the Yeast and Human Genome projects
were announced
followed by the first chromosome in 92
et cetra, et cetra
37Sequencing Explosion
- 1977 Shotgun sequencing invented,
bacteriophage fX174 sequenced. - 1989 Yeast Genome project announced
- 1990 Human Genome project announced
- 1992 First chromosome (Yeast) sequenced
- 1995 H. influenza sequenced
- 1996 Yeast Genome sequenced
- 2000 Human Genome draft
Eng, J. K. McCormack, A. L. Yates, J. R. III
J. Am. Soc. Mass Spectrom. 1994, 5, 976-989.
In 1994 Jimmy Eng and John Yates published a
technique to exploit genome sequencing
for use in tandem mass spectrometry.
And the idea was
38SEQUEST
.instead of searching all possible peptide
sequences,
Now, in the post- genomic world this seems like a
pretty trivial idea,
search only those in genome databases.
but back then there was a lot of assumption
placed on the idea
that wed actually have a complete Human genome
in a reasonable amount of time.
39SEQUEST
- 21014 -- All possible 11mers
- (ELVISLIVESK)
- 21010 -- All possible peptides in NR
- 1108 -- All tryptic peptides in NR
- 4106 -- All Human tryptic peptides in NR
So, In terms of 11amino acid peptides
So that was huge,
were talking about a 10 thousand fold
difference between searching every possible 11mer
those in the current non-redundant protein
database from the NCBI
it made hypothetical spectrum matching feasible.
And a 100 million fold difference for searching
human trypic peptides
40SEQUEST Model Spectrum
Instead of trying to make a better model,
SEQUEST made a couple of other interesting
improvements as well
they decided just to make the actual spectrum
look like the model with normalization
Jimmy and John noted that there was a
discontinuity between the intensities of the
hypothetical spectrum and the actual spectrum.
41For a scoring function they decided to use
Cross-Correlation,
Like so.
which basically sums the peaks that overlap
between hypothetical and the actual spectra
SEQUEST Model Spectrum
42And then they shifted the spectra back and .
SEQUEST Model Spectrum
43They used this number, also called the
Auto-Correlation, as their background.
Forth so that the peaks shouldnt align.
SEQUEST Model Spectrum
44SEQUEST XCorr
This is another representation of the Cross
Correlation and the Auto Correlation.
Cross Correlation (direct comparison)
Correlation Score
Auto Correlation (background)
Offset (AMU)
Gentzel M. et al Proteomics 3 (2003) 1597-1610
45SEQUEST XCorr
The XCorr score is the Cross Correlation divided
by the average of the auto correlation over a 150
AMU range.
The XCorr is high if the direct comparison is
significantly greater than the background,
Cross Correlation (direct comparison)
which is obviously good for peptide
identification.
Auto Correlation (background)
Correlation Score
Offset (AMU)
XCorr
Gentzel M. et al Proteomics 3 (2003) 1597-1610
46SEQUEST DeltaCn
And this XCorr is actually a pretty robust method
for estimating how accurate the match is,
and so far, there really havent been any
significant improvements on it.
The DeltaCn is another score that scientists
often use.
It measures how good the XCorr is relative to the
next best match.
As you can see, this is actually a pretty crude
calculation.
47Heres another representation of that sentiment.
The XCorr is a strong measure of accuracy,
whereas the DeltaCn is a weak measure of relative
goodness. .
Accuracy Score
Relative Score
Strong (XCorr)
Weak (DeltaCn)
SEQUEST
48Obviously, there could be an alternative method
that focuses more on the success of the relative
score.
Mascot and X! Tandem fit that bill.
Accuracy Score
Relative Score
Strong (XCorr)
Weak (DeltaCn)
SEQUEST
Alternate Method
Strong
Weak
49X! Tandem Scoring
by-Score Sum of intensities of peaks
matching B-type or Y-type ions HyperScore
Now the X! Tandem accuracy score is rather crude.
It only considers B and Y ions and
and attaches these factorial terms with an
admittedly hand waving argument.
Fenyo, D. Beavis, R. C. Anal. Chem., 75 (2003)
768-774
50Distribution of Incorrect Hits
But instead of just considering the best match to
the second best, it looks at the distribution of
lower scoring hits, assuming that they are all
wrong.
This is somewhat based on ideas pioneered with
the BLAST algorithm.
Here, every bar represents the number of matches
at a given score.
The X! Tandem creators found that the
distribution decays (or slopes down)
exponentially
of Matches
Second Best
Best Hit
Hyper Score
51Estimate Likelihood (E-Value)
and the log of the distribution is relatively
linear because of the exponential decay.
Log( of Matches)
Best Hit
Hyper Score
52Estimate Likelihood (E-Value)
Hyper Score
Expected Number Of Random Matches
Log( of Matches)
Best Hit
If the distribution represents the number of
random matches at any given score,
the linear fit should correspond to the expected
number of random matches.
53Estimate Likelihood (E-Value)
Score of 60 has 1/10 chance of occurring at random
Log( of Matches)
Best Hit
And from this, you can calculate the likelihood
that the best match is random.
This is called an E-Value, or Expected-Value.
In this case, a score of 60 corresponds with a
log number of matches being -1
which means the estimated number of random
matches for that score is 0.1
54X! Tandem and Mascot
Now, X! Tandem calculates this E-Value
empirically.
E-Value Likelihood that match is incorrect relative to N guesses Empirical (X! Tandem)
P-Value Likelihood that match is incorrect (EPN) Theoretical (Mascot)
Another search engine, Mascot, tries to get at
the same kind of number using theoretical
calculations,
most likely based on the number of identified
peaks and the likelihood of finding certain amino
acids in the genome database.
Theyve never explicitly published their
algorithm, so well never really know,
but I suspect its something smart.
I just want to bring up a point that well touch
on a little later
55X! Tandem and Mascot
the E-Value that X! Tandem calculates
and the P-Value that Mascot calculates are
probabilistically based,
but they can only estimate the likelihood that
the match is wrong.
E-Value Likelihood that match is incorrect relative to N guesses Empirical (X! Tandem)
P-Value Likelihood that match is incorrect (EPN) Theoretical (Mascot)
Probability Likelihood that match is correct Note (Probability?1-P)! Likelihood that match is correct Note (Probability?1-P)!
This is realistically not nearly as useful as
knowing
the probability that a peptide identification is
right,
which is NOT 1 minus the P-Value.
56Now, lets go back and fill in the X! Tandem part
of our accuracy/relativity scoring grid.
57To reiterate, the XCorr is an excellent measure
of accuracy
58whereas the E-Value is an excellent measure of
how good the best score is relative to the rest.
If we assume that accuracy and relativity scores
are independent measures of goodness,
could we use both the SEQUESTs XCorr and X!
Tandems E-Value together?
5910 Protein Control Sample
And the answer is a resounding yes.
Each point on this graph is a spectrum, where
correct identifications are marked in red, while
incorrect identifications are marked in blue.
X! Tandem -log(E-Value)
We know whats correct and incorrect because this
is a control sample.
SEQUEST Discriminant Score
Although in general the spectra SEQUEST scores
well are spectra X!Tandem also scores well,
there is considerable scatter between the search
engines.
6010 Protein Control Sample
One might wonder if X! Tandem and Mascot use
similar scoring approaches,
would they benefit as much,
but the answer is surprisingly still yes!
X! Tandem -log(E-Value)
Mascot Ion-Identity Score
Now, why are the scores so different?
61Why So Different?
Well, here are a couple of possible reasons.
- Sequest
- Considers relative intensities
- X! Tandem
- Considers semi-tryptic peptides
- Considers only B/Y-type Ions
- Mascot
- Considers theoretical
- P-Value relative to search space
SEQUEST is the only method to consider relative
intensities.
62Why So Different?
X! Tandem is the only method to consider peptides
outside the standard search space by default,
- Sequest
- Considers relative intensities
- X! Tandem
- Considers semi-tryptic peptides
- Considers only B/Y-type Ions
- Mascot
- Considers theoretical
- P-Value relative to search space
such as semi-tryptic peptides.
However, its the only score that considers only
B and Y ions,
as opposed to a complete model.
63Why So Different?
- Sequest
- Considers relative intensities
- X! Tandem
- Considers semi-tryptic peptides
- Considers only B/Y-type Ions
- Mascot
- Considers theoretical
- P-Value relative to search space
And Mascot is the only search engine to compute a
completely theoretical P-Value
64Consider Multiple Algorithms?
So we clearly want to consider multiple search
engines simultaneously,
X! Tandem -log(E-Value)
but how?
Mascot Ion-Identity Score
65How To Compare Search Engines?
- SEQUEST XCorrgt2.5, DeltaCngt0.1
- Mascot Ion Score-Identity Scoregt0
- X! Tandem E-Valuelt0.01
You cant use a thresholding system
For example, a SEQUEST match with an XCorr of 2.5
doesnt mean the same thing
because its impossible to find corresponding
thresholds.
as an X! Tandem match with an E-Value of 0.01.
66How To Compare Search Engines?
- SEQUEST XCorrgt2.5, DeltaCngt0.1
- Mascot Ion Score-Identity Scoregt0
- X! Tandem E-Valuelt0.01
The simplest way would be to convert the scores
into probabilities and compare those.
We advocate for Andrew Keller and Alexy
Nesviskiis Peptide Prophet approach
because it actually calculates a true
probability, not just a p-value.
- Need to convert scores to probabilities!
6710 Protein Control Sample (Q-ToF) X! Tandem
approach
Other Incorrect IDs for Spectrum
So if you remember,
X! Tandem considers the best peptide match for a
spectrum against a distribution of incorrect
matches
Possibly Correct?
of Matches
Mascot Ion-Identity Score
6810 Protein Control Sample (Q-ToF) Peptide Prophet
approach
ALL Other Best Matches
Well, Peptide Prophet looks across the entire
sample, and not at just one spectrum at a time.
It compares the best match against all of the
other best matches in the sample, which is
clearly bimodal.
Possibly Correct?
of Matches
Mascot Ion-Identity Score
Keller, A. et al Anal. Chem. 74, 5383-5392
6910 Protein Control Sample (Q-ToF) Peptide Prophet
approach
ALL Other Best Matches
The low mode represents matches that are most
likely wrong while the high mode represents
matches that are probably right.
Possibly Correct?
of Matches
Mascot Ion-Identity Score
Keller, A. et al Anal. Chem. 74, 5383-5392
7010 Protein Control Sample (Q-ToF) Peptide Prophet
approach
Peptide Prophet curve fits two distributions to
the modes,
following the assumption that the low scoring
distribution is Incorrect
Incorrect
and that the higher scoring distribution is
correct.
Possibly Correct?
of Matches
Correct
Mascot Ion-Identity Score
7110 Protein Control Sample (Q-ToF)
Incorrect
These two distributions can be analyzed using
Bayesian statistics with this formula.
Now that formula looks pretty complex, but
Possibly Correct?
of Matches
Correct
Mascot Ion-Identity Score
7210 Protein Control Sample (Q-ToF)
Incorrect
It just calculates the height of the correct
distribution at a particular score, divided by
the height of both distributions.
of Matches
Correct
Mascot Ion-Identity Score
7310 Protein Control Sample (Q-ToF)
This is essentially the probability of having
that score and being correct divided by the
probability of just having that score
Incorrect
Correct
Mascot Ion-Identity Score
74Incorrect
Possibly Correct?
of Matches
Correct
Mascot Ion-Identity Score
This is a neat method because it actually
considers the likelihood of being correct,
rather than X! Tandem and Mascot, which only
calculate the probability of being incorrect.
Its because of this that Peptide Prophet can get
produce a true probability,
which is important when the sample
characteristics change.
75Q-ToF
Incorrect
Possibly Correct?
of Matches
Correct
Mascot Ion-Identity Score
For example, the control sample weve been
looking at was derived from Q-ToF data
which produces pretty high quality results
76Q-ToF Ion Trap
Incorrect
If you compare that to the same sample on run on
an Ion Trap, the probability of being correct is
greatly diminished.
Possibly Correct?
of Matches
Correct
If youll note, the Incorrect distribution
doesnt change very much between the two
analyses, however, the likelihood that the
identification is right changes dramatically!
Mascot Ion-Identity Score
Incorrect
Possibly Correct?
of Matches
Correct
77 Ion Trap
As Peptide Prophet considers the correct
distribution, it is immune to fluctuations
between samples.
P-Values and E-Values dont consider this
information, so they cant be compared across
multiple samples, or different examinations of
the same sample
hence the reason why we need to use Peptide
Prophet for comparing two different search engines
Mascot Ion-Identity Score
Incorrect
Possibly Correct?
of Matches
Correct
78Consider Multiple Algorithms?
X! Tandem -log(E-Value)
So going back to the scatter plot between X!
Tandem and Mascot,
Mascot Ion-Identity Score
we can use Peptide Prophet to compute the score
threshold that represents a 95 cut-off
79Consider Multiple Algorithms?
Like so.
X! Tandem -log(E-Value)
Mascot Ion-Identity Score
This allows you to fairly consider the answers
from both search engines simultaneously.
The important thing to note, is that if you
looked at a different sample, these thresholds
should change depending on the height of the
correct distributions
80Conclusion
So in conclusion,
- All search engines use different criteria,
producing different scores - Using multiple search engines simultaneously
yields better results - Peptide Prophet can normalize search engine
results
all of the search engines look at different
criteria
81Conclusion
And we can leverage this to identify more
peptides
- All search engines use different criteria,
producing different scores - Using multiple search engines simultaneously
yields better results - Peptide Prophet can normalize search engine
results
82Conclusion
And that Peptide Prophet is a great mechanism for
doing that
- All search engines use different criteria,
producing different scores - Using multiple search engines simultaneously
yields better results - Peptide Prophet can normalize search engine
results
because it calculates true probabilities,
instead of p-values
83The End