Title: Computer Analysis of Mass Spectrometry Data
1Computer Analysis ofMass Spectrometry Data
- David Perkins
- Proteomics Section,
- Hammersmith Hospital Campus,
- Imperial College School of Medicine.
2Introduction
- Software and computational techniques for the
identification of proteins and residue
modifications using both MALDI and MS/MS data.
3(No Transcript)
4Peptide Mass Fingerprinting
A mass spectrum of the peptide mixture resulting
from the digestion of a protein by a
proteolytic enzyme
- Choice of Enzyme
- Missed Cleavages
- Search Masses
- Constraining the Protein Molecular Weight
- Which masses to include in a search
- Autolysis products
- Modifications
5Enzymatic Cleavage
Peptide Fragments
Native Protein
Enzyme
6Choice of Enzyme
- Enzymes of low specificity are next to useless as
they produce a complex mixture of similar masses - For MALDI, Peptides of masses less than 500 Da
should be avoided
7Enzyme Specificity
Enzyme Cleave At Dont Cleave
N or Cterm
Trypsin KR P
C
Lys-C K P
C
Lys-C/P K
C
Arg-C R P
C
V8-E E P
C
V8-DE DE P
C
Chymotrypsin FYWLIVM P C
8Missed Cleavages
- Digests are usually not perfect
- Cleavage sites may be missed by an enzyme
- These partially cleaved peptides are known as
partials - Reduce the discrimination of a search
9Search Masses
- Select masses which are large enough to provide
discrimination - Larger masses are more likely to be partials
- With Trypsin, a mass range of 1000 to 3000 Da is
good - Mass tolerance is important in obtaining good
discrimination
10Constraining Protein Mass
- To increase discrimination, the mass of the
intact protein can be used in a search - This is dangerous since this may be just a
fragment of an entire protein
11Which Masses to Include ?
The optimum dataset for a peptide mass
fingerprint is all the correct peptides and none
of the wrong ones ! By correct, we mean that the
textbook cleavage rules were followed. In
practice, this rarely (if ever) happens.
- Enzymatic cleavage not perfect
- Sequence coverage may be poor
- Noise
12Autolysis Products
- Some digests may be dominated by the autolysis
peaks of the enzyme used - In these cases, the known masses of these
products may be filtered
13Residue Modifications
- Some residues may be modified during the sample
preparation procedure - This introduces discrepancies in the expected and
observed masses - For example, Met residues are often oxidised
14Sample Preparation for MALDI
- Excise band from gel
- Tryptic Digestion of gel fragment
- Supernatant transferred to fresh eppendorf
- Sample transferred to target plate
15Sample Preparation Robot
16MALDI Mass Spectrometer
- Ions are generated by a LASER firing at the
target plate - The time of firing of the LASER and the arrival
time of the ions at the detector are known, the
relative masses can then be calculated - Only singly charged ions are generated, other
types of spectrometer may generate multiply
charged ions
17MALDI Internals
18Micromass MALDI
19Typical Fingerprint Spectrum
20Isotopic Cluster
21Poorly Resolved Peak
22Database Searching with Peptide Mass fingerprints
- Produce a theoretical digest of all the proteins
in a database with a specific enzyme - Compare these theoretical masses with
experimentally observed masses - Assign a score to matching peptides/proteins
23Problems
- Mixtures and contamination
- Partial cleavage
- Identifying real peaks
- Residue modifications
- Mass accuracy
24MOWSE
- One of the first programs for identifying
proteins by peptide mass fingerprinting - Developed by Darryl Pappin and Alan Bleasby
- Developed alongside the OWL non-redundant protein
database
25Problems with MOWSE
- Databases had to be pre-indexed, these indexes
are large and slow to build - Does not handle variable modifications
- Indexing means that databases cant be regularly
updated easily - Limited functionality
26MASCOT
- Take advantage of multi-processor systems
- Totally web based
- No pre-indexing of databases
- Increased functionality
- Copes with multiple modifications
- Easily expandable
- Increased speed
27Search Speed
Search speed is very important as databases
increase in size and automation leads to a high
throughput of samples. Also, if the algorithms
are efficient more elaborate searches may be
undertaken, for instance with large numbers of
variable residue modifications and different
mass tolerance to attempt to make more sense of
data derived from mixtures or with contamination
- Ability to use multiple processors when available
- Very efficient I/O, databases may also be mapped
to memory - Efficient cleavage site and mass calculation
28Thread Models
- Boss/Worker
- Peer
- Pipeline
- MASCOT is based on the Boss/Worker model
29Boss/Worker Model
Output
The Boss accepts input and then distributes the
work to other threads
30Peer Model
Output
Output
Output
Each Thread is responsible for its own input
31Pipeline Model
Input Stream
Output
Thread A
Thread B
Thread C
A single thread accepts input, passing the data
on to the next thread for further processing
32(No Transcript)
33(No Transcript)
34(No Transcript)
35(No Transcript)
36Related Search Methods
- Masses may be combined with sequence information
1234.5 seq(c-ABCD) seq(EF) - These searches are very valuable as even small
amounts of sequence information may be very
discriminating - Sequence information is derived from the partial
interpretation of a MS/MS spectrum - Know as the sequence tag method
37Composition Queries
- Composition information may also be used with
mass information to refine queries - Chemical or enzymatic analysis, such as N
terminal analysis with Edman, may give
composition information - A typical query would
be 1234.5 comp(2H0M)
38MASCOT Queries
- One of the most powerful features of MASCOT is
the ability to mix all the types of query in one
search - MASCOT allows the user to specify a particular
species to further increase search discrimination
39Databases Searched with Peptide Mass Fingerprint
Data
- Non-identical protein databases are the ideal
- EST sequences are too short to contain meaningful
information for these searches - Non-redundant databases may be problematic
- MASCOT translates nucleic acid databases on the
fly
40MSDB
- A non-identical protein sequence database
designed for mass spectrometry searches - Additional information, such as multiple species
lines, in the textual information - De-convolution of SWISSPROT and other sequences
- Nightly updates
- Links to source databases
41Is The Protein Identified ?
- Most samples are identified using just peptide
mass fingerprinting - With the growth of databases, this trend will
continue - Some samples do not have representatives in any
of the databases, to sequence these proteins more
analysis is required
42MS/MS Analysis
- Also known as tandem MS
- Individual peptides from the enzymatic digests
are fragmented further - From this ladder sequences may be reconstructed
- Much more discriminating search than simple
peptide mass fingerprinting
43MS/MS Analysis
- Carried out on nanospray/electrospray mass
spectrometers - Rather than spotted on a target plate, the sample
is introduced through an inlet from a capillary - Peptides identified by the MALDI analysis are
fragmented inside the mass spectrometer and the
resultant daughter ions observed
44Stylized Nanospray Mass Spectrometer
45Micromass QTOF
46Finnigan Ion Trap
47Daughter Ions
- Unlike the MALDI, ions produced by
electrospray/nanospray machines may carry
multiple charges - Various types of ions are produced, categorized
by their charge and their direction in the
peptide sequence - Fortunately the peptides fragment at the peptide
bonds
48B and Y Fragment Ions
Y-ions from C to N terminus
B-ions from N to C terminus
49 E
05
100
2.71
80
60
40
1571.45
20
1725.29
1814.37
500
1000
1500
2000
50MASCOT Searches with MS/MS Data
- In a similar fashion to peptide mass
fingerprinting, the predicted fragment ion mass
from each peptide of a database sequence are
calculated - The calculated and observed ion masses are
compared and given a score - Individual peptide scores are combined to give a
protein score
51Problems with MS/MS data
- The type of daughter ions produced may be large
and are dependant on the machine and analytic
procedure used - Searches tend to be used with a no enzyme
option which introduces a large number of
calculations - Residue modifications are far more difficult to
handle, the number of mass permutations being
very large
52(No Transcript)
53(No Transcript)
54(No Transcript)
55(No Transcript)
56(No Transcript)
57(No Transcript)
58(No Transcript)
59(No Transcript)
60(No Transcript)
61(No Transcript)
62Databases Searched with MS/MS Data
- Non-identical protein databases are the ideal
- EST databases translated in 6 frames are very
useful as individual peptides may be identified - Translated nucleic acid databases
- Non-redundant databases create problems
63De-Novo Sequencing
- If the protein is still not identified, the
sequence of a peptide has to be reconstructed
from the MS/MS data - Very time consuming and demands a great deal of
skill, noisy data is very problematic - Sequencing is carried out by finding mass
differences between peaks that correspond to
amino acid masses
64Tags
- Easy to find initial masses in ladder
- Tags modify the fragmentation of the peptide
- Reduce isobaric problems
- Neutralise the adverse effects of certain
residues on peptide fragmentation
65Example Tags
66 2044.9 1933.7 1862.7 1763.7 1634.6 1521.6
1450.4 1321.4 1220.3 1106.5 959.4 831.3
760.4 689.3 618.2 517.0 446.3 375.1
Gln Ala Val Glu Xle
Ala Glu Thr Asn Phe
Gln Ala Ala Ala Thr
Ala Ala Thr Lys
256.1 327.3 426.2 555.5 668.2
739.2 868.2 969.2 1083.1 1230.6 1358.3
1429.4 1500.4 1571.5 1672.3 1743.5 1814.4 1915.5
E
05
100
2.71
y-ion series b-ion series
80
60
40
1571.45
20
1725.29
1814.37
500
1000
1500
2000
67(No Transcript)
68Automation
Automation is critical to maintain a high
throughput of samples. It is essential to produce
closer integration of machine control and data
analysis software
- New generation of Mass Spectrometers, quadrupole
machines with LASER sources - Laboratory Information Management Systems
- Automated sample preparation
69Laboratory Information Management System
Mass Spectrometer
Data Reduction Peak Processing
Submission into Microarray/Proteomics database
MASCOT Search Engine
Re-search after database updates
Protein Identified
Protein not Identified
Automatic report generation for sample submitter
Via WWW
Results database
70Future of MASCOT
- Homology searching
- Post processing of results for easier
interpretation - Distributed processing - Linux cluster. MASCOT is
based on the Boss/Worker model so is easy to port - Development of a standard API to allow simpler
automation and extensions to functionality
71MASCOT Homology Searching
- Identification dependant on at least some of the
peptide sequences being identical to a database
sequence - Homology searching (for instance allowing common
substitutions to occur by default) would overcome
this limitation - Lead to less selectivity and also increased
search times
72Post processing of Results
- Allows easier interpretation by, e.g. removing
all identical peptide matches from the report
page - Text mining to interpret the results of a search,
for instance are all the proteins identified
involved in a particular cellular process ? - Important when dealing with quantitative studies
73Distributed Processing
- Ability to use as much processing power as
possible when dealing with high throughput data,
for instance the thousands of peptides from LC
MS/MS - Implemented in MASCOT using a MPI style mechanism
- has the ability to dynamically add/remove
processors for data processing
74Processing Farm
75Standard API
- A standard interface to MASCOT routines allowing
users to, e.g produce a bespoke interface - Allows integration with instrument control
software (although this is dependant on the
goodwill of the manufacturers !)
76MSDB developments
- Inclusion of variable splicing regions from
SWISSPROT - Integration of textual information from all
source databases - Clustering of highly similar sequences into
families with extra annotation - Inclusion of more translations from nucleic acid
databases
77Identification of proteins using short peptide
sequences
- FASTS is most commonly used tool at the moment,
but it is relatively slow and doesnt take into
account peptide masses and other information - New functionality for MASCOT based on tri-peptide
indices and using mass and residue modification
information
78MS/MS Data Mining
- MS/MS data may contain useful information in
addition to sequence - Statistical methods for mining MS/MS data for, eg
fragmentation efficiency etc - Predictive tool for de-novo sequencing
- Understanding of physical/chemical processes
involved in fragmentation
79Matrix Science
- Dr. John Cottrell
- Dr. David Creasy
- URL http//www.matrixscience.com
80Imperial College
- Darryl Pappin (now director of research ABI)
- Mike Bartlet-Jones
- URL http//csc-fserve.hh.med.ic.ac.uk