Computer Analysis of Mass Spectrometry Data - PowerPoint PPT Presentation

1 / 80

About This Presentation

Title:

Computer Analysis of Mass Spectrometry Data

Description:

Software and computational techniques for the identification of proteins and ... Supernatant transferred to fresh eppendorf. Sample transferred to target plate ... – PowerPoint PPT presentation

Number of Views:144

Avg rating:3.0/5.0

Slides: 81

Provided by: proteinseq

Category:

more less

Transcript and Presenter's Notes

Title: Computer Analysis of Mass Spectrometry Data

1
Computer Analysis ofMass Spectrometry Data

David Perkins
Proteomics Section,
Hammersmith Hospital Campus,
Imperial College School of Medicine.

2
Introduction

Software and computational techniques for the
identification of proteins and residue
modifications using both MALDI and MS/MS data.

3
(No Transcript)
4
Peptide Mass Fingerprinting
A mass spectrum of the peptide mixture resulting
from the digestion of a protein by a
proteolytic enzyme

Choice of Enzyme
Missed Cleavages
Search Masses
Constraining the Protein Molecular Weight
Which masses to include in a search
Autolysis products
Modifications

5
Enzymatic Cleavage
Peptide Fragments
Native Protein
Enzyme
6
Choice of Enzyme

Enzymes of low specificity are next to useless as
they produce a complex mixture of similar masses
For MALDI, Peptides of masses less than 500 Da
should be avoided

7
Enzyme Specificity
Enzyme Cleave At Dont Cleave
N or Cterm
Trypsin KR P
C
Lys-C K P
C
Lys-C/P K
C
Arg-C R P
C
V8-E E P
C
V8-DE DE P
C
Chymotrypsin FYWLIVM P C
8
Missed Cleavages

Digests are usually not perfect
Cleavage sites may be missed by an enzyme
These partially cleaved peptides are known as
partials
Reduce the discrimination of a search

9
Search Masses

Select masses which are large enough to provide
discrimination
Larger masses are more likely to be partials
With Trypsin, a mass range of 1000 to 3000 Da is
good
Mass tolerance is important in obtaining good
discrimination

10
Constraining Protein Mass

To increase discrimination, the mass of the
intact protein can be used in a search
This is dangerous since this may be just a
fragment of an entire protein

11
Which Masses to Include ?
The optimum dataset for a peptide mass
fingerprint is all the correct peptides and none
of the wrong ones ! By correct, we mean that the
textbook cleavage rules were followed. In
practice, this rarely (if ever) happens.

Enzymatic cleavage not perfect
Sequence coverage may be poor
Noise

12
Autolysis Products

Some digests may be dominated by the autolysis
peaks of the enzyme used
In these cases, the known masses of these
products may be filtered

13
Residue Modifications

Some residues may be modified during the sample
preparation procedure
This introduces discrepancies in the expected and
observed masses
For example, Met residues are often oxidised

14
Sample Preparation for MALDI

Excise band from gel
Tryptic Digestion of gel fragment
Supernatant transferred to fresh eppendorf
Sample transferred to target plate

15
Sample Preparation Robot
16
MALDI Mass Spectrometer

Ions are generated by a LASER firing at the
target plate
The time of firing of the LASER and the arrival
time of the ions at the detector are known, the
relative masses can then be calculated
Only singly charged ions are generated, other
types of spectrometer may generate multiply
charged ions

17
MALDI Internals
18
Micromass MALDI
19
Typical Fingerprint Spectrum
20
Isotopic Cluster
21
Poorly Resolved Peak
22
Database Searching with Peptide Mass fingerprints

Produce a theoretical digest of all the proteins
in a database with a specific enzyme
Compare these theoretical masses with
experimentally observed masses
Assign a score to matching peptides/proteins

23
Problems

Mixtures and contamination
Partial cleavage
Identifying real peaks
Residue modifications
Mass accuracy

24
MOWSE

One of the first programs for identifying
proteins by peptide mass fingerprinting
Developed by Darryl Pappin and Alan Bleasby
Developed alongside the OWL non-redundant protein
database

25
Problems with MOWSE

Databases had to be pre-indexed, these indexes
are large and slow to build
Does not handle variable modifications
Indexing means that databases cant be regularly
updated easily
Limited functionality

26
MASCOT

Take advantage of multi-processor systems
Totally web based
No pre-indexing of databases
Increased functionality
Copes with multiple modifications
Easily expandable
Increased speed

27
Search Speed
Search speed is very important as databases
increase in size and automation leads to a high
throughput of samples. Also, if the algorithms
are efficient more elaborate searches may be
undertaken, for instance with large numbers of
variable residue modifications and different
mass tolerance to attempt to make more sense of
data derived from mixtures or with contamination

Ability to use multiple processors when available
Very efficient I/O, databases may also be mapped
to memory
Efficient cleavage site and mass calculation

28
Thread Models

Boss/Worker
Peer
Pipeline
MASCOT is based on the Boss/Worker model

29
Boss/Worker Model
Output
The Boss accepts input and then distributes the
work to other threads
30
Peer Model
Output
Output
Output
Each Thread is responsible for its own input
31
Pipeline Model
Input Stream
Output
Thread A
Thread B
Thread C
A single thread accepts input, passing the data
on to the next thread for further processing
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
Related Search Methods

Masses may be combined with sequence information
1234.5 seq(c-ABCD) seq(EF)
These searches are very valuable as even small
amounts of sequence information may be very
discriminating
Sequence information is derived from the partial
interpretation of a MS/MS spectrum
Know as the sequence tag method

37
Composition Queries

Composition information may also be used with
mass information to refine queries
Chemical or enzymatic analysis, such as N
terminal analysis with Edman, may give
composition information
A typical query would
be 1234.5 comp(2H0M)

38
MASCOT Queries

One of the most powerful features of MASCOT is
the ability to mix all the types of query in one
search
MASCOT allows the user to specify a particular
species to further increase search discrimination

39
Databases Searched with Peptide Mass Fingerprint
Data

Non-identical protein databases are the ideal
EST sequences are too short to contain meaningful
information for these searches
Non-redundant databases may be problematic
MASCOT translates nucleic acid databases on the
fly

40
MSDB

A non-identical protein sequence database
designed for mass spectrometry searches
Additional information, such as multiple species
lines, in the textual information
De-convolution of SWISSPROT and other sequences
Nightly updates
Links to source databases

41
Is The Protein Identified ?

Most samples are identified using just peptide
mass fingerprinting
With the growth of databases, this trend will
continue
Some samples do not have representatives in any
of the databases, to sequence these proteins more
analysis is required

42
MS/MS Analysis

Also known as tandem MS
Individual peptides from the enzymatic digests
are fragmented further
From this ladder sequences may be reconstructed
Much more discriminating search than simple
peptide mass fingerprinting

43
MS/MS Analysis

Carried out on nanospray/electrospray mass
spectrometers
Rather than spotted on a target plate, the sample
is introduced through an inlet from a capillary
Peptides identified by the MALDI analysis are
fragmented inside the mass spectrometer and the
resultant daughter ions observed

44
Stylized Nanospray Mass Spectrometer
45
Micromass QTOF
46
Finnigan Ion Trap
47
Daughter Ions

Unlike the MALDI, ions produced by
electrospray/nanospray machines may carry
multiple charges
Various types of ions are produced, categorized
by their charge and their direction in the
peptide sequence
Fortunately the peptides fragment at the peptide
bonds

48
B and Y Fragment Ions
Y-ions from C to N terminus
B-ions from N to C terminus
49

E
05
100
2.71
80
60
40
1571.45
20
1725.29
1814.37
500
1000
1500
2000
50
MASCOT Searches with MS/MS Data

In a similar fashion to peptide mass
fingerprinting, the predicted fragment ion mass
from each peptide of a database sequence are
calculated
The calculated and observed ion masses are
compared and given a score
Individual peptide scores are combined to give a
protein score

51
Problems with MS/MS data

The type of daughter ions produced may be large
and are dependant on the machine and analytic
procedure used
Searches tend to be used with a no enzyme
option which introduces a large number of
calculations
Residue modifications are far more difficult to
handle, the number of mass permutations being
very large

52
(No Transcript)
53
(No Transcript)
54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
Databases Searched with MS/MS Data

Non-identical protein databases are the ideal
EST databases translated in 6 frames are very
useful as individual peptides may be identified
Translated nucleic acid databases
Non-redundant databases create problems

63
De-Novo Sequencing

If the protein is still not identified, the
sequence of a peptide has to be reconstructed
from the MS/MS data
Very time consuming and demands a great deal of
skill, noisy data is very problematic
Sequencing is carried out by finding mass
differences between peaks that correspond to
amino acid masses

64
Tags

Easy to find initial masses in ladder
Tags modify the fragmentation of the peptide
Reduce isobaric problems
Neutralise the adverse effects of certain
residues on peptide fragmentation

65
Example Tags
66

2044.9 1933.7 1862.7 1763.7 1634.6 1521.6
1450.4 1321.4 1220.3 1106.5 959.4 831.3
760.4 689.3 618.2 517.0 446.3 375.1

Gln Ala Val Glu Xle
Ala Glu Thr Asn Phe
Gln Ala Ala Ala Thr
Ala Ala Thr Lys

256.1 327.3 426.2 555.5 668.2
739.2 868.2 969.2 1083.1 1230.6 1358.3
1429.4 1500.4 1571.5 1672.3 1743.5 1814.4 1915.5

E
05
100
2.71
y-ion series b-ion series
80
60
40
1571.45
20
1725.29
1814.37
500
1000
1500
2000
67
(No Transcript)
68
Automation
Automation is critical to maintain a high
throughput of samples. It is essential to produce
closer integration of machine control and data
analysis software

New generation of Mass Spectrometers, quadrupole
machines with LASER sources
Laboratory Information Management Systems
Automated sample preparation

69
Laboratory Information Management System
Mass Spectrometer
Data Reduction Peak Processing
Submission into Microarray/Proteomics database
MASCOT Search Engine
Re-search after database updates
Protein Identified
Protein not Identified
Automatic report generation for sample submitter
Via WWW
Results database
70
Future of MASCOT

Homology searching
Post processing of results for easier
interpretation
Distributed processing - Linux cluster. MASCOT is
based on the Boss/Worker model so is easy to port
Development of a standard API to allow simpler
automation and extensions to functionality

71
MASCOT Homology Searching

Identification dependant on at least some of the
peptide sequences being identical to a database
sequence
Homology searching (for instance allowing common
substitutions to occur by default) would overcome
this limitation
Lead to less selectivity and also increased
search times

72
Post processing of Results

Allows easier interpretation by, e.g. removing
all identical peptide matches from the report
page
Text mining to interpret the results of a search,
for instance are all the proteins identified
involved in a particular cellular process ?
Important when dealing with quantitative studies

73
Distributed Processing

Ability to use as much processing power as
possible when dealing with high throughput data,
for instance the thousands of peptides from LC
MS/MS
Implemented in MASCOT using a MPI style mechanism
has the ability to dynamically add/remove
processors for data processing

74
Processing Farm
75
Standard API

A standard interface to MASCOT routines allowing
users to, e.g produce a bespoke interface
Allows integration with instrument control
software (although this is dependant on the
goodwill of the manufacturers !)

76
MSDB developments

Inclusion of variable splicing regions from
SWISSPROT
Integration of textual information from all
source databases
Clustering of highly similar sequences into
families with extra annotation
Inclusion of more translations from nucleic acid
databases

77
Identification of proteins using short peptide
sequences

FASTS is most commonly used tool at the moment,
but it is relatively slow and doesnt take into
account peptide masses and other information
New functionality for MASCOT based on tri-peptide
indices and using mass and residue modification
information

78
MS/MS Data Mining