Computational Simulations for Mass Spectrometry based identification of Biological Weapons

1 / 30

About This Presentation

Title:

Computational Simulations for Mass Spectrometry based identification of Biological Weapons

Description:

Computational Simulations for Mass Spectrometry based identification of Biological Weapons ... Masses of proteins with known sequence can easily be ... –

Number of Views:19

Avg rating:3.0/5.0

Slides: 31

Provided by: rzw

Category:

more less

Transcript and Presenter's Notes

Title: Computational Simulations for Mass Spectrometry based identification of Biological Weapons

1
Computational Simulations for Mass Spectrometry
based identification of Biological Weapons
2
Introduction to Mass Spectrometry

Mass Spectrometry (MS) is an analytical technique
that allows to accurately measure masses of
biological molecules.
Masses of proteins with known sequence can easily
be theoretically calculated by adding masses of
individual amino acids.
Protein identification is performed through
comparison between theoretical masses in a
database and experimentally measured masses.

3
Organism Identification

Problem
Identify an organism in a complex background
mixture by mass measurements.
Is it possible?
What approaches should be used?
What accuracy is required?

4
Simulation Preparation

Select a set of proteins that can be reliably
measured by mass spectrometry (high intensities,
have been observed in nature with no expected
modifications)
Select a suitable background for the simulation,
which will be a good representation of complexity
in real world mixtures.

5
Sample

Test organism E. coli
376 E. coli proteins were selected as
representative for the organism due to their
abundance as detected in two other microbial MS
detections.
Background organism selection includes 13
organisms and 83777 proteins
A. thaliana, B. anthracis (genes at NCBI
predicted by GeneMark), B. fungorium, D.
radiodurans, G. metallireducens, N. europaea, P.
aeruginosa, R. palutris, S. putrefaciens, S.
cerevisiae, Y. pestis CO92, Y. pestis KIM

6
Bottom up approach
Example Protein sequence MRIMVRTLRGDRVALDVDGATTT
VAQVKGMVMARERIAVAMQRLFFAGRCLDDDHRTLADYGVRHDSVVF
LSLRLATDAYQTEMHNVRLMQPETATAKQEMHQQQQQQLHVHVAADDEEK
AIKRKPVSRRALRKILSRLQ
Example digest MR IVR TLR GDR VALDVDGATTTVAQVK GM
VMAR ER IVAMQR LFFAGR CLDDDDHR TLADYGVR HDSVVFLSLR

The proteins in the representative E. coli set
and the proteins in the background set are broken
into pieces in silico (e.g. cut at K and R amino
acids). Up to 4 missed cleavages are allowed.
There are 70,000 unique peptides in the
representative set and 125 million unique
peptides in the background set.

7
Bottom up identification
As with the proteins, the percent of unique E.
coli tryptic peptides as a function of
experimental error. Peptides are significantly
smaller than proteins and possess a lesser
discriminative power. Here, it is shown that,
only 24 of peptides are unique to representative
set, in the given background with a perfect
measurement, and there are 8 of them with the
error -0.01Da accuracy. Further out as the
measurement uncertainty increases, the number of
unique peptides diminishes, and the size of
unique peptides increases (cannot be identified
with low accuracy)
8
Sequence information

Mass spectrometry allows a fragmentation
capability, that gives sequence information
(MS/MS or Tandem MS).
The sequence information can be an invaluable
source for a greater discrimination.
With addition of MS/MS information into analysis,
Bottom up discriminative power can be greatly
improved.

9
MRIMVRTLRGDRVALDVDGATTTVAQVKGMVMARER
MRIMVRTLRGDRVALDVD
GATTTVAQVKGMVMARER
b-ion
y-ion
Assumption Will break between every two
amino acids, providing a unique sequence pattern.
10
Experiment

Given
The background dataset
The Representative dataset
Accuracy -10Da
Bottom Up experiment
Can we
Identify absence of representative dataset in the
background dataset?

11
Definitions

PM(sequence) , where n is
the length of the protein sequence, and aa is a
mass of an amino acid. (Parent mass)
TM(sequence) a list of b-ions and y-ions where
(TandemMS)
F(t)
where t is displacement -1010
Score(f_pattern1, f_pattern2)

12
Sequential Algorithm

Select peptide a from representative list
Find a set (b0 .. bn) from the background list
such that
PM(a) PM(bi) lt Error, for i0..n
For all bi
score(TM(a), TM(bi))
return the average score, and the highest score
where a ! bi.

13
(No Transcript)
14
(No Transcript)
15
Run time estimation
16
Parallel Algorithm Considerations

Possible parallelizations
Level1 Data parallel on representative set.
Level2 Parallelization while working on a set
(b0 .. bn)
Level3 Parallelization within Score loop
Is there a best parallelization technique?
What to choose?

17
Level1 parallelization
18
Level2 parallelization
19
Level3 parallelization
for(i -10 i lt10 i) Score
-10 -9 -8 -7 -6 -55 6 7 8 9 10
processor1
processorN
20
Parallelization

Main problems to consider
Synchronization
Amount of communication between processors
Load balancing
Slow/fast processors
Fault tolerance

21
Decision

Create a client/server system
Parent -gt server
Children-gtclients
Split representative set into N subsets (so that
N gtgt number of processors).
Amount of data in a subset can be constant, but
small.
Number of subsets not too large to minimize the
communication between parent and children.

22
Parent
Spawn children
Init children
Beginning
Mid Point
Create Task-list
if
Read Task-list
While Task-list gt 0
1. Listen for work request 2. If(request
received) a. Send work to child c.
Decrement Task-list d. record childs id 3.
Listen for result message 4. Printout unfinished
tasks to Mid Point 5. If (child_died) a.
add childs last back to task to
Task-list If (result received) printout result
false
true
Broadcast quit signal
exit
23
Child
Receive Initialization
While Not quit
Send request For work
true
false
Exit
Task
quit signal
Listen
If task received
true
false

1.Do
computation
2. Send results to
the parent

24
Measuring runtime improvement

Speedup and efficiency are tools for analysis of
parallel program performance.
Speedup T(1)/T(n), the ratio of runtime with 1
processor and runtime using multiple processors.
Efficiency Speedup(n)/n, average speedup per
processor

25
Speedup1
26
Speedup2
27
Efficiency
28
Results
29
Results
30
Conclusion

In terms of parallelization
This problem is data parallel.
The approach taken, provides a reasonable
speedup.
In terms of data received
With a good measurement accuracy, the organism
can be identified in this background.
Further statistical studies are necessary to
assess the reliabilities of identification based
on number of identified peptides.