Title: Computational Simulations for Mass Spectrometry based identification of Biological Weapons
1Computational Simulations for Mass Spectrometry
based identification of Biological Weapons
2Introduction to Mass Spectrometry
- Mass Spectrometry (MS) is an analytical technique
that allows to accurately measure masses of
biological molecules. - Masses of proteins with known sequence can easily
be theoretically calculated by adding masses of
individual amino acids. - Protein identification is performed through
comparison between theoretical masses in a
database and experimentally measured masses.
3Organism Identification
- Problem
- Identify an organism in a complex background
mixture by mass measurements. - Is it possible?
- What approaches should be used?
- What accuracy is required?
4Simulation Preparation
- Select a set of proteins that can be reliably
measured by mass spectrometry (high intensities,
have been observed in nature with no expected
modifications) - Select a suitable background for the simulation,
which will be a good representation of complexity
in real world mixtures.
5Sample
- Test organism E. coli
- 376 E. coli proteins were selected as
representative for the organism due to their
abundance as detected in two other microbial MS
detections. - Background organism selection includes 13
organisms and 83777 proteins -
- A. thaliana, B. anthracis (genes at NCBI
predicted by GeneMark), B. fungorium, D.
radiodurans, G. metallireducens, N. europaea, P.
aeruginosa, R. palutris, S. putrefaciens, S.
cerevisiae, Y. pestis CO92, Y. pestis KIM
6Bottom up approach
Example Protein sequence MRIMVRTLRGDRVALDVDGATTT
VAQVKGMVMARERIAVAMQRLFFAGRCLDDDHRTLADYGVRHDSVVF
LSLRLATDAYQTEMHNVRLMQPETATAKQEMHQQQQQQLHVHVAADDEEK
AIKRKPVSRRALRKILSRLQ
Example digest MR IVR TLR GDR VALDVDGATTTVAQVK GM
VMAR ER IVAMQR LFFAGR CLDDDDHR TLADYGVR HDSVVFLSLR
- The proteins in the representative E. coli set
and the proteins in the background set are broken
into pieces in silico (e.g. cut at K and R amino
acids). Up to 4 missed cleavages are allowed. - There are 70,000 unique peptides in the
representative set and 125 million unique
peptides in the background set.
7Bottom up identification
As with the proteins, the percent of unique E.
coli tryptic peptides as a function of
experimental error. Peptides are significantly
smaller than proteins and possess a lesser
discriminative power. Here, it is shown that,
only 24 of peptides are unique to representative
set, in the given background with a perfect
measurement, and there are 8 of them with the
error -0.01Da accuracy. Further out as the
measurement uncertainty increases, the number of
unique peptides diminishes, and the size of
unique peptides increases (cannot be identified
with low accuracy)
8Sequence information
- Mass spectrometry allows a fragmentation
capability, that gives sequence information
(MS/MS or Tandem MS). - The sequence information can be an invaluable
source for a greater discrimination. - With addition of MS/MS information into analysis,
Bottom up discriminative power can be greatly
improved.
9MRIMVRTLRGDRVALDVDGATTTVAQVKGMVMARER
MRIMVRTLRGDRVALDVD
GATTTVAQVKGMVMARER
b-ion
y-ion
Assumption Will break between every two
amino acids, providing a unique sequence pattern.
10Experiment
- Given
- The background dataset
- The Representative dataset
- Accuracy -10Da
- Bottom Up experiment
- Can we
- Identify absence of representative dataset in the
background dataset?
11Definitions
- PM(sequence) , where n is
the length of the protein sequence, and aa is a
mass of an amino acid. (Parent mass) - TM(sequence) a list of b-ions and y-ions where
-
(TandemMS)
-
- F(t)
where t is displacement -1010 - Score(f_pattern1, f_pattern2)
12Sequential Algorithm
- Select peptide a from representative list
- Find a set (b0 .. bn) from the background list
such that - PM(a) PM(bi) lt Error, for i0..n
- For all bi
- score(TM(a), TM(bi))
- return the average score, and the highest score
where a ! bi.
13(No Transcript)
14(No Transcript)
15Run time estimation
16Parallel Algorithm Considerations
- Possible parallelizations
- Level1 Data parallel on representative set.
- Level2 Parallelization while working on a set
(b0 .. bn) - Level3 Parallelization within Score loop
- Is there a best parallelization technique?
- What to choose?
17Level1 parallelization
18Level2 parallelization
19Level3 parallelization
for(i -10 i lt10 i) Score
-10 -9 -8 -7 -6 -55 6 7 8 9 10
processor1
processorN
20Parallelization
- Main problems to consider
- Synchronization
- Amount of communication between processors
- Load balancing
- Slow/fast processors
- Fault tolerance
21Decision
- Create a client/server system
- Parent -gt server
- Children-gtclients
- Split representative set into N subsets (so that
N gtgt number of processors). - Amount of data in a subset can be constant, but
small. - Number of subsets not too large to minimize the
communication between parent and children.
22Parent
Spawn children
Init children
Beginning
Mid Point
Create Task-list
if
Read Task-list
While Task-list gt 0
1. Listen for work request 2. If(request
received) a. Send work to child c.
Decrement Task-list d. record childs id 3.
Listen for result message 4. Printout unfinished
tasks to Mid Point 5. If (child_died) a.
add childs last back to task to
Task-list If (result received) printout result
false
true
Broadcast quit signal
exit
23Child
Receive Initialization
While Not quit
Send request For work
true
false
Exit
Task
quit signal
Listen
If task received
true
false
- 1.Do
- computation
- 2. Send results to
- the parent
24Measuring runtime improvement
- Speedup and efficiency are tools for analysis of
parallel program performance. - Speedup T(1)/T(n), the ratio of runtime with 1
processor and runtime using multiple processors. - Efficiency Speedup(n)/n, average speedup per
processor
25Speedup1
26Speedup2
27Efficiency
28Results
29Results
30Conclusion
- In terms of parallelization
- This problem is data parallel.
- The approach taken, provides a reasonable
speedup. - In terms of data received
- With a good measurement accuracy, the organism
can be identified in this background. - Further statistical studies are necessary to
assess the reliabilities of identification based
on number of identified peptides.