Computational Simulations for Mass Spectrometry based identification of Biological Weapons

1 / 30
About This Presentation
Title:

Computational Simulations for Mass Spectrometry based identification of Biological Weapons

Description:

Computational Simulations for Mass Spectrometry based identification of Biological Weapons ... Masses of proteins with known sequence can easily be ... –

Number of Views:19
Avg rating:3.0/5.0
Slides: 31
Provided by: rzw
Category:

less

Transcript and Presenter's Notes

Title: Computational Simulations for Mass Spectrometry based identification of Biological Weapons


1
Computational Simulations for Mass Spectrometry
based identification of Biological Weapons
2
Introduction to Mass Spectrometry
  • Mass Spectrometry (MS) is an analytical technique
    that allows to accurately measure masses of
    biological molecules.
  • Masses of proteins with known sequence can easily
    be theoretically calculated by adding masses of
    individual amino acids.
  • Protein identification is performed through
    comparison between theoretical masses in a
    database and experimentally measured masses.

3
Organism Identification
  • Problem
  • Identify an organism in a complex background
    mixture by mass measurements.
  • Is it possible?
  • What approaches should be used?
  • What accuracy is required?

4
Simulation Preparation
  • Select a set of proteins that can be reliably
    measured by mass spectrometry (high intensities,
    have been observed in nature with no expected
    modifications)
  • Select a suitable background for the simulation,
    which will be a good representation of complexity
    in real world mixtures.

5
Sample
  • Test organism E. coli
  • 376 E. coli proteins were selected as
    representative for the organism due to their
    abundance as detected in two other microbial MS
    detections.
  • Background organism selection includes 13
    organisms and 83777 proteins
  • A. thaliana, B. anthracis (genes at NCBI
    predicted by GeneMark), B. fungorium, D.
    radiodurans, G. metallireducens, N. europaea, P.
    aeruginosa, R. palutris, S. putrefaciens, S.
    cerevisiae, Y. pestis CO92, Y. pestis KIM

6
Bottom up approach
Example Protein sequence MRIMVRTLRGDRVALDVDGATTT
VAQVKGMVMARERIAVAMQRLFFAGRCLDDDHRTLADYGVRHDSVVF
LSLRLATDAYQTEMHNVRLMQPETATAKQEMHQQQQQQLHVHVAADDEEK
AIKRKPVSRRALRKILSRLQ
Example digest MR IVR TLR GDR VALDVDGATTTVAQVK GM
VMAR ER IVAMQR LFFAGR CLDDDDHR TLADYGVR HDSVVFLSLR
  • The proteins in the representative E. coli set
    and the proteins in the background set are broken
    into pieces in silico (e.g. cut at K and R amino
    acids). Up to 4 missed cleavages are allowed.
  • There are 70,000 unique peptides in the
    representative set and 125 million unique
    peptides in the background set.

7
Bottom up identification
As with the proteins, the percent of unique E.
coli tryptic peptides as a function of
experimental error. Peptides are significantly
smaller than proteins and possess a lesser
discriminative power. Here, it is shown that,
only 24 of peptides are unique to representative
set, in the given background with a perfect
measurement, and there are 8 of them with the
error -0.01Da accuracy. Further out as the
measurement uncertainty increases, the number of
unique peptides diminishes, and the size of
unique peptides increases (cannot be identified
with low accuracy)
8
Sequence information
  • Mass spectrometry allows a fragmentation
    capability, that gives sequence information
    (MS/MS or Tandem MS).
  • The sequence information can be an invaluable
    source for a greater discrimination.
  • With addition of MS/MS information into analysis,
    Bottom up discriminative power can be greatly
    improved.

9
MRIMVRTLRGDRVALDVDGATTTVAQVKGMVMARER
MRIMVRTLRGDRVALDVD
GATTTVAQVKGMVMARER
b-ion
y-ion
Assumption Will break between every two
amino acids, providing a unique sequence pattern.
10
Experiment
  • Given
  • The background dataset
  • The Representative dataset
  • Accuracy -10Da
  • Bottom Up experiment
  • Can we
  • Identify absence of representative dataset in the
    background dataset?

11
Definitions
  • PM(sequence) , where n is
    the length of the protein sequence, and aa is a
    mass of an amino acid. (Parent mass)
  • TM(sequence) a list of b-ions and y-ions where

  • (TandemMS)

  • F(t)
    where t is displacement -1010
  • Score(f_pattern1, f_pattern2)

12
Sequential Algorithm
  • Select peptide a from representative list
  • Find a set (b0 .. bn) from the background list
    such that
  • PM(a) PM(bi) lt Error, for i0..n
  • For all bi
  • score(TM(a), TM(bi))
  • return the average score, and the highest score
    where a ! bi.

13
(No Transcript)
14
(No Transcript)
15
Run time estimation
16
Parallel Algorithm Considerations
  • Possible parallelizations
  • Level1 Data parallel on representative set.
  • Level2 Parallelization while working on a set
    (b0 .. bn)
  • Level3 Parallelization within Score loop
  • Is there a best parallelization technique?
  • What to choose?

17
Level1 parallelization
18
Level2 parallelization
19
Level3 parallelization
for(i -10 i lt10 i) Score
-10 -9 -8 -7 -6 -55 6 7 8 9 10
processor1
processorN
20
Parallelization
  • Main problems to consider
  • Synchronization
  • Amount of communication between processors
  • Load balancing
  • Slow/fast processors
  • Fault tolerance

21
Decision
  • Create a client/server system
  • Parent -gt server
  • Children-gtclients
  • Split representative set into N subsets (so that
    N gtgt number of processors).
  • Amount of data in a subset can be constant, but
    small.
  • Number of subsets not too large to minimize the
    communication between parent and children.

22
Parent
Spawn children
Init children
Beginning
Mid Point
Create Task-list
if
Read Task-list
While Task-list gt 0
1. Listen for work request 2. If(request
received) a. Send work to child c.
Decrement Task-list d. record childs id 3.
Listen for result message 4. Printout unfinished
tasks to Mid Point 5. If (child_died) a.
add childs last back to task to
Task-list If (result received) printout result
false
true
Broadcast quit signal
exit
23
Child
Receive Initialization
While Not quit
Send request For work
true
false
Exit
Task
quit signal
Listen
If task received
true
false
  • 1.Do
  • computation
  • 2. Send results to
  • the parent

24
Measuring runtime improvement
  • Speedup and efficiency are tools for analysis of
    parallel program performance.
  • Speedup T(1)/T(n), the ratio of runtime with 1
    processor and runtime using multiple processors.
  • Efficiency Speedup(n)/n, average speedup per
    processor

25
Speedup1
26
Speedup2
27
Efficiency
28
Results
29
Results
30
Conclusion
  • In terms of parallelization
  • This problem is data parallel.
  • The approach taken, provides a reasonable
    speedup.
  • In terms of data received
  • With a good measurement accuracy, the organism
    can be identified in this background.
  • Further statistical studies are necessary to
    assess the reliabilities of identification based
    on number of identified peptides.
Write a Comment
User Comments (0)
About PowerShow.com