Scalable Molecular Dynamics for Large Biomolecular Systems - PowerPoint PPT Presentation

About This Presentation

Title:

Scalable Molecular Dynamics for Large Biomolecular Systems

Description:

... so, cut-off based computations are important: near-atom calculations are part of the above. multiple time-stepping is ... Sequential step time was 57 seconds ... – PowerPoint PPT presentation

Number of Views:97

Avg rating:3.0/5.0

Slides: 52

Provided by: laxmika

Learn more at: http://charm.cs.uiuc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Scalable Molecular Dynamics for Large Biomolecular Systems

1
Scalable Molecular Dynamicsfor Large
Biomolecular Systems

Robert Brunner
James C Phillips
Laxmikant Kale

2
Overview

Context approach and methodology
Molecular dynamics for biomolecules
Our program NAMD
Basic Parallelization strategy
NAMD performance Optimizations
Techniques
Results
Conclusions summary, lessons and future work

3
The context

Objective Enhance Performance and productivity
in parallel programming
For complex, dynamic applications
Scalable to thousands of processors
Theme
Adaptive techniques for handling dynamic behavior
Look for optimal division of labor between human
programmer and the system
Let the programmer specify what to do in parallel
Let the system decide when and where to run the
subcomputations
Data driven objects as the substrate

4
5
1
8
1
2
10
3
4
8
2
3
9
7
5
6
10
9
4
9
12
11
13
6
7
13
11
12
5
Data driven execution
Scheduler
Scheduler
Message Q
Message Q
6
Charm

Parallel C with Data Driven Objects
Object Arrays and collections
Asynchronous method invocation
Object Groups
global object with a representative on each PE
Prioritized scheduling
Mature, robust, portable
http//charm.cs.uiuc.edu

7
Multi-partition decomposition
8
Load balancing

Based on migratable objects
Collect timing data for several cycles
Run heuristic load balancer
Several alternative ones
Re-map and migrate objects accordingly
Registration mechanisms facilitate migration

9
Measurement based load balancing

Application induced imbalances
Abrupt, but infrequent, or
Slow, cumulative
rarely frequent, large changes
Principle of persistence
Extension of principle of locality
Behavior, including computational load and
communication patterns, of objects tend to
persist over time
We have implemented strategies that exploit this
automatically

10
Molecular Dynamics
11
Molecular dynamics and NAMD

MD to understand the structure and function of
biomolecules
proteins, DNA, membranes
NAMD is a production quality MD program
Active use by biophysicists (science
publications)
50,000 lines of C code
1000 registered users
Features and accessories such as
VMD visualization
Biocore collaboratory
Steered and Interactive Molecular Dynamics

12
NAMD Contributors

PI s
Laxmikant Kale, Klaus Schulten, Robert Skeel
NAMD 1
Robert Brunner, Andrew Dalke, Attila Gursoy,
Bill Humphrey, Mark Nelson
NAMD2
M. Bhandarkar, R. Brunner, A. Gursoy, J.
Phillips, N.Krawetz, A. Shinozaki, K.
Varadarajan, Gengbin Zheng, ..

13
Molecular Dynamics

Collection of charged atoms, with bonds
Newtonian mechanics
At each time-step
Calculate forces on each atom
bonds
non-bonded electrostatic and van der Waals
Calculate velocities and Advance positions
1 femtosecond time-step, millions needed!
Thousands of atoms (1,000 - 100,000)

14
Cut-off radius

Use of cut-off radius to reduce work
8 - 14 Å
Faraway charges ignored!
80-95 work is non-bonded force computations
Some simulations need faraway contributions
Periodic systems Ewald, Particle-Mesh Ewald
Aperiodic systems FMA
Even so, cut-off based computations are
important
near-atom calculations are part of the above
multiple time-stepping is used k cut-off steps,
1 PME/FMA

15
Scalability

The Program should scale up to use a large number
of processors.
But what does that mean?
An individual simulation isnt truly scalable
Better definition of scalability
If I double the number of processors, I should
be able to retain parallel efficiency by
increasing the problem size

16
Isoefficiency

Quantify scalability
(Work of Vipin Kumar, U. Minnesota)
How much increase in problem size is needed to
retain the same efficiency on a larger machine?
Efficiency Seq. Time/ (P Parallel Time)
parallel time
computation communication idle

17
Atom decomposition

Partition the Atoms array across processors
Nearby atoms may not be on the same processor
Communication O(N) per processor
Communication/Computation O(N)/(N/P) O(P)
Again, not scalable by our definition

18
Force Decomposition

Distribute force matrix to processors
Matrix is sparse, non uniform
Each processor has one block
Communication
Ratio
Better scalability in practice
(can use 100 processors)
Plimpton
Hwang, Saltz, et al
6 on 32 Pes 36 on 128 processor
Yet not scalable in the sense defined here!

19
Spatial Decomposition

Allocate close-by atoms to the same processor
Three variations possible
Partitioning into P boxes, 1 per processor
Good scalability, but hard to implement
Partitioning into fixed size boxes, each a little
larger than the cutoff distance
Partitioning into smaller boxes
Communication O(N/P)
so, scalable in principle

20
Spatial Decomposition in NAMD

NAMD 1 used spatial decomposition
Good theoretical isoefficiency, but for a fixed
size system, load balancing problems
For midsize systems, got good speedups up to 16
processors.
Use the symmetry of Newtons 3rd law to
facilitate load balancing

21
Spatial Decomposition
But the load balancing problems are still severe
22

23
FD SD

Now, we have many more objects to load balance
Each diamond can be assigned to any processor
Number of diamonds (3D)
14Number of Patches

24
Bond Forces

Multiple types of forces
Bonds(2), Angles(3), Dihedrals (4), ..
Luckily, each involves atoms in neighboring
patches only
Straightforward implementation
Send message to all neighbors,
receive forces from them
262 messages per patch!

25
Bonded Forces

Assume one patch per processor
an angle force involving atoms in patches
(x1,y1,z1), (x2,y2,z2), (x3,y3,z3)
is calculated in patch (maxxi, maxyi,
maxzi)

C
A
B
26
Implementation

Multiple Objects per processor
Different types patches, pairwise forces, bonded
forces,
Each may have its data ready at different times
Need ability to map and remap them
Need prioritized scheduling
Charm supports all of these

27
Load Balancing

Is a major challenge for this application
especially for a large number of processors
Unpredictable workloads
Each diamond (force object) and patch encapsulate
variable amount of work
Static estimates are inaccurate
Measurement based Load Balancing Framework
Robert Brunners recent Ph.D. thesis
Very slow variations across timesteps

28
Bipartite graph balancing

Background load
Patches (integration, ..) and bond-related
forces
Migratable load
Non-bonded forces
Bipartite communication graph
between migratable and non-migratable objects
Challenge
Balance Load while minimizing communication

29
Load balancing strategy
Greedy variant (simplified) Sort compute objects
(diamonds) Repeat (until all assigned) S set
of all processors that -- are not
overloaded -- generate least new commun.
P least loaded S Assign heaviest compute
to P
Refinement Repeat - Pick a compute from
the most overloaded PE - Assign it to a
suitable underloaded PE Until (No movement)
Cell
Cell
Compute
30
(No Transcript)
31
Speedups in 1998
32
Initial Speedup Results ASCI Red
33
BC1 complex 200k atoms
34
Optimizations

Series of optimizations
Examples to be covered here
Grainsize distributions (bimodal)
Integration message sending overheads

35
Grainsize and Amdahlss law

A variant of Amdahls law, for objects, would be
The fastest time can be no shorter than the time
for the biggest single object!
How did it apply to us?
Sequential step time was 57 seconds
To run on 2k processors, no object should be more
than 28 msecs.
Should be even shorter
Grainsize analysis via projections showed that
was not so..

36
Grainsize analysis
Solution Split compute objects that may have
too much work using a heuristics based on number
of interacting atoms
Problem
37
Grainsize reduced
38
Performance audit
39
Performance audit

Through the optimization process,
an audit was kept to decide where to look to
improve performance

Total Ideal Actual Total 57.04 86 nonBonded 52.
44 49.77 Bonds 3.16 3.9 Integration 1.44 3.05 Ove
rhead 0 7.97 Imbalance 0 10.45 Idle 0 9.25 Receiv
es 0 1.61
Integration time doubled
40
Integration overhead analysis
integration
Problem integration time had doubled from
sequential run
41
Integration overhead example

The projections pictures showed the overhead was
associated with sending messages.
Many cells were sending 30-40 messages.
The overhead was still too much compared with the
cost of messages.
Code analysis memory allocations!
Identical message is being sent to 30
processors.
Simple multicast support was added to Charm
Mainly eliminates memory allocations (and some
copying)

42
Integration overhead After multicast
43
Improved Performance Data
44
Recent Results on Origin 2000
45
Results on Linux Cluster
46
Performance of Apo-A1 on Asci Red
47
Performance of Apo-A1 on O2k and T3E
48
Lessons learned