Title: Scalable Molecular Dynamics for Large Biomolecular Systems
1Scalable Molecular Dynamicsfor Large
Biomolecular Systems
- Robert Brunner
- James C Phillips
- Laxmikant Kale
2Overview
- Context approach and methodology
- Molecular dynamics for biomolecules
- Our program NAMD
- Basic Parallelization strategy
- NAMD performance Optimizations
- Techniques
- Results
- Conclusions summary, lessons and future work
3The context
- Objective Enhance Performance and productivity
in parallel programming - For complex, dynamic applications
- Scalable to thousands of processors
- Theme
- Adaptive techniques for handling dynamic behavior
- Look for optimal division of labor between human
programmer and the system - Let the programmer specify what to do in parallel
- Let the system decide when and where to run the
subcomputations - Data driven objects as the substrate
45
1
8
1
2
10
3
4
8
2
3
9
7
5
6
10
9
4
9
12
11
13
6
7
13
11
12
5Data driven execution
Scheduler
Scheduler
Message Q
Message Q
6Charm
- Parallel C with Data Driven Objects
- Object Arrays and collections
- Asynchronous method invocation
- Object Groups
- global object with a representative on each PE
- Prioritized scheduling
- Mature, robust, portable
- http//charm.cs.uiuc.edu
7Multi-partition decomposition
8Load balancing
- Based on migratable objects
- Collect timing data for several cycles
- Run heuristic load balancer
- Several alternative ones
- Re-map and migrate objects accordingly
- Registration mechanisms facilitate migration
9Measurement based load balancing
- Application induced imbalances
- Abrupt, but infrequent, or
- Slow, cumulative
- rarely frequent, large changes
- Principle of persistence
- Extension of principle of locality
- Behavior, including computational load and
communication patterns, of objects tend to
persist over time - We have implemented strategies that exploit this
automatically
10Molecular Dynamics
11Molecular dynamics and NAMD
- MD to understand the structure and function of
biomolecules - proteins, DNA, membranes
- NAMD is a production quality MD program
- Active use by biophysicists (science
publications) - 50,000 lines of C code
- 1000 registered users
- Features and accessories such as
- VMD visualization
- Biocore collaboratory
- Steered and Interactive Molecular Dynamics
12NAMD Contributors
- PI s
- Laxmikant Kale, Klaus Schulten, Robert Skeel
- NAMD 1
- Robert Brunner, Andrew Dalke, Attila Gursoy,
Bill Humphrey, Mark Nelson - NAMD2
- M. Bhandarkar, R. Brunner, A. Gursoy, J.
Phillips, N.Krawetz, A. Shinozaki, K.
Varadarajan, Gengbin Zheng, ..
13Molecular Dynamics
- Collection of charged atoms, with bonds
- Newtonian mechanics
- At each time-step
- Calculate forces on each atom
- bonds
- non-bonded electrostatic and van der Waals
- Calculate velocities and Advance positions
- 1 femtosecond time-step, millions needed!
- Thousands of atoms (1,000 - 100,000)
14Cut-off radius
- Use of cut-off radius to reduce work
- 8 - 14 Ã…
- Faraway charges ignored!
- 80-95 work is non-bonded force computations
- Some simulations need faraway contributions
- Periodic systems Ewald, Particle-Mesh Ewald
- Aperiodic systems FMA
- Even so, cut-off based computations are
important - near-atom calculations are part of the above
- multiple time-stepping is used k cut-off steps,
1 PME/FMA
15Scalability
- The Program should scale up to use a large number
of processors. - But what does that mean?
- An individual simulation isnt truly scalable
- Better definition of scalability
- If I double the number of processors, I should
be able to retain parallel efficiency by
increasing the problem size
16Isoefficiency
- Quantify scalability
- (Work of Vipin Kumar, U. Minnesota)
- How much increase in problem size is needed to
retain the same efficiency on a larger machine? - Efficiency Seq. Time/ (P Parallel Time)
- parallel time
- computation communication idle
17Atom decomposition
- Partition the Atoms array across processors
- Nearby atoms may not be on the same processor
- Communication O(N) per processor
- Communication/Computation O(N)/(N/P) O(P)
- Again, not scalable by our definition
18Force Decomposition
- Distribute force matrix to processors
- Matrix is sparse, non uniform
- Each processor has one block
- Communication
- Ratio
- Better scalability in practice
- (can use 100 processors)
- Plimpton
- Hwang, Saltz, et al
- 6 on 32 Pes 36 on 128 processor
- Yet not scalable in the sense defined here!
19Spatial Decomposition
- Allocate close-by atoms to the same processor
- Three variations possible
- Partitioning into P boxes, 1 per processor
- Good scalability, but hard to implement
- Partitioning into fixed size boxes, each a little
larger than the cutoff distance - Partitioning into smaller boxes
- Communication O(N/P)
- so, scalable in principle
20Spatial Decomposition in NAMD
- NAMD 1 used spatial decomposition
- Good theoretical isoefficiency, but for a fixed
size system, load balancing problems - For midsize systems, got good speedups up to 16
processors. - Use the symmetry of Newtons 3rd law to
facilitate load balancing
21Spatial Decomposition
But the load balancing problems are still severe
22 23FD SD
- Now, we have many more objects to load balance
- Each diamond can be assigned to any processor
- Number of diamonds (3D)
- 14Number of Patches
24Bond Forces
- Multiple types of forces
- Bonds(2), Angles(3), Dihedrals (4), ..
- Luckily, each involves atoms in neighboring
patches only - Straightforward implementation
- Send message to all neighbors,
- receive forces from them
- 262 messages per patch!
25Bonded Forces
- Assume one patch per processor
- an angle force involving atoms in patches
- (x1,y1,z1), (x2,y2,z2), (x3,y3,z3)
- is calculated in patch (maxxi, maxyi,
maxzi)
C
A
B
26Implementation
- Multiple Objects per processor
- Different types patches, pairwise forces, bonded
forces, - Each may have its data ready at different times
- Need ability to map and remap them
- Need prioritized scheduling
- Charm supports all of these
27Load Balancing
- Is a major challenge for this application
- especially for a large number of processors
- Unpredictable workloads
- Each diamond (force object) and patch encapsulate
variable amount of work - Static estimates are inaccurate
- Measurement based Load Balancing Framework
- Robert Brunners recent Ph.D. thesis
- Very slow variations across timesteps
28Bipartite graph balancing
- Background load
- Patches (integration, ..) and bond-related
forces - Migratable load
- Non-bonded forces
- Bipartite communication graph
- between migratable and non-migratable objects
- Challenge
- Balance Load while minimizing communication
29Load balancing strategy
Greedy variant (simplified) Sort compute objects
(diamonds) Repeat (until all assigned) S set
of all processors that -- are not
overloaded -- generate least new commun.
P least loaded S Assign heaviest compute
to P
Refinement Repeat - Pick a compute from
the most overloaded PE - Assign it to a
suitable underloaded PE Until (No movement)
Cell
Cell
Compute
30(No Transcript)
31Speedups in 1998
32Initial Speedup Results ASCI Red
33BC1 complex 200k atoms
34Optimizations
- Series of optimizations
- Examples to be covered here
- Grainsize distributions (bimodal)
- Integration message sending overheads
35Grainsize and Amdahlss law
- A variant of Amdahls law, for objects, would be
- The fastest time can be no shorter than the time
for the biggest single object! - How did it apply to us?
- Sequential step time was 57 seconds
- To run on 2k processors, no object should be more
than 28 msecs. - Should be even shorter
- Grainsize analysis via projections showed that
was not so..
36Grainsize analysis
Solution Split compute objects that may have
too much work using a heuristics based on number
of interacting atoms
Problem
37Grainsize reduced
38Performance audit
39Performance audit
- Through the optimization process,
- an audit was kept to decide where to look to
improve performance
Total Ideal Actual Total 57.04 86 nonBonded 52.
44 49.77 Bonds 3.16 3.9 Integration 1.44 3.05 Ove
rhead 0 7.97 Imbalance 0 10.45 Idle 0 9.25 Receiv
es 0 1.61
Integration time doubled
40Integration overhead analysis
integration
Problem integration time had doubled from
sequential run
41Integration overhead example
- The projections pictures showed the overhead was
associated with sending messages. - Many cells were sending 30-40 messages.
- The overhead was still too much compared with the
cost of messages. - Code analysis memory allocations!
- Identical message is being sent to 30
processors. - Simple multicast support was added to Charm
- Mainly eliminates memory allocations (and some
copying)
42Integration overhead After multicast
43Improved Performance Data
44Recent Results on Origin 2000
45Results on Linux Cluster
46Performance of Apo-A1 on Asci Red
47Performance of Apo-A1 on O2k and T3E
48Lessons learned
- Need to downsize objects!
- Choose smallest possible grainsize that amortizes
overhead - One of the biggest challenge
- was getting time for performance tuning runs on
parallel machines
49Future and Planned work
- Speedup on small molecules!
- Interactive molecular dynamics
- Increased speedups on 2k-10k processors
- Smaller grainsizes
- New algorithms for reducing communication impact
- New load balancing strategies
- Further performance improvements for PME/FMA
- With multiple timestepping
- Needs multi-phase load balancing
50Steered MD example picture
Image and Simulation by the theoretical
biophysics group, Beckman Institute, UIUC
51More information
- Charm and associated framework
- http//charm.cs.uiuc.edu
- NAMD and associated biophysics tools
- http//www.ks.uiuc.edu
- Both include downloadable software
52Performance size of system
Performance data on Cray T3E
53Performance various machines