Scalable Molecular Dynamics for Large Biomolecular Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Scalable Molecular Dynamics for Large Biomolecular Systems

Description:

... so, cut-off based computations are important: near-atom calculations are part of the above. multiple time-stepping is ... Sequential step time was 57 seconds ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 52
Provided by: laxmika
Learn more at: http://charm.cs.uiuc.edu
Category:

less

Transcript and Presenter's Notes

Title: Scalable Molecular Dynamics for Large Biomolecular Systems


1
Scalable Molecular Dynamicsfor Large
Biomolecular Systems
  • Robert Brunner
  • James C Phillips
  • Laxmikant Kale

2
Overview
  • Context approach and methodology
  • Molecular dynamics for biomolecules
  • Our program NAMD
  • Basic Parallelization strategy
  • NAMD performance Optimizations
  • Techniques
  • Results
  • Conclusions summary, lessons and future work

3
The context
  • Objective Enhance Performance and productivity
    in parallel programming
  • For complex, dynamic applications
  • Scalable to thousands of processors
  • Theme
  • Adaptive techniques for handling dynamic behavior
  • Look for optimal division of labor between human
    programmer and the system
  • Let the programmer specify what to do in parallel
  • Let the system decide when and where to run the
    subcomputations
  • Data driven objects as the substrate

4
5
1
8
1
2
10
3
4
8
2
3
9
7
5
6
10
9
4
9
12
11
13
6
7
13
11
12
5
Data driven execution
Scheduler
Scheduler
Message Q
Message Q
6
Charm
  • Parallel C with Data Driven Objects
  • Object Arrays and collections
  • Asynchronous method invocation
  • Object Groups
  • global object with a representative on each PE
  • Prioritized scheduling
  • Mature, robust, portable
  • http//charm.cs.uiuc.edu

7
Multi-partition decomposition
8
Load balancing
  • Based on migratable objects
  • Collect timing data for several cycles
  • Run heuristic load balancer
  • Several alternative ones
  • Re-map and migrate objects accordingly
  • Registration mechanisms facilitate migration

9
Measurement based load balancing
  • Application induced imbalances
  • Abrupt, but infrequent, or
  • Slow, cumulative
  • rarely frequent, large changes
  • Principle of persistence
  • Extension of principle of locality
  • Behavior, including computational load and
    communication patterns, of objects tend to
    persist over time
  • We have implemented strategies that exploit this
    automatically

10
Molecular Dynamics
11
Molecular dynamics and NAMD
  • MD to understand the structure and function of
    biomolecules
  • proteins, DNA, membranes
  • NAMD is a production quality MD program
  • Active use by biophysicists (science
    publications)
  • 50,000 lines of C code
  • 1000 registered users
  • Features and accessories such as
  • VMD visualization
  • Biocore collaboratory
  • Steered and Interactive Molecular Dynamics

12
NAMD Contributors
  • PI s
  • Laxmikant Kale, Klaus Schulten, Robert Skeel
  • NAMD 1
  • Robert Brunner, Andrew Dalke, Attila Gursoy,
    Bill Humphrey, Mark Nelson
  • NAMD2
  • M. Bhandarkar, R. Brunner, A. Gursoy, J.
    Phillips, N.Krawetz, A. Shinozaki, K.
    Varadarajan, Gengbin Zheng, ..

13
Molecular Dynamics
  • Collection of charged atoms, with bonds
  • Newtonian mechanics
  • At each time-step
  • Calculate forces on each atom
  • bonds
  • non-bonded electrostatic and van der Waals
  • Calculate velocities and Advance positions
  • 1 femtosecond time-step, millions needed!
  • Thousands of atoms (1,000 - 100,000)

14
Cut-off radius
  • Use of cut-off radius to reduce work
  • 8 - 14 Ã…
  • Faraway charges ignored!
  • 80-95 work is non-bonded force computations
  • Some simulations need faraway contributions
  • Periodic systems Ewald, Particle-Mesh Ewald
  • Aperiodic systems FMA
  • Even so, cut-off based computations are
    important
  • near-atom calculations are part of the above
  • multiple time-stepping is used k cut-off steps,
    1 PME/FMA

15
Scalability
  • The Program should scale up to use a large number
    of processors.
  • But what does that mean?
  • An individual simulation isnt truly scalable
  • Better definition of scalability
  • If I double the number of processors, I should
    be able to retain parallel efficiency by
    increasing the problem size

16
Isoefficiency
  • Quantify scalability
  • (Work of Vipin Kumar, U. Minnesota)
  • How much increase in problem size is needed to
    retain the same efficiency on a larger machine?
  • Efficiency Seq. Time/ (P Parallel Time)
  • parallel time
  • computation communication idle

17
Atom decomposition
  • Partition the Atoms array across processors
  • Nearby atoms may not be on the same processor
  • Communication O(N) per processor
  • Communication/Computation O(N)/(N/P) O(P)
  • Again, not scalable by our definition

18
Force Decomposition
  • Distribute force matrix to processors
  • Matrix is sparse, non uniform
  • Each processor has one block
  • Communication
  • Ratio
  • Better scalability in practice
  • (can use 100 processors)
  • Plimpton
  • Hwang, Saltz, et al
  • 6 on 32 Pes 36 on 128 processor
  • Yet not scalable in the sense defined here!

19
Spatial Decomposition
  • Allocate close-by atoms to the same processor
  • Three variations possible
  • Partitioning into P boxes, 1 per processor
  • Good scalability, but hard to implement
  • Partitioning into fixed size boxes, each a little
    larger than the cutoff distance
  • Partitioning into smaller boxes
  • Communication O(N/P)
  • so, scalable in principle

20
Spatial Decomposition in NAMD
  • NAMD 1 used spatial decomposition
  • Good theoretical isoefficiency, but for a fixed
    size system, load balancing problems
  • For midsize systems, got good speedups up to 16
    processors.
  • Use the symmetry of Newtons 3rd law to
    facilitate load balancing

21
Spatial Decomposition
But the load balancing problems are still severe
22

23
FD SD
  • Now, we have many more objects to load balance
  • Each diamond can be assigned to any processor
  • Number of diamonds (3D)
  • 14Number of Patches

24
Bond Forces
  • Multiple types of forces
  • Bonds(2), Angles(3), Dihedrals (4), ..
  • Luckily, each involves atoms in neighboring
    patches only
  • Straightforward implementation
  • Send message to all neighbors,
  • receive forces from them
  • 262 messages per patch!

25
Bonded Forces
  • Assume one patch per processor
  • an angle force involving atoms in patches
  • (x1,y1,z1), (x2,y2,z2), (x3,y3,z3)
  • is calculated in patch (maxxi, maxyi,
    maxzi)

C
A
B
26
Implementation
  • Multiple Objects per processor
  • Different types patches, pairwise forces, bonded
    forces,
  • Each may have its data ready at different times
  • Need ability to map and remap them
  • Need prioritized scheduling
  • Charm supports all of these

27
Load Balancing
  • Is a major challenge for this application
  • especially for a large number of processors
  • Unpredictable workloads
  • Each diamond (force object) and patch encapsulate
    variable amount of work
  • Static estimates are inaccurate
  • Measurement based Load Balancing Framework
  • Robert Brunners recent Ph.D. thesis
  • Very slow variations across timesteps

28
Bipartite graph balancing
  • Background load
  • Patches (integration, ..) and bond-related
    forces
  • Migratable load
  • Non-bonded forces
  • Bipartite communication graph
  • between migratable and non-migratable objects
  • Challenge
  • Balance Load while minimizing communication

29
Load balancing strategy
Greedy variant (simplified) Sort compute objects
(diamonds) Repeat (until all assigned) S set
of all processors that -- are not
overloaded -- generate least new commun.
P least loaded S Assign heaviest compute
to P
Refinement Repeat - Pick a compute from
the most overloaded PE - Assign it to a
suitable underloaded PE Until (No movement)
Cell
Cell
Compute
30
(No Transcript)
31
Speedups in 1998
32
Initial Speedup Results ASCI Red
33
BC1 complex 200k atoms
34
Optimizations
  • Series of optimizations
  • Examples to be covered here
  • Grainsize distributions (bimodal)
  • Integration message sending overheads

35
Grainsize and Amdahlss law
  • A variant of Amdahls law, for objects, would be
  • The fastest time can be no shorter than the time
    for the biggest single object!
  • How did it apply to us?
  • Sequential step time was 57 seconds
  • To run on 2k processors, no object should be more
    than 28 msecs.
  • Should be even shorter
  • Grainsize analysis via projections showed that
    was not so..

36
Grainsize analysis
Solution Split compute objects that may have
too much work using a heuristics based on number
of interacting atoms
Problem
37
Grainsize reduced
38
Performance audit
39
Performance audit
  • Through the optimization process,
  • an audit was kept to decide where to look to
    improve performance

Total Ideal Actual Total 57.04 86 nonBonded 52.
44 49.77 Bonds 3.16 3.9 Integration 1.44 3.05 Ove
rhead 0 7.97 Imbalance 0 10.45 Idle 0 9.25 Receiv
es 0 1.61
Integration time doubled
40
Integration overhead analysis
integration
Problem integration time had doubled from
sequential run
41
Integration overhead example
  • The projections pictures showed the overhead was
    associated with sending messages.
  • Many cells were sending 30-40 messages.
  • The overhead was still too much compared with the
    cost of messages.
  • Code analysis memory allocations!
  • Identical message is being sent to 30
    processors.
  • Simple multicast support was added to Charm
  • Mainly eliminates memory allocations (and some
    copying)

42
Integration overhead After multicast
43
Improved Performance Data
44
Recent Results on Origin 2000
45
Results on Linux Cluster
46
Performance of Apo-A1 on Asci Red
47
Performance of Apo-A1 on O2k and T3E
48
Lessons learned
  • Need to downsize objects!
  • Choose smallest possible grainsize that amortizes
    overhead
  • One of the biggest challenge
  • was getting time for performance tuning runs on
    parallel machines

49
Future and Planned work
  • Speedup on small molecules!
  • Interactive molecular dynamics
  • Increased speedups on 2k-10k processors
  • Smaller grainsizes
  • New algorithms for reducing communication impact
  • New load balancing strategies
  • Further performance improvements for PME/FMA
  • With multiple timestepping
  • Needs multi-phase load balancing

50
Steered MD example picture
Image and Simulation by the theoretical
biophysics group, Beckman Institute, UIUC
51
More information
  • Charm and associated framework
  • http//charm.cs.uiuc.edu
  • NAMD and associated biophysics tools
  • http//www.ks.uiuc.edu
  • Both include downloadable software

52
Performance size of system
Performance data on Cray T3E
53
Performance various machines
Write a Comment
User Comments (0)
About PowerShow.com