Title: Scalable%20Molecular%20Dynamics%20for%20Large%20Biomolecular%20Systems
1Scalable Molecular Dynamicsfor Large
Biomolecular Systems
- Robert Brunner
- James C Phillips
- Laxmikant Kale
- Department of Computer Science
- and
- Theoretical Biophysics Group
- University of Illinois at Urbana Champaign
2Parallel Computing withData-driven Objects
- Laxmikant (Sanjay) Kale
- Parallel Programming Laboratory
- Department of Computer Science
- http//charm.cs.uiuc.edu
3Overview
- Context approach and methodology
- Molecular dynamics for biomolecules
- Our program, NAMD
- Basic parallelization strategy
- NAMD performance optimizations
- Techniques
- Results
- Conclusions summary, lessons and future work
4Parallel Programming Laboratory
- Objective Enhance performance and productivity
in parallel programming - For complex, dynamic applications
- Scalable to thousands of processors
- Theme
- Adaptive techniques for handling dynamic behavior
- Strategy Look for optimal division of labor
between human programmer and the system - Let the programmer specify what to do in parallel
- Let the system decide when and where to run them
- Data driven objects as the substrate Charm
5System Mapped Objects
5
1
8
1
2
10
3
4
8
2
3
9
7
5
6
10
9
4
9
12
11
13
6
7
13
11
12
6Data Driven Execution
Scheduler
Scheduler
Message Q
Message Q
7Charm
- Parallel C with data driven objects
- Object Arrays and collections
- Asynchronous method invocation
- Object Groups
- global object with a representative on each PE
- Prioritized scheduling
- Mature, robust, portable
- http//charm.cs.uiuc.edu
8Multi-partition Decomposition
- Writing applications with Charm
- Decompose the problem into a large number of
chunks - Implements chunks as objects
- Or, now, as MPI threads (AMPI on top of Charm)
- Let Charm map and remap objects
- Allow for migration of objects
- If desired, specify potential migration points
9Load Balancing Mechanisms
- Re-map and migrate objects
- Registration mechanisms facilitate migration
- Efficient message delivery strategies
- Efficient global operations
- Such as reductions and broadcasts
- Several classes of load balancing strategies
provided - Incremental
- Centralized as well as distributed
- Measurement based
10Principle of Persistence
- An observation about CSE applications
- Extension of principle of locality
- Behavior of objects, including computational load
and communication patterns, tend to persist over
time - Application induced imbalance
- Abrupt, but infrequent, or
- Slow, cumulative
- Rarely frequent, large changes
- Our framework still deals with this case as well
- Measurement based strategies
11Measurement-Based Load Balancing Strategies
- Collect timing data for several cycles
- Run heuristic load balancer
- Several alternative ones
- Robert Brunners recent Ph.D. thesis
- Instrumentation framework
- Strategies
- Performance comparisons
12Molecular Dynamics
ApoA-I 92k Atoms
13Molecular Dynamics and NAMD
- MD is used to understand the structure and
function of biomolecules - Proteins, DNA, membranes
- NAMD is a production-quality MD program
- Active use by biophysicists (published science)
- 50,000 lines of C code
- 1000 registered users
- Features include
- CHARMM and XPLOR compatibility
- PME electrostatics and multiple timestepping
- Steered and interactive simulation via VMDl
14NAMD Contributors
- PI s
- Laxmikant Kale, Klaus Schulten, Robert Skeel
- NAMD Version 1
- Robert Brunner, Andrew Dalke, Attila Gursoy,
Bill Humphrey, Mark Nelson - NAMD2
- M. Bhandarkar, R. Brunner, Justin Gullingsrud, A.
Gursoy, N.Krawetz, J. Phillips, A. Shinozaki, K.
Varadarajan, Gengbin Zheng, ..
Theoretical Biophysics Group, supported by NIH
15Molecular Dynamics
- Collection of charged atoms, with bonds
- Newtonian mechanics
- At each time-step
- Calculate forces on each atom
- Bonds
- Non-bonded electrostatic and van der Waals
- Calculate velocities and advance positions
- 1 femtosecond time-step, millions needed!
- Thousands of atoms (1,000 - 100,000)
16Cut-off Radius
- Use of cut-off radius to reduce work
- 8 - 14 Å
- Far away atoms ignored! (screening effects)
- 80-95 work is non-bonded force computations
- Some simulations need faraway contributions
- Particle-Mesh Ewald (PME)
- Even so, cut-off based computations are
important - Near-atom calculations constitute the bulk of the
above - Multiple time-stepping is used k cut-off steps,
1 PME - So, (k-1) steps do just cut-off based simulation
17Scalability
- The program should scale up to use a large number
of processors. - But what does that mean?
- An individual simulation isnt truly scalable
- Better definition of scalability
- If I double the number of processors, I should
be able to retain parallel efficiency by
increasing the problem size
18Isoefficiency
- Quantify scalability
- (Work of Vipin Kumar, U. Minnesota)
- How much increase in problem size is needed to
retain the same efficiency on a larger machine? - Efficiency sequential-time/ (P parallel-time)
- Parallel-time computation communication
idle
19Early methods
- Atom replication
- Each processor has data for all atoms
- Force calculations parallelized
- Collection of forces O(N log p) communication
- Computation O(N/P)
- Communication/computation Ratio O(P log P) Not
Scalable - Atom Decomposition
- Partition the atoms array across processors
- Nearby atoms may not be on the same processor
- Communication O(N) per processor
- Ratio O(N) / (N / P) O(P) Not Scalable
20Force Decomposition
- Distribute force matrix to processors
- Matrix is sparse, non uniform
- Each processor has one block
- Communication
- Ratio
- Better scalability in practice
- Can use 100 processors
- Plimpton
- Hwang, Saltz, et al
- 6 on 32 processors
- 36 on 128 processor
- Yet not scalable in the sense defined here!
21Spatial Decomposition
- Allocate close-by atoms to the same processor
- Three variations possible
- Partitioning into P boxes, 1 per processor
- Good scalability, but hard to implement
- Partitioning into fixed size boxes, each a little
larger than the cut-off distance - Partitioning into smaller boxes
- Communication O(N/P)
- Communication/Computation ratio O(1)
- So, scalable in principle
22Ongoing work
- Plimpton, Hendrickson
- new spatial decomposition
- NWChem (PNL)
- Peter Kollman, Yong Duan et al
- microsecond simulation
- AMBER version (SANDER)
23Spatial Decomposition in NAMD
But the load balancing problems are still severe
24Hybrid Decomposition
25FD SD
- Now, we have many more objects to load balance
- Each diamond can be assigned to any processor
- Number of diamonds (3D)
- 14Number of Patches
26Bond Forces
- Multiple types of forces
- Bonds(2), angles(3), dihedrals (4), ..
- Luckily, each involves atoms in neighboring
patches only - Straightforward implementation
- Send message to all neighbors,
- receive forces from them
- 262 messages per patch!
27Bond Forces
- Assume one patch per processor
- An angle force involving atoms in patches
(x1,y1,z1), (x2,y2,z2), (x3,y3,z3) is calculated
in patch (maxxi, maxyi, maxzi)
C
A
B
28NAMD Implementation
- Multiple objects per processor
- Different types patches, pairwise forces, bonded
forces - Each may have its data ready at different times
- Need ability to map and remap them
- Need prioritized scheduling
- Charm supports all of these
29Load Balancing
- Is a major challenge for this application
- Especially for a large number of processors
- Unpredictable workloads
- Each diamond (force compute object) and patch
encapsulate variable amount of work - Static estimates are inaccurate
- Very slow variations across timesteps
- Measurement-based load balancing framework
Compute
Cell (patch)
Cell (patch)
30Bipartite Graph Balancing
- Background load
- Patches (integration, etc.) and bond-related
forces - Migratable load
- Non-bonded forces
- Bipartite communication graph
- Between migratable and non-migratable objects
- Challenge
- Balance load while minimizing communication
Compute
Cell (Patch)
Cell (patch)
31Load Balancing Strategy
Greedy variant (simplified) Sort compute objects
(diamonds) Repeat (until all assigned) S set
of all processors that -- are not
overloaded -- generate least new commun.
P least loaded S Assign heaviest compute
to P
Refinement Repeat - Pick a compute from
the most overloaded PE - Assign it to a
suitable underloaded PE Until (No movement)
Cell
Cell
Compute
32(No Transcript)
33Speedups in 1998
ApoA-I 92k atoms
34Optimizations
- Series of optimizations
- Examples discussed here
- Grainsize distributions (bimodal)
- Integration message sending overheads
- Several other optimizations
- Separation of bond/angle/dihedral objects
- Inter-patch and intra-patch
- Prioritization
- Local synchronization to avoid interference
across steps
35Grainsize and Amdahlss Law
- A variant of Amdahls law, for objects, would be
- The fastest time can be no shorter than the time
for the biggest single object! - How did it apply to us?
- Sequential step time was 57 seconds
- To run on 2k processors, no object should be more
than 28 msecs. - Should be even shorter
- Grainsize analysis via projections showed that
was not so..
36Grainsize Analysis
Solution Split compute objects that may have
too much work using a heuristics based on number
of interacting atoms
Problem
37Grainsize Reduced
38Performance Audit
- Through the optimization process,
- an audit was kept to decide where to look to
improve performance
Total Ideal Actual Total 57.04 86 nonBonded 52.
44 49.77 Bonds 3.16 3.9 Integration 1.44 3.05 Ove
rhead 0 7.97 Imbalance 0 10.45 Idle 0 9.25 Receiv
es 0 1.61
Integration time doubled
39Integration Overhead Analysis
integration
Problem integration time had doubled from
sequential run
40Integration Overhead Example
- The projections pictures showed the overhead was
associated with sending messages. - Many cells were sending 30-40 messages.
- The overhead was still too much compared with the
cost of messages. - Code analysis memory allocations!
- Identical message is being sent to 30
processors. - Simple multicast support was added to Charm
- Mainly eliminates memory allocations (and some
copying)
41Integration Overhead After Multicast
42ApoA-I on ASCI Red
57 ms/step
43ApoA-I on Origin 2000
44ApoA-I on Linux Cluster
45ApoA-I on O2K and T3E
46ApoA-I on T3E
47BC1 complex 200k atoms
48BC1 on ASCI Red
58.4 GFlops
49Lessons Learned
- Need to downsize objects!
- Choose smallest possible grainsize that amortizes
overhead - One of the biggest challenge
- Was getting time for performance tuning runs on
parallel machines
50ApoA-I with PME on T3E
51Future and Planned Work
- Increased speedups on 2k-10k processors
- Smaller grainsizes
- Parallelizing integration further
- New algorithms for reducing communication impact
- New load balancing strategies
- Further performance improvements for PME
- With multiple timestepping
- Needs multi-phase load balancing
- Speedup on small molecules!
- Interactive molecular dynamics
52More Information
- Charm and associated framework
- http//charm.cs.uiuc.edu
- NAMD and associated biophysics tools
- http//www.ks.uiuc.edu
- Both include downloadable software
53Parallel Programming Laboratory
- Funding
- Dept of Energy (via Rocket center)
- National Science Foundation
- National Institute of Health
- Group Members
Affiliated (NIH/Biophysics) Jim Phillips Kirby
Vandivoort
Joshua Unger Gengbin Zheng Jay Desouza Sameer
Kumar Chee wai Lee
Milind Bhandarkar Terry Wilmarth Orion
Lawlor Neelam Saboo Arun Singla Karthikeyan Mahesh
54The Parallel Programming Problem
- Is there one?
- We can all write MPI programs, right?
- Several Large Machines in use
- But
- New complex apps with dynamic and irregular
structure - Should all application scientists also be experts
in parallel computing?
55What makes it difficult?
- Multiple objectives
- Correctness, Sequential efficiency, speedups
- Nondeterminacy affects correctness
- Several obstacles to speedup
- communication costs
- Load imbalances
- Long critical paths
56Parallel Programming
- Decomposition
- Decide what to do in parallel
- Tasks (loop iterations, functions,.. ) that can
be done in parallel - Mapping
- Which processor does each task
- Scheduling (sequencing)
- On each processor
- Machine dependent expression
- Express the above decisions for the particular
parallel machine
57Spectrum of parallel Languages
Parallelizing Fortran compiler
Decomposition
Mapping
Charm
Leve l
Scheduling (sequencing)
Machine dependent expression
MPI
Specialization
What is automated
58Charm
- Data Driven Objects
- Asynchronous method invocation
- Prioritized scheduling
- Object Arrays
- Object Groups
- global object with a representative on each PE
- Information sharing abstractions
- readonly data
- accumulators
- distributed tables
59Data Driven Execution
Objects
Scheduler
Scheduler
Message Q
Message Q
60Group Mission and Approach
- To enhance Performance and Productivity in
programming complex parallel applications - Approach Application Oriented yet CS centered
research - Develop enabling technology, for many apps.
- Develop, use and test it in the context of real
applications - Theme
- Adaptive techniques for irregular and dynamic
applications - Optimal division of labor system and
programmer - Decomposition done by programmer, everything else
automated - Develop standard library for parallel programming
of reusable components
61Active Projects
- Charm/ Converse parallel infrastructure
- Scientific/Engineering apps
- Molecular Dynamics
- Rocket Simulation
- Finite Element Framework
- Web-based interaction and monitoring
- Faucets anonymous compute power
- Parallel
- Operations Research, discrete event simulation,
combinatorial search
62Charm Parallel C With Data Driven Objects
- Chares dynamically balanced objects
- Object Groups
- global object with a representative on each PE
- Object Arrays/ Object Collections
- User defined indexing (1D,2D,..,quad and
oct-tree,..) - System supports remapping and forwarding
- Asynchronous method invocation
- Prioritized scheduling
- Mature, robust, portable
- http//charm.cs.uiuc.edu
Data driven Execution
63Multi-partition Decomposition
- Idea divide the computation into a large number
of pieces - Independent of number of processors
- Typically larger than number of processors
- Let the system map entities to processors
64Converse
- Portable parallel run-time system that allows
interoperability among parallel languages - Rich features to allow quick and efficient
implementation of new parallel languages - Based on message-driven execution that allows
co-existence of different control regimes - Support for debugging and performance analysis of
parallel programs - Support for building parallel servers
65Converse
- Languages and paradigms implemented
- Charm, a parallel object-oriented language
- Thread-safe MPI and PVM
- Parallel Java, message-driven Perl, pC
- Platforms supported
- SGI Origin2000, IBM SP, ASCI Red, CRAY T3E,
Convex Ex. - Workstation clusters (Solaris, HP-UX, AIX, Linux
etc.) - Windows NT Clusters
Paradigms Languages, Libraries,
Converse
Parallel Machines
66Adaptive MPI
- A bridge between legacy MPI codes and dynamic
load balancing capabilities of Charm - AMPI MPI dynamic load balancing
- Based on Charm object arrays and Converses
migratable threads - Minimal modification needed to convert existing
MPI programs (to be automated in future) - Bindings for C, C, and Fortran90
- Currently supports most of the MPI 1.1 standard
67Converse Use in NAMD
68Molecular Dynamics
- Collection of charged atoms, with bonds
- Newtonian mechanics
- At each time-step
- Calculate forces on each atom
- bonds
- non-bonded electrostatic and van der Waals
- Calculate velocities and Advance positions
- 1 femtosecond time-step, millions needed!
- Thousands of atoms (1,000 - 200,000)
Collaboration with Klaus Schulten, Robert Skeel
69Spatial Decomposition in NAMD
- Space divided into cubes
- Forces between atoms in neighboring cubes
computed by individual compute objects - Compute objects are remapped by load balancer
70NAMD a Production-quality MD Program
- NAMD is used by biophysicists routinely, with
several published results - NIH funded collaborative effort with Profs. K.
Schulten and R. Skeel - Supports full range electrostatics
- Parallel Particle-Mesh Ewald for periodic and
Fast multipole for aperiodic systems - Implemented ic C/Charm
- Supports visualization (via VMD), Interactive MD,
and haptic interface - see http//www.ks.uiuc.edu
- Part of Biophysics collaboratory
ApoLipoprotein A1
71NAMD Scalable Performance
Sequential Performance of NAMD (a C program) is
comparable to or better than contemporary MD
programs, written in Fortran.
Gordon Bell award finalist, SC2000
Speedup of 1250 on 2048 processors on ASCI red,
simulating BC1 with about 200k atoms. (compare
with best speedups on production-quality MD by
others 170/256 processors) Around 10,000
varying-size objects mapped by the load balancer
72Rocket Simulation
- Rocket behavior (and therefore its simulation) is
irregular, dynamic - We need to deal with dynamic variations
adaptively - Dynamic behavior arises from
- Combustion moving boundaries
- Crack propagation
- Evolution of the system
73Rocket Simulation
- Our Approach
- Multi-partition decomposition
- Data-driven objects (Charm)
- Automatic load balancing framework
- AMPI Migration path for existing MPIFortran90
codes - ROCFLO, ROCSOLID, and ROCFACE
74FEM Framework
- Objective To make it easy to parallelize
existing Finite Element Method (FEM) Applications
and to quickly build new parallel FEM
applications including those with irregular and
dynamic behavior - Hides the details of parallelism developer
provides only sequential callback routines - Embedded mesh partitioning algorithms split mesh
into chunks that are mapped to different
processors (many-to-one) - Developers callbacks are executed in migratable
threads, monitored by the run-time system - Migration of chunks to correct load imbalance
- Examples
- Pressure-driven crack propagation
- 3-D Dendritic Growth
75FEM Framework Responsibilities
FEM Application (Initialize, Registration of
Nodal Attributes, Loops Over Elements, Finalize)
FEM Framework (Update of Nodal properties,
Reductions over nodes or partitions)
Partitioner
Combiner
Charm (Dynamic Load Balancing, Communication)
METIS
I/O
76Crack Propagation
- Explicit FEM code
- Zero-volume Cohesive Elements inserted near the
crack - As the crack propagates, more cohesive elements
added near the crack, which leads to severe load
imbalance - Framework handles
- Partitioning elements into chunks
- Communication between chunks
- Load Balancing
Decomposition into 16 chunks (left) and 128
chunks, 8 for each PE (right). The middle area
contains cohesive elements. Pictures S.
Breitenfeld, and P. Geubelle
77Dendritic Growth
- Studies evolution of solidification
microstructures using a phase-field model
computed on an adaptive finite element grid - Adaptive refinement and coarsening of grid
involves re-partitioning
Work by Prof. Jon Dantzig, Jun-ho Jeong
78(No Transcript)
79Anonymous Compute Power
What is needed to make this metaphor
work? Timeshared parallel machines in the
background effective resource management Quality
of computational service contracts/guarantees Fron
t ends that will allow agents to submit jobs on
users behalf
Computational Faucets
80Computational Faucets
- What does a Computational faucet do?
- Submit requests to the grid
- Evaluate bids and decide whom to assign work
- Monitor applications (for performance and
correctness) - Provide interface to users
- Interacting with jobs, and monitoring behavior
- What does it look like?
A browser!
81Faucets QoS
- User specifies desired job parameters such as
program executable name, executable platform, min
PE, max PE, estimated CPU-seconds (for various
PE), priority, etc. - User does not specify machine. Faucet software
contacts a central server and obtains a list of
available workstation clusters, then negotiates
with clusters and chooses one to submit the job. - User can view status of clusters.
- Planned file transfer, user authentication,
merge with Appspector for job monitoring.
Workstation Cluster
Faucet Client
Central Server
Workstation Cluster
Web Browser
Workstation Cluster
82Timeshared Parallel Machines
- Need resource management
- Shrink and expand individual jobs to available
sets of processors - Example Machine with 100 processors
- Job1 arrives, can use 20-150 processors
- Assign 100 processors to it
- Job2 arrives, can use 30-70 processors,
- and will pay more if we meet its deadline
- Make resource allocation decisions
83Time-shared Parallel Machines
- To bid effectively (profitably) in such an
environment, a parallel machine must be able to
run well-paying (important) jobs, even when it is
already running others. - Allows a suitably written Charm/Converse
program running on a workstation cluster to
dynamically change the number of CPU's it is
running on, in response to a network (CCS)
request. - Works in coordination with a Cluster Manager to
give a job as many CPU's as are available when
there are no other jobs, while providing the
flexibility to accept new jobs and scale down.
84Appspector
- Appspector provides a web interface to submitting
and monitoring parallel jobs. - Submission user specifies machine, login,
password, program name (which must already be
available on the target machine). - Jobs can be monitored from any computer with a
web browser. Advanced program information can be
shown on the monitoring screen using CCS.