Title: Techniques for Developing Efficient Petascale Applications
1Techniques for Developing Efficient Petascale
Applications
- Laxmikant (Sanjay) Kale
- http//charm.cs.uiuc.edu
- Parallel Programming Laboratory
- Department of Computer Science
- University of Illinois at Urbana Champaign
2Summarizing the state of art
- Petascale
- Very powerful parallel computers are being built
- Application domains exist that need that kind of
power - New generation of applications
- Use sophisticated algorithms
- Dynamic adaptive refinements
- Multi-scale, multi-physics
- Challenge
- Parallel programming is more complex than
sequential - Difficulty in achieving performance that scales
to PetaFLOP/S and beyond - Difficulty in getting correct behavior from
programs
11/16/2009
CharmWorkshop2007
2
3Our Guiding Principles
- No magic
- Parallelizing compilers have achievedd close to
technical perfection, but are not enough - Sequential programs obscure too much information
- Seek an optimal division of labor between the
system and the programmer - Design abstractions based solidly on use-cases
- Application-oriented yet computer-science
centered approach
11/16/2009
3
4Application Oriented Parallel Abstractions
Synergy between Computer Science Research and
Apps has been beneficial to both
LeanCP
Space-time meshing
Other Applications
Issues
NAMD
Charm
Techniques libraries
Rocket Simulation
ChaNGa
5Application Oriented Parallel Abstractions
Synergy between Computer Science Research and
Apps has been beneficial to both
LeanCP
Space-time meshing
Other Applications
Issues
NAMD
Charm
Techniques libraries
Rocket Simulation
ChaNGa
6Enabling CS technology of parallel objects and
intelligent runtime systems (Charm and AMPI)
has led to several collaborative applications in
CSE
7Outline
- Techniques for Petascale Programming
- My biases isoefficiency analysis,
overdecompistion, automated resource management,
sophisticated perf. analysis tools - BigSim
- Early development of Applications on future
machines - Dealing with Multicore processors and SMP nodes
- Novel parallel programming models
11/16/2009
7
8Techniques For Scaling to Multi-PetaFLOP/s
91.Analyze Scalability of Algorithms (say via the
iso-efficiency metric)
10Isoefficiency Analysis
Vipin Kumar (circa 1990)
- An algorithm () is scalable if
- If you double the number of processors available
to it, it can retain the same parallel efficiency
by increasing the size of the problem by some
amount - Not all algorithms are scalable in this sense..
- Isoefficiency is the rate at which the problem
size must be increased, as a function of number
of processors, to keep the same efficiency
Parallel efficiency T1/(TpP) T1 Time on one
processor Tp Time on P processors
11NAMD A Production MD program
- NAMD
- Fully featured program
- NIH-funded development
- Distributed free of charge (20,000 registered
users) - Binaries and source code
- Installed at NSF centers
- User training and support
- Large published simulations
11/16/2009
11
12Molecular Dynamics in NAMD
- Collection of charged atoms, with bonds
- Newtonian mechanics
- Thousands of atoms (10,000 5,000,000)
- At each time-step
- Calculate forces on each atom
- Bonds
- Non-bonded electrostatic and van der Waals
- Short-distance every timestep
- Long-distance using PME (3D FFT)
- Multiple Time Stepping PME every 4 timesteps
- Calculate velocities and advance positions
- Challenge femtosecond time-step, millions needed!
Collaboration with K. Schulten, R. Skeel, and
coworkers
13Traditional Approaches non isoefficient
In 1996-2002
- Replicated Data
- All atom coordinates stored on each processor
- Communication/Computation ratio P log P
- Partition the Atoms array across processors
- Nearby atoms may not be on the same processor
- C/C ratio O(P)
- Distribute force matrix to processors
- Matrix is sparse, non uniform,
- C/C Ratio sqrt(P)
Not Scalable
Not Scalable
Not Scalable
14Spatial Decomposition Via Charm
- Atoms distributed to cubes based on their
location - Size of each cube
- Just a bit larger than cut-off radius
- Communicate only with neighbors
- Work for each pair of nbr objects
- C/C ratio O(1)
- However
- Load Imbalance
- Limited Parallelism
Charm is useful to handle this
Cells, Cubes orPatches
15 Object Based Parallelization for MD Force
Decomposition Spatial Decomposition
- Now, we have many objects to load balance
- Each diamond can be assigned to any proc.
- Number of diamonds (3D)
- 14Number of Patches
- 2-away variation
- Half-size cubes
- 5x5x5 interactions
- 3-away interactions 7x7x7
16Performance of NAMD STMV
STMV 1million atoms
Time (ms per step)
No. of cores
172. Decouple decomposition from Physical
Processors
18Migratable Objects (aka Processor Virtualization)
Programmer Over decomposition into virtual
processors Runtime Assigns VPs to
processors Enables adaptive runtime
strategies Implementations Charm, AMPI
19Migratable Objects (aka Processor Virtualization)
Benefits
Programmer Over decomposition into virtual
processors Runtime Assigns VPs to
processors Enables adaptive runtime
strategies Implementations Charm, AMPI
- Software engineering
- Number of virtual processors can be independently
controlled - Separate VPs for different modules
- Message driven execution
- Adaptive overlap of communication
- Predictability
- Automatic out-of-core
- Prefetch to local stores
- Asynchronous reductions
- Dynamic mapping
- Heterogeneous clusters
- Vacate, adjust to speed, share
- Automatic checkpointing
- Change set of processors used
- Automatic dynamic load balancing
- Communication optimization
MPI processes
Virtual Processors (user-level migratable threads)
20Parallel Decomposition and Processors
- MPI-style encourages
- Decomposition into P pieces, where P is the
number of physical processors available - If your natural decomposition is a cube, then the
number of processors must be a cube -
- Charm/AMPI style virtual processors
- Decompose into natural objects of the application
- Let the runtime map them to processors
- Decouple decomposition from load balancing
21Decomposition independent of numCores
- Rocket simulation example under traditional MPI
vs. Charm/AMPI framework - Benefit load balance, communication
optimizations, modularity
11/16/2009
LSU PetaScale
21
22Fault Tolerance
- Automatic Checkpointing
- Migrate objects to disk
- In-memory checkpointing as an option
- Automatic fault detection and restart
- Impending Fault Response
- Migrate objects to other processors
- Adjust processor-level parallel data structures
- Scalable fault tolerance
- When a processor out of 100,000 fails, all 99,999
shouldnt have to run back to their checkpoints! - Sender-side message logging
- Latency tolerance helps mitigate costs
- Restart can be speeded up by spreading out
objects from failed processor
233. Use Dynamic Load Balancing Based on
the Principle of Persistence
Principle of persistence Computational loads and
communication patterns tend to persist, even in
dynamic computations So, recent past is a good
predictor or near future
11/16/2009
LSU PetaScale
23
24Two step balancing
- Computational entities (nodes, structured grid
points, particles) are partitioned into objects - Say, a few thousand FEM elements per chunk, on
average - Objects are migrated across processors for
balancing load - Much smaller problem than repartitioning entire
mesh, e.g. - Occasional repartitioning for some problems
25Load Balancing Steps
Regular Timesteps
Detailed, aggressive Load Balancing
Instrumented Timesteps
Refinement Load Balancing
26ChaNGa Parallel Gravity
- Collaborative project (NSF ITR)
- with Prof. Tom Quinn, Univ. of Washington
- Components gravity, gas dynamics
- Barnes-Hut tree codes
- Oct tree is natural decomposition
- Geometry has better aspect ratios, and so you
open fewer nodes up. - But is not used because it leads to bad load
balance - Assumption one-to-one map between sub-trees and
processors - Binary trees are considered better load balanced
- With Charm Use Oct-Tree, and let Charm map
subtrees to processors
27(No Transcript)
28ChaNGa Load balancing
Simple balancers worsened performance!
dwarf 5M on 1,024 BlueGene/L processors
29Load balancing with OrbRefineLB
dwarf 5M on 1,024 BlueGene/L processors
Need sophisticated balancers, and ability to
choose the right ones automatically
30ChaNGa Preliminary Performance
ChaNGa Parallel Gravity Code Developed in
Collaboration with Tom Quinn (Univ. Washington)
using Charm
31Load balancing for large machines I
- Centralized balancers achieve best balance
- Collect object-communication graph on one
processor - But wont scale beyond tens of thousands of nodes
- Fully distributed load balancers
- Avoid bottleneck but.. Achieve poor load balance
- Not adequately agile
- Hierarchical load balancers
- Careful control of what information goes up and
down the hierarchy can lead to fast, high-quality
balancers
32Load balancing for large machines II
- Interconnection topology starts to matter again
- Was hidden due to wormhole routing etc.
- Latency variation is still small..
- But bandwidth occupancy is a problem
- Topology aware load balancers
- Some general heuristic have shown good
performance - But may require too much compute power
- Also, special-purpose heuristic work fine when
applicable - Still, many open challenges
33OpenAtom Car-Parinello ab initio MD NSF ITR
2001-2007, IBM
G. Martyna M. Tuckerman L. Kale
34Goal The accurate treatment of complex
heterogeneous
nanoscale systems
Courtsey G. Martyna
read head
bit
write head
soft under-layer
recording medium
35New OpenAtom Collaboration (DOE)
- Principle Investigators
- G. Martyna (IBM TJ Watson)
- M. Tuckerman (NYU)?
- L.V. Kale (UIUC)??
- K. Schulten (UIUC)?
- J. Dongarra (UTK/ORNL)?
- Current effort focus
- QMMM via integration with NAMD2
- ORNL Cray XT4 Jaguar (31,328 cores)
- A unique parallel decomposition of the
Car-Parinello method. - Using Charm virtualization we can efficiently
scale small (32 molecule) systems to thousands of
processors
36Decomposition and Computation Flow
37 BlueGene/P
BlueGene/L
OpenAtom Performance
Cray XT3
38Topology aware mapping of objects
39Improvements wrought by network topological
aware mapping of computational work to processors
The simulation of the left panel maps
computational work to processors taking the
network connectivity into account while the right
panel simulation does not . The black or idle
time processors spent waiting for computational
work to arrive on processors is significantly
reduced at left. (256waters, 70R, on BG/L 4096
cores)
40OpenAtom Topology Aware Mapping on BG/L
41OpenAtom Topology Aware Mapping on Cray XT/3
424. Analyze Performance with Sophisticated Tools
43Exploit sophisticated Performance analysis tools
- We use a tool called Projections
- Many other tools exist
- Need for scalable analysis
- A recent example
- Trying to identify the next performance obstacle
for NAMD - Running on 8192 processors, with 92,000 atom
simulation - Test scenario without PME
- Time is 3 ms per step, but lower bound is 1.6ms
or so..
44(No Transcript)
45(No Transcript)
46(No Transcript)
475. Model and Predict Performance Via Simulation
48How to tune performance for a future machine?
- For example, Blue Waters will arrive in 2011
- But need to prepare applications for it stating
now - Even for extant machines
- Full size machine may not be available as often
as needed for tuning runs - A simulation-based approach is needed
- Our Approach BigSim
- Full scale Emulation
- Trace driven Simulation
49BigSim Emulation
- Emulation
- Run an existing, full-scale MPI, AMPI or Charm
application - Using an emulation layer that pretends to be
(say) 100k cores - Leverages Charms object-based virtualization
- Generates traces
- Characteristics of SEBs (Sequential Execution
Blocks) - Dependences between SEBs and messages
50BigSim Trace-Driven Simulation
- Trace Driven Parallel Simulation
- Typically on tens to hundreds of processors
- Multiple resolution Simulation of sequential
execution - from simple scaling to
- cycle-accurate modeling
- Multipl resolution simulation of the Network
- from simple latency/bw model to
- A detailed packet and switching port level
simulation - Generates Timing traces just as a real app on
full scale machine - Phase 3 Analyze performance
- Identify bottlenecks, even w/o predicting exact
performance - Carry out various what-if analysis
51Projections Performance visualization
52BigSim Challenges
- This simple picture hides many complexities
- Emulation
- Automatic Out-of-core support for large memory
footprint apps - Simulation
- Accuracy vs cost tradeoffs
- Interpolation mechanisms
- Memory optimizations
- Active area of research
- But early versions will be available for
application developers for Blue Waters next month
53Validation
54Challenges of Multicore Chips And SMP nodes
55Challenges of Multicore Chips
- SMP nodes have been around for a while
- Typically, users just run separate MPI processes
on each core - Seen to be faster (OpenMP MPI has not succeeded
much) - Yet, you cannot ignore the shared memory
- E.g. For gravity calculations (cosmology), a
processor may request a set of particles from
remote nodes - With shared memory, one can maintain a node-level
cache, minimizing communication - Need model without pitfalls of Pthreads/OpenMP
- Costly locks, false sharing, no respect for
locality - E.g. Charm avoids these pitfalls, IF its RTS
can avoid them
56Charm Experience
57Raising the Level of Abstraction
58Raising the Level of Abstraction
- Clearly (?) we need new programming models
- Many opinions and ideas exist
- Two metapoints
- Need for Exploration (dont standardize too soon)
- Interoperability
- Allows a beachhead for novel paradigms
- Long-term Allow each module to be coded using
the paradigm best suited for it - Interoperability requires concurrent composition
- This may require message-driven execution
59Simplifying Parallel Programming
- By giving up completeness!
- A paradigm may be simple, but
- not suitable for expressing all parallel patterns
- Yet, if it can cover a significant classes of
patters (applications, modules), it is useful - A collection of incomplete models, backed by a
few complete ones, will do the trick - Our own examples both outlaw non-determinism
- Multiphase Shared Arrays (MSA) restricted shared
memory - LCPC 04
- Charisma Static data flow among collections of
objects - LCR04, HPDC 07
60MSA
C
C
C
C
- In the simple model
- A program consists of
- A collection of Charm threads, and
- Multiple collections of data-arrays
- Partitioned into pages (user-specified)
- Each array is in one mode at a time
- But its mode may change from phase to phase
- Modes
- Write-once
- Read-only
- Accumulate
- Owner-computes
A
B
61A View of an Interoperable Future
62Summary
- Petascale Applications will require a large
toolbox My pet tools/techniques - Scalable Algorithms with isoefficiency analysis
- Object-based decomposition
- Dynamic Load balancing
- Scalable Performance analysis
- Early performance tuning via BigSim
- Multicore chips, SMP nodes, accelerators
- Require extra programming effort with
locality-respecting models - Raise the level of abstraction
- Via Incomplete yet simple paradigms, and
- domain-specific frameworks
More Info http//charm.cs.uiuc.edu /