Techniques for Developing Efficient Petascale Applications

About This Presentation

Title:

Techniques for Developing Efficient Petascale Applications

Description:

... special-purpose heuristic work fine when applicable. Still, ... Improvements wrought by network topological aware mapping of computational work to processors ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 61

Provided by: san7196

Learn more at: http://charm.cs.uiuc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Techniques for Developing Efficient Petascale Applications

1
Techniques for Developing Efficient Petascale
Applications

Laxmikant (Sanjay) Kale
http//charm.cs.uiuc.edu
Parallel Programming Laboratory
Department of Computer Science
University of Illinois at Urbana Champaign

2
Summarizing the state of art

Petascale
Very powerful parallel computers are being built
Application domains exist that need that kind of
power
New generation of applications
Use sophisticated algorithms
Dynamic adaptive refinements
Multi-scale, multi-physics
Challenge
Parallel programming is more complex than
sequential
Difficulty in achieving performance that scales
to PetaFLOP/S and beyond
Difficulty in getting correct behavior from
programs

11/16/2009
CharmWorkshop2007
2
3
Our Guiding Principles

No magic
Parallelizing compilers have achievedd close to
technical perfection, but are not enough
Sequential programs obscure too much information
Seek an optimal division of labor between the
system and the programmer
Design abstractions based solidly on use-cases
Application-oriented yet computer-science
centered approach

11/16/2009
3
4
Application Oriented Parallel Abstractions
Synergy between Computer Science Research and
Apps has been beneficial to both
LeanCP
Space-time meshing
Other Applications
Issues
NAMD
Charm
Techniques libraries
Rocket Simulation
ChaNGa
5
Application Oriented Parallel Abstractions
Synergy between Computer Science Research and
Apps has been beneficial to both
LeanCP
Space-time meshing
Other Applications
Issues
NAMD
Charm
Techniques libraries
Rocket Simulation
ChaNGa
6
Enabling CS technology of parallel objects and
intelligent runtime systems (Charm and AMPI)
has led to several collaborative applications in
CSE
7
Outline

Techniques for Petascale Programming
My biases isoefficiency analysis,
overdecompistion, automated resource management,
sophisticated perf. analysis tools
BigSim
Early development of Applications on future
machines
Dealing with Multicore processors and SMP nodes
Novel parallel programming models

11/16/2009
7
8
Techniques For Scaling to Multi-PetaFLOP/s
9
1.Analyze Scalability of Algorithms (say via the
iso-efficiency metric)
10
Isoefficiency Analysis
Vipin Kumar (circa 1990)

An algorithm () is scalable if
If you double the number of processors available
to it, it can retain the same parallel efficiency
by increasing the size of the problem by some
amount
Not all algorithms are scalable in this sense..
Isoefficiency is the rate at which the problem
size must be increased, as a function of number
of processors, to keep the same efficiency

Parallel efficiency T1/(TpP) T1 Time on one
processor Tp Time on P processors
11
NAMD A Production MD program

NAMD
Fully featured program
NIH-funded development
Distributed free of charge (20,000 registered
users)
Binaries and source code
Installed at NSF centers
User training and support
Large published simulations

11/16/2009
11
12
Molecular Dynamics in NAMD

Collection of charged atoms, with bonds
Newtonian mechanics
Thousands of atoms (10,000 5,000,000)
At each time-step
Calculate forces on each atom
Bonds
Non-bonded electrostatic and van der Waals
Short-distance every timestep
Long-distance using PME (3D FFT)
Multiple Time Stepping PME every 4 timesteps
Calculate velocities and advance positions
Challenge femtosecond time-step, millions needed!

Collaboration with K. Schulten, R. Skeel, and
coworkers
13
Traditional Approaches non isoefficient
In 1996-2002

Replicated Data
All atom coordinates stored on each processor
Communication/Computation ratio P log P
Partition the Atoms array across processors
Nearby atoms may not be on the same processor
C/C ratio O(P)
Distribute force matrix to processors
Matrix is sparse, non uniform,
C/C Ratio sqrt(P)

Not Scalable
Not Scalable
Not Scalable
14
Spatial Decomposition Via Charm

Atoms distributed to cubes based on their
location
Size of each cube
Just a bit larger than cut-off radius
Communicate only with neighbors
Work for each pair of nbr objects
C/C ratio O(1)
However
Load Imbalance
Limited Parallelism

Charm is useful to handle this
Cells, Cubes orPatches
15

Object Based Parallelization for MD Force
Decomposition Spatial Decomposition

Now, we have many objects to load balance
Each diamond can be assigned to any proc.
Number of diamonds (3D)
14Number of Patches
2-away variation
Half-size cubes
5x5x5 interactions
3-away interactions 7x7x7

16
Performance of NAMD STMV
STMV 1million atoms
Time (ms per step)
No. of cores
17
2. Decouple decomposition from Physical
Processors
18
Migratable Objects (aka Processor Virtualization)
Programmer Over decomposition into virtual
processors Runtime Assigns VPs to
processors Enables adaptive runtime
strategies Implementations Charm, AMPI
19
Migratable Objects (aka Processor Virtualization)
Benefits
Programmer Over decomposition into virtual
processors Runtime Assigns VPs to
processors Enables adaptive runtime
strategies Implementations Charm, AMPI

Software engineering
Number of virtual processors can be independently
controlled
Separate VPs for different modules
Message driven execution
Adaptive overlap of communication
Predictability
Automatic out-of-core
Prefetch to local stores
Asynchronous reductions
Dynamic mapping
Heterogeneous clusters
Vacate, adjust to speed, share
Automatic checkpointing
Change set of processors used
Automatic dynamic load balancing
Communication optimization

MPI processes
Virtual Processors (user-level migratable threads)
20
Parallel Decomposition and Processors

MPI-style encourages
Decomposition into P pieces, where P is the
number of physical processors available
If your natural decomposition is a cube, then the
number of processors must be a cube
Charm/AMPI style virtual processors
Decompose into natural objects of the application
Let the runtime map them to processors
Decouple decomposition from load balancing

21
Decomposition independent of numCores

Rocket simulation example under traditional MPI
vs. Charm/AMPI framework
Benefit load balance, communication
optimizations, modularity

11/16/2009
LSU PetaScale
21
22
Fault Tolerance

Automatic Checkpointing
Migrate objects to disk
In-memory checkpointing as an option
Automatic fault detection and restart
Impending Fault Response
Migrate objects to other processors
Adjust processor-level parallel data structures

Scalable fault tolerance
When a processor out of 100,000 fails, all 99,999
shouldnt have to run back to their checkpoints!
Sender-side message logging
Latency tolerance helps mitigate costs
Restart can be speeded up by spreading out
objects from failed processor

23
3. Use Dynamic Load Balancing Based on
the Principle of Persistence
Principle of persistence Computational loads and
communication patterns tend to persist, even in
dynamic computations So, recent past is a good
predictor or near future
11/16/2009
LSU PetaScale
23
24
Two step balancing

Computational entities (nodes, structured grid
points, particles) are partitioned into objects
Say, a few thousand FEM elements per chunk, on
average
Objects are migrated across processors for
balancing load
Much smaller problem than repartitioning entire
mesh, e.g.
Occasional repartitioning for some problems

25
Load Balancing Steps
Regular Timesteps
Detailed, aggressive Load Balancing
Instrumented Timesteps
Refinement Load Balancing
26
ChaNGa Parallel Gravity

Collaborative project (NSF ITR)
with Prof. Tom Quinn, Univ. of Washington
Components gravity, gas dynamics
Barnes-Hut tree codes
Oct tree is natural decomposition
Geometry has better aspect ratios, and so you
open fewer nodes up.
But is not used because it leads to bad load
balance
Assumption one-to-one map between sub-trees and
processors
Binary trees are considered better load balanced
With Charm Use Oct-Tree, and let Charm map
subtrees to processors

27
(No Transcript)
28
ChaNGa Load balancing
Simple balancers worsened performance!
dwarf 5M on 1,024 BlueGene/L processors
29
Load balancing with OrbRefineLB
dwarf 5M on 1,024 BlueGene/L processors
Need sophisticated balancers, and ability to
choose the right ones automatically
30
ChaNGa Preliminary Performance
ChaNGa Parallel Gravity Code Developed in
Collaboration with Tom Quinn (Univ. Washington)
using Charm
31
Load balancing for large machines I

Centralized balancers achieve best balance
Collect object-communication graph on one
processor
But wont scale beyond tens of thousands of nodes
Fully distributed load balancers
Avoid bottleneck but.. Achieve poor load balance
Not adequately agile
Hierarchical load balancers
Careful control of what information goes up and
down the hierarchy can lead to fast, high-quality
balancers

32
Load balancing for large machines II

Interconnection topology starts to matter again
Was hidden due to wormhole routing etc.
Latency variation is still small..
But bandwidth occupancy is a problem
Topology aware load balancers
Some general heuristic have shown good
performance
But may require too much compute power
Also, special-purpose heuristic work fine when
applicable
Still, many open challenges

33
OpenAtom Car-Parinello ab initio MD NSF ITR
2001-2007, IBM
G. Martyna M. Tuckerman L. Kale
34
Goal The accurate treatment of complex
heterogeneous
nanoscale systems
Courtsey G. Martyna
read head
bit
write head
soft under-layer
recording medium
35
New OpenAtom Collaboration (DOE)

Principle Investigators
G. Martyna (IBM TJ Watson)
M. Tuckerman (NYU)?
L.V. Kale (UIUC)??
K. Schulten (UIUC)?
J. Dongarra (UTK/ORNL)?
Current effort focus
QMMM via integration with NAMD2
ORNL Cray XT4 Jaguar (31,328 cores)

A unique parallel decomposition of the
Car-Parinello method.
Using Charm virtualization we can efficiently
scale small (32 molecule) systems to thousands of
processors

36
Decomposition and Computation Flow
37

BlueGene/P
BlueGene/L
OpenAtom Performance
Cray XT3
38
Topology aware mapping of objects
39
Improvements wrought by network topological
aware mapping of computational work to processors
The simulation of the left panel maps
computational work to processors taking the
network connectivity into account while the right
panel simulation does not . The black or idle
time processors spent waiting for computational
work to arrive on processors is significantly
reduced at left. (256waters, 70R, on BG/L 4096
cores)
40
OpenAtom Topology Aware Mapping on BG/L
41
OpenAtom Topology Aware Mapping on Cray XT/3
42
4. Analyze Performance with Sophisticated Tools
43
Exploit sophisticated Performance analysis tools

We use a tool called Projections
Many other tools exist
Need for scalable analysis
A recent example
Trying to identify the next performance obstacle
for NAMD
Running on 8192 processors, with 92,000 atom
simulation
Test scenario without PME
Time is 3 ms per step, but lower bound is 1.6ms
or so..

44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
5. Model and Predict Performance Via Simulation
48
How to tune performance for a future machine?

For example, Blue Waters will arrive in 2011
But need to prepare applications for it stating
now
Even for extant machines
Full size machine may not be available as often
as needed for tuning runs
A simulation-based approach is needed
Our Approach BigSim
Full scale Emulation
Trace driven Simulation

49
BigSim Emulation

Emulation
Run an existing, full-scale MPI, AMPI or Charm
application
Using an emulation layer that pretends to be
(say) 100k cores
Leverages Charms object-based virtualization
Generates traces
Characteristics of SEBs (Sequential Execution
Blocks)
Dependences between SEBs and messages

50
BigSim Trace-Driven Simulation

Trace Driven Parallel Simulation
Typically on tens to hundreds of processors
Multiple resolution Simulation of sequential
execution
from simple scaling to
cycle-accurate modeling
Multipl resolution simulation of the Network
from simple latency/bw model to
A detailed packet and switching port level
simulation
Generates Timing traces just as a real app on
full scale machine
Phase 3 Analyze performance
Identify bottlenecks, even w/o predicting exact
performance
Carry out various what-if analysis

51
Projections Performance visualization
52
BigSim Challenges

This simple picture hides many complexities
Emulation
Automatic Out-of-core support for large memory
footprint apps
Simulation
Accuracy vs cost tradeoffs
Interpolation mechanisms
Memory optimizations
Active area of research
But early versions will be available for
application developers for Blue Waters next month

53
Validation
54
Challenges of Multicore Chips And SMP nodes
55
Challenges of Multicore Chips

SMP nodes have been around for a while
Typically, users just run separate MPI processes
on each core
Seen to be faster (OpenMP MPI has not succeeded
much)
Yet, you cannot ignore the shared memory
E.g. For gravity calculations (cosmology), a
processor may request a set of particles from
remote nodes
With shared memory, one can maintain a node-level
cache, minimizing communication
Need model without pitfalls of Pthreads/OpenMP
Costly locks, false sharing, no respect for
locality
E.g. Charm avoids these pitfalls, IF its RTS
can avoid them

56
Charm Experience

Pingpong performance

57
Raising the Level of Abstraction
58
Raising the Level of Abstraction

Clearly (?) we need new programming models
Many opinions and ideas exist
Two metapoints
Need for Exploration (dont standardize too soon)
Interoperability
Allows a beachhead for novel paradigms
Long-term Allow each module to be coded using
the paradigm best suited for it
Interoperability requires concurrent composition
This may require message-driven execution

59
Simplifying Parallel Programming

By giving up completeness!
A paradigm may be simple, but
not suitable for expressing all parallel patterns
Yet, if it can cover a significant classes of
patters (applications, modules), it is useful
A collection of incomplete models, backed by a
few complete ones, will do the trick
Our own examples both outlaw non-determinism
Multiphase Shared Arrays (MSA) restricted shared
memory
LCPC 04
Charisma Static data flow among collections of
objects
LCR04, HPDC 07