Techniques for Developing Efficient Petascale Applications - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

Techniques for Developing Efficient Petascale Applications

Description:

... special-purpose heuristic work fine when applicable. Still, ... Improvements wrought by network topological aware mapping of computational work to processors ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 61
Provided by: san7196
Learn more at: http://charm.cs.uiuc.edu
Category:

less

Transcript and Presenter's Notes

Title: Techniques for Developing Efficient Petascale Applications


1
Techniques for Developing Efficient Petascale
Applications
  • Laxmikant (Sanjay) Kale
  • http//charm.cs.uiuc.edu
  • Parallel Programming Laboratory
  • Department of Computer Science
  • University of Illinois at Urbana Champaign

2
Summarizing the state of art
  • Petascale
  • Very powerful parallel computers are being built
  • Application domains exist that need that kind of
    power
  • New generation of applications
  • Use sophisticated algorithms
  • Dynamic adaptive refinements
  • Multi-scale, multi-physics
  • Challenge
  • Parallel programming is more complex than
    sequential
  • Difficulty in achieving performance that scales
    to PetaFLOP/S and beyond
  • Difficulty in getting correct behavior from
    programs

11/16/2009
CharmWorkshop2007
2
3
Our Guiding Principles
  • No magic
  • Parallelizing compilers have achievedd close to
    technical perfection, but are not enough
  • Sequential programs obscure too much information
  • Seek an optimal division of labor between the
    system and the programmer
  • Design abstractions based solidly on use-cases
  • Application-oriented yet computer-science
    centered approach

11/16/2009
3
4
Application Oriented Parallel Abstractions
Synergy between Computer Science Research and
Apps has been beneficial to both
LeanCP
Space-time meshing
Other Applications
Issues
NAMD
Charm
Techniques libraries
Rocket Simulation
ChaNGa
5
Application Oriented Parallel Abstractions
Synergy between Computer Science Research and
Apps has been beneficial to both
LeanCP
Space-time meshing
Other Applications
Issues
NAMD
Charm
Techniques libraries
Rocket Simulation
ChaNGa
6
Enabling CS technology of parallel objects and
intelligent runtime systems (Charm and AMPI)
has led to several collaborative applications in
CSE
7
Outline
  • Techniques for Petascale Programming
  • My biases isoefficiency analysis,
    overdecompistion, automated resource management,
    sophisticated perf. analysis tools
  • BigSim
  • Early development of Applications on future
    machines
  • Dealing with Multicore processors and SMP nodes
  • Novel parallel programming models

11/16/2009
7
8
Techniques For Scaling to Multi-PetaFLOP/s
9
1.Analyze Scalability of Algorithms (say via the
iso-efficiency metric)
10
Isoefficiency Analysis
Vipin Kumar (circa 1990)
  • An algorithm () is scalable if
  • If you double the number of processors available
    to it, it can retain the same parallel efficiency
    by increasing the size of the problem by some
    amount
  • Not all algorithms are scalable in this sense..
  • Isoefficiency is the rate at which the problem
    size must be increased, as a function of number
    of processors, to keep the same efficiency

Parallel efficiency T1/(TpP) T1 Time on one
processor Tp Time on P processors
11
NAMD A Production MD program
  • NAMD
  • Fully featured program
  • NIH-funded development
  • Distributed free of charge (20,000 registered
    users)
  • Binaries and source code
  • Installed at NSF centers
  • User training and support
  • Large published simulations

11/16/2009
11
12
Molecular Dynamics in NAMD
  • Collection of charged atoms, with bonds
  • Newtonian mechanics
  • Thousands of atoms (10,000 5,000,000)
  • At each time-step
  • Calculate forces on each atom
  • Bonds
  • Non-bonded electrostatic and van der Waals
  • Short-distance every timestep
  • Long-distance using PME (3D FFT)
  • Multiple Time Stepping PME every 4 timesteps
  • Calculate velocities and advance positions
  • Challenge femtosecond time-step, millions needed!

Collaboration with K. Schulten, R. Skeel, and
coworkers
13
Traditional Approaches non isoefficient
In 1996-2002
  • Replicated Data
  • All atom coordinates stored on each processor
  • Communication/Computation ratio P log P
  • Partition the Atoms array across processors
  • Nearby atoms may not be on the same processor
  • C/C ratio O(P)
  • Distribute force matrix to processors
  • Matrix is sparse, non uniform,
  • C/C Ratio sqrt(P)

Not Scalable
Not Scalable
Not Scalable
14
Spatial Decomposition Via Charm
  • Atoms distributed to cubes based on their
    location
  • Size of each cube
  • Just a bit larger than cut-off radius
  • Communicate only with neighbors
  • Work for each pair of nbr objects
  • C/C ratio O(1)
  • However
  • Load Imbalance
  • Limited Parallelism

Charm is useful to handle this
Cells, Cubes orPatches
15

Object Based Parallelization for MD Force
Decomposition Spatial Decomposition
  • Now, we have many objects to load balance
  • Each diamond can be assigned to any proc.
  • Number of diamonds (3D)
  • 14Number of Patches
  • 2-away variation
  • Half-size cubes
  • 5x5x5 interactions
  • 3-away interactions 7x7x7

16
Performance of NAMD STMV
STMV 1million atoms
Time (ms per step)
No. of cores
17
2. Decouple decomposition from Physical
Processors
18
Migratable Objects (aka Processor Virtualization)
Programmer Over decomposition into virtual
processors Runtime Assigns VPs to
processors Enables adaptive runtime
strategies Implementations Charm, AMPI
19
Migratable Objects (aka Processor Virtualization)
Benefits
Programmer Over decomposition into virtual
processors Runtime Assigns VPs to
processors Enables adaptive runtime
strategies Implementations Charm, AMPI
  • Software engineering
  • Number of virtual processors can be independently
    controlled
  • Separate VPs for different modules
  • Message driven execution
  • Adaptive overlap of communication
  • Predictability
  • Automatic out-of-core
  • Prefetch to local stores
  • Asynchronous reductions
  • Dynamic mapping
  • Heterogeneous clusters
  • Vacate, adjust to speed, share
  • Automatic checkpointing
  • Change set of processors used
  • Automatic dynamic load balancing
  • Communication optimization

MPI processes
Virtual Processors (user-level migratable threads)
20
Parallel Decomposition and Processors
  • MPI-style encourages
  • Decomposition into P pieces, where P is the
    number of physical processors available
  • If your natural decomposition is a cube, then the
    number of processors must be a cube
  • Charm/AMPI style virtual processors
  • Decompose into natural objects of the application
  • Let the runtime map them to processors
  • Decouple decomposition from load balancing

21
Decomposition independent of numCores
  • Rocket simulation example under traditional MPI
    vs. Charm/AMPI framework
  • Benefit load balance, communication
    optimizations, modularity

11/16/2009
LSU PetaScale
21
22
Fault Tolerance
  • Automatic Checkpointing
  • Migrate objects to disk
  • In-memory checkpointing as an option
  • Automatic fault detection and restart
  • Impending Fault Response
  • Migrate objects to other processors
  • Adjust processor-level parallel data structures
  • Scalable fault tolerance
  • When a processor out of 100,000 fails, all 99,999
    shouldnt have to run back to their checkpoints!
  • Sender-side message logging
  • Latency tolerance helps mitigate costs
  • Restart can be speeded up by spreading out
    objects from failed processor

23
3. Use Dynamic Load Balancing Based on
the Principle of Persistence
Principle of persistence Computational loads and
communication patterns tend to persist, even in
dynamic computations So, recent past is a good
predictor or near future
11/16/2009
LSU PetaScale
23
24
Two step balancing
  • Computational entities (nodes, structured grid
    points, particles) are partitioned into objects
  • Say, a few thousand FEM elements per chunk, on
    average
  • Objects are migrated across processors for
    balancing load
  • Much smaller problem than repartitioning entire
    mesh, e.g.
  • Occasional repartitioning for some problems

25
Load Balancing Steps
Regular Timesteps
Detailed, aggressive Load Balancing
Instrumented Timesteps
Refinement Load Balancing
26
ChaNGa Parallel Gravity
  • Collaborative project (NSF ITR)
  • with Prof. Tom Quinn, Univ. of Washington
  • Components gravity, gas dynamics
  • Barnes-Hut tree codes
  • Oct tree is natural decomposition
  • Geometry has better aspect ratios, and so you
    open fewer nodes up.
  • But is not used because it leads to bad load
    balance
  • Assumption one-to-one map between sub-trees and
    processors
  • Binary trees are considered better load balanced
  • With Charm Use Oct-Tree, and let Charm map
    subtrees to processors

27
(No Transcript)
28
ChaNGa Load balancing
Simple balancers worsened performance!
dwarf 5M on 1,024 BlueGene/L processors
29
Load balancing with OrbRefineLB
dwarf 5M on 1,024 BlueGene/L processors
Need sophisticated balancers, and ability to
choose the right ones automatically
30
ChaNGa Preliminary Performance
ChaNGa Parallel Gravity Code Developed in
Collaboration with Tom Quinn (Univ. Washington)
using Charm
31
Load balancing for large machines I
  • Centralized balancers achieve best balance
  • Collect object-communication graph on one
    processor
  • But wont scale beyond tens of thousands of nodes
  • Fully distributed load balancers
  • Avoid bottleneck but.. Achieve poor load balance
  • Not adequately agile
  • Hierarchical load balancers
  • Careful control of what information goes up and
    down the hierarchy can lead to fast, high-quality
    balancers

32
Load balancing for large machines II
  • Interconnection topology starts to matter again
  • Was hidden due to wormhole routing etc.
  • Latency variation is still small..
  • But bandwidth occupancy is a problem
  • Topology aware load balancers
  • Some general heuristic have shown good
    performance
  • But may require too much compute power
  • Also, special-purpose heuristic work fine when
    applicable
  • Still, many open challenges

33
OpenAtom Car-Parinello ab initio MD NSF ITR
2001-2007, IBM
G. Martyna M. Tuckerman L. Kale
34
Goal The accurate treatment of complex
heterogeneous
nanoscale systems
Courtsey G. Martyna
read head
bit
write head
soft under-layer
recording medium
35
New OpenAtom Collaboration (DOE)
  • Principle Investigators
  • G. Martyna (IBM TJ Watson)
  • M. Tuckerman (NYU)?
  • L.V. Kale (UIUC)??
  • K. Schulten (UIUC)?
  • J. Dongarra (UTK/ORNL)?
  • Current effort focus
  • QMMM via integration with NAMD2
  • ORNL Cray XT4 Jaguar (31,328 cores)
  • A unique parallel decomposition of the
    Car-Parinello method.
  • Using Charm virtualization we can efficiently
    scale small (32 molecule) systems to thousands of
    processors

36
Decomposition and Computation Flow
37

BlueGene/P
BlueGene/L
OpenAtom Performance
Cray XT3
38
Topology aware mapping of objects
39
Improvements wrought by network topological
aware mapping of computational work to processors
The simulation of the left panel maps
computational work to processors taking the
network connectivity into account while the right
panel simulation does not . The black or idle
time processors spent waiting for computational
work to arrive on processors is significantly
reduced at left. (256waters, 70R, on BG/L 4096
cores)
40
OpenAtom Topology Aware Mapping on BG/L
41
OpenAtom Topology Aware Mapping on Cray XT/3
42
4. Analyze Performance with Sophisticated Tools
43
Exploit sophisticated Performance analysis tools
  • We use a tool called Projections
  • Many other tools exist
  • Need for scalable analysis
  • A recent example
  • Trying to identify the next performance obstacle
    for NAMD
  • Running on 8192 processors, with 92,000 atom
    simulation
  • Test scenario without PME
  • Time is 3 ms per step, but lower bound is 1.6ms
    or so..

44
(No Transcript)
45
(No Transcript)
46
(No Transcript)
47
5. Model and Predict Performance Via Simulation
48
How to tune performance for a future machine?
  • For example, Blue Waters will arrive in 2011
  • But need to prepare applications for it stating
    now
  • Even for extant machines
  • Full size machine may not be available as often
    as needed for tuning runs
  • A simulation-based approach is needed
  • Our Approach BigSim
  • Full scale Emulation
  • Trace driven Simulation

49
BigSim Emulation
  • Emulation
  • Run an existing, full-scale MPI, AMPI or Charm
    application
  • Using an emulation layer that pretends to be
    (say) 100k cores
  • Leverages Charms object-based virtualization
  • Generates traces
  • Characteristics of SEBs (Sequential Execution
    Blocks)
  • Dependences between SEBs and messages

50
BigSim Trace-Driven Simulation
  • Trace Driven Parallel Simulation
  • Typically on tens to hundreds of processors
  • Multiple resolution Simulation of sequential
    execution
  • from simple scaling to
  • cycle-accurate modeling
  • Multipl resolution simulation of the Network
  • from simple latency/bw model to
  • A detailed packet and switching port level
    simulation
  • Generates Timing traces just as a real app on
    full scale machine
  • Phase 3 Analyze performance
  • Identify bottlenecks, even w/o predicting exact
    performance
  • Carry out various what-if analysis

51
Projections Performance visualization
52
BigSim Challenges
  • This simple picture hides many complexities
  • Emulation
  • Automatic Out-of-core support for large memory
    footprint apps
  • Simulation
  • Accuracy vs cost tradeoffs
  • Interpolation mechanisms
  • Memory optimizations
  • Active area of research
  • But early versions will be available for
    application developers for Blue Waters next month

53
Validation
54
Challenges of Multicore Chips And SMP nodes
55
Challenges of Multicore Chips
  • SMP nodes have been around for a while
  • Typically, users just run separate MPI processes
    on each core
  • Seen to be faster (OpenMP MPI has not succeeded
    much)
  • Yet, you cannot ignore the shared memory
  • E.g. For gravity calculations (cosmology), a
    processor may request a set of particles from
    remote nodes
  • With shared memory, one can maintain a node-level
    cache, minimizing communication
  • Need model without pitfalls of Pthreads/OpenMP
  • Costly locks, false sharing, no respect for
    locality
  • E.g. Charm avoids these pitfalls, IF its RTS
    can avoid them

56
Charm Experience
  • Pingpong performance

57
Raising the Level of Abstraction
58
Raising the Level of Abstraction
  • Clearly (?) we need new programming models
  • Many opinions and ideas exist
  • Two metapoints
  • Need for Exploration (dont standardize too soon)
  • Interoperability
  • Allows a beachhead for novel paradigms
  • Long-term Allow each module to be coded using
    the paradigm best suited for it
  • Interoperability requires concurrent composition
  • This may require message-driven execution

59
Simplifying Parallel Programming
  • By giving up completeness!
  • A paradigm may be simple, but
  • not suitable for expressing all parallel patterns
  • Yet, if it can cover a significant classes of
    patters (applications, modules), it is useful
  • A collection of incomplete models, backed by a
    few complete ones, will do the trick
  • Our own examples both outlaw non-determinism
  • Multiphase Shared Arrays (MSA) restricted shared
    memory
  • LCPC 04
  • Charisma Static data flow among collections of
    objects
  • LCR04, HPDC 07

60
MSA
C
C
C
C
  • In the simple model
  • A program consists of
  • A collection of Charm threads, and
  • Multiple collections of data-arrays
  • Partitioned into pages (user-specified)
  • Each array is in one mode at a time
  • But its mode may change from phase to phase
  • Modes
  • Write-once
  • Read-only
  • Accumulate
  • Owner-computes

A
B
61
A View of an Interoperable Future
62
Summary
  • Petascale Applications will require a large
    toolbox My pet tools/techniques
  • Scalable Algorithms with isoefficiency analysis
  • Object-based decomposition
  • Dynamic Load balancing
  • Scalable Performance analysis
  • Early performance tuning via BigSim
  • Multicore chips, SMP nodes, accelerators
  • Require extra programming effort with
    locality-respecting models
  • Raise the level of abstraction
  • Via Incomplete yet simple paradigms, and
  • domain-specific frameworks

More Info http//charm.cs.uiuc.edu /
Write a Comment
User Comments (0)
About PowerShow.com