Title: Runtime Optimizations via Processor Virtualization
1Runtime Optimizationsvia Processor
Virtualization
- Laxmikant Kale
- Parallel Programming Laboratory
- Dept. of Computer Science
- University of Illinois at Urbana Champaign
- http//charm.cs.uiuc.edu
2Outline
- Where we stand
- Predictable behavior very important for
performance - What is virtualization
- Charm, AMPI
- Consequences
- Software Engineering
- Natural unit of parallelism
- Data vs code abstraction
- cohesion/coupling MPI vs objects
- Message driven execution
- Adaptive overlap
- Handling jitter
- outofcore
- caches /controllable SRAM
- Flexible and dynamic mapping
- Vacating.
- Accommodating speed varn
- shrink/expand
- Principle of persistence
- Measurement based lb
- Communication opts
- Learning algorithms
- New uses of compiler analysis
- Apparently simple,
- Threads, Global vars
- Minimizing state at migration
- Border fusion
3Technical Approach
- Seek optimal division of labor between system
and programmer - Decomposition done by programmer, everything else
automated - Develop standard library of reusable parallel
components
4Object-based Decomposition
- Idea
- Divide the computation into a large number of
pieces - Independent of number of processors
- Typically larger than number of processors
- Let the system map objects to processors
- This is virtualization
- Language and runtime support for virtualization
- Exploitation of virtualization to the hilt
5Object-based Parallelization
User is only concerned with interaction between
objects
System implementation
User View
6Realizations Charm and AMPI
- Charm
- Parallel C with Data Driven Objects (Chares)
- Object Arrays/ Object Collections
- Object Groups
- Global object with a representative on each PE
- Asynchronous method invocation
- Prioritized scheduling
- Information sharing abstractions readonly,
tables,.. - Mature, robust, portable (http//charm.cs.uiuc.edu
) - AMPI
- Adaptive MPI, with many MPI threads on each
processor
7Chare Arrays
- Elements are data-driven objects
- Elements are indexed by a user-defined data
type-- sparse 1D, 2D, 3D, tree, ... - Send messages to index, receive messages at
element. Reductions and broadcasts across the
array - Dynamic insertion, deletion, migration-- and
everything still has to work!
8Object Arrays
- A collection of data-driven objects (aka chares),
- With a single global name for the collection, and
- Each member addressed by an index
- Mapping of element objects to processors handled
by the system
Users view
A0
A1
A2
A3
A..
9Object Arrays
- A collection of chares,
- with a single global name for the collection, and
- each member addressed by an index
- Mapping of element objects to processors handled
by the system
Users view
A0
A1
A2
A3
A..
System view
A3
A0
10Object Arrays
- A collection of chares,
- with a single global name for the collection, and
- each member addressed by an index
- Mapping of element objects to processors handled
by the system
Users view
A0
A1
A2
A3
A..
System view
A3
A0
11Comparison with MPI
- Advantage Charm
- Modules/Abstractions are centered on application
data structures, - Not processors
- Several other
- Advantage MPI
- Highly popular, widely available, industry
standard - Anthropomorphic view of processor
- Many developers find this intuitive
- But mostly
- There is no hope of weaning people away from MPI
- There is no need to choose between them!
12Adaptive MPI
- A migration path for legacy MPI codes
- Allows them dynamic load balancing capabilities
of Charm - AMPI MPI dynamic load balancing
- Uses Charm object arrays and migratable threads
- Minimal modifications to convert existing MPI
programs - Automated via AMPizer
- Bindings for
- C, C, and Fortran90
13AMPI
MPI processes
Implemented as virtual processors (user-level
migratable threads)
14II Consequences of virtualization
- Better Software Engineering
- Message Driven Execution
- Flexible and Dynamic mapping to processors
15Modularization
- Number of processors decoupled from logical
units - E.g. Oct tree nodes for particle data
- No artificial restriction on the number of
processors - Cube of power of 2
- Modularity
- Software engineering Cohesion and coupling
- MPIs are on the same processor is a bad
coupling principle - Objects liberate you from that
- E.g. solid and fluid moldules in a rocket
simulation
16Rocket Simulation
- Large Collaboration headed Mike Heath
- DOE supported center
- Challenge
- Multi-component code, with modules from
independent researchers - MPI was common base
- AMPI new wine in old bottle
- Easier to convert
- Can still run original codes on MPI, unchanged
17Rocket simulation via virtual processors
18AMPI and Roc Communication
Rocflo
Rocflo
Rocflo
Rocflo
Rocflo
19Message Driven Execution
Scheduler
Scheduler
Message Q
Message Q
20Adaptive Overlap via Data-driven Objects
- Problem
- Processors wait for too long at receive
statements - Routine communication optimizations in MPI
- Move sends up and receives down
- Sometimes. Use irecvs, but be careful
- With Data-driven objects
- Adaptive overlap of computation and communication
- No object or threads holds up the processor
- No need to guess which is likely to arrive first
21Adaptive overlap and modules
SPMD and Message-Driven Modules (From A. Gursoy,
Simplified expression of message-driven programs
and quantification of their impact on
performance, Ph.D Thesis, Apr 1994.)
22Modularity and Adaptive Overlap
Parallel Composition Principle For effective
composition of parallel components, a
compositional programming language should allow
concurrent interleaving of component execution,
with the order of execution constrained only by
availability of data. (Ian Foster,
Compositional parallel programming languages, ACM
Transactions of Programming Languages and
Systems, 1996)
23Handling OS Jitter via MDE
- MDE encourages asynchrony
- Asynchronous reductions, for example
- Only data dependence should force synchronization
- One benefit
- Consider algorithm with N steps
- Each step has different load balance
- Loose dependence between steps
- (on neighbors, for example)
- Sum-of-max (MPI) vs max-of-sums (MDE)
- OS Jitter
- Causes random processors to add delays in each
step - Handled Automatically by MDE
24Virtualization/MDE leads to predictability
- Ability to predict
- Which data is going to be needed and
- Which code will execute
- Based on the ready queue of object method
invocations
25Virtualization/MDE leads to predictability
- Ability to predict
- Which data is going to be needed and
- Which code will execute
- Based on the ready queue of object method
invocations - So, we can
- Prefetch data accurately
- Prefetch code if needed
- Out-of-core execution
- Caches vs controllable SRAM
26Flexible Dynamic Mapping to Processors
- The system can migrate objects between processors
- Vacate workstation used by a parallel program
- Dealing with extraneous loads on shared
workstations - Shrink and Expand the set of processors used by
an app - Adaptive job scheduling
- Better System utilization
- Adapt to speed difference between processors
- E.g. Cluster with 500 MHz and 1 Ghz processors
- Automatic checkpointing
- Checkpointing migrate to disk!
- Restart on a different number of processors
27Load Balancing with AMPI/Charm
Turing cluster has processors with different
speeds
28Principle of Persistence
- Parallel analog of principle of locality
- Heuristics
- Holds for most CSE applications
- Once the application is expressed in terms of
interacting objects - Object communication patterns and computational
loads tend to persist over time - In spite of dynamic behavior
- Abrupt but infrequent changes
- Slow and small changes
29Utility of the principle of persistence
- Learning / adaptive algorithms
- Adaptive Communication libraries
- Each to all individualized sends
- Performance depends on many runtime
characteristics - Library switches between different algorithms
- Measurement based load balancing
30Dynamic Load Balancing For CSE
- Many CSE applications exhibit dynamic behavior
- Adaptive mesh refinement
- Atom migration
- Particle migration in cosmology codes
- Objects allow RTS to remap them to balance load
- Just move objects away from overloaded processors
to underloaded processors
Just??
31Measurement Based Load Balancing
- Based on Principle of persistence
- Runtime instrumentation
- Measures communication volume and computation
time - Measurement based load balancers
- Use the instrumented data-base periodically to
make new decisions - Many alternative strategies can use the database
- Centralized vs distributed
- Greedy improvements vs complete reassignments
- Taking communication into account
- Taking dependences into account (More complex)
32Load balancer in action
Automatic Load Balancing in Crack Propagation
1. Elements Added
3. Chunks Migrated
2. Load Balancer Invoked
33Overhead of Multipartitioning
34Optimizing for Communication Patterns
- The parallel-objects Runtime System can observe,
instrument, and measure communication patterns - Communication is from/to objects, not processors
- Load balancers can use this to optimize object
placement - Communication libraries can optimize
- By substituting most suitable algorithm for each
operation - Learning at runtime
V. Krishnan, MS Thesis, 1996
35Example All to all on Lemieux
36The Other Side Pipelining
- A sends a large message to B, whereupon B
computes - Problem B is idle for a long time, while the
message gets there. - Solution Pipelining
- Send the message in multiple pieces, triggering a
computation on each - Objects makes this easy to do
- Example
- Ab Initio Computations using Car-Parinello method
- Multiple 3D FFT kernel
Recent collaboration with R. Car, M. Klein, G.
Martyna, M, Tuckerman, N. Nystrom, J. Torrellas
37(No Transcript)
38Effect of Pipelining
Multiple Concurrent 3D FFTs, on 64 Processors of
Lemieux
V. Ramkumar (PPL)
39Control Points learning and tuning
- The RTS can automatically optimize the degree of
pipelining - If it is given a control point (knob) to tune
- By the application
Controlling pipelining between a pair of
objects S. Krishnan, PhD Thesis, 1994
Controlling degree of virtualization Orchestrati
on Framework Ongoing PhD thesis
40Example Molecular Dynamics in NAMD
- Collection of charged atoms, with bonds
- Newtonian mechanics
- At each time-step
- Calculate forces on each atom
- Bonds
- Non-bonded electrostatic and van der Waals
- Calculate velocities and advance positions
- 1 femtosecond time-step, millions needed!
- Thousands of atoms (1,000 - 500,000)
Collaboration with K. Schulten, R. Skeel, and
coworkers
41NAMD performance using virtualization
- Written in Charm
- Uses measurement based load balancing
- Object level performance feedback
- using projections tool for Charm
- Identifies problems at source level easily
- Almost suggests fixes
- Attained unprecedented performance
42Grainsize analysis
Solution Split compute objects that may have
too much work using a heuristics based on number
of interacting atoms
Problem
43Integration overhead analysis
integration
Problem integration time had doubled from
sequential run
44Improved Performance Data
Published in SC2000 Gordon Bell Award Finalist
45(No Transcript)
46Role of compilers
- New uses of compiler analysis
- Apparently simple, but then again, data flow
analysis must have seemed simple - Supporting Threads,
- Shades of global variables
- Minimizing state at migration
- Border fusion
- Split-phase semantics (UPC).
- Components (separately compiled)
- Compiler RTS collaboration needed!
47Summary
- Virtualization as a magic bullet
- Logical decomposition, better software eng.
- Flexible and dynamic mapping to processors
- Message driven execution
- Adaptive overlap, modularity, predictability
- Principle of persistence
- Measurement based load balancing,
- Adaptive communication libraries
- Future
- Compiler support
- Realize the potential
- Strategies and applications
More info http//charm.cs.uiuc.edu
48Component Frameworks
- Seek optimal division of labor between system
and programmer - Decomposition done by programmer, everything else
automated - Develop standard library of reusable parallel
components
Domain specific frameworks