Runtime Optimizations via Processor Virtualization

About This Presentation

Title:

Runtime Optimizations via Processor Virtualization

Description:

AMPI: new wine in old bottle. Easier to convert. Can still run original codes on MPI, unchanged ... Use the instrumented data-base periodically to make new decisions ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 49

Provided by: lka4

Learn more at: http://charm.cs.uiuc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Runtime Optimizations via Processor Virtualization

1
Runtime Optimizationsvia Processor
Virtualization

Laxmikant Kale
Parallel Programming Laboratory
Dept. of Computer Science
University of Illinois at Urbana Champaign
http//charm.cs.uiuc.edu

2
Outline

Where we stand
Predictable behavior very important for
performance
What is virtualization
Charm, AMPI
Consequences
Software Engineering
Natural unit of parallelism
Data vs code abstraction
cohesion/coupling MPI vs objects
Message driven execution
Adaptive overlap
Handling jitter
outofcore
caches /controllable SRAM

Flexible and dynamic mapping
Vacating.
Accommodating speed varn
shrink/expand
Principle of persistence
Measurement based lb
Communication opts
Learning algorithms
New uses of compiler analysis
Apparently simple,
Threads, Global vars
Minimizing state at migration
Border fusion

3
Technical Approach

Seek optimal division of labor between system
and programmer
Decomposition done by programmer, everything else
automated
Develop standard library of reusable parallel
components

4
Object-based Decomposition

Idea
Divide the computation into a large number of
pieces
Independent of number of processors
Typically larger than number of processors
Let the system map objects to processors

This is virtualization
Language and runtime support for virtualization
Exploitation of virtualization to the hilt

5
Object-based Parallelization
User is only concerned with interaction between
objects
System implementation
User View
6
Realizations Charm and AMPI

Charm
Parallel C with Data Driven Objects (Chares)
Object Arrays/ Object Collections
Object Groups
Global object with a representative on each PE
Asynchronous method invocation
Prioritized scheduling
Information sharing abstractions readonly,
tables,..
Mature, robust, portable (http//charm.cs.uiuc.edu
)
AMPI
Adaptive MPI, with many MPI threads on each
processor

7
Chare Arrays

Elements are data-driven objects
Elements are indexed by a user-defined data
type-- sparse 1D, 2D, 3D, tree, ...
Send messages to index, receive messages at
element. Reductions and broadcasts across the
array
Dynamic insertion, deletion, migration-- and
everything still has to work!

8
Object Arrays

A collection of data-driven objects (aka chares),
With a single global name for the collection, and
Each member addressed by an index
Mapping of element objects to processors handled
by the system

Users view
A0
A1
A2
A3
A..
9
Object Arrays

A collection of chares,
with a single global name for the collection, and
each member addressed by an index
Mapping of element objects to processors handled
by the system

Users view
A0
A1
A2
A3
A..
System view
A3
A0
10
Object Arrays

A collection of chares,
with a single global name for the collection, and
each member addressed by an index
Mapping of element objects to processors handled
by the system

Users view
A0
A1
A2
A3
A..
System view
A3
A0
11
Comparison with MPI

Advantage Charm
Modules/Abstractions are centered on application
data structures,
Not processors
Several other
Advantage MPI
Highly popular, widely available, industry
standard
Anthropomorphic view of processor
Many developers find this intuitive
But mostly
There is no hope of weaning people away from MPI
There is no need to choose between them!

12
Adaptive MPI

A migration path for legacy MPI codes
Allows them dynamic load balancing capabilities
of Charm
AMPI MPI dynamic load balancing
Uses Charm object arrays and migratable threads
Minimal modifications to convert existing MPI
programs
Automated via AMPizer
Bindings for
C, C, and Fortran90

13
AMPI
MPI processes
Implemented as virtual processors (user-level
migratable threads)
14
II Consequences of virtualization

Better Software Engineering
Message Driven Execution
Flexible and Dynamic mapping to processors

15
Modularization

Number of processors decoupled from logical
units
E.g. Oct tree nodes for particle data
No artificial restriction on the number of
processors
Cube of power of 2
Modularity
Software engineering Cohesion and coupling
MPIs are on the same processor is a bad
coupling principle
Objects liberate you from that
E.g. solid and fluid moldules in a rocket
simulation

16
Rocket Simulation

Large Collaboration headed Mike Heath
DOE supported center
Challenge
Multi-component code, with modules from
independent researchers
MPI was common base
AMPI new wine in old bottle
Easier to convert
Can still run original codes on MPI, unchanged

17
Rocket simulation via virtual processors
18
AMPI and Roc Communication
Rocflo
Rocflo
Rocflo
Rocflo
Rocflo
19
Message Driven Execution
Scheduler
Scheduler
Message Q
Message Q
20
Adaptive Overlap via Data-driven Objects

Problem
Processors wait for too long at receive
statements
Routine communication optimizations in MPI
Move sends up and receives down
Sometimes. Use irecvs, but be careful
With Data-driven objects
Adaptive overlap of computation and communication
No object or threads holds up the processor
No need to guess which is likely to arrive first

21
Adaptive overlap and modules
SPMD and Message-Driven Modules (From A. Gursoy,
Simplified expression of message-driven programs
and quantification of their impact on
performance, Ph.D Thesis, Apr 1994.)
22
Modularity and Adaptive Overlap
Parallel Composition Principle For effective
composition of parallel components, a
compositional programming language should allow
concurrent interleaving of component execution,
with the order of execution constrained only by
availability of data. (Ian Foster,
Compositional parallel programming languages, ACM
Transactions of Programming Languages and
Systems, 1996)
23
Handling OS Jitter via MDE

MDE encourages asynchrony
Asynchronous reductions, for example
Only data dependence should force synchronization
One benefit
Consider algorithm with N steps
Each step has different load balance
Loose dependence between steps
(on neighbors, for example)
Sum-of-max (MPI) vs max-of-sums (MDE)
OS Jitter
Causes random processors to add delays in each
step
Handled Automatically by MDE

24
Virtualization/MDE leads to predictability

Ability to predict
Which data is going to be needed and
Which code will execute
Based on the ready queue of object method
invocations

25
Virtualization/MDE leads to predictability

Ability to predict
Which data is going to be needed and
Which code will execute
Based on the ready queue of object method
invocations
So, we can
Prefetch data accurately
Prefetch code if needed
Out-of-core execution
Caches vs controllable SRAM

26
Flexible Dynamic Mapping to Processors

The system can migrate objects between processors
Vacate workstation used by a parallel program
Dealing with extraneous loads on shared
workstations
Shrink and Expand the set of processors used by
an app
Adaptive job scheduling
Better System utilization
Adapt to speed difference between processors
E.g. Cluster with 500 MHz and 1 Ghz processors
Automatic checkpointing
Checkpointing migrate to disk!
Restart on a different number of processors

27
Load Balancing with AMPI/Charm
Turing cluster has processors with different
speeds
28
Principle of Persistence

Parallel analog of principle of locality
Heuristics
Holds for most CSE applications

Once the application is expressed in terms of
interacting objects
Object communication patterns and computational
loads tend to persist over time
In spite of dynamic behavior
Abrupt but infrequent changes
Slow and small changes

29
Utility of the principle of persistence

Learning / adaptive algorithms
Adaptive Communication libraries
Each to all individualized sends
Performance depends on many runtime
characteristics
Library switches between different algorithms
Measurement based load balancing

30
Dynamic Load Balancing For CSE

Many CSE applications exhibit dynamic behavior
Adaptive mesh refinement
Atom migration
Particle migration in cosmology codes
Objects allow RTS to remap them to balance load
Just move objects away from overloaded processors
to underloaded processors

Just??
31
Measurement Based Load Balancing

Based on Principle of persistence
Runtime instrumentation
Measures communication volume and computation
time
Measurement based load balancers
Use the instrumented data-base periodically to
make new decisions
Many alternative strategies can use the database
Centralized vs distributed
Greedy improvements vs complete reassignments
Taking communication into account
Taking dependences into account (More complex)

32
Load balancer in action
Automatic Load Balancing in Crack Propagation
1. Elements Added
3. Chunks Migrated
2. Load Balancer Invoked
33
Overhead of Multipartitioning
34
Optimizing for Communication Patterns

The parallel-objects Runtime System can observe,
instrument, and measure communication patterns
Communication is from/to objects, not processors
Load balancers can use this to optimize object
placement
Communication libraries can optimize
By substituting most suitable algorithm for each
operation
Learning at runtime

V. Krishnan, MS Thesis, 1996
35
Example All to all on Lemieux
36
The Other Side Pipelining

A sends a large message to B, whereupon B
computes
Problem B is idle for a long time, while the
message gets there.
Solution Pipelining
Send the message in multiple pieces, triggering a
computation on each
Objects makes this easy to do
Example
Ab Initio Computations using Car-Parinello method
Multiple 3D FFT kernel

Recent collaboration with R. Car, M. Klein, G.
Martyna, M, Tuckerman, N. Nystrom, J. Torrellas
37
(No Transcript)
38
Effect of Pipelining
Multiple Concurrent 3D FFTs, on 64 Processors of
Lemieux
V. Ramkumar (PPL)
39
Control Points learning and tuning

The RTS can automatically optimize the degree of
pipelining
If it is given a control point (knob) to tune
By the application

Controlling pipelining between a pair of
objects S. Krishnan, PhD Thesis, 1994
Controlling degree of virtualization Orchestrati
on Framework Ongoing PhD thesis
40
Example Molecular Dynamics in NAMD

Collection of charged atoms, with bonds
Newtonian mechanics
At each time-step
Calculate forces on each atom
Bonds
Non-bonded electrostatic and van der Waals
Calculate velocities and advance positions
1 femtosecond time-step, millions needed!
Thousands of atoms (1,000 - 500,000)

Collaboration with K. Schulten, R. Skeel, and
coworkers
41
NAMD performance using virtualization

Written in Charm
Uses measurement based load balancing
Object level performance feedback
using projections tool for Charm
Identifies problems at source level easily
Almost suggests fixes
Attained unprecedented performance

42
Grainsize analysis
Solution Split compute objects that may have
too much work using a heuristics based on number
of interacting atoms
Problem
43
Integration overhead analysis
integration
Problem integration time had doubled from
sequential run
44
Improved Performance Data
Published in SC2000 Gordon Bell Award Finalist
45
(No Transcript)
46
Role of compilers

New uses of compiler analysis
Apparently simple, but then again, data flow
analysis must have seemed simple
Supporting Threads,
Shades of global variables
Minimizing state at migration
Border fusion
Split-phase semantics (UPC).
Components (separately compiled)
Compiler RTS collaboration needed!

47
Summary

Virtualization as a magic bullet
Logical decomposition, better software eng.
Flexible and dynamic mapping to processors
Message driven execution
Adaptive overlap, modularity, predictability
Principle of persistence
Measurement based load balancing,
Adaptive communication libraries
Future
Compiler support
Realize the potential
Strategies and applications

Runtime Optimizations via Processor Virtualization - PowerPoint PPT Presentation

Runtime Optimizations via Processor Virtualization

AMPI: new wine in old bottle. Easier to convert. Can still run original codes on MPI, unchanged ... Use the instrumented data-base periodically to make new decisions ... – PowerPoint PPT presentation