Runtime Optimizations via Processor Virtualization - PowerPoint PPT Presentation

About This Presentation
Title:

Runtime Optimizations via Processor Virtualization

Description:

Title: An Object Based Approach to Developing Scalable Applications: Challenges and Techniques Author: l-kale1 Last modified by: kale Created Date – PowerPoint PPT presentation

Number of Views:113
Avg rating:3.0/5.0
Slides: 49
Provided by: lka5
Category:

less

Transcript and Presenter's Notes

Title: Runtime Optimizations via Processor Virtualization


1
Runtime Optimizationsvia Processor
Virtualization
  • Laxmikant Kale
  • Parallel Programming Laboratory
  • Dept. of Computer Science
  • University of Illinois at Urbana Champaign
  • http//charm.cs.uiuc.edu

2
Outline
  • Where we stand
  • Predictable behavior very important for
    performance
  • What is virtualization
  • Charm, AMPI
  • Consequences
  • Software Engineering
  • Natural unit of parallelism
  • Data vs code abstraction
  • cohesion/coupling MPI vs objects
  • Message driven execution
  • Adaptive overlap
  • Handling jitter
  • outofcore
  • caches /controllable SRAM
  • Flexible and dynamic mapping
  • Vacating.
  • Accommodating speed varn
  • shrink/expand
  • Principle of persistence
  • Measurement based lb
  • Communication opts
  • Learning algorithms
  • New uses of compiler analysis
  • Apparently simple,
  • Threads, Global vars
  • Minimizing state at migration
  • Border fusion

3
Technical Approach
  • Seek optimal division of labor between system
    and programmer
  • Decomposition done by programmer, everything else
    automated
  • Develop standard library of reusable parallel
    components

4
Object-based Decomposition
  • Idea
  • Divide the computation into a large number of
    pieces
  • Independent of number of processors
  • Typically larger than number of processors
  • Let the system map objects to processors
  • This is virtualization
  • Language and runtime support for virtualization
  • Exploitation of virtualization to the hilt

5
Object-based Parallelization
User is only concerned with interaction between
objects
System implementation
User View
6
Realizations Charm and AMPI
  • Charm
  • Parallel C with Data Driven Objects (Chares)
  • Object Arrays/ Object Collections
  • Object Groups
  • Global object with a representative on each PE
  • Asynchronous method invocation
  • Prioritized scheduling
  • Information sharing abstractions readonly,
    tables,..
  • Mature, robust, portable (http//charm.cs.uiuc.edu
    )
  • AMPI
  • Adaptive MPI, with many MPI threads on each
    processor

7
Chare Arrays
  • Elements are data-driven objects
  • Elements are indexed by a user-defined data
    type-- sparse 1D, 2D, 3D, tree, ...
  • Send messages to index, receive messages at
    element. Reductions and broadcasts across the
    array
  • Dynamic insertion, deletion, migration-- and
    everything still has to work!

8
Object Arrays
  • A collection of data-driven objects (aka chares),
  • With a single global name for the collection, and
  • Each member addressed by an index
  • Mapping of element objects to processors handled
    by the system

Users view
A0
A1
A2
A3
A..
9
Object Arrays
  • A collection of chares,
  • with a single global name for the collection, and
  • each member addressed by an index
  • Mapping of element objects to processors handled
    by the system

Users view
A0
A1
A2
A3
A..
System view
A3
A0
10
Object Arrays
  • A collection of chares,
  • with a single global name for the collection, and
  • each member addressed by an index
  • Mapping of element objects to processors handled
    by the system

Users view
A0
A1
A2
A3
A..
System view
A3
A0
11
Comparison with MPI
  • Advantage Charm
  • Modules/Abstractions are centered on application
    data structures,
  • Not processors
  • Several other
  • Advantage MPI
  • Highly popular, widely available, industry
    standard
  • Anthropomorphic view of processor
  • Many developers find this intuitive
  • But mostly
  • There is no hope of weaning people away from MPI
  • There is no need to choose between them!

12
Adaptive MPI
  • A migration path for legacy MPI codes
  • Allows them dynamic load balancing capabilities
    of Charm
  • AMPI MPI dynamic load balancing
  • Uses Charm object arrays and migratable threads
  • Minimal modifications to convert existing MPI
    programs
  • Automated via AMPizer
  • Bindings for
  • C, C, and Fortran90

13
AMPI
MPI processes
Implemented as virtual processors (user-level
migratable threads)
14
II Consequences of virtualization
  • Better Software Engineering
  • Message Driven Execution
  • Flexible and Dynamic mapping to processors

15
Modularization
  • Number of processors decoupled from logical
    units
  • E.g. Oct tree nodes for particle data
  • No artificial restriction on the number of
    processors
  • Cube of power of 2
  • Modularity
  • Software engineering Cohesion and coupling
  • MPIs are on the same processor is a bad
    coupling principle
  • Objects liberate you from that
  • E.g. solid and fluid moldules in a rocket
    simulation

16
Rocket Simulation
  • Large Collaboration headed Mike Heath
  • DOE supported center
  • Challenge
  • Multi-component code, with modules from
    independent researchers
  • MPI was common base
  • AMPI new wine in old bottle
  • Easier to convert
  • Can still run original codes on MPI, unchanged

17
Rocket simulation via virtual processors
18
AMPI and Roc Communication
Rocflo
Rocflo
Rocflo
Rocflo
Rocflo
19
Message Driven Execution
Scheduler
Scheduler
Message Q
Message Q
20
Adaptive Overlap via Data-driven Objects
  • Problem
  • Processors wait for too long at receive
    statements
  • Routine communication optimizations in MPI
  • Move sends up and receives down
  • Sometimes. Use irecvs, but be careful
  • With Data-driven objects
  • Adaptive overlap of computation and communication
  • No object or threads holds up the processor
  • No need to guess which is likely to arrive first

21
Adaptive overlap and modules
SPMD and Message-Driven Modules (From A. Gursoy,
Simplified expression of message-driven programs
and quantification of their impact on
performance, Ph.D Thesis, Apr 1994.)
22
Modularity and Adaptive Overlap
Parallel Composition Principle For effective
composition of parallel components, a
compositional programming language should allow
concurrent interleaving of component execution,
with the order of execution constrained only by
availability of data. (Ian Foster,
Compositional parallel programming languages, ACM
Transactions of Programming Languages and
Systems, 1996)
23
Handling OS Jitter via MDE
  • MDE encourages asynchrony
  • Asynchronous reductions, for example
  • Only data dependence should force synchronization
  • One benefit
  • Consider algorithm with N steps
  • Each step has different load balance
  • Loose dependence between steps
  • (on neighbors, for example)
  • Sum-of-max (MPI) vs max-of-sums (MDE)
  • OS Jitter
  • Causes random processors to add delays in each
    step
  • Handled Automatically by MDE

24
Virtualization/MDE leads to predictability
  • Ability to predict
  • Which data is going to be needed and
  • Which code will execute
  • Based on the ready queue of object method
    invocations

25
Virtualization/MDE leads to predictability
  • Ability to predict
  • Which data is going to be needed and
  • Which code will execute
  • Based on the ready queue of object method
    invocations
  • So, we can
  • Prefetch data accurately
  • Prefetch code if needed
  • Out-of-core execution
  • Caches vs controllable SRAM

26
Flexible Dynamic Mapping to Processors
  • The system can migrate objects between processors
  • Vacate workstation used by a parallel program
  • Dealing with extraneous loads on shared
    workstations
  • Shrink and Expand the set of processors used by
    an app
  • Adaptive job scheduling
  • Better System utilization
  • Adapt to speed difference between processors
  • E.g. Cluster with 500 MHz and 1 Ghz processors
  • Automatic checkpointing
  • Checkpointing migrate to disk!
  • Restart on a different number of processors

27
Load Balancing with AMPI/Charm
Turing cluster has processors with different
speeds
28
Principle of Persistence
  • Parallel analog of principle of locality
  • Heuristics
  • Holds for most CSE applications
  • Once the application is expressed in terms of
    interacting objects
  • Object communication patterns and computational
    loads tend to persist over time
  • In spite of dynamic behavior
  • Abrupt but infrequent changes
  • Slow and small changes

29
Utility of the principle of persistence
  • Learning / adaptive algorithms
  • Adaptive Communication libraries
  • Each to all individualized sends
  • Performance depends on many runtime
    characteristics
  • Library switches between different algorithms
  • Measurement based load balancing

30
Dynamic Load Balancing For CSE
  • Many CSE applications exhibit dynamic behavior
  • Adaptive mesh refinement
  • Atom migration
  • Particle migration in cosmology codes
  • Objects allow RTS to remap them to balance load
  • Just move objects away from overloaded processors
    to underloaded processors

Just??
31
Measurement Based Load Balancing
  • Based on Principle of persistence
  • Runtime instrumentation
  • Measures communication volume and computation
    time
  • Measurement based load balancers
  • Use the instrumented data-base periodically to
    make new decisions
  • Many alternative strategies can use the database
  • Centralized vs distributed
  • Greedy improvements vs complete reassignments
  • Taking communication into account
  • Taking dependences into account (More complex)

32
Load balancer in action
Automatic Load Balancing in Crack Propagation
1. Elements Added
3. Chunks Migrated
2. Load Balancer Invoked
33
Overhead of Multipartitioning
34
Optimizing for Communication Patterns
  • The parallel-objects Runtime System can observe,
    instrument, and measure communication patterns
  • Communication is from/to objects, not processors
  • Load balancers can use this to optimize object
    placement
  • Communication libraries can optimize
  • By substituting most suitable algorithm for each
    operation
  • Learning at runtime

V. Krishnan, MS Thesis, 1996
35
Example All to all on Lemieux
36
The Other Side Pipelining
  • A sends a large message to B, whereupon B
    computes
  • Problem B is idle for a long time, while the
    message gets there.
  • Solution Pipelining
  • Send the message in multiple pieces, triggering a
    computation on each
  • Objects makes this easy to do
  • Example
  • Ab Initio Computations using Car-Parinello method
  • Multiple 3D FFT kernel

Recent collaboration with R. Car, M. Klein, G.
Martyna, M, Tuckerman, N. Nystrom, J. Torrellas
37
(No Transcript)
38
Effect of Pipelining
Multiple Concurrent 3D FFTs, on 64 Processors of
Lemieux
V. Ramkumar (PPL)
39
Control Points learning and tuning
  • The RTS can automatically optimize the degree of
    pipelining
  • If it is given a control point (knob) to tune
  • By the application

Controlling pipelining between a pair of
objects S. Krishnan, PhD Thesis, 1994
Controlling degree of virtualization Orchestrati
on Framework Ongoing PhD thesis
40
Example Molecular Dynamics in NAMD
  • Collection of charged atoms, with bonds
  • Newtonian mechanics
  • At each time-step
  • Calculate forces on each atom
  • Bonds
  • Non-bonded electrostatic and van der Waals
  • Calculate velocities and advance positions
  • 1 femtosecond time-step, millions needed!
  • Thousands of atoms (1,000 - 500,000)

Collaboration with K. Schulten, R. Skeel, and
coworkers
41
NAMD performance using virtualization
  • Written in Charm
  • Uses measurement based load balancing
  • Object level performance feedback
  • using projections tool for Charm
  • Identifies problems at source level easily
  • Almost suggests fixes
  • Attained unprecedented performance

42
Grainsize analysis
Solution Split compute objects that may have
too much work using a heuristics based on number
of interacting atoms
Problem
43
Integration overhead analysis
integration
Problem integration time had doubled from
sequential run
44
Improved Performance Data
Published in SC2000 Gordon Bell Award Finalist
45
(No Transcript)
46
Role of compilers
  • New uses of compiler analysis
  • Apparently simple, but then again, data flow
    analysis must have seemed simple
  • Supporting Threads,
  • Shades of global variables
  • Minimizing state at migration
  • Border fusion
  • Split-phase semantics (UPC).
  • Components (separately compiled)
  • Compiler RTS collaboration needed!

47
Summary
  • Virtualization as a magic bullet
  • Logical decomposition, better software eng.
  • Flexible and dynamic mapping to processors
  • Message driven execution
  • Adaptive overlap, modularity, predictability
  • Principle of persistence
  • Measurement based load balancing,
  • Adaptive communication libraries
  • Future
  • Compiler support
  • Realize the potential
  • Strategies and applications

More info http//charm.cs.uiuc.edu
48
Component Frameworks
  • Seek optimal division of labor between system
    and programmer
  • Decomposition done by programmer, everything else
    automated
  • Develop standard library of reusable parallel
    components

Domain specific frameworks
Write a Comment
User Comments (0)
About PowerShow.com