Title: Adaptive MPI
1Adaptive MPI
- Milind A. Bhandarkar
- (milind_at_cs.uiuc.edu)
2Motivation
- Many CSE applications exhibit dynamic behavior
- Adaptive mesh refinement (AMR)
- Pressure-driven crack propagation in solid
propellants - Also, non-dedicated supercomputing platforms such
as clusters affect processor availability - These factors cause severe load imbalance
- Can the parallel language / runtime system help ?
3Load Imbalance in Crack-propagation Application
(More and more cohesive elements activated
between iterations 35 to 40.)
4Load Imbalance in an AMR Application
(Mesh is refined at 25th time step)
5Multi-partition Decomposition
- Basic idea
- Decompose the problem into a number of
partitions, - Independent of the number of processors
- Partitions gt processors
- The system maps partitions to processors
- The system maps and re-maps objects as needed
- Re-mapping strategies help adapt to dynamic
variations - To make this work, need
- Load balancing framework runtime support
- But, isnt there a high overhead of
multi-partitioning ?
6Overhead of Multi-partition Decomposition
(Crack Propagation code, with 70k elements)
7Charm
- Supports data driven objects.
- Singleton objects, object arrays, groups, ..
- Many objects per processor, with method execution
scheduled with availability of data. - Supports object migration, with automatic
forwarding. - Excellent results with highly irregular dynamic
applications. - Molecular dynamics application NAMD, speedup of
1250 on 2000 processors of ASCI-red. - Brunner, Phillips, Kale. Scalable molecular
dynamics, Gordon Bell finalist, SC2000.
8Charm System Mapped Objects
9Load Balancing Framework
10However
- Many CSE applications are written in Fortran, MPI
- Conversion to a parallel object-based language
such as Charm is cumbersome - Message-driven style requires split-phase
transactions - Often results in a complete rewrite
- How to convert existing MPI applications without
extensive rewrite ?
11Solution
- Each partition implemented as a user-level thread
associated with a message-driven object - Communication library for these threads same in
syntax and semantics as MPI - But what about the overheads associated with
threads ?
12AMPI Threads Vs MD Objects (1D Decomposition)
13AMPI Threads Vs MD Objects (3D Decomposition)
14Thread Migration
- Thread stacks may contain references to local
variables - May not be valid upon migration to a different
address space - Solution thread stacks should span the same
virtual address space on any processor where they
may migrate (Isomalloc) - Split the virtual space into per-processor
allocation pool - Scalability issues
- Not important on 64-bit processors
- Constrained load balancing (limit the threads
migratability to fewer processors)
15AMPI Issues Thread-safety
- Multiple threads mapped to each processor
- Process data to be localized
- Make them instance variables of a class
- All subroutines become instance methods of this
class - AMPIzer A source-to-source translator
- Based on Polaris front-end
- Recognize all global variables
- Put them in a thread-private area
16AMPI Issues Data Migration
- Thread-private data needs to be migrated with the
thread - Developer has to write subroutines for packing
and unpacking data - Writing separate subroutines is error-prone
- Puppers (puppack-unpack)
- A subroutine to show the data to the runtime
system - Fortran90 generic procedures make writing the
pupper easy
17AMPI Other Features
- Automatic checkpoint and restart
- On different number of processors
- Number of chunks remain the same, but can be
mapped to different number of processors - No additional work is needed
- Same pupper used for migration is also used for
checkpointing and restart
18Adaptive Multi-MPI
- Integration of multiple MPI-based modules
- Example integrated rocket simulation
- ROCFLO, ROCSOLID, ROCBURN, ROCFACE
- Each module gets its own MPI_COMM_WORLD
- All COMM_worlds form MPI_COMM_UNIVERSE
- Point to point communication between different
MPI_COMM_worlds using the same AMPI functions - Communication across modules is also considered
while balancing load
19Experimental Results
20AMR Application With Load Balancing
(Load balancer is activated at time steps 20, 40,
60, and 80.)
21AMPI Load Balancing on Heterogeneous Clusters
(Experiment carried out on a cluster of Linux
workstations.)
22AMPI Vs MPI
(This is a scaled problem.)
23AMPI Overhead
24AMPI Status
- Over 70 commonly used functions from MPI 1.1
- All point-to-point communication functions
- All collective communications functions
- User-defined MPI data types
- C, C, and Fortran (77/90) bindings
- Tested on Origin 2000, IBM SP, Linux and Solaris
clusters - Should run on any platform supported by Charm
that has mmap