Title: Kaoutar El Maghraoui, elmagkcs.rpi.edu
1An Architecture for Reconfiguring MPI
Applications in Dynamic and Heterogeneous
Environments
12th SIAM Conference on Parallel Processing for
Scientific Computing
- Kaoutar El Maghraoui, elmagk_at_cs.rpi.edu
- Department of Computer Science
- Rensselaer Polytechnic Institute
- http//wcl.cs.rpi.edu/ios/
- In Collaboration with
- Dr. Carlos Varela (Thesis Advisor)
- Dr. Boleslaw Szymanski
- Travis Desell
- February 22, 2006
2Todays Grid Environments
- Infrastructure
- Complex, large-scale, high fault rates, and
dynamic - Applications
- Complex development, deployment
- Challenges
- High-level application development interface
- Designing and constructing applications for
adaptability - Late mapping of applications to Grid resources
- Monitoring and control of performance
3MPI Challenges on Dynamic Grids
- Tailored for tightly coupled systems
- Dynamic reconfiguration
- Process mobility
- Scale up to accommodate new resources
- Shrink to accommodate leaving or slow resources
- Transparent performance monitoring and
application adaptability - Currently handled by the programmer
- Goal
- Extending MPI with dynamic reconfiguration and
adaptability to dynamic computational grids
4Approach
- Separation of concerns between the application
and the middleware - Middleware-level
- When and how to reconfigure applications?
- Applications-level
- Problem solving
- Support for migration and/or malleability
- Gap bridging software
- High level APIs
- Library support to integrate applications and
middleware
5IOS Overview
- The Internet Operating System (IOS) is a
decentralized middleware framework that provides - Opportunistic load balancing capabilities
- Resource-level profiling
- Application-level profiling
- Goal
- Automatic reconfiguration of applications in
dynamic environments (e.g., Computational Grids) - Scalability to worldwide execution environments
- Modular architecture enabling evaluation of
different load balancing and resource profiling
strategies - Generic interfaces to interoperate with various
programming models
6IOS Architecture
- Distributed middleware agents
- Encapsulate modules for resource profiling and
reconfiguration policies. - Capable of interconnecting in various virtual
topologies (e.g., hierarchical or P2P) - Interface with high level applications
- Interfacing with IOS agents
- Applications implement specific APIs to interface
with IOS agents - Applications need to support component
migration/malleability
7IOS Architecture
IOS-enabled Node
Reconfiguration request (migrate/split/merge/repli
cate)
Application Component
Message passing
Application profiling
IOS API
Decision Module
Profiling Module
Protocol Module
Steal requests
Communication profiles
Reconfigure?
List of profiles
Evaluates the gain of a potential
reconfiguration
Sends steal requests/ Receives steal requests
Available processing
Decision
Interfaces to resources profilers
Inter-delay info
Network monitor
Memory monitor
CPU monitor
Initiate a steal request
IOS Agent
8IOS Load Balancing Strategies
- Modularity for customizable load balancing and
profiling strategies, e.g. - Random work-stealing (RS)
- Based on Cilks work stealing approach
- Lightly-loaded nodes send work steal packets to
heavily loaded nodes - Application topology-sensitive work-stealing
(ATS) - Extension to RS
- Collocate processes communicating frequently
- Network topology-sensitive work-stealing (NTS)
- Extension to ATS
- Considers network topology
- Minimizes WAN latencies
9Reconfiguring MPI Applications with IOS
- Extending MPI
- Semi-transparent checkpointing
- Process migration support
- Integration with IOS
- Currently for iterative applications
10The MPI/IOS Runtime Architecture
- Instrumented MPI applications
- Process Checkpointing and Migration (PCM) library
- Wrappers for some MPI native calls
- The MPI library
- The IOS runtime components
11MPI/IOS Interactions
12MPI Process Migration
- Implemented at the user-level
- Relies on MPI communicator rearrangements and
MPI-2 spawning feature - Instrumentation of programs with PCM calls
- Benefit portability
- Limitation semi-transparency
13Migration Example
3
1
4
2
5
0
MPI_COMM_WORLD
14Migration Example
MPI_Intercomm_merge merges the two communicators
3
6
1
4
2
5
0
MPI_COMM_WORLD
15Migration Example
MPI_Comm_create creates a new communicators
3
3
1
4
2
5
0
MPI_COMM_WORLD
16Profiling MPI Applications
- The profiling library is based on the MPI
profiling interface - Transparent interception of all MPI calls
- Goal Profile MPI applications' communication
patterns
17How to Instrument MPI Programs with
PCM?(Initialization Phase)
- include mpi.h
- include "pcm.h
- MPI_Comm PCM_COMM_WORLD
- int main(int argc, char argv)
- MPI_Init( argc, argv )
- PCM_COMM_WORLD MPI_COMM_WORLD
- PCM_Init(PCM_COMM_WORLD)
- MPI_Comm_rank( PCM_COMM_WORLD, rank )
- MPI_Comm_size( PCM_COMM_WORLD, n )
-
- spawnrank PCM_Process_Status()
- if(spawnrank gt 0)
- //load any checkpointed data
- PCM_Load()
-
18How to Instrument MPI Programs with
PCM?(Iterations Phase)
- for(several iterations)
- pcm_status PCM_Status(PCM_COMM_WORLD)
- if(pcm_status PCM_MIGRATE)
- //checkpoint data
- PCM_Store()
- PCM_COMM_WORLD PCM_Reconfigure()
-
- else if(pcm_status PCM_RECONFIGURE)
-
- PCM_COMM_WORLD PCM_Reconfigure()
- MPI_Comm_rank(PCM_COMM_WORLD, rank)
-
- // Data Computation.
- //Exchange of computed data with
neighboring processes. - // MPI_Send() MPI_Recv()
-
-
- PCM_Finalize(PCM_COMM_WORLD)
- MPI_Finalize()
19A Reconfiguration Scenario
20A Reconfiguration Scenario
21Case Study Heat Diffusion Problem
- A problem that models heat transfer in a solid
- A two-dimensional mesh is used to represent the
problem data space - An Iterative Application
- Highly synchronized
22Adaptation Experiments
23Adaptation Experiments (2)
Adaptation through removing a slow processor
24Adaptation Experiments (3)
Adaptation through migration to a better cluster
25Empirical Results Overhead of the PCM library
26Reconfiguration Overhead
27Breakdown of Reconfiguration Cost
28Ongoing/Future Work
- Splitting and Merging MPI Application Processes
- New reconfiguration policies on dynamic
environments - More realistic load characteristics and network
latencies. - Interoperability with MPICH-G2
- Improving the PCM API
- Non-iterative applications
29Related Work
- MPICH-G2
- Grid-enabled implementation of MPI
- http//www3.niu.edu/mpi/
- Adaptive MPI (AMPI)
- MPI implementation with light threads for process
migration Huang03 - MPI Process Swapping
- Initial over-allocation of processors and
selection of the best executing nodes Sievert04 - Extensions to MPI with checkpointing and restart
- SRS library Vadhiyar03 application stop and
restart - CoCheck Stellner96 and StarFishAgbaria99
Fault tolerance - MPICH-VBouteiller05 Fault tolerance
30Questions?
31Backup Slides
32Resource Sensitive Model
- Decision components use a resource sensitive
model to decide based on the profiled
applications how to balance the resources
consumption - Reconfiguration decisions
- Where to migrate
- When to migrate
- How many entities to migrate
33A General Model for Weighted Resource-Sensitive
Work-Stealing (WRS)
- Given
- A set of resources, R r0 rn
- A set of actors, A a0 an
- w is a weight, based on importance of the
resource r to the performance of a set of actors
A - 0 w(r,A) 1
- Sall r w(r,A) 1
- a(r,f) is the amount of resource r available at
foreign node f - u(r,l,A) is the amount of resource r used by
actors A at local node l - M(A,l,f) is the estimated cost of migration of
actors A from l to f - L(A) is the average life expectancy of the set of
actors A - The predicted increase in overall performance G
gained by migrating A from l to f, where G 1 - D(r,l,f,A) (a(r,f) u(r,l,A)) / (a(r,f)
u(r,l,A)) - G Sall r (w(r,A) D(r,l,f,A))
M(A,l,f)/(10log L(A)) - When work requested by f, migrate actor(s) A with
greatest predicted increase in overall
performance, if positive.
34IOS API
- The following methods notify the profiling agent
of actors entering and exiting the theater due to
migration and binding - public void addProfile(UAN uan)
- public void removeProfile(UAN uan)
- Public void migrateProfile(UAN uan, UAL target)
- The profiling agent updates its actor profiles
based on message sending with these methods - public void msgSend(UAN uan, Msg_INFO msgInfo)
- The profiling agent updates its actor profiles
based on message reception with this method - public void msgReceive(UAN uan, targetUAL,
Msg_INFO msgInfo) - The following methods notify the profiling agent
of the start of a message being processed and the
end of a message being processed, with a UAN or
UAL to identify the sending actor - public void beginProcessing(UAN uan, Msg_INFO
msgInfo) - public void endProcessing(UAN uan, Msg_INFO
msgInfo)
35Virtual Topologies of IOS Agents
- Agents organize themselves in various
network-sensitive virtual topologies to sense the
underlying physical environments - Peer-to-peer topology agents form a p2p network
to exchange profiled information. - Cluster-to-cluster topology agents organize
themselves in groups of clusters. Cluster
managers form a p2p network.
36C2C vs. P2P topologies
37How to Instrument an MPI Program?
- The PCM API
- Process Checkpointing and Migration API
- Register variables with a check-point handler
- Store data locally or remotely in a PCM Daemon.
- Restores previously check-pointed data
- Periodic probing of the status of an MPI
application or MPI process. - The PCM Daemon
- Loaded on every participating node.
- Communicates with IOS agents and the MPI
profiling library - Handles process migration