Title: AMPI, Charisma and MSA
1AMPI, Charisma and MSA
Celso Mendes Pritish Jetley Parallel Programming
Laboratory University of Illinois at
Urbana-Champaign
2Motivation
- Challenges
- New generation parallel applications are
- Dynamically varying load shifting, adaptive
refinement - Typical MPI implementations are
- Not naturally suitable for dynamic applications
- Set of available processors
- May not match the natural expression of the
algorithm - AMPI Adaptive MPI
- MPI with virtualization VP (Virtual
Processors)?
AMPI Charm Workshop 2008
2
05/02/08
3AMPI Status
- Compliance to MPI-1.1 Standard
- Missing error handling, profiling interface
- Partial MPI-2 support
- One-sided communication
- ROMIO integrated for parallel I/O
- Missing dynamic process management, language
bindings
AMPI Charm Workshop 2008
3
05/02/08
4AMPI MPI with Virtualization
- Each virtual process implemented as a user-level
thread embedded in a Charm object
AMPI Charm Workshop 2008
4
05/02/08
5Comparison with Native MPI
- Performance
- Slightly worse w/o optimization
- Can be improved, via Charm
- Run time parameters for better performance
AMPI Charm Workshop 2008
5
05/02/08
6Comparison with Native MPI
- Flexibility
- Big runs on any number of processors
- Fits the nature of algorithms
Problem setup 3D stencil calculation of size
2403 run on Lemieux. AMPI runs on any of PEs
(eg 19, 33, 105). Native MPI needs PK3
AMPI Charm Workshop 2008
6
05/02/08
7Building Charm / AMPI
- Download website
- http//charm.cs.uiuc.edu/download/
- Please register for better support
- Build Charm/AMPI
- gt ./build lttargetgt ltversiongt ltoptionsgt
charmc-options - See README file for details
- To build AMPI
- gt ./build AMPI net-linux -g (-O3)?
AMPI Charm Workshop 2008
7
05/02/08
8How to write AMPI programs
- Avoid using global variables
- Global variables are dangerous in multithreaded
programs - Global variables are shared by all the threads on
a processor and can be changed by any of the
threads - Example
time
If count is a global incorrect value is read!
AMPI Charm Workshop 2008
8
05/02/08
9How to run AMPI programs (1)?
- We can run multithreaded on each processor
- Running with many virtual processors
- p command line option of physical processors
- vp command line option of virtual processors
- Multiple initial processor mappings are possible
- gt charmrun hello p3 vp6 mapping ltmapgt
- Available mappings at program initialization
- RR_MAP Round-Robin (cyclic)
(0,3)(1,4)(2,5) - BLOCK_MAP Block (default)
(0,1)(2,3)(4,5) - PROP_MAP Proportional to processors speeds
(0,1,2,3)(4)(5)
AMPI Charm Workshop 2008
9
05/02/08
10How to run AMPI programs (2)?
- Specify stack size for each thread
- Set smaller/larger stack sizes
- Notice that threads stack space is unique across
processors - Specify stack size for each thread with
tcharm_stacksize command line option - charmrun hello p2 vp8 tcharm_stacksize
8000000 - Default stack size is 1 MByte for each thread
AMPI Charm Workshop 2008
10
05/02/08
11How to convert an MPI program
- Remove global variables if possible
- If not possible, privatize global variables
- Pack them into struct/TYPE or class
- Allocate struct / type in heap or stack
Original Code
MODULE shareddata INTEGER myrank DOUBLE
PRECISION xyz(100)? END MODULE
AMPI Charm Workshop 2008
11
05/02/08
12How to convert an MPI program
- Fortran program entry point MPI_Main
- program pgm ? subroutine MPI_Main
- ... ...
- end program end subroutine
- C program entry point is handled automatically,
via mpi.h
AMPI Charm Workshop 2008
12
05/02/08
13AMPI Extensions
- Automatic load balancing
- Non-blocking collectives
- Checkpoint/restart mechanism
- Multi-module programming
- Handling global and static variables
AMPI Charm Workshop 2008
13
05/02/08
14Automatic Load Balancing
- Automatic load balancing MPI_Migrate()?
- Collective call informing the load balancer that
the thread is ready to be migrated, if needed. - Link-time flag -memory isomalloc makes heap-data
migration transparent - Special memory allocation mode, giving allocated
memory the same virtual address on all processors - Ideal on 64-bit machines
- But granularity 16KB per allocation
- Alternative PUPer
AMPI Charm Workshop 2008
14
05/02/08
15Asynchronous Collectives
- Our implementation is asynchronous
- Collective operation posted
- Test/wait for its completion
- Meanwhile useful computation can utilize CPU
- MPI_Ialltoall( , req)
- / other computation /
- MPI_Wait(req)
AMPI Charm Workshop 2008
15
05/02/08
16Checkpoint/Restart Mechanism
- Large scale machines may suffer from fails
- Checkpoint/restart mechanism
- State of applications checkpointed to disk files
- Capable of restarting on different of PEs
- Facilitates future efforts on fault tolerance
- In-disk MPI_Checkpoint(DIRNAME)?
- In-memory MPI_MemCheckpoint(void)?
- Checkpoint/restart and load-balancing have been
recently applied in CSARs Rocstar
AMPI Charm Workshop 2008
16
05/02/08
17Interoperability with Charm
- Charm has a collection of support libraries
- We can make use of them by running Charm code
in AMPI programs - Also we can run AMPI code in Charm programs
AMPI Charm Workshop 2008
17
05/02/08
18Handling Global Static Variables
- Global and static variables are not thread-safe
- Can we switch those variables when we switch
threads? - Globals Executable and Linking Format (ELF)
- Executable has a Global Offset Table containing
global data - GOT pointer stored at ebx register
- Switch this pointer when switching between
threads - Support on Linux, Solaris 2.x, and more
- Integrated in Charm/AMPI
- Invoked by compile time option swapglobals
- Statics in C codes __thread privatizes them
- Requires linking to pthreads library
AMPI Charm Workshop 2008
18
05/02/08
19Programmer Productivity
- Take away generality
- Program in small, uncomplicated languages
- Incomplete
- Fit for specific types of programs
- Inter-operable Modules
- Common ARTS
20Charisma
- Static data flow
- Suffices for number of applications
- Molecular dynamics, FEM, PDE's, etc.
- Global data and control flow explicit
- Unlike Charm
21The Charisma Programming Model
- Arrays of objects
- Global parameter space (PS)?
- Objects read from and write into PS
- Clean division between
- Parallel (orchestration) code
- Sequential methods
22Example Stencil Computation
foreach x,y in workers (lbx,y, rbx,y,
ubx,y, dbx,y) lt-
workersx,y.produceBorders() end-foreach
23Language Features
- Communication patterns
- P2P, Multicast, Scatter, Gather
- Determinism
- Methods invoked on objects in Program Order
- Support for libraries
- Use external
- Create your own
24However...
- Charisma is no panacea
- Only static dataflow
- Data-dependent dataflow can not be expressed
25Shared Address Spaces
- Shared Memory programming
- Easier to program in (sometimes)?
- Cache coherence (hurts performance)?
- Consistency, Data Races (productivity suffers)?
- MSA
- SM programs access data differently in various
phases
26Matrix-Matrix Multiply
C
A
C
A
B
B
27MSA Access modes
- SAS is part of programming model
- Support some access modes efficiently
- Read-only
- Write
- Accumulate
28MSA MMM Code
typedef MSA2Dltdouble, MSA_NullAltdoublegt,
MSA_ROW_MAJORgt MSA2DRowMjr typedef MSA2Dltdouble,
MSA_SumAltdoublegt, MSA_COL_MAJORgt
MSA2DColMjr MSA2DRowMjr arr1(ROWS1, COLS1,
NUMWORKERS, cacheSize1) // row major MSA2DColMjr
arr2(ROWS2, COLS2, NUMWORKERS, cacheSize2) //
column major MSA2DRowMjr prod(ROWS1, COLS2,
NUMWORKERS, cacheSize3) //product
matrix for(unsigned int c 0 c lt COLS2 c)
// Each thread computes a subset of rows of
product matrix for(unsigned int r rowStart r
lt rowEnd r) double result 0.0
for(unsigned int k 0 k lt COLS1 k)?
result arr1rk arr2kc
prodrc result prod.sync()
29Future Work
- Compiler-directed optimizations
- Performance and Productivity analyses
- Cross-module data exchange