AMPI, Charisma and MSA - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

AMPI, Charisma and MSA

Description:

New generation parallel applications are: Dynamically varying: load ... Statics in C codes: __thread privatizes them. Requires linking to pthreads library ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 30

Provided by: pritish

Category:

more less

Transcript and Presenter's Notes

Title: AMPI, Charisma and MSA

1
AMPI, Charisma and MSA
Celso Mendes Pritish Jetley Parallel Programming
Laboratory University of Illinois at
Urbana-Champaign
2
Motivation

Challenges
New generation parallel applications are
Dynamically varying load shifting, adaptive
refinement
Typical MPI implementations are
Not naturally suitable for dynamic applications
Set of available processors
May not match the natural expression of the
algorithm
AMPI Adaptive MPI
MPI with virtualization VP (Virtual
Processors)?

AMPI Charm Workshop 2008
2
05/02/08
3
AMPI Status

Compliance to MPI-1.1 Standard
Missing error handling, profiling interface
Partial MPI-2 support
One-sided communication
ROMIO integrated for parallel I/O
Missing dynamic process management, language
bindings

AMPI Charm Workshop 2008
3
05/02/08
4
AMPI MPI with Virtualization

Each virtual process implemented as a user-level
thread embedded in a Charm object

AMPI Charm Workshop 2008
4
05/02/08
5
Comparison with Native MPI

Performance
Slightly worse w/o optimization
Can be improved, via Charm
Run time parameters for better performance

AMPI Charm Workshop 2008
5
05/02/08
6
Comparison with Native MPI

Flexibility
Big runs on any number of processors
Fits the nature of algorithms

Problem setup 3D stencil calculation of size
2403 run on Lemieux. AMPI runs on any of PEs
(eg 19, 33, 105). Native MPI needs PK3
AMPI Charm Workshop 2008
6
05/02/08
7
Building Charm / AMPI

Download website
http//charm.cs.uiuc.edu/download/
Please register for better support
Build Charm/AMPI
gt ./build lttargetgt ltversiongt ltoptionsgt
charmc-options
See README file for details
To build AMPI
gt ./build AMPI net-linux -g (-O3)?

AMPI Charm Workshop 2008
7
05/02/08
8
How to write AMPI programs

Avoid using global variables
Global variables are dangerous in multithreaded
programs
Global variables are shared by all the threads on
a processor and can be changed by any of the
threads
Example

time
If count is a global incorrect value is read!
AMPI Charm Workshop 2008
8
05/02/08
9
How to run AMPI programs (1)?

We can run multithreaded on each processor
Running with many virtual processors
p command line option of physical processors
vp command line option of virtual processors
Multiple initial processor mappings are possible
gt charmrun hello p3 vp6 mapping ltmapgt
Available mappings at program initialization
RR_MAP Round-Robin (cyclic)
(0,3)(1,4)(2,5)
BLOCK_MAP Block (default)
(0,1)(2,3)(4,5)
PROP_MAP Proportional to processors speeds
(0,1,2,3)(4)(5)

AMPI Charm Workshop 2008
9
05/02/08
10
How to run AMPI programs (2)?

Specify stack size for each thread
Set smaller/larger stack sizes
Notice that threads stack space is unique across
processors
Specify stack size for each thread with
tcharm_stacksize command line option
charmrun hello p2 vp8 tcharm_stacksize
8000000
Default stack size is 1 MByte for each thread

AMPI Charm Workshop 2008
10
05/02/08
11
How to convert an MPI program

Remove global variables if possible
If not possible, privatize global variables
Pack them into struct/TYPE or class
Allocate struct / type in heap or stack

Original Code
MODULE shareddata INTEGER myrank DOUBLE
PRECISION xyz(100)? END MODULE
AMPI Charm Workshop 2008
11
05/02/08
12
How to convert an MPI program

Fortran program entry point MPI_Main
program pgm ? subroutine MPI_Main
... ...
end program end subroutine
C program entry point is handled automatically,
via mpi.h

AMPI Charm Workshop 2008
12
05/02/08
13
AMPI Extensions

Automatic load balancing
Non-blocking collectives
Checkpoint/restart mechanism
Multi-module programming
Handling global and static variables

AMPI Charm Workshop 2008
13
05/02/08
14
Automatic Load Balancing

Automatic load balancing MPI_Migrate()?
Collective call informing the load balancer that
the thread is ready to be migrated, if needed.
Link-time flag -memory isomalloc makes heap-data
migration transparent
Special memory allocation mode, giving allocated
memory the same virtual address on all processors
Ideal on 64-bit machines
But granularity 16KB per allocation
Alternative PUPer

AMPI Charm Workshop 2008
14
05/02/08
15
Asynchronous Collectives

Our implementation is asynchronous
Collective operation posted
Test/wait for its completion
Meanwhile useful computation can utilize CPU
MPI_Ialltoall( , req)
/ other computation /
MPI_Wait(req)

AMPI Charm Workshop 2008
15
05/02/08
16
Checkpoint/Restart Mechanism

Large scale machines may suffer from fails
Checkpoint/restart mechanism
State of applications checkpointed to disk files
Capable of restarting on different of PEs
Facilitates future efforts on fault tolerance
In-disk MPI_Checkpoint(DIRNAME)?
In-memory MPI_MemCheckpoint(void)?
Checkpoint/restart and load-balancing have been
recently applied in CSARs Rocstar

AMPI Charm Workshop 2008
16
05/02/08
17
Interoperability with Charm

Charm has a collection of support libraries
We can make use of them by running Charm code
in AMPI programs
Also we can run AMPI code in Charm programs

AMPI Charm Workshop 2008
17
05/02/08
18
Handling Global Static Variables

Global and static variables are not thread-safe
Can we switch those variables when we switch
threads?
Globals Executable and Linking Format (ELF)
Executable has a Global Offset Table containing
global data
GOT pointer stored at ebx register
Switch this pointer when switching between
threads
Support on Linux, Solaris 2.x, and more
Integrated in Charm/AMPI
Invoked by compile time option swapglobals
Statics in C codes __thread privatizes them
Requires linking to pthreads library

AMPI Charm Workshop 2008
18
05/02/08
19
Programmer Productivity

Take away generality
Program in small, uncomplicated languages
Incomplete
Fit for specific types of programs
Inter-operable Modules
Common ARTS

20
Charisma

Static data flow
Suffices for number of applications
Molecular dynamics, FEM, PDE's, etc.
Global data and control flow explicit
Unlike Charm

21
The Charisma Programming Model

Arrays of objects
Global parameter space (PS)?
Objects read from and write into PS
Clean division between
Parallel (orchestration) code
Sequential methods

22
Example Stencil Computation
foreach x,y in workers (lbx,y, rbx,y,
ubx,y, dbx,y) lt-
workersx,y.produceBorders() end-foreach
23
Language Features

Communication patterns
P2P, Multicast, Scatter, Gather
Determinism
Methods invoked on objects in Program Order
Support for libraries
Use external
Create your own

24
However...

Charisma is no panacea
Only static dataflow
Data-dependent dataflow can not be expressed

25
Shared Address Spaces

Shared Memory programming
Easier to program in (sometimes)?
Cache coherence (hurts performance)?
Consistency, Data Races (productivity suffers)?
MSA
SM programs access data differently in various
phases

26
Matrix-Matrix Multiply
C
A
C
A
B
B
27
MSA Access modes

SAS is part of programming model
Support some access modes efficiently
Read-only
Write
Accumulate

28
MSA MMM Code
typedef MSA2Dltdouble, MSA_NullAltdoublegt,
MSA_ROW_MAJORgt MSA2DRowMjr typedef MSA2Dltdouble,
MSA_SumAltdoublegt, MSA_COL_MAJORgt
MSA2DColMjr MSA2DRowMjr arr1(ROWS1, COLS1,
NUMWORKERS, cacheSize1) // row major MSA2DColMjr
arr2(ROWS2, COLS2, NUMWORKERS, cacheSize2) //
column major MSA2DRowMjr prod(ROWS1, COLS2,
NUMWORKERS, cacheSize3) //product
matrix for(unsigned int c 0 c lt COLS2 c)
// Each thread computes a subset of rows of
product matrix for(unsigned int r rowStart r
lt rowEnd r) double result 0.0
for(unsigned int k 0 k lt COLS1 k)?
result arr1rk arr2kc
prodrc result prod.sync()
29
Future Work