AMPI, Charisma and MSA - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

AMPI, Charisma and MSA

Description:

New generation parallel applications are: Dynamically varying: load ... Statics in C codes: __thread privatizes them. Requires linking to pthreads library ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 30
Provided by: pritish
Category:
Tags: ampi | msa | charisma | statics

less

Transcript and Presenter's Notes

Title: AMPI, Charisma and MSA


1
AMPI, Charisma and MSA
Celso Mendes Pritish Jetley Parallel Programming
Laboratory University of Illinois at
Urbana-Champaign
2
Motivation
  • Challenges
  • New generation parallel applications are
  • Dynamically varying load shifting, adaptive
    refinement
  • Typical MPI implementations are
  • Not naturally suitable for dynamic applications
  • Set of available processors
  • May not match the natural expression of the
    algorithm
  • AMPI Adaptive MPI
  • MPI with virtualization VP (Virtual
    Processors)?

AMPI Charm Workshop 2008
2
05/02/08
3
AMPI Status
  • Compliance to MPI-1.1 Standard
  • Missing error handling, profiling interface
  • Partial MPI-2 support
  • One-sided communication
  • ROMIO integrated for parallel I/O
  • Missing dynamic process management, language
    bindings

AMPI Charm Workshop 2008
3
05/02/08
4
AMPI MPI with Virtualization
  • Each virtual process implemented as a user-level
    thread embedded in a Charm object

AMPI Charm Workshop 2008
4
05/02/08
5
Comparison with Native MPI
  • Performance
  • Slightly worse w/o optimization
  • Can be improved, via Charm
  • Run time parameters for better performance

AMPI Charm Workshop 2008
5
05/02/08
6
Comparison with Native MPI
  • Flexibility
  • Big runs on any number of processors
  • Fits the nature of algorithms

Problem setup 3D stencil calculation of size
2403 run on Lemieux. AMPI runs on any of PEs
(eg 19, 33, 105). Native MPI needs PK3
AMPI Charm Workshop 2008
6
05/02/08
7
Building Charm / AMPI
  • Download website
  • http//charm.cs.uiuc.edu/download/
  • Please register for better support
  • Build Charm/AMPI
  • gt ./build lttargetgt ltversiongt ltoptionsgt
    charmc-options
  • See README file for details
  • To build AMPI
  • gt ./build AMPI net-linux -g (-O3)?

AMPI Charm Workshop 2008
7
05/02/08
8
How to write AMPI programs
  • Avoid using global variables
  • Global variables are dangerous in multithreaded
    programs
  • Global variables are shared by all the threads on
    a processor and can be changed by any of the
    threads
  • Example

time
If count is a global incorrect value is read!
AMPI Charm Workshop 2008
8
05/02/08
9
How to run AMPI programs (1)?
  • We can run multithreaded on each processor
  • Running with many virtual processors
  • p command line option of physical processors
  • vp command line option of virtual processors
  • Multiple initial processor mappings are possible
  • gt charmrun hello p3 vp6 mapping ltmapgt
  • Available mappings at program initialization
  • RR_MAP Round-Robin (cyclic)
    (0,3)(1,4)(2,5)
  • BLOCK_MAP Block (default)
    (0,1)(2,3)(4,5)
  • PROP_MAP Proportional to processors speeds
    (0,1,2,3)(4)(5)

AMPI Charm Workshop 2008
9
05/02/08
10
How to run AMPI programs (2)?
  • Specify stack size for each thread
  • Set smaller/larger stack sizes
  • Notice that threads stack space is unique across
    processors
  • Specify stack size for each thread with
    tcharm_stacksize command line option
  • charmrun hello p2 vp8 tcharm_stacksize
    8000000
  • Default stack size is 1 MByte for each thread

AMPI Charm Workshop 2008
10
05/02/08
11
How to convert an MPI program
  • Remove global variables if possible
  • If not possible, privatize global variables
  • Pack them into struct/TYPE or class
  • Allocate struct / type in heap or stack

Original Code
MODULE shareddata INTEGER myrank DOUBLE
PRECISION xyz(100)? END MODULE
AMPI Charm Workshop 2008
11
05/02/08
12
How to convert an MPI program
  • Fortran program entry point MPI_Main
  • program pgm ? subroutine MPI_Main
  • ... ...
  • end program end subroutine
  • C program entry point is handled automatically,
    via mpi.h

AMPI Charm Workshop 2008
12
05/02/08
13
AMPI Extensions
  • Automatic load balancing
  • Non-blocking collectives
  • Checkpoint/restart mechanism
  • Multi-module programming
  • Handling global and static variables

AMPI Charm Workshop 2008
13
05/02/08
14
Automatic Load Balancing
  • Automatic load balancing MPI_Migrate()?
  • Collective call informing the load balancer that
    the thread is ready to be migrated, if needed.
  • Link-time flag -memory isomalloc makes heap-data
    migration transparent
  • Special memory allocation mode, giving allocated
    memory the same virtual address on all processors
  • Ideal on 64-bit machines
  • But granularity 16KB per allocation
  • Alternative PUPer

AMPI Charm Workshop 2008
14
05/02/08
15
Asynchronous Collectives
  • Our implementation is asynchronous
  • Collective operation posted
  • Test/wait for its completion
  • Meanwhile useful computation can utilize CPU
  • MPI_Ialltoall( , req)
  • / other computation /
  • MPI_Wait(req)

AMPI Charm Workshop 2008
15
05/02/08
16
Checkpoint/Restart Mechanism
  • Large scale machines may suffer from fails
  • Checkpoint/restart mechanism
  • State of applications checkpointed to disk files
  • Capable of restarting on different of PEs
  • Facilitates future efforts on fault tolerance
  • In-disk MPI_Checkpoint(DIRNAME)?
  • In-memory MPI_MemCheckpoint(void)?
  • Checkpoint/restart and load-balancing have been
    recently applied in CSARs Rocstar

AMPI Charm Workshop 2008
16
05/02/08
17
Interoperability with Charm
  • Charm has a collection of support libraries
  • We can make use of them by running Charm code
    in AMPI programs
  • Also we can run AMPI code in Charm programs

AMPI Charm Workshop 2008
17
05/02/08
18
Handling Global Static Variables
  • Global and static variables are not thread-safe
  • Can we switch those variables when we switch
    threads?
  • Globals Executable and Linking Format (ELF)
  • Executable has a Global Offset Table containing
    global data
  • GOT pointer stored at ebx register
  • Switch this pointer when switching between
    threads
  • Support on Linux, Solaris 2.x, and more
  • Integrated in Charm/AMPI
  • Invoked by compile time option swapglobals
  • Statics in C codes __thread privatizes them
  • Requires linking to pthreads library

AMPI Charm Workshop 2008
18
05/02/08
19
Programmer Productivity
  • Take away generality
  • Program in small, uncomplicated languages
  • Incomplete
  • Fit for specific types of programs
  • Inter-operable Modules
  • Common ARTS

20
Charisma
  • Static data flow
  • Suffices for number of applications
  • Molecular dynamics, FEM, PDE's, etc.
  • Global data and control flow explicit
  • Unlike Charm

21
The Charisma Programming Model
  • Arrays of objects
  • Global parameter space (PS)?
  • Objects read from and write into PS
  • Clean division between
  • Parallel (orchestration) code
  • Sequential methods

22
Example Stencil Computation
foreach x,y in workers (lbx,y, rbx,y,
ubx,y, dbx,y) lt-
workersx,y.produceBorders() end-foreach
23
Language Features
  • Communication patterns
  • P2P, Multicast, Scatter, Gather
  • Determinism
  • Methods invoked on objects in Program Order
  • Support for libraries
  • Use external
  • Create your own

24
However...
  • Charisma is no panacea
  • Only static dataflow
  • Data-dependent dataflow can not be expressed

25
Shared Address Spaces
  • Shared Memory programming
  • Easier to program in (sometimes)?
  • Cache coherence (hurts performance)?
  • Consistency, Data Races (productivity suffers)?
  • MSA
  • SM programs access data differently in various
    phases

26
Matrix-Matrix Multiply
C
A
C
A
B
B
27
MSA Access modes
  • SAS is part of programming model
  • Support some access modes efficiently
  • Read-only
  • Write
  • Accumulate

28
MSA MMM Code
typedef MSA2Dltdouble, MSA_NullAltdoublegt,
MSA_ROW_MAJORgt MSA2DRowMjr typedef MSA2Dltdouble,
MSA_SumAltdoublegt, MSA_COL_MAJORgt
MSA2DColMjr MSA2DRowMjr arr1(ROWS1, COLS1,
NUMWORKERS, cacheSize1) // row major MSA2DColMjr
arr2(ROWS2, COLS2, NUMWORKERS, cacheSize2) //
column major MSA2DRowMjr prod(ROWS1, COLS2,
NUMWORKERS, cacheSize3) //product
matrix for(unsigned int c 0 c lt COLS2 c)
// Each thread computes a subset of rows of
product matrix for(unsigned int r rowStart r
lt rowEnd r) double result 0.0
for(unsigned int k 0 k lt COLS1 k)?
result arr1rk arr2kc
prodrc result prod.sync()
29
Future Work
  • Compiler-directed optimizations
  • Performance and Productivity analyses
  • Cross-module data exchange
Write a Comment
User Comments (0)
About PowerShow.com