Title: SHMEM Programming Model
1SHMEM Programming Model
- Hung-Hsun Su
- UPC Group, HCS lab
- 1/23/2004
2Outline
- Background
- Nuts and Bolts
- GPSHMEM
- Performance
- Conclusion
- Reference
3BackgroundWhat is SHMEM?
- SHard MEMory library
- Based on SPMD model
- Available for C / Fortran
- Hybrid Message Passing / Shared Memory
Programming Model - Message Passing Like
- Explicit communication, replication and
synchronization - Specification of remote data location (processor
id) is required - Shard Memory like
- Provides logically shared memory system view
- Communication require processor on one side only
- Allows any processor element (PE) to access
memory in a remote PE without involving the
microprocessor on the remote PE (put / get) - Non-blocking data transfer
4BackgroundWhat is SHMEM?
- Must know the address of a variable on the remote
processor for transfer - same on all PEs
- Remotely accessible data objects (Symmetric
Vars.) - Global variables
- Local static variables
- Variables in common blocks
- Fortran variables modified by a !DIR SYMMETRIC
directive - C variables modified by a pragma symmetric
directive
5BackgroundWhy program in SHMEM?
- Easier to program in than MPI / PVM
- Low latency, high bandwidth data transfer
- Puts
- Gets
- Provide efficient collective communication
- Gather / Scatter
- All-to-all
- Broadcast
- Reductions
- Provide mechanisms to implement mutual exclusion
- Atomic swap
- Locking
- Provide synchronization mechanisms
6BackgroundSupported Platforms
- SHMEM
- Cray T3D, T3E, PVP
- SGI Irix, Origin
- Compaq SC
- IBM SP
- Quadrics Linux Cluster
- SCI (?)
- GPSHMEM (Version 1.0)
- IBM SP
- SGI Origin
- Cray J90, T3E
- Unix/Linux
- Windows NT
- Myrinet (?)
7Nuts BoltsInitialization
- Include header shmem.h / shmem.fh to access the
library - shmem_init() Initializes SHMEM
- my_pe() Get the PE ID of local processor
- num_pes() Get the total number of PE in the
system - include ltstdio.hgt
- include ltstdlib.hgt
- include "shmem.h
- int main(int argc, char argv)
- int my_pe, num_peshmem_init()my_pe
my_pe()num_pe num_pes()printf("Hello World
from process d of d\n", my_pe,
num_pes)exit(0) -
8Nuts BoltsData Transfer
- Put
- Specific Variable
- void shmem_TYPE_p(TYPE addr, TYPE value, int pe)
- TYPE double, float, int, long, short
- Contiguous Object
- void shmem_put(void target, const void source,
size_t len, int pe) - void shmem_TYPE_put(TYPE target, const
TYPEsource, size_t len, int pe) - TYPE double, float, int, long, longdouble,
longlong, short - void shmem_putSS(void target, const void
source, size_t len, int pe) - Storage Size (SS) 32, 64 (default), 128, mem
(any size)
9Nuts BoltsData Transfer
- Get
- Specific Variable
- void shmem_TYPE_g(TYPE addr, TYPE value, int pe)
- TYPE double, float, int, long, short
- Contiguous Object
- void shmem_get(void target, const void source,
size_t len, int pe) - void shmem_TYPE_get(TYPE target, const
TYPEsource, size_t len, int pe) - TYPE double, float, int, long, longdouble,
longlong, short - void shmem_getSS(void target, const void
source, size_t len, int pe) - Storage Size (SS) 32, 64 (default), 128, mem
(any size)
10Nuts BoltsCollective Communication
- Broadcast
- void shmem_broadcast(void target, void source,
int nlong, int PE_root, int PE_start, int
PE_group, int PE_size, long pSync) - One-to-all communication
- Collection
- void shmem_collect(void target, void source,
int nlong, int PE_start, int PE_group, int
PE_size, long pSync) - void shmem_fcollect(void target, void source,
int nlong, int PE_start, int PE_group, int
PE_size, long pSync) - Concatenates data items from the source array
into the target array over the defined set of
PEs. The resultant target array consists of the
contribution from the 1st PE, followed by 1st PE
2nd PE, etc.
pSync - symmetric work array. Every element of
this array must be initialized with the value
_SHMEM_SYNC_VALUE before any of the PEs in the
active set enter the routine. Use to prevent
overlapping collective communication
11Nuts BoltsSynchronization
- Barrier
- void shmem_barrier_all(void)
- Suspend all operations until all PE calls this
function - void shmem_barrier(int PE_start, int PE_group,
int PE_size, long pSync) - Barrier operation on subset of PEs
- Wait
- Suspend until a remote PE writes a value NOT
equal to the one specified - void shmem_wait(long var, long value)
- void shmem_TYPE_wait(TYPE var, TYPE value)
- TYPE int, long, longlong, short
- Conditional Wait
- Same as wait except the comparison can now be gt,
gt, , !, lt, lt - void shmem_wait_until(long var, int cond, long
value)
12Nuts BoltsSynchronization
- Fence
- All put operations issued to a particular PE
prior to call are guaranteed to be delivered
before any subsequent remote write operation to
the same PE which follows the call - Ensures ordering of remote write (put) operations
- Quiet
- Waits for completion of all outstanding remote
writes initiated from the calling PE
13Nuts BoltsAtomic Operations
- Atomic Swap
- Unconditional
- long shmem_swap(long target, long value, int pe)
- Conditional
- int shmem_int_cswap(int target, int cond, int
value, int pe) - Arithmetic
- add, increment
- int shmem_int_fadd(int target, int value, int
pe)
14Nuts BoltsCollective Reduction
- Collective logical operations
- and, or, xor
- void shmem_int_and_to_all(int target, int
source, int nreduce, int PE_start, int PE_group,
int PE_size, int pWrk, long pSync) - Collective comparison operations
- max, min
- void shmem_double_max_to_all(double target,
double source, int nreduce, int PE_start, int
PE_group, int PE_size, double pWrk, long pSync)
- Collective arithmetic operations
- product, sum
- void shmem_double_prod_to_all(double target,
double source, int nreduce, int PE_start, int
PE_group, int PE_size, double pWrk, long pSync)
15Nuts BoltsOther
- Address Manipulation
- shmem_ptr - Returns a pointer to a data object on
a remote PE - Cache Control
- shmem_clear_cache_inv - Disables automatic cache
coherency mode - shmem_set_cache_inv - Enables automatic cache
coherency mode - shmem_set_cache_line_inv - Enables automatic line
cache coherency mode - shmem_udcflush - Makes the entire user data cache
coherent - shmem_udcflush_line - Makes coherent a cache line
16Nuts BoltsExample (Array copy)
14. / Initialize and send on PE 1 / 15. if(me
1) 16. for(i0 ilt8 i) 17. sourcei
i1 18. shmem_put64(dest, source,
8sizeof(dest0)/8, 0) 19. 20. 21. / Make
sure the transfer is complete / 22.
shmem_barrier_all() 23. 24. / Print from the
receiving PE / 25. if(me 0) 26.
_shmem_udcflush() 27. printf(" DEST ON PE 0")
28. for(i0 ilt8 i) 29. printf(" dc",
desti, (ilt7) ? ',' '\n') 30.
1. include ltstdio.hgt 2. include ltmpp/shmem.hgt
3. include ltintrinsics.hgt 4. 6. int me, npes,
i 7. int source8, dest8 8. main() 9.
10. / Get PE information / 11. me _my_pe()
12. npes _num_pes() 13.
17GPSHMEM
- AMES Lab / Pacific Northwest National Lab
collaborative project - Communication library like SHMEM library, but
tries to achieve full portability - Mostly the T3D components with some extensions
of functionality - Research Quality at this point
ARMCI A Portable Remote Memory Copy Library for
Distributed Array Libraries and Compiler Run-time
Systems
18Performance Latency (Origin 2000)
19Performance Latency (T3E 600)
20Performance Bandwidth
Taken from http//infm.cineca.it/documenti/incontr
o_infm/comunicazio/sld015.htm
21Performance Bandwidth
22Performance - Broadcast
23Performance All to all
24Performance Ocean
On SGI Origin 2000
25Performance Radix
On SGI Origin 2000
26Conclusion
- Hybrid MP/Shard Memory programming model
- Compare to MP
- Pro.
- Easier to use
- Lower latency, higher bandwidth communication
- More scalable (within limit)
- Remote CPU not interrupted during transfer
- Con.
- Limited platform support (as of now)
27Reference
- Ricky A. Kendall et. al., GPSHMEM and other
Parallel Programming Models Powerpoint
presentation - Hongzhang Shan and Jaswinder Pal Singh, A
Comparison of MPI, SHMEM and Cache-coherent
Shared Address Space Programming Models on the
SGI Origin2000 http//citeseer.nj.nec.com/rd/48418
3212C2963482C12C0.252CDownload/http//citeseer
.nj.nec.com/cache/papers/cs/14068/httpzSzzSzwww.c
s.princeton.eduzSz7EshzzSzpaperszSzics99.pdf/a-co
mparison-of-mpi.pdf - Quadrics SHMEM Programming Manual
http//www.psc.edu/oneal/compaq/ShmemMan.pdf - Karl Feind, Shared Memory Access (SHMEM) Routines
- Glenn Leucke et. al., The Performance and
Scalability of SHMEM and MPI-2 One-Sided Routines
on a SCI Origin 2000 and a Cray T3E-600
http//dsg.port.ac.uk/Journals/PEMCS/papers/paper1
9.pdf - Patrick H. Worley, CCSM Component Performance
Benchmarking and Status of the CRAY X1 at ORNL
http//www.csm.ornl.gov/worley/talks/index.html