SHMEM Programming Model - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

SHMEM Programming Model

Description:

Nuts & Bolts. Collective Communication. Broadcast ... A Comparison of MPI, SHMEM and Cache-coherent Shared Address Space Programming ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 28

Provided by: burtg

Category:

more less

Transcript and Presenter's Notes

Title: SHMEM Programming Model

1
SHMEM Programming Model

Hung-Hsun Su
UPC Group, HCS lab
1/23/2004

2
Outline

Background
Nuts and Bolts
GPSHMEM
Performance
Conclusion
Reference

3
BackgroundWhat is SHMEM?

SHard MEMory library
Based on SPMD model
Available for C / Fortran
Hybrid Message Passing / Shared Memory
Programming Model
Message Passing Like
Explicit communication, replication and
synchronization
Specification of remote data location (processor
id) is required
Shard Memory like
Provides logically shared memory system view
Communication require processor on one side only
Allows any processor element (PE) to access
memory in a remote PE without involving the
microprocessor on the remote PE (put / get)
Non-blocking data transfer

4
BackgroundWhat is SHMEM?

Must know the address of a variable on the remote
processor for transfer
same on all PEs
Remotely accessible data objects (Symmetric
Vars.)
Global variables
Local static variables
Variables in common blocks
Fortran variables modified by a !DIR SYMMETRIC
directive
C variables modified by a pragma symmetric
directive

5
BackgroundWhy program in SHMEM?

Easier to program in than MPI / PVM
Low latency, high bandwidth data transfer
Puts
Gets
Provide efficient collective communication
Gather / Scatter
All-to-all
Broadcast
Reductions
Provide mechanisms to implement mutual exclusion
Atomic swap
Locking
Provide synchronization mechanisms

6
BackgroundSupported Platforms

SHMEM
Cray T3D, T3E, PVP
SGI Irix, Origin
Compaq SC
IBM SP
Quadrics Linux Cluster
SCI (?)
GPSHMEM (Version 1.0)
IBM SP
SGI Origin
Cray J90, T3E
Unix/Linux
Windows NT
Myrinet (?)

7
Nuts BoltsInitialization

Include header shmem.h / shmem.fh to access the
library
shmem_init() Initializes SHMEM
my_pe() Get the PE ID of local processor
num_pes() Get the total number of PE in the
system
include ltstdio.hgt
include ltstdlib.hgt
include "shmem.h
int main(int argc, char argv)
int my_pe, num_peshmem_init()my_pe
my_pe()num_pe num_pes()printf("Hello World
from process d of d\n", my_pe,
num_pes)exit(0)

8
Nuts BoltsData Transfer

Put
Specific Variable
void shmem_TYPE_p(TYPE addr, TYPE value, int pe)
TYPE double, float, int, long, short
Contiguous Object
void shmem_put(void target, const void source,
size_t len, int pe)
void shmem_TYPE_put(TYPE target, const
TYPEsource, size_t len, int pe)
TYPE double, float, int, long, longdouble,
longlong, short
void shmem_putSS(void target, const void
source, size_t len, int pe)
Storage Size (SS) 32, 64 (default), 128, mem
(any size)

9
Nuts BoltsData Transfer

Get
Specific Variable
void shmem_TYPE_g(TYPE addr, TYPE value, int pe)
TYPE double, float, int, long, short
Contiguous Object
void shmem_get(void target, const void source,
size_t len, int pe)
void shmem_TYPE_get(TYPE target, const
TYPEsource, size_t len, int pe)
TYPE double, float, int, long, longdouble,
longlong, short
void shmem_getSS(void target, const void
source, size_t len, int pe)
Storage Size (SS) 32, 64 (default), 128, mem
(any size)

10
Nuts BoltsCollective Communication

Broadcast
void shmem_broadcast(void target, void source,
int nlong, int PE_root, int PE_start, int
PE_group, int PE_size, long pSync)
One-to-all communication
Collection
void shmem_collect(void target, void source,
int nlong, int PE_start, int PE_group, int
PE_size, long pSync)
void shmem_fcollect(void target, void source,
int nlong, int PE_start, int PE_group, int
PE_size, long pSync)
Concatenates data items from the source array
into the target array over the defined set of
PEs. The resultant target array consists of the
contribution from the 1st PE, followed by 1st PE
2nd PE, etc.

pSync - symmetric work array. Every element of
this array must be initialized with the value
_SHMEM_SYNC_VALUE before any of the PEs in the
active set enter the routine. Use to prevent
overlapping collective communication
11
Nuts BoltsSynchronization

Barrier
void shmem_barrier_all(void)
Suspend all operations until all PE calls this
function
void shmem_barrier(int PE_start, int PE_group,
int PE_size, long pSync)
Barrier operation on subset of PEs
Wait
Suspend until a remote PE writes a value NOT
equal to the one specified
void shmem_wait(long var, long value)
void shmem_TYPE_wait(TYPE var, TYPE value)
TYPE int, long, longlong, short
Conditional Wait
Same as wait except the comparison can now be gt,
gt, , !, lt, lt
void shmem_wait_until(long var, int cond, long
value)

12
Nuts BoltsSynchronization

Fence
All put operations issued to a particular PE
prior to call are guaranteed to be delivered
before any subsequent remote write operation to
the same PE which follows the call
Ensures ordering of remote write (put) operations
Quiet
Waits for completion of all outstanding remote
writes initiated from the calling PE

13
Nuts BoltsAtomic Operations

Atomic Swap
Unconditional
long shmem_swap(long target, long value, int pe)
Conditional
int shmem_int_cswap(int target, int cond, int
value, int pe)
Arithmetic
add, increment
int shmem_int_fadd(int target, int value, int
pe)

14
Nuts BoltsCollective Reduction

Collective logical operations
and, or, xor
void shmem_int_and_to_all(int target, int
source, int nreduce, int PE_start, int PE_group,
int PE_size, int pWrk, long pSync)
Collective comparison operations
max, min
void shmem_double_max_to_all(double target,
double source, int nreduce, int PE_start, int
PE_group, int PE_size, double pWrk, long pSync)
Collective arithmetic operations
product, sum
void shmem_double_prod_to_all(double target,
double source, int nreduce, int PE_start, int
PE_group, int PE_size, double pWrk, long pSync)

15
Nuts BoltsOther

Address Manipulation
shmem_ptr - Returns a pointer to a data object on
a remote PE
Cache Control
shmem_clear_cache_inv - Disables automatic cache
coherency mode
shmem_set_cache_inv - Enables automatic cache
coherency mode
shmem_set_cache_line_inv - Enables automatic line
cache coherency mode
shmem_udcflush - Makes the entire user data cache
coherent
shmem_udcflush_line - Makes coherent a cache line

16
Nuts BoltsExample (Array copy)
14. / Initialize and send on PE 1 / 15. if(me
1) 16. for(i0 ilt8 i) 17. sourcei
i1 18. shmem_put64(dest, source,
8sizeof(dest0)/8, 0) 19. 20. 21. / Make
sure the transfer is complete / 22.
shmem_barrier_all() 23. 24. / Print from the
receiving PE / 25. if(me 0) 26.
_shmem_udcflush() 27. printf(" DEST ON PE 0")
28. for(i0 ilt8 i) 29. printf(" dc",
desti, (ilt7) ? ',' '\n') 30.
1. include ltstdio.hgt 2. include ltmpp/shmem.hgt
3. include ltintrinsics.hgt 4. 6. int me, npes,
i 7. int source8, dest8 8. main() 9.
10. / Get PE information / 11. me _my_pe()
12. npes _num_pes() 13.
17
GPSHMEM

AMES Lab / Pacific Northwest National Lab
collaborative project
Communication library like SHMEM library, but
tries to achieve full portability
Mostly the T3D components with some extensions
of functionality
Research Quality at this point

ARMCI A Portable Remote Memory Copy Library for
Distributed Array Libraries and Compiler Run-time
Systems
18
Performance Latency (Origin 2000)
19
Performance Latency (T3E 600)
20
Performance Bandwidth
Taken from http//infm.cineca.it/documenti/incontr
o_infm/comunicazio/sld015.htm
21
Performance Bandwidth
22
Performance - Broadcast
23
Performance All to all
24
Performance Ocean
On SGI Origin 2000
25
Performance Radix
On SGI Origin 2000
26
Conclusion

Hybrid MP/Shard Memory programming model
Compare to MP
Pro.
Easier to use
Lower latency, higher bandwidth communication
More scalable (within limit)
Remote CPU not interrupted during transfer
Con.
Limited platform support (as of now)

27
Reference

Ricky A. Kendall et. al., GPSHMEM and other
Parallel Programming Models Powerpoint
presentation
Hongzhang Shan and Jaswinder Pal Singh, A
Comparison of MPI, SHMEM and Cache-coherent
Shared Address Space Programming Models on the
SGI Origin2000 http//citeseer.nj.nec.com/rd/48418
3212C2963482C12C0.252CDownload/http//citeseer
.nj.nec.com/cache/papers/cs/14068/httpzSzzSzwww.c
s.princeton.eduzSz7EshzzSzpaperszSzics99.pdf/a-co
mparison-of-mpi.pdf
Quadrics SHMEM Programming Manual
http//www.psc.edu/oneal/compaq/ShmemMan.pdf
Karl Feind, Shared Memory Access (SHMEM) Routines
Glenn Leucke et. al., The Performance and
Scalability of SHMEM and MPI-2 One-Sided Routines
on a SCI Origin 2000 and a Cray T3E-600
http//dsg.port.ac.uk/Journals/PEMCS/papers/paper1
9.pdf
Patrick H. Worley, CCSM Component Performance
Benchmarking and Status of the CRAY X1 at ORNL

http//www.csm.ornl.gov/worley/talks/index.html