Title: Experiences with Sweep3D Implementations in Coarray Fortran
1Experiences with Sweep3D Implementations in
Co-array Fortran
- Cristian Coarfa Yuri Dotsenko
- John Mellor-Crummey
- Department of Computer Science
- Rice University
- Houston, TX USA
2Motivation
- Parallel Programming Models
- MPI de facto standard
- difficult to program
- OpenMP inefficient to map on distributed memory
platforms - lack of locality control
- HPF hard to obtain high-performance
- heroic compilers needed!
- An appealing middle ground global address space
languages - CAF, Titanium, UPC
Evaluate CAF for an application with
sophisticated parallelization Sweep3D
3Co-Array Fortran
- Global address space programming model
- one-sided communication (GET/PUT)
- Programmer has control over performance-critical
factors - data distribution
- computation partitioning
- communication placement
- Data movement and synchronization as language
primitives - amenable to compiler-based communication
optimization
4CAF Programming Model Features
- SPMD process images
- fixed number of images during execution
- images operate asynchronously
- Both private and shared data
- real x(20, 20) a private 20x20 array in
each image - real y(20, 20) a shared 20x20 array in each
image - Simple one-sided shared-memory communication
- x(,jj2) y(,pp2)r copy columns from
image r into local columns - Synchronization intrinsic functions
- sync_all a barrier and a memory fence
- sync_mem a memory fence
- sync_team(team members to notify, team members
to wait for) - Pointers and (perhaps asymmetric) dynamic
allocation
5One-sided Communication with Co-Arrays
image 1
image 2
image N
h
Copy from left neighbor
image 1
image 2
image N
6Outline
- CAF programming model
- cafc
- Sweep3D implementations in CAF
- Experimental evaluation
- Conclusions
7Rice Co-Array Fortran Compiler (cafc)
- First CAF multi-platform compiler
- previous compiler only for Cray shared memory
systems - Implements core of the language
- currently lacks support for derived type and
dynamic co-arrays - Core sufficient for non-trivial codes
- Performance comparable to that of hand-tuned MPI
codes - Open source
8cafc Implementation Strategy
- Goals
- portability
- high-performance on a wide range of platforms
- Source-to-source compilation of CAF codes
- uses Open64/SL Fortran 90 infrastructure
- CAF Fortran 90 communication operations
- Communication
- ARMCI library for one-sided communication on
clusters (PNNL) - load/store communication on shared-memory
platforms
9Synchronization
- Original CAF specification team synchronization
only - sync_all, sync_team
- Limits performance on loosely-coupled
architectures - Point-to-point extensions
- sync_notify(q)
- sync_wait(p)
- Point to point
synchronization semantics - Delivery of a notify to q from p ?
- all communication from p to q issued before the
notify has been delivered to q
10CAF Compiler Targets (Oct 2004)
- Processors
- Pentium, Alpha, Itanium2, MIPS
- Interconnects
- Quadrics, Myrinet, Gigabit Ethernet, shared
memory - Operating systems
- Linux, Tru64, IRIX
11Outline
- CAF programming model
- cafc
- Sweep3D implementations
- Original MPI implementation
- CAF versions
- Communication microbenchmark
- Experimental evaluation
- Conclusions
12Sweep3D
- Core of an ASCI application
- Solves a
- one-group
- time-independent
- discrete ordinates (Sn)
- 3D Cartesian (XYZ) geometry
- neutron transport problem
- Deterministic particle transport accounts for
50-80 execution time of many realistic DOE
simulations
13Sweep3D Parallelization
2D spatial domain decomposition onto a 2D
processor array
14Sweep3D Parallelization
Wavefront parallelism
15Sweep3D Parallelization
Wavefront parallelism
16Sweep3D Parallelization
Wavefront parallelism
17Sweep3D Parallelization
Wavefront parallelism
18Sweep3D Parallelization
Wavefront parallelism
19Sweep3D Parallelization
Wavefront parallelism
20Sweep3D Parallelization
Wavefront parallelism
21Sweep3D Parallelization
Wavefront parallelism
22Sweep3D Parallelization
Wavefront parallelism
23Sweep3D Parallelization
Wavefront parallelism
24Sweep3D Parallelization
Wavefront parallelism
25Sweep3D Kernel Pseudocode
do iq1,8 do mo 1, mmo do kk 1, kb
recv e/w into Phiib recv n/s into Phijb
... ! heavy computation with
use/update ! of Phiib and Phijb ...
send e/w Phiib send n/s Phijb
enddo enddo enddo
26Sweep3D Kernel Pseudocode
do iq1,8 do mo 1, mmo do kk 1, kb
recv e/w into Phiib recv n/s into Phijb
... ! heavy computation with
use/update ! of Phiib and Phijb ...
send e/w Phiib send n/s Phijb
enddo enddo enddo
27Sweep3D Kernel Pseudocode
do iq1,8 do mo 1, mmo do kk 1, kb
recv e/w into Phiib recv n/s into Phijb
... ! heavy computation with
use/update ! of Phiib and Phijb ...
send e/w Phiib send n/s Phijb
enddo enddo enddo
28Sweep3D Kernel Pseudocode
do iq1,8 do mo 1, mmo do kk 1, kb
recv e/w into Phiib recv n/s into Phijb
... ! heavy computation with
use/update ! of Phiib and Phijb ...
send e/w Phiib send n/s Phijb
enddo enddo enddo
29Initial Sweep3D CAF Implementation
- Based on the MPI implementation
- Maintain original computation
- Convert communication buffers into co-arrays
- Fundamental issue converting from two-sided
communication into one-sided communication
302-sided vs 1-sided Communication
2-sided comm
312-sided vs 1-sided Communication
MPI_Send
MPI_Recv
2-sided comm
322-sided vs 1-sided Communication
MPI_Send
MPI_Recv
2-sided comm
332-sided vs 1-sided Communication
MPI_Send
MPI_Recv
2-sided comm
1-sided comm
342-sided vs 1-sided Communication
sync_notify
sync_wait
MPI_Send
MPI_Recv
2-sided comm
1-sided comm
352-sided vs 1-sided Communication
sync_notify
sync_wait
PUT
MPI_Send
MPI_Recv
2-sided comm
1-sided comm
362-sided vs 1-sided Communication
sync_notify
sync_wait
PUT
MPI_Send
MPI_Recv
sync_notify
sync_wait
2-sided comm
1-sided comm
372-sided vs 1-sided Communication
sync_notify
sync_wait
PUT
MPI_Send
MPI_Recv
sync_notify
sync_wait
2-sided comm
1-sided comm
38CAF Implementation Issues
- Synchronization necessary to avoid data races
might lead to inefficiency - Using multiple communication buffers enables
overlap of synchronization with computation
39One- vs. Two-buffer Communication
One-buffer communication
source
dest
d
Two-buffers communication
40Asynchrony-tolerant CAF Implementation of Sweep3D
- Multiple-versioned communication buffers
- Benefits
- Overlap PUT with computation on destination
- Overlap of synchronization with computation on
source
41Three-buffer Communication
42Communication Throughput Microbenchmark
- MPI implementation blocking send and receive
- CAF one-version buffer
- CAF multi-versioned buffers
- ARMCI implementation one buffer
43Outline
- CAF programming model
- cafc
- Sweep3D implementations
- Experimental evaluation
- Conclusions
44Experimental Evaluation
- Platforms
- Itanium2Quadrics QSNet II (Elan4)
- SGI Altix 3000
- Itanium2Myrinet 2000
- AlphaQuadrics QSNet (Elan3)
- Problem sizes
- 50x50x50
- 150x150x150
- 300x300x300
45Itanium2 Quadrics, Size 50x50x50
46Itanium2 Quadrics, Size 150x150x150
47Itanium2 Quadrics, Size 300x300x300
- multi-version buffers improve performance of
CAF codes by 15 - imperative to use non-blocking notifies
48Itanium2Quadrics, Communication Throughput
Microbenchmark
- multi-version buffers improve throughput
- by 30 for messages up to 8KB
- by 10 for messages larger than 8KB
- overhead of the CAF translation is acceptable
49SGI Altix 3000, Size 50x50x50
50SGI Altix 3000, Size 150x150x150
- multi-version buffers are effective for
asynchrony-tolerance
51SGI Altix 3000, Size 300x300x300
- both CAF implementations outperforms MPI
52SGI Altix 3000, Communication Throughput
Microbenchmark
Warm cache
- ARMCI library exploits effectively
- the hardware support for efficient
- data movement
- MPI performs extra data copies
53Summary of results
- MPI buffering for small messages helps latency
asynchrony tolerance - CAF multi-version buffers improve performance of
one-sided communication for wavefront
computations - enables PUT and receivers computation to overlap
- asynchrony tolerance between sender and receiver
- Non-blocking notifies are important for
performance - enables synchronization to overlap with
computation - Platform results
- CAF outperforms MPI for large problem sizes by
10 on Itanium2Quadrics,Myrinet,Altix - CAF 16slower on AlphaQuadrics(Elan3)
- ARMCI lacks non-blocking notifies on Elan3
54Enhancing CAF Usability
- CAF vs MPI usability
- easier to use than MPI for simple parallel
programs - as difficult for carefully-tuned parallel codes
- Improving CAF ease of use
- compiler support for managing multi-version
communication buffers - vectorizing fine-grain communication to best
support X1 and cluster platforms with same code
http//www.hipersoft.rice.edu/caf
55(No Transcript)
56Implementing Communication
- x(1n) a(1n)p
- Use a temporary buffer to hold off processor data
- allocate buffer
- perform GET to fill buffer
- perform computation x(1n) buffer(1n)
- deallocate buffer
- Optimizations
- no temporary storage for co-array to co-array
copies - load/store communication on shared-memory systems
57Detailed Results
- Itanium2Quadrics(Elan4)
- similar for 503, 9 better for 1503 and 3003
- AlphaQuadrics(Elan3)
- 8 better for 503, 16 lower for 1503 and similar
for 3003 - ARMCI lacks non-blocking notifies on Elan3
- SGI Altix 3000
- comparable for 503 and 1503, 10 better for 3003
- Itanium2Myrinet
- similar for 503, 12 better for 1503 and 9
better for 3003
58SGI Altix 3000, communication throughput
microbenchmark
Warm cache
Cold cache
59One- vs. Two-buffer Communication
One-buffer communication
source
dest
d
Two-buffers communication
60Asynchrony-tolerant CAF Implementation
sync_notify
sync_notify
61Asynchrony-tolerant CAF Implementation
sync_notify
sync_notify
62Asynchrony-tolerant CAF Implementation
sync_notify
sync_notify
63Asynchrony-tolerant CAF Implementation
sync_notify
sync_notify
64(No Transcript)