Experiences with Sweep3D Implementations in Coarray Fortran - PowerPoint PPT Presentation

About This Presentation
Title:

Experiences with Sweep3D Implementations in Coarray Fortran

Description:

Evaluate CAF for an application with sophisticated parallelization: Sweep3D. Co-Array Fortran ... Sweep3D Parallelization. 2D spatial domain decomposition onto ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 65
Provided by: hiperso
Category:

less

Transcript and Presenter's Notes

Title: Experiences with Sweep3D Implementations in Coarray Fortran


1
Experiences with Sweep3D Implementations in
Co-array Fortran
  • Cristian Coarfa Yuri Dotsenko
  • John Mellor-Crummey
  • Department of Computer Science
  • Rice University
  • Houston, TX USA

2
Motivation
  • Parallel Programming Models
  • MPI de facto standard
  • difficult to program
  • OpenMP inefficient to map on distributed memory
    platforms
  • lack of locality control
  • HPF hard to obtain high-performance
  • heroic compilers needed!
  • An appealing middle ground global address space
    languages
  • CAF, Titanium, UPC

Evaluate CAF for an application with
sophisticated parallelization Sweep3D
3
Co-Array Fortran
  • Global address space programming model
  • one-sided communication (GET/PUT)
  • Programmer has control over performance-critical
    factors
  • data distribution
  • computation partitioning
  • communication placement
  • Data movement and synchronization as language
    primitives
  • amenable to compiler-based communication
    optimization

4
CAF Programming Model Features
  • SPMD process images
  • fixed number of images during execution
  • images operate asynchronously
  • Both private and shared data
  • real x(20, 20) a private 20x20 array in
    each image
  • real y(20, 20) a shared 20x20 array in each
    image
  • Simple one-sided shared-memory communication
  • x(,jj2) y(,pp2)r copy columns from
    image r into local columns
  • Synchronization intrinsic functions
  • sync_all a barrier and a memory fence
  • sync_mem a memory fence
  • sync_team(team members to notify, team members
    to wait for)
  • Pointers and (perhaps asymmetric) dynamic
    allocation

5
One-sided Communication with Co-Arrays
image 1
image 2
image N
h
Copy from left neighbor
image 1
image 2
image N
6
Outline
  • CAF programming model
  • cafc
  • Sweep3D implementations in CAF
  • Experimental evaluation
  • Conclusions

7
Rice Co-Array Fortran Compiler (cafc)
  • First CAF multi-platform compiler
  • previous compiler only for Cray shared memory
    systems
  • Implements core of the language
  • currently lacks support for derived type and
    dynamic co-arrays
  • Core sufficient for non-trivial codes
  • Performance comparable to that of hand-tuned MPI
    codes
  • Open source

8
cafc Implementation Strategy
  • Goals
  • portability
  • high-performance on a wide range of platforms
  • Source-to-source compilation of CAF codes
  • uses Open64/SL Fortran 90 infrastructure
  • CAF Fortran 90 communication operations
  • Communication
  • ARMCI library for one-sided communication on
    clusters (PNNL)
  • load/store communication on shared-memory
    platforms

9
Synchronization
  • Original CAF specification team synchronization
    only
  • sync_all, sync_team
  • Limits performance on loosely-coupled
    architectures
  • Point-to-point extensions
  • sync_notify(q)
  • sync_wait(p)
  • Point to point
    synchronization semantics
  • Delivery of a notify to q from p ?
  • all communication from p to q issued before the
    notify has been delivered to q

10
CAF Compiler Targets (Oct 2004)
  • Processors
  • Pentium, Alpha, Itanium2, MIPS
  • Interconnects
  • Quadrics, Myrinet, Gigabit Ethernet, shared
    memory
  • Operating systems
  • Linux, Tru64, IRIX

11
Outline
  • CAF programming model
  • cafc
  • Sweep3D implementations
  • Original MPI implementation
  • CAF versions
  • Communication microbenchmark
  • Experimental evaluation
  • Conclusions

12
Sweep3D
  • Core of an ASCI application
  • Solves a
  • one-group
  • time-independent
  • discrete ordinates (Sn)
  • 3D Cartesian (XYZ) geometry
  • neutron transport problem
  • Deterministic particle transport accounts for
    50-80 execution time of many realistic DOE
    simulations

13
Sweep3D Parallelization
2D spatial domain decomposition onto a 2D
processor array
14
Sweep3D Parallelization
Wavefront parallelism
15
Sweep3D Parallelization
Wavefront parallelism
16
Sweep3D Parallelization
Wavefront parallelism
17
Sweep3D Parallelization
Wavefront parallelism
18
Sweep3D Parallelization
Wavefront parallelism
19
Sweep3D Parallelization
Wavefront parallelism
20
Sweep3D Parallelization
Wavefront parallelism
21
Sweep3D Parallelization
Wavefront parallelism
22
Sweep3D Parallelization
Wavefront parallelism
23
Sweep3D Parallelization
Wavefront parallelism
24
Sweep3D Parallelization
Wavefront parallelism
25
Sweep3D Kernel Pseudocode
do iq1,8 do mo 1, mmo do kk 1, kb
recv e/w into Phiib recv n/s into Phijb
... ! heavy computation with
use/update ! of Phiib and Phijb ...
send e/w Phiib send n/s Phijb
enddo enddo enddo
26
Sweep3D Kernel Pseudocode
do iq1,8 do mo 1, mmo do kk 1, kb
recv e/w into Phiib recv n/s into Phijb
... ! heavy computation with
use/update ! of Phiib and Phijb ...
send e/w Phiib send n/s Phijb
enddo enddo enddo
27
Sweep3D Kernel Pseudocode
do iq1,8 do mo 1, mmo do kk 1, kb
recv e/w into Phiib recv n/s into Phijb
... ! heavy computation with
use/update ! of Phiib and Phijb ...
send e/w Phiib send n/s Phijb
enddo enddo enddo
28
Sweep3D Kernel Pseudocode
do iq1,8 do mo 1, mmo do kk 1, kb
recv e/w into Phiib recv n/s into Phijb
... ! heavy computation with
use/update ! of Phiib and Phijb ...
send e/w Phiib send n/s Phijb
enddo enddo enddo
29
Initial Sweep3D CAF Implementation
  • Based on the MPI implementation
  • Maintain original computation
  • Convert communication buffers into co-arrays
  • Fundamental issue converting from two-sided
    communication into one-sided communication

30
2-sided vs 1-sided Communication
2-sided comm
31
2-sided vs 1-sided Communication
MPI_Send
MPI_Recv
2-sided comm
32
2-sided vs 1-sided Communication
MPI_Send
MPI_Recv
2-sided comm
33
2-sided vs 1-sided Communication
MPI_Send
MPI_Recv
2-sided comm
1-sided comm
34
2-sided vs 1-sided Communication
sync_notify
sync_wait
MPI_Send
MPI_Recv
2-sided comm
1-sided comm
35
2-sided vs 1-sided Communication
sync_notify
sync_wait
PUT
MPI_Send
MPI_Recv
2-sided comm
1-sided comm
36
2-sided vs 1-sided Communication
sync_notify
sync_wait
PUT
MPI_Send
MPI_Recv
sync_notify
sync_wait
2-sided comm
1-sided comm
37
2-sided vs 1-sided Communication
sync_notify
sync_wait
PUT
MPI_Send
MPI_Recv
sync_notify
sync_wait
2-sided comm
1-sided comm
38
CAF Implementation Issues
  • Synchronization necessary to avoid data races
    might lead to inefficiency
  • Using multiple communication buffers enables
    overlap of synchronization with computation

39
One- vs. Two-buffer Communication
One-buffer communication
source
dest
d
Two-buffers communication
40
Asynchrony-tolerant CAF Implementation of Sweep3D
  • Multiple-versioned communication buffers
  • Benefits
  • Overlap PUT with computation on destination
  • Overlap of synchronization with computation on
    source

41
Three-buffer Communication
42
Communication Throughput Microbenchmark
  • MPI implementation blocking send and receive
  • CAF one-version buffer
  • CAF multi-versioned buffers
  • ARMCI implementation one buffer

43
Outline
  • CAF programming model
  • cafc
  • Sweep3D implementations
  • Experimental evaluation
  • Conclusions

44
Experimental Evaluation
  • Platforms
  • Itanium2Quadrics QSNet II (Elan4)
  • SGI Altix 3000
  • Itanium2Myrinet 2000
  • AlphaQuadrics QSNet (Elan3)
  • Problem sizes
  • 50x50x50
  • 150x150x150
  • 300x300x300

45
Itanium2 Quadrics, Size 50x50x50
46
Itanium2 Quadrics, Size 150x150x150
47
Itanium2 Quadrics, Size 300x300x300
  • multi-version buffers improve performance of
    CAF codes by 15
  • imperative to use non-blocking notifies

48
Itanium2Quadrics, Communication Throughput
Microbenchmark
  • multi-version buffers improve throughput
  • by 30 for messages up to 8KB
  • by 10 for messages larger than 8KB
  • overhead of the CAF translation is acceptable

49
SGI Altix 3000, Size 50x50x50
50
SGI Altix 3000, Size 150x150x150
  • multi-version buffers are effective for
    asynchrony-tolerance

51
SGI Altix 3000, Size 300x300x300
  • both CAF implementations outperforms MPI

52
SGI Altix 3000, Communication Throughput
Microbenchmark
Warm cache
  • ARMCI library exploits effectively
  • the hardware support for efficient
  • data movement
  • MPI performs extra data copies

53
Summary of results
  • MPI buffering for small messages helps latency
    asynchrony tolerance
  • CAF multi-version buffers improve performance of
    one-sided communication for wavefront
    computations
  • enables PUT and receivers computation to overlap
  • asynchrony tolerance between sender and receiver
  • Non-blocking notifies are important for
    performance
  • enables synchronization to overlap with
    computation
  • Platform results
  • CAF outperforms MPI for large problem sizes by
    10 on Itanium2Quadrics,Myrinet,Altix
  • CAF 16slower on AlphaQuadrics(Elan3)
  • ARMCI lacks non-blocking notifies on Elan3

54
Enhancing CAF Usability
  • CAF vs MPI usability
  • easier to use than MPI for simple parallel
    programs
  • as difficult for carefully-tuned parallel codes
  • Improving CAF ease of use
  • compiler support for managing multi-version
    communication buffers
  • vectorizing fine-grain communication to best
    support X1 and cluster platforms with same code

http//www.hipersoft.rice.edu/caf
55
(No Transcript)
56
Implementing Communication
  • x(1n) a(1n)p
  • Use a temporary buffer to hold off processor data
  • allocate buffer
  • perform GET to fill buffer
  • perform computation x(1n) buffer(1n)
  • deallocate buffer
  • Optimizations
  • no temporary storage for co-array to co-array
    copies
  • load/store communication on shared-memory systems

57
Detailed Results
  • Itanium2Quadrics(Elan4)
  • similar for 503, 9 better for 1503 and 3003
  • AlphaQuadrics(Elan3)
  • 8 better for 503, 16 lower for 1503 and similar
    for 3003
  • ARMCI lacks non-blocking notifies on Elan3
  • SGI Altix 3000
  • comparable for 503 and 1503, 10 better for 3003
  • Itanium2Myrinet
  • similar for 503, 12 better for 1503 and 9
    better for 3003

58
SGI Altix 3000, communication throughput
microbenchmark
Warm cache
Cold cache
59
One- vs. Two-buffer Communication
One-buffer communication
source
dest
d
Two-buffers communication
60
Asynchrony-tolerant CAF Implementation
sync_notify
sync_notify
61
Asynchrony-tolerant CAF Implementation
sync_notify
sync_notify
62
Asynchrony-tolerant CAF Implementation
sync_notify
sync_notify
63
Asynchrony-tolerant CAF Implementation
sync_notify
sync_notify
64
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com