Presentation at the 4th PMEOPDS Workshop

About This Presentation

Title:

Presentation at the 4th PMEOPDS Workshop

Description:

A special case of shared memory ... Input size: Class A workload. 7. Local ... two locations with the same affinity to a remote thread needs optimization. ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 17

Provided by: jint6

Category:

more less

Transcript and Presenter's Notes

Title: Presentation at the 4th PMEOPDS Workshop

1
Presentation at the 4th PMEO-PDS Workshop
Benchmark Measurements of Current UPC
Platforms Zhang Zhang and Steve Seidel Michigan
Technological University Denver, Colorado
3/22/2005
2
Presentation Outline

Background
Unified Parallel C, implementations and users.
Previous UPC performance studies.
Experiments
Available UPC platforms
Benchmarks
Performance measurements
Conclusions

3
UPC Overview

UPC is an extension of C for partitioned shared
memory parallel programming.
A special case of shared memory programming
model.
Similar languages Co-Array Fortran, Titanium.
UPC homepage http//www.upc.gwu.edu
Platforms supported
Cray X1, Cray T3E, SGI Origin, HP AlphaServer, HP
UX, Linux clusters, IBM SP.
UPC compilers
Open source MuPC, Berkeley UPC, Intrepid UPC
Commercial HP UPC, Cray UPC
Users
LBNL, IDA, AHPCRC,

4
Related UPC Performance Studies

Performance benchmark suites
UPC_Bench (GWU)
Synthetic microbenchmark based on the STREAM
benchmark.
Application benchmarks Sobel edge detection,
matrix multiplication, N-Queens problem
UPC NAS Parallel Benchmarks (GWU)
Performance monitoring
Performance analysis for HP UPC compiler (GWU)
Performance of Berkeley UPC on HP AlphaServer
(Berkeley)
Performance of Intrepid UPC on SGI Origin (GWU)

5
Benchmarking UPC Systems

Extended shared memory bandwidth microbenchmarks
to cover various reference patterns
Scalar references 11 access patterns
Block memory operations 9 access patterns
Benchmarked six combinations of available UPC
compilers and platforms using both the UPC STREAM
(MTU code) and the UPC NAS Parallel Benchmarks
(GWU code).
Compilers MuPC, HP UPC, Berkeley UPC and
Intrepid UPC
Platforms Myrinet Linux cluster, HP AlphaServer
SC, and T3E
The first comparison of performance for currently
available UPC implementations.
The first report on MuPC performance.

6
Benchmarks

Synthetic benchmarks
The STREAM microbenchmark was rewritten using UPC
with more diversities of shared memory access
patterns
Local shared read / write
Unit stride shared read / write / copy
Random shared read / write / copy
Stride-n shared read / write / copy
Block transfers with variations of source and
sink affinities.
NAS Parallel Benchmark Suite v2.4
The UPC version was developed at GWU.
Five cores CG, EP, FT, IS and MG.
Two variations Naïve version and Hand-tuned
version.
Input size Class A workload.

7
Local Shared References

Intrepid UPC performance is poor on local shared
accesses.
HP UPC cache state has significant effects on
local shared accesses.

8
Remote Shared References

HP UPC and MuPC caches help unit stride remote
shared accesses.
Intrepid UPC does the best for remote shared
accesses.

9
Block Memory Operations

HP UPC performance is poor on certain string
functions.
Intrepid UPC low performance on all categories.

10
NPB CG

The only case that scales well Berkeley UPC
optimized code.

11
NPB EP
12
NPB FT

HP, Berkeley and MuPC performance is comparable.

13
NPB IS

HP, Berkeley and MuPC performance is comparable.

14
NPB MG

MG performance is very inconsistent.

15
Conclusions

STREAM benchmarking
UPC language overhead reduces performance of
local shared references.
Remote reference caching helps stride-1 accesses.
Copying between two locations with the same
affinity to a remote thread needs optimization.
NPB benchmarking
Some implementation failed for some benchmarks.
More stable and reliable implementations are
needed.
Hand-tuning techniques (e.g. prefetching) are
critical in performance.
Berkeley UPC is the best at handling
unstructured, fine-grained references.
MuPC experience shows that it will be more
rewarding to optimize remote shared references
than to improve network interconnects.