Presentation at the 4th PMEOPDS Workshop - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Presentation at the 4th PMEOPDS Workshop

Description:

A special case of shared memory ... Input size: Class A workload. 7. Local ... two locations with the same affinity to a remote thread needs optimization. ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 17
Provided by: jint6
Category:

less

Transcript and Presenter's Notes

Title: Presentation at the 4th PMEOPDS Workshop


1
Presentation at the 4th PMEO-PDS Workshop
Benchmark Measurements of Current UPC
Platforms Zhang Zhang and Steve Seidel Michigan
Technological University Denver, Colorado
3/22/2005
2
Presentation Outline
  • Background
  • Unified Parallel C, implementations and users.
  • Previous UPC performance studies.
  • Experiments
  • Available UPC platforms
  • Benchmarks
  • Performance measurements
  • Conclusions

3
UPC Overview
  • UPC is an extension of C for partitioned shared
    memory parallel programming.
  • A special case of shared memory programming
    model.
  • Similar languages Co-Array Fortran, Titanium.
  • UPC homepage http//www.upc.gwu.edu
  • Platforms supported
  • Cray X1, Cray T3E, SGI Origin, HP AlphaServer, HP
    UX, Linux clusters, IBM SP.
  • UPC compilers
  • Open source MuPC, Berkeley UPC, Intrepid UPC
  • Commercial HP UPC, Cray UPC
  • Users
  • LBNL, IDA, AHPCRC,

4
Related UPC Performance Studies
  • Performance benchmark suites
  • UPC_Bench (GWU)
  • Synthetic microbenchmark based on the STREAM
    benchmark.
  • Application benchmarks Sobel edge detection,
    matrix multiplication, N-Queens problem
  • UPC NAS Parallel Benchmarks (GWU)
  • Performance monitoring
  • Performance analysis for HP UPC compiler (GWU)
  • Performance of Berkeley UPC on HP AlphaServer
    (Berkeley)
  • Performance of Intrepid UPC on SGI Origin (GWU)

5
Benchmarking UPC Systems
  • Extended shared memory bandwidth microbenchmarks
    to cover various reference patterns
  • Scalar references 11 access patterns
  • Block memory operations 9 access patterns
  • Benchmarked six combinations of available UPC
    compilers and platforms using both the UPC STREAM
    (MTU code) and the UPC NAS Parallel Benchmarks
    (GWU code).
  • Compilers MuPC, HP UPC, Berkeley UPC and
    Intrepid UPC
  • Platforms Myrinet Linux cluster, HP AlphaServer
    SC, and T3E
  • The first comparison of performance for currently
    available UPC implementations.
  • The first report on MuPC performance.

6
Benchmarks
  • Synthetic benchmarks
  • The STREAM microbenchmark was rewritten using UPC
    with more diversities of shared memory access
    patterns
  • Local shared read / write
  • Unit stride shared read / write / copy
  • Random shared read / write / copy
  • Stride-n shared read / write / copy
  • Block transfers with variations of source and
    sink affinities.
  • NAS Parallel Benchmark Suite v2.4
  • The UPC version was developed at GWU.
  • Five cores CG, EP, FT, IS and MG.
  • Two variations Naïve version and Hand-tuned
    version.
  • Input size Class A workload.

7
Local Shared References
  • Intrepid UPC performance is poor on local shared
    accesses.
  • HP UPC cache state has significant effects on
    local shared accesses.

8
Remote Shared References
  • HP UPC and MuPC caches help unit stride remote
    shared accesses.
  • Intrepid UPC does the best for remote shared
    accesses.

9
Block Memory Operations
  • HP UPC performance is poor on certain string
    functions.
  • Intrepid UPC low performance on all categories.

10
NPB CG
  • The only case that scales well Berkeley UPC
    optimized code.

11
NPB EP
12
NPB FT
  • HP, Berkeley and MuPC performance is comparable.

13
NPB IS
  • HP, Berkeley and MuPC performance is comparable.

14
NPB MG
  • MG performance is very inconsistent.

15
Conclusions
  • STREAM benchmarking
  • UPC language overhead reduces performance of
    local shared references.
  • Remote reference caching helps stride-1 accesses.
  • Copying between two locations with the same
    affinity to a remote thread needs optimization.
  • NPB benchmarking
  • Some implementation failed for some benchmarks.
    More stable and reliable implementations are
    needed.
  • Hand-tuning techniques (e.g. prefetching) are
    critical in performance.
  • Berkeley UPC is the best at handling
    unstructured, fine-grained references.
  • MuPC experience shows that it will be more
    rewarding to optimize remote shared references
    than to improve network interconnects.

16
  • Thank you!
  • For more information
  • http//www.upc.mtu.edu
Write a Comment
User Comments (0)
About PowerShow.com