Unified Parallel C UPC - PowerPoint PPT Presentation

1 / 133
About This Presentation
Title:

Unified Parallel C UPC

Description:

Applications: NAS parallel benchmarks (CG & MG) Standard benchmarks written in UPC by GWU ... Benchmark written in bulk synchronous style. Performance is ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 134
Provided by: kath221
Category:

less

Transcript and Presenter's Notes

Title: Unified Parallel C UPC


1
Unified Parallel C (UPC)
  • Kathy Yelick
  • UC Berkeley and LBNL

2
UPC Projects
  • GWU http//upc.gwu.edu
  • Benchmarking, language design
  • MTU http//www.upc.mtu.edu
  • Language, benchmarking, MPI runtime for HP
    compiler
  • UFL http//www.hcs.ufl.edu/proj/upc
  • Communication runtime (GASNet)
  • UMD http//www.cs.umd.edu/tseng/
  • Benchmarks
  • IDA http//www.super.org
  • Language, compiler for t3e
  • Other companies (Intel, Sun,) and labs

3
UPC Compiler Efforts
  • HP http//www.hp.com/go/upc
  • Compiler, tests, language
  • Etnus http//www.etnus.com
  • Debugger
  • Intrepid http//www.intrepid.com/upc
  • Compiler based on gcc
  • UCB/LBNL http//upc.lbl.gov
  • Compiler, runtime, applications
  • IBM http//www.ibm.com
  • Compiler under development for SP line
  • Cray http//www.cray.com
  • Compiler product for X1

4
Comparison to MPI
  • One-sided vs. two-sided communication models
  • Programmability
  • Two-sided works reasonably well for regular
    computation
  • When computation is irregular/asynchronous,
    issuing receives can be difficult
  • To simplify programming, communication is grouped
    into a phase, which limits overlap
  • Performance
  • Some hardware does one-sided communication
  • RDMA support is increasingly common

5
Communication Support Today
  • Potential performance advantage for fine-grained,
    one-sided programs
  • Potential productivity advantage for irregular
    applications

6
MPI vs. PGAS Languages
  • GASNet - portable, high-performance communication
    layer
  • compilation target for both UPC and Titanium
  • reference implementation over MPI 1.1
    (AMMPI-based)
  • direct implementation over many vendor network
    API's
  • IBM LAPI, Quadrics Elan, Myrinet GM, Infiniband
    vapi, Dolphin SCI, others on the way
  • Applications NAS parallel benchmarks (CG MG)
  • Standard benchmarks written in UPC by GWU
  • Compiled using Berkeley UPC compiler
  • Difference is GASNet backend MPI 1.1 vs vendor
    API
  • Also used HP/Compaq UPC compiler where available
  • Caveats
  • Not a comparison of MPI as a programming model

7
Performance Difference Translates to Applications
  • Bulk-synchronous NAS MG and CG codes in UPC
  • Elan-based layer beats MPI
  • Performance and scaling
  • The only difference in the Berkeley lines is the
    network API!
  • Machine Alpha Quadrics, Lemieux
  • Source Bonachea and Duell

8
Performance Difference Translates to Applications
  • Apps on GM-based layer beat apps on MPI-based
    layer by 20
  • The only difference is the network API!
  • Machine
  • Pentium 3 Myrinet
  • NERSC Alvarez cluster

9
Performance Difference Translates to Applications
App on LAPI-based layer provides significantly
better absolute performance and scaling than same
app on MPI-based layer The only difference is the
network API! Machine IBM SP, Seaborg at NERSC
10
Productivity
  • Productivity is hard to measure
  • lines (or characters) is easy to measure
  • May not reflect programmability, but if the same
    algorithms are used, it can reveal some
    differences
  • Fast fine-grained communication is useful
  • Incremental program development
  • Inherently fine-grained applications
  • Compare performance of these fine-grained
    versions

11
Productivity Study El Gazhawi et al, GWU
All the line counts are the number of real code
lines (no comments, no blocks) 1 The sequential
code is coded in C except for NAS-EP and FT which
are coded in Fortran.2 The sequential code is
always in C.
12
Fine-Grained Applications have Larger Spread
  • Machine
  • HP Alpha Quadrics, Lemieux
  • Benchmark
  • Naïve CG with fine-grained remote accesses
  • For comparison purposes
  • All versions scale poorly due to naïve algorithm,
    as expected
  • Absolute performance Elan version is more than
    4x faster!
  • Means more work for application programmers in
    MPI
  • Elan-based layer more suitable for
  • incremental application development and
    fine-grained algorithms

13
A Brief Look at the Past
  • Conjugate Gradient dominated by sparse
    matrix-vector multiply
  • Longer term, identify large application
  • Same fine-grained version used in previous
  • Shows advantage of t3e network model and UPC
  • Will we get a machine like this again?

14
Goals of the Berkeley UPC Project
  • Make UPC Ubiquitous
  • Parallel machines
  • Workstations and PCs for development
  • A portable compiler for future machines too
  • Research in compiler optimizations for parallel
    languages
  • Demonstration of UPC on real applications
  • Ongoing language development with the UPC
    Consortium
  • Collaboration between LBNL and UCB

15
Example Berkeley UPC Compiler
  • Compiler based on Open64
  • Multiple front-ends, including gcc
  • Intermediate form called WHIRL
  • Current focus on C backend
  • IA64 possible in future
  • UPC Runtime
  • Pointer representation
  • Shared/distribute memory
  • Communication in GASNet
  • Portable
  • Language-independent

UPC
Higher WHIRL
Optimizing transformations
C Runtime
Lower WHIRL
Assembly IA64, MIPS, Runtime
16
Portability Strategy in UPC Compiler
Runtime Layers
  • Generation of C code from translator
  • Layered approach to runtime
  • Core GASNet API
  • Most basic required primitives, as narrow and
    general as possible
  • Implemented directly on each platform
  • Based heavily on active messages paradigm
  • Extended API
  • Wider interface that includes more complicated
    operations
  • Reference implementation provided in terms of
    core
  • Implementers may tune for network
  • UPC Runtime
  • pointer representation (specific to UPC, possibly
    to machine)
  • thread implementation

Compiler-generated code
Language-specific runtime
GASNet Extended API
GASNet Core API
Network Hardware
17
Portability of Berkeley UPC Compiler
  • Make UPC Ubiquitous
  • Current and future parallel machines
  • Workstations and PCs for development
  • Ports of Berkeley UPC Compiler
  • OS Linux, FreeBSD, Tru64, AIX, IRIX, HPUX,
    Solaris, MSWindows(cygwin), MacOSX, Unicos,
    SuperUX
  • CPU x86, Itanium, Alpha, PowerPC, PA-RISC
  • Supercomputers Cray T3e, Cray X-1, IBM SP, NEC
    SX-6, Cluster X (Big Mac), SGI Altix 3000
  • Recently added a net-compile option
  • Only install runtime system locally
  • Runtime ported to Posix Threads (direct
    load/store)
  • Run on SGI Altix as well as SMPs
  • GASNet tuned to vendor-supplied communication
    layer
  • Myrinet GM, Quadrics Elan, Mellanox Infiniband
    VAPI, IBM LAPI, Cray X1, Cray/SGI SHMEM

18
Pointer-to-Shared Phases
  • UPC has three difference kinds of pointers
  • Block-cyclic
  • shared 4 double a n
  • Cyclic
  • shared double a n
  • Indefinite (always local)
  • shared 0 double a (shared 0
    double ) upc_alloc(n)
  • A pointer needs a phase to keep track of
    relative position within a block
  • Source of overhead for updating and dereferencing
  • Special case for phaseless Pointers
  • Cyclic pointers always have phase 0
  • Indefinite blocked pointers only have one block
  • Dont need to keep phase for cyclic and
    indefinite
  • Dont need to update thread id for indefinite

phaseless
19
Accessing Shared Memory in UPC
start of array object
Shared Memory

block size
Phase

Thread 1
Thread N -1
Thread 0
0
2
addr
20
Pointer-to-Shared Representation
  • Shared pointer representation trade-offs
  • Use of scalar types (long) rather than a struct
    may improve backend code quality
  • Faster pointer manipulation, e.g., ptrint and
    dereferencing
  • Important in C, because array reference based on
    pointers
  • Pointer size is important to performance
  • Use of smaller types, 64 bits, rather than 128
    bits may allow pointers to reside in a single
    register
  • But very large machines may require a longer
    pointer type
  • Consider two different machines
  • 2048-processor machine with 16 GB/processor ? 128
    bits
  • 64-processor machine with 2 GB/processor ? 64
    bits
  • 6 bits for thread, 31 bits of address, 27 bits
    for phase ? 64 bit
  • Portability and performance balance in UPC
    compiler
  • The pointer representation is hidden in the
    runtime layer
  • Can easily switch at compiler installation time

21
Performance of Shared Pointer Arithmetic
1 cycle 1.5ns
  • Phaseless pointer an important optimization
  • Indefinite pointers almost as fast as regular C
    pointers
  • Packing also helps, especially for pointer and
    int addition

22
Comparison with HP UPC v1.7
1 cycle 1.5ns
  • HP a little faster, due to it generating
    assembly coded
  • Gap for addition likely smaller with further
    optimizations

23
Cost of Shared Memory Access
  • Local accesses somewhat slower than private
    accesses
  • Remote accesses significantly worse, as expected

24
Optimizing Explicitly Parallel Code
  • Compiler optimizations for parallel languages
  • Enabled optimizations in Open64 base
  • Static analyses for parallel code
  • Problem is to understand when code motion is
    legal without changing views from other
    processorst
  • Extended cycle detection to arrays with three
    different algorithms LCPC 03
  • Message strip-mining
  • Packing messages is good, but it can go too far
  • Use performance model to strip-mine messages into
    smaller chunks to optimize overlap VECPAR 04
  • Automatic message vectorization (packing) underway

25
Performance Example
  • Performance of the Berkeley MG UPC code
  • HP (Lemieux, left) includes MPI comparison

26
Berkeley UPC on the X1
48x
  • Translator generated C code usually vectorizes as
    well as original C code
  • Source-to-source translation a reasonable
    strategy
  • Work needed for 3D arrays

27
GASNet/X1 Performance
Puts
Gets
  • GASNet/X1 improves small message performance
    over shmem and MPI
  • GASNet/X1 communication can be integrated
    seamlessly into long computation loops and is
    vectorizable
  • GASNet/X1 operates directly on global pointers

28
NAS CG OpenMP style vs. MPI style
  • GAS language outperforms MPIFortran (flat is
    good!)
  • Fine-grained (OpenMP style) version still slower
  • shared memory programming style leads to more
    overhead (redundant boundary computation)
  • GAS languages can support both programming styles

29
EP on Alpha/Quadrics (GWU Bench)
30
IS on Alpha/Quadrics (GWU Bench)
31
MG on Alpha/Quadrics (Berkeley version)
32
Multigrid on Cray X1
  • Performance similar to MPI
  • Cray C does not automatically
    vectorize/multistream (addition of pragmas)
  • 4 SSP slightly better than 1 MSP, 2 MSP much
    better than 8 SSP (cache conflict caused by
    layout of private data)

33
Integer Sort
  • Benchmark written in bulk synchronous style
  • Performance is similar to MPI
  • Code does not vectorize even the best performer
    is much slower than cache-based superscalar
    architecture

34
Fine-grained Irregular Accesses UPC GUPS
  • Hard to control vectorization of fine-grained
    accesses
  • temporary variables, casts, etc.
  • Communication libraries may help

35
Recent Progress on Applications
  • Application demonstration of UPC
  • NAS PB-size problems
  • Berkeley NAS MG avoids most global barriers and
    relies on UPC relaxed memory model
  • Berkeley NAS CG has several versions, including
    simpler, fine-grained communication
  • Algorithms that are challenging in MPI
  • 2D Delauney Triangulation SIAM PP 04
  • AMR in UPC Chombo (non-adaptive) Poisson solver

36
Progress in Language
  • Group is active in UPC Consortium meetings,
    mailing list, SC booth, etc.
  • Recent language level work
  • Specification of UPC memory model in progress
  • Joint with MTU
  • Behavioral spec Dagstuhl03
  • UPC IO nearly finalized
  • Joint with GWU and ANL
  • UPC Collectives V 1.0 finalized
  • Effort led by MTU
  • Improvements/updates to UPC Language Spec
  • Led by IDA

37
Center Overview
  • Broad collaboration between three groups
  • Library efforts MPI, ARMCI, GA, OpenMP
  • Language efforts UPC, CAF, Titanium
  • New model investigations multi-threading, memory
    consistency models
  • Led by Rusty Lusk at ANL
  • Major focus is common runtime system
  • GASNet for UPC, Titanium and (soon) CAF
  • Also common compiler
  • CAF, UPC, and OpenMP work based on Open64

38
Progress on UPC Runtime
  • Cross-language support Berkeley UPC and MPI
  • Calling MPI from UPC
  • Calling UPC from MPI
  • Runtime for gcc-based UPC compiler by Intrepid
  • Interface UPC compiler to parallel collectives
    libraries (end of FY04)
  • Reference implementation just released by HP/MTU
  • Thread version of the Berkeley UPC runtime layer
  • Evaluating performance on hybrid GASNet systems

39
Progress on GASNet
  • GASNet Myrinet GM, Quadrix Elan-3, IBM LAPI,
    UDP, MPI, Infiniband
  • Ongoing SCI (with UFL), Cray X1 SGI Shmem, and
    reviewing future Myrinet and latest Elan-4
  • Extension to GASNet to support strided and
    scatter/gather communication
  • Also proposed support for UPC bulk copy support
  • Analysis of MPI one-sided for GAS languages
  • Problems with synchronization model
  • Multiple protocols for managing pinned memory
    in Direct Memory Addressing systems CAC 03
  • Depends on language usage as well as network
    architecture

40
Future Plans
  • Architecture-specific GASNet for scatter-gather
    and strided hardware support.
  • Need for CAF and for UPC with message
    vectorization
  • Optimized collective communication library
  • Spec agreed on in 2003
  • New reference implementation
  • Developing GASNet extension for building
    optimized collectives
  • Application- and architecture- driven
    optimization
  • Interface to the UPC I/O library
  • Evaluate GASNet on machines with non-cache
    coherent shared memory
  • BlueGene/L and NEC SX6

41
Try It Out
  • Download from the Berkeley UPC web page
  • http//upc.lbl.gov
  • May just get runtime system (includes GASNet)
  • Netcompile is default
  • Runtime is easier to install
  • New release planned for this summer
  • Not quite open development model
  • We publicize a latest stable version that is
    not fully tested
  • Let us know what happens (good and bad)
  • Mail upc_at_lbl.gov

42
UPC Outline
  • Background and Philosophy
  • UPC Execution Model
  • UPC Memory Model
  • Data and Pointers
  • Dynamic Memory Management
  • Programming Examples

8. Synchronization 9. Performance Tuning and
Early Results 10. Concluding Remarks
43
Context
  • Most parallel programs are written using either
  • Message passing with a SPMD model
  • Usually for scientific applications with
    C/Fortran
  • Scales easily
  • Shared memory with threads in OpenMP,
    ThreadsC/C/F or Java
  • Usually for non-scientific applications
  • Easier to program, but less scalable performance
  • Global Address Space (GAS) Languages take the
    best of both
  • global address space like threads
    (programmability)
  • SPMD parallelism like MPI (performance)
  • local/global distinction, i.e., layout matters
    (performance)

44
Partitioned Global Address Space Languages
  • Explicitly-parallel programming model with SPMD
    parallelism
  • Fixed at program start-up, typically 1 thread per
    processor
  • Global address space model of memory
  • Allows programmer to directly represent
    distributed data structures
  • Address space is logically partitioned
  • Local vs. remote memory (two-level hierarchy)
  • Programmer control over performance critical
    decisions
  • Data layout and communication
  • Performance transparency and tunability are goals
  • Initial implementation can use fine-grained
    shared memory
  • Base languages differ UPC (C), CAF (Fortran),
    Titanium (Java)

45
Global Address Space Eases Programming
Thread0 Thread1
Threadn
X0
X1
XP
Shared
Global address space
ptr
ptr
ptr
Private
  • The languages share the global address space
    abstraction
  • Shared memory is partitioned by processors
  • Remote memory may stay remote no automatic
    caching implied
  • One-sided communication through reads/writes of
    shared variables
  • Both individual and bulk memory copies
  • Differ on details
  • Some models have a separate private memory area
  • Distributed array generality and how they are
    constructed

46
One-Sided Communication Is Sometimes Faster
  • Potential performance advantage for fine-grained,
    one-sided programs
  • Potential productivity advantage for irregular
    applications

47
Current Implementations
  • A successful language/library must run everywhere
  • UPC
  • Commercial compilers available on Cray, SGI, HP
    machines
  • Open source compiler from LBNL/UCB (and another
    from MTU)
  • CAF
  • Commercial compiler available on Cray machines
  • Open source compiler available from Rice
  • Titanium (Friday)
  • Open source compiler from UCB runs on most
    machines
  • Common tools
  • Open64 open source research compiler
    infrastructure
  • ARMCI, GASNet for distributed memory
    implementations
  • Pthreads, System V shared memory

48
UPC Overview and Design Philosophy
  • Unified Parallel C (UPC) is
  • An explicit parallel extension of ANSI C
  • A partitioned global address space language
  • Sometimes called a GAS language
  • Similar to the C language philosophy
  • Programmers are clever and careful, and may need
    to get close to hardware
  • to get performance, but
  • can get in trouble
  • Concise and efficient syntax
  • Common and familiar syntax and semantics for
    parallel C with simple extensions to ANSI C
  • Based on ideas in Split-C, AC, and PCP

49
UPC Execution Model
50
UPC Execution Model
  • A number of threads working independently in a
    SPMD fashion
  • Number of threads specified at compile-time or
    run-time available as program variable THREADS
  • MYTHREAD specifies thread index (0..THREADS-1)
  • upc_barrier is a global synchronization all wait
  • There is a form of parallel loop that we will see
    later
  • There are two compilation modes
  • Static Threads mode
  • Threads is specified at compile time by the user
  • The program may is THREADS as a compile-time
    constant
  • Dynamic threads mode
  • Compiled code may be run with varying numbers of
    threads

51
Hello World in UPC
  • Any legal C program is also a legal UPC program
  • If you compile and run it as UPC with P threads,
    it will run P copies of the program.
  • Using this fact, plus the identifiers from the
    previous slides, we can parallel hello world
  • include ltupc.hgt / needed for UPC extensions /
  • include ltstdio.hgt
  • main()
  • printf("Thread d of d hello UPC world\n",
  • MYTHREAD, THREADS)

52
Example Monte Carlo Pi Calculation
  • Estimate Pi by throwing darts at a unit square
  • Calculate percentage that fall in the unit circle
  • Area of square r2 1
  • Area of circle quadrant ¼ p r2 p/4
  • Randomly throw darts at x,y positions
  • If x2 y2 lt 1, then point is inside circle
  • Compute ratio
  • points inside / points total
  • p 4ratio

53
Pi in UPC
  • Independent estimates of pi
  • main(int argc, char argv)
  • int i, hits, trials 0
  • double pi
  • if (argc ! 2)trials 1000000
  • else trials atoi(argv1)
  • srand(MYTHREAD17)
  • for (i0 i lt trials i) hits hit()
  • pi 4.0hits/trials
  • printf("PI estimated to f.", pi)

54
Helper Code for Pi in UPC
  • Required includes
  • include ltstdio.hgt
  • include ltmath.hgt
  • include ltupc.hgt
  • Function to throw dart and calculate where it
    hits
  • int hit()
  • int const rand_max 0xFFFFFF
  • double x (double) (rand()rand_max) /
    rand_max
  • double y (double) (rand()rand_max) /
    rand_max
  • if ((xx yy) lt 1.0) return(1)
  • else return(0)

Hidden slide
55
UPC Memory Model
  • Scalar Variables
  • Distributed Arrays
  • Pointers to shared data

56
Private vs. Shared Variables in UPC
  • Normal C variables and objects are allocated in
    the private memory space for each thread.
  • Shared variables are allocated only once, with
    thread 0
  • shared int ours
  • int mine
  • Simple shared variables of this kind may not
    occur in a within a function definition

Thread0 Thread1
Threadn
Shared
ours
Global address space
mine
mine
mine
Private
57
Pi in UPC (Cooperative Version)
  • Parallel computing of pi, but with a race
    condition
  • shared int hits
  • main(int argc, char argv)
  • int i, my_hits 0
  • int trials atoi(argv1)
  • my_trials (trials THREADS - 1
  • - MYTHREAD)/THREADS
  • srand(MYTHREAD17)
  • for (i0 i lt my_trials i)
  • hits hit()
  • upc_barrier
  • if (MYTHREAD 0)
  • printf("PI estimated to f.",
    4.0hits/trials)

shared variable to record hits
divide work up evenly
accumulate hits
58
Pi in UPC (Cooperative Version)
  • The race condition can be fixed in several ways
  • Add a lock around the hits increment (later)
  • Have each thread update a separate counter
  • Have one thread compute sum
  • Use a collective to compute sum (recently added
    to UPC)
  • shared int all_hits THREADS
  • main(int argc, char argv)
  • declarations an initialization code omitted
  • for (i0 i lt my_trials i)
  • all_hitsMYTHREAD hit()
  • upc_barrier
  • if (MYTHREAD 0)
  • for (i0 i lt THREADS i) hits
    all_hitsi
  • printf("PI estimated to f.",
    4.0hits/trials)

all_hits is shared by all processors, just as
hits was
Where does it live?
59
Shared Arrays Are Cyclic By Default
  • Shared array elements are spread across the
    threads
  • shared int xTHREADS / 1 element per
    thread /
  • shared int y3THREADS / 3 elements per
    thread /
  • shared int z3THREADS / 3 elements per
    thread, cyclic /
  • In the pictures below
  • Assume THREADS 4
  • Elements with affinity to processor 0 are red

As a 2D array, this is logically blocked by
columns
x
y
z
60
Example Vector Addition
  • Questions about parallel vector additions
  • How to layout data (here it is cyclic)
  • Which processor does what (here it is owner
    computes)
  • / vadd.c /
  • include ltupc_relaxed.hgtdefine N
    100THREADSshared int v1N, v2N,
    sumNvoid main() int i for(i0 iltN i)
  • if (MYTHREAD iTHREADS) sumiv1iv2
    i

cyclic layout
owner computes
61
Vector Addition with upc_forall
  • The loop in vadd is common, so there is
    upc_forall
  • 4th argument is int expression that gives
    affinity
  • Iteration executes when
  • affinityTHREADS is MYTHREAD
  • / vadd.c /
  • include ltupc_relaxed.hgtdefine N
    100THREADSshared int v1N, v2N,
    sumNvoid main() int i upc_forall(i0
    iltN i i)
  • sumiv1iv2i

62
Work Sharing with upc_forall()
  • Iteration are independent
  • Each thread gets a bunch of iterations
  • Simple C-like syntax and semantics
  • upc_forall(init test loop affinity)
  • statement
  • Affinity field to distribute the work
  • Cyclic (round robin) distribution
  • Blocked (chunks of iterations) distribution
  • Semantics are undefined if there are dependencies
    between iterations executed by different threads
  • Programmer has indicated iterations are
    independent

63
UPC Matrix Vector Multiplication Code
  • Here is one possible matrix-vector multiplication

include ltupc_relaxed.hgt shared int
aTHREADSTHREADS shared int bTHREADS,
cTHREADS void main (void) int i, j , l
upc_forall( i 0 i lt THREADS i i)
ci 0 for ( l 0 l? THREADS
l) ci ailbl
64
Data Distribution
B


Thread 0
Thread 1
Thread 2
A
B
C
65
A Better Data Distribution
B
Th. 0
Thread 0


Th. 1
Thread 1
Th. 2
Thread 2
A
B
C
66
Layouts in General
  • All non-array shared variables have affinity with
    thread zero.
  • Array layouts are controlled by layout
    specifiers
  • shared b double x n
  • Groups of b elements are wrapped around
  • Empty cyclic layout of data in 1D view
  • layout_specifier integer_expression
  • The affinity of an array element is defined in
    terms of the block size, a compile-time constant,
    and THREADS a runtime constant.
  • Element i has affinity with thread
  • ( i / block_size) PROCS.

67
Layout Terminology
  • Notation is HPF, but terminology is
    language-independent
  • Assume there are 4 processors

(Block, )
(, Block)
(Block, Block)
(Cyclic, )
(Cyclic, Block)
(Cyclic, Cyclic)
68
2D Array Layouts in UPC
  • Array a1 has a row layout and array a2 has a
    block row layout.
  • shared m int a1 nm
  • shared km int a2 nm
  • If (k m) THREADS 0 them a3 has a row
    layout
  • shared int a3 nmk
  • To get more general HPF and ScaLAPACK style 2D
    blocked layouts, one needs to add dimensions.
  • Assume rc THREADS
  • shared b1b2 int a5 mnrcb1b2
  • or equivalently
  • shared b1b2 int a5 mnrcb1b2

69
UPC Matrix Vector Multiplication Code
  • Matrix-vector multiplication with better layout

include ltupc_relaxed.hgt shared THREADS int
aTHREADSTHREADS shared int bTHREADS,
cTHREADS void main (void) int i, j , l
upc_forall( i 0 i lt THREADS i i)
ci 0 for ( l 0 l? THREADS
l) ci ailbl
70
Example Matrix Multiplication in UPC
  • Given two integer matrices A(NxP) and B(PxM)
  • Compute C A x B.
  • Entries Cij in C are computed by the formula

71
Matrix Multiply in C
  • include ltstdlib.hgt
  • include lttime.hgt
  • define N 4
  • define P 4
  • define M 4
  • int aNP, cNM
  • int bPM
  • void main (void)
  • int i, j , l
  • for (i 0 iltN i)
  • for (j0 jltM j)
  • cij 0
  • for (l 0 l?P l) cij
    ailblj

72
Domain Decomposition for UPC
  • Exploits locality in matrix multiplication
  • A (N ? P) is decomposed row-wise into blocks of
    size (N ? P) / THREADS as shown below
  • B(P ? M) is decomposed column wise into M/
    THREADS blocks as shown below

Thread THREADS-1
Thread 0
P
M
Thread 0
0 .. (NP / THREADS) -1
Thread 1
(NP / THREADS)..(2NP / THREADS)-1
N
P
((THREADS-1)?NP) / THREADS .. (THREADSNP /
THREADS)-1
Thread THREADS-1
  • Note N and M are assumed to be multiples of
    THREADS

Columns 0 (M/THREADS)-1
Columns ((THREAD-1) ? M)/THREADS(M-1)
73
UPC Matrix Multiplication Code
/ mat_mult_1.c / include ltupc_relaxed.hgt defi
ne N 4 define P 4 define M 4 shared NP
/THREADS int aNP, cNM // a and c are
row-wise blocked shared matrices sharedM/THREADS
int bPM //column-wise blocking void main
(void) int i, j , l // private
variables upc_forall(i 0 iltN i
ci0) for (j0 jltM j) cij
0 for (l 0 l?P l) cij
ailblj
74
Notes on the Matrix Multiplication Example
  • The UPC code for the matrix multiplication is
    almost the same size as the sequential code
  • Shared variable declarations include the keyword
    shared
  • Making a private copy of matrix B in each thread
    might result in better performance since many
    remote memory operations can be avoided
  • Can be done with the help of upc_memget

75
Pointers to Shared vs. Arrays
  • In the C tradition, array can be access through
    pointers
  • Here is the vector addition example using pointers
  • include ltupc_relaxed.hgt
  • define N 100THREADS
  • shared int v1N, v2N, sumN
  • void main() int ishared int p1, p2p1v1
    p2v2for (i0 iltN i, p1, p2 )
  • if (i THREADS MYTHREAD) sumi p1
    p2

v1
p1
76
UPC Pointers
Where does the pointer reside?
Where does it point?
int p1 / private pointer to local
memory / shared int p2 / private pointer to
shared space / int shared p3 / shared pointer
to local memory / shared int shared p4 /
shared pointer to
shared space / Shared to private is not
recommended.
77
UPC Pointers
Thread0 Thread1
Threadn
p3
p3
p3
Shared
p4
p4
p4
Global address space
p1
p1
p1
Private
p2
p2
p2
int p1 / private pointer to local
memory / shared int p2 / private pointer to
shared space / int shared p3 / shared pointer
to local memory / shared int shared p4 /
shared pointer to
shared space /
Pointers to shared often require more storage and
are more costly to dereference they may refer to
local or remote memory.
78
Common Uses for UPC Pointer Types
  • int p1
  • These pointers are fast
  • Use to access private data in part of code
    performing local work
  • Often cast a pointer-to-shared to one of these to
    get faster access to shared data that is local
  • shared int p2
  • Use to refer to remote data
  • Larger and slower due to test-for-local
    possible communication
  • int shared p3
  • Not recommended
  • shared int shared p4
  • Use to build shared linked structures, e.g., a
    linked list

79
UPC Pointers
  • In UPC pointers to shared objects have three
    fields
  • thread number
  • local address of block
  • phase (specifies position in the block)
  • Example Cray T3E implementation

0
37
38
48
49
63
80
UPC Pointers
  • Pointer arithmetic supports blocked and
    non-blocked array distributions
  • Casting of shared to private pointers is allowed
    but not vice versa !
  • When casting a pointer to shared to a private
    pointer, the thread number of the pointer to
    shared may be lost
  • Casting of shared to private is well defined only
    if the object pointed to by the pointer to shared
    has affinity with the thread performing the cast

81
Special Functions
  • size_t upc_threadof(shared void ptr)returns
    the thread number that has affinity to the
    pointer to shared
  • size_t upc_phaseof(shared void ptr)returns the
    index (position within the block)field of the
    pointer to shared
  • size_t upc_addrfield(shared void ptr)returns
    the address of the block which is pointed at by
    the pointer to shared
  • shared void upc_resetphase(shared void ptr)
    resets the phase to zero

82
Synchronization
  • No implicit synchronization among the threads
  • UPC provides many synchronization mechanisms
  • Barriers (Blocking)
  • upc_barrier
  • Split Phase Barriers (Non Blocking)
  • upc_notify
  • upc_wait
  • Optional label allow for
  • Locks

83
Synchronization - Locks
  • In UPC, shared data can be protected against
    multiple writers
  • void upc_lock(upc_lock_t l)
  • int upc_lock_attempt(upc_lock_t l) //returns 1
    on success and 0 on failure
  • void upc_unlock(upc_lock_t l)
  • Locks can be allocated dynamically. Dynamically
    allocated locks can be freed
  • Dynamic locks are properly initialized and static
    locks need initialization

84
Corrected version Pi Example
  • Parallel computing of pi, but with a bug
  • shared int hits
  • main(int argc, char argv)
  • int i, my_hits 0
  • upc_lock_t hit_lock upc_all_lock_alloc()
  • ...initialization of trials,
    my_trials, srand code omitted
  • for (i0 i lt my_trials i)
  • my_hits hit()
  • upc_lock(hit_lock)
  • hits my_hits
  • upc_unlock(hit_lock)
  • upc_barrier
  • if (MYTHREAD 0)
  • printf("PI estimated to f.",
    4.0hits/trials)
  • upc_lock_free(hit_lock)

all threads collectively allocate lock
update in critical region
85
Memory Consistency in UPC
  • The consistency model of shared memory accesses
    are controlled by designating accesses as strict,
    relaxed, or unualified (the default).
  • There are several ways of designating the
    ordering type.
  • A type qualifier, strict or relaxed can be used
    to affect all variables of that type.
  • Labels strict or relaxed can be used to control
    the accesses within a statement.
  • strict x y z y1
  • A strict or relaxed cast can be used to override
    the current label or type qualifier.

86
Synchronization- Fence
  • Upc provides a fence construct
  • Equivalent to a null strict reference, and has
    the syntax
  • upc_fence
  • UPC ensures that all shared references issued
    before the upc_fence are complete

87
Matrix Multiplication with Blocked Matrices
include ltupc_relaxed.hgt shared NP/THREADS int
aNP, cNM shared M/THREADS int
bPM int b_localPM void main (void)
int i, j , l // private variables upc_memget
(b_local, b, PMsizeof(int)) upc_forall(i 0
iltN i ci0) for (j0 jltM j)
cij 0 for (l 0 l?P l)
cij ailb_locallj
88
Shared and Private Data
  • Assume THREADS 4
  • shared 3 int A4THREADS
  • will result in the following data layout

Thread 0
Thread 1
Thread 2
Thread 3
A00
A03
A12
A21
A01
A10
A13
A22
A02
A11
A20
A23
A30
A33
A31
A32
89
UPC Pointers
Thread 3
Thread 2
Thread 0
X1
X2
X3
X0
X5
dp1
X6
X7
X4
dp
dp2
X9
dp 4
dp6
dp 5
X10
X11
dp 3
X8
X13
X14
X15
dp 8
dp 9
X12
dp 7
dp1
90
UPC Pointers
Thread 3
Thread 2
Thread 0
X6
X9
dp 1
dp 4
X7
X10
dp 5
dp 2
X11
X8
dp 6
dp
dp 3
X12
dp 7
X15
X13
dp 8
X14
dp9
dp1
91
Bulk Copy Operations in UPC
  • UPC provides standard library functions to move
    data to/from shared memory
  • Can be used to move chunks in the shared space or
    between shared and private spaces
  • Equivalent of memcpy
  • upc_memcpy(dst, src, size) copy from shared to
    shared
  • upc_memput(dst, src, size) copy from private to
    shared
  • upc_memget(dst, src, size) copy from shared to
    private
  • Equivalent of memset
  • upc_memset(dst, char, size) initialize shared
    memory with a character

92
Worksharing with upc_forall
  • Distributes independent iteration across threads
    in the way you wish typically to boost locality
    exploitation
  • Simple C-like syntax and semantics
  • upc_forall(init test loop expression)
  • statement
  • Expression could be an integer expression or a
    reference to (address of) a shared object

93
Work Sharing upc_forall()
  • Example 1 Exploiting locality
  • shared int a100,b100, c101
  • int i
  • upc_forall (i0 ilt100 i ai)
  • ai bi ci1
  • Example 2 distribution in a round-robin fashion
  • shared int a100,b100, c101
  • int i
  • upc_forall (i0 ilt100 i i)
  • ai bi ci1
  • Note Examples 1 and 2 happen to result in the
    same distribution

94
Work Sharing upc_forall()
  • Example 3 distribution by chunks
  • shared int a100,b100, c101
  • int i
  • upc_forall (i0 ilt100 i (iTHREADS)/100)
  • ai bi ci1

95
UPC Outline
  • Background and Philosophy
  • UPC Execution Model
  • UPC Memory Model
  • UPC A Quick Intro
  • Data and Pointers
  • Dynamic Memory Management
  • Programming Examples

8. Synchronization 9. Performance Tuning and
Early Results 10. Concluding Remarks
96
Dynamic Memory Allocation in UPC
  • Dynamic memory allocation of shared memory is
    available in UPC
  • Functions can be collective or not
  • A collective function has to be called by every
    thread and will return the same value to all of
    them

97
Global Memory Allocation
  • shared void upc_global_alloc(size_t nblocks,
    size_t nbytes)
  • nblocks number of blocksnbytes block size
  • Non collective, expected to be called by one
    thread
  • The calling thread allocates a contiguous memory
    space in the shared space
  • If called by more than one thread, multiple
    regions are allocated and each thread which makes
    the call gets a different pointer
  • Space allocated per calling thread is equivalent
    to shared nbytes charnblocks nbytes
  • (Not yet implemented on Cray)

98
Collective Global Memory Allocation
  • shared void upc_all_alloc(size_t nblocks, size_t
    nbytes)
  • nblocks number of blocksnbytes block size
  • This function has the same result as
    upc_global_alloc. But this is a collective
    function, which is expected to be called by all
    threads
  • All the threads will get the same pointer
  • Equivalent to shared nbytes charnblocks
    nbytes

99
Memory Freeing
  • void upc_free(shared void ptr)
  • The upc_free function frees the dynamically
    allocated shared memory pointed to by ptr
  • upc_free is not collective

100
UPC Outline
  • Background and Philosophy
  • UPC Execution Model
  • UPC Memory Model
  • UPC A Quick Intro
  • Data and Pointers
  • Dynamic Memory Management
  • Programming Examples

8. Synchronization 9. Performance Tuning and
Early Results 10. Concluding Remarks
101
Example Matrix Multiplication in UPC
  • Given two integer matrices A(NxP) and B(PxM), we
    want to compute C A x B.
  • Entries cij in C are computed by the formula

102
Doing it in C
  • include ltstdlib.hgt
  • include lttime.hgt
  • define N 4
  • define P 4
  • define M 4
  • int aNP 1,2,3,4,5,6,7,8,9,10,11,12,14,14,1
    5,16, cNM
  • int bPM 0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1
  • void main (void)
  • int i, j , l
  • for (i 0 iltN i)
  • for (j0 jltM j)
  • cij 0
  • for (l 0 l?P l) cij
    ailblj

Note some compiler do not yet support the
intialization in declaration statements
103
Domain Decomposition for UPC
  • Exploits locality in matrix multiplication
  • A (N ? P) is decomposed row-wise into blocks of
    size (N ? P) / THREADS as shown below
  • B(P ? M) is decomposed column wise into M/
    THREADS blocks as shown below

Thread THREADS-1
Thread 0
P
M
Thread 0
0 .. (NP / THREADS) -1
Thread 1
(NP / THREADS)..(2NP / THREADS)-1
N
P
((THREADS-1)?NP) / THREADS .. (THREADSNP /
THREADS)-1
Thread THREADS-1
  • Note N and M are assumed to be multiples of
    THREADS

Columns 0 (M/THREADS)-1
Columns ((THREAD-1) ? M)/THREADS(M-1)
104
UPC Matrix Multiplication Code
include ltupc_relaxed.hgt define N 4 define P
4 define M 4 shared NP /THREADS int aNP
1,2,3,4,5,6,7,8,9,10,11,12,14,14,15,16,
cNM // a and c are blocked shared matrices,
initialization is not currently
implemented sharedM/THREADS int bPM
0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1 void main
(void) int i, j , l // private
variables upc_forall(i 0 iltN i
ci0) for (j0 jltM j) cij
0 for (l 0 l?P l) cij
ailblj
105
UPC Matrix Multiplication Code with block copy
include ltupc_relaxed.hgt shared NP /THREADS
int aNP, cNM // a and c are blocked
shared matrices, initialization is not currently
implemented sharedM/THREADS int bPM int
b_localPM void main (void) int i, j , l
// private variables upc_memget(b_local, b,
PMsizeof(int)) upc_forall(i 0 iltN i
ci0) for (j0 jltM j) cij
0 for (l 0 l?P l) cij
ailb_locallj
106
UPC Outline
  • Background and Philosophy
  • UPC Execution Model
  • UPC Memory Model
  • UPC A Quick Intro
  • Data and Pointers
  • Dynamic Memory Management
  • Programming Examples

8. Synchronization 9. Performance Tuning and
Early Results 10. Concluding Remarks
107
Memory Consistency Models
  • Has to do with the ordering of shared operations
  • Under the relaxed consistency model, the shared
    operations can be reordered by the compiler /
    runtime system
  • The strict consistency model enforces sequential
    ordering of shared operations. (no shared
    operation can begin before the previously
    specified one is done)

108
Memory Consistency Models
  • User specifies the memory model through
  • declarations
  • pragmas for a particular statement or sequence of
    statements
  • use of barriers, and global operations
  • Consistency can be strict or relaxed
  • Programmers responsible for using correct
    consistency model

109
Memory Consistency
  • Default behavior can be controlled by the
    programmer
  • Use strict memory consistency
  • includeltupc_strict.hgt
  • Use relaxed memory consistency
  • includeltupc_relaxed.hgt

110
Memory Consistency
  • Default behavior can be altered for a variable
    definition using
  • Type qualifiers strict relaxed
  • Default behavior can be altered for a statement
    or a block of statements using
  • pragma upc strict
  • pragma upc relaxed

111
UPC Outline
  • Background and Philosophy
  • UPC Execution Model
  • UPC Memory Model
  • UPC A Quick Intro
  • Data and Pointers
  • Dynamic Memory Management
  • Programming Examples

8. Synchronization 9. Performance Tuning and
Early Results 10. Concluding Remarks
112
How to Exploit the Opportunities for Performance
Enhancement?
  • Compiler optimizations
  • Run-time system
  • Hand tuning

113
List of Possible Optimizations for UPC Codes
  • Space privatization use private pointers instead
    of pointer to shareds when dealing with local
    shared data (through casting and assignments)
  • Block moves use block copy instead of copying
    elements one by one with a loop, through string
    operations or structures
  • Latency hiding For example, overlap remote
    accesses with local processing using split-phase
    barriers
  • Vendors can also help decrease cost for address
    translation and providing optimized standard
    libraries

114
Performance of Shared vs. Private Accesses (Old
COMPAQ Measurement)
Recent compiler developments have improved some
of that
115
Using Local Pointers Instead of pointer to shared
  • int pa (int) Ai0int pc (int)
    Ci0 upc_forall(i0iltNiAi0)
    for(j0jltPj) pajpcj
  • Pointer arithmetic is faster using local pointers
    than pointer to shared
  • The pointer dereference can be one order of
    magnitude faster

116
Performance of UPC
  • UPC benchmarking results
  • Nqueens Problem
  • Matrix Multiplication
  • Sobel Edge detection
  • Stream and GUPS
  • NPB
  • Splash-2
  • Compaq AlphaServer SC and Origin 2000/3000
  • Check the web site for new measurements

117
Shared vs. Private Accesses (Recent SGI Origin
3000 Measurement)
STREAM BENCHMARK
118
Execution Time over SGIOrigin 2k NAS-EP Class A
119
Performance of Edge detection on the Origin 2000
Execution Time
Speedup
120
Execution Time over SGIOrigin 2k NAS-FT Class A
121
Execution Time over SGIOrigin 2k NAS-CG Class
A
122
Execution Time over SGIOrigin 2k NAS-EP Class
A
123
Execution Time over SGIOrigin 2k NAS-FT Class
A
124
Execution Time over SGIOrigin 2k NAS-CG Class A
125
Execution Time over SGIOrigin 2k NAS-MG Class A
126
UPC Outline
  • Background and Philosophy
  • UPC Execution Model
  • UPC Memory Model
  • UPC A Quick Intro
  • Data and Pointers
  • Dynamic Memory Management
  • Programming Examples

8. Synchronization 9. Performance Tuning and
Early Results 10. Concluding Remarks
127
UPCTime-To-SolutionUPCProgramming Time
UPCExecution Time
Conclusions
  • Simple and Familiar View
  • Domain decomposition maintains global application
    view
  • No function calls
  • Concise Syntax
  • Remote writes with assignment to shared
  • Remote reads with expressions involving shared
  • Domain decomposition (mainly) implied in
    declarations (logical place!)
  • Data locality exploitation
  • No calls
  • One-sided communications
  • Low overhead for short accesses

128
Conclusions
  • UPC is easy to program in for C writers,
    significantly easier than alternative paradigms
    at times
  • UPC exhibits very little overhead when compared
    with MPI for problems that are embarrassingly
    parallel. No tuning is necessary.
  • For other problems compiler optimizations are
    happening but not fully there
  • With hand-tuning, UPC performance compared
    favorably with MPI
  • Hand tuned code, with block moves, is still
    substantially simpler than message passing code

129
Conclusions
  • Automatic compiler optimizations should focus on
  • Inexpensive address translation
  • Space Privatization for local shared accesses
  • Prefetching and aggregation of remote accesses,
    prediction is easier under the UPC model
  • More performance help is expected from optimized
    standard library implementations, specially
    collective and I/O

130
References
  • The official UPC website, http//upc.gwu.edu
  • T. A.El-Ghazawi, W.W.Carlson, J. M. Draper. UPC
    Language Specifications V1.1 (http//upc.gwu.edu).
    May, 2003
  • François Cantonnet, Yiyi Yao, Smita Annareddy,
    Ahmed S. Mohamed, Tarek A. El-Ghazawi Performance
    Monitoring and Evaluation of a UPC Implementation
    on a NUMA Architecture, International Parallel
    and Distributed Processing Symposium(IPDPS03)
    Nice Acropolis Convention Center, Nice, France,
    2003.
  • Wei-Yu Chen, Dan Bonachea, Jason Duell, Parry
    Husbands, Costin Iancu, Katherine Yelick, A
    performance analysis of the Berkeley UPC
    compiler, International Conference on
    Supercomputing, Proceedings of the 17th annual
    international conference on Supercomputing
    2003,San Francisco, CA, USA
  • Tarek A. El-Ghazawi, François Cantonnet, UPC
    Performance and Potential A NPB Experimental
    Study, SuperComputing 2002 (SC2002). IEEE,
    Baltimore MD, USA, 2002.
  • Tarek A.El-Ghazawi, Sébastien Chauvin, UPC
    Benchmarking Issues, Proceedings of the
    International Conference on Parallel Processing
    (ICPP01). IEEE CS Press. Valencia, Spain,
    September 2001.

131
CS267 Final Projects
  • Project proposal
  • Teams of 3 students, typically across departments
  • Interesting parallel application or system
  • Conference-quality paper
  • High performance is key
  • Understanding performance, tuning, scaling, etc.
  • More important the difficulty of problem
  • Leverage
  • Projects in other classes (but discuss with me
    first)
  • Research projects

132
Project Ideas
  • Applications
  • Implement existing sequential or shared memory
    program on distributed memory
  • Investigate SMP trade-offs (using only MPI versus
    MPI and thread based parallelism)
  • Tools and Systems
  • Effects of reordering on sparse matrix factoring
    and solves
  • Numerical algorithms
  • Improved solver for immersed boundary method
  • Use of multiple vectors (blocked algorithms) in
    iterative solvers

133
Project Ideas
  • Novel computational platforms
  • Exploiting hierarchy of SMP-clusters in
    benchmarks
  • Computing aggregate operations on ad hoc networks
    (Culler)
  • Push/explore limits of computing on the grid
  • Performance under failures
  • Detailed benchmarking and performance analysis,
    including identification of optimization
    opportunities
  • Titanium
  • UPC
  • IBM SP (Blue Horizon)

134
Hardware Limits to Software Innovation
  • Software send overhead for 8-byte messages over
    time.
  • Not improving much over time (even in absolute
    terms)
Write a Comment
User Comments (0)
About PowerShow.com