Unified Parallel C UPC

About This Presentation

Title:

Unified Parallel C UPC

Description:

Applications: NAS parallel benchmarks (CG & MG) Standard benchmarks written in UPC by GWU ... Benchmark written in bulk synchronous style. Performance is ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 134

Provided by: kath221

Category:

more less

Transcript and Presenter's Notes

Title: Unified Parallel C UPC

1
Unified Parallel C (UPC)

Kathy Yelick
UC Berkeley and LBNL

2
UPC Projects

GWU http//upc.gwu.edu
Benchmarking, language design
MTU http//www.upc.mtu.edu
Language, benchmarking, MPI runtime for HP
compiler
UFL http//www.hcs.ufl.edu/proj/upc
Communication runtime (GASNet)
UMD http//www.cs.umd.edu/tseng/
Benchmarks
IDA http//www.super.org
Language, compiler for t3e
Other companies (Intel, Sun,) and labs

3
UPC Compiler Efforts

HP http//www.hp.com/go/upc
Compiler, tests, language
Etnus http//www.etnus.com
Debugger
Intrepid http//www.intrepid.com/upc
Compiler based on gcc
UCB/LBNL http//upc.lbl.gov
Compiler, runtime, applications
IBM http//www.ibm.com
Compiler under development for SP line
Cray http//www.cray.com
Compiler product for X1

4
Comparison to MPI

One-sided vs. two-sided communication models
Programmability
Two-sided works reasonably well for regular
computation
When computation is irregular/asynchronous,
issuing receives can be difficult
To simplify programming, communication is grouped
into a phase, which limits overlap
Performance
Some hardware does one-sided communication
RDMA support is increasingly common

5
Communication Support Today

Potential performance advantage for fine-grained,
one-sided programs
Potential productivity advantage for irregular
applications

6
MPI vs. PGAS Languages

GASNet - portable, high-performance communication
layer
compilation target for both UPC and Titanium
reference implementation over MPI 1.1
(AMMPI-based)
direct implementation over many vendor network
API's
IBM LAPI, Quadrics Elan, Myrinet GM, Infiniband
vapi, Dolphin SCI, others on the way
Applications NAS parallel benchmarks (CG MG)
Standard benchmarks written in UPC by GWU
Compiled using Berkeley UPC compiler
Difference is GASNet backend MPI 1.1 vs vendor
API
Also used HP/Compaq UPC compiler where available
Caveats
Not a comparison of MPI as a programming model

7
Performance Difference Translates to Applications

Bulk-synchronous NAS MG and CG codes in UPC
Elan-based layer beats MPI
Performance and scaling
The only difference in the Berkeley lines is the
network API!
Machine Alpha Quadrics, Lemieux
Source Bonachea and Duell

8
Performance Difference Translates to Applications

Apps on GM-based layer beat apps on MPI-based
layer by 20
The only difference is the network API!
Machine
Pentium 3 Myrinet
NERSC Alvarez cluster

9
Performance Difference Translates to Applications
App on LAPI-based layer provides significantly
better absolute performance and scaling than same
app on MPI-based layer The only difference is the
network API! Machine IBM SP, Seaborg at NERSC
10
Productivity

Productivity is hard to measure
lines (or characters) is easy to measure
May not reflect programmability, but if the same
algorithms are used, it can reveal some
differences
Fast fine-grained communication is useful
Incremental program development
Inherently fine-grained applications
Compare performance of these fine-grained
versions

11
Productivity Study El Gazhawi et al, GWU
All the line counts are the number of real code
lines (no comments, no blocks) 1 The sequential
code is coded in C except for NAS-EP and FT which
are coded in Fortran.2 The sequential code is
always in C.
12
Fine-Grained Applications have Larger Spread

Machine
HP Alpha Quadrics, Lemieux
Benchmark
Naïve CG with fine-grained remote accesses

For comparison purposes
All versions scale poorly due to naïve algorithm,
as expected
Absolute performance Elan version is more than
4x faster!
Means more work for application programmers in
MPI
Elan-based layer more suitable for
incremental application development and
fine-grained algorithms

13
A Brief Look at the Past

Conjugate Gradient dominated by sparse
matrix-vector multiply

Longer term, identify large application

Same fine-grained version used in previous
Shows advantage of t3e network model and UPC
Will we get a machine like this again?

14
Goals of the Berkeley UPC Project

Make UPC Ubiquitous
Parallel machines
Workstations and PCs for development
A portable compiler for future machines too
Research in compiler optimizations for parallel
languages
Demonstration of UPC on real applications
Ongoing language development with the UPC
Consortium
Collaboration between LBNL and UCB

15
Example Berkeley UPC Compiler

Compiler based on Open64
Multiple front-ends, including gcc
Intermediate form called WHIRL
Current focus on C backend
IA64 possible in future
UPC Runtime
Pointer representation
Shared/distribute memory
Communication in GASNet
Portable
Language-independent

UPC
Higher WHIRL
Optimizing transformations
C Runtime
Lower WHIRL
Assembly IA64, MIPS, Runtime
16
Portability Strategy in UPC Compiler
Runtime Layers

Generation of C code from translator
Layered approach to runtime
Core GASNet API
Most basic required primitives, as narrow and
general as possible
Implemented directly on each platform
Based heavily on active messages paradigm
Extended API
Wider interface that includes more complicated
operations
Reference implementation provided in terms of
core
Implementers may tune for network
UPC Runtime
pointer representation (specific to UPC, possibly
to machine)
thread implementation

Compiler-generated code
Language-specific runtime
GASNet Extended API
GASNet Core API
Network Hardware
17
Portability of Berkeley UPC Compiler

Make UPC Ubiquitous
Current and future parallel machines
Workstations and PCs for development
Ports of Berkeley UPC Compiler
OS Linux, FreeBSD, Tru64, AIX, IRIX, HPUX,
Solaris, MSWindows(cygwin), MacOSX, Unicos,
SuperUX
CPU x86, Itanium, Alpha, PowerPC, PA-RISC
Supercomputers Cray T3e, Cray X-1, IBM SP, NEC
SX-6, Cluster X (Big Mac), SGI Altix 3000
Recently added a net-compile option
Only install runtime system locally
Runtime ported to Posix Threads (direct
load/store)
Run on SGI Altix as well as SMPs
GASNet tuned to vendor-supplied communication
layer
Myrinet GM, Quadrics Elan, Mellanox Infiniband
VAPI, IBM LAPI, Cray X1, Cray/SGI SHMEM

18
Pointer-to-Shared Phases

UPC has three difference kinds of pointers
Block-cyclic
shared 4 double a n
Cyclic
shared double a n
Indefinite (always local)
shared 0 double a (shared 0
double ) upc_alloc(n)
A pointer needs a phase to keep track of
relative position within a block
Source of overhead for updating and dereferencing
Special case for phaseless Pointers
Cyclic pointers always have phase 0
Indefinite blocked pointers only have one block
Dont need to keep phase for cyclic and
indefinite
Dont need to update thread id for indefinite

phaseless
19
Accessing Shared Memory in UPC
start of array object
Shared Memory

block size
Phase

Thread 1
Thread N -1
Thread 0
0
2
addr
20
Pointer-to-Shared Representation

Shared pointer representation trade-offs
Use of scalar types (long) rather than a struct
may improve backend code quality
Faster pointer manipulation, e.g., ptrint and
dereferencing
Important in C, because array reference based on
pointers
Pointer size is important to performance
Use of smaller types, 64 bits, rather than 128
bits may allow pointers to reside in a single
register
But very large machines may require a longer
pointer type
Consider two different machines
2048-processor machine with 16 GB/processor ? 128
bits
64-processor machine with 2 GB/processor ? 64
bits
6 bits for thread, 31 bits of address, 27 bits
for phase ? 64 bit
Portability and performance balance in UPC
compiler
The pointer representation is hidden in the
runtime layer
Can easily switch at compiler installation time

21
Performance of Shared Pointer Arithmetic
1 cycle 1.5ns

Phaseless pointer an important optimization
Indefinite pointers almost as fast as regular C
pointers
Packing also helps, especially for pointer and
int addition

22
Comparison with HP UPC v1.7
1 cycle 1.5ns

HP a little faster, due to it generating
assembly coded
Gap for addition likely smaller with further
optimizations

23
Cost of Shared Memory Access

Local accesses somewhat slower than private
accesses
Remote accesses significantly worse, as expected

24
Optimizing Explicitly Parallel Code

Compiler optimizations for parallel languages
Enabled optimizations in Open64 base
Static analyses for parallel code
Problem is to understand when code motion is
legal without changing views from other
processorst
Extended cycle detection to arrays with three
different algorithms LCPC 03
Message strip-mining
Packing messages is good, but it can go too far
Use performance model to strip-mine messages into
smaller chunks to optimize overlap VECPAR 04
Automatic message vectorization (packing) underway

25
Performance Example

Performance of the Berkeley MG UPC code
HP (Lemieux, left) includes MPI comparison

26
Berkeley UPC on the X1
48x

Translator generated C code usually vectorizes as
well as original C code
Source-to-source translation a reasonable
strategy
Work needed for 3D arrays

27
GASNet/X1 Performance
Puts
Gets

GASNet/X1 improves small message performance
over shmem and MPI
GASNet/X1 communication can be integrated
seamlessly into long computation loops and is
vectorizable
GASNet/X1 operates directly on global pointers

28
NAS CG OpenMP style vs. MPI style

GAS language outperforms MPIFortran (flat is
good!)
Fine-grained (OpenMP style) version still slower
shared memory programming style leads to more
overhead (redundant boundary computation)
GAS languages can support both programming styles

29
EP on Alpha/Quadrics (GWU Bench)
30
IS on Alpha/Quadrics (GWU Bench)
31
MG on Alpha/Quadrics (Berkeley version)
32
Multigrid on Cray X1

Performance similar to MPI
Cray C does not automatically
vectorize/multistream (addition of pragmas)
4 SSP slightly better than 1 MSP, 2 MSP much
better than 8 SSP (cache conflict caused by
layout of private data)

33
Integer Sort

Benchmark written in bulk synchronous style
Performance is similar to MPI
Code does not vectorize even the best performer
is much slower than cache-based superscalar
architecture

34
Fine-grained Irregular Accesses UPC GUPS

Hard to control vectorization of fine-grained
accesses
temporary variables, casts, etc.
Communication libraries may help

35
Recent Progress on Applications

Application demonstration of UPC
NAS PB-size problems
Berkeley NAS MG avoids most global barriers and
relies on UPC relaxed memory model
Berkeley NAS CG has several versions, including
simpler, fine-grained communication
Algorithms that are challenging in MPI
2D Delauney Triangulation SIAM PP 04
AMR in UPC Chombo (non-adaptive) Poisson solver

36
Progress in Language

Group is active in UPC Consortium meetings,
mailing list, SC booth, etc.
Recent language level work
Specification of UPC memory model in progress
Joint with MTU
Behavioral spec Dagstuhl03
UPC IO nearly finalized
Joint with GWU and ANL
UPC Collectives V 1.0 finalized
Effort led by MTU
Improvements/updates to UPC Language Spec
Led by IDA

37
Center Overview

Broad collaboration between three groups
Library efforts MPI, ARMCI, GA, OpenMP
Language efforts UPC, CAF, Titanium
New model investigations multi-threading, memory
consistency models
Led by Rusty Lusk at ANL
Major focus is common runtime system
GASNet for UPC, Titanium and (soon) CAF
Also common compiler
CAF, UPC, and OpenMP work based on Open64

38
Progress on UPC Runtime

Cross-language support Berkeley UPC and MPI
Calling MPI from UPC
Calling UPC from MPI
Runtime for gcc-based UPC compiler by Intrepid
Interface UPC compiler to parallel collectives
libraries (end of FY04)
Reference implementation just released by HP/MTU
Thread version of the Berkeley UPC runtime layer
Evaluating performance on hybrid GASNet systems

39
Progress on GASNet

GASNet Myrinet GM, Quadrix Elan-3, IBM LAPI,
UDP, MPI, Infiniband
Ongoing SCI (with UFL), Cray X1 SGI Shmem, and
reviewing future Myrinet and latest Elan-4
Extension to GASNet to support strided and
scatter/gather communication
Also proposed support for UPC bulk copy support
Analysis of MPI one-sided for GAS languages
Problems with synchronization model
Multiple protocols for managing pinned memory
in Direct Memory Addressing systems CAC 03
Depends on language usage as well as network
architecture

40
Future Plans

Architecture-specific GASNet for scatter-gather
and strided hardware support.
Need for CAF and for UPC with message
vectorization
Optimized collective communication library
Spec agreed on in 2003
New reference implementation
Developing GASNet extension for building
optimized collectives
Application- and architecture- driven
optimization
Interface to the UPC I/O library
Evaluate GASNet on machines with non-cache
coherent shared memory
BlueGene/L and NEC SX6

41
Try It Out

Download from the Berkeley UPC web page
http//upc.lbl.gov
May just get runtime system (includes GASNet)
Netcompile is default
Runtime is easier to install
New release planned for this summer
Not quite open development model
We publicize a latest stable version that is
not fully tested
Let us know what happens (good and bad)
Mail upc_at_lbl.gov

42
UPC Outline

Background and Philosophy
UPC Execution Model
UPC Memory Model
Data and Pointers
Dynamic Memory Management
Programming Examples

8. Synchronization 9. Performance Tuning and
Early Results 10. Concluding Remarks
43
Context

Most parallel programs are written using either
Message passing with a SPMD model
Usually for scientific applications with
C/Fortran
Scales easily
Shared memory with threads in OpenMP,
ThreadsC/C/F or Java
Usually for non-scientific applications
Easier to program, but less scalable performance
Global Address Space (GAS) Languages take the
best of both
global address space like threads
(programmability)
SPMD parallelism like MPI (performance)
local/global distinction, i.e., layout matters
(performance)

44
Partitioned Global Address Space Languages

Explicitly-parallel programming model with SPMD
parallelism
Fixed at program start-up, typically 1 thread per
processor
Global address space model of memory
Allows programmer to directly represent
distributed data structures
Address space is logically partitioned
Local vs. remote memory (two-level hierarchy)
Programmer control over performance critical
decisions
Data layout and communication
Performance transparency and tunability are goals
Initial implementation can use fine-grained
shared memory
Base languages differ UPC (C), CAF (Fortran),
Titanium (Java)

45
Global Address Space Eases Programming
Thread0 Thread1
Threadn
X0
X1
XP
Shared
Global address space
ptr
ptr
ptr
Private

The languages share the global address space
abstraction
Shared memory is partitioned by processors
Remote memory may stay remote no automatic
caching implied
One-sided communication through reads/writes of
shared variables
Both individual and bulk memory copies
Differ on details
Some models have a separate private memory area
Distributed array generality and how they are
constructed

46
One-Sided Communication Is Sometimes Faster

Potential performance advantage for fine-grained,
one-sided programs
Potential productivity advantage for irregular
applications

47
Current Implementations

A successful language/library must run everywhere
UPC
Commercial compilers available on Cray, SGI, HP
machines
Open source compiler from LBNL/UCB (and another
from MTU)
CAF
Commercial compiler available on Cray machines
Open source compiler available from Rice
Titanium (Friday)
Open source compiler from UCB runs on most
machines
Common tools
Open64 open source research compiler
infrastructure
ARMCI, GASNet for distributed memory
implementations
Pthreads, System V shared memory

48
UPC Overview and Design Philosophy

Unified Parallel C (UPC) is
An explicit parallel extension of ANSI C
A partitioned global address space language
Sometimes called a GAS language
Similar to the C language philosophy
Programmers are clever and careful, and may need
to get close to hardware
to get performance, but
can get in trouble
Concise and efficient syntax
Common and familiar syntax and semantics for
parallel C with simple extensions to ANSI C
Based on ideas in Split-C, AC, and PCP

49
UPC Execution Model
50
UPC Execution Model

A number of threads working independently in a
SPMD fashion
Number of threads specified at compile-time or
run-time available as program variable THREADS
MYTHREAD specifies thread index (0..THREADS-1)
upc_barrier is a global synchronization all wait
There is a form of parallel loop that we will see
later
There are two compilation modes
Static Threads mode
Threads is specified at compile time by the user
The program may is THREADS as a compile-time
constant
Dynamic threads mode
Compiled code may be run with varying numbers of
threads

51
Hello World in UPC

Any legal C program is also a legal UPC program
If you compile and run it as UPC with P threads,
it will run P copies of the program.
Using this fact, plus the identifiers from the
previous slides, we can parallel hello world
include ltupc.hgt / needed for UPC extensions /
include ltstdio.hgt
main()
printf("Thread d of d hello UPC world\n",
MYTHREAD, THREADS)

52
Example Monte Carlo Pi Calculation

Estimate Pi by throwing darts at a unit square
Calculate percentage that fall in the unit circle
Area of square r2 1
Area of circle quadrant ¼ p r2 p/4
Randomly throw darts at x,y positions
If x2 y2 lt 1, then point is inside circle
Compute ratio
points inside / points total
p 4ratio

53
Pi in UPC

Independent estimates of pi
main(int argc, char argv)
int i, hits, trials 0
double pi
if (argc ! 2)trials 1000000
else trials atoi(argv1)
srand(MYTHREAD17)
for (i0 i lt trials i) hits hit()
pi 4.0hits/trials
printf("PI estimated to f.", pi)

54
Helper Code for Pi in UPC

Required includes
include ltstdio.hgt
include ltmath.hgt
include ltupc.hgt
Function to throw dart and calculate where it
hits
int hit()
int const rand_max 0xFFFFFF
double x (double) (rand()rand_max) /
rand_max
double y (double) (rand()rand_max) /
rand_max
if ((xx yy) lt 1.0) return(1)
else return(0)

Hidden slide
55
UPC Memory Model

Scalar Variables
Distributed Arrays
Pointers to shared data

56
Private vs. Shared Variables in UPC

Normal C variables and objects are allocated in
the private memory space for each thread.
Shared variables are allocated only once, with
thread 0
shared int ours
int mine
Simple shared variables of this kind may not
occur in a within a function definition

Thread0 Thread1
Threadn
Shared
ours
Global address space
mine
mine
mine
Private
57
Pi in UPC (Cooperative Version)

Parallel computing of pi, but with a race
condition
shared int hits
main(int argc, char argv)
int i, my_hits 0
int trials atoi(argv1)
my_trials (trials THREADS - 1
- MYTHREAD)/THREADS
srand(MYTHREAD17)
for (i0 i lt my_trials i)
hits hit()
upc_barrier
if (MYTHREAD 0)
printf("PI estimated to f.",
4.0hits/trials)

shared variable to record hits
divide work up evenly
accumulate hits
58
Pi in UPC (Cooperative Version)

The race condition can be fixed in several ways
Add a lock around the hits increment (later)
Have each thread update a separate counter
Have one thread compute sum
Use a collective to compute sum (recently added
to UPC)
shared int all_hits THREADS
main(int argc, char argv)
declarations an initialization code omitted
for (i0 i lt my_trials i)
all_hitsMYTHREAD hit()
upc_barrier
if (MYTHREAD 0)
for (i0 i lt THREADS i) hits
all_hitsi
printf("PI estimated to f.",
4.0hits/trials)

all_hits is shared by all processors, just as
hits was
Where does it live?
59
Shared Arrays Are Cyclic By Default

Shared array elements are spread across the
threads
shared int xTHREADS / 1 element per
thread /
shared int y3THREADS / 3 elements per
thread /
shared int z3THREADS / 3 elements per
thread, cyclic /
In the pictures below
Assume THREADS 4
Elements with affinity to processor 0 are red

As a 2D array, this is logically blocked by
columns
x
y
z
60
Example Vector Addition

Questions about parallel vector additions
How to layout data (here it is cyclic)
Which processor does what (here it is owner
computes)

/ vadd.c /
include ltupc_relaxed.hgtdefine N
100THREADSshared int v1N, v2N,
sumNvoid main() int i for(i0 iltN i)
if (MYTHREAD iTHREADS) sumiv1iv2
i

cyclic layout
owner computes
61
Vector Addition with upc_forall

The loop in vadd is common, so there is
upc_forall
4th argument is int expression that gives
affinity
Iteration executes when
affinityTHREADS is MYTHREAD

/ vadd.c /
include ltupc_relaxed.hgtdefine N
100THREADSshared int v1N, v2N,
sumNvoid main() int i upc_forall(i0
iltN i i)
sumiv1iv2i

62
Work Sharing with upc_forall()

Iteration are independent
Each thread gets a bunch of iterations
Simple C-like syntax and semantics
upc_forall(init test loop affinity)
statement
Affinity field to distribute the work
Cyclic (round robin) distribution
Blocked (chunks of iterations) distribution
Semantics are undefined if there are dependencies
between iterations executed by different threads
Programmer has indicated iterations are
independent

63
UPC Matrix Vector Multiplication Code

Here is one possible matrix-vector multiplication

include ltupc_relaxed.hgt shared int
aTHREADSTHREADS shared int bTHREADS,
cTHREADS void main (void) int i, j , l
upc_forall( i 0 i lt THREADS i i)
ci 0 for ( l 0 l? THREADS
l) ci ailbl
64
Data Distribution
B

Thread 0
Thread 1
Thread 2
A
B
C
65
A Better Data Distribution
B
Th. 0
Thread 0

Th. 1
Thread 1
Th. 2
Thread 2
A
B
C
66
Layouts in General

All non-array shared variables have affinity with
thread zero.
Array layouts are controlled by layout
specifiers
shared b double x n
Groups of b elements are wrapped around
Empty cyclic layout of data in 1D view
layout_specifier integer_expression
The affinity of an array element is defined in
terms of the block size, a compile-time constant,
and THREADS a runtime constant.
Element i has affinity with thread
( i / block_size) PROCS.

67
Layout Terminology

Notation is HPF, but terminology is
language-independent
Assume there are 4 processors

(Block, )
(, Block)
(Block, Block)
(Cyclic, )
(Cyclic, Block)
(Cyclic, Cyclic)
68
2D Array Layouts in UPC

Array a1 has a row layout and array a2 has a
block row layout.
shared m int a1 nm
shared km int a2 nm
If (k m) THREADS 0 them a3 has a row
layout
shared int a3 nmk
To get more general HPF and ScaLAPACK style 2D
blocked layouts, one needs to add dimensions.
Assume rc THREADS
shared b1b2 int a5 mnrcb1b2
or equivalently
shared b1b2 int a5 mnrcb1b2

69
UPC Matrix Vector Multiplication Code

Matrix-vector multiplication with better layout

include ltupc_relaxed.hgt shared THREADS int
aTHREADSTHREADS shared int bTHREADS,
cTHREADS void main (void) int i, j , l
upc_forall( i 0 i lt THREADS i i)
ci 0 for ( l 0 l? THREADS
l) ci ailbl
70
Example Matrix Multiplication in UPC

Given two integer matrices A(NxP) and B(PxM)
Compute C A x B.
Entries Cij in C are computed by the formula

71
Matrix Multiply in C

include ltstdlib.hgt
include lttime.hgt
define N 4
define P 4
define M 4
int aNP, cNM
int bPM
void main (void)
int i, j , l
for (i 0 iltN i)
for (j0 jltM j)
cij 0
for (l 0 l?P l) cij
ailblj

72
Domain Decomposition for UPC

Exploits locality in matrix multiplication

A (N ? P) is decomposed row-wise into blocks of
size (N ? P) / THREADS as shown below

B(P ? M) is decomposed column wise into M/
THREADS blocks as shown below

Thread THREADS-1
Thread 0
P
M
Thread 0
0 .. (NP / THREADS) -1
Thread 1
(NP / THREADS)..(2NP / THREADS)-1
N
P
((THREADS-1)?NP) / THREADS .. (THREADSNP /
THREADS)-1
Thread THREADS-1

Note N and M are assumed to be multiples of
THREADS

Columns 0 (M/THREADS)-1
Columns ((THREAD-1) ? M)/THREADS(M-1)
73
UPC Matrix Multiplication Code
/ mat_mult_1.c / include ltupc_relaxed.hgt defi
ne N 4 define P 4 define M 4 shared NP
/THREADS int aNP, cNM // a and c are
row-wise blocked shared matrices sharedM/THREADS
int bPM //column-wise blocking void main
(void) int i, j , l // private
variables upc_forall(i 0 iltN i
ci0) for (j0 jltM j) cij
0 for (l 0 l?P l) cij
ailblj
74
Notes on the Matrix Multiplication Example

The UPC code for the matrix multiplication is
almost the same size as the sequential code
Shared variable declarations include the keyword
shared
Making a private copy of matrix B in each thread
might result in better performance since many
remote memory operations can be avoided
Can be done with the help of upc_memget

75
Pointers to Shared vs. Arrays

In the C tradition, array can be access through
pointers
Here is the vector addition example using pointers

include ltupc_relaxed.hgt
define N 100THREADS
shared int v1N, v2N, sumN
void main() int ishared int p1, p2p1v1
p2v2for (i0 iltN i, p1, p2 )
if (i THREADS MYTHREAD) sumi p1
p2

v1
p1
76
UPC Pointers
Where does the pointer reside?
Where does it point?
int p1 / private pointer to local
memory / shared int p2 / private pointer to
shared space / int shared p3 / shared pointer
to local memory / shared int shared p4 /
shared pointer to
shared space / Shared to private is not
recommended.
77
UPC Pointers
Thread0 Thread1
Threadn
p3
p3
p3
Shared
p4
p4
p4
Global address space
p1
p1
p1
Private
p2
p2
p2
int p1 / private pointer to local
memory / shared int p2 / private pointer to
shared space / int shared p3 / shared pointer
to local memory / shared int shared p4 /
shared pointer to
shared space /
Pointers to shared often require more storage and
are more costly to dereference they may refer to
local or remote memory.
78
Common Uses for UPC Pointer Types

int p1
These pointers are fast
Use to access private data in part of code
performing local work
Often cast a pointer-to-shared to one of these to
get faster access to shared data that is local
shared int p2
Use to refer to remote data
Larger and slower due to test-for-local
possible communication
int shared p3
Not recommended
shared int shared p4
Use to build shared linked structures, e.g., a
linked list

79
UPC Pointers

In UPC pointers to shared objects have three
fields
thread number
local address of block
phase (specifies position in the block)
Example Cray T3E implementation

0
37
38
48
49
63
80
UPC Pointers

Pointer arithmetic supports blocked and
non-blocked array distributions
Casting of shared to private pointers is allowed
but not vice versa !
When casting a pointer to shared to a private
pointer, the thread number of the pointer to
shared may be lost
Casting of shared to private is well defined only
if the object pointed to by the pointer to shared
has affinity with the thread performing the cast

81
Special Functions

size_t upc_threadof(shared void ptr)returns
the thread number that has affinity to the
pointer to shared
size_t upc_phaseof(shared void ptr)returns the
index (position within the block)field of the
pointer to shared
size_t upc_addrfield(shared void ptr)returns
the address of the block which is pointed at by
the pointer to shared
shared void upc_resetphase(shared void ptr)
resets the phase to zero

82
Synchronization

No implicit synchronization among the threads
UPC provides many synchronization mechanisms
Barriers (Blocking)
upc_barrier
Split Phase Barriers (Non Blocking)
upc_notify
upc_wait
Optional label allow for
Locks

83
Synchronization - Locks

In UPC, shared data can be protected against
multiple writers
void upc_lock(upc_lock_t l)
int upc_lock_attempt(upc_lock_t l) //returns 1
on success and 0 on failure
void upc_unlock(upc_lock_t l)
Locks can be allocated dynamically. Dynamically
allocated locks can be freed
Dynamic locks are properly initialized and static
locks need initialization

84
Corrected version Pi Example

Parallel computing of pi, but with a bug
shared int hits
main(int argc, char argv)
int i, my_hits 0
upc_lock_t hit_lock upc_all_lock_alloc()
...initialization of trials,
my_trials, srand code omitted
for (i0 i lt my_trials i)
my_hits hit()
upc_lock(hit_lock)
hits my_hits
upc_unlock(hit_lock)
upc_barrier
if (MYTHREAD 0)
printf("PI estimated to f.",
4.0hits/trials)
upc_lock_free(hit_lock)

all threads collectively allocate lock
update in critical region
85
Memory Consistency in UPC

The consistency model of shared memory accesses
are controlled by designating accesses as strict,
relaxed, or unualified (the default).
There are several ways of designating the
ordering type.
A type qualifier, strict or relaxed can be used
to affect all variables of that type.
Labels strict or relaxed can be used to control
the accesses within a statement.
strict x y z y1
A strict or relaxed cast can be used to override
the current label or type qualifier.

86
Synchronization- Fence

Upc provides a fence construct
Equivalent to a null strict reference, and has
the syntax
upc_fence
UPC ensures that all shared references issued
before the upc_fence are complete

87
Matrix Multiplication with Blocked Matrices
include ltupc_relaxed.hgt shared NP/THREADS int
aNP, cNM shared M/THREADS int
bPM int b_localPM void main (void)
int i, j , l // private variables upc_memget
(b_local, b, PMsizeof(int)) upc_forall(i 0
iltN i ci0) for (j0 jltM j)
cij 0 for (l 0 l?P l)
cij ailb_locallj
88
Shared and Private Data

Assume THREADS 4
shared 3 int A4THREADS
will result in the following data layout

Thread 0
Thread 1
Thread 2
Thread 3
A00
A03
A12
A21
A01
A10
A13
A22
A02
A11
A20
A23
A30
A33
A31
A32
89
UPC Pointers
Thread 3
Thread 2
Thread 0
X1
X2
X3
X0
X5
dp1
X6
X7
X4
dp
dp2
X9
dp 4
dp6
dp 5
X10
X11
dp 3
X8
X13
X14
X15
dp 8
dp 9
X12
dp 7
dp1
90
UPC Pointers
Thread 3
Thread 2
Thread 0
X6
X9
dp 1
dp 4
X7
X10
dp 5
dp 2
X11
X8
dp 6
dp
dp 3
X12
dp 7
X15
X13
dp 8
X14
dp9
dp1
91
Bulk Copy Operations in UPC

UPC provides standard library functions to move
data to/from shared memory
Can be used to move chunks in the shared space or
between shared and private spaces
Equivalent of memcpy
upc_memcpy(dst, src, size) copy from shared to
shared
upc_memput(dst, src, size) copy from private to
shared
upc_memget(dst, src, size) copy from shared to
private
Equivalent of memset
upc_memset(dst, char, size) initialize shared
memory with a character

92
Worksharing with upc_forall

Distributes independent iteration across threads
in the way you wish typically to boost locality
exploitation
Simple C-like syntax and semantics
upc_forall(init test loop expression)
statement
Expression could be an integer expression or a
reference to (address of) a shared object

93
Work Sharing upc_forall()

Example 1 Exploiting locality
shared int a100,b100, c101
int i
upc_forall (i0 ilt100 i ai)
ai bi ci1
Example 2 distribution in a round-robin fashion
shared int a100,b100, c101
int i
upc_forall (i0 ilt100 i i)
ai bi ci1
Note Examples 1 and 2 happen to result in the
same distribution

94
Work Sharing upc_forall()

Example 3 distribution by chunks
shared int a100,b100, c101
int i
upc_forall (i0 ilt100 i (iTHREADS)/100)
ai bi ci1

95
UPC Outline

Background and Philosophy
UPC Execution Model
UPC Memory Model
UPC A Quick Intro
Data and Pointers
Dynamic Memory Management
Programming Examples

8. Synchronization 9. Performance Tuning and
Early Results 10. Concluding Remarks
96
Dynamic Memory Allocation in UPC

Dynamic memory allocation of shared memory is
available in UPC
Functions can be collective or not
A collective function has to be called by every
thread and will return the same value to all of
them

97
Global Memory Allocation

shared void upc_global_alloc(size_t nblocks,
size_t nbytes)
nblocks number of blocksnbytes block size
Non collective, expected to be called by one
thread
The calling thread allocates a contiguous memory
space in the shared space
If called by more than one thread, multiple
regions are allocated and each thread which makes
the call gets a different pointer
Space allocated per calling thread is equivalent
to shared nbytes charnblocks nbytes
(Not yet implemented on Cray)

98
Collective Global Memory Allocation

shared void upc_all_alloc(size_t nblocks, size_t
nbytes)
nblocks number of blocksnbytes block size
This function has the same result as
upc_global_alloc. But this is a collective
function, which is expected to be called by all
threads
All the threads will get the same pointer
Equivalent to shared nbytes charnblocks
nbytes

99
Memory Freeing

void upc_free(shared void ptr)
The upc_free function frees the dynamically
allocated shared memory pointed to by ptr
upc_free is not collective

100
UPC Outline

Background and Philosophy
UPC Execution Model
UPC Memory Model
UPC A Quick Intro
Data and Pointers
Dynamic Memory Management
Programming Examples

8. Synchronization 9. Performance Tuning and
Early Results 10. Concluding Remarks
101
Example Matrix Multiplication in UPC

Given two integer matrices A(NxP) and B(PxM), we
want to compute C A x B.
Entries cij in C are computed by the formula

102
Doing it in C

include ltstdlib.hgt
include lttime.hgt
define N 4
define P 4
define M 4
int aNP 1,2,3,4,5,6,7,8,9,10,11,12,14,14,1
5,16, cNM
int bPM 0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1
void main (void)
int i, j , l
for (i 0 iltN i)
for (j0 jltM j)
cij 0
for (l 0 l?P l) cij
ailblj

Note some compiler do not yet support the
intialization in declaration statements
103
Domain Decomposition for UPC

Exploits locality in matrix multiplication

A (N ? P) is decomposed row-wise into blocks of
size (N ? P) / THREADS as shown below

B(P ? M) is decomposed column wise into M/
THREADS blocks as shown below

Thread THREADS-1
Thread 0
P
M
Thread 0
0 .. (NP / THREADS) -1
Thread 1
(NP / THREADS)..(2NP / THREADS)-1
N
P
((THREADS-1)?NP) / THREADS .. (THREADSNP /
THREADS)-1
Thread THREADS-1

Note N and M are assumed to be multiples of
THREADS

Columns 0 (M/THREADS)-1
Columns ((THREAD-1) ? M)/THREADS(M-1)
104
UPC Matrix Multiplication Code
include ltupc_relaxed.hgt define N 4 define P
4 define M 4 shared NP /THREADS int aNP
1,2,3,4,5,6,7,8,9,10,11,12,14,14,15,16,
cNM // a and c are blocked shared matrices,
initialization is not currently
implemented sharedM/THREADS int bPM
0,1,0,1,0,1,0,1,0,1,0,1,0,1,0,1 void main
(void) int i, j , l // private
variables upc_forall(i 0 iltN i
ci0) for (j0 jltM j) cij
0 for (l 0 l?P l) cij
ailblj
105
UPC Matrix Multiplication Code with block copy
include ltupc_relaxed.hgt shared NP /THREADS
int aNP, cNM // a and c are blocked
shared matrices, initialization is not currently
implemented sharedM/THREADS int bPM int
b_localPM void main (void) int i, j , l
// private variables upc_memget(b_local, b,
PMsizeof(int)) upc_forall(i 0 iltN i
ci0) for (j0 jltM j) cij
0 for (l 0 l?P l) cij
ailb_locallj
106
UPC Outline

Background and Philosophy
UPC Execution Model
UPC Memory Model
UPC A Quick Intro
Data and Pointers
Dynamic Memory Management
Programming Examples

8. Synchronization 9. Performance Tuning and
Early Results 10. Concluding Remarks
107
Memory Consistency Models

Has to do with the ordering of shared operations
Under the relaxed consistency model, the shared
operations can be reordered by the compiler /
runtime system
The strict consistency model enforces sequential
ordering of shared operations. (no shared
operation can begin before the previously
specified one is done)

108
Memory Consistency Models

User specifies the memory model through
declarations
pragmas for a particular statement or sequence of
statements
use of barriers, and global operations
Consistency can be strict or relaxed
Programmers responsible for using correct
consistency model

109
Memory Consistency

Default behavior can be controlled by the
programmer
Use strict memory consistency
includeltupc_strict.hgt
Use relaxed memory consistency
includeltupc_relaxed.hgt

110
Memory Consistency

Default behavior can be altered for a variable
definition using
Type qualifiers strict relaxed
Default behavior can be altered for a statement
or a block of statements using
pragma upc strict
pragma upc relaxed

111
UPC Outline

Background and Philosophy
UPC Execution Model
UPC Memory Model
UPC A Quick Intro
Data and Pointers
Dynamic Memory Management
Programming Examples

8. Synchronization 9. Performance Tuning and
Early Results 10. Concluding Remarks
112
How to Exploit the Opportunities for Performance
Enhancement?

Compiler optimizations
Run-time system
Hand tuning

113
List of Possible Optimizations for UPC Codes

Space privatization use private pointers instead
of pointer to shareds when dealing with local
shared data (through casting and assignments)
Block moves use block copy instead of copying
elements one by one with a loop, through string
operations or structures
Latency hiding For example, overlap remote
accesses with local processing using split-phase
barriers
Vendors can also help decrease cost for address
translation and providing optimized standard
libraries

114
Performance of Shared vs. Private Accesses (Old
COMPAQ Measurement)
Recent compiler developments have improved some
of that
115
Using Local Pointers Instead of pointer to shared

int pa (int) Ai0int pc (int)
Ci0 upc_forall(i0iltNiAi0)
for(j0jltPj) pajpcj
Pointer arithmetic is faster using local pointers
than pointer to shared
The pointer dereference can be one order of
magnitude faster

116
Performance of UPC

UPC benchmarking results
Nqueens Problem
Matrix Multiplication
Sobel Edge detection
Stream and GUPS
NPB
Splash-2
Compaq AlphaServer SC and Origin 2000/3000
Check the web site for new measurements

117
Shared vs. Private Accesses (Recent SGI Origin
3000 Measurement)
STREAM BENCHMARK
118
Execution Time over SGIOrigin 2k NAS-EP Class A
119
Performance of Edge detection on the Origin 2000
Execution Time
Speedup
120
Execution Time over SGIOrigin 2k NAS-FT Class A
121
Execution Time over SGIOrigin 2k NAS-CG Class
A
122
Execution Time over SGIOrigin 2k NAS-EP Class
A
123
Execution Time over SGIOrigin 2k NAS-FT Class
A
124
Execution Time over SGIOrigin 2k NAS-CG Class A
125
Execution Time over SGIOrigin 2k NAS-MG Class A
126
UPC Outline

Background and Philosophy
UPC Execution Model
UPC Memory Model
UPC A Quick Intro
Data and Pointers
Dynamic Memory Management
Programming Examples

8. Synchronization 9. Performance Tuning and
Early Results 10. Concluding Remarks
127
UPCTime-To-SolutionUPCProgramming Time
UPCExecution Time
Conclusions

Simple and Familiar View
Domain decomposition maintains global application
view
No function calls
Concise Syntax
Remote writes with assignment to shared
Remote reads with expressions involving shared
Domain decomposition (mainly) implied in
declarations (logical place!)

Data locality exploitation
No calls
One-sided communications
Low overhead for short accesses

128
Conclusions

UPC is easy to program in for C writers,
significantly easier than alternative paradigms
at times
UPC exhibits very little overhead when compared
with MPI for problems that are embarrassingly
parallel. No tuning is necessary.
For other problems compiler optimizations are
happening but not fully there
With hand-tuning, UPC performance compared
favorably with MPI
Hand tuned code, with block moves, is still
substantially simpler than message passing code

129
Conclusions

Automatic compiler optimizations should focus on
Inexpensive address translation
Space Privatization for local shared accesses
Prefetching and aggregation of remote accesses,
prediction is easier under the UPC model
More performance help is expected from optimized
standard library implementations, specially
collective and I/O

130
References

The official UPC website, http//upc.gwu.edu
T. A.El-Ghazawi, W.W.Carlson, J. M. Draper. UPC
Language Specifications V1.1 (http//upc.gwu.edu).
May, 2003
François Cantonnet, Yiyi Yao, Smita Annareddy,
Ahmed S. Mohamed, Tarek A. El-Ghazawi Performance
Monitoring and Evaluation of a UPC Implementation
on a NUMA Architecture, International Parallel
and Distributed Processing Symposium(IPDPS03)
Nice Acropolis Convention Center, Nice, France,
2003.
Wei-Yu Chen, Dan Bonachea, Jason Duell, Parry
Husbands, Costin Iancu, Katherine Yelick, A
performance analysis of the Berkeley UPC
compiler, International Conference on
Supercomputing, Proceedings of the 17th annual
international conference on Supercomputing
2003,San Francisco, CA, USA
Tarek A. El-Ghazawi, François Cantonnet, UPC
Performance and Potential A NPB Experimental
Study, SuperComputing 2002 (SC2002). IEEE,
Baltimore MD, USA, 2002.
Tarek A.El-Ghazawi, Sébastien Chauvin, UPC
Benchmarking Issues, Proceedings of the
International Conference on Parallel Processing
(ICPP01). IEEE CS Press. Valencia, Spain,
September 2001.

131
CS267 Final Projects

Project proposal
Teams of 3 students, typically across departments
Interesting parallel application or system
Conference-quality paper
High performance is key
Understanding performance, tuning, scaling, etc.
More important the difficulty of problem
Leverage
Projects in other classes (but discuss with me
first)
Research projects

132
Project Ideas

Applications
Implement existing sequential or shared memory
program on distributed memory
Investigate SMP trade-offs (using only MPI versus
MPI and thread based parallelism)
Tools and Systems
Effects of reordering on sparse matrix factoring
and solves
Numerical algorithms
Improved solver for immersed boundary method
Use of multiple vectors (blocked algorithms) in
iterative solvers

133
Project Ideas

Novel computational platforms
Exploiting hierarchy of SMP-clusters in
benchmarks
Computing aggregate operations on ad hoc networks
(Culler)
Push/explore limits of computing on the grid
Performance under failures
Detailed benchmarking and performance analysis,
including identification of optimization
opportunities
Titanium
UPC
IBM SP (Blue Horizon)

Unified Parallel C UPC - PowerPoint PPT Presentation

Unified Parallel C UPC

Applications: NAS parallel benchmarks (CG & MG) Standard benchmarks written in UPC by GWU ... Benchmark written in bulk synchronous style. Performance is ... – PowerPoint PPT presentation