Title: Unified Parallel C
1Unified Parallel C
- Kathy Yelick
- EECS, U.C. Berkeley and NERSC/LBNL
- NERSC Team Dan Bonachea, Jason Duell, Paul
Hargrove, Parry Husbands, Costin Iancu, Mike
Welcome, Christian Bell
2Outline
- Global Address Space Languages in General
- Programming models
- Overview of Unified Parallel C (UPC)
- Programmability advantage
- Performance opportunity
- Status
- Next step
- Related projects
3Programming Model 1 Shared Memory
- Program is a collection of threads of control.
- Many languages allow threads to be created
dynamically, - Each thread has a set of private variables, e.g.
local variables on the stack. - Collectively with a set of shared variables,
e.g., static variables, shared common blocks,
global heap. - Threads communicate implicitly by writing/reading
shared variables. - Threads coordinate using synchronization
operations on shared variables
x ...
Shared
y ..x ...
Private
. . .
Pn
P0
4Programming Model 2 Message Passing
- Program consists of a collection of named
processes. - Usually fixed at program startup time
- Thread of control plus local address space -- NO
shared data. - Logically shared data is partitioned over local
processes. - Processes communicate by explicit send/receive
pairs - Coordination is implicit in every communication
event. - MPI is the most common example
send P0,X
recv Pn,Y
Y
X
. . .
Pn
P0
5Tradeoffs Between the Models
- Shared memory
- Programming is easier
- Can build large shared data structures
- Machines dont scale
- SMPs typically lt 16 processors (Sun, DEC, Intel,
IBM) - Distributed shared memory lt 128 (SGI)
- Performance is hard to predict and control
- Message passing
- Machines easier to build from commodity parts
- Can scale (given sufficient network)
- Programming is harder
- Distributed data structures only in the
programmers mind - Tedious packing/unpacking of irregular data
structures
6Global Address Space Programming
- Intermediate point between message passing and
shared memory - Program consists of a collection of processes.
- Fixed at program startup time, like MPI
- Local and shared data, as in shared memory model
- But, shared data is partitioned over local
processes - Remote data stays remote on distributed memory
machines - Processes communicate by reads/writes to shared
variables - Examples are UPC, Titanium, CAF, Split-C
- Note These are not data-parallel languages
- heroic compilers not required
7GAS Languages on Clusters of SMPs
- Cluster of SMPs (CLUMPs)hb
- IBM SP 16-way SMP nodes
- Berkeley Millennium 2-way and 4-way nodes
- What is an appropriate programming model?
- Use message passing throughout
- Most common model
- Unnecessary packing/unpacking overhead
- Hybrid models
- Write 2 parallel programs (MPI OpenMP or
Threads) - Global address space
- Only adds test (on/off node) before local
read/write
8Support for GAS Languages
- Unified Parallel C (UPC)
- Funded by the NSA
- Compaq compiler for Alpha/Quadrics
- HP, Sun and Cray compilers under development
- Gcc-based compiler for SGI (Intrepid)
- Gcc-based compiler (SRC) for Cray T3E
- MTU and Compaq effort for MPI-based compiler
- LBNL compiler based on Open64
- Co-Array Fortran (CAF)
- Cray compiler
- Rice and UMN effort based on Open64
- SPMD Java (Titanium)
- UCB compiler available for most machines
9Parallelism Model in UPC
- UPC uses an SPMD model of parallelism
- A set if THREADS threads working independently
- Two compilation models
- THREADS may be fixed at compile time or
- Dynamically set at program startup time
- MYTHREAD specifies thread index (0..THREADS-1)
- Basic synchronization mechanisms
- Barriers (normal and split-phase), locks
- What UPC does not do automatically
- Determine data layout
- Load balance move computations
- Caching move data
- These are intentionally left to the programmer
10UPC Pointers
- Pointers may point to shared or private variables
- Same syntax for use, just add qualifier
- shared int sp
- int lp
- sp is a pointer to an integer residing in the
shared memory space. - sp is called a shared pointer (somewhat sloppy).
x 3
Shared
sp
sp
sp
Global address space
Private
11Shared Arrays in UPC
- Shared array elements are spread across the
threads - shared int xTHREADS /One element per
thread / - shared int y3THREADS / 3 elements per
thread / - shared int z3THREADS / 3 elements per
thread, cyclic / - In the pictures below
- Assume THREADS 4
- Elements with affinity to processor 0 are red
Of course, this is really a 2D array
x
y
blocked
z
cyclic
12Overlapping Communication in UPC
- Programs with fine-grained communication require
overlap for performance - UPC compiler does this automatically for
relaxed accesses. - Accesses may be designated as strict, relaxed, or
unqualified (the default). - There are several ways of designating the
ordering type. - A type qualifier, strict or relaxed can be used
to affect all variables of that type. - Labels strict or relaxed can be used to control
the accesses within a statement. - strict x y z y1
- A strict or relaxed cast can be used to override
the current label or type qualifier.
13Performance of UPC
- Reason why UPC may be slower than MPI
- Shared array indexing is expensive
- Small messages encouraged by model
- Reasons why UPC may be faster than MPI
- MPI encourages synchrony
- Buffering required for many MPI calls
- Remote read/write of a single word may require
very little overhead - Cray t3e, Quadrics interconnect (next version)
- Assuming overlapped communication, the real
issues is overhead how much time does it take to
issue a remote read/write?
14UPC vs. MPI Sparse MatVec Multiply
- Short term goal
- Evaluate language and compilers using small
applications - Longer term, identify large application
- Show advantage of t3e network model and UPC
- Performance on Compaq machine worse
- Serial code
- Communication performance
- New compiler just released
15UPC versus MPI for Edge detection
b. Scalability
a. Execution time
- Performance from Cray T3E
- Benchmark developed by El Ghazawis group at GWU
16UPC versus MPI for Matrix Multiplication
a. Execution time
b. Scalability
- Performance from Cray T3E
- Benchmark developed by El Ghazawis group at GWU
17Implementing UPC
- UPC extensions to C are small
- lt 1 person-year to implement in existing compiler
- Simplest approach
- Reads and writes of shared pointers become small
message puts/gets - UPC has relaxed keyword for nonblocking
communication - Small message performance is key
- Advanced optimizations include conversions to
bulk communication by either - Application programmer
- Compiler
18Overview of NERSC Compiler
- Compiler
- Portable compiler infrastructure (UPC-gtC)
- Explore optimizations communication, shared
pointers - Based on Open64 plan to release sources
- Runtime systems for multiple compilers
- Allow use by other languages (Titanium and CAF)
- And in other UPC compilers, e.g., Intrepid
- Performance of small message put/get are key
- Designed to be easily ported, then tuned
- Also designed for low overhead (macros, inline
functions)
19Compiler and Runtime Status
- Basic parsing and type-checking complete
- Generates code for small serial kernels
- Still testing and debugging
- Needs runtime for complete testing
- UPC runtime layer
- Initial implementation should be done this month
- Based on processes (not threads) on GASNet
- GASNet
- Initial specification complete
- Reference implementation done on MPI
- Working on Quadrics and IBM (LAPI)
20Benchmarks for GAS Languages
- EEL End to end latency or time spent sending a
short message between two processes. - BW Large message network bandwidth
- Parameters of the LogP Model
- L Latencyor time spent on the network
- During this time, processor can be doing other
work - O Overhead or processor busy time on the
sending or receiving side. - During this time, processor cannot be doing other
work - We distinguish between send and recv overhead
- G gap the rate at which messages can be
pushed onto the network. - P the number of processors
21LogP Parameters Overhead Latency
- Send and recv overhead can overlap
P0
osend
orecv
P1
EEL osend L orecv
EEL f(osend, L, orecv)
22Benchmarks
- Designed to measure the network parameters
- Also provide gap as function of queue depth
- Measured for best case in general
- Implemented once in MPI
- For portability and comparison to target specific
layer - Implemented again in target specific
communication layer - LAPI
- ELAN
- GM
- SHMEM
- VIPL
23Results EEL and Overhead
24Results Gap and Overhead
25Send Overhead Over Time
- Overhead has not improved significantly T3D was
best - Lack of integration lack of attention in software
26Summary
- Global address space languages offer alternative
to MPI for large machines - Easier to use shared data structures
- Recover users left behind on shared memory?
- Performance tuning still possible
- Implementation
- Small compiler effort given lightweight
communication - Portable communication layer GASNet
- Difficulty with small message performance on IBM
SP platform