Unified Parallel C at NERSC - PowerPoint PPT Presentation

1 / 42
About This Presentation
Title:

Unified Parallel C at NERSC

Description:

Top 500 Supercomputers. Listing of the 500 most powerful computers in the world ... Maxwells Equations on an Unstructured 3D Mesh: Explicit Method ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 43
Provided by: yel3
Category:

less

Transcript and Presenter's Notes

Title: Unified Parallel C at NERSC


1
Unified Parallel C at NERSC
  • Kathy Yelick
  • EECS, U.C. Berkeley and NERSC/LBNL
  • UPC Team Dan Bonachea, Jason Duell, Paul
    Hargrove, Parry Husbands, Costin Iancu, Mike
    Welcome, Christian Bell

2
Outline
  • Motivation for a new class of languages
  • Programming models
  • Architectural trends
  • Overview of Unified Parallel C (UPC)
  • Programmability advantage
  • Performance opportunity
  • Status
  • Next step
  • Related projects

3
Programming Model 1 Shared Memory
  • Program is a collection of threads of control.
  • Many languages allow threads to be created
    dynamically,
  • Each thread has a set of private variables, e.g.
    local variables on the stack.
  • Collectively with a set of shared variables,
    e.g., static variables, shared common blocks,
    global heap.
  • Threads communicate implicitly by writing/reading
    shared variables.
  • Threads coordinate using synchronization
    operations on shared variables

x ...
Shared
y ..x ...
Private
. . .
Pn
P0
4
Programming Model 2 Message Passing
  • Program consists of a collection of named
    processes.
  • Usually fixed at program startup time
  • Thread of control plus local address space -- NO
    shared data.
  • Logically shared data is partitioned over local
    processes.
  • Processes communicate by explicit send/receive
    pairs
  • Coordination is implicit in every communication
    event.
  • MPI is the most common example

send P0,X
recv Pn,Y
Y
X
. . .
Pn
P0
5
Advantages/Disadvantages of Each Model
  • Shared memory
  • Programming is easier
  • Can build large shared data structures
  • Machines dont scale
  • SMPs typically lt 16 processors (Sun, DEC, Intel,
    IBM)
  • Distributed shared memory lt 128 (SGI)
  • Performance is hard to predict and control
  • Message passing
  • Machines easier to build from commodity parts
  • Can scale (given sufficient network)
  • Programming is harder
  • Distributed data structures only in the
    programmers mind
  • Tedious packing/unpacking of irregular data
    structures

6
Global Address Space Programming
  • Intermediate point between message passing and
    shared memory
  • Program consists of a collection of processes.
  • Fixed at program startup time, like MPI
  • Local and shared data, as in shared memory model
  • But, shared data is partitioned over local
    processes
  • Remote data stays remote on distributed memory
    machines
  • Processes communicate by reads/writes to shared
    variables
  • Examples are UPC, Titanium, CAF, Split-C
  • Note These are not data-parallel languages
  • heroic compilers not required

7
GAS Languages on Clusters of SMPs
  • SMPs are the fastest commodity machine, so used
    as a node in large-scale clusters
  • Common names
  • CLUMP Cluster of SMPs
  • Hierarchical machines, constellations
  • Most modern machines look like this
  • Millennium, IBM SPs, (not the t3e)...
  • What is an appropriate programming model?
  • Use message passing throughout
  • Unnecessary packing/unpacking overhead
  • Hybrid models
  • Write 2 parallel programs (MPI OpenMP or
    Threads)
  • Global address space
  • Only adds test (on/off node) before local
    read/write

8
Top 500 Supercomputers
  • Listing of the 500 most powerful computers in the
    world
  • - Yardstick Rmax from LINPACK MPP benchmark
  • Axb, dense problem
  • - Dense LU Factorization (dominated by matrix
    multiply)
  • Updated twice a year SCxy in the States in
    November
  • Meeting in Mannheim, Germany in June
  • All data (and slides) available from
    www.top500.org
  • Also measures N-1/2 (size required to get ½ speed)

performance
Rate
Size
9
(No Transcript)
10
(No Transcript)
11
Outline
  • Motivation for a new class of languages
  • Programming models
  • Architectural trends
  • Overview of Unified Parallel C (UPC)
  • Programmability advantage
  • Performance opportunity
  • Status
  • Next step
  • Related projects

12
Parallelism Model in UPC
  • UPC uses an SPMD model of parallelism
  • A set if THREADS threads working independently
  • Two compilation models
  • THREADS may be fixed at compile time or
  • Dynamically set at program startup time
  • MYTHREAD specifies thread index (0..THREADS-1)
  • Basic synchronization mechanisms
  • Barriers (normal and split-phase), locks
  • What UPC does not do automatically
  • Determine data layout
  • Load balance move computations
  • Caching move data
  • These are intentionally left to the programmer

13
Shared and Private Variables in UPC
  • A shared variable has one instance, shared by all
    threads.
  • Affinity to thread 0 by default (allocated in
    processor 0s memory)
  • A private variable has an instance per thread
  • Example
  • int x // private copy for each
    processor
  • shared int y // one copy on P0, shared by
    all others
  • x 0 y 0
  • x 1 y 1
  • After executing this code
  • x will be 1 in all threads y will be between 1
    and THREADS
  • Shared scalar variable are somewhat rare because
  • cannot be automatic (declared in a function) (Why
    not?)

14
UPC Pointers
  • Pointers may point to shared or private variables
  • Same syntax for use, just add qualifier
  • shared int sp
  • int lp
  • sp is a pointer to an integer residing in the
    shared memory space.
  • sp is called a shared pointer (somewhat sloppy).

x 3
Shared
sp
sp
sp
Global address space
Private
15
UPC Pointers
  • May also have a pointer variable that is shared.
  • shared int shared sps
  • int shared spl // does this make
    sense?
  • The most common case is a private variable that
    points to a shared object (called a shared
    pointer)

sps
Shared
Global address space
Private
16
Shared and Private Rules
  • Default Types that are neither shared-qualified
    nor private-qualified are considered private.
  • This makes porting uniprocessor libraries easy
  • Makes porting shared memory code somewhat harder
  • Casting pointers
  • A pointer to a private variable may not be cast
    to a shared type.
  • If a pointer to a shared variable is cast to a
    pointer to a private object
  • If the object has affinity with the casting
    thread, this is fine.
  • If not, attempts to de-reference that private
    pointer are undefined. (Some compilers may give
    better errors than others.)
  • Why?

17
Shared Arrays in UPV
  • Shared array elements are spread across the
    threads
  • shared int xTHREADS /One element per
    thread /
  • shared int y3THREADS / 3 elements per
    thread /
  • shared int z3THREADS / 3 elements per
    thread, cyclic /
  • In the pictures below
  • Assume THREADS 4
  • Elements with affinity to processor 0 are red

Of course, this is really a 2D array
x
y
blocked
z
cyclic
18
Example Vector Addition
  • Questions about parallel vector additions
  • How to layout data (here it is cyclic)
  • Which processor does what (here it is owner
    computes)
  • / vadd.c /
  • include ltupc_relaxed.hgtdefine N
    100THREADSshared int v1N, v2N,
    sumNvoid main() int i for(i0 iltN i)
  • if (MYTHREAD iTHREADS) sumiv1iv2
    i

cyclic layout
owner computes
19
Shared Pointers
  • In the C tradition, array can be access through
    pointers
  • Here is the vector addition example using pointers
  • include ltupc_relaxed.hgtdefine N
    100THREADSshared int v1N, v2N,
    sumNvoid main() int i shared int p1,
    p2 p1v1 p2v2 for (i0 iltN i, p1,
    p2) if (i THREADS MYTHREAD) sumip1p2

v1
p1
20
Work Sharing with upc_forall()
  • Iterations are independent
  • Each thread gets a bunch of iterations
  • Simple C-like syntax and semantics
  • upc_forall(init test loop affinity)
  • statement
  • Affinity field to distribute the work
  • Round robin
  • Chunks of iterations
  • Semantics are undefined if there are dependencies
    between iterations
  • Programmer has indicated iterations are
    independent

21
Vector Addition with upc_forall
  • The loop in vadd is common, so there is
    upc_forall
  • 4th argument is int expression that gives
    affinity
  • Iteration executes when
  • affinityTHREADS is MYTHREAD
  • / vadd.c /
  • include ltupc_relaxed.hgtdefine N
    100THREADSshared int v1N, v2N,
    sumNvoid main() int i upc_forall(i0
    iltN i i)
  • sumiv1iv2i

22
UPC Vector Matrix Multiplication Code
  • Here is one possible matrix-vector multiplication

// vect_mat_mult.c include ltupc_relaxed.hgt share
d int aTHREADSTHREADS shared int bTHREADS,
cTHREADS void main (void) int i, j , l
upc_forall( i 0 i lt THREADS i i)
ci 0 for ( l 0 l? THREADS
l) ci ailbl
23
Data Distribution
B


Thread 0
Thread 1
Thread 2
A
B
C
24
A Better Data Distribution
B
Th. 0
Thread 0


Th. 1
Thread 1
Th. 2
Thread 2
A
B
C
25
Layouts in General
  • All non-array objects have affinity with thread
    zero.
  • Array layouts are controlled by layout
    specifiers.
  • layout_specifier
  • null
  • layout_specifier integer_expression
  • The affinity of an array element is defined in
    terms of the
  • block size, a compile-time constant, and THREADS
    a runtime constant.
  • Element i has affinity with thread
  • ( i / block_size) PROCS.

26
Layout Terminology
  • Notation is HPF, but terminology is
    language-independent
  • Assume there are 4 processors

(Block, )
(, Block)
(Block, Block)
(Cyclic, )
(Cyclic, Block)
(Cyclic, Cyclic)
27
2D Array Layouts in UPC
  • Array a1 has a row layout and array a2 has a
    block row layout.
  • shared m int a1 nm
  • shared km int a2 nm
  • If (k m) THREADS 0 them a3 has a row
    layout
  • shared int a3 nmk
  • To get more general HPF and ScaLAPACK style 2D
    blocked layouts, one needs to add dimensions.
  • Assume rc THREADS
  • shared b1b2 int a5 mnrcb1b2
  • or equivalently
  • shared b1b2 int a5 mnrcb1b2

28
UPC Vector Matrix Multiplication Code
  • Matrix-vector multiplication with better layout

// vect_mat_mult.c include ltupc_relaxed.hgt shar
ed THREADS int aTHREADSTHREADS shared int
bTHREADS, cTHREADS void main (void) int
i, j , l upc_forall( i 0 i lt THREADS
i i) ci 0 for ( l 0 l? THREADS
l) ci ailbl
29
Example Matrix Multiplication in UPC
  • Given two integer matrices A(NxP) and B(PxM)
  • Compute C A x B.
  • Entries Cij in C are computed by the formula

30
Matrix Multiply in C
  • include ltstdlib.hgt
  • include lttime.hgt
  • define N 4
  • define P 4
  • define M 4
  • int aNP, cNM
  • int bPM
  • void main (void)
  • int i, j , l
  • for (i 0 iltN i)
  • for (j0 jltM j)
  • cij 0
  • for (l 0 l?P l) cij
    ailblj

31
Domain Decomposition for UPC
  • Exploits locality in matrix multiplication
  • A (N ? P) is decomposed row-wise into blocks of
    size (N ? P) / THREADS as shown below
  • B(P ? M) is decomposed column wise into M/
    THREADS blocks as shown below

Thread THREADS-1
Thread 0
P
M
Thread 0
0 .. (NP / THREADS) -1
Thread 1
(NP / THREADS)..(2NP / THREADS)-1
N
P
((THREADS-1)?NP) / THREADS .. (THREADSNP /
THREADS)-1
Thread THREADS-1
  • Note N and M are assumed to be multiples of
    THREADS

Columns 0 (M/THREADS)-1
Columns ((THREAD-1) ? M)/THREADS(M-1)
32
UPC Matrix Multiplication Code
/ mat_mult_1.c / include ltupc_relaxed.hgt share
d NP /THREADS int aNP, cNM // a and c
are row-wise blocked shared matrices sharedM/THR
EADS int bPM //column-wise blocking void
main (void) int i, j , l // private
variables upc_forall(i 0 iltN i
ci0) for (j0 jltM j) cij
0 for (l 0 l?P l) cij
ailblj
33
Notes on the Matrix Multiplication Example
  • The UPC code for the matrix multiplication is
    almost the same size as the sequential code
  • Shared variable declarations include the keyword
    shared
  • Making a private copy of matrix B in each thread
    might result in better performance since many
    remote memory operations can be avoided
  • Can be done with the help of upc_memget

34
Overlapping Communication in UPC
  • Programs with fine-grained communication require
    overlap for performance
  • UPC compiler does this automatically for
    relaxed accesses.
  • Acesses may be designated as strict, relaxed, or
    unqualified (the default).
  • There are several ways of designating the
    ordering type.
  • A type qualifier, strict or relaxed can be used
    to affect all variables of that type.
  • Labels strict or relaxed can be used to control
    the accesses within a statement.
  • strict x y z y1
  • A strict or relaxed cast can be used to override
    the current label or type qualifier.

35
Performance of UPC
  • Reason why UPC may be slower than MPI
  • Shared array indexing is expensive
  • Small messages encouraged by model
  • Reasons why UPC may be faster than MPI
  • MPI encourages synchrony
  • Buffering required for many MPI calls
  • Remote read/write of a single word may require
    very little overhead
  • Cray t3e, Quadrics interconnect (next version)
  • Assuming overlapped communication, the real
    issues is overhead how much time does it take to
    issue a remote read/write?

36
UPC versus MPI for Edge detection
b. Scalability
a. Execution time
  • Performance from Cray T3E
  • Benchmark developed by El Ghazawis group at GWU

37
UPC versus MPI for Matrix Multiplication
a. Execution time
b. Scalability
  • Performance from Cray T3E
  • Benchmark developed by El Ghazawis group at GWU

38
UPC vs. MPI for Sparse Matrix-Vector Multiply
  • Short term goal
  • Evaluate language and compilers using small
    applications
  • Longer term, identify large application
  • Show advantage of t3e network model and UPC
  • Performance on Compaq machine worse
  • Serial code
  • Communication performance
  • New compiler just released

39
Particle/Grid Methods in UPC ?
  • Experience so far in a related language
  • Titanium, Java-based GAS language
  • Immersed boundary method
  • Most time in communication between mesh and
    particles
  • Currently uses bulk communication
  • May benefit from SPMV trick

40
EM3D Performance in Split-C Language on CM-5
Maxwells Equations on an Unstructured 3D Mesh
Explicit Method
Irregular Bipartite Graph of varying
degree (about 20) with weighted edges
v1
v2
w1
w2
H
E
B
Basic operation is to subtract weighted sum
of neighboring values for all E nodes for
all H nodes
D
41
Split-C Performance Tuning on the CM5
  • Tuning affects application performance

42
Outline
  • Motivation for a new class of languages
  • Programming models
  • Architectural trends
  • Overview of Unified Parallel C (UPC)
  • Programmability advantage
  • Performance opportunity
  • Status
  • Next step
  • Related projects

43
UPC Implementation Effort
  • UPC efforts elsewhere
  • IDA t3e implementation based on old gcc
  • GMU (documentation) and UMC (benchmarking)
  • Compaq (Alpha cluster and CMPI compiler (with
    MTU))
  • Cray, Sun, and HP (implementations)
  • Intrepid (SGI compiler and t3e compiler)
  • UPC Book
  • T. El-Ghazawi, B. Carlson, T. Sterling, K. Yelick
  • Three components of NERSC effort
  • Compilers (SP and PC clusters) optimization
    (DOE)
  • Runtime systems for multiple compilers (DOE
    NSA)
  • Applications and benchmarks
    (DOE)

44
Compiler Status
  • NERSC compiler (Costin Iancu)
  • Based on Open64 compiler for C
  • Parses and type-checks UPC
  • Code generation for SMPs underway
  • Generate C on most machines, possibly IA64 later
  • Investigating optimization opportunities
  • Focus of this compiler is high level
    optimizations
  • Intrepid compiler
  • Based on gcc (3.x)
  • Will target our runtime layer on most machines
  • Initial focus is t3e, then Pentium clusters

45
Runtime System
  • Characterizing network performance
  • Low latency (low overhead) -gt programmability
  • Optimization depend on network characteristics
  • T3e was ideal
  • Quadrics reports very low overhead coming
  • Difficult to access low level SP and Myrinet

46
Next Step
  • Undertake larger application effort
  • What type of application?
  • Challenging to write in MPI (e.g., sparse direct
    solvers)
  • Irregular communication (e.g., PIC)
  • Well-understood algorithm

47
Outline
  • Motivation for a new class of languages
  • Programming models
  • Architectural trends
  • Overview of Unified Parallel C (UPC)
  • Programmability advantage
  • Performance opportunity
  • Status
  • Next step
  • Related projects

48
3 Related Projects on Campus
  • Titanium
  • High performance Java dialect
  • Collaboration with Phil Colella and Charlie
    Peskin
  • BeBOP Berkeley Benchmarking and Optimization
  • Self-tuning numerical kernels
  • Sparse matrix operations
  • Pyramid mesh generator (Jonathan Shewchuk)

49
Locality and Parallelism
  • Large memories are slow, fast memories are small.
  • Storage hierarchies are large and fast on
    average.
  • Parallel processors, collectively, have large,
    fast memories -- the slow accesses to remote
    data we call communication.
  • Algorithm should do most work on local data.

50
Tuning pays off ATLAS (Dongarra, Whaley)
Extends applicability of PHIPAC Incorporated in
Matlab (with rest of LAPACK)
51
Speedups on SPMV from Sparsity on Sun Ultra 1/170
1 RHS
52
Speedups on SPMV from Sparsity on Sun Ultra 1/170
9 RHS
53
Future Work
  • Exploit Itanium Architecture
  • 128 (82-bit) floating point registers
  • 9 HW formats 24/8(v), 24/15, 24/17, 53/11,
    53/15, 53/17, 64/15, 64/17
  • Many few load/store instructions
  • fused multiply-add instruction
  • predicated instructions
  • rotating registers for software pipelining
  • prefetch instructions
  • three levels of cache
  • Tune current and wider set of kernels
  • Improve heuristics, eg choice of r x c
  • Incorporate into
  • SUGAR
  • Information Retrieval
  • Further automate performance tuning
  • Generation of algorithm space generators
Write a Comment
User Comments (0)
About PowerShow.com