An Introduction to Unified Parallel C UPC - PowerPoint PPT Presentation

1 / 58
About This Presentation
Title:

An Introduction to Unified Parallel C UPC

Description:

A number of threads (i.e. processes) working independently in a SPMD fashion ... Distributed Arrays Directory Style ... build directories of distributed ... – PowerPoint PPT presentation

Number of Views:194
Avg rating:3.0/5.0
Slides: 59
Provided by: kath220
Category:

less

Transcript and Presenter's Notes

Title: An Introduction to Unified Parallel C UPC


1
An Introduction toUnified Parallel C (UPC)
  • James Dinan
  • PhD Intern from Ohio State
  • MCS Seminar at Argonne May 4, 2009
  • Adapted from slides by
  • Kathy Yelick (LBNL UCB) and Tarek El-Ghazawi
    (GWU)

2
UPC Outline
  • Background
  • UPC Programming Model
  • Memory Consistency and Synchronization
  • Work Distribution
  • Distributed Shared Arrays
  • Pointers and Dynamic Memory Management
  • Performance Results

3
Context
  • Most parallel programs are written using either
  • SPMD Parallel Message passing
  • Many scientific applications
  • Good scaling
  • Requires attention to data dist and communication
  • Shared memory OpenMP, PThreads
  • Easier to program, but less scalable performance
  • Few scientific applications
  • Global Address Space Languages take the best of
    both
  • Shared memory like threads (programmability)
  • SPMD parallelism like MPI (performance)
  • Adds Locality to bridge the gap

4
Partitioned Global Address Space Models
Thread0 Thread1
Threadn
X0
X1
XP
Shared
head
node
node
Global address space
Private
ptr
ptr
ptr
  • Explicitly-parallel SPMD programming model
  • Global Address Space model of memory
  • Address space is logically partitioned
  • Local vs. remote memory
  • Local shared vs local private
  • Enables creation of distributed shared data
    structures
  • Programmer control over data layout and locality
  • Multiple PGAS models UPC (C), CAF (Fortran),
    Titanium (Java), Global Arrays (library)

5
UPC Overview
  • Unified Parallel C (UPC) is
  • An explicit parallel extension of ANSI C
  • A partitioned global address space language
  • Similar to the C language philosophy
  • Programmers are clever and careful, and may need
    to get close to hardware
  • to get performance, but can get in trouble
  • Concise and efficient syntax
  • Tunable approach to performance
  • High level Sequential C ? Shared memory
  • Medium level Locality, data distribution,
    consistency
  • Low level Explicit one-sided communication
  • Based on ideas from Split-C, AC, and PCP

6
Who is UPC
  • UPC is an open standard, latest is v1.2 from May,
    2005
  • Academic and Government Institutions
  • George Washington University
  • Laurence Berkeley National Laboratory
  • University of California, Berkeley
  • University of Florida
  • Michigan Technological University
  • U.S. Department of Energy
  • Army High Performance Computing Research Center
  • Commercial Institutions
  • Hewlett-Packard (HP)
  • Cray, Inc
  • Intrepid Technology, Inc.
  • IBM
  • Etnus, LLC (Totalview)

7
UPC Programming Model
8
UPC Execution Model
  • A number of threads (i.e. processes) working
    independently in a SPMD fashion
  • Number of threads THREADS
  • MYTHREAD specifies thread index (0..THREADS-1)
  • There are two compilation modes
  • Static Threads mode
  • THREADS is specified at compile time by the user
  • The program may use THREADS as a compile-time
    constant
  • Dynamic Threads mode
  • Number of threads chosen when app is launched

9
Hello World in UPC
  • Any legal C program is also a legal UPC program
  • If you compile and run it as UPC with N threads,
    it will run N copies of the program.
  • include ltupc.hgt
  • include ltstdio.hgt
  • int main(int argc, char argv)
  • printf("Thread d of d hello UPC world\n",
  • MYTHREAD, THREADS)
  • return 0

10
Private vs. Shared Variables in UPC
  • Normal C variables and objects are allocated in
    the private memory space for each thread (stack
    is private)
  • Shared variables are allocated only once, by
    thread 0
  • shared int ours // use sparingly
    performance
  • int mine
  • Shared variables may not have dynamic lifetime
    may not occur in a in a function definition,
    except as static. Why?

Thread0 Thread1
Threadn
Shared
ours
Global address space
mine
mine
mine
Private
11
Memory Consistency and Synchronization
12
Memory Consistency in UPC
  • The consistency model defines the order in which
    one thread may see another threads accesses to
    memory
  • If you write a program with unsychronized
    accesses, what happens?
  • Does this work?
  • data while (!flag)
  • flag 1 mydata data
  • UPC has two types of accesses
  • Strict Sequential consistency, all threads see
    the same ordering
  • Relaxed May appear out of order to other
    threads, concurrency
  • Can be combined in same program, default is
    strict
  • There are several ways of specifying the
    consistency model
  • include ltupc_relaxed.hgt
  • pragma upc strict
  • strict shared int flag

13
Synchronization- Fence
  • upc_fence
  • Non-collective
  • UPC ensures that all shared references issued
    before the upc_fence are complete
  • Allows you to force an ordering between shared
    accesses
  • Important when using relaxed semantics

14
UPC Global Synchronization
  • UPC has two basic forms of barriers
  • Barrier block until all other threads arrive
  • upc_barrier
  • Split-phase barriers
  • upc_notify this thread is ready for barrier
  • do computation unrelated to barrier
  • upc_wait wait for others to be ready
  • Optional labels allow for debugging
  • define MERGE_BARRIER 12
  • if (MYTHREAD2 0)
  • ...
  • upc_barrier MERGE_BARRIER
  • else
  • ...
  • upc_barrier MERGE_BARRIER

15
Synchronization - Locks
  • Locks in UPC are represented by an opaque type
  • upc_lock_t
  • Locks must be allocated before use
  • Collective returns same pointer to all threads
  • upc_lock_t upc_all_lock_alloc(void)
  • Non-Collective returns different pointers
  • upc_lock_t upc_global_lock_alloc(void)
  • To use a lock
  • void upc_lock(upc_lock_t l)
  • void upc_unlock(upc_lock_t l)
  • Locks can be freed when not in use
  • void upc_lock_free(upc_lock_t ptr)

16
Example Monte Carlo Pi Calculation
  • Estimate Pi by throwing darts at a unit square
  • Calculate percentage that fall in the unit circle
  • Area of square r2 1
  • Area of circle quadrant ¼ p r2 p/4
  • Randomly throw darts at x,y positions
  • If x2 y2 lt 1, then point is inside circle
  • Compute ratio
  • Area of quadrant points inside / points
    total
  • p 4area

17
Helper Code for Pi in UPC
  • Function to throw dart and calculate where it
    hits
  • int hit()
  • double x ((double) rand()) / RAND_MAX
  • double y ((double) rand()) / RAND_MAX
  • if ((xx yy) lt 1.0)
  • return 1
  • else
  • return 0

18
Pi in UPC Shared Memory Style
  • shared int hits
  • main(int argc, char argv)
  • int i, my_hits, my_trials 0
  • upc_lock_t hit_lock upc_all_lock_alloc()
  • int trials atoi(argv1)
  • my_trials (trials THREADS - 1)/THREADS
  • srand(MYTHREAD17)
  • for (i0 i lt my_trials i)
  • my_hits hit()
  • upc_lock(hit_lock)
  • hits my_hits
  • upc_unlock(hit_lock)
  • upc_barrier
  • if (MYTHREAD 0)
  • printf("PI f", 4.0hits/trials)

create a lock
accumulate hits locally
accumulate across threads
19
Work Distribution Using upc_forall
20
Shared Arrays Are Cyclic By Default
  • Shared scalars always live in thread 0
  • Shared arrays are spread over threads cyclically
  • shared int xTHREADS / 1 element per
    thread /
  • shared int y3THREADS / 3 elements per
    thread /
  • shared int z33 / 2 or 3
    elements per thread /
  • In the pictures below, assume THREADS 4
  • Red elts have affinity to thread 0

Think of linearized C array, then map in
round-robin
x
As a 2D array, y is logically blocked by columns
y
z
z is not
21
Example Vector Addition
  • Questions about parallel vector additions
  • How to layout data (here it is cyclic)
  • Which processor does what (here it is owner
    computes)
  • / vadd.c /
  • include ltupc_relaxed.hgtdefine N
    100THREADSshared int v1N, v2N,
    sumNvoid main() int i for(i0 i lt N
    i)
  • if (MYTHREAD iTHREADS) sumi
    v1iv2i

cyclic layout
owner computes
22
Work Sharing with upc_forall()
  • The idiom in the previous slide is very common
  • Loop over all work on those owned by this proc
  • UPC adds a special type of loop
  • upc_forall(init test loop affinity)
  • Programmer indicates the iterations are
    independent
  • Behavior undefined if there are dependencies
    across threads
  • Affinity expression indicates which iterations to
    run on each thread. It may have one of two
    types
  • Integer affinityTHREADS MYTHREAD
  • Pointer upc_threadof(affinity) MYTHREAD
  • Syntactic sugar for loop on previous slide
  • Some compilers may do better than this, e.g.,
  • for(iMYTHREAD iltN iTHREADS)
  • Rather than having all threads iterate N times
  • for(i0 iltN i) if (MYTHREAD iTHREADS)

23
Vector Addition with upc_forall
  • The vadd example can be rewritten as follows
  • Equivalent code could use sumi for affinity
  • The code would be correct but slow if the
    affinity expression were i1 rather than i.
  • define N 100THREADSshared int v1N, v2N,
    sumNvoid main() int i upc_forall(i0 i
    lt N i i)
  • sumiv1iv2i

The cyclic data distribution may perform poorly
on some machines. Cache effects!
24
Distributed Arrays in UPC
25
Blocked Layouts in UPC
  • The cyclic layout is typically stored in one of
    two ways
  • Distributed memory each processor has a chunk of
    memory
  • Thread 0 would have 0,THREADS, THREADS2, in a
    chunk
  • Shared memory machine each thread has a logical
    chunk
  • Shared memory would have 0,1,2,THREADS,THREADS1
    ,
  • What performance problem is there with the
    latter?
  • Vector addition example can be rewritten as
    follows
  • define N 100THREADSshared int v1N,
    v2N, sumNvoid main() int
    i upc_forall(i0 iltN i ai)
  • sumiv1iv2i

blocked layout
26
Layouts in General
  • All non-array objects have affinity with thread
    zero.
  • Array layouts are controlled by layout
    specifiers
  • None (cyclic layout, i.e. block size of 1)
  • (blocked layout)
  • 0 or (indefinite layout, all on 1 thread)
  • b or b1b2bn b1b2bn (fixed block
    size)
  • The affinity of an array element is defined in
    terms of
  • block size, a compile-time constant
  • and THREADS.
  • Element i has affinity with thread
  • (i / block_size) THREADS
  • In 2D and higher, linearize the elements as in a
    C row major representation, and then use above
    mapping

27
2D Array Layouts in UPC
a1
a2
  • shared m int a1 nm
  • shared km int a2 nm
  • shared int a3 nmTHREADS
  • a1 has a row layout
  • a2 has a block row layout.
  • a3 has a column layout
  • To get more general HPF and ScaLAPACK style 2D
    blocked layouts, one needs to add dimensions.
  • Assume rc THREADS
  • shared b1b2 int a5 mnrcb1b2
  • or equivalently
  • shared b1b2 int a5 mnrcb1b2

a3
28
UPC Matrix Multiplication Code
include ltupc_relaxed.hgt define N 4 define P
4 define M 4 // a and c are row-wise blocked
shared matrices // b is column-wise
blocked shared NP/THREADS int aNP,
cNM Shared M/THREADS int bPM void
main (void) int i, j , k // private
variables upc_forall (i 0 i lt N i
ci0) for (j0 j lt M j) cij
0 for (k 0 k ? P k) cij
aikbkj
29
Domain Decomposition for UPC
  • Exploits locality in matrix multiplication
  • A (N ? P) is decomposed row-wise into blocks of
    size (N ? P) / THREADS as shown below
  • B (P ? M) is decomposed column wise into M/
    THREADS blocks as shown below

Thread THREADS-1
Thread 0
M
P
Thread 0
0 .. (NP / THREADS) -1
Thread 1
(NP / THREADS)..(2NP / THREADS)-1
N
P
((THREADS-1)?NP) / THREADS .. (THREADSNP /
THREADS)-1
THREADS-1
Columns 0 (M/THREADS)-1
Columns ((THREAD-1) ? M)/THREADS(M-1)
Note N and M are assumed to be multiples of
THREADS
30
Observations on Matrix Multiplication Code
  • UPC code is almost the same size as the seq code
  • Convert sequential C to parallel UPC code by
    adding shared to matrices and work sharing loop
  • Distributions are an incremental optimization
    that allow us to improve locality
  • Would still get correct result without
    distributions
  • Further Improvement
  • We may not have all needed elements of B locally
  • Making a private copy of B in each thread might
    result in better performance
  • Can be done with the help of upc_memget

31
Pointers and Dynamic Memory Management
32
Pointers to Shared vs. Arrays
  • In the C tradition, array can be access through
    pointers
  • Here is the vector addition example using pointers
  • define N 100THREADS
  • shared int v1N, v2N, sumN
  • void main() int ishared int p1, p2p1v1
    p2v2upc_forall (i0 iltN i, p1, p2 i)
  • sumi p1 p2

v1
p1
33
UPC Pointers
Where does the pointer reside?
Where does the pointer point?
int p1 / private pointer to local / shared
int p2 / private pointer to shared / int
shared p3 / shared pointer to local
/ shared int shared p4 / shared pointer to
shared / Shared to private is not recommended.
34
UPC Pointers
Thread0 Thread1
Threadn
p3
p3
p3
Shared
p4
p4
p4
Global address space
p1
p1
p1
Private
p2
p2
p2
int p1 / private pointer to local
memory / shared int p2 / private pointer to
shared space / int shared p3 / shared pointer
to local memory / shared int shared p4 /
shared pointer to
shared space /
Pointers to shared often require more storage and
are more costly to dereference they may refer to
local or remote memory.
35
UPC Pointers
  • In UPC pointers to shared objects have three
    fields
  • thread number
  • local address of block
  • phase (specifies position in the block)
  • lt Thread, Phase, Local Address gt
  • Example Cray T3E implementation

0
37
38
48
49
63
36
UPC Pointers
  • Pointer arithmetic supports blocked and
    non-blocked array distributions
  • Casting pointers
  • shared to local is allowed
  • private to shared is not allowed
  • Casting of shared to local is well defined only
    if the object pointed to by the pointer to shared
    has affinity with the thread performing the cast
  • In general will result in an error

37
Special Functions
  • size_t upc_threadof(shared void ptr)
  • Returns the thread id that the data at ptr has
    affinity to
  • size_t upc_phaseof(shared void ptr)
  • Returns the index (position within the block)
    field of the pointer to shared
  • shared void upc_resetphase(shared void ptr)
  • Resets the phase to zero

38
UPC Memory Allocation
  • shared void upc_alloc(size_t nbytes)
  • Allocate nbytes of shared memory with affinity to
    the calling thread
  • upc_alloc is not collective
  • void upc_free(shared void ptr)
  • Frees the dynamically allocated shared memory
    pointed to by ptr
  • upc_free is not collective

39
Global Memory Allocation
  • shared void upc_global_alloc(size_t nblocks,
    size_t nbytes)
  • nblocks number of blocksnbytes block size
  • Non-collective called by one thread
  • Allocates memory in the shared space
  • If called by more than one thread, multiple
    regions are allocated and each thread which makes
    the call gets a different pointer
  • Space allocated per calling thread is equivalent
    toshared nbytes charnblocks nbytes

40
Collective Global Memory Allocation
  • shared void upc_all_alloc(size_t nblocks,
  • size_t nbytes)
  • nblocks number of blocksnbytes block size
  • This function has the same result as
    upc_global_alloc. But this is a collective
    function
  • All the threads will get the same pointer
  • Equivalent to
  • shared nbytes charnblocks nbytes

41
One-Sided Communication
  • Semantics similar to libc memcpy() but operate in
    GAS
  • Allow programmer to explicitly manage
    communication
  • upc_memget(void dst, shared void src, size_t
    size)
  • One-sided get copies data from shared to local
  • upc_memput(shared void dst, void src, size_t
    size)
  • One-sided put copies data from local to shared
  • upc_memcpy(shared void dst, shared void src,
    size_t size)
  • One-sided copy copies data from shared to shared

42
Distributed Arrays Directory Style
  • Some high performance UPC programmers avoid the
    UPC style arrays
  • Instead, build directories of distributed objects
  • Also more general
  • typedef shared double sdblptr
  • shared sdblptr directoryTHREADS
  • directoryiupc_alloc(local_sizesizeof(double))
  • upc_barrier

43
UPC PerformanceStudy conducted by Kathy
Yelick, Chris Bell, Rajesh Nishtala, Dan Bonachea
at UC Berkeley
44
One-Sided vs Two-Sided Messaging
one-sided put message
host CPU
address
data payload
network interface
two-sided message
memory
message id
data payload
  • A one-sided put/get message can be handled
    directly by a network interface with RDMA support
  • Avoid interrupting the CPU or storing data from
    CPU (preposts)
  • A two-sided messages needs to be matched with a
    receive to identify memory address to put data
  • Offloaded to Network Interface in networks like
    Quadrics
  • Need to download match tables to interface (from
    host)
  • Ordering requirements on messages can also hinder
    bandwidth

45
Performance Advantage of One-Sided Communication
  • Opteron/InfiniBand (Jacquard at NERSC)
  • GASNets vapi-conduit and OSU MPI 0.9.5 MVAPICH
  • This is a very good MPI implementation its
    limited by semantics of message matching,
    ordering, etc.
  • Half power point (N ½ ) differs by one order of
    magnitude

Kathy Yelick with Paul Hargrove and Dan Bonachea
46
Case Study 2 NAS FT
  • Performance of Exchange (All-to-all) is critical
  • 1D FFTs in each dimension, 3 phases
  • Transpose after first 2 for locality
  • Bisection bandwidth-limited
  • Problem as procs grows
  • Three approaches to transpose
  • Exchange
  • wait for 2nd dim FFTs to finish, send 1 message
    per processor pair
  • Slab
  • wait for chunk of rows destined for 1 proc, send
    when ready
  • Pencil
  • send each row as it completes

Kathy Yelick, Chris Bell, Rajesh Nishtala, Dan
Bonachea
47
Overlapping Communication
  • Goal make use of all the wires all the time
  • Schedule communication to avoid network backup
  • Trade-off overhead vs. overlap
  • Exchange has fewest messages, less message
    overhead
  • Slabs and pencils have more overlap pencils the
    most
  • Example Class D problem on 256 Processors

Kathy Yelick with Chris Bell, Rajesh Nishtala,
Dan Bonachea
48
NAS FT Variants Performance Summary
.5 Tflops
  • Slab is always best for MPI small message cost
    too high
  • Pencil is always best for UPC more overlap

Kathy Yelick with Chris Bell, Rajesh Nishtala,
Dan Bonachea
49
Summary
  • UPC extends C with explicit parallel constructs
  • UPC is PGAS
  • Defines a global address space
  • Every shared object has a affinity to a UPC
    thread
  • UPC promises high productivity and high
    performance
  • Tunable approach to performance
  • High level Sequential C ? Shared memory
  • Medium level Locality, data distribution,
    consistency
  • Low level Explicit one-sided communication
  • Gives high performance on modern capability
    systems
  • NAS FT Benchmark on IB Cluster
  • 16k Processors on

50
UPC Resources
  • UPC Language Spec, Users guide, resources
  • http//upc.gwu.edu
  • Berkeley UPC (rDMA clusers, shared memory, )
  • Source-to-source UPC ? C, GASNet runtime
  • http//upc.lbl.gov
  • http//upc-wiki.lbl.gov
  • Intrepid UPC/GCC-UPC (Shared memory or BUPCR)
  • Direct UPC ? binary, also can use BUPC runtime
  • http//www.intrepid.com
  • MuPC from Michigan Tech (Linux Clusters)
  • EDG UPC ? C trans, MPI runtime, Reference Impl.
  • http//www.upc.mtu.edu
  • Commercial UPC IBM, SGI, Cray, HP

51
Backup Slides
52
Proposed Extensions to UPC
  • UPC Collectives
  • Thread groups/teams
  • UPC I/O
  • Nonblocking communication
  • Point-to-point synchronization
  • Atomic memory operations
  • Variable blocksize pointers
  • High performance timers
  • Hierarchical thread layout query and control
  • Non-contiguous data transfer
  • UPC
  • Active Messages

53
UPC Distributed Shared Linked List
  • include ltupc.hgt
  • struct node_s
  • int value
  • shared struct node_s next
  • typedef struct node_s node_t
  • shared node_t shared head NULL
  • shared int turn 0
  • int main(int argc, char argv)
  • shared node_t cur
  • shared node_t mynode
  • mynode upc_alloc(sizeof(node_t))
  • mynode-gtvalue MYTHREAD
  • mynode-gtnext NULL

54
UPC Collectives in General
  • Collectives are a proposed extension
  • Implemented in MuPC, Berkeley UPC and others
  • The UPC collectives interface is available from
  • http//www.gwu.edu/upc/docs/
  • It contains typical functions
  • Data movement broadcast, scatter, gather,
  • Computational reduce, prefix,
  • Interface has synchronization modes
  • Avoid over-synchronizing (barrier before/after is
    simplest semantics, but may be unnecessary)
  • Data being collected may be read/written by any
    thread simultaneously

55
Pi in UPC Data Parallel Style
  • The previous version of Pi works, but is not
    scalable
  • On a large of threads, the locked region will
    be a bottleneck
  • Use a reduction for better scalability
  • include ltbupc_collectivev.hgt
  • // shared int hits
  • main(int argc, char argv)
  • ...
  • for (i0 i lt my_trials i)
  • my_hits hit()
  • my_hits // type, input, thread, op
  • bupc_allv_reduce(int, my_hits, 0,
    UPC_ADD)
  • // upc_barrier
  • if (MYTHREAD 0)
  • printf("PI f", 4.0my_hits/trials)

Berkeley collectives
no shared variables
barrier implied by collective
56
GASNet Portability and High-Performance
GASNet better for latency across machines
Kathy Yelick with UPC Group GASNet design by Dan
Bonachea
57
GASNet Portability and High-Performance
GASNet at least as high (comparable) for large
messages
Kathy Yelick with UPC Group GASNet design by Dan
Bonachea
58
GASNet Portability and High-Performance
GASNet excels at mid-range sizes important for
overlap
Kathy Yelick with UPC Group GASNet design by Dan
Bonachea
Write a Comment
User Comments (0)
About PowerShow.com