Title: Towards Optimized UPC Implementations
1Towards Optimized UPC Implementations
- Tarek A. El-Ghazawi
- The George Washington University tarek_at_gwu.edu
2Agenda
- Background
- UPC Language Overview
- Productivity
- Performance Issues
- Automatic Optimizations
- Conclusions
3Parallel Programming Models
- What is a programming model?
- An abstract machine which outlines the view
perceived by the programmer of data and execution - Where architecture and applications meet
- A non-binding contract between the programmer and
the compiler/system - Good Programming Models Should
- Allow efficient mapping on different
architectures - Keep programming easy
- Benefits
- Application - independence from architecture
- Architecture - independence from applications
4Programming Models
Process/Thread
Address Space
Message Passing Shared Memory DSM/PGAS M
PI OpenMP UPC
5Programming Paradigms Expressivity
LOCALITY Implicit Explicit
PARALLEISM
Implicit
Sequential (e.g. C, Fortran, Java)
Data Parallel (e.g. HPF, C)
Shared Memory (e.g. OpenMP)
Distributed Shared Memory/PGAS (e.g. UPC, CAF,
and Titanium)
Explicit
6What is UPC?
- Unified Parallel C
- An explicit parallel extension of ISO C
- A distributed shared memory/PGAS parallel
programming language
7Why not message passing?
- Performance
- High-penalty for short transactions
- Cost of calls
- Two sided
- Excessive buffering
- Ease-of-use
- Explicit data transfers
- Domain decomposition does not maintain the
original global application view - More code and conceptual difficulty
8Why DSM/PGAS?
- Performance
- No calls
- Efficient short transfers
- locality
- Ease-of-use
- Implicit transfers
- Consistent global application view
- Less code and conceptual difficulty
9Why DSM/PGASNew Opportunities for Compiler
Optimizations
Image
Sobel Operator
Thread0
Thread1
Ghost Zones
Thread2
Thread3
- DSM P_Model exposes sequential remote accesses at
compile time - Opportunity for compiler directed prefetching
10History
- Initial Tech. Report from IDA in collaboration
with LLNL and UCB in May 1999 - UPC consortium of government, academia, and HPC
vendors coordinated by GWU, IDA, and DoD - The participants currently are IDA CCS, GWU,
UCB, MTU, UMN, ARSC, UMCP, U florida, ANL, LBNL,
LLNL, DoD, DoE, HP, Cray, IBM, Sun, Intrepid,
Etnus,
11Status
- Specification v1.0 completed February of 2001,
v1.1.1 in October of 2003, v1.2 will add
collectives and UPC/IO - Benchmarking Suites Stream, GUPS, RandomAccess,
NPB suite, Splash-2, and others - Testing suite v1.0, v1.1
- Short courses and tutorials in the US and abroad
- Research Exhibits at SC 2000-2004
- UPC web site upc.gwu.edu
- UPC Book by mid 2005 from John Wiley and Sons
- Manual(s)
12Hardware Platforms
- UPC implementations are available for
- SGI O 2000/3000
- Intrepid 32 and 64b GCC
- UCB 32 b GCC
- Cray T3D/E
- Cray X-1
- HP AlphaServer SC, Superdome
- UPC Berkeley Compiler Myrinet, Quadrics, and
Infiniband Clusters - Beowulf Reference Implementation (MPI-based, MTU)
- New ongoing efforts by IBM and Sun
13UPC Execution Model
- A number of threads working independently in a
SPMD fashion - MYTHREAD specifies thread index (0..THREADS-1)
- Number of threads specified at compile-time or
run-time - Process and Data Synchronization when needed
- Barriers and split phase barriers
- Locks and arrays of locks
- Fence
- Memory consistency control
14UPC Memory Model
Thread THREADS-1
Thread 0
Thread 1
Shared
Private 0
Private 1
Private THREADS-1
- Shared space with thread affinity, plus private
spaces - A pointer-to-shared can reference all locations
in the shared space - A private pointer may reference only addresses in
its private space or addresses in its portion of
the shared space - Static and dynamic memory allocations are
supported for both shared and private memory
15UPC Pointers
- How to declare them?
- int p1 / private pointer pointing locally
/ - shared int p2 / private pointer pointing into
the shared space / - int shared p3 / shared pointer pointing
locally / - shared int shared p4 / shared pointer
pointing into the shared space / - You may find many using shared pointer to mean
a pointer pointing to a shared object, e.g.
equivalent to p2 but could be p4 as well.
16UPC Pointers
Thread 0
Shared
P4
P3
P2
P2
Private
P2
17Synchronization - Barriers
- No implicit synchronization among the threads
- UPC provides the following synchronization
mechanisms - Barriers
- Locks
- Memory Consistency Control
- Fence
18Memory Consistency Models
- Has to do with ordering of shared operations, and
when a change of a shared object by a thread
becomes visible to others - Consistency can be strict or relaxed
- Under the relaxed consistency model, the shared
operations can be reordered by the compiler /
runtime system - The strict consistency model enforces sequential
ordering of shared operations. (No operation on
shared can begin before the previous ones are
done, and changes become visible immediately)
19Memory Consistency Models
- User specifies the memory model through
- declarations
- pragmas for a particular statement or sequence of
statements - use of barriers, and global operations
- Programmers responsible for using correct
consistency model
20UPC and Productivity
- Metrics
- Lines of useful Code
- indicates the development time as well as the
maintenance cost - Number of useful Characters
- alternative way to measure development and
maintenance efforts - Conceptual Complexity
- function level,
- keyword usage,
- number of tokens,
- max loop depth,
-
21Manual Effort NPB Example
22Manual Effort More Examples
23Conceptual Complexity - HIST
24Conceptual Complexity - GUPS
25UPC Optimizations Issues
- Particular Challenges
- Avoiding Address Translation
- Cost of Address Translation
- Special Opportunities
- Locality-driven compiler-directed prefetching
- Aggregation
- General
- Low-level optimized libraries, e.g. collective
- Backend optimizations
- Overlapping of remote accesses and
synchronization with other work
26Showing Potential Optimizations Through Emulated
Hand-Tunings
- Different Hand-tuning levels
- Unoptimized UPC code
- referred as UPC.O0
- Privatized UPC code
- referred as UPC.O1
- Prefetched UPC code
- hand-optimized variant using block get/put to
mimic the effect of prefetching - referred as UPC.O2
- Fully Hand-Tuned UPC code
- Hand-optimized variant integrating privatization,
aggregation of remote accesses as well as
prefetching - Referred as UPC.O3
- T. El-Ghazawi and S. Chauvin, UPC Benchmarking
Issues, 30th Annual Conference IEEE
International Conference on Parallel
Processing,2001 (ICPP01) Pages 365-372
27Address Translation Cost and Local Space
Privatization- Cluster
MB/s Put Get Scale Sum
CC N/A N/A 1565.04 5409.3
UPC Private N/A N/A 1687.63 1776.81
UPC Local 1196.51 1082.89 54.22 82.7
UPC Remote 241.43 237.51 0.09 0.16
MB/s Copy (arr) Copy (ptr) Memcpy Memset
CC 1340.99 1488.02 1223.86 2401.26
UPC Private 1383.57 433.45 1252.47 2352.71
UPC Local 47.2 90.67 1202.8 2398.9
UPC Remote 0.09 0.20 1197.22 2360.59
STREAM BENCHMARK
Results gathered on a Myrinet Cluster
28Address Translation and Local Space
Privatization DSM ARCHITECTURE
Bulk operations
Element-by-Element operations
MB/Sec Memorycopy Block Get Block Put ArraySet Array Copy Sum Scale
GCC 127 N/A N/A 175 106 223 108
UPC Private 127 N/A N/A 173 106 215 107
UPC Local Shared 139 140 136 26 14 31 13
UPC Remote Shared (within SMP node) 130 129 136 26 13 30 13
UPC Remote Shared (beyond SMP node) 112 117 136 24 12 28 12
STREAM BENCHMARK MB/S
29Aggregation and Overlapping of Remote Shared
Memory Accesses
UPC N-Queens Execution Time
UPC Sobel Edge Execution Time
- Benefit of hand-optimizations are greatly
application dependent - N-Queens does not perform any better, mainly
because it is an embarrassingly parallel program - Sobel Edge Detector does get a speedup of one
order of magnitude after hand-optimizating,
scales linearly perfectly. - SGI O2000
30Impact of Hand-Optimizations on NPB.CG
Class A onSGI Origin 2k
31Shared Address Translation Overhead
- Address translation overhead is quite significant
- More than 70 of work for a local-shared memory
access - Demonstrates the real need for optimization
Overhead Present in Local-Shared Memory Accesses
(SGI Origin 2000, GCC-UPC)
Quantification of the Address Translation
Overheads
32Shared Address Translation Overheads for Sobel
Edge Detection
UPC.O0 unoptimized UPC code, UPC.O3
handoptimized UPC code. Ox notations from T.
El-Ghazawi, S. Chauvin, UPC Benchmarking
Issues, Proceedings of the 2001 International
Conference on Parallel Processing, Valencia,
September 2001
33Reducing Address Translation Overheads via
Translation Look-Aside Buffers
- F. Cantonnet, T. El-Ghazawi, P. Lorenz, J. Gaber,
Fast Address Translation Techniques for
Distributed Shared Memory Compilers, IPDPS05,
Denver CO, April 2005 - Use Look-up Memory Model Translation Buffers
(MMTB) to perform fast translations - Two alternative methods proposed to create and
use MMTBs - FT basic method using direct addressing
- RT advanced method, using indexed addressing
- Was prototyped as a compiler-enabled optimization
- no modifications to actual UPC codes are needed
34Different Strategies Full-Table
- Pros
- Direct mapping
- No address calculation
- Cons
- Large memory required
- Can lead to competition over caches and main
memory
Consider shared B int array8 To Initialize
FT ?i ? 0,7, FTi _get_vaddr(arrayi) T
o Access array ?i ? 0,7, arrayi
_get_value_at(FTi)
35Different Strategies Reduced-Table Infinite
blocksize
- RT Strategy
- Only one table entry in this case
- Address calculation step is simple in that case
BLOCKSIZEinfinite Only first address of the
element of the array needs to be saved since all
array data is contiguous Consider shared int
array4 To initialize RT RT0
_get_vaddr(array0) To access array ?i
?0,3, arrayi _get_value_at( RT0 i )
array0
array1
i
array2
array3
RT0
RT0
RT0
RT0
THREAD0
THREAD1
THREAD2
THREAD3
36Different Strategies Reduced-Table Default
blocksize
BLOCKSIZE1 Only first address of elements on
each thread are saved since all array data is
contiguous Consider shared 1 int array16 To
initialize RT ?i ?0,THREADS-1, RTi
_get_vaddr(arrayi) To access array ?i
?0,15, arrayi _get_value_at( RTi mod
THREADS (i/THREADS))
- RT Strategy
- Less memory required than FT, MMTB buffer has
threads entries - Address calculation step is a bit costly but much
cheaper than current implementations
array0
array1
array2
array3
RT0
array4
array5
array6
array7
RT1
array8
array9
array10
array11
RT2
array12
array13
array14
array15
RT3
RT
RT
RT
RT
RT
THREAD1
THREAD2
THREAD3
THREAD0
37Different Strategies Reduced-Table Arbitrary
blocksize
- RT Strategy
- Less memory required than for FT, but more than
previous cases - Address calculation step more costly than
previous cases
ARBITRARY BLOCK SIZES Only first address of
elements of each block are saved since all block
data is contiguous Consider shared 2 int
array16 To initialize T ?i ?0,7, RTi
_get_vaddr(arrayiblocksize(array)) To
access array ?i ?0,15, arrayi
_get_value_at( RTi / blocksize(array) (i
mod blocksize(array)) )
RT0
RT1
array0
array2
array4
array6
RT2
array1
array3
array5
array7
RT3
array8
array10
array12
array14
RT4
array9
array11
array13
array15
RT5
RT6
RT
RT
RT
RT
RT7
THREAD1
THREAD2
THREAD3
THREAD0
RT
38Performance Impact of the MMTB Sobel Edge
Performance of Sobel-Edge Detection using new
MMTB strategies (with and without O0)
- FT and RT are performing around 6 to 8 folds
better than the regular basic UPC version (O0) - RT strategy slower than FT since address
calculation (arbitrary block size case), becomes
more complex. - FT on the other hand is performing almost as good
as the hand-tuned versions (O3 and MPI)
39Performance Impact of the MMTB Matrix
Multiplication
Performance and Hardware Profiling of Matrix
Multiplication using new MMTB strategies
- FT strategy increase in L1 data cache misses due
to the large table size - RT strategy L1 kept low, but increase in number
of loads and stores is observed showing increase
in computations (arbitrary blocksize used)
40Time and storage requirements of the Address
Translation Methods for the Matrix Multiply
Microkernel
For a shared array of N elements with B as blocksize Storage requirements pershared array of memory accesses per shared memoryaccess of arithmetic operations pershared memoryaccess
UPC.O0 More than 25 More than 5
UPC.O0.FT 1 0
UPC.O0.RT 1 Up to 3
(E element size in bytes,P pointer size in
bytes)
Comparison among Optimizations of Storage, Memory
Accesses and Computation Requirements
- Number of loads and stores can increase with
arithmetic operators
41UPC Work-sharing Construct Optimizations
- By thread/index number (upc_forall integer)
- upc_forall(i0 iltN i i)
- loop body
- By the address of a shared variable (upc_forall
address) - upc_forall(i0 iltN i shared_vari)
- loop body
- By thread/index number (for optimized)
- for(iMYTHREAD iltN iTHREADS)
- loop body
- By thread/index number (for integer)
- for(i0 iltN i)
-
- if(MYTHREAD iTHREADS)
- loop body
-
- By the address of a shared variable (for address)
- for(i0 iltN i)
-
- if(upc_threadof(shared_vari)
MYTHREAD) - loop body
42Performance of Equivalent upc_forall and for Loops
43Performance Limitations Imposed by Sequential C
Compilers -- STREAM
NUMA (MB/s) BULK BULK BULK Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element
NUMA (MB/s) memcpy memset Struct cp Copy (arr) Copy (ptr) Set Sum Scale Add Triad
F 291.21 163.90 N/A 291.59 N/A 159.68 135.37 246.3 235.1 303.82
C 231.20 214.62 158.86 120.57 152.77 147.70 298.38 133.4 13.86 20.71
Vector(MB/s) BULK BULK BULK Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element
Vector(MB/s) memcpy memset Struct cp Copy (arr) Copy (ptr) Set Sum Scale Add Triad
F 14423 11051 N/A 14407 N/A 11015 17837 14423 10715 16053
C 18850 5307 7882 7972 7969 10576 18260 7865 3874 5824
44Loopmark SET/ADD Operations
Vector BULK BULK BULK Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element
Vector memcpy memset Struct cp Copy (arr) Copy (ptr) Set Sum Scale Add Triad
F 14423 11051 N/A 14407 N/A 11015 17837 14423 10715 16053
C 18850 5307 7882 7972 7969 10576 18260 7865 3874 5824
Let us compare loopmarks for each F / C operation
45Loopmark SET/ADD Operations
Fortran
C
- MEMSET (bulk set)
- 146. 1 t mysecond(tflag)
- 147. 1 V M--ltgtltgt a(1n) 1.0d0
- 148. 1 t mysecond(tflag) - t
- 149. 1 times(2,k) t
- SET
- 158. 1 arrsum 2.0d0
- 159. 1 t mysecond(tflag)
- 160. 1 MV------lt DO i 1,n
- 161. 1 MV c(i) arrsum
- 162. 1 MV arrsum arrsum 1
- 163. 1 MV------gt END DO
- 164. 1 t mysecond(tflag) - t
- 165. 1 times(4,k) t
- ADD
- 180. 1 t mysecond(tflag)
- 181. 1 V M--ltgtltgt c(1n) a(1n)
b(1n)
- MEMSET (bulk set)
- 163. 1 times1k mysecond_()
- 164. 1 memset(a, 1,
NDIMsizeof(elem_t)) - 165. 1 times1k mysecond_()
- times1k - SET
- 217. 1 set 2
- 220. 1 times5k mysecond_()
- 222. 1 MV--lt for (i0 iltNDIM i)
- 223. 1 MV
- 224. 1 MV ci (set)
- 225. 1 MV--gt
- 227. 1 times5k mysecond_()
- times5k - ADD
- 283. 1 times10k mysecond_()
- 285. 1 Vp--lt for (j0 jltNDIM j)
- 286. 1 Vp
- 287. 1 Vp cj aj bj
Legend V Vectorized M Multistreamed p
conditional, partial and/or computed
46UPC vs CAF using the NPB workloads
- In General, UPC slower than CAF, mainly due to
- Point-to-point vs barrier synchronization
- Better scalability with proper collective
operations - Program writers can do a p-to-p syncronization
using current constructs - Scalar performance of source-to-source translated
code - Alias analysis (C pointers)
- Can highlight the need for explicitly using
restrict to help several compiler backends - Lack of support for multi-dimensional arrays in C
- Can prevent high level loop transformations and
software pipelining, causing a 2 times slowdown
in SP for UPC - Need for exhaustive C compiler analysis
- A failure to perform proper loop fusion and
alignment in the critical section of MG can lead
to 51 more loads for UPC than CAF - A failure to unroll adequately the sparse
matrix-vector multiplication in CG can lead to
more cycles in UPC
47Conclusions
- UPC is a locality-aware parallel programming
language - With proper optimizations, UPC can outperform MPI
in random short accesses and can otherwise
perform as good as MPI - UPC is very productive and UPC applications
result in much smaller and more readable code
than MPI - UPC compiler optimizations are still lagging, in
spite of the fact that substantial progress has
been made - For future architectures, UPC has the unique
opportunity of having very efficient
implementations as most of the pitfalls and
obstacles are revealed along with adequate
solutions
48Conclusions
- In general, four types of optimizations
- Optimizations to Exploit the Locality
Consciousness and other Unique Features of UPC - Optimizations to Keep the Overhead of UPC low
- Optimizations to Exploit Architectural Features
- Standard Optimizations that are Applicable to all
Systems Compilers
49Conclusions
- Optimizations possible at three levels
- Source to source program acting during the
compilation phase and incorporating most UPC
specific optimizations - C backend compilers to compete with Fortran
- Strong run-time system that can work effectively
with the Operating System
50Selected Publications
- T. El-Ghazawi, W. Carlson, T. Sterling, and K.
Yelick, UPC Distributed Shared Memory
Programming. John Wiley Sons Inc., New York,
2005. ISBN 0-471-22048-5. (June 2005) - T. El-Ghazawi, F. Cantonnet, Y. Yao, S.
Annareddy, A. Mohamed, Benchmarking Parallel
Compilers for Distributed Shared Memory
Languages A UPC Case Study, Journal of Future
Generation Computer Systems, North-Holland
(Accepted)
51Selected Publications
- T. El-Ghazawi and S. Chauvin, UPC Benchmarking
Issues, 30th Annual Conference IEEE
International Conference on Parallel
Processing,2001 (ICPP01) Pages 365-372 - T. El-Ghazawi and F. Cantonnet. UPC performance
and potential A NPB experimental study.
Supercomputing 2002 (SC2002), Baltimore, November
2002 - F. Cantonnet, T. El-Ghazawi, P. Lorenz, J. Gaber,
Fast Address Translation Techniques for
Distributed Shared Memory Compilers, IPDPS05,
Denver CO, April 2005 - CUG and PPOP