Towards Optimized UPC Implementations

About This Presentation

Title:

Towards Optimized UPC Implementations

Description:

Towards Optimized UPC Implementations. Tarek A. El-Ghazawi. The George ... Conceptual Complexity - HIST. IBM T.J. Waston. UPC: Unified Parallel C. 24. 02/22/05 ... – PowerPoint PPT presentation

Number of Views:108

Avg rating:3.0/5.0

Slides: 52

Provided by: ifipw

Learn more at: http://www.ifipwg103.org

Category:

more less

Transcript and Presenter's Notes

Title: Towards Optimized UPC Implementations

1
Towards Optimized UPC Implementations

Tarek A. El-Ghazawi
The George Washington University tarek_at_gwu.edu

2
Agenda

Background
UPC Language Overview
Productivity
Performance Issues
Automatic Optimizations
Conclusions

3
Parallel Programming Models

What is a programming model?
An abstract machine which outlines the view
perceived by the programmer of data and execution
Where architecture and applications meet
A non-binding contract between the programmer and
the compiler/system
Good Programming Models Should
Allow efficient mapping on different
architectures
Keep programming easy
Benefits
Application - independence from architecture
Architecture - independence from applications

4
Programming Models
Process/Thread
Address Space
Message Passing Shared Memory DSM/PGAS M
PI OpenMP UPC
5
Programming Paradigms Expressivity
LOCALITY Implicit Explicit
PARALLEISM
Implicit
Sequential (e.g. C, Fortran, Java)
Data Parallel (e.g. HPF, C)
Shared Memory (e.g. OpenMP)
Distributed Shared Memory/PGAS (e.g. UPC, CAF,
and Titanium)
Explicit
6
What is UPC?

Unified Parallel C
An explicit parallel extension of ISO C
A distributed shared memory/PGAS parallel
programming language

7
Why not message passing?

Performance
High-penalty for short transactions
Cost of calls
Two sided
Excessive buffering
Ease-of-use
Explicit data transfers
Domain decomposition does not maintain the
original global application view
More code and conceptual difficulty

8
Why DSM/PGAS?

Performance
No calls
Efficient short transfers
locality
Ease-of-use
Implicit transfers
Consistent global application view
Less code and conceptual difficulty

9
Why DSM/PGASNew Opportunities for Compiler
Optimizations
Image
Sobel Operator
Thread0
Thread1
Ghost Zones
Thread2
Thread3

DSM P_Model exposes sequential remote accesses at
compile time
Opportunity for compiler directed prefetching

10
History

Initial Tech. Report from IDA in collaboration
with LLNL and UCB in May 1999
UPC consortium of government, academia, and HPC
vendors coordinated by GWU, IDA, and DoD
The participants currently are IDA CCS, GWU,
UCB, MTU, UMN, ARSC, UMCP, U florida, ANL, LBNL,
LLNL, DoD, DoE, HP, Cray, IBM, Sun, Intrepid,
Etnus,

11
Status

Specification v1.0 completed February of 2001,
v1.1.1 in October of 2003, v1.2 will add
collectives and UPC/IO
Benchmarking Suites Stream, GUPS, RandomAccess,
NPB suite, Splash-2, and others
Testing suite v1.0, v1.1
Short courses and tutorials in the US and abroad
Research Exhibits at SC 2000-2004
UPC web site upc.gwu.edu
UPC Book by mid 2005 from John Wiley and Sons
Manual(s)

12
Hardware Platforms

UPC implementations are available for
SGI O 2000/3000
Intrepid 32 and 64b GCC
UCB 32 b GCC
Cray T3D/E
Cray X-1
HP AlphaServer SC, Superdome
UPC Berkeley Compiler Myrinet, Quadrics, and
Infiniband Clusters
Beowulf Reference Implementation (MPI-based, MTU)
New ongoing efforts by IBM and Sun

13
UPC Execution Model

A number of threads working independently in a
SPMD fashion
MYTHREAD specifies thread index (0..THREADS-1)
Number of threads specified at compile-time or
run-time
Process and Data Synchronization when needed
Barriers and split phase barriers
Locks and arrays of locks
Fence
Memory consistency control

14
UPC Memory Model
Thread THREADS-1
Thread 0
Thread 1
Shared
Private 0
Private 1
Private THREADS-1

Shared space with thread affinity, plus private
spaces
A pointer-to-shared can reference all locations
in the shared space
A private pointer may reference only addresses in
its private space or addresses in its portion of
the shared space
Static and dynamic memory allocations are
supported for both shared and private memory

15
UPC Pointers

How to declare them?
int p1 / private pointer pointing locally
/
shared int p2 / private pointer pointing into
the shared space /
int shared p3 / shared pointer pointing
locally /
shared int shared p4 / shared pointer
pointing into the shared space /
You may find many using shared pointer to mean
a pointer pointing to a shared object, e.g.
equivalent to p2 but could be p4 as well.

16
UPC Pointers
Thread 0
Shared
P4
P3
P2
P2
Private
P2
17
Synchronization - Barriers

No implicit synchronization among the threads
UPC provides the following synchronization
mechanisms
Barriers
Locks
Memory Consistency Control
Fence

18
Memory Consistency Models

Has to do with ordering of shared operations, and
when a change of a shared object by a thread
becomes visible to others
Consistency can be strict or relaxed
Under the relaxed consistency model, the shared
operations can be reordered by the compiler /
runtime system
The strict consistency model enforces sequential
ordering of shared operations. (No operation on
shared can begin before the previous ones are
done, and changes become visible immediately)

19
Memory Consistency Models

User specifies the memory model through
declarations
pragmas for a particular statement or sequence of
statements
use of barriers, and global operations
Programmers responsible for using correct
consistency model

20
UPC and Productivity

Metrics
Lines of useful Code
indicates the development time as well as the
maintenance cost
Number of useful Characters
alternative way to measure development and
maintenance efforts
Conceptual Complexity
function level,
keyword usage,
number of tokens,
max loop depth,

21
Manual Effort NPB Example
22
Manual Effort More Examples
23
Conceptual Complexity - HIST
24
Conceptual Complexity - GUPS
25
UPC Optimizations Issues

Particular Challenges
Avoiding Address Translation
Cost of Address Translation
Special Opportunities
Locality-driven compiler-directed prefetching
Aggregation
General
Low-level optimized libraries, e.g. collective
Backend optimizations
Overlapping of remote accesses and
synchronization with other work

26
Showing Potential Optimizations Through Emulated
Hand-Tunings

Different Hand-tuning levels
Unoptimized UPC code
referred as UPC.O0
Privatized UPC code
referred as UPC.O1
Prefetched UPC code
hand-optimized variant using block get/put to
mimic the effect of prefetching
referred as UPC.O2
Fully Hand-Tuned UPC code
Hand-optimized variant integrating privatization,
aggregation of remote accesses as well as
prefetching
Referred as UPC.O3
T. El-Ghazawi and S. Chauvin, UPC Benchmarking
Issues, 30th Annual Conference IEEE
International Conference on Parallel
Processing,2001 (ICPP01) Pages 365-372

27
Address Translation Cost and Local Space
Privatization- Cluster
MB/s Put Get Scale Sum
CC N/A N/A 1565.04 5409.3
UPC Private N/A N/A 1687.63 1776.81
UPC Local 1196.51 1082.89 54.22 82.7
UPC Remote 241.43 237.51 0.09 0.16
MB/s Copy (arr) Copy (ptr) Memcpy Memset
CC 1340.99 1488.02 1223.86 2401.26
UPC Private 1383.57 433.45 1252.47 2352.71
UPC Local 47.2 90.67 1202.8 2398.9
UPC Remote 0.09 0.20 1197.22 2360.59
STREAM BENCHMARK
Results gathered on a Myrinet Cluster
28
Address Translation and Local Space
Privatization DSM ARCHITECTURE
Bulk operations
Element-by-Element operations
MB/Sec Memorycopy Block Get Block Put ArraySet Array Copy Sum Scale
GCC 127 N/A N/A 175 106 223 108
UPC Private 127 N/A N/A 173 106 215 107
UPC Local Shared 139 140 136 26 14 31 13
UPC Remote Shared (within SMP node) 130 129 136 26 13 30 13
UPC Remote Shared (beyond SMP node) 112 117 136 24 12 28 12
STREAM BENCHMARK MB/S
29
Aggregation and Overlapping of Remote Shared
Memory Accesses
UPC N-Queens Execution Time
UPC Sobel Edge Execution Time

Benefit of hand-optimizations are greatly
application dependent
N-Queens does not perform any better, mainly
because it is an embarrassingly parallel program
Sobel Edge Detector does get a speedup of one
order of magnitude after hand-optimizating,
scales linearly perfectly.
SGI O2000

30
Impact of Hand-Optimizations on NPB.CG
Class A onSGI Origin 2k
31
Shared Address Translation Overhead

Address translation overhead is quite significant
More than 70 of work for a local-shared memory
access
Demonstrates the real need for optimization

Overhead Present in Local-Shared Memory Accesses
(SGI Origin 2000, GCC-UPC)
Quantification of the Address Translation
Overheads
32
Shared Address Translation Overheads for Sobel
Edge Detection
UPC.O0 unoptimized UPC code, UPC.O3
handoptimized UPC code. Ox notations from T.
El-Ghazawi, S. Chauvin, UPC Benchmarking
Issues, Proceedings of the 2001 International
Conference on Parallel Processing, Valencia,
September 2001
33
Reducing Address Translation Overheads via
Translation Look-Aside Buffers

F. Cantonnet, T. El-Ghazawi, P. Lorenz, J. Gaber,
Fast Address Translation Techniques for
Distributed Shared Memory Compilers, IPDPS05,
Denver CO, April 2005
Use Look-up Memory Model Translation Buffers
(MMTB) to perform fast translations
Two alternative methods proposed to create and
use MMTBs
FT basic method using direct addressing
RT advanced method, using indexed addressing
Was prototyped as a compiler-enabled optimization
no modifications to actual UPC codes are needed

34
Different Strategies Full-Table

Pros
Direct mapping
No address calculation

Cons
Large memory required
Can lead to competition over caches and main
memory

Consider shared B int array8 To Initialize
FT ?i ? 0,7, FTi _get_vaddr(arrayi) T
o Access array ?i ? 0,7, arrayi
_get_value_at(FTi)
35
Different Strategies Reduced-Table Infinite
blocksize

RT Strategy
Only one table entry in this case
Address calculation step is simple in that case

BLOCKSIZEinfinite Only first address of the
element of the array needs to be saved since all
array data is contiguous Consider shared int
array4 To initialize RT RT0
_get_vaddr(array0) To access array ?i
?0,3, arrayi _get_value_at( RT0 i )
array0
array1
i
array2
array3
RT0
RT0
RT0
RT0
THREAD0
THREAD1
THREAD2
THREAD3
36
Different Strategies Reduced-Table Default
blocksize
BLOCKSIZE1 Only first address of elements on
each thread are saved since all array data is
contiguous Consider shared 1 int array16 To
initialize RT ?i ?0,THREADS-1, RTi
_get_vaddr(arrayi) To access array ?i
?0,15, arrayi _get_value_at( RTi mod
THREADS (i/THREADS))

RT Strategy
Less memory required than FT, MMTB buffer has
threads entries
Address calculation step is a bit costly but much
cheaper than current implementations

array0
array1
array2
array3
RT0
array4
array5
array6
array7
RT1
array8
array9
array10
array11
RT2
array12
array13
array14
array15
RT3
RT
RT
RT
RT
RT
THREAD1
THREAD2
THREAD3
THREAD0
37
Different Strategies Reduced-Table Arbitrary
blocksize

RT Strategy
Less memory required than for FT, but more than
previous cases
Address calculation step more costly than
previous cases

ARBITRARY BLOCK SIZES Only first address of
elements of each block are saved since all block
data is contiguous Consider shared 2 int
array16 To initialize T ?i ?0,7, RTi
_get_vaddr(arrayiblocksize(array)) To
access array ?i ?0,15, arrayi
_get_value_at( RTi / blocksize(array) (i
mod blocksize(array)) )
RT0
RT1
array0
array2
array4
array6
RT2
array1
array3
array5
array7
RT3
array8
array10
array12
array14
RT4
array9
array11
array13
array15
RT5
RT6
RT
RT
RT
RT
RT7
THREAD1
THREAD2
THREAD3
THREAD0
RT
38
Performance Impact of the MMTB Sobel Edge
Performance of Sobel-Edge Detection using new
MMTB strategies (with and without O0)

FT and RT are performing around 6 to 8 folds
better than the regular basic UPC version (O0)
RT strategy slower than FT since address
calculation (arbitrary block size case), becomes
more complex.
FT on the other hand is performing almost as good
as the hand-tuned versions (O3 and MPI)

39
Performance Impact of the MMTB Matrix
Multiplication
Performance and Hardware Profiling of Matrix
Multiplication using new MMTB strategies

FT strategy increase in L1 data cache misses due
to the large table size
RT strategy L1 kept low, but increase in number
of loads and stores is observed showing increase
in computations (arbitrary blocksize used)

40
Time and storage requirements of the Address
Translation Methods for the Matrix Multiply
Microkernel
For a shared array of N elements with B as blocksize Storage requirements pershared array of memory accesses per shared memoryaccess of arithmetic operations pershared memoryaccess
UPC.O0 More than 25 More than 5
UPC.O0.FT 1 0
UPC.O0.RT 1 Up to 3
(E element size in bytes,P pointer size in
bytes)
Comparison among Optimizations of Storage, Memory
Accesses and Computation Requirements

Number of loads and stores can increase with
arithmetic operators

41
UPC Work-sharing Construct Optimizations

By thread/index number (upc_forall integer)
upc_forall(i0 iltN i i)
loop body
By the address of a shared variable (upc_forall
address)
upc_forall(i0 iltN i shared_vari)
loop body
By thread/index number (for optimized)
for(iMYTHREAD iltN iTHREADS)
loop body

By thread/index number (for integer)
for(i0 iltN i)
if(MYTHREAD iTHREADS)
loop body
By the address of a shared variable (for address)
for(i0 iltN i)
if(upc_threadof(shared_vari)
MYTHREAD)
loop body

42
Performance of Equivalent upc_forall and for Loops
43
Performance Limitations Imposed by Sequential C
Compilers -- STREAM
NUMA (MB/s) BULK BULK BULK Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element
NUMA (MB/s) memcpy memset Struct cp Copy (arr) Copy (ptr) Set Sum Scale Add Triad
F 291.21 163.90 N/A 291.59 N/A 159.68 135.37 246.3 235.1 303.82
C 231.20 214.62 158.86 120.57 152.77 147.70 298.38 133.4 13.86 20.71
Vector(MB/s) BULK BULK BULK Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element
Vector(MB/s) memcpy memset Struct cp Copy (arr) Copy (ptr) Set Sum Scale Add Triad
F 14423 11051 N/A 14407 N/A 11015 17837 14423 10715 16053
C 18850 5307 7882 7972 7969 10576 18260 7865 3874 5824
44
Loopmark SET/ADD Operations
Vector BULK BULK BULK Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element
Vector memcpy memset Struct cp Copy (arr) Copy (ptr) Set Sum Scale Add Triad
F 14423 11051 N/A 14407 N/A 11015 17837 14423 10715 16053
C 18850 5307 7882 7972 7969 10576 18260 7865 3874 5824
Let us compare loopmarks for each F / C operation
45
Loopmark SET/ADD Operations
Fortran
C

MEMSET (bulk set)
146. 1 t mysecond(tflag)
147. 1 V M--ltgtltgt a(1n) 1.0d0
148. 1 t mysecond(tflag) - t
149. 1 times(2,k) t
SET
158. 1 arrsum 2.0d0
159. 1 t mysecond(tflag)
160. 1 MV------lt DO i 1,n
161. 1 MV c(i) arrsum
162. 1 MV arrsum arrsum 1
163. 1 MV------gt END DO
164. 1 t mysecond(tflag) - t
165. 1 times(4,k) t
ADD
180. 1 t mysecond(tflag)
181. 1 V M--ltgtltgt c(1n) a(1n)
b(1n)

MEMSET (bulk set)
163. 1 times1k mysecond_()
164. 1 memset(a, 1,
NDIMsizeof(elem_t))
165. 1 times1k mysecond_()
- times1k
SET
217. 1 set 2
220. 1 times5k mysecond_()
222. 1 MV--lt for (i0 iltNDIM i)
223. 1 MV
224. 1 MV ci (set)
225. 1 MV--gt
227. 1 times5k mysecond_()
- times5k
ADD
283. 1 times10k mysecond_()
285. 1 Vp--lt for (j0 jltNDIM j)
286. 1 Vp
287. 1 Vp cj aj bj

Legend V Vectorized M Multistreamed p
conditional, partial and/or computed
46
UPC vs CAF using the NPB workloads

In General, UPC slower than CAF, mainly due to
Point-to-point vs barrier synchronization
Better scalability with proper collective
operations
Program writers can do a p-to-p syncronization
using current constructs
Scalar performance of source-to-source translated
code
Alias analysis (C pointers)
Can highlight the need for explicitly using
restrict to help several compiler backends
Lack of support for multi-dimensional arrays in C
Can prevent high level loop transformations and
software pipelining, causing a 2 times slowdown
in SP for UPC
Need for exhaustive C compiler analysis
A failure to perform proper loop fusion and
alignment in the critical section of MG can lead
to 51 more loads for UPC than CAF
A failure to unroll adequately the sparse
matrix-vector multiplication in CG can lead to
more cycles in UPC

47
Conclusions

UPC is a locality-aware parallel programming
language
With proper optimizations, UPC can outperform MPI
in random short accesses and can otherwise
perform as good as MPI
UPC is very productive and UPC applications
result in much smaller and more readable code
than MPI
UPC compiler optimizations are still lagging, in
spite of the fact that substantial progress has
been made
For future architectures, UPC has the unique
opportunity of having very efficient
implementations as most of the pitfalls and
obstacles are revealed along with adequate
solutions

48
Conclusions

In general, four types of optimizations
Optimizations to Exploit the Locality
Consciousness and other Unique Features of UPC
Optimizations to Keep the Overhead of UPC low
Optimizations to Exploit Architectural Features
Standard Optimizations that are Applicable to all
Systems Compilers

49
Conclusions

Optimizations possible at three levels
Source to source program acting during the
compilation phase and incorporating most UPC
specific optimizations
C backend compilers to compete with Fortran
Strong run-time system that can work effectively
with the Operating System

50
Selected Publications

T. El-Ghazawi, W. Carlson, T. Sterling, and K.
Yelick, UPC Distributed Shared Memory
Programming. John Wiley Sons Inc., New York,
2005. ISBN 0-471-22048-5. (June 2005)
T. El-Ghazawi, F. Cantonnet, Y. Yao, S.
Annareddy, A. Mohamed, Benchmarking Parallel
Compilers for Distributed Shared Memory
Languages A UPC Case Study, Journal of Future
Generation Computer Systems, North-Holland
(Accepted)

51
Selected Publications

T. El-Ghazawi and S. Chauvin, UPC Benchmarking
Issues, 30th Annual Conference IEEE
International Conference on Parallel
Processing,2001 (ICPP01) Pages 365-372
T. El-Ghazawi and F. Cantonnet. UPC performance
and potential A NPB experimental study.
Supercomputing 2002 (SC2002), Baltimore, November
2002
F. Cantonnet, T. El-Ghazawi, P. Lorenz, J. Gaber,
Fast Address Translation Techniques for
Distributed Shared Memory Compilers, IPDPS05,
Denver CO, April 2005
CUG and PPOP