Towards Optimized UPC Implementations - PowerPoint PPT Presentation

About This Presentation
Title:

Towards Optimized UPC Implementations

Description:

Towards Optimized UPC Implementations. Tarek A. El-Ghazawi. The George ... Conceptual Complexity - HIST. IBM T.J. Waston. UPC: Unified Parallel C. 24. 02/22/05 ... – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 52
Provided by: ifipw
Learn more at: http://www.ifipwg103.org
Category:

less

Transcript and Presenter's Notes

Title: Towards Optimized UPC Implementations


1
Towards Optimized UPC Implementations
  • Tarek A. El-Ghazawi
  • The George Washington University tarek_at_gwu.edu

2
Agenda
  • Background
  • UPC Language Overview
  • Productivity
  • Performance Issues
  • Automatic Optimizations
  • Conclusions

3
Parallel Programming Models
  • What is a programming model?
  • An abstract machine which outlines the view
    perceived by the programmer of data and execution
  • Where architecture and applications meet
  • A non-binding contract between the programmer and
    the compiler/system
  • Good Programming Models Should
  • Allow efficient mapping on different
    architectures
  • Keep programming easy
  • Benefits
  • Application - independence from architecture
  • Architecture - independence from applications

4
Programming Models
Process/Thread
Address Space
Message Passing Shared Memory DSM/PGAS M
PI OpenMP UPC
5
Programming Paradigms Expressivity
LOCALITY Implicit Explicit
PARALLEISM
Implicit
Sequential (e.g. C, Fortran, Java)
Data Parallel (e.g. HPF, C)
Shared Memory (e.g. OpenMP)
Distributed Shared Memory/PGAS (e.g. UPC, CAF,
and Titanium)
Explicit
6
What is UPC?
  • Unified Parallel C
  • An explicit parallel extension of ISO C
  • A distributed shared memory/PGAS parallel
    programming language

7
Why not message passing?
  • Performance
  • High-penalty for short transactions
  • Cost of calls
  • Two sided
  • Excessive buffering
  • Ease-of-use
  • Explicit data transfers
  • Domain decomposition does not maintain the
    original global application view
  • More code and conceptual difficulty

8
Why DSM/PGAS?
  • Performance
  • No calls
  • Efficient short transfers
  • locality
  • Ease-of-use
  • Implicit transfers
  • Consistent global application view
  • Less code and conceptual difficulty

9
Why DSM/PGASNew Opportunities for Compiler
Optimizations
Image
Sobel Operator
Thread0
Thread1
Ghost Zones
Thread2
Thread3
  • DSM P_Model exposes sequential remote accesses at
    compile time
  • Opportunity for compiler directed prefetching


10
History
  • Initial Tech. Report from IDA in collaboration
    with LLNL and UCB in May 1999
  • UPC consortium of government, academia, and HPC
    vendors coordinated by GWU, IDA, and DoD
  • The participants currently are IDA CCS, GWU,
    UCB, MTU, UMN, ARSC, UMCP, U florida, ANL, LBNL,
    LLNL, DoD, DoE, HP, Cray, IBM, Sun, Intrepid,
    Etnus,

11
Status
  • Specification v1.0 completed February of 2001,
    v1.1.1 in October of 2003, v1.2 will add
    collectives and UPC/IO
  • Benchmarking Suites Stream, GUPS, RandomAccess,
    NPB suite, Splash-2, and others
  • Testing suite v1.0, v1.1
  • Short courses and tutorials in the US and abroad
  • Research Exhibits at SC 2000-2004
  • UPC web site upc.gwu.edu
  • UPC Book by mid 2005 from John Wiley and Sons
  • Manual(s)

12
Hardware Platforms
  • UPC implementations are available for
  • SGI O 2000/3000
  • Intrepid 32 and 64b GCC
  • UCB 32 b GCC
  • Cray T3D/E
  • Cray X-1
  • HP AlphaServer SC, Superdome
  • UPC Berkeley Compiler Myrinet, Quadrics, and
    Infiniband Clusters
  • Beowulf Reference Implementation (MPI-based, MTU)
  • New ongoing efforts by IBM and Sun

13
UPC Execution Model
  • A number of threads working independently in a
    SPMD fashion
  • MYTHREAD specifies thread index (0..THREADS-1)
  • Number of threads specified at compile-time or
    run-time
  • Process and Data Synchronization when needed
  • Barriers and split phase barriers
  • Locks and arrays of locks
  • Fence
  • Memory consistency control

14
UPC Memory Model
Thread THREADS-1
Thread 0
Thread 1
Shared
Private 0
Private 1
Private THREADS-1
  • Shared space with thread affinity, plus private
    spaces
  • A pointer-to-shared can reference all locations
    in the shared space
  • A private pointer may reference only addresses in
    its private space or addresses in its portion of
    the shared space
  • Static and dynamic memory allocations are
    supported for both shared and private memory

15
UPC Pointers
  • How to declare them?
  • int p1 / private pointer pointing locally
    /
  • shared int p2 / private pointer pointing into
    the shared space /
  • int shared p3 / shared pointer pointing
    locally /
  • shared int shared p4 / shared pointer
    pointing into the shared space /
  • You may find many using shared pointer to mean
    a pointer pointing to a shared object, e.g.
    equivalent to p2 but could be p4 as well.

16
UPC Pointers
Thread 0
Shared
P4
P3
P2
P2
Private
P2
17
Synchronization - Barriers
  • No implicit synchronization among the threads
  • UPC provides the following synchronization
    mechanisms
  • Barriers
  • Locks
  • Memory Consistency Control
  • Fence

18
Memory Consistency Models
  • Has to do with ordering of shared operations, and
    when a change of a shared object by a thread
    becomes visible to others
  • Consistency can be strict or relaxed
  • Under the relaxed consistency model, the shared
    operations can be reordered by the compiler /
    runtime system
  • The strict consistency model enforces sequential
    ordering of shared operations. (No operation on
    shared can begin before the previous ones are
    done, and changes become visible immediately)

19
Memory Consistency Models
  • User specifies the memory model through
  • declarations
  • pragmas for a particular statement or sequence of
    statements
  • use of barriers, and global operations
  • Programmers responsible for using correct
    consistency model

20
UPC and Productivity
  • Metrics
  • Lines of useful Code
  • indicates the development time as well as the
    maintenance cost
  • Number of useful Characters
  • alternative way to measure development and
    maintenance efforts
  • Conceptual Complexity
  • function level,
  • keyword usage,
  • number of tokens,
  • max loop depth,

21
Manual Effort NPB Example
22
Manual Effort More Examples
23
Conceptual Complexity - HIST
24
Conceptual Complexity - GUPS
25
UPC Optimizations Issues
  • Particular Challenges
  • Avoiding Address Translation
  • Cost of Address Translation
  • Special Opportunities
  • Locality-driven compiler-directed prefetching
  • Aggregation
  • General
  • Low-level optimized libraries, e.g. collective
  • Backend optimizations
  • Overlapping of remote accesses and
    synchronization with other work

26
Showing Potential Optimizations Through Emulated
Hand-Tunings
  • Different Hand-tuning levels
  • Unoptimized UPC code
  • referred as UPC.O0
  • Privatized UPC code
  • referred as UPC.O1
  • Prefetched UPC code
  • hand-optimized variant using block get/put to
    mimic the effect of prefetching
  • referred as UPC.O2
  • Fully Hand-Tuned UPC code
  • Hand-optimized variant integrating privatization,
    aggregation of remote accesses as well as
    prefetching
  • Referred as UPC.O3
  • T. El-Ghazawi and S. Chauvin, UPC Benchmarking
    Issues, 30th Annual Conference IEEE
    International Conference on Parallel
    Processing,2001 (ICPP01) Pages 365-372

27
Address Translation Cost and Local Space
Privatization- Cluster
MB/s Put Get Scale Sum
CC N/A N/A 1565.04 5409.3
UPC Private N/A N/A 1687.63 1776.81
UPC Local 1196.51 1082.89 54.22 82.7
UPC Remote 241.43 237.51 0.09 0.16
MB/s Copy (arr) Copy (ptr) Memcpy Memset
CC 1340.99 1488.02 1223.86 2401.26
UPC Private 1383.57 433.45 1252.47 2352.71
UPC Local 47.2 90.67 1202.8 2398.9
UPC Remote 0.09 0.20 1197.22 2360.59
STREAM BENCHMARK
Results gathered on a Myrinet Cluster
28
Address Translation and Local Space
Privatization DSM ARCHITECTURE
Bulk operations
Element-by-Element operations
MB/Sec Memorycopy Block Get Block Put ArraySet Array Copy Sum Scale
GCC 127 N/A N/A 175 106 223 108
UPC Private 127 N/A N/A 173 106 215 107
UPC Local Shared 139 140 136 26 14 31 13
UPC Remote Shared (within SMP node) 130 129 136 26 13 30 13
UPC Remote Shared (beyond SMP node) 112 117 136 24 12 28 12
STREAM BENCHMARK MB/S
29
Aggregation and Overlapping of Remote Shared
Memory Accesses
UPC N-Queens Execution Time
UPC Sobel Edge Execution Time
  • Benefit of hand-optimizations are greatly
    application dependent
  • N-Queens does not perform any better, mainly
    because it is an embarrassingly parallel program
  • Sobel Edge Detector does get a speedup of one
    order of magnitude after hand-optimizating,
    scales linearly perfectly.
  • SGI O2000

30
Impact of Hand-Optimizations on NPB.CG
Class A onSGI Origin 2k
31
Shared Address Translation Overhead
  • Address translation overhead is quite significant
  • More than 70 of work for a local-shared memory
    access
  • Demonstrates the real need for optimization

Overhead Present in Local-Shared Memory Accesses
(SGI Origin 2000, GCC-UPC)
Quantification of the Address Translation
Overheads
32
Shared Address Translation Overheads for Sobel
Edge Detection
UPC.O0 unoptimized UPC code, UPC.O3
handoptimized UPC code. Ox notations from T.
El-Ghazawi, S. Chauvin, UPC Benchmarking
Issues, Proceedings of the 2001 International
Conference on Parallel Processing, Valencia,
September 2001
33
Reducing Address Translation Overheads via
Translation Look-Aside Buffers
  • F. Cantonnet, T. El-Ghazawi, P. Lorenz, J. Gaber,
    Fast Address Translation Techniques for
    Distributed Shared Memory Compilers, IPDPS05,
    Denver CO, April 2005
  • Use Look-up Memory Model Translation Buffers
    (MMTB) to perform fast translations
  • Two alternative methods proposed to create and
    use MMTBs
  • FT basic method using direct addressing
  • RT advanced method, using indexed addressing
  • Was prototyped as a compiler-enabled optimization
  • no modifications to actual UPC codes are needed

34
Different Strategies Full-Table
  • Pros
  • Direct mapping
  • No address calculation
  • Cons
  • Large memory required
  • Can lead to competition over caches and main
    memory

Consider shared B int array8 To Initialize
FT ?i ? 0,7, FTi _get_vaddr(arrayi) T
o Access array ?i ? 0,7, arrayi
_get_value_at(FTi)
35
Different Strategies Reduced-Table Infinite
blocksize
  • RT Strategy
  • Only one table entry in this case
  • Address calculation step is simple in that case

BLOCKSIZEinfinite Only first address of the
element of the array needs to be saved since all
array data is contiguous Consider shared int
array4 To initialize RT RT0
_get_vaddr(array0) To access array ?i
?0,3, arrayi _get_value_at( RT0 i )
array0
array1
i
array2
array3
RT0
RT0
RT0
RT0
THREAD0
THREAD1
THREAD2
THREAD3
36
Different Strategies Reduced-Table Default
blocksize
BLOCKSIZE1 Only first address of elements on
each thread are saved since all array data is
contiguous Consider shared 1 int array16 To
initialize RT ?i ?0,THREADS-1, RTi
_get_vaddr(arrayi) To access array ?i
?0,15, arrayi _get_value_at( RTi mod
THREADS (i/THREADS))
  • RT Strategy
  • Less memory required than FT, MMTB buffer has
    threads entries
  • Address calculation step is a bit costly but much
    cheaper than current implementations

array0
array1
array2
array3
RT0
array4
array5
array6
array7
RT1
array8
array9
array10
array11
RT2
array12
array13
array14
array15
RT3
RT
RT
RT
RT
RT
THREAD1
THREAD2
THREAD3
THREAD0
37
Different Strategies Reduced-Table Arbitrary
blocksize
  • RT Strategy
  • Less memory required than for FT, but more than
    previous cases
  • Address calculation step more costly than
    previous cases

ARBITRARY BLOCK SIZES Only first address of
elements of each block are saved since all block
data is contiguous Consider shared 2 int
array16 To initialize T ?i ?0,7, RTi
_get_vaddr(arrayiblocksize(array)) To
access array ?i ?0,15, arrayi
_get_value_at( RTi / blocksize(array) (i
mod blocksize(array)) )
RT0
RT1
array0
array2
array4
array6
RT2
array1
array3
array5
array7
RT3
array8
array10
array12
array14
RT4
array9
array11
array13
array15
RT5
RT6
RT
RT
RT
RT
RT7
THREAD1
THREAD2
THREAD3
THREAD0
RT
38
Performance Impact of the MMTB Sobel Edge
Performance of Sobel-Edge Detection using new
MMTB strategies (with and without O0)
  • FT and RT are performing around 6 to 8 folds
    better than the regular basic UPC version (O0)
  • RT strategy slower than FT since address
    calculation (arbitrary block size case), becomes
    more complex.
  • FT on the other hand is performing almost as good
    as the hand-tuned versions (O3 and MPI)

39
Performance Impact of the MMTB Matrix
Multiplication
Performance and Hardware Profiling of Matrix
Multiplication using new MMTB strategies
  • FT strategy increase in L1 data cache misses due
    to the large table size
  • RT strategy L1 kept low, but increase in number
    of loads and stores is observed showing increase
    in computations (arbitrary blocksize used)

40
Time and storage requirements of the Address
Translation Methods for the Matrix Multiply
Microkernel
For a shared array of N elements with B as blocksize Storage requirements pershared array of memory accesses per shared memoryaccess of arithmetic operations pershared memoryaccess
UPC.O0 More than 25 More than 5
UPC.O0.FT 1 0
UPC.O0.RT 1 Up to 3
(E element size in bytes,P pointer size in
bytes)
Comparison among Optimizations of Storage, Memory
Accesses and Computation Requirements
  • Number of loads and stores can increase with
    arithmetic operators

41
UPC Work-sharing Construct Optimizations
  • By thread/index number (upc_forall integer)
  • upc_forall(i0 iltN i i)
  • loop body
  • By the address of a shared variable (upc_forall
    address)
  • upc_forall(i0 iltN i shared_vari)
  • loop body
  • By thread/index number (for optimized)
  • for(iMYTHREAD iltN iTHREADS)
  • loop body
  • By thread/index number (for integer)
  • for(i0 iltN i)
  • if(MYTHREAD iTHREADS)
  • loop body
  • By the address of a shared variable (for address)
  • for(i0 iltN i)
  • if(upc_threadof(shared_vari)
    MYTHREAD)
  • loop body

42
Performance of Equivalent upc_forall and for Loops
43
Performance Limitations Imposed by Sequential C
Compilers -- STREAM
NUMA (MB/s) BULK BULK BULK Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element
NUMA (MB/s) memcpy memset Struct cp Copy (arr) Copy (ptr) Set Sum Scale Add Triad
F 291.21 163.90 N/A 291.59 N/A 159.68 135.37 246.3 235.1 303.82
C 231.20 214.62 158.86 120.57 152.77 147.70 298.38 133.4 13.86 20.71
Vector(MB/s) BULK BULK BULK Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element
Vector(MB/s) memcpy memset Struct cp Copy (arr) Copy (ptr) Set Sum Scale Add Triad
F 14423 11051 N/A 14407 N/A 11015 17837 14423 10715 16053
C 18850 5307 7882 7972 7969 10576 18260 7865 3874 5824
44
Loopmark SET/ADD Operations
Vector BULK BULK BULK Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element Element-by-Element
Vector memcpy memset Struct cp Copy (arr) Copy (ptr) Set Sum Scale Add Triad
F 14423 11051 N/A 14407 N/A 11015 17837 14423 10715 16053
C 18850 5307 7882 7972 7969 10576 18260 7865 3874 5824
Let us compare loopmarks for each F / C operation
45
Loopmark SET/ADD Operations
Fortran
C
  • MEMSET (bulk set)
  • 146. 1 t mysecond(tflag)
  • 147. 1 V M--ltgtltgt a(1n) 1.0d0
  • 148. 1 t mysecond(tflag) - t
  • 149. 1 times(2,k) t
  • SET
  • 158. 1 arrsum 2.0d0
  • 159. 1 t mysecond(tflag)
  • 160. 1 MV------lt DO i 1,n
  • 161. 1 MV c(i) arrsum
  • 162. 1 MV arrsum arrsum 1
  • 163. 1 MV------gt END DO
  • 164. 1 t mysecond(tflag) - t
  • 165. 1 times(4,k) t
  • ADD
  • 180. 1 t mysecond(tflag)
  • 181. 1 V M--ltgtltgt c(1n) a(1n)
    b(1n)
  • MEMSET (bulk set)
  • 163. 1 times1k mysecond_()
  • 164. 1 memset(a, 1,
    NDIMsizeof(elem_t))
  • 165. 1 times1k mysecond_()
    - times1k
  • SET
  • 217. 1 set 2
  • 220. 1 times5k mysecond_()
  • 222. 1 MV--lt for (i0 iltNDIM i)
  • 223. 1 MV
  • 224. 1 MV ci (set)
  • 225. 1 MV--gt
  • 227. 1 times5k mysecond_()
    - times5k
  • ADD
  • 283. 1 times10k mysecond_()
  • 285. 1 Vp--lt for (j0 jltNDIM j)
  • 286. 1 Vp
  • 287. 1 Vp cj aj bj

Legend V Vectorized M Multistreamed p
conditional, partial and/or computed
46
UPC vs CAF using the NPB workloads
  • In General, UPC slower than CAF, mainly due to
  • Point-to-point vs barrier synchronization
  • Better scalability with proper collective
    operations
  • Program writers can do a p-to-p syncronization
    using current constructs
  • Scalar performance of source-to-source translated
    code
  • Alias analysis (C pointers)
  • Can highlight the need for explicitly using
    restrict to help several compiler backends
  • Lack of support for multi-dimensional arrays in C
  • Can prevent high level loop transformations and
    software pipelining, causing a 2 times slowdown
    in SP for UPC
  • Need for exhaustive C compiler analysis
  • A failure to perform proper loop fusion and
    alignment in the critical section of MG can lead
    to 51 more loads for UPC than CAF
  • A failure to unroll adequately the sparse
    matrix-vector multiplication in CG can lead to
    more cycles in UPC

47
Conclusions
  • UPC is a locality-aware parallel programming
    language
  • With proper optimizations, UPC can outperform MPI
    in random short accesses and can otherwise
    perform as good as MPI
  • UPC is very productive and UPC applications
    result in much smaller and more readable code
    than MPI
  • UPC compiler optimizations are still lagging, in
    spite of the fact that substantial progress has
    been made
  • For future architectures, UPC has the unique
    opportunity of having very efficient
    implementations as most of the pitfalls and
    obstacles are revealed along with adequate
    solutions

48
Conclusions
  • In general, four types of optimizations
  • Optimizations to Exploit the Locality
    Consciousness and other Unique Features of UPC
  • Optimizations to Keep the Overhead of UPC low
  • Optimizations to Exploit Architectural Features
  • Standard Optimizations that are Applicable to all
    Systems Compilers

49
Conclusions
  • Optimizations possible at three levels
  • Source to source program acting during the
    compilation phase and incorporating most UPC
    specific optimizations
  • C backend compilers to compete with Fortran
  • Strong run-time system that can work effectively
    with the Operating System

50
Selected Publications
  • T. El-Ghazawi, W. Carlson, T. Sterling, and K.
    Yelick, UPC Distributed Shared Memory
    Programming. John Wiley Sons Inc., New York,
    2005. ISBN 0-471-22048-5. (June 2005)
  • T. El-Ghazawi, F. Cantonnet, Y. Yao, S.
    Annareddy, A. Mohamed, Benchmarking Parallel
    Compilers for Distributed Shared Memory
    Languages A UPC Case Study, Journal of Future
    Generation Computer Systems, North-Holland
    (Accepted)

51
Selected Publications
  • T. El-Ghazawi and S. Chauvin, UPC Benchmarking
    Issues, 30th Annual Conference IEEE
    International Conference on Parallel
    Processing,2001 (ICPP01) Pages 365-372
  • T. El-Ghazawi and F. Cantonnet. UPC performance
    and potential A NPB experimental study.
    Supercomputing 2002 (SC2002), Baltimore, November
    2002
  • F. Cantonnet, T. El-Ghazawi, P. Lorenz, J. Gaber,
    Fast Address Translation Techniques for
    Distributed Shared Memory Compilers, IPDPS05,
    Denver CO, April 2005
  • CUG and PPOP
Write a Comment
User Comments (0)
About PowerShow.com