Title: Ernest Orlando Lawrence Berkeley National Laboratory
1Compilation Technology for Computational
Science Kathy Yelick Lawrence Berkeley National
Laboratory and UC Berkeley
Joint work with The Titanium Group S. Graham,
P. Hilfinger, P. Colella, D. Bonachea, K. Datta,
E. Givelberg, A. Kamil, N. Mai, A. Solar, J. Su,
T. Wen The Berkeley UPC Group C. Bell, D.
Bonachea, W. Chen, J. Duell, P.
Hargrove, P. Husbands, C. Iancu, R. Nishtala, M.
Welcome
2Outline
- Computer architecture trends
- Software trends
- Scientific computing expertise in parallelism
- Performance is as important as parallelism
- Resource management is key to performance
- Open question how much to virtualize machine?
- Parallel language problems PGAS solutions
- Virtualize global address space
- Not shared virtual memory, not virtual processor
space - Parallel compiler problems/solutions
3Parallelism Everywhere
- Single processor Moores Law effect is ending
- Power density limitations device physics below
90nm - Multicore is becoming the norm
- AMD, IBM, Intel, Sun all offering multicore
- Number of cores per chip likely to increase with
density - Fundamental software change
- Parallelism is exposed to software
- Performance is no longer solely a hardware
problem - What has the HPC community learned?
- Caveat Scale and applications differ
4High-end simulation in the physical sciences 7
methods
Phillip Colellas Seven dwarfs
- Structured Grids (including Adaptive Mesh
Refinement) - Unstructured Grids
- Spectral Methods (FFTs, etc.)
- Dense Linear Algebra
- Sparse Linear Algebra
- Particles
- Monte Carlo Simulation
- Add 4 for embedded covers all 41 EEMBC
benchmarks - 8. Search/Sort
- 9. Filter
- 10. Comb. logic
- 11. Finite State Machine
Note Data sizes (8 bit to 32 bit) and types
(integer, character) differ, but algorithms the
same Games/Entertainment close to scientific
computing
Slide source Phillip Colella, 2004 and Dave
Patterson, 2006
5Parallel Programming Models
- Parallel software is still an unsolved problem !
- Most parallel programs are written using either
- Message passing with a SPMD model
- for scientific applications scales easily
- Shared memory with threads in OpenMP, Threads, or
Java - non-scientific applications easier to program
- Partitioned Global Address Space (PGAS) Languages
off 3 features - Productivity easy to understand and use
- Performance primary requirement in HPC
- Portability must run everywhere
6Partitioned Global Address Space
- Global address space any thread/process may
directly read/write data allocated by another - Partitioned data is designated as local (near)
or global (possibly far) programmer controls
layout
- By default
- Object heaps are shared
- Program stacks are private
x 1 y
x 5 y
x 7 y 0
Global address space
l
l
l
g
g
g
p0
p1
pn
- 3 Current languages UPC, CAF, and Titanium
- Emphasis in this talk on UPC Titanium (based on
Java)
7PGAS Language Overview
- Many common concepts, although specifics differ
- Consistent with base language
- Both private and shared data
- int x10 and shared int y10
- Support for distributed data structures
- Distributed arrays local and global
pointers/references - One-sided shared-memory communication
- Simple assignment statements xi yi
or t p - Bulk operations memcpy in UPC, array ops in
Titanium and CAF - Synchronization
- Global barriers, locks, memory fences
- Collective Communication, IO libraries, etc.
8Example Titanium Arrays
- Ti Arrays created using Domains indexed using
Points - double 3d gridA new double
0,0,010,10,10 - Eliminates some loop bound errors using foreach
- foreach (p in gridA.domain())
- gridAp gridApc gridBp
- Rich domain calculus allow for slicing, subarray,
transpose and other operations without data
copies - Array copy operations automatically work on
intersection - dataneighborPos.copy(mydata)
intersection (copied area)
restrict-ed (non-ghost) cells
ghost cells
mydata
dataneighorPos
9Productivity Line Count Comparison
- Comparison of NAS Parallel Benchmarks
- UPC version has modest programming effort
relative to C - Titanium even more compact, especially for MG,
which uses multi-d arrays - Caveat Titanium FT has user-defined Complex type
and cross-language support used to call FFTW for
serial 1D FFTs
UPC results from Tarek El-Gazhawi et al CAF from
Chamberlain et al Titanium joint with Kaushik
Datta Dan Bonachea
10Case Study 1 Block-Structured AMR
- Adaptive Mesh Refinement (AMR) is challenging
- Irregular data accesses and control from
boundaries - Mixed global/local view is useful
Titanium AMR benchmarks available
AMR Titanium work by Tong Wen and Philip Colella
11AMR in Titanium
- C/Fortran/MPI AMR
- Chombo package from LBNL
- Bulk-synchronous comm
- Pack boundary data between procs
- Titanium AMR
- Entirely in Titanium
- Finer-grained communication
- No explicit pack/unpack code
- Automated in runtime system
Code Size in Lines Code Size in Lines Code Size in Lines
C/Fortran/MPI Titanium
AMR data Structures 35000 2000
AMR operations 6500 1200
Elliptic PDE solver 4200 1500
10X reduction in lines of code!
Somewhat more functionality in PDE part of
Chombo code
Work by Tong Wen and Philip Colella
Communication optimizations joint with Jimmy Su
12Performance of Titanium AMR
Comparable performance
- Serial Titanium is within a few of C/F
sometimes faster! - Parallel Titanium scaling is comparable with
generic optimizations - - additional optimizations (namely overlap)
not yet implemented
13Immersed Boundary Simulation in Titanium
- Modeling elastic structures in an incompressible
fluid. - Blood flow in the heart, blood clotting, inner
ear, embryo growth, and many more - Complicated parallelization
- Particle/Mesh method
- Particles connected into materials
Code Size in Lines Code Size in Lines
Fortran Titanium
8000 4000
Joint work with Ed Givelberg, Armando Solar-Lezama
14High Performance
- Strategy for acceptance of a new language
- Within HPC Make it run faster than anything else
- Approaches to high performance
- Language support for performance
- Allow programmers sufficient control over
resources for tuning - Non-blocking data transfers, cross-language
calls, etc. - Control over layout, load balancing, and
synchronization - Compiler optimizations reduce need for hand
tuning - Automate non-blocking memory operations, relaxed
memory, - Productivity gains though parallel analysis and
optimizations - Runtime support exposes best possible performance
- Berkeley UPC and Titanium use GASNet
communication layer - Dynamic optimizations based on runtime information
15One-Sided vs Two-Sided
one-sided put message
host CPU
address
data payload
network interface
two-sided message
memory
message id
data payload
- A one-sided put/get message can be handled
directly by a network interface with RDMA support - Avoid interrupting the CPU or storing data from
CPU (preposts) - A two-sided messages needs to be matched with a
receive to identify memory address to put data - Offloaded to Network Interface in networks like
Quadrics - Need to download match tables to interface (from
host)
16Performance Advantage of One-Sided Communication
GASNet vs MPI
- Opteron/InfiniBand (Jacquard at NERSC)
- GASNets vapi-conduit and OSU MPI 0.9.5 MVAPICH
- Half power point (N ½ ) differs by one order of
magnitude
Joint work with Paul Hargrove and Dan Bonachea
17GASNet Portability and High-Performance
GASNet better for latency across machines
Joint work with UPC Group GASNet design by Dan
Bonachea
18GASNet Portability and High-Performance
GASNet at least as high (comparable) for large
messages
Joint work with UPC Group GASNet design by Dan
Bonachea
19GASNet Portability and High-Performance
GASNet excels at mid-range sizes important for
overlap
Joint work with UPC Group GASNet design by Dan
Bonachea
20Case Study 2 NAS FT
- Performance of Exchange (Alltoall) is critical
- 1D FFTs in each dimension, 3 phases
- Transpose after first 2 for locality
- Bisection bandwidth-limited
- Problem as procs grows
- Three approaches
- Exchange
- wait for 2nd dim FFTs to finish, send 1 message
per processor pair - Slab
- wait for chunk of rows destined for 1 proc, send
when ready - Pencil
- send each row as it completes
Joint work with Chris Bell, Rajesh Nishtala, Dan
Bonachea
21Overlapping Communication
- Goal make use of all the wires all the time
- Schedule communication to avoid network backup
- Trade-off overhead vs. overlap
- Exchange has fewest messages, less message
overhead - Slabs and pencils have more overlap pencils the
most - Example Class D problem on 256 Processors
Exchange (all data at once) 512 Kbytes
Slabs (contiguous rows that go to 1 processor) 64 Kbytes
Pencils (single row) 16 Kbytes
Joint work with Chris Bell, Rajesh Nishtala, Dan
Bonachea
22NAS FT Variants Performance Summary
.5 Tflops
- Slab is always best for MPI small message cost
too high - Pencil is always best for UPC more overlap
Joint work with Chris Bell, Rajesh Nishtala, Dan
Bonachea
23Case Study 3 LU Factorization
- Direct methods have complicated dependencies
- Especially with pivoting (unpredictable
communication) - Especially for sparse matrices (dependence graph
with holes) - LU Factorization in UPC
- Use overlap ideas and multithreading to mask
latency - Multithreaded UPC threads user threads
threaded BLAS - Panel factorization Including pivoting
- Update to a block of U
- Trailing submatrix updates
- Status
- Dense LU done HPL-compliant
- Sparse version underway
Joint work with Parry Husbands
24UPC HPL Performance
- MPI HPL numbers from HPCC database
- Large scaling
- 2.2 TFlops on 512p,
- 4.4 TFlops on 1024p (Thunder)
- Comparison to ScaLAPACK on an Altix, a 2 x 4
process grid - ScaLAPACK (block size 64) 25.25 GFlop/s (tried
several block sizes) - UPC LU (block size 256) - 33.60 GFlop/s, (block
size 64) - 26.47 GFlop/s - n 32000 on a 4x4 process grid
- ScaLAPACK - 43.34 GFlop/s (block size 64)
- UPC - 70.26 Gflop/s (block size 200)
Joint work with Parry Husbands
25Automating Support for Optimizations
- The previous examples are hand-optimized
- Non-blocking put/get on distributed memory
- Relaxed memory consistency on shared memory
- What analyses are needed to optimize parallel
codes? - Concurrency analysis determine which blocks of
code could run in parallel - Alias analysis determine which variables could
access the same location - Synchronization analysis align matching
barriers, locks - Locality analysis when is a general (global
pointer) used only locally (can convert to
cheaper local pointer)
Joint work with Amir Kamil and Jimmy Su
26Reordering in Parallel Programs
In parallel programs, a reordering can change the
semantics even if no local dependencies exist.
Initially, flag data 0
T1
T1
data 1
flag 1
T2
T2
f flag
f flag
d data
d data
flag 1
data 1
f 1, d 0 is possible after reordering
not in original
Compiler, runtime, and hardware can produce such
reorderings
Joint work with Amir Kamil and Jimmy Su
27Memory Models
- Sequential consistency a reordering is illegal
if it can be observed by another thread - Relaxed consistency reordering may be observed,
but local dependencies and synchronization
preserved (roughly) - Titanium, Java, UPC are not sequentially
consistent - Perceived cost of enforcing it is too high
- For Titanium and UPC, network latency is the cost
- For Java shared memory fences and code
transformations are the cost
Joint work with Amir Kamil and Jimmy Su
28Software and Hardware Reordering
- Compiler can reorder accesses as part of an
optimization - Example copy propagation
- Logical fences inserted where reordering is
illegal optimizations respect these fences - Hardware can reorder accesses
- Examples out of order execution, remote accesses
- Fence instructions inserted into generated code
waits until all prior memory operations have
completed - Can cost a complete round trip time due to remote
accesses
Joint work with Amir Kamil and Jimmy Su
29Conflicts
- Reordering of an access is observable only if it
conflicts with some other access - The accesses can be to the same memory location
- At least one access is a write
- The accesses can run concurrently
- Fences (compiler and hardware) need to be
inserted around accesses that conflict
T1
T2
data 1
f flag
flag 1
d data
Conflicts
Joint work with Amir Kamil and Jimmy Su
30Sequential Consistency in Titanium
- Minimize number of fences allow same
optimizations as relaxed model - Concurrency analysis identifies concurrent
accesses - Relies on Titaniums textual barriers and
single-valued expressions - Alias analysis identifies accesses to the same
location - Relies on SPMD nature of Titanium
Joint work with Amir Kamil and Jimmy Su
31Barrier Alignment
- Many parallel languages make no attempt to ensure
that barriers line up - Example code that is legal but will deadlock
- if (Ti.thisProc() 2 0)
- Ti.barrier() // even ID threads
- else
- // odd ID threads
Joint work with Amir Kamil and Jimmy Su
32Structural Correctness
- Aiken and Gay introduced structural correctness
(POPL98) - Ensures that every thread executes the same
number of barriers - Example of structurally correct code
- if (Ti.thisProc() 2 0)
- Ti.barrier() // even ID threads
- else
- Ti.barrier() // odd ID threads
Joint work with Amir Kamil and Jimmy Su
33Textual Barrier Alignment
- Titanium has textual barriers all threads must
execute the same textual sequence of barriers - Stronger guarantee than structural correctness
this example is illegal - if (Ti.thisProc() 2 0)
- Ti.barrier() // even ID threads
- else
- Ti.barrier() // odd ID threads
- Single-valued expressions used to enforce textual
barriers
Joint work with Amir Kamil and Jimmy Su
34Single-Valued Expressions
- A single-valued expression has the same value on
all threads when evaluated - Example Ti.numProcs() gt 1
- All threads guaranteed to take the same branch of
a conditional guarded by a single-valued
expression - Only single-valued conditionals may have barriers
- Example of legal barrier use
- if (Ti.numProcs() gt 1)
- Ti.barrier() // multiple threads
- else
- // only one thread total
Joint work with Amir Kamil and Jimmy Su
35Concurrency Analysis
- Graph generated from program as follows
- Node added for each code segment between barriers
and single-valued conditionals - Edges added to represent control flow between
segments
1
// code segment 1 if (single) // code segment
2 else // code segment 3 // code segment
4 Ti.barrier() // code segment 5
2
3
4
barrier
5
Joint work with Amir Kamil and Jimmy Su
36Concurrency Analysis (II)
- Two accesses can run concurrently if
- They are in the same node, or
- One accesss node is reachable from the other
accesss node without hitting a barrier - Algorithm remove barrier edges, do DFS
1
Concurrent Segments Concurrent Segments Concurrent Segments Concurrent Segments Concurrent Segments Concurrent Segments
1 2 3 4 5
1 X X X X
2 X X X
3 X X X
4 X X X X
5 X
2
3
4
barrier
5
Joint work with Amir Kamil and Jimmy Su
37Alias Analysis
- Allocation sites correspond to abstract locations
(a-locs) - All explicit and implict program variables have
points-to sets - A-locs are typed and have points-to sets for each
field of the corresponding type - Arrays have a single points-to set for all
indices - Analysis is flow,context-insensitive
- Experimental call-site sensitive version
doesnt seem to help much
Joint work with Amir Kamil and Jimmy Su
38Thread-Aware Alias Analysis
- Two types of abstract locations local and remote
- Local locations reside in local threads memory
- Remote locations reside on another thread
- Exploits SPMD property
- Results are a summary over all threads
- Independent of the number of threads at runtime
Joint work with Amir Kamil and Jimmy Su
39Alias Analysis Allocation
- Creates new local abstract location
- Result of allocation must reside in local memory
class Foo Object z static void
bar() L1 Foo a new Foo() Foo b
broadcast a from 0 Foo c a L2 a.z new
Object()
A-locs 1, 2
Points-to Sets Points-to Sets
a
b
c
Joint work with Amir Kamil and Jimmy Su
40Alias Analysis Assignment
- Copies source abstract locations into points-to
set of target
class Foo Object z static void
bar() L1 Foo a new Foo() Foo b
broadcast a from 0 Foo c a L2 a.z new
Object()
A-locs 1, 2
Points-to Sets Points-to Sets
a 1
b
c 1
1.z 2
Joint work with Amir Kamil and Jimmy Su
41Alias Analysis Broadcast
- Produces both local and remote versions of source
abstract location - Remote a-loc points to remote analog of what
local a-loc points to
class Foo Object z static void
bar() L1 Foo a new Foo() Foo b
broadcast a from 0 Foo c a L2 a.z new
Object()
A-locs 1, 2, 1r
Points-to Sets Points-to Sets
a 1
b 1, 1r
c 1
1.z 2
1r.z 2r
Joint work with Amir Kamil and Jimmy Su
42Aliasing Results
- Two variables A and B may alias if
- xÎpointsTo(A).
- xÎpointsTo(B)
- Two variables A and B may alias across threads
if - xÎpointsTo(A).
- R(x)ÎpointsTo(B),
- (where R(x) is the remote counterpart of x)
Points-to Sets Points-to Sets
a 1
b 1, 1r
c 1
Alias Across Threads Alias Across Threads
a b, c b
b a, c a, c
c a, b b
Joint work with Amir Kamil and Jimmy Su
43Benchmarks
Benchmark Lines1 Description
pi 56 Monte Carlo integration
demv 122 Dense matrix-vector multiply
sample-sort 321 Parallel sort
lu-fact 420 Dense linear algebra
3d-fft 614 Fourier transform
gsrb 1090 Computational fluid dynamics kernel
gsrb 1099 Slightly modified version of gsrb
spmv 1493 Sparse matrix-vector multiply
gas 8841 Hyperbolic solver for gas dynamics
1 Line counts do not include the reachable
portion of the 1 37,000 line Titanium/Java 1.0
libraries
Joint work with Amir Kamil and Jimmy Su
44Analysis Levels
- We tested analyses of varying levels of precision
Analysis Description
naïve All heap accesses
sharing All shared accesses
concur Concurrency analysis type-based AA
concur/saa Concurrency analysis sequential AA
concur/taa Concurrency analysis thread-aware AA
concur/taa/cycle Concurrency analysis thread-aware AA cycle detection
Joint work with Amir Kamil and Jimmy Su
45Static (Logical) Fences
GOOD
Percentages are for number of static fences
reduced over naive
Joint work with Amir Kamil and Jimmy Su
46Dynamic (Executed) Fences
GOOD
Percentages are for number of dynamic fences
reduced over naive
Joint work with Amir Kamil and Jimmy Su
47Dynamic Fences gsrb
- gsrb relies on dynamic locality checks
- slight modification to remove checks (gsrb)
greatly increases precision of analysis
GOOD
Joint work with Amir Kamil and Jimmy Su
48Two Example Optimizations
- Consider two optimizations for GAS languages
- Overlap bulk memory copies
- Communication aggregation for irregular array
accesses (i.e. abi) - Both optimizations reorder accesses, so
sequential consistency can inhibit them - Both are addressing network performance, so
potential payoff is high
Joint work with Amir Kamil and Jimmy Su
49Array Copies in Titanium
- Array copy operations are commonly used
- dst.copy(src)
- Content in the domain intersection of the two
arrays is copied from dst to src - Communication (possibly with packing) required if
arrays reside on different threads - Processor blocks until the operation is complete.
src
dst
Joint work with Amir Kamil and Jimmy Su
50Non-Blocking Array Copy Optimization
- Automatically convert blocking array copies into
non-blocking array copies - Push sync as far down the instruction stream as
possible to allow overlap with computation - Interprocedural syncs can be moved across method
boundaries - Optimization reorders memory accesses may be
illegal under sequential consistency
Joint work with Amir Kamil and Jimmy Su
51Communication Aggregation on Irregular Array
Accesses (Inspector/Executor)
- A loop containing indirect array accesses is
split into phases - Inspector examines loop and computes reference
targets - Required remote data gathered in a bulk operation
- Executor uses data to perform actual computation
- Can be illegal under sequential consistency
schd inspect(remote, b) tmp get(remote,
schd) for (...) ai tmpi // other
accesses
for (...) ai remotebi // other
accesses
Joint work with Amir Kamil and Jimmy Su
52Relaxed SC with 3 Analyses
- We tested performance using analyses of varying
levels of precision
Name Description
relaxed Uses Titaniums relaxed memory model
naïve Uses sequential consistency, puts fences around every heap access
sharing Uses sequential consistency, puts fences around every shared heap access
concur/taa/cycle Uses sequential consistency, uses our most aggressive analysis
Joint work with Amir Kamil and Jimmy Su
53Dense Matrix Vector Multiply
- Non-blocking array copy optimization applied
- Strongest analysis is necessary other SC
implementations suffer relative to relaxed
Joint work with Amir Kamil and Jimmy Su
54Sparse Matrix Vector Multiply
- Inspector/executor optimization applied
- Strongest analysis is again necessary and
sufficient
Joint work with Amir Kamil and Jimmy Su
55Portability of Titanium and UPC
- Titanium and the Berkeley UPC translator use a
similar model - Source-to-source translator (generate ISO C)
- Runtime layer implements global pointers, etc
- Common communication layer (GASNet)
- Both run on most PCs, SMPs, clusters
supercomputers - Support Operating Systems
- Linux, FreeBSD, Tru64, AIX, IRIX, HPUX, Solaris,
Cygwin, MacOSX, Unicos, SuperUX - UPC translator somewhat less portable we provide
a http-based compile server - Supported CPUs
- x86, Itanium, Alpha, Sparc, PowerPC, PA-RISC,
Opteron - GASNet communication
- Myrinet GM, Quadrics Elan, Mellanox Infiniband
VAPI, IBM LAPI, Cray X1, SGI Altix, Cray/SGI
SHMEM, and (for portability) MPI and UDP - Specific supercomputer platforms
- HP AlphaServer, Cray X1, IBM SP, NEC SX-6,
Cluster X (Big Mac), SGI Altix 3000 - Underway Cray XT3, BG/L (both run over MPI)
- Can be mixed with MPI, C/C, Fortran
Also used by gcc/upc
Joint work with Titanium and UPC groups
56Portability of PGAS Languages
- Other compilers also exist for PGAS Languages
- UPC
- Gcc/UPC by Intrepid runs on GASNet
- HP UPC for AlphaServers, clusters,
- MTU UPC uses HP compiler on MPI (source to
source) - Cray UPC
- Co-Array Fortran
- Cray CAF Compiler X1, X1E
- Rice CAF Compiler (on ARMCI or GASNet), John
Mellor-Crummey - Source to source
- Processors Pentium, Itanium2, Alpha, MIPS
- Networks Myrinet, Quadrics, Altix, Origin,
Ethernet - OS Linux32 RedHat, IRIS, Tru64
- NB source-to-source requires cooperation by
backend compilers
57Summary
- PGAS languages offer productivity advantage
- Order of magnitude in line counts for grid-based
code in Titanium - Push decisions about packing/not into runtime for
portability (advantage of language with
translator vs. library approach) - Significant work in compiler can make programming
easier - PGAS languages offer performance advantages
- Good match to RDMA support in networks
- Smaller messages may be faster
- make better use of network postpone bisection
bandwidth pain - can also prevent cache thrashing for packing
- Have locality advantages that may help even SMPs
- Source-to-source translation
- The way to ubiquity
- Complement highly tuned machine-specific compilers
58End of Slides
59Productizing BUPC
- Recent Berkeley UPC release
- Support full 1.2 language spec
- Supports collectives (tuning ongoing) memory
model compliance - Supports UPC I/O (naïve reference implementation)
- Large effort in quality assurance and robustness
- Test suite 600 tests run nightly on 20
platform configs - Tests correct compilation execution of UPC and
GASNet - gt30,000 UPC compilations and gt20,000 UPC test
runs per night - Online reporting of results hookup with bug
database - Test suite infrastructure extended to support any
UPC compiler - now running nightly with GCC/UPC UPCR
- also support HP-UPC, Cray UPC,
- Online bug reporting database
- Over gt1100 reports since Jan 03
- gt 90 fixed (excl. enhancement requests)
60NAS FT UPC Non-blocking MFlops
- Berkeley UPC compiler support non-blocking UPC
extensions - Produce 15-45 speedup over best UPC Blocking
version - Non-blocking version requires about 30 extra
lines of UPC code
61Benchmarking
- Next few UPC and MPI application benchmarks use
the following systems - Myrinet Myrinet 2000 PCI64B, P4-Xeon 2.2GHz
- InfiniBand IB Mellanox Cougar 4X HCA, Opteron
2.2GHz - Elan3 Quadrics QsNet1, Alpha 1GHz
- Elan4 Quadrics QsNet2, Itanium2 1.4GHz
62PGAS Languages Key to High Performance
- One way to gain acceptance of a new language
- Make it run faster than anything else
- Keys to high performance
- Parallelism
- Scaling the number of processors
- Maximize single node performance
- Generate friendly code or use tuned libraries
(BLAS, FFTW, etc.) - Avoid (unnecessary) communication cost
- Latency, bandwidth, overhead
- Avoid unnecessary delays due to dependencies
- Load balance
- Pipeline algorithmic dependencies
63Hardware Latency
- Network latency is not expected to improve
significantly - Overlapping communication automatically (Chen)
- Overlapping manually in the UPC applications
(Husbands, Welcome, Bell, Nishtala) - Language support for overlap (Bonachea)
64Effective Latency
- Communication wait time from other factors
- Algorithmic dependencies
- Use finer-grained parallelism, pipeline tasks
(Husbands) - Communication bandwidth bottleneck
- Message time is Latency 1/Bandwidth Size
- Too much aggregation hurts wait for bandwidth
term - De-aggregation optimization automatic (Iancu)
- Bisection bandwidth bottlenecks
- Spread communication throughout the computation
(Bell)
65Fine-grained UPC vs. Bulk-Synch MPI
- How to waste money on supercomputers
- Pack all communication into single message (spend
memory bandwidth) - Save all communication until the last one is
ready (add effective latency) - Send all at once (spend bisection bandwidth)
- Or, to use what you have efficiently
- Avoid long wait times send early and often
- Use all the wires, all the time
- This requires having low overhead!
66What You Wont Hear Much About
- Compiler/runtime/gasnet bug fixes, performance
tuning, testing, - gt13,000 e-mail messages regarding cvs checkins
- Nightly regression testing
- 25 platforms, 3 compilers (head, opt-branch,
gcc-upc), - Bug reporting
- 1177 bug reports, 1027 fixed
- Release scheduled for later this summer
- Beta is available
- Process significantly streamlined
67Take-Home Messages
- Titanium offers tremendous gains in productivity
- High level domain-specific array abstractions
- Titanium is being used for real applications
- Not just toy problems
- Titanium and UPC are both highly portable
- Run on essentially any machine
- Rigorously tested and supported
- PGAS Languages are Faster than two-sided MPI
- Better match to most HPC networks
- Berkeley UPC and Titanium benchmarks
- Designed from scratch with one-side PGAS model
- Focus on 2 scalability challenges AMR and Sparse
LU
68Titanium Background
- Based on Java, a cleaner C
- Classes, automatic memory management, etc.
- Compiled to C and then machine code, no JVM
- Same parallelism model at UPC and CAF
- SPMD parallelism
- Dynamic Java threads are not supported
- Optimizing compiler
- Analyzes global synchronization
- Optimizes pointers, communication, memory
69Do these Features Yield Productivity?
Joint work with Kaushik Datta, Dan Bonachea
70GASNet/X1 Performance
single word get
single word put
- GASNet/X1 improves small message performance over
shmem and MPI - Leverages global pointers on X1
- Highlights advantage of languages vs. library
approach
Joint work with Christian Bell, Wei Chen and Dan
Bonachea
71High Level Optimizations in Titanium
- Irregular communication can be expensive
- Best strategy differs by data size/distribution
and machine parameters - E.g., packing, sending bounding boxes,
fine-grained are
- Use of runtime optimizations
- Inspector-executor
- Performance on Sparse MatVec Mult
- Results best strategy differs within the machine
on a single matrix ( 50 better)
Speedup relative to MPI code (Aztec library)
Average and maximum speedup of the Titanium
version relative to the Aztec version on 1 to 16
processors
Joint work with Jimmy Su
72Source to Source Strategy
- Source-to-source translation strategy
- Tremendous portability advantage
- Still can perform significant optimizations
- Relies on high quality back-end compilers and
some coaxing in code generation
48x
- Use of restrict pointers in C
- Understand Multi-D array indexing (Intel/Itanium
issue) - Support for pragmas like IVDEP
- Robust vectorizators (X1, SSE, NEC,)
- On machines with integrated shared memory
hardware need access to shared memory operations
Joint work with Jimmy Su