Ernest Orlando Lawrence Berkeley National Laboratory

About This Presentation

Title:

Ernest Orlando Lawrence Berkeley National Laboratory

Description:

Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley Joint work with The Titanium Group: S. Graham, P. Hilfinger, P. Colella, D. Bonachea, – PowerPoint PPT presentation

Number of Views:188

Avg rating:3.0/5.0

Slides: 70

Provided by: Gabo97

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Ernest Orlando Lawrence Berkeley National Laboratory

1
Compilation Technology for Computational
Science Kathy Yelick Lawrence Berkeley National
Laboratory and UC Berkeley
Joint work with The Titanium Group S. Graham,
P. Hilfinger, P. Colella, D. Bonachea, K. Datta,
E. Givelberg, A. Kamil, N. Mai, A. Solar, J. Su,
T. Wen The Berkeley UPC Group C. Bell, D.
Bonachea, W. Chen, J. Duell, P.
Hargrove, P. Husbands, C. Iancu, R. Nishtala, M.
Welcome
2
Outline

Computer architecture trends
Software trends
Scientific computing expertise in parallelism
Performance is as important as parallelism
Resource management is key to performance
Open question how much to virtualize machine?
Parallel language problems PGAS solutions
Virtualize global address space
Not shared virtual memory, not virtual processor
space
Parallel compiler problems/solutions

3
Parallelism Everywhere

Single processor Moores Law effect is ending
Power density limitations device physics below
90nm
Multicore is becoming the norm
AMD, IBM, Intel, Sun all offering multicore
Number of cores per chip likely to increase with
density
Fundamental software change
Parallelism is exposed to software
Performance is no longer solely a hardware
problem
What has the HPC community learned?
Caveat Scale and applications differ

4
High-end simulation in the physical sciences 7
methods
Phillip Colellas Seven dwarfs

Structured Grids (including Adaptive Mesh
Refinement)
Unstructured Grids
Spectral Methods (FFTs, etc.)
Dense Linear Algebra
Sparse Linear Algebra
Particles
Monte Carlo Simulation

Add 4 for embedded covers all 41 EEMBC
benchmarks
8. Search/Sort
9. Filter
10. Comb. logic
11. Finite State Machine

Note Data sizes (8 bit to 32 bit) and types
(integer, character) differ, but algorithms the
same Games/Entertainment close to scientific
computing
Slide source Phillip Colella, 2004 and Dave
Patterson, 2006
5
Parallel Programming Models

Parallel software is still an unsolved problem !
Most parallel programs are written using either
Message passing with a SPMD model
for scientific applications scales easily
Shared memory with threads in OpenMP, Threads, or
Java
non-scientific applications easier to program
Partitioned Global Address Space (PGAS) Languages
off 3 features
Productivity easy to understand and use
Performance primary requirement in HPC
Portability must run everywhere

6
Partitioned Global Address Space

Global address space any thread/process may
directly read/write data allocated by another
Partitioned data is designated as local (near)
or global (possibly far) programmer controls
layout

By default
Object heaps are shared
Program stacks are private

x 1 y
x 5 y
x 7 y 0
Global address space
l
l
l
g
g
g
p0
p1
pn

3 Current languages UPC, CAF, and Titanium
Emphasis in this talk on UPC Titanium (based on
Java)

7
PGAS Language Overview

Many common concepts, although specifics differ
Consistent with base language
Both private and shared data
int x10 and shared int y10
Support for distributed data structures
Distributed arrays local and global
pointers/references
One-sided shared-memory communication
Simple assignment statements xi yi
or t p
Bulk operations memcpy in UPC, array ops in
Titanium and CAF
Synchronization
Global barriers, locks, memory fences
Collective Communication, IO libraries, etc.

8
Example Titanium Arrays

Ti Arrays created using Domains indexed using
Points
double 3d gridA new double
0,0,010,10,10
Eliminates some loop bound errors using foreach
foreach (p in gridA.domain())
gridAp gridApc gridBp
Rich domain calculus allow for slicing, subarray,
transpose and other operations without data
copies
Array copy operations automatically work on
intersection
dataneighborPos.copy(mydata)

intersection (copied area)
restrict-ed (non-ghost) cells
ghost cells
mydata
dataneighorPos
9
Productivity Line Count Comparison

Comparison of NAS Parallel Benchmarks
UPC version has modest programming effort
relative to C
Titanium even more compact, especially for MG,
which uses multi-d arrays
Caveat Titanium FT has user-defined Complex type
and cross-language support used to call FFTW for
serial 1D FFTs

UPC results from Tarek El-Gazhawi et al CAF from
Chamberlain et al Titanium joint with Kaushik
Datta Dan Bonachea
10
Case Study 1 Block-Structured AMR

Adaptive Mesh Refinement (AMR) is challenging
Irregular data accesses and control from
boundaries
Mixed global/local view is useful

Titanium AMR benchmarks available
AMR Titanium work by Tong Wen and Philip Colella
11
AMR in Titanium

C/Fortran/MPI AMR
Chombo package from LBNL
Bulk-synchronous comm
Pack boundary data between procs

Titanium AMR
Entirely in Titanium
Finer-grained communication
No explicit pack/unpack code
Automated in runtime system

Code Size in Lines Code Size in Lines Code Size in Lines
C/Fortran/MPI Titanium
AMR data Structures 35000 2000
AMR operations 6500 1200
Elliptic PDE solver 4200 1500
10X reduction in lines of code!
Somewhat more functionality in PDE part of
Chombo code
Work by Tong Wen and Philip Colella
Communication optimizations joint with Jimmy Su
12
Performance of Titanium AMR
Comparable performance

Serial Titanium is within a few of C/F
sometimes faster!
Parallel Titanium scaling is comparable with
generic optimizations
- additional optimizations (namely overlap)
not yet implemented

13
Immersed Boundary Simulation in Titanium

Modeling elastic structures in an incompressible
fluid.
Blood flow in the heart, blood clotting, inner
ear, embryo growth, and many more
Complicated parallelization
Particle/Mesh method
Particles connected into materials

Code Size in Lines Code Size in Lines
Fortran Titanium
8000 4000
Joint work with Ed Givelberg, Armando Solar-Lezama
14
High Performance

Strategy for acceptance of a new language
Within HPC Make it run faster than anything else
Approaches to high performance
Language support for performance
Allow programmers sufficient control over
resources for tuning
Non-blocking data transfers, cross-language
calls, etc.
Control over layout, load balancing, and
synchronization
Compiler optimizations reduce need for hand
tuning
Automate non-blocking memory operations, relaxed
memory,
Productivity gains though parallel analysis and
optimizations
Runtime support exposes best possible performance
Berkeley UPC and Titanium use GASNet
communication layer
Dynamic optimizations based on runtime information

15
One-Sided vs Two-Sided
one-sided put message
host CPU
address
data payload
network interface
two-sided message
memory
message id
data payload

A one-sided put/get message can be handled
directly by a network interface with RDMA support
Avoid interrupting the CPU or storing data from
CPU (preposts)
A two-sided messages needs to be matched with a
receive to identify memory address to put data
Offloaded to Network Interface in networks like
Quadrics
Need to download match tables to interface (from
host)

16
Performance Advantage of One-Sided Communication
GASNet vs MPI

Opteron/InfiniBand (Jacquard at NERSC)
GASNets vapi-conduit and OSU MPI 0.9.5 MVAPICH
Half power point (N ½ ) differs by one order of
magnitude

Joint work with Paul Hargrove and Dan Bonachea
17
GASNet Portability and High-Performance
GASNet better for latency across machines
Joint work with UPC Group GASNet design by Dan
Bonachea
18
GASNet Portability and High-Performance
GASNet at least as high (comparable) for large
messages
Joint work with UPC Group GASNet design by Dan
Bonachea
19
GASNet Portability and High-Performance
GASNet excels at mid-range sizes important for
overlap
Joint work with UPC Group GASNet design by Dan
Bonachea
20
Case Study 2 NAS FT

Performance of Exchange (Alltoall) is critical
1D FFTs in each dimension, 3 phases
Transpose after first 2 for locality
Bisection bandwidth-limited
Problem as procs grows

Three approaches
Exchange
wait for 2nd dim FFTs to finish, send 1 message
per processor pair
Slab
wait for chunk of rows destined for 1 proc, send
when ready
Pencil
send each row as it completes

Joint work with Chris Bell, Rajesh Nishtala, Dan
Bonachea
21
Overlapping Communication

Goal make use of all the wires all the time
Schedule communication to avoid network backup
Trade-off overhead vs. overlap
Exchange has fewest messages, less message
overhead
Slabs and pencils have more overlap pencils the
most
Example Class D problem on 256 Processors

Exchange (all data at once) 512 Kbytes
Slabs (contiguous rows that go to 1 processor) 64 Kbytes
Pencils (single row) 16 Kbytes
Joint work with Chris Bell, Rajesh Nishtala, Dan
Bonachea
22
NAS FT Variants Performance Summary
.5 Tflops

Slab is always best for MPI small message cost
too high
Pencil is always best for UPC more overlap

Joint work with Chris Bell, Rajesh Nishtala, Dan
Bonachea
23
Case Study 3 LU Factorization

Direct methods have complicated dependencies
Especially with pivoting (unpredictable
communication)
Especially for sparse matrices (dependence graph
with holes)
LU Factorization in UPC
Use overlap ideas and multithreading to mask
latency
Multithreaded UPC threads user threads
threaded BLAS
Panel factorization Including pivoting
Update to a block of U
Trailing submatrix updates
Status
Dense LU done HPL-compliant
Sparse version underway

Joint work with Parry Husbands
24
UPC HPL Performance

MPI HPL numbers from HPCC database
Large scaling
2.2 TFlops on 512p,
4.4 TFlops on 1024p (Thunder)

Comparison to ScaLAPACK on an Altix, a 2 x 4
process grid
ScaLAPACK (block size 64) 25.25 GFlop/s (tried
several block sizes)
UPC LU (block size 256) - 33.60 GFlop/s, (block
size 64) - 26.47 GFlop/s
n 32000 on a 4x4 process grid
ScaLAPACK - 43.34 GFlop/s (block size 64)
UPC - 70.26 Gflop/s (block size 200)

Joint work with Parry Husbands
25
Automating Support for Optimizations

The previous examples are hand-optimized
Non-blocking put/get on distributed memory
Relaxed memory consistency on shared memory
What analyses are needed to optimize parallel
codes?
Concurrency analysis determine which blocks of
code could run in parallel
Alias analysis determine which variables could
access the same location
Synchronization analysis align matching
barriers, locks
Locality analysis when is a general (global
pointer) used only locally (can convert to
cheaper local pointer)

Joint work with Amir Kamil and Jimmy Su
26
Reordering in Parallel Programs
In parallel programs, a reordering can change the
semantics even if no local dependencies exist.
Initially, flag data 0
T1
T1
data 1
flag 1
T2
T2
f flag
f flag
d data
d data
flag 1
data 1
f 1, d 0 is possible after reordering
not in original
Compiler, runtime, and hardware can produce such
reorderings
Joint work with Amir Kamil and Jimmy Su
27
Memory Models

Sequential consistency a reordering is illegal
if it can be observed by another thread
Relaxed consistency reordering may be observed,
but local dependencies and synchronization
preserved (roughly)
Titanium, Java, UPC are not sequentially
consistent
Perceived cost of enforcing it is too high
For Titanium and UPC, network latency is the cost
For Java shared memory fences and code
transformations are the cost

Joint work with Amir Kamil and Jimmy Su
28
Software and Hardware Reordering

Compiler can reorder accesses as part of an
optimization
Example copy propagation
Logical fences inserted where reordering is
illegal optimizations respect these fences
Hardware can reorder accesses
Examples out of order execution, remote accesses
Fence instructions inserted into generated code
waits until all prior memory operations have
completed
Can cost a complete round trip time due to remote
accesses

Joint work with Amir Kamil and Jimmy Su
29
Conflicts

Reordering of an access is observable only if it
conflicts with some other access
The accesses can be to the same memory location
At least one access is a write
The accesses can run concurrently
Fences (compiler and hardware) need to be
inserted around accesses that conflict

T1
T2
data 1
f flag
flag 1
d data
Conflicts
Joint work with Amir Kamil and Jimmy Su
30
Sequential Consistency in Titanium

Minimize number of fences allow same
optimizations as relaxed model
Concurrency analysis identifies concurrent
accesses
Relies on Titaniums textual barriers and
single-valued expressions
Alias analysis identifies accesses to the same
location
Relies on SPMD nature of Titanium

Joint work with Amir Kamil and Jimmy Su
31
Barrier Alignment

Many parallel languages make no attempt to ensure
that barriers line up
Example code that is legal but will deadlock
if (Ti.thisProc() 2 0)
Ti.barrier() // even ID threads
else
// odd ID threads

Joint work with Amir Kamil and Jimmy Su
32
Structural Correctness

Aiken and Gay introduced structural correctness
(POPL98)
Ensures that every thread executes the same
number of barriers
Example of structurally correct code
if (Ti.thisProc() 2 0)
Ti.barrier() // even ID threads
else
Ti.barrier() // odd ID threads

Joint work with Amir Kamil and Jimmy Su
33
Textual Barrier Alignment

Titanium has textual barriers all threads must
execute the same textual sequence of barriers
Stronger guarantee than structural correctness
this example is illegal
if (Ti.thisProc() 2 0)
Ti.barrier() // even ID threads
else
Ti.barrier() // odd ID threads
Single-valued expressions used to enforce textual
barriers

Joint work with Amir Kamil and Jimmy Su
34
Single-Valued Expressions

A single-valued expression has the same value on
all threads when evaluated
Example Ti.numProcs() gt 1
All threads guaranteed to take the same branch of
a conditional guarded by a single-valued
expression
Only single-valued conditionals may have barriers
Example of legal barrier use
if (Ti.numProcs() gt 1)
Ti.barrier() // multiple threads
else
// only one thread total

Joint work with Amir Kamil and Jimmy Su
35
Concurrency Analysis

Graph generated from program as follows
Node added for each code segment between barriers
and single-valued conditionals
Edges added to represent control flow between
segments

1
// code segment 1 if (single) // code segment
2 else // code segment 3 // code segment
4 Ti.barrier() // code segment 5
2
3
4
barrier
5
Joint work with Amir Kamil and Jimmy Su
36
Concurrency Analysis (II)

Two accesses can run concurrently if
They are in the same node, or
One accesss node is reachable from the other
accesss node without hitting a barrier
Algorithm remove barrier edges, do DFS

1
Concurrent Segments Concurrent Segments Concurrent Segments Concurrent Segments Concurrent Segments Concurrent Segments
1 2 3 4 5
1 X X X X
2 X X X
3 X X X
4 X X X X
5 X
2
3
4
barrier
5
Joint work with Amir Kamil and Jimmy Su
37
Alias Analysis

Allocation sites correspond to abstract locations
(a-locs)
All explicit and implict program variables have
points-to sets
A-locs are typed and have points-to sets for each
field of the corresponding type
Arrays have a single points-to set for all
indices
Analysis is flow,context-insensitive
Experimental call-site sensitive version
doesnt seem to help much

Joint work with Amir Kamil and Jimmy Su
38
Thread-Aware Alias Analysis

Two types of abstract locations local and remote
Local locations reside in local threads memory
Remote locations reside on another thread
Exploits SPMD property
Results are a summary over all threads
Independent of the number of threads at runtime

Joint work with Amir Kamil and Jimmy Su
39
Alias Analysis Allocation

Creates new local abstract location
Result of allocation must reside in local memory

class Foo Object z static void
bar() L1 Foo a new Foo() Foo b
broadcast a from 0 Foo c a L2 a.z new
Object()
A-locs 1, 2
Points-to Sets Points-to Sets
a
b
c
Joint work with Amir Kamil and Jimmy Su
40
Alias Analysis Assignment

Copies source abstract locations into points-to
set of target

class Foo Object z static void
bar() L1 Foo a new Foo() Foo b
broadcast a from 0 Foo c a L2 a.z new
Object()
A-locs 1, 2
Points-to Sets Points-to Sets
a 1
b
c 1
1.z 2
Joint work with Amir Kamil and Jimmy Su
41
Alias Analysis Broadcast

Produces both local and remote versions of source
abstract location
Remote a-loc points to remote analog of what
local a-loc points to

class Foo Object z static void
bar() L1 Foo a new Foo() Foo b
broadcast a from 0 Foo c a L2 a.z new
Object()
A-locs 1, 2, 1r
Points-to Sets Points-to Sets
a 1
b 1, 1r
c 1
1.z 2
1r.z 2r
Joint work with Amir Kamil and Jimmy Su
42
Aliasing Results

Two variables A and B may alias if
xÎpointsTo(A).
xÎpointsTo(B)
Two variables A and B may alias across threads
if
xÎpointsTo(A).
R(x)ÎpointsTo(B),
(where R(x) is the remote counterpart of x)

Points-to Sets Points-to Sets
a 1
b 1, 1r
c 1
Alias Across Threads Alias Across Threads
a b, c b
b a, c a, c
c a, b b
Joint work with Amir Kamil and Jimmy Su
43
Benchmarks
Benchmark Lines1 Description
pi 56 Monte Carlo integration
demv 122 Dense matrix-vector multiply
sample-sort 321 Parallel sort
lu-fact 420 Dense linear algebra
3d-fft 614 Fourier transform
gsrb 1090 Computational fluid dynamics kernel
gsrb 1099 Slightly modified version of gsrb
spmv 1493 Sparse matrix-vector multiply
gas 8841 Hyperbolic solver for gas dynamics
1 Line counts do not include the reachable
portion of the 1 37,000 line Titanium/Java 1.0
libraries
Joint work with Amir Kamil and Jimmy Su
44
Analysis Levels

We tested analyses of varying levels of precision

Analysis Description
naïve All heap accesses
sharing All shared accesses
concur Concurrency analysis type-based AA
concur/saa Concurrency analysis sequential AA
concur/taa Concurrency analysis thread-aware AA
concur/taa/cycle Concurrency analysis thread-aware AA cycle detection
Joint work with Amir Kamil and Jimmy Su
45
Static (Logical) Fences
GOOD
Percentages are for number of static fences
reduced over naive
Joint work with Amir Kamil and Jimmy Su
46
Dynamic (Executed) Fences
GOOD
Percentages are for number of dynamic fences
reduced over naive
Joint work with Amir Kamil and Jimmy Su
47
Dynamic Fences gsrb

gsrb relies on dynamic locality checks
slight modification to remove checks (gsrb)
greatly increases precision of analysis

GOOD
Joint work with Amir Kamil and Jimmy Su
48
Two Example Optimizations

Consider two optimizations for GAS languages
Overlap bulk memory copies
Communication aggregation for irregular array
accesses (i.e. abi)
Both optimizations reorder accesses, so
sequential consistency can inhibit them
Both are addressing network performance, so
potential payoff is high

Joint work with Amir Kamil and Jimmy Su
49
Array Copies in Titanium

Array copy operations are commonly used
dst.copy(src)
Content in the domain intersection of the two
arrays is copied from dst to src
Communication (possibly with packing) required if
arrays reside on different threads
Processor blocks until the operation is complete.

src

dst
Joint work with Amir Kamil and Jimmy Su
50
Non-Blocking Array Copy Optimization

Automatically convert blocking array copies into
non-blocking array copies
Push sync as far down the instruction stream as
possible to allow overlap with computation
Interprocedural syncs can be moved across method
boundaries
Optimization reorders memory accesses may be
illegal under sequential consistency

Joint work with Amir Kamil and Jimmy Su
51
Communication Aggregation on Irregular Array
Accesses (Inspector/Executor)

A loop containing indirect array accesses is
split into phases
Inspector examines loop and computes reference
targets
Required remote data gathered in a bulk operation
Executor uses data to perform actual computation
Can be illegal under sequential consistency

schd inspect(remote, b) tmp get(remote,
schd) for (...) ai tmpi // other
accesses
for (...) ai remotebi // other
accesses
Joint work with Amir Kamil and Jimmy Su
52
Relaxed SC with 3 Analyses

We tested performance using analyses of varying
levels of precision

Name Description
relaxed Uses Titaniums relaxed memory model
naïve Uses sequential consistency, puts fences around every heap access
sharing Uses sequential consistency, puts fences around every shared heap access
concur/taa/cycle Uses sequential consistency, uses our most aggressive analysis
Joint work with Amir Kamil and Jimmy Su
53
Dense Matrix Vector Multiply

Non-blocking array copy optimization applied
Strongest analysis is necessary other SC
implementations suffer relative to relaxed

Joint work with Amir Kamil and Jimmy Su
54
Sparse Matrix Vector Multiply

Inspector/executor optimization applied
Strongest analysis is again necessary and
sufficient

Joint work with Amir Kamil and Jimmy Su
55
Portability of Titanium and UPC

Titanium and the Berkeley UPC translator use a
similar model
Source-to-source translator (generate ISO C)
Runtime layer implements global pointers, etc
Common communication layer (GASNet)
Both run on most PCs, SMPs, clusters
supercomputers
Support Operating Systems
Linux, FreeBSD, Tru64, AIX, IRIX, HPUX, Solaris,
Cygwin, MacOSX, Unicos, SuperUX
UPC translator somewhat less portable we provide
a http-based compile server
Supported CPUs
x86, Itanium, Alpha, Sparc, PowerPC, PA-RISC,
Opteron
GASNet communication
Myrinet GM, Quadrics Elan, Mellanox Infiniband
VAPI, IBM LAPI, Cray X1, SGI Altix, Cray/SGI
SHMEM, and (for portability) MPI and UDP
Specific supercomputer platforms
HP AlphaServer, Cray X1, IBM SP, NEC SX-6,
Cluster X (Big Mac), SGI Altix 3000
Underway Cray XT3, BG/L (both run over MPI)
Can be mixed with MPI, C/C, Fortran

Also used by gcc/upc
Joint work with Titanium and UPC groups
56
Portability of PGAS Languages

Other compilers also exist for PGAS Languages
UPC
Gcc/UPC by Intrepid runs on GASNet
HP UPC for AlphaServers, clusters,
MTU UPC uses HP compiler on MPI (source to
source)
Cray UPC
Co-Array Fortran
Cray CAF Compiler X1, X1E
Rice CAF Compiler (on ARMCI or GASNet), John
Mellor-Crummey
Source to source
Processors Pentium, Itanium2, Alpha, MIPS
Networks Myrinet, Quadrics, Altix, Origin,
Ethernet
OS Linux32 RedHat, IRIS, Tru64
NB source-to-source requires cooperation by
backend compilers

57
Summary

PGAS languages offer productivity advantage
Order of magnitude in line counts for grid-based
code in Titanium
Push decisions about packing/not into runtime for
portability (advantage of language with
translator vs. library approach)
Significant work in compiler can make programming
easier
PGAS languages offer performance advantages
Good match to RDMA support in networks
Smaller messages may be faster
make better use of network postpone bisection
bandwidth pain
can also prevent cache thrashing for packing
Have locality advantages that may help even SMPs
Source-to-source translation
The way to ubiquity
Complement highly tuned machine-specific compilers

58
End of Slides
59
Productizing BUPC

Recent Berkeley UPC release
Support full 1.2 language spec
Supports collectives (tuning ongoing) memory
model compliance
Supports UPC I/O (naïve reference implementation)
Large effort in quality assurance and robustness
Test suite 600 tests run nightly on 20
platform configs
Tests correct compilation execution of UPC and
GASNet
gt30,000 UPC compilations and gt20,000 UPC test
runs per night
Online reporting of results hookup with bug
database
Test suite infrastructure extended to support any
UPC compiler
now running nightly with GCC/UPC UPCR
also support HP-UPC, Cray UPC,
Online bug reporting database
Over gt1100 reports since Jan 03
gt 90 fixed (excl. enhancement requests)

60
NAS FT UPC Non-blocking MFlops

Berkeley UPC compiler support non-blocking UPC
extensions
Produce 15-45 speedup over best UPC Blocking
version
Non-blocking version requires about 30 extra
lines of UPC code

61
Benchmarking

Next few UPC and MPI application benchmarks use
the following systems
Myrinet Myrinet 2000 PCI64B, P4-Xeon 2.2GHz
InfiniBand IB Mellanox Cougar 4X HCA, Opteron
2.2GHz
Elan3 Quadrics QsNet1, Alpha 1GHz
Elan4 Quadrics QsNet2, Itanium2 1.4GHz

62
PGAS Languages Key to High Performance

One way to gain acceptance of a new language
Make it run faster than anything else
Keys to high performance
Parallelism
Scaling the number of processors
Maximize single node performance
Generate friendly code or use tuned libraries
(BLAS, FFTW, etc.)
Avoid (unnecessary) communication cost
Latency, bandwidth, overhead
Avoid unnecessary delays due to dependencies
Load balance
Pipeline algorithmic dependencies

63
Hardware Latency

Network latency is not expected to improve
significantly
Overlapping communication automatically (Chen)
Overlapping manually in the UPC applications
(Husbands, Welcome, Bell, Nishtala)
Language support for overlap (Bonachea)

64
Effective Latency

Communication wait time from other factors
Algorithmic dependencies
Use finer-grained parallelism, pipeline tasks
(Husbands)
Communication bandwidth bottleneck
Message time is Latency 1/Bandwidth Size
Too much aggregation hurts wait for bandwidth
term
De-aggregation optimization automatic (Iancu)
Bisection bandwidth bottlenecks
Spread communication throughout the computation
(Bell)

65
Fine-grained UPC vs. Bulk-Synch MPI

How to waste money on supercomputers
Pack all communication into single message (spend
memory bandwidth)
Save all communication until the last one is
ready (add effective latency)
Send all at once (spend bisection bandwidth)
Or, to use what you have efficiently
Avoid long wait times send early and often
Use all the wires, all the time
This requires having low overhead!

66
What You Wont Hear Much About

Compiler/runtime/gasnet bug fixes, performance
tuning, testing,
gt13,000 e-mail messages regarding cvs checkins
Nightly regression testing
25 platforms, 3 compilers (head, opt-branch,
gcc-upc),
Bug reporting
1177 bug reports, 1027 fixed
Release scheduled for later this summer
Beta is available
Process significantly streamlined

67
Take-Home Messages

Titanium offers tremendous gains in productivity
High level domain-specific array abstractions
Titanium is being used for real applications
Not just toy problems
Titanium and UPC are both highly portable
Run on essentially any machine
Rigorously tested and supported
PGAS Languages are Faster than two-sided MPI
Better match to most HPC networks
Berkeley UPC and Titanium benchmarks
Designed from scratch with one-side PGAS model
Focus on 2 scalability challenges AMR and Sparse
LU

68
Titanium Background

Based on Java, a cleaner C
Classes, automatic memory management, etc.
Compiled to C and then machine code, no JVM
Same parallelism model at UPC and CAF
SPMD parallelism
Dynamic Java threads are not supported
Optimizing compiler
Analyzes global synchronization
Optimizes pointers, communication, memory

69
Do these Features Yield Productivity?
Joint work with Kaushik Datta, Dan Bonachea
70
GASNet/X1 Performance
single word get
single word put

GASNet/X1 improves small message performance over
shmem and MPI
Leverages global pointers on X1
Highlights advantage of languages vs. library
approach

Joint work with Christian Bell, Wei Chen and Dan
Bonachea
71
High Level Optimizations in Titanium

Irregular communication can be expensive
Best strategy differs by data size/distribution
and machine parameters
E.g., packing, sending bounding boxes,
fine-grained are

Use of runtime optimizations
Inspector-executor
Performance on Sparse MatVec Mult
Results best strategy differs within the machine
on a single matrix ( 50 better)

Speedup relative to MPI code (Aztec library)
Average and maximum speedup of the Titanium
version relative to the Aztec version on 1 to 16
processors
Joint work with Jimmy Su
72
Source to Source Strategy