Title: Autotuning Memory Intensive Kernels for Multicore
1Auto-tuning Memory Intensive Kernels for Multicore
- Sam Williams
- SWWilliams_at_lbl.gov
2Outline
- Challenges arising from Optimizing Single Thread
Performance - New Challenges Arising when Optimizing Multicore
SMP Performance - Performance Modeling and Littles Law
- Multicore SMPs of Interest
- Auto-tuning Sparse Matrix-Vector Multiplication
(SpMV) - Auto-tuning Lattice-Boltzmann Magneto-Hydrodynamic
s (LBMHD) - Summary
3Challenges arising fromOptimizing Single Thread
Performance
4Instruction-Level Parallelism
- On modern pipelined architectures, operations
(like floating-point addition) have a latency of
4-6 cycles (until the result is ready). - However, independent adds can be pipelined one
after another. - Although this increases the peak flop rate,
- one can only achieve peak flops on the condition
that on any given cycle the program has gt4
independent adds ready to execute. - failing to do so will result in a gt4x drop in
performance. - The problem is exacerbated by superscalar or VLIW
architectures like POWER or Itanium. - One must often reorganize kernels to express more
instruction-level parallelism
5ILP Example (1x1 BCSR)
- for(all rows)
- y0 0.0
- for(all tiles in this row)
- y0ViXCi
-
- yr y0
-
- Consider the core of SpMV
- No ILP in the inner loop
- OOO cant accelerate serial FMAs
1x1 Register Block
FMA
FMA
FMA
FMA
FMA
6ILP Example (1x4 BCSR)
- for(all rows)
- y0 0.0
- for(all tiles in this row)
- y0Vi XCi
- y0Vi1XCi1
- y0Vi2XCi2
- y0Vi3XCi3
-
- yr y0
-
- What about 1x4 BCSR ?
- Still no ILP in the inner loop
- FMAs are still dependent on each other
1x4 Register Block
FMA
FMA
FMA
FMA
7ILP Example (4x1 BCSR)
- for(all rows)
- y0 0.0y1 0.0
- y2 0.0y3 0.0
- for(all tiles in this row)
- y0Vi XCi
- y1Vi1XCi
- y2Vi2XCi
- y3Vi3XCi
-
- yr0 y0 yr1 y1
- yr2 y2 yr3 y3
-
- What about 4x1 BCSR ?
- Updating 4 different rows
- The 4 FMAs are independent
- Thus they can be pipelined.
4x1 Register Block
FMA
FMA
FMA
FMA
FMA
FMA
FMA
FMA
8Data-level Parallelism
- DLP apply the same operation to multiple
independent operands. - Today, rather than relying on superscalar issue,
many architectures have adopted SIMD as an
efficient means of boosting peak performance.
(SSE, Double Hummer, AltiVec, Cell, GPUs, etc) - Typically these instructions operate on four
single precision - (or two double precision) numbers at a time.
- However, some are more GPUs(32), Larrabee(16),
and AVX(8) - Failing to use these instructions may cause a
2-32x drop in performance - Unfortunately, most compilers utterly fail to
generate these instructions.
9Memory-Level Parallelism (1)
- Although caches may filter many memory requests,
in HPC many memory references will still go all
the way to DRAM. - Memory latency (as measured in core cycles) grew
by an order of magnitude in the 90s - Today, the latency of a memory operation can
exceed 200 cycles (1 double every 80ns is
unacceptably slow). - Like ILP, we wish to pipeline requests to DRAM
- Several solutions exist today
- HW stream prefetchers
- HW Multithreading (e.g. hyperthreading)
- SW line prefetch
- DMA
10Memory-Level Parallelism (2)
- HW stream prefetchers are by far the easiest to
implement and exploit. - They detect a series of consecutive cache misses
and speculate that the next addresses in the
series will be needed. They then prefetch that
data into the cache or a dedicated buffer. - To effectively exploit a HW prefetcher, ensure
your array references accesses 100s of
consecutive addresses. - e.g. read AiAi255 without any jumps or
discontinuities - This force limits the effectiveness (shape) of
the cache blocking you implemented in HW1 as you
accessed - A(j0)NiA(j0)NiB, jump
- A(j1)NiA(j1)NiB, jump
- A(j2)NiA(j2)NiB, jump
-
11Branch Misprediction
- A mispredicted branch can stall subsequent
instructions by 10 cycles. - Select a loop structure that maximizes the loop
length - (keeps mispredicted branches per instruction to
a minimum) - Some architectures support predication either in
hardware or software to eliminate branches
(transforms control dependencies into data
dependencies)
12Cache Subtleties
- Set associative caches have a limited number of
sets (S) and - ways (W), the product of which is the capacity
(in cache lines). - As seen in HW1, it can be beneficial to
reorganize kernels to reduce the working size and
eliminate capacity misses. - Conflict misses can severely impair performance,
be very challenging to identify and eliminate. - Given address may only be placed in W different
locations in the cache. - Poor access patterns or roughly power of two
problem sizes can be especially bad - Results in too many addresses mapped to the same
set. Not all of them can be kept in the cache
and some will have to be evicted. - Padding arrays (problem sizes) or skewing access
pattern can eliminate conflict misses.
13Array padding Example
- Padding changes the data layout
- Consider a large matrix with a power of two
number of - double ANM// column major with Mpow2
- Aij and Ai1j will likely be mapped to
the same set. - We can pad each column with a couple extra rows
- double ANMpad
- Such techniques are applicable in many other
domains (stencils, lattice-boltzman methods,
etc)
14New Challenges Arising whenOptimizing Multicore
SMP Performance
15What are SMPs ?
- SMP shared memory parallel.
- Multiple chips (typically lt 32 threads) can
address any location in a large shared memory
through a network or bus - Caches are almost universally coherent
- You can still run MPI on an SMP, but
- you trade free (always pay for it)
cache-coherency traffic for additional memory
traffic (for explicit communication) - you trade user-level function calls for system
calls - Alternately, you use a SPMD threading model
- (pthreads, OpenMP, UPC)
- If communication between cores or threads is
significant, then threaded implementations win
out. - As computationcommunication ratio increases, MPI
asymptotically approached threaded
implementations.
16What is multicore ?What are multicore SMPs ?
- Today, multiple cores are integrated on the same
chip - Almost universally this is done in a SMP fashion
- For convince, programming multicore SMPs is
indistinguishable from programming multi-socket
SMPs. (easy transition) - Multiple cores can share
- memory controllers
- caches
- occasionally FPUs
- Although there was a graceful transition
- from multiple sockets to multiple cores
- from the point of view of correctness,
- achieving good performance can be
- incredibly challenging.
17Affinity
- We may wish one pair of threads to share a cache,
but be disjoint from another pair of threads. - We can control the mapping of threads to linux
processors via - includeltsched.hgt sched_set/getaffinity()
- But, mapping of linux processors to physical
cores/sockets is machine/OS dependent. - Inspect /proc/cpuinfo or use PLPA
18NUMA Challenges
- Recent multicore SMPs have integrated the memory
controllers on chip. - As a result, memory-access is non-uniform (NUMA)
- That is, the bandwidth to read a given address
varies dramatically among between cores - Exploit NUMA (affinityfirst touch) when you
malloc/init data. - Concept is similar to data decomposition for
distributed memory
19Implicit allocation for NUMA
- Consider an OpenMP example for implicitly NUMA
initialization - pragma omp parallel for
- for (j0 jltN j)
- aj 1.0
- bj 2.0
- cj 0.0
-
- The first accesses to the array (read or write)
must be parallelized. DO NOT TOUCH BETWEEN
MALLOC AND INIT - When the for loop is parallelized, each thread
initializes a range of i - Exploits the OSs first touch policy.
- Relies on assumption OpenMP maps threads
correctly
20New Cache Challenges
- shared caches SPMD programming models can
exacerbate conflict misses. - Individually, threads may produce significant
cache associativity pressure based on access
pattern. (power of 2 problem sizes) - Collectively, threads may produce excessive cache
associativity pressure. (power of 2 problem sizes
decomposed with a power of two number of threads) - This can be much harder to diagnose and correct
- This problem arises whether using MPI or a
threaded model.
21New Memory Challenges
- The number of memory controllers and bandwidth on
multicore SMPs is growing much slower than the
number of cores. - codes are becoming increasingly memory-bound as a
fraction of the cores can saturate a sockets
memory bandwidth - Multicore has traded bit-or word-parallelism for
thread-parallelism. - However, main memory is still built from
bit-parallel devices (DIMMs) - Must restructure memory-intensive apps to the
bit-parallel nature of DIMMs (sequential access)
22Synchronization
- Using multiple concurrent threads can create
ordering and race errors. - Locks are one solution. Must balance granularity
and frequency - SPMD programming model barriers are often a
better/simpler solution. - spin barriers can be orders of magnitude faster
than pthread library barriers. (Rajesh Nishtala,
HotPar09)
23Performance Modeling and Littles Law
24System Abstraction
- Abstractly describe any system (or subsystem) as
a combination of black-boxed storage,
computational units, and the bandwidth between
them. - These can be hierarchically
- composed.
- A volume of data must be
- transferred from the storage
- component, processed, and
- another volume of data must be returned.
- Consider the basic parameters governing
performance of the channel Bandwidth, Latency,
Concurrency - Bandwidth can be measured in GB/s, Gflop/s,
MIPS, etc - Latency can be measured in seconds, cycles, etc
- Concurrency the volume in flight across the
channel, and can be measured in bytes, cache
lines, operations, instructions, etc
25Littles Law
- Littles law related concurrency, bandwidth, and
latency - To achieve peak bandwidth, one must satisfy
- Concurrency Latency Bandwidth
- For example, a memory controller with 20GB/s of
bandwidth, and 100ns of latency requires the CPU
to express 2KB of concurrency - (memory-level parallelism)
- Similarly, given an expressed concurrency, one
can bound attained performance - That is, as more concurrency is injected, we get
progressively better performance - Note, this assumes continual, pipelined accesses.
26Wheres the bottleneck?
- Weve described bandwidths
- DRAM ? CPU
- Cache ? Core
- Register File ? Functional units
- But in an application, one of these
- may be a performance-limiting
- bottleneck.
- We can take any pair and compare how quickly data
can be transferred to how quickly it can be
processed to determine the bottleneck.
27Arithmetic Intensity
O( log(N) )
O( N )
O( 1 )
A r i t h m e t i c I n t e n s i t y
SpMV, BLAS1,2
FFTs
Stencils (PDEs)
Dense Linear Algebra (BLAS3)
Lattice Methods
Particle Methods
- Consider the first case (DRAM-CPU)
- True Arithmetic Intensity (AI) Total Flops /
Total DRAM Bytes - Some HPC kernels have an arithmetic intensity
that scales with problem size (increased temporal
locality), but remains constant on others - Arithmetic intensity is ultimately limited by
compulsory traffic - Arithmetic intensity is diminished by conflict or
capacity misses.
28Kernel Arithmetic Intensityand Architecture
- For a given architecture, one may calculate its
flopbyte ratio. - For a 2.3GHz Quad Core Opteron,
- 1 SIMD add 1 SIMD multiply per cycle per core
- 12.8GB/s of DRAM bandwidth
- 36.8 / 12.8 2.9 flops per byte
- When a kernels arithmetic intensity is
substantially - less than the architectures flopbyte ratio,
transferring - data will take longer than computing on it
- ? memory-bound
- When a kernels arithmetic intensity is
substantially greater than the architectures
flopbyte ratio, computation will take longer
than data transfers - ? compute-bound
29Memory Traffic Definition
- Total bytes to/from DRAM
- Can categorize into
- Compulsory misses
- Capacity misses
- Conflict misses
- Write allocations
-
- Oblivious of lack of sub-cache line spatial
locality
30Roofline ModelBasic Concept
- Synthesize communication, computation, and
locality into a single visually-intuitive
performance figure using bound and bottleneck
analysis. - where optimization i can be SIMDize, or unroll,
or SW prefetch, - Given a kernels arithmetic intensity (based on
DRAM traffic after being filtered by the cache),
programmers can inspect the figure, and bound
performance. - Moreover, provides insights as to which
optimizations will potentially be beneficial.
31Roofline ModelBasic Concept
- Plot on log-log scale
- Given AI, we can easily bound performance
- But architectures are much more complicated
- We will bound performance as we eliminate
specific forms of in-core parallelism
Opteron 2356 (Barcelona)
peak DP
attainable GFLOP/s
Stream Bandwidth
actual FLOPByte ratio
32Roofline Modelcomputational ceilings
- Opterons have dedicated multipliers and adders.
- If the code is dominated by adds, then attainable
performance is half of peak. - We call these Ceilings
- They act like constraints on performance
Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
Stream Bandwidth
actual FLOPByte ratio
33Roofline Modelcomputational ceilings
- Opterons have 128-bit datapaths.
- If instructions arent SIMDized, attainable
performance will be halved
Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
w/out SIMD
Stream Bandwidth
actual FLOPByte ratio
34Roofline Modelcomputational ceilings
- On Opterons, floating-point instructions have a 4
cycle latency. - If we dont express 4-way ILP, performance will
drop by as much as 4x
Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
w/out SIMD
Stream Bandwidth
w/out ILP
actual FLOPByte ratio
35Roofline Modelcommunication ceilings
- We can perform a similar exercise taking away
parallelism from the memory subsystem
Opteron 2356 (Barcelona)
peak DP
attainable GFLOP/s
Stream Bandwidth
actual FLOPByte ratio
36Roofline Modelcommunication ceilings
- Explicit software prefetch instructions are
required to achieve peak bandwidth
Opteron 2356 (Barcelona)
peak DP
attainable GFLOP/s
Stream Bandwidth
w/out SW prefetch
actual FLOPByte ratio
37Roofline Modelcommunication ceilings
- Opterons are NUMA
- As such memory traffic must be correctly balanced
among the two sockets to achieve good Stream
bandwidth. - We could continue this by examining strided or
random memory access patterns
Opteron 2356 (Barcelona)
peak DP
attainable GFLOP/s
Stream Bandwidth
w/out SW prefetch
w/out NUMA
actual FLOPByte ratio
38Roofline Modelcomputation communication
ceilings
- We may bound performance based on the combination
of expressed in-core parallelism and attained
bandwidth.
Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
w/out SIMD
Stream Bandwidth
w/out SW prefetch
w/out NUMA
w/out ILP
actual FLOPByte ratio
39Roofline Modellocality walls
- Remember, memory traffic includes more than just
compulsory misses. - As such, actual arithmetic intensity may be
substantially lower. - Walls are unique to the architecture-kernel
combination
Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
w/out SIMD
Stream Bandwidth
w/out SW prefetch
w/out NUMA
only compulsory miss traffic
w/out ILP
FLOPs
AI
Compulsory Misses
actual FLOPByte ratio
40Roofline Modellocality walls
- Remember, memory traffic includes more than just
compulsory misses. - As such, actual arithmetic intensity may be
substantially lower. - Walls are unique to the architecture-kernel
combination
Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
w/out SIMD
Stream Bandwidth
w/out SW prefetch
w/out NUMA
only compulsory miss traffic
write allocation traffic
w/out ILP
FLOPs
AI
Allocations Compulsory Misses
actual FLOPByte ratio
41Roofline Modellocality walls
- Remember, memory traffic includes more than just
compulsory misses. - As such, actual arithmetic intensity may be
substantially lower. - Walls are unique to the architecture-kernel
combination
Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
w/out SIMD
Stream Bandwidth
w/out SW prefetch
w/out NUMA
only compulsory miss traffic
write allocation traffic
capacity miss traffic
w/out ILP
FLOPs
AI
Capacity Allocations Compulsory
actual FLOPByte ratio
42Roofline Modellocality walls
- Remember, memory traffic includes more than just
compulsory misses. - As such, actual arithmetic intensity may be
substantially lower. - Walls are unique to the architecture-kernel
combination
Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
w/out SIMD
Stream Bandwidth
w/out SW prefetch
w/out NUMA
only compulsory miss traffic
write allocation traffic
capacity miss traffic
conflict miss traffic
w/out ILP
FLOPs
AI
Conflict Capacity Allocations Compulsory
actual FLOPByte ratio
43Optimization Categorization
Maximizing (attained) In-core Performance
Minimizing (total) Memory Traffic
Maximizing (attained) Memory Bandwidth
44Optimization Categorization
Minimizing Memory Traffic
Maximizing Memory Bandwidth
Maximizing In-core Performance
- Exploit in-core parallelism
- (ILP, DLP, etc)
- Good (enough)
- floating-point balance
45Optimization Categorization
Minimizing Memory Traffic
Maximizing Memory Bandwidth
Maximizing In-core Performance
- Exploit in-core parallelism
- (ILP, DLP, etc)
- Good (enough)
- floating-point balance
46Optimization Categorization
Maximizing In-core Performance
Minimizing Memory Traffic
Maximizing Memory Bandwidth
- Exploit in-core parallelism
- (ILP, DLP, etc)
- Good (enough)
- floating-point balance
- Exploit NUMA
- Hide memory latency
- Satisfy Littles Law
47Optimization Categorization
Maximizing In-core Performance
Maximizing Memory Bandwidth
Minimizing Memory Traffic
- Exploit in-core parallelism
- (ILP, DLP, etc)
- Good (enough)
- floating-point balance
- Exploit NUMA
- Hide memory latency
- Satisfy Littles Law
- Eliminate
- Capacity misses
- Conflict misses
- Compulsory misses
- Write allocate behavior
48Optimization Categorization
Maximizing In-core Performance
Minimizing Memory Traffic
Maximizing Memory Bandwidth
- Exploit in-core parallelism
- (ILP, DLP, etc)
- Good (enough)
- floating-point balance
- Exploit NUMA
- Hide memory latency
- Satisfy Littles Law
- Eliminate
- Capacity misses
- Conflict misses
- Compulsory misses
- Write allocate behavior
49Roofline Modellocality walls
- Optimizations remove these walls and ceilings
which act to constrain performance.
Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
w/out SIMD
Stream Bandwidth
w/out SW prefetch
w/out NUMA
only compulsory miss traffic
write allocation traffic
capacity miss traffic
conflict miss traffic
w/out ILP
actual FLOPByte ratio
50Roofline Modellocality walls
- Optimizations remove these walls and ceilings
which act to constrain performance.
Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
w/out SIMD
Stream Bandwidth
w/out SW prefetch
w/out NUMA
only compulsory miss traffic
w/out ILP
actual FLOPByte ratio
51Roofline Modellocality walls
- Optimizations remove these walls and ceilings
which act to constrain performance.
Opteron 2356 (Barcelona)
peak DP
attainable GFLOP/s
Stream Bandwidth
w/out SW prefetch
w/out NUMA
only compulsory miss traffic
actual FLOPByte ratio
52Roofline Modellocality walls
- Optimizations remove these walls and ceilings
which act to constrain performance.
Opteron 2356 (Barcelona)
peak DP
attainable GFLOP/s
Stream Bandwidth
only compulsory miss traffic
actual FLOPByte ratio
53Optimization Categorization
Maximizing In-core Performance
Minimizing Memory Traffic
Maximizing Memory Bandwidth
- Exploit in-core parallelism
- (ILP, DLP, etc)
- Good (enough)
- floating-point balance
- Exploit NUMA
- Hide memory latency
- Satisfy Littles Law
- Eliminate
- Capacity misses
- Conflict misses
- Compulsory misses
- Write allocate behavior
Each optimization has a large parameter
space What are the optimal parameters?
54Auto-tuning?
- Provides performance portability across the
existing breadth and evolution of microprocessors - One time up front productivity cost is amortized
by the number of machines its used on - Auto-tuning does not invent new optimizations
- Auto-tuning automates the code generation and
exploration of the optimization and parameter
space - Two components
- parameterized code generator (we wrote ours in
Perl) - Auto-tuning exploration benchmark
- (combination of heuristics and exhaustive
search) - Can be extended with ISA specific optimizations
(e.g. DMA, SIMD)
55Multicore SMPsof Interest
- (used throughout the rest of the talk)
56Multicore SMPs Used
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
57Multicore SMPs Used(Conventional cache-based
memory hierarchy)
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
58Multicore SMPs Used(local store-based memory
hierarchy)
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
59Multicore SMPs Used(CMT Chip-MultiThreading)
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
60Multicore SMPs Used(threads)
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
8 threads
8 threads
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
16 threads
128 threads
SPEs only
61Multicore SMPs Used(peak double precision flops)
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
75 GFlop/s
74 Gflop/s
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
29 GFlop/s
19 GFlop/s
SPEs only
62Multicore SMPs Used(total DRAM bandwidth)
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
21 GB/s (read) 10 GB/s (write)
21 GB/s
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
51 GB/s
42 GB/s (read) 21 GB/s (write)
SPEs only
63Multicore SMPs Used(Non-Uniform Memory Access -
NUMA)
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
SPEs only
64Roofline Modelfor these multicore SMPs
- Note, the multithreaded Niagara is limited by the
instruction mix rather than a lack of expressed
in-core parallelism - Clearly some architectures are more dependent on
bandwidth optimizations while others on in-core
optimizations.
65Auto-tuning Sparse Matrix-Vector Multiplication
(SpMV)
- Samuel Williams, Leonid Oliker, Richard Vuduc,
John Shalf, Katherine Yelick, James Demmel,
"Optimization of Sparse Matrix-Vector
Multiplication on Emerging Multicore Platforms",
Supercomputing (SC), 2007.
66Sparse MatrixVector Multiplication
- Whats a Sparse Matrix ?
- Most entries are 0.0
- Performance advantage in only storing/operating
on the nonzeros - Requires significant meta data to reconstruct the
matrix structure - Whats SpMV ?
- Evaluate yAx
- A is a sparse matrix, x y are dense vectors
- Challenges
- Very low arithmetic intensity (often lt0.166
flops/byte) - Difficult to exploit ILP (bad for pipelined or
superscalar), - Difficult to exploit DLP (bad for SIMD)
67The Dataset (matrices)
- Unlike dense BLAS, performance is dictated by
sparsity - Suite of 14 matrices
- All bigger than the caches of our SMPs
- Well also include a median performance number
2K x 2K Dense matrix stored in sparse format
Dense
Well Structured (sorted by nonzeros/row)
Protein
FEM / Spheres
FEM / Cantilever
Wind Tunnel
FEM / Harbor
QCD
FEM / Ship
Economics
Epidemiology
Poorly Structured hodgepodge
FEM / Accelerator
Circuit
webbase
Extreme Aspect Ratio (linear programming)
LP
68SpMV Performance(simple parallelization)
- Out-of-the box SpMV performance on a suite of 14
matrices - Scalability isnt great
- Is this performance good?
Naïve Pthreads
Naïve
69NUMA for SpMV
- On NUMA architectures, all large arrays should be
partitioned either - explicitly (multiple malloc()s affinity)
- implicitly (parallelize initialization and rely
on first touch) - You cannot partition on granularities less than
the page size - 512 elements on x86
- 2M elements on Niagara
- For SpMV, partition the matrix and
- perform multiple malloc()s
- Pin submatrices so they are
- co-located with the cores tasked
- to process them
70Prefetch for SpMV
- SW prefetch injects more MLP into the memory
subsystem. - Can try to prefetch the
- values
- indices
- source vector
- or any combination thereof
- In general, should only insert one prefetch per
cache line (works best on unrolled code)
for(all rows) y0 0.0 y1 0.0 y2
0.0 y3 0.0 for(all tiles in this row)
PREFETCH(ViPFDistance) y0Vi
XCi y1Vi1XCi
y2Vi2XCi y3Vi3XCi
yr0 y0 yr1 y1 yr2 y2
yr3 y3
71SpMV Performance(NUMA and Software Prefetching)
- NUMA-aware allocation is essential on
memory-bound NUMA SMPs. - Explicit software prefetching can boost bandwidth
and change cache replacement policies - Cell PPEs are likely latency-limited.
- used exhaustive search
72ILP/DLP vs Bandwidth
- In the multicore era, which is the bigger issue?
- a lack of ILP/DLP (a major advantage of BCSR)
- insufficient memory bandwidth per core
- There are many architectures than when running
low arithmetic intensity kernels, there is so
little available memory bandwidth (per core) that
you wont notice a complete lack of ILP - Perhaps we should concentrate on minimizing
memory traffic rather than maximizing ILP/DLP - Rather than benchmarking every combination, just
- Select the register blocking that minimizes the
matrix foot print.
73SpMV Performance(Matrix Compression)
- After maximizing memory bandwidth, the only hope
is to minimize memory traffic. - exploit
- register blocking
- other formats
- smaller indices
- Use a traffic minimization heuristic rather than
search - Benefit is clearly
- matrix-dependent.
- Register blocking enables efficient software
prefetching (one per cache line)
74Cache blocking for SpMV
- Cache-blocking sparse matrices is very different
than cache-blocking dense matrices. - Rather than changing loop bounds, store entire
submatrices contiguously. - The columns spanned by each cache
- block are selected so that all submatrices
- place the same pressure on the cache
-
- i.e. touch the same number of unique
- source vector cache lines
- TLB blocking is a similar concept but
- instead of on 8 byte granularities,
- it uses 4KB granularities
75Auto-tuned SpMV Performance(cache and TLB
blocking)
- Fully auto-tuned SpMV performance across the
suite of matrices - Why do some optimizations work better on some
architectures?
Cache/LS/TLB Blocking
Matrix Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
76Auto-tuned SpMV Performance(architecture
specific optimizations)
- Fully auto-tuned SpMV performance across the
suite of matrices - Included SPE/local store optimized version
- Why do some optimizations work better on some
architectures?
Cache/LS/TLB Blocking
Matrix Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
77Auto-tuned SpMV Performance(max speedup)
- Fully auto-tuned SpMV performance across the
suite of matrices - Included SPE/local store optimized version
- Why do some optimizations work better on some
architectures?
2.7x
4.0x
2.9x
35x
Cache/LS/TLB Blocking
Matrix Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
78Auto-tuned SpMV Performance(architecture
specific optimizations)
- Fully auto-tuned SpMV performance across the
suite of matrices - Included SPE/local store optimized version
- Why do some optimizations work better on some
architectures? - Performance is better,
- but is performance good?
Auto-tuning resulted in better performance,
but did it result in good performance?
Cache/LS/TLB Blocking
Matrix Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
79Roofline model for SpMV
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128
- Double precision roofline models
- In-core optimizations 1..i
- DRAM optimizations 1..j
- FMA is inherent in SpMV (place at bottom)
peak DP
peak DP
64
64
w/out SIMD
w/out SIMD
32
32
16
16
attainable Gflop/s
attainable Gflop/s
w/out ILP
dataset dataset fits in snoop filter
8
8
w/out ILP
mul/add imbalance
4
4
mul/add imbalance
w/out SW prefetch
w/out NUMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
64
64
peak DP
32
32
peak DP
16
16
w/out SIMD
attainable Gflop/s
attainable Gflop/s
bank conflicts
25 FP
8
8
w/out ILP
w/out SW prefetch
w/out NUMA
12 FP
w/out NUMA
4
4
w/out FMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
80Roofline model for SpMV(overlay arithmetic
intensity)
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128
- Two unit stride streams
- Inherent FMA
- No ILP
- No DLP
- FP is 12-25
- Naïve compulsory flopbyte lt 0.166
peak DP
peak DP
64
64
w/out SIMD
w/out SIMD
32
32
16
16
attainable Gflop/s
attainable Gflop/s
w/out ILP
dataset dataset fits in snoop filter
8
8
w/out ILP
mul/add imbalance
4
4
mul/add imbalance
w/out SW prefetch
w/out NUMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
No naïve SPE implementation
64
64
peak DP
32
32
peak DP
16
16
w/out SIMD
attainable Gflop/s
attainable Gflop/s
bank conflicts
25 FP
8
8
w/out ILP
w/out SW prefetch
w/out NUMA
12 FP
w/out NUMA
4
4
w/out FMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
81Roofline model for SpMV(out-of-the-box parallel)
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128
- Two unit stride streams
- Inherent FMA
- No ILP
- No DLP
- FP is 12-25
- Naïve compulsory flopbyte lt 0.166
- For simplicity dense matrix in sparse format
peak DP
peak DP
64
64
w/out SIMD
w/out SIMD
32
32
16
16
attainable Gflop/s
attainable Gflop/s
w/out ILP
dataset dataset fits in snoop filter
8
8
w/out ILP
mul/add imbalance
4
4
mul/add imbalance
w/out SW prefetch
w/out NUMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
No naïve SPE implementation
64
64
peak DP
32
32
peak DP
16
16
w/out SIMD
attainable Gflop/s
attainable Gflop/s
bank conflicts
25 FP
8
8
w/out ILP
w/out SW prefetch
w/out NUMA
12 FP
w/out NUMA
4
4
w/out FMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
82Roofline model for SpMV(NUMA SW prefetch)
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128
- compulsory flopbyte 0.166
- utilize all memory channels
peak DP
peak DP
64
64
w/out SIMD
w/out SIMD
32
32
16
16
attainable Gflop/s
attainable Gflop/s
w/out ILP
dataset dataset fits in snoop filter
8
8
w/out ILP
mul/add imbalance
4
4
mul/add imbalance
w/out SW prefetch
w/out NUMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
No naïve SPE implementation
64
64
peak DP
32
32
peak DP
16
16
w/out SIMD
attainable Gflop/s
attainable Gflop/s
bank conflicts
25 FP
8
8
w/out ILP
w/out SW prefetch
w/out NUMA
12 FP
w/out NUMA
4
4
w/out FMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
83Roofline model for SpMV(matrix compression)
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128
- Inherent FMA
- Register blocking improves ILP, DLP, flopbyte
ratio, and FP of instructions
peak DP
peak DP
64
64
w/out SIMD
w/out SIMD
32
32
16
16
attainable Gflop/s
attainable Gflop/s
w/out ILP
dataset dataset fits in snoop filter
8
8
w/out ILP
mul/add imbalance
4
4
mul/add imbalance
w/out SW prefetch
w/out NUMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
64
64
peak DP
32
32
peak DP
16
16
w/out SIMD
attainable Gflop/s
attainable Gflop/s
bank conflicts
25 FP
8
8
w/out ILP
w/out SW prefetch
w/out NUMA
12 FP
w/out NUMA
4
4
w/out FMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
84Roofline model for SpMV(matrix compression)
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128
- Inherent FMA
- Register blocking improves ILP, DLP, flopbyte
ratio, and FP of instructions
peak DP
peak DP
64
64
w/out SIMD
w/out SIMD
32
32
16
16
attainable Gflop/s
attainable Gflop/s
w/out ILP
8
8
w/out ILP
mul/add imbalance
4
4
mul/add imbalance
dataset fits in snoop filter
w/out SW prefetch
w/out NUMA
2
2
Performance is bandwidth limited
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
64
64
peak DP
32
32
peak DP
16
16
w/out SIMD
attainable Gflop/s
attainable Gflop/s
bank conflicts
25 FP
8
8
w/out ILP
w/out SW prefetch
w/out NUMA
12 FP
w/out NUMA
4
4
w/out FMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
85SpMV Performance(summary)
- Median SpMV performance
- aside, unlike LBMHD, SSE was unnecessary to
achieve performance - Cell still requires a non-portable, ISA-specific
implementation to achieve good performance. - Novel SpMV implementations may require
ISA-specific (SSE) code to achieve better
performance.
86Auto-tuning Lattice-Boltzmann Magneto-Hydrodynamic
s (LBMHD)
- Samuel Williams, Jonathan Carter, Leonid Oliker,
John Shalf, Katherine Yelick, "Lattice Boltzmann
Simulation Optimization on Leading Multicore
Platforms", International Parallel Distributed
Processing Symposium (IPDPS), 2008. - Best Paper, Application Track
87LBMHD
- Plasma turbulence simulation via Lattice
Boltzmann Method - Two distributions
- momentum distribution (27 scalar components)
- magnetic distribution (15 vector components)
- Three macroscopic quantities
- Density
- Momentum (vector)
- Magnetic Field (vector)
- Arithmetic Intensity
- Must read 73 doubles, and update 79 doubles per
lattice update (1216 bytes) - Requires about 1300 floating point operations per
lattice update - Just over 1.0 flops/byte (ideal)
- Cache capacity requirements are independent of
problem size - Two problem sizes
- 643 (0.3 GB)
- 1283 (2.5 GB)
- periodic boundary
- conditions
88LBMHD Performance(reference implementation)
- Generally, scalability looks good
- Scalability is good
- but is performance good?
NaïveNUMA
collision() only
89LBMHD Performance(lattice-aware array padding)
- LBMHD touches gt150 arrays.
- Most caches have limited associativity
- Conflict misses are likely
- Apply heuristic to pad arrays
Padding
NaïveNUMA
90Vectorization
- Two phases with a lattice methods collision()
operator - reconstruction of macroscopic variables
- updating discretized velocities
- Normally this is done one point at a time.
- Change to do a vectors worth at a time (loop
interchange tuning)
91LBMHD Performance(vectorization)
- Restructure loops to attain good TLB page
locality and streaming accesses
Vectorization
Padding
NaïveNUMA
collision() only
92LBMHD Performance(architecture specific
optimizations)
- Add unrolling and reordering of inner loop
- Additionally, it exploits SIMD where the compiler
doesnt - Include a SPE/Local Store optimized version
small pages
Explicit SIMDization
SW Prefetching
Unrolling
Vectorization
Padding
NaïveNUMA
collision() only
93LBMHD Performance(architecture specific
optimizations)
- Add unrolling and reordering of inner loop
- Additionally, it exploits SIMD where the compiler
doesnt - Include a SPE/Local Store optimized version
1.6x
4x
3x
130x
small pages
Explicit SIMDization
SW Prefetching
Unrolling
Vectorization
Padding
NaïveNUMA
collision() only
94Roofline model for LBMHD
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128
- Far more adds than multiplies (imbalance)
- Huge data sets
peak DP
peak DP
64
64
mul/add imbalance
mul/add imbalance
32
32
w/out SIMD
w/out SIMD
16
16
attainable Gflop/s
attainable Gflop/s
dataset fits in snoop filter
8
8
w/out ILP
4
4
w/out SW prefetch
w/out NUMA
w/out ILP
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
64
64
peak DP
32
32
peak DP
16
16
w/out FMA
attainable Gflop/s
attainable Gflop/s
25 FP
bank conflicts
8
8
w/out SW prefetch
w/out ILP
w/out NUMA
12 FP
w/out NUMA
4
4
w/out SIMD
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
95Roofline model for LBMHD(overlay arithmetic
intensity)
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128
- Far more adds than multiplies (imbalance)
- Essentially random access to memory
- Flopbyte ratio 0.7
- NUMA allocation/access
- Little ILP
- No DLP
- High conflict misses
peak DP
peak DP
64
64
mul/add imbalance
mul/add imbalance
32
32
w/out SIMD
w/out SIMD
16
16
attainable Gflop/s
attainable Gflop/s
dataset fits in snoop filter
8
8
w/out ILP
4
4
w/out SW prefetch
w/out NUMA
w/out ILP
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
No naïve SPE implementation
64
64
peak DP
32
32
peak DP
16
16
w/out FMA
attainable Gflop/s
attainable Gflop/s
25 FP
bank conflicts
8
8
w/out SW prefetch
w/out ILP
w/out NUMA
12 FP
w/out NUMA
4
4
w/out SIMD
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
96Roofline model for LBMHD(out-of-the-box parallel
performance)
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128
- Far more adds than multiplies (imbalance)
- Essentially random access to memory
- Flopbyte ratio 0.7
- NUMA allocation/access
- Little ILP
- No DLP
- High conflict misses
- Peak VF performance with 64 threads (out of 128)
- high conflict misses
peak DP
peak DP
64
64
mul/add imbalance
mul/add imbalance
32
32
w/out SIMD
w/out SIMD
16
16
attainable Gflop/s
attainable Gflop/s
dataset fits in snoop filter
8
8
w/out ILP
4
4
w/out SW prefetch
w/out NUMA
w/out ILP
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
No naïve SPE implementation
64
64
peak DP
32
32
peak DP
16
16
w/out FMA
attainable Gflop/s
attainable Gflop/s
25 FP
bank conflicts
8
8
w/out SW prefetch
w/out ILP
w/out NUMA
12 FP
w/out NUMA
4
4
w/out SIMD
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
97Roofline model for LBMHD(Padding, Vectorization,
Unrolling, Reordering, )
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128
- Vectorize the code to eliminate TLB capacity
misses - Ensures unit stride access (bottom bandwidth
ceiling) - Tune for optimal VL
- Clovertown pinned to lower BW ceiling
peak DP
peak DP
64
64
mul/add imbalance
mul/add imbalance
32
32
w/out SIMD
w/out SIMD
16
16
attainable Gflop/s
attainable Gflop/s
dataset fits in snoop filter
8
8
w/out ILP
4
4
w/out SW prefetch
w/out NUMA
w/out ILP
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
No naïve SPE implementation
64
64
peak DP
32
32
peak DP
16
16
w/out FMA
attainable Gflop/s
attainable Gflop/s
25 FP
bank conflicts
8
8
w/out SW prefetch
w/out ILP
w/out NUMA
12 FP
w/out NUMA
4
4
w/out SIMD
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
98Roofline model for LBMHD(SIMDization cache
bypass)
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128
- Make SIMDization explicit
- Technically, this swaps ILP and SIMD ceilings
- Use cache bypass instruction movntpd
- Increases flopbyte ratio to 1.0 on x86/Cell
peak DP
peak DP
64
64
mul/add imbalance
mul/add imbalance
32
32
16
16
attainable Gflop/s
attainable Gflop/s
w/out ILP
dataset fits in snoop filter
8
8
w/out ILP
w/out SIMD
4
4
w/out SW prefetch
w/out NUMA
w/out SIMD
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4