Autotuning Memory Intensive Kernels for Multicore presentation

About This Presentation

Title:

Autotuning Memory Intensive Kernels for Multicore

Description:

Auto-tuning Sparse Matrix-Vector Multiplication (SpMV) ... you trade free (always pay for it) cache-coherency traffic for additional memory ... –

Number of Views:26

Avg rating:3.0/5.0

Slides: 105

Provided by: samwil

Learn more at: https://people.eecs.berkeley.edu

Category:

more less

Transcript and Presenter's Notes

Title: Autotuning Memory Intensive Kernels for Multicore

1
Auto-tuning Memory Intensive Kernels for Multicore

Sam Williams
SWWilliams_at_lbl.gov

2
Outline

Challenges arising from Optimizing Single Thread
Performance
New Challenges Arising when Optimizing Multicore
SMP Performance
Performance Modeling and Littles Law
Multicore SMPs of Interest
Auto-tuning Sparse Matrix-Vector Multiplication
(SpMV)
Auto-tuning Lattice-Boltzmann Magneto-Hydrodynamic
s (LBMHD)
Summary

3
Challenges arising fromOptimizing Single Thread
Performance
4
Instruction-Level Parallelism

On modern pipelined architectures, operations
(like floating-point addition) have a latency of
4-6 cycles (until the result is ready).
However, independent adds can be pipelined one
after another.
Although this increases the peak flop rate,
one can only achieve peak flops on the condition
that on any given cycle the program has gt4
independent adds ready to execute.
failing to do so will result in a gt4x drop in
performance.
The problem is exacerbated by superscalar or VLIW
architectures like POWER or Itanium.
One must often reorganize kernels to express more
instruction-level parallelism

5
ILP Example (1x1 BCSR)

for(all rows)
y0 0.0
for(all tiles in this row)
y0ViXCi
yr y0

Consider the core of SpMV
No ILP in the inner loop
OOO cant accelerate serial FMAs

1x1 Register Block
FMA
FMA
FMA
FMA
FMA
6
ILP Example (1x4 BCSR)

for(all rows)
y0 0.0
for(all tiles in this row)
y0Vi XCi
y0Vi1XCi1
y0Vi2XCi2
y0Vi3XCi3
yr y0

What about 1x4 BCSR ?
Still no ILP in the inner loop
FMAs are still dependent on each other

1x4 Register Block
FMA
FMA
FMA
FMA
7
ILP Example (4x1 BCSR)

for(all rows)
y0 0.0y1 0.0
y2 0.0y3 0.0
for(all tiles in this row)
y0Vi XCi
y1Vi1XCi
y2Vi2XCi
y3Vi3XCi
yr0 y0 yr1 y1
yr2 y2 yr3 y3

What about 4x1 BCSR ?
Updating 4 different rows
The 4 FMAs are independent
Thus they can be pipelined.

4x1 Register Block
FMA
FMA
FMA
FMA
FMA
FMA
FMA
FMA
8
Data-level Parallelism

DLP apply the same operation to multiple
independent operands.
Today, rather than relying on superscalar issue,
many architectures have adopted SIMD as an
efficient means of boosting peak performance.
(SSE, Double Hummer, AltiVec, Cell, GPUs, etc)
Typically these instructions operate on four
single precision
(or two double precision) numbers at a time.
However, some are more GPUs(32), Larrabee(16),
and AVX(8)
Failing to use these instructions may cause a
2-32x drop in performance
Unfortunately, most compilers utterly fail to
generate these instructions.

9
Memory-Level Parallelism (1)

Although caches may filter many memory requests,
in HPC many memory references will still go all
the way to DRAM.
Memory latency (as measured in core cycles) grew
by an order of magnitude in the 90s
Today, the latency of a memory operation can
exceed 200 cycles (1 double every 80ns is
unacceptably slow).
Like ILP, we wish to pipeline requests to DRAM
Several solutions exist today
HW stream prefetchers
HW Multithreading (e.g. hyperthreading)
SW line prefetch
DMA

10
Memory-Level Parallelism (2)

HW stream prefetchers are by far the easiest to
implement and exploit.
They detect a series of consecutive cache misses
and speculate that the next addresses in the
series will be needed. They then prefetch that
data into the cache or a dedicated buffer.
To effectively exploit a HW prefetcher, ensure
your array references accesses 100s of
consecutive addresses.
e.g. read AiAi255 without any jumps or
discontinuities
This force limits the effectiveness (shape) of
the cache blocking you implemented in HW1 as you
accessed
A(j0)NiA(j0)NiB, jump
A(j1)NiA(j1)NiB, jump
A(j2)NiA(j2)NiB, jump

11
Branch Misprediction

A mispredicted branch can stall subsequent
instructions by 10 cycles.
Select a loop structure that maximizes the loop
length
(keeps mispredicted branches per instruction to
a minimum)
Some architectures support predication either in
hardware or software to eliminate branches
(transforms control dependencies into data
dependencies)

12
Cache Subtleties

Set associative caches have a limited number of
sets (S) and
ways (W), the product of which is the capacity
(in cache lines).
As seen in HW1, it can be beneficial to
reorganize kernels to reduce the working size and
eliminate capacity misses.
Conflict misses can severely impair performance,
be very challenging to identify and eliminate.
Given address may only be placed in W different
locations in the cache.
Poor access patterns or roughly power of two
problem sizes can be especially bad
Results in too many addresses mapped to the same
set. Not all of them can be kept in the cache
and some will have to be evicted.
Padding arrays (problem sizes) or skewing access
pattern can eliminate conflict misses.

13
Array padding Example

Padding changes the data layout
Consider a large matrix with a power of two
number of
double ANM// column major with Mpow2
Aij and Ai1j will likely be mapped to
the same set.
We can pad each column with a couple extra rows
double ANMpad
Such techniques are applicable in many other
domains (stencils, lattice-boltzman methods,
etc)

14
New Challenges Arising whenOptimizing Multicore
SMP Performance
15
What are SMPs ?

SMP shared memory parallel.
Multiple chips (typically lt 32 threads) can
address any location in a large shared memory
through a network or bus
Caches are almost universally coherent
You can still run MPI on an SMP, but
you trade free (always pay for it)
cache-coherency traffic for additional memory
traffic (for explicit communication)
you trade user-level function calls for system
calls
Alternately, you use a SPMD threading model
(pthreads, OpenMP, UPC)
If communication between cores or threads is
significant, then threaded implementations win
out.
As computationcommunication ratio increases, MPI
asymptotically approached threaded
implementations.

16
What is multicore ?What are multicore SMPs ?

Today, multiple cores are integrated on the same
chip
Almost universally this is done in a SMP fashion
For convince, programming multicore SMPs is
indistinguishable from programming multi-socket
SMPs. (easy transition)
Multiple cores can share
memory controllers
caches
occasionally FPUs
Although there was a graceful transition
from multiple sockets to multiple cores
from the point of view of correctness,
achieving good performance can be
incredibly challenging.

17
Affinity

We may wish one pair of threads to share a cache,
but be disjoint from another pair of threads.
We can control the mapping of threads to linux
processors via
includeltsched.hgt sched_set/getaffinity()
But, mapping of linux processors to physical
cores/sockets is machine/OS dependent.
Inspect /proc/cpuinfo or use PLPA

18
NUMA Challenges

Recent multicore SMPs have integrated the memory
controllers on chip.
As a result, memory-access is non-uniform (NUMA)
That is, the bandwidth to read a given address
varies dramatically among between cores
Exploit NUMA (affinityfirst touch) when you
malloc/init data.
Concept is similar to data decomposition for
distributed memory

19
Implicit allocation for NUMA

Consider an OpenMP example for implicitly NUMA
initialization
pragma omp parallel for
for (j0 jltN j)
aj 1.0
bj 2.0
cj 0.0
The first accesses to the array (read or write)
must be parallelized. DO NOT TOUCH BETWEEN
MALLOC AND INIT
When the for loop is parallelized, each thread
initializes a range of i
Exploits the OSs first touch policy.
Relies on assumption OpenMP maps threads
correctly

20
New Cache Challenges

shared caches SPMD programming models can
exacerbate conflict misses.
Individually, threads may produce significant
cache associativity pressure based on access
pattern. (power of 2 problem sizes)
Collectively, threads may produce excessive cache
associativity pressure. (power of 2 problem sizes
decomposed with a power of two number of threads)
This can be much harder to diagnose and correct
This problem arises whether using MPI or a
threaded model.

21
New Memory Challenges

The number of memory controllers and bandwidth on
multicore SMPs is growing much slower than the
number of cores.
codes are becoming increasingly memory-bound as a
fraction of the cores can saturate a sockets
memory bandwidth
Multicore has traded bit-or word-parallelism for
thread-parallelism.
However, main memory is still built from
bit-parallel devices (DIMMs)
Must restructure memory-intensive apps to the
bit-parallel nature of DIMMs (sequential access)

22
Synchronization

Using multiple concurrent threads can create
ordering and race errors.
Locks are one solution. Must balance granularity
and frequency
SPMD programming model barriers are often a
better/simpler solution.
spin barriers can be orders of magnitude faster
than pthread library barriers. (Rajesh Nishtala,
HotPar09)

23
Performance Modeling and Littles Law
24
System Abstraction

Abstractly describe any system (or subsystem) as
a combination of black-boxed storage,
computational units, and the bandwidth between
them.
These can be hierarchically
composed.
A volume of data must be
transferred from the storage
component, processed, and
another volume of data must be returned.
Consider the basic parameters governing
performance of the channel Bandwidth, Latency,
Concurrency
Bandwidth can be measured in GB/s, Gflop/s,
MIPS, etc
Latency can be measured in seconds, cycles, etc
Concurrency the volume in flight across the
channel, and can be measured in bytes, cache
lines, operations, instructions, etc

25
Littles Law

Littles law related concurrency, bandwidth, and
latency
To achieve peak bandwidth, one must satisfy
Concurrency Latency Bandwidth
For example, a memory controller with 20GB/s of
bandwidth, and 100ns of latency requires the CPU
to express 2KB of concurrency
(memory-level parallelism)
Similarly, given an expressed concurrency, one
can bound attained performance
That is, as more concurrency is injected, we get
progressively better performance
Note, this assumes continual, pipelined accesses.

26
Wheres the bottleneck?

Weve described bandwidths
DRAM ? CPU
Cache ? Core
Register File ? Functional units
But in an application, one of these
may be a performance-limiting
bottleneck.
We can take any pair and compare how quickly data
can be transferred to how quickly it can be
processed to determine the bottleneck.

27
Arithmetic Intensity
O( log(N) )
O( N )
O( 1 )
A r i t h m e t i c I n t e n s i t y
SpMV, BLAS1,2
FFTs
Stencils (PDEs)
Dense Linear Algebra (BLAS3)
Lattice Methods
Particle Methods

Consider the first case (DRAM-CPU)
True Arithmetic Intensity (AI) Total Flops /
Total DRAM Bytes
Some HPC kernels have an arithmetic intensity
that scales with problem size (increased temporal
locality), but remains constant on others
Arithmetic intensity is ultimately limited by
compulsory traffic
Arithmetic intensity is diminished by conflict or
capacity misses.

28
Kernel Arithmetic Intensityand Architecture

For a given architecture, one may calculate its
flopbyte ratio.
For a 2.3GHz Quad Core Opteron,
1 SIMD add 1 SIMD multiply per cycle per core
12.8GB/s of DRAM bandwidth
36.8 / 12.8 2.9 flops per byte
When a kernels arithmetic intensity is
substantially
less than the architectures flopbyte ratio,
transferring
data will take longer than computing on it
? memory-bound
When a kernels arithmetic intensity is
substantially greater than the architectures
flopbyte ratio, computation will take longer
than data transfers
? compute-bound

29
Memory Traffic Definition

Total bytes to/from DRAM
Can categorize into
Compulsory misses
Capacity misses
Conflict misses
Write allocations
Oblivious of lack of sub-cache line spatial
locality

30
Roofline ModelBasic Concept

Synthesize communication, computation, and
locality into a single visually-intuitive
performance figure using bound and bottleneck
analysis.
where optimization i can be SIMDize, or unroll,
or SW prefetch,
Given a kernels arithmetic intensity (based on
DRAM traffic after being filtered by the cache),
programmers can inspect the figure, and bound
performance.
Moreover, provides insights as to which
optimizations will potentially be beneficial.

31
Roofline ModelBasic Concept

Plot on log-log scale
Given AI, we can easily bound performance
But architectures are much more complicated
We will bound performance as we eliminate
specific forms of in-core parallelism

Opteron 2356 (Barcelona)
peak DP
attainable GFLOP/s
Stream Bandwidth
actual FLOPByte ratio
32
Roofline Modelcomputational ceilings

Opterons have dedicated multipliers and adders.
If the code is dominated by adds, then attainable
performance is half of peak.
We call these Ceilings
They act like constraints on performance

Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
Stream Bandwidth
actual FLOPByte ratio
33
Roofline Modelcomputational ceilings

Opterons have 128-bit datapaths.
If instructions arent SIMDized, attainable
performance will be halved

Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
w/out SIMD
Stream Bandwidth
actual FLOPByte ratio
34
Roofline Modelcomputational ceilings

On Opterons, floating-point instructions have a 4
cycle latency.
If we dont express 4-way ILP, performance will
drop by as much as 4x

Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
w/out SIMD
Stream Bandwidth
w/out ILP
actual FLOPByte ratio
35
Roofline Modelcommunication ceilings

We can perform a similar exercise taking away
parallelism from the memory subsystem

Opteron 2356 (Barcelona)
peak DP
attainable GFLOP/s
Stream Bandwidth
actual FLOPByte ratio
36
Roofline Modelcommunication ceilings

Explicit software prefetch instructions are
required to achieve peak bandwidth

Opteron 2356 (Barcelona)
peak DP
attainable GFLOP/s
Stream Bandwidth
w/out SW prefetch
actual FLOPByte ratio
37
Roofline Modelcommunication ceilings

Opterons are NUMA
As such memory traffic must be correctly balanced
among the two sockets to achieve good Stream
bandwidth.
We could continue this by examining strided or
random memory access patterns

Opteron 2356 (Barcelona)
peak DP
attainable GFLOP/s
Stream Bandwidth
w/out SW prefetch
w/out NUMA
actual FLOPByte ratio
38
Roofline Modelcomputation communication
ceilings

We may bound performance based on the combination
of expressed in-core parallelism and attained
bandwidth.

Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
w/out SIMD
Stream Bandwidth
w/out SW prefetch
w/out NUMA
w/out ILP
actual FLOPByte ratio
39
Roofline Modellocality walls

Remember, memory traffic includes more than just
compulsory misses.
As such, actual arithmetic intensity may be
substantially lower.
Walls are unique to the architecture-kernel
combination

Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
w/out SIMD
Stream Bandwidth
w/out SW prefetch
w/out NUMA
only compulsory miss traffic
w/out ILP
FLOPs
AI
Compulsory Misses
actual FLOPByte ratio
40
Roofline Modellocality walls

Remember, memory traffic includes more than just
compulsory misses.
As such, actual arithmetic intensity may be
substantially lower.
Walls are unique to the architecture-kernel
combination

Remember, memory traffic includes more than just
compulsory misses.
As such, actual arithmetic intensity may be
substantially lower.
Walls are unique to the architecture-kernel
combination

Remember, memory traffic includes more than just
compulsory misses.
As such, actual arithmetic intensity may be
substantially lower.
Walls are unique to the architecture-kernel
combination

Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
w/out SIMD
Stream Bandwidth
w/out SW prefetch
w/out NUMA
only compulsory miss traffic
write allocation traffic
capacity miss traffic
conflict miss traffic
w/out ILP
FLOPs
AI
Conflict Capacity Allocations Compulsory
actual FLOPByte ratio
43
Optimization Categorization
Maximizing (attained) In-core Performance
Minimizing (total) Memory Traffic
Maximizing (attained) Memory Bandwidth
44
Optimization Categorization
Minimizing Memory Traffic
Maximizing Memory Bandwidth
Maximizing In-core Performance

Exploit in-core parallelism
(ILP, DLP, etc)
Good (enough)
floating-point balance

45
Optimization Categorization
Minimizing Memory Traffic
Maximizing Memory Bandwidth
Maximizing In-core Performance

Exploit in-core parallelism
(ILP, DLP, etc)
Good (enough)
floating-point balance

46
Optimization Categorization
Maximizing In-core Performance
Minimizing Memory Traffic
Maximizing Memory Bandwidth

Exploit in-core parallelism
(ILP, DLP, etc)
Good (enough)
floating-point balance

Exploit NUMA
Hide memory latency
Satisfy Littles Law

47
Optimization Categorization
Maximizing In-core Performance
Maximizing Memory Bandwidth
Minimizing Memory Traffic

Exploit in-core parallelism
(ILP, DLP, etc)
Good (enough)
floating-point balance

Exploit NUMA
Hide memory latency
Satisfy Littles Law

Eliminate
Capacity misses
Conflict misses
Compulsory misses
Write allocate behavior

48
Optimization Categorization
Maximizing In-core Performance
Minimizing Memory Traffic
Maximizing Memory Bandwidth

Exploit in-core parallelism
(ILP, DLP, etc)
Good (enough)
floating-point balance

Exploit NUMA
Hide memory latency
Satisfy Littles Law

Eliminate
Capacity misses
Conflict misses
Compulsory misses
Write allocate behavior

49
Roofline Modellocality walls

Optimizations remove these walls and ceilings
which act to constrain performance.

Optimizations remove these walls and ceilings
which act to constrain performance.

Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
w/out SIMD
Stream Bandwidth
w/out SW prefetch
w/out NUMA
only compulsory miss traffic
w/out ILP
actual FLOPByte ratio
51
Roofline Modellocality walls

Optimizations remove these walls and ceilings
which act to constrain performance.

Opteron 2356 (Barcelona)
peak DP
attainable GFLOP/s
Stream Bandwidth
w/out SW prefetch
w/out NUMA
only compulsory miss traffic
actual FLOPByte ratio
52
Roofline Modellocality walls

Optimizations remove these walls and ceilings
which act to constrain performance.

Opteron 2356 (Barcelona)
peak DP
attainable GFLOP/s
Stream Bandwidth
only compulsory miss traffic
actual FLOPByte ratio
53
Optimization Categorization
Maximizing In-core Performance
Minimizing Memory Traffic
Maximizing Memory Bandwidth

Exploit in-core parallelism
(ILP, DLP, etc)
Good (enough)
floating-point balance

Exploit NUMA
Hide memory latency
Satisfy Littles Law

Eliminate
Capacity misses
Conflict misses
Compulsory misses
Write allocate behavior

Each optimization has a large parameter
space What are the optimal parameters?
54
Auto-tuning?

Provides performance portability across the
existing breadth and evolution of microprocessors
One time up front productivity cost is amortized
by the number of machines its used on
Auto-tuning does not invent new optimizations
Auto-tuning automates the code generation and
exploration of the optimization and parameter
space
Two components
parameterized code generator (we wrote ours in
Perl)
Auto-tuning exploration benchmark
(combination of heuristics and exhaustive
search)
Can be extended with ISA specific optimizations
(e.g. DMA, SIMD)

55
Multicore SMPsof Interest

(used throughout the rest of the talk)

56
Multicore SMPs Used
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
57
Multicore SMPs Used(Conventional cache-based
memory hierarchy)
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
58
Multicore SMPs Used(local store-based memory
hierarchy)
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
59
Multicore SMPs Used(CMT Chip-MultiThreading)
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
60
Multicore SMPs Used(threads)
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
8 threads
8 threads
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
16 threads
128 threads
SPEs only
61
Multicore SMPs Used(peak double precision flops)
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
75 GFlop/s
74 Gflop/s
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
29 GFlop/s
19 GFlop/s
SPEs only
62
Multicore SMPs Used(total DRAM bandwidth)
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
21 GB/s (read) 10 GB/s (write)
21 GB/s
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
51 GB/s
42 GB/s (read) 21 GB/s (write)
SPEs only
63
Multicore SMPs Used(Non-Uniform Memory Access -
NUMA)
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
SPEs only
64
Roofline Modelfor these multicore SMPs

Note, the multithreaded Niagara is limited by the
instruction mix rather than a lack of expressed
in-core parallelism
Clearly some architectures are more dependent on
bandwidth optimizations while others on in-core
optimizations.

65
Auto-tuning Sparse Matrix-Vector Multiplication
(SpMV)

Samuel Williams, Leonid Oliker, Richard Vuduc,
John Shalf, Katherine Yelick, James Demmel,
"Optimization of Sparse Matrix-Vector
Multiplication on Emerging Multicore Platforms",
Supercomputing (SC), 2007.

66
Sparse MatrixVector Multiplication

Whats a Sparse Matrix ?
Most entries are 0.0
Performance advantage in only storing/operating
on the nonzeros
Requires significant meta data to reconstruct the
matrix structure
Whats SpMV ?
Evaluate yAx
A is a sparse matrix, x y are dense vectors
Challenges
Very low arithmetic intensity (often lt0.166
flops/byte)
Difficult to exploit ILP (bad for pipelined or
superscalar),
Difficult to exploit DLP (bad for SIMD)

67
The Dataset (matrices)

Unlike dense BLAS, performance is dictated by
sparsity
Suite of 14 matrices
All bigger than the caches of our SMPs
Well also include a median performance number

2K x 2K Dense matrix stored in sparse format
Dense
Well Structured (sorted by nonzeros/row)
Protein
FEM / Spheres
FEM / Cantilever
Wind Tunnel
FEM / Harbor
QCD
FEM / Ship
Economics
Epidemiology
Poorly Structured hodgepodge
FEM / Accelerator
Circuit
webbase
Extreme Aspect Ratio (linear programming)
LP
68
SpMV Performance(simple parallelization)

Out-of-the box SpMV performance on a suite of 14
matrices
Scalability isnt great
Is this performance good?

Naïve Pthreads
Naïve
69
NUMA for SpMV

On NUMA architectures, all large arrays should be
partitioned either
explicitly (multiple malloc()s affinity)
implicitly (parallelize initialization and rely
on first touch)
You cannot partition on granularities less than
the page size
512 elements on x86
2M elements on Niagara
For SpMV, partition the matrix and
perform multiple malloc()s
Pin submatrices so they are
co-located with the cores tasked
to process them

70
Prefetch for SpMV

SW prefetch injects more MLP into the memory
subsystem.
Can try to prefetch the
values
indices
source vector
or any combination thereof
In general, should only insert one prefetch per
cache line (works best on unrolled code)

for(all rows) y0 0.0 y1 0.0 y2
0.0 y3 0.0 for(all tiles in this row)
PREFETCH(ViPFDistance) y0Vi
XCi y1Vi1XCi
y2Vi2XCi y3Vi3XCi
yr0 y0 yr1 y1 yr2 y2
yr3 y3
71
SpMV Performance(NUMA and Software Prefetching)

NUMA-aware allocation is essential on
memory-bound NUMA SMPs.
Explicit software prefetching can boost bandwidth
and change cache replacement policies
Cell PPEs are likely latency-limited.
used exhaustive search

72
ILP/DLP vs Bandwidth

In the multicore era, which is the bigger issue?
a lack of ILP/DLP (a major advantage of BCSR)
insufficient memory bandwidth per core
There are many architectures than when running
low arithmetic intensity kernels, there is so
little available memory bandwidth (per core) that
you wont notice a complete lack of ILP
Perhaps we should concentrate on minimizing
memory traffic rather than maximizing ILP/DLP
Rather than benchmarking every combination, just
Select the register blocking that minimizes the
matrix foot print.

73
SpMV Performance(Matrix Compression)

After maximizing memory bandwidth, the only hope
is to minimize memory traffic.
exploit
register blocking
other formats
smaller indices
Use a traffic minimization heuristic rather than
search
Benefit is clearly
matrix-dependent.
Register blocking enables efficient software
prefetching (one per cache line)

74
Cache blocking for SpMV

Cache-blocking sparse matrices is very different
than cache-blocking dense matrices.
Rather than changing loop bounds, store entire
submatrices contiguously.
The columns spanned by each cache
block are selected so that all submatrices
place the same pressure on the cache
i.e. touch the same number of unique
source vector cache lines
TLB blocking is a similar concept but
instead of on 8 byte granularities,
it uses 4KB granularities

75
Auto-tuned SpMV Performance(cache and TLB
blocking)

Fully auto-tuned SpMV performance across the
suite of matrices
Why do some optimizations work better on some
architectures?

Cache/LS/TLB Blocking
Matrix Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
76
Auto-tuned SpMV Performance(architecture
specific optimizations)

Fully auto-tuned SpMV performance across the
suite of matrices
Included SPE/local store optimized version
Why do some optimizations work better on some
architectures?

Cache/LS/TLB Blocking
Matrix Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
77
Auto-tuned SpMV Performance(max speedup)

Fully auto-tuned SpMV performance across the
suite of matrices
Included SPE/local store optimized version
Why do some optimizations work better on some
architectures?

2.7x
4.0x
2.9x
35x
Cache/LS/TLB Blocking
Matrix Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
78
Auto-tuned SpMV Performance(architecture
specific optimizations)

Fully auto-tuned SpMV performance across the
suite of matrices
Included SPE/local store optimized version
Why do some optimizations work better on some
architectures?
Performance is better,
but is performance good?

Auto-tuning resulted in better performance,
but did it result in good performance?
Cache/LS/TLB Blocking
Matrix Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
79
Roofline model for SpMV
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128

Double precision roofline models
In-core optimizations 1..i
DRAM optimizations 1..j
FMA is inherent in SpMV (place at bottom)

peak DP
peak DP
64
64
w/out SIMD
w/out SIMD
32
32
16
16
attainable Gflop/s
attainable Gflop/s
w/out ILP
dataset dataset fits in snoop filter
8
8
w/out ILP
mul/add imbalance
4
4
mul/add imbalance
w/out SW prefetch
w/out NUMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
64
64
peak DP
32
32
peak DP
16
16
w/out SIMD
attainable Gflop/s
attainable Gflop/s
bank conflicts
25 FP
8
8
w/out ILP
w/out SW prefetch
w/out NUMA
12 FP
w/out NUMA
4
4
w/out FMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
80
Roofline model for SpMV(overlay arithmetic
intensity)
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128

Two unit stride streams
Inherent FMA
No ILP
No DLP
FP is 12-25
Naïve compulsory flopbyte lt 0.166

peak DP
peak DP
64
64
w/out SIMD
w/out SIMD
32
32
16
16
attainable Gflop/s
attainable Gflop/s
w/out ILP
dataset dataset fits in snoop filter
8
8
w/out ILP
mul/add imbalance
4
4
mul/add imbalance
w/out SW prefetch
w/out NUMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
No naïve SPE implementation
64
64
peak DP
32
32
peak DP
16
16
w/out SIMD
attainable Gflop/s
attainable Gflop/s
bank conflicts
25 FP
8
8
w/out ILP
w/out SW prefetch
w/out NUMA
12 FP
w/out NUMA
4
4
w/out FMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
81
Roofline model for SpMV(out-of-the-box parallel)
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128

Two unit stride streams
Inherent FMA
No ILP
No DLP
FP is 12-25
Naïve compulsory flopbyte lt 0.166
For simplicity dense matrix in sparse format

peak DP
peak DP
64
64
w/out SIMD
w/out SIMD
32
32
16
16
attainable Gflop/s
attainable Gflop/s
w/out ILP
dataset dataset fits in snoop filter
8
8
w/out ILP
mul/add imbalance
4
4
mul/add imbalance
w/out SW prefetch
w/out NUMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
No naïve SPE implementation
64
64
peak DP
32
32
peak DP
16
16
w/out SIMD
attainable Gflop/s
attainable Gflop/s
bank conflicts
25 FP
8
8
w/out ILP
w/out SW prefetch
w/out NUMA
12 FP
w/out NUMA
4
4
w/out FMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
82
Roofline model for SpMV(NUMA SW prefetch)
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128

compulsory flopbyte 0.166
utilize all memory channels

peak DP
peak DP
64
64
w/out SIMD
w/out SIMD
32
32
16
16
attainable Gflop/s
attainable Gflop/s
w/out ILP
dataset dataset fits in snoop filter
8
8
w/out ILP
mul/add imbalance
4
4
mul/add imbalance
w/out SW prefetch
w/out NUMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
No naïve SPE implementation
64
64
peak DP
32
32
peak DP
16
16
w/out SIMD
attainable Gflop/s
attainable Gflop/s
bank conflicts
25 FP
8
8
w/out ILP
w/out SW prefetch
w/out NUMA
12 FP
w/out NUMA
4
4
w/out FMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
83
Roofline model for SpMV(matrix compression)
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128

Inherent FMA
Register blocking improves ILP, DLP, flopbyte
ratio, and FP of instructions

peak DP
peak DP
64
64
w/out SIMD
w/out SIMD
32
32
16
16
attainable Gflop/s
attainable Gflop/s
w/out ILP
dataset dataset fits in snoop filter
8
8
w/out ILP
mul/add imbalance
4
4
mul/add imbalance
w/out SW prefetch
w/out NUMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
64
64
peak DP
32
32
peak DP
16
16
w/out SIMD
attainable Gflop/s
attainable Gflop/s
bank conflicts
25 FP
8
8
w/out ILP
w/out SW prefetch
w/out NUMA
12 FP
w/out NUMA
4
4
w/out FMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
84
Roofline model for SpMV(matrix compression)
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128

Inherent FMA
Register blocking improves ILP, DLP, flopbyte
ratio, and FP of instructions

peak DP
peak DP
64
64
w/out SIMD
w/out SIMD
32
32
16
16
attainable Gflop/s
attainable Gflop/s
w/out ILP
8
8
w/out ILP
mul/add imbalance
4
4
mul/add imbalance
dataset fits in snoop filter
w/out SW prefetch
w/out NUMA
2
2
Performance is bandwidth limited
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
64
64
peak DP
32
32
peak DP
16
16
w/out SIMD
attainable Gflop/s
attainable Gflop/s
bank conflicts
25 FP
8
8
w/out ILP
w/out SW prefetch
w/out NUMA
12 FP
w/out NUMA
4
4
w/out FMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
85
SpMV Performance(summary)

Median SpMV performance
aside, unlike LBMHD, SSE was unnecessary to
achieve performance
Cell still requires a non-portable, ISA-specific
implementation to achieve good performance.
Novel SpMV implementations may require
ISA-specific (SSE) code to achieve better
performance.

86
Auto-tuning Lattice-Boltzmann Magneto-Hydrodynamic
s (LBMHD)

Samuel Williams, Jonathan Carter, Leonid Oliker,
John Shalf, Katherine Yelick, "Lattice Boltzmann
Simulation Optimization on Leading Multicore
Platforms", International Parallel Distributed
Processing Symposium (IPDPS), 2008.
Best Paper, Application Track

87
LBMHD

Plasma turbulence simulation via Lattice
Boltzmann Method
Two distributions
momentum distribution (27 scalar components)
magnetic distribution (15 vector components)
Three macroscopic quantities
Density
Momentum (vector)
Magnetic Field (vector)
Arithmetic Intensity
Must read 73 doubles, and update 79 doubles per
lattice update (1216 bytes)
Requires about 1300 floating point operations per
lattice update
Just over 1.0 flops/byte (ideal)
Cache capacity requirements are independent of
problem size
Two problem sizes
643 (0.3 GB)
1283 (2.5 GB)
periodic boundary
conditions

88
LBMHD Performance(reference implementation)

Generally, scalability looks good
Scalability is good
but is performance good?

NaïveNUMA
collision() only
89
LBMHD Performance(lattice-aware array padding)

LBMHD touches gt150 arrays.
Most caches have limited associativity
Conflict misses are likely
Apply heuristic to pad arrays

Padding
NaïveNUMA
90
Vectorization

Two phases with a lattice methods collision()
operator
reconstruction of macroscopic variables
updating discretized velocities
Normally this is done one point at a time.
Change to do a vectors worth at a time (loop
interchange tuning)

91
LBMHD Performance(vectorization)

Restructure loops to attain good TLB page
locality and streaming accesses

Vectorization
Padding
NaïveNUMA
collision() only
92
LBMHD Performance(architecture specific
optimizations)

Add unrolling and reordering of inner loop
Additionally, it exploits SIMD where the compiler
doesnt
Include a SPE/Local Store optimized version

small pages
Explicit SIMDization
SW Prefetching
Unrolling
Vectorization
Padding
NaïveNUMA
collision() only
93
LBMHD Performance(architecture specific
optimizations)

Add unrolling and reordering of inner loop
Additionally, it exploits SIMD where the compiler
doesnt
Include a SPE/Local Store optimized version

1.6x
4x
3x
130x
small pages
Explicit SIMDization
SW Prefetching
Unrolling
Vectorization
Padding
NaïveNUMA
collision() only
94
Roofline model for LBMHD
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128

Far more adds than multiplies (imbalance)
Huge data sets

peak DP
peak DP
64
64
mul/add imbalance
mul/add imbalance
32
32
w/out SIMD
w/out SIMD
16
16
attainable Gflop/s
attainable Gflop/s
dataset fits in snoop filter
8
8
w/out ILP
4
4
w/out SW prefetch
w/out NUMA
w/out ILP
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
64
64
peak DP
32
32
peak DP
16
16
w/out FMA
attainable Gflop/s
attainable Gflop/s
25 FP
bank conflicts
8
8
w/out SW prefetch
w/out ILP
w/out NUMA
12 FP
w/out NUMA
4
4
w/out SIMD
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
95
Roofline model for LBMHD(overlay arithmetic
intensity)
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128

Far more adds than multiplies (imbalance)
Essentially random access to memory
Flopbyte ratio 0.7
NUMA allocation/access
Little ILP
No DLP
High conflict misses

peak DP
peak DP
64
64
mul/add imbalance
mul/add imbalance
32
32
w/out SIMD
w/out SIMD
16
16
attainable Gflop/s
attainable Gflop/s
dataset fits in snoop filter
8
8
w/out ILP
4
4
w/out SW prefetch
w/out NUMA
w/out ILP
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
No naïve SPE implementation
64
64
peak DP
32
32
peak DP
16
16
w/out FMA
attainable Gflop/s
attainable Gflop/s
25 FP
bank conflicts
8
8
w/out SW prefetch
w/out ILP
w/out NUMA
12 FP
w/out NUMA
4
4
w/out SIMD
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
96
Roofline model for LBMHD(out-of-the-box parallel
performance)
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128

Far more adds than multiplies (imbalance)
Essentially random access to memory
Flopbyte ratio 0.7
NUMA allocation/access
Little ILP
No DLP
High conflict misses
Peak VF performance with 64 threads (out of 128)
- high conflict misses

peak DP
peak DP
64
64
mul/add imbalance
mul/add imbalance
32
32
w/out SIMD
w/out SIMD
16
16
attainable Gflop/s
attainable Gflop/s
dataset fits in snoop filter
8
8
w/out ILP
4
4
w/out SW prefetch
w/out NUMA
w/out ILP
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
No naïve SPE implementation
64
64
peak DP
32
32
peak DP
16
16
w/out FMA
attainable Gflop/s
attainable Gflop/s
25 FP
bank conflicts
8
8
w/out SW prefetch
w/out ILP
w/out NUMA
12 FP
w/out NUMA
4
4
w/out SIMD
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
97
Roofline model for LBMHD(Padding, Vectorization,
Unrolling, Reordering, )
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128

Vectorize the code to eliminate TLB capacity
misses
Ensures unit stride access (bottom bandwidth
ceiling)
Tune for optimal VL
Clovertown pinned to lower BW ceiling

peak DP
peak DP
64
64
mul/add imbalance
mul/add imbalance
32
32
w/out SIMD
w/out SIMD
16
16
attainable Gflop/s
attainable Gflop/s
dataset fits in snoop filter
8
8
w/out ILP
4
4
w/out SW prefetch
w/out NUMA
w/out ILP
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
No naïve SPE implementation
64
64
peak DP
32
32
peak DP
16
16
w/out FMA
attainable Gflop/s
attainable Gflop/s
25 FP
bank conflicts
8
8
w/out SW prefetch
w/out ILP
w/out NUMA
12 FP
w/out NUMA
4
4
w/out SIMD
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
98
Roofline model for LBMHD(SIMDization cache
bypass)
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128

Make SIMDization explicit
Technically, this swaps ILP and SIMD ceilings
Use cache bypass instruction movntpd
Increases flopbyte ratio to 1.0 on x86/Cell

peak DP
peak DP
64
64
mul/add imbalance
mul/add imbalance
32
32
16
16
attainable Gflop/s
attainable Gflop/s
w/out ILP
dataset fits in snoop filter
8
8
w/out ILP
w/out SIMD
4
4
w/out SW prefetch
w/out NUMA
w/out SIMD
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4

Write a Comment

User Comments (0)

About PowerShow.com