Autotuning Memory Intensive Kernels for Multicore

1 / 104
About This Presentation
Title:

Autotuning Memory Intensive Kernels for Multicore

Description:

Auto-tuning Sparse Matrix-Vector Multiplication (SpMV) ... you trade free (always pay for it) cache-coherency traffic for additional memory ... –

Number of Views:26
Avg rating:3.0/5.0
Slides: 105
Provided by: samwil
Category:

less

Transcript and Presenter's Notes

Title: Autotuning Memory Intensive Kernels for Multicore


1
Auto-tuning Memory Intensive Kernels for Multicore
  • Sam Williams
  • SWWilliams_at_lbl.gov

2
Outline
  • Challenges arising from Optimizing Single Thread
    Performance
  • New Challenges Arising when Optimizing Multicore
    SMP Performance
  • Performance Modeling and Littles Law
  • Multicore SMPs of Interest
  • Auto-tuning Sparse Matrix-Vector Multiplication
    (SpMV)
  • Auto-tuning Lattice-Boltzmann Magneto-Hydrodynamic
    s (LBMHD)
  • Summary

3
Challenges arising fromOptimizing Single Thread
Performance
4
Instruction-Level Parallelism
  • On modern pipelined architectures, operations
    (like floating-point addition) have a latency of
    4-6 cycles (until the result is ready).
  • However, independent adds can be pipelined one
    after another.
  • Although this increases the peak flop rate,
  • one can only achieve peak flops on the condition
    that on any given cycle the program has gt4
    independent adds ready to execute.
  • failing to do so will result in a gt4x drop in
    performance.
  • The problem is exacerbated by superscalar or VLIW
    architectures like POWER or Itanium.
  • One must often reorganize kernels to express more
    instruction-level parallelism

5
ILP Example (1x1 BCSR)
  • for(all rows)
  • y0 0.0
  • for(all tiles in this row)
  • y0ViXCi
  • yr y0
  • Consider the core of SpMV
  • No ILP in the inner loop
  • OOO cant accelerate serial FMAs

1x1 Register Block
FMA
FMA
FMA
FMA
FMA
6
ILP Example (1x4 BCSR)
  • for(all rows)
  • y0 0.0
  • for(all tiles in this row)
  • y0Vi XCi
  • y0Vi1XCi1
  • y0Vi2XCi2
  • y0Vi3XCi3
  • yr y0
  • What about 1x4 BCSR ?
  • Still no ILP in the inner loop
  • FMAs are still dependent on each other

1x4 Register Block
FMA
FMA
FMA
FMA
7
ILP Example (4x1 BCSR)
  • for(all rows)
  • y0 0.0y1 0.0
  • y2 0.0y3 0.0
  • for(all tiles in this row)
  • y0Vi XCi
  • y1Vi1XCi
  • y2Vi2XCi
  • y3Vi3XCi
  • yr0 y0 yr1 y1
  • yr2 y2 yr3 y3
  • What about 4x1 BCSR ?
  • Updating 4 different rows
  • The 4 FMAs are independent
  • Thus they can be pipelined.

4x1 Register Block
FMA
FMA
FMA
FMA
FMA
FMA
FMA
FMA
8
Data-level Parallelism
  • DLP apply the same operation to multiple
    independent operands.
  • Today, rather than relying on superscalar issue,
    many architectures have adopted SIMD as an
    efficient means of boosting peak performance.
    (SSE, Double Hummer, AltiVec, Cell, GPUs, etc)
  • Typically these instructions operate on four
    single precision
  • (or two double precision) numbers at a time.
  • However, some are more GPUs(32), Larrabee(16),
    and AVX(8)
  • Failing to use these instructions may cause a
    2-32x drop in performance
  • Unfortunately, most compilers utterly fail to
    generate these instructions.





9
Memory-Level Parallelism (1)
  • Although caches may filter many memory requests,
    in HPC many memory references will still go all
    the way to DRAM.
  • Memory latency (as measured in core cycles) grew
    by an order of magnitude in the 90s
  • Today, the latency of a memory operation can
    exceed 200 cycles (1 double every 80ns is
    unacceptably slow).
  • Like ILP, we wish to pipeline requests to DRAM
  • Several solutions exist today
  • HW stream prefetchers
  • HW Multithreading (e.g. hyperthreading)
  • SW line prefetch
  • DMA

10
Memory-Level Parallelism (2)
  • HW stream prefetchers are by far the easiest to
    implement and exploit.
  • They detect a series of consecutive cache misses
    and speculate that the next addresses in the
    series will be needed. They then prefetch that
    data into the cache or a dedicated buffer.
  • To effectively exploit a HW prefetcher, ensure
    your array references accesses 100s of
    consecutive addresses.
  • e.g. read AiAi255 without any jumps or
    discontinuities
  • This force limits the effectiveness (shape) of
    the cache blocking you implemented in HW1 as you
    accessed
  • A(j0)NiA(j0)NiB, jump
  • A(j1)NiA(j1)NiB, jump
  • A(j2)NiA(j2)NiB, jump

11
Branch Misprediction
  • A mispredicted branch can stall subsequent
    instructions by 10 cycles.
  • Select a loop structure that maximizes the loop
    length
  • (keeps mispredicted branches per instruction to
    a minimum)
  • Some architectures support predication either in
    hardware or software to eliminate branches
    (transforms control dependencies into data
    dependencies)

12
Cache Subtleties
  • Set associative caches have a limited number of
    sets (S) and
  • ways (W), the product of which is the capacity
    (in cache lines).
  • As seen in HW1, it can be beneficial to
    reorganize kernels to reduce the working size and
    eliminate capacity misses.
  • Conflict misses can severely impair performance,
    be very challenging to identify and eliminate.
  • Given address may only be placed in W different
    locations in the cache.
  • Poor access patterns or roughly power of two
    problem sizes can be especially bad
  • Results in too many addresses mapped to the same
    set. Not all of them can be kept in the cache
    and some will have to be evicted.
  • Padding arrays (problem sizes) or skewing access
    pattern can eliminate conflict misses.

13
Array padding Example
  • Padding changes the data layout
  • Consider a large matrix with a power of two
    number of
  • double ANM// column major with Mpow2
  • Aij and Ai1j will likely be mapped to
    the same set.
  • We can pad each column with a couple extra rows
  • double ANMpad
  • Such techniques are applicable in many other
    domains (stencils, lattice-boltzman methods,
    etc)

14
New Challenges Arising whenOptimizing Multicore
SMP Performance
15
What are SMPs ?
  • SMP shared memory parallel.
  • Multiple chips (typically lt 32 threads) can
    address any location in a large shared memory
    through a network or bus
  • Caches are almost universally coherent
  • You can still run MPI on an SMP, but
  • you trade free (always pay for it)
    cache-coherency traffic for additional memory
    traffic (for explicit communication)
  • you trade user-level function calls for system
    calls
  • Alternately, you use a SPMD threading model
  • (pthreads, OpenMP, UPC)
  • If communication between cores or threads is
    significant, then threaded implementations win
    out.
  • As computationcommunication ratio increases, MPI
    asymptotically approached threaded
    implementations.

16
What is multicore ?What are multicore SMPs ?
  • Today, multiple cores are integrated on the same
    chip
  • Almost universally this is done in a SMP fashion
  • For convince, programming multicore SMPs is
    indistinguishable from programming multi-socket
    SMPs. (easy transition)
  • Multiple cores can share
  • memory controllers
  • caches
  • occasionally FPUs
  • Although there was a graceful transition
  • from multiple sockets to multiple cores
  • from the point of view of correctness,
  • achieving good performance can be
  • incredibly challenging.

17
Affinity
  • We may wish one pair of threads to share a cache,
    but be disjoint from another pair of threads.
  • We can control the mapping of threads to linux
    processors via
  • includeltsched.hgt sched_set/getaffinity()
  • But, mapping of linux processors to physical
    cores/sockets is machine/OS dependent.
  • Inspect /proc/cpuinfo or use PLPA

18
NUMA Challenges
  • Recent multicore SMPs have integrated the memory
    controllers on chip.
  • As a result, memory-access is non-uniform (NUMA)
  • That is, the bandwidth to read a given address
    varies dramatically among between cores
  • Exploit NUMA (affinityfirst touch) when you
    malloc/init data.
  • Concept is similar to data decomposition for
    distributed memory

19
Implicit allocation for NUMA
  • Consider an OpenMP example for implicitly NUMA
    initialization
  • pragma omp parallel for
  • for (j0 jltN j)
  • aj 1.0
  • bj 2.0
  • cj 0.0
  • The first accesses to the array (read or write)
    must be parallelized. DO NOT TOUCH BETWEEN
    MALLOC AND INIT
  • When the for loop is parallelized, each thread
    initializes a range of i
  • Exploits the OSs first touch policy.
  • Relies on assumption OpenMP maps threads
    correctly

20
New Cache Challenges
  • shared caches SPMD programming models can
    exacerbate conflict misses.
  • Individually, threads may produce significant
    cache associativity pressure based on access
    pattern. (power of 2 problem sizes)
  • Collectively, threads may produce excessive cache
    associativity pressure. (power of 2 problem sizes
    decomposed with a power of two number of threads)
  • This can be much harder to diagnose and correct
  • This problem arises whether using MPI or a
    threaded model.

21
New Memory Challenges
  • The number of memory controllers and bandwidth on
    multicore SMPs is growing much slower than the
    number of cores.
  • codes are becoming increasingly memory-bound as a
    fraction of the cores can saturate a sockets
    memory bandwidth
  • Multicore has traded bit-or word-parallelism for
    thread-parallelism.
  • However, main memory is still built from
    bit-parallel devices (DIMMs)
  • Must restructure memory-intensive apps to the
    bit-parallel nature of DIMMs (sequential access)

22
Synchronization
  • Using multiple concurrent threads can create
    ordering and race errors.
  • Locks are one solution. Must balance granularity
    and frequency
  • SPMD programming model barriers are often a
    better/simpler solution.
  • spin barriers can be orders of magnitude faster
    than pthread library barriers. (Rajesh Nishtala,
    HotPar09)

23
Performance Modeling and Littles Law
24
System Abstraction
  • Abstractly describe any system (or subsystem) as
    a combination of black-boxed storage,
    computational units, and the bandwidth between
    them.
  • These can be hierarchically
  • composed.
  • A volume of data must be
  • transferred from the storage
  • component, processed, and
  • another volume of data must be returned.
  • Consider the basic parameters governing
    performance of the channel Bandwidth, Latency,
    Concurrency
  • Bandwidth can be measured in GB/s, Gflop/s,
    MIPS, etc
  • Latency can be measured in seconds, cycles, etc
  • Concurrency the volume in flight across the
    channel, and can be measured in bytes, cache
    lines, operations, instructions, etc

25
Littles Law
  • Littles law related concurrency, bandwidth, and
    latency
  • To achieve peak bandwidth, one must satisfy
  • Concurrency Latency Bandwidth
  • For example, a memory controller with 20GB/s of
    bandwidth, and 100ns of latency requires the CPU
    to express 2KB of concurrency
  • (memory-level parallelism)
  • Similarly, given an expressed concurrency, one
    can bound attained performance
  • That is, as more concurrency is injected, we get
    progressively better performance
  • Note, this assumes continual, pipelined accesses.

26
Wheres the bottleneck?
  • Weve described bandwidths
  • DRAM ? CPU
  • Cache ? Core
  • Register File ? Functional units
  • But in an application, one of these
  • may be a performance-limiting
  • bottleneck.
  • We can take any pair and compare how quickly data
    can be transferred to how quickly it can be
    processed to determine the bottleneck.

27
Arithmetic Intensity
O( log(N) )
O( N )
O( 1 )
A r i t h m e t i c I n t e n s i t y
SpMV, BLAS1,2
FFTs
Stencils (PDEs)
Dense Linear Algebra (BLAS3)
Lattice Methods
Particle Methods
  • Consider the first case (DRAM-CPU)
  • True Arithmetic Intensity (AI) Total Flops /
    Total DRAM Bytes
  • Some HPC kernels have an arithmetic intensity
    that scales with problem size (increased temporal
    locality), but remains constant on others
  • Arithmetic intensity is ultimately limited by
    compulsory traffic
  • Arithmetic intensity is diminished by conflict or
    capacity misses.

28
Kernel Arithmetic Intensityand Architecture
  • For a given architecture, one may calculate its
    flopbyte ratio.
  • For a 2.3GHz Quad Core Opteron,
  • 1 SIMD add 1 SIMD multiply per cycle per core
  • 12.8GB/s of DRAM bandwidth
  • 36.8 / 12.8 2.9 flops per byte
  • When a kernels arithmetic intensity is
    substantially
  • less than the architectures flopbyte ratio,
    transferring
  • data will take longer than computing on it
  • ? memory-bound
  • When a kernels arithmetic intensity is
    substantially greater than the architectures
    flopbyte ratio, computation will take longer
    than data transfers
  • ? compute-bound

29
Memory Traffic Definition
  • Total bytes to/from DRAM
  • Can categorize into
  • Compulsory misses
  • Capacity misses
  • Conflict misses
  • Write allocations
  • Oblivious of lack of sub-cache line spatial
    locality

30
Roofline ModelBasic Concept
  • Synthesize communication, computation, and
    locality into a single visually-intuitive
    performance figure using bound and bottleneck
    analysis.
  • where optimization i can be SIMDize, or unroll,
    or SW prefetch,
  • Given a kernels arithmetic intensity (based on
    DRAM traffic after being filtered by the cache),
    programmers can inspect the figure, and bound
    performance.
  • Moreover, provides insights as to which
    optimizations will potentially be beneficial.

31
Roofline ModelBasic Concept
  • Plot on log-log scale
  • Given AI, we can easily bound performance
  • But architectures are much more complicated
  • We will bound performance as we eliminate
    specific forms of in-core parallelism

Opteron 2356 (Barcelona)
peak DP
attainable GFLOP/s
Stream Bandwidth
actual FLOPByte ratio
32
Roofline Modelcomputational ceilings
  • Opterons have dedicated multipliers and adders.
  • If the code is dominated by adds, then attainable
    performance is half of peak.
  • We call these Ceilings
  • They act like constraints on performance

Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
Stream Bandwidth
actual FLOPByte ratio
33
Roofline Modelcomputational ceilings
  • Opterons have 128-bit datapaths.
  • If instructions arent SIMDized, attainable
    performance will be halved

Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
w/out SIMD
Stream Bandwidth
actual FLOPByte ratio
34
Roofline Modelcomputational ceilings
  • On Opterons, floating-point instructions have a 4
    cycle latency.
  • If we dont express 4-way ILP, performance will
    drop by as much as 4x

Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
w/out SIMD
Stream Bandwidth
w/out ILP
actual FLOPByte ratio
35
Roofline Modelcommunication ceilings
  • We can perform a similar exercise taking away
    parallelism from the memory subsystem

Opteron 2356 (Barcelona)
peak DP
attainable GFLOP/s
Stream Bandwidth
actual FLOPByte ratio
36
Roofline Modelcommunication ceilings
  • Explicit software prefetch instructions are
    required to achieve peak bandwidth

Opteron 2356 (Barcelona)
peak DP
attainable GFLOP/s
Stream Bandwidth
w/out SW prefetch
actual FLOPByte ratio
37
Roofline Modelcommunication ceilings
  • Opterons are NUMA
  • As such memory traffic must be correctly balanced
    among the two sockets to achieve good Stream
    bandwidth.
  • We could continue this by examining strided or
    random memory access patterns

Opteron 2356 (Barcelona)
peak DP
attainable GFLOP/s
Stream Bandwidth
w/out SW prefetch
w/out NUMA
actual FLOPByte ratio
38
Roofline Modelcomputation communication
ceilings
  • We may bound performance based on the combination
    of expressed in-core parallelism and attained
    bandwidth.

Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
w/out SIMD
Stream Bandwidth
w/out SW prefetch
w/out NUMA
w/out ILP
actual FLOPByte ratio
39
Roofline Modellocality walls
  • Remember, memory traffic includes more than just
    compulsory misses.
  • As such, actual arithmetic intensity may be
    substantially lower.
  • Walls are unique to the architecture-kernel
    combination

Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
w/out SIMD
Stream Bandwidth
w/out SW prefetch
w/out NUMA
only compulsory miss traffic
w/out ILP
FLOPs
AI
Compulsory Misses
actual FLOPByte ratio
40
Roofline Modellocality walls
  • Remember, memory traffic includes more than just
    compulsory misses.
  • As such, actual arithmetic intensity may be
    substantially lower.
  • Walls are unique to the architecture-kernel
    combination

Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
w/out SIMD
Stream Bandwidth
w/out SW prefetch
w/out NUMA
only compulsory miss traffic
write allocation traffic
w/out ILP
FLOPs
AI
Allocations Compulsory Misses
actual FLOPByte ratio
41
Roofline Modellocality walls
  • Remember, memory traffic includes more than just
    compulsory misses.
  • As such, actual arithmetic intensity may be
    substantially lower.
  • Walls are unique to the architecture-kernel
    combination

Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
w/out SIMD
Stream Bandwidth
w/out SW prefetch
w/out NUMA
only compulsory miss traffic
write allocation traffic
capacity miss traffic
w/out ILP
FLOPs
AI
Capacity Allocations Compulsory
actual FLOPByte ratio
42
Roofline Modellocality walls
  • Remember, memory traffic includes more than just
    compulsory misses.
  • As such, actual arithmetic intensity may be
    substantially lower.
  • Walls are unique to the architecture-kernel
    combination

Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
w/out SIMD
Stream Bandwidth
w/out SW prefetch
w/out NUMA
only compulsory miss traffic
write allocation traffic
capacity miss traffic
conflict miss traffic
w/out ILP
FLOPs
AI
Conflict Capacity Allocations Compulsory
actual FLOPByte ratio
43
Optimization Categorization
Maximizing (attained) In-core Performance
Minimizing (total) Memory Traffic
Maximizing (attained) Memory Bandwidth
44
Optimization Categorization
Minimizing Memory Traffic
Maximizing Memory Bandwidth
Maximizing In-core Performance
  • Exploit in-core parallelism
  • (ILP, DLP, etc)
  • Good (enough)
  • floating-point balance

45
Optimization Categorization
Minimizing Memory Traffic
Maximizing Memory Bandwidth
Maximizing In-core Performance
  • Exploit in-core parallelism
  • (ILP, DLP, etc)
  • Good (enough)
  • floating-point balance

46
Optimization Categorization
Maximizing In-core Performance
Minimizing Memory Traffic
Maximizing Memory Bandwidth
  • Exploit in-core parallelism
  • (ILP, DLP, etc)
  • Good (enough)
  • floating-point balance
  • Exploit NUMA
  • Hide memory latency
  • Satisfy Littles Law

47
Optimization Categorization
Maximizing In-core Performance
Maximizing Memory Bandwidth
Minimizing Memory Traffic
  • Exploit in-core parallelism
  • (ILP, DLP, etc)
  • Good (enough)
  • floating-point balance
  • Exploit NUMA
  • Hide memory latency
  • Satisfy Littles Law
  • Eliminate
  • Capacity misses
  • Conflict misses
  • Compulsory misses
  • Write allocate behavior

48
Optimization Categorization
Maximizing In-core Performance
Minimizing Memory Traffic
Maximizing Memory Bandwidth
  • Exploit in-core parallelism
  • (ILP, DLP, etc)
  • Good (enough)
  • floating-point balance
  • Exploit NUMA
  • Hide memory latency
  • Satisfy Littles Law
  • Eliminate
  • Capacity misses
  • Conflict misses
  • Compulsory misses
  • Write allocate behavior

49
Roofline Modellocality walls
  • Optimizations remove these walls and ceilings
    which act to constrain performance.

Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
w/out SIMD
Stream Bandwidth
w/out SW prefetch
w/out NUMA
only compulsory miss traffic
write allocation traffic
capacity miss traffic
conflict miss traffic
w/out ILP
actual FLOPByte ratio
50
Roofline Modellocality walls
  • Optimizations remove these walls and ceilings
    which act to constrain performance.

Opteron 2356 (Barcelona)
peak DP
mul / add imbalance
attainable GFLOP/s
w/out SIMD
Stream Bandwidth
w/out SW prefetch
w/out NUMA
only compulsory miss traffic
w/out ILP
actual FLOPByte ratio
51
Roofline Modellocality walls
  • Optimizations remove these walls and ceilings
    which act to constrain performance.

Opteron 2356 (Barcelona)
peak DP
attainable GFLOP/s
Stream Bandwidth
w/out SW prefetch
w/out NUMA
only compulsory miss traffic
actual FLOPByte ratio
52
Roofline Modellocality walls
  • Optimizations remove these walls and ceilings
    which act to constrain performance.

Opteron 2356 (Barcelona)
peak DP
attainable GFLOP/s
Stream Bandwidth
only compulsory miss traffic
actual FLOPByte ratio
53
Optimization Categorization
Maximizing In-core Performance
Minimizing Memory Traffic
Maximizing Memory Bandwidth
  • Exploit in-core parallelism
  • (ILP, DLP, etc)
  • Good (enough)
  • floating-point balance
  • Exploit NUMA
  • Hide memory latency
  • Satisfy Littles Law
  • Eliminate
  • Capacity misses
  • Conflict misses
  • Compulsory misses
  • Write allocate behavior

Each optimization has a large parameter
space What are the optimal parameters?
54
Auto-tuning?
  • Provides performance portability across the
    existing breadth and evolution of microprocessors
  • One time up front productivity cost is amortized
    by the number of machines its used on
  • Auto-tuning does not invent new optimizations
  • Auto-tuning automates the code generation and
    exploration of the optimization and parameter
    space
  • Two components
  • parameterized code generator (we wrote ours in
    Perl)
  • Auto-tuning exploration benchmark
  • (combination of heuristics and exhaustive
    search)
  • Can be extended with ISA specific optimizations
    (e.g. DMA, SIMD)

55
Multicore SMPsof Interest
  • (used throughout the rest of the talk)

56
Multicore SMPs Used
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
57
Multicore SMPs Used(Conventional cache-based
memory hierarchy)
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
58
Multicore SMPs Used(local store-based memory
hierarchy)
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
59
Multicore SMPs Used(CMT Chip-MultiThreading)
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
60
Multicore SMPs Used(threads)
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
8 threads
8 threads
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
16 threads
128 threads
SPEs only
61
Multicore SMPs Used(peak double precision flops)
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
75 GFlop/s
74 Gflop/s
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
29 GFlop/s
19 GFlop/s
SPEs only
62
Multicore SMPs Used(total DRAM bandwidth)
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
21 GB/s (read) 10 GB/s (write)
21 GB/s
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
51 GB/s
42 GB/s (read) 21 GB/s (write)
SPEs only
63
Multicore SMPs Used(Non-Uniform Memory Access -
NUMA)
AMD Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
IBM QS20 Cell Blade
Sun T2 T5140 (Victoria Falls)
SPEs only
64
Roofline Modelfor these multicore SMPs
  • Note, the multithreaded Niagara is limited by the
    instruction mix rather than a lack of expressed
    in-core parallelism
  • Clearly some architectures are more dependent on
    bandwidth optimizations while others on in-core
    optimizations.

65
Auto-tuning Sparse Matrix-Vector Multiplication
(SpMV)
  • Samuel Williams, Leonid Oliker, Richard Vuduc,
    John Shalf, Katherine Yelick, James Demmel,
    "Optimization of Sparse Matrix-Vector
    Multiplication on Emerging Multicore Platforms",
    Supercomputing (SC), 2007.

66
Sparse MatrixVector Multiplication
  • Whats a Sparse Matrix ?
  • Most entries are 0.0
  • Performance advantage in only storing/operating
    on the nonzeros
  • Requires significant meta data to reconstruct the
    matrix structure
  • Whats SpMV ?
  • Evaluate yAx
  • A is a sparse matrix, x y are dense vectors
  • Challenges
  • Very low arithmetic intensity (often lt0.166
    flops/byte)
  • Difficult to exploit ILP (bad for pipelined or
    superscalar),
  • Difficult to exploit DLP (bad for SIMD)

67
The Dataset (matrices)
  • Unlike dense BLAS, performance is dictated by
    sparsity
  • Suite of 14 matrices
  • All bigger than the caches of our SMPs
  • Well also include a median performance number

2K x 2K Dense matrix stored in sparse format
Dense
Well Structured (sorted by nonzeros/row)
Protein
FEM / Spheres
FEM / Cantilever
Wind Tunnel
FEM / Harbor
QCD
FEM / Ship
Economics
Epidemiology
Poorly Structured hodgepodge
FEM / Accelerator
Circuit
webbase
Extreme Aspect Ratio (linear programming)
LP
68
SpMV Performance(simple parallelization)
  • Out-of-the box SpMV performance on a suite of 14
    matrices
  • Scalability isnt great
  • Is this performance good?

Naïve Pthreads
Naïve
69
NUMA for SpMV
  • On NUMA architectures, all large arrays should be
    partitioned either
  • explicitly (multiple malloc()s affinity)
  • implicitly (parallelize initialization and rely
    on first touch)
  • You cannot partition on granularities less than
    the page size
  • 512 elements on x86
  • 2M elements on Niagara
  • For SpMV, partition the matrix and
  • perform multiple malloc()s
  • Pin submatrices so they are
  • co-located with the cores tasked
  • to process them

70
Prefetch for SpMV
  • SW prefetch injects more MLP into the memory
    subsystem.
  • Can try to prefetch the
  • values
  • indices
  • source vector
  • or any combination thereof
  • In general, should only insert one prefetch per
    cache line (works best on unrolled code)

for(all rows) y0 0.0 y1 0.0 y2
0.0 y3 0.0 for(all tiles in this row)
PREFETCH(ViPFDistance) y0Vi
XCi y1Vi1XCi
y2Vi2XCi y3Vi3XCi
yr0 y0 yr1 y1 yr2 y2
yr3 y3
71
SpMV Performance(NUMA and Software Prefetching)
  • NUMA-aware allocation is essential on
    memory-bound NUMA SMPs.
  • Explicit software prefetching can boost bandwidth
    and change cache replacement policies
  • Cell PPEs are likely latency-limited.
  • used exhaustive search

72
ILP/DLP vs Bandwidth
  • In the multicore era, which is the bigger issue?
  • a lack of ILP/DLP (a major advantage of BCSR)
  • insufficient memory bandwidth per core
  • There are many architectures than when running
    low arithmetic intensity kernels, there is so
    little available memory bandwidth (per core) that
    you wont notice a complete lack of ILP
  • Perhaps we should concentrate on minimizing
    memory traffic rather than maximizing ILP/DLP
  • Rather than benchmarking every combination, just
  • Select the register blocking that minimizes the
    matrix foot print.

73
SpMV Performance(Matrix Compression)
  • After maximizing memory bandwidth, the only hope
    is to minimize memory traffic.
  • exploit
  • register blocking
  • other formats
  • smaller indices
  • Use a traffic minimization heuristic rather than
    search
  • Benefit is clearly
  • matrix-dependent.
  • Register blocking enables efficient software
    prefetching (one per cache line)

74
Cache blocking for SpMV
  • Cache-blocking sparse matrices is very different
    than cache-blocking dense matrices.
  • Rather than changing loop bounds, store entire
    submatrices contiguously.
  • The columns spanned by each cache
  • block are selected so that all submatrices
  • place the same pressure on the cache
  • i.e. touch the same number of unique
  • source vector cache lines
  • TLB blocking is a similar concept but
  • instead of on 8 byte granularities,
  • it uses 4KB granularities

75
Auto-tuned SpMV Performance(cache and TLB
blocking)
  • Fully auto-tuned SpMV performance across the
    suite of matrices
  • Why do some optimizations work better on some
    architectures?

Cache/LS/TLB Blocking
Matrix Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
76
Auto-tuned SpMV Performance(architecture
specific optimizations)
  • Fully auto-tuned SpMV performance across the
    suite of matrices
  • Included SPE/local store optimized version
  • Why do some optimizations work better on some
    architectures?

Cache/LS/TLB Blocking
Matrix Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
77
Auto-tuned SpMV Performance(max speedup)
  • Fully auto-tuned SpMV performance across the
    suite of matrices
  • Included SPE/local store optimized version
  • Why do some optimizations work better on some
    architectures?

2.7x
4.0x
2.9x
35x
Cache/LS/TLB Blocking
Matrix Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
78
Auto-tuned SpMV Performance(architecture
specific optimizations)
  • Fully auto-tuned SpMV performance across the
    suite of matrices
  • Included SPE/local store optimized version
  • Why do some optimizations work better on some
    architectures?
  • Performance is better,
  • but is performance good?

Auto-tuning resulted in better performance,
but did it result in good performance?
Cache/LS/TLB Blocking
Matrix Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
79
Roofline model for SpMV
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128
  • Double precision roofline models
  • In-core optimizations 1..i
  • DRAM optimizations 1..j
  • FMA is inherent in SpMV (place at bottom)

peak DP
peak DP
64
64
w/out SIMD
w/out SIMD
32
32
16
16
attainable Gflop/s
attainable Gflop/s
w/out ILP
dataset dataset fits in snoop filter
8
8
w/out ILP
mul/add imbalance
4
4
mul/add imbalance
w/out SW prefetch
w/out NUMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
64
64
peak DP
32
32
peak DP
16
16
w/out SIMD
attainable Gflop/s
attainable Gflop/s
bank conflicts
25 FP
8
8
w/out ILP
w/out SW prefetch
w/out NUMA
12 FP
w/out NUMA
4
4
w/out FMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
80
Roofline model for SpMV(overlay arithmetic
intensity)
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128
  • Two unit stride streams
  • Inherent FMA
  • No ILP
  • No DLP
  • FP is 12-25
  • Naïve compulsory flopbyte lt 0.166

peak DP
peak DP
64
64
w/out SIMD
w/out SIMD
32
32
16
16
attainable Gflop/s
attainable Gflop/s
w/out ILP
dataset dataset fits in snoop filter
8
8
w/out ILP
mul/add imbalance
4
4
mul/add imbalance
w/out SW prefetch
w/out NUMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
No naïve SPE implementation
64
64
peak DP
32
32
peak DP
16
16
w/out SIMD
attainable Gflop/s
attainable Gflop/s
bank conflicts
25 FP
8
8
w/out ILP
w/out SW prefetch
w/out NUMA
12 FP
w/out NUMA
4
4
w/out FMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
81
Roofline model for SpMV(out-of-the-box parallel)
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128
  • Two unit stride streams
  • Inherent FMA
  • No ILP
  • No DLP
  • FP is 12-25
  • Naïve compulsory flopbyte lt 0.166
  • For simplicity dense matrix in sparse format

peak DP
peak DP
64
64
w/out SIMD
w/out SIMD
32
32
16
16
attainable Gflop/s
attainable Gflop/s
w/out ILP
dataset dataset fits in snoop filter
8
8
w/out ILP
mul/add imbalance
4
4
mul/add imbalance
w/out SW prefetch
w/out NUMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
No naïve SPE implementation
64
64
peak DP
32
32
peak DP
16
16
w/out SIMD
attainable Gflop/s
attainable Gflop/s
bank conflicts
25 FP
8
8
w/out ILP
w/out SW prefetch
w/out NUMA
12 FP
w/out NUMA
4
4
w/out FMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
82
Roofline model for SpMV(NUMA SW prefetch)
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128
  • compulsory flopbyte 0.166
  • utilize all memory channels

peak DP
peak DP
64
64
w/out SIMD
w/out SIMD
32
32
16
16
attainable Gflop/s
attainable Gflop/s
w/out ILP
dataset dataset fits in snoop filter
8
8
w/out ILP
mul/add imbalance
4
4
mul/add imbalance
w/out SW prefetch
w/out NUMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
No naïve SPE implementation
64
64
peak DP
32
32
peak DP
16
16
w/out SIMD
attainable Gflop/s
attainable Gflop/s
bank conflicts
25 FP
8
8
w/out ILP
w/out SW prefetch
w/out NUMA
12 FP
w/out NUMA
4
4
w/out FMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
83
Roofline model for SpMV(matrix compression)
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128
  • Inherent FMA
  • Register blocking improves ILP, DLP, flopbyte
    ratio, and FP of instructions

peak DP
peak DP
64
64
w/out SIMD
w/out SIMD
32
32
16
16
attainable Gflop/s
attainable Gflop/s
w/out ILP
dataset dataset fits in snoop filter
8
8
w/out ILP
mul/add imbalance
4
4
mul/add imbalance
w/out SW prefetch
w/out NUMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
64
64
peak DP
32
32
peak DP
16
16
w/out SIMD
attainable Gflop/s
attainable Gflop/s
bank conflicts
25 FP
8
8
w/out ILP
w/out SW prefetch
w/out NUMA
12 FP
w/out NUMA
4
4
w/out FMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
84
Roofline model for SpMV(matrix compression)
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128
  • Inherent FMA
  • Register blocking improves ILP, DLP, flopbyte
    ratio, and FP of instructions

peak DP
peak DP
64
64
w/out SIMD
w/out SIMD
32
32
16
16
attainable Gflop/s
attainable Gflop/s
w/out ILP
8
8
w/out ILP
mul/add imbalance
4
4
mul/add imbalance
dataset fits in snoop filter
w/out SW prefetch
w/out NUMA
2
2
Performance is bandwidth limited
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
64
64
peak DP
32
32
peak DP
16
16
w/out SIMD
attainable Gflop/s
attainable Gflop/s
bank conflicts
25 FP
8
8
w/out ILP
w/out SW prefetch
w/out NUMA
12 FP
w/out NUMA
4
4
w/out FMA
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
85
SpMV Performance(summary)
  • Median SpMV performance
  • aside, unlike LBMHD, SSE was unnecessary to
    achieve performance
  • Cell still requires a non-portable, ISA-specific
    implementation to achieve good performance.
  • Novel SpMV implementations may require
    ISA-specific (SSE) code to achieve better
    performance.

86
Auto-tuning Lattice-Boltzmann Magneto-Hydrodynamic
s (LBMHD)
  • Samuel Williams, Jonathan Carter, Leonid Oliker,
    John Shalf, Katherine Yelick, "Lattice Boltzmann
    Simulation Optimization on Leading Multicore
    Platforms", International Parallel Distributed
    Processing Symposium (IPDPS), 2008.
  • Best Paper, Application Track

87
LBMHD
  • Plasma turbulence simulation via Lattice
    Boltzmann Method
  • Two distributions
  • momentum distribution (27 scalar components)
  • magnetic distribution (15 vector components)
  • Three macroscopic quantities
  • Density
  • Momentum (vector)
  • Magnetic Field (vector)
  • Arithmetic Intensity
  • Must read 73 doubles, and update 79 doubles per
    lattice update (1216 bytes)
  • Requires about 1300 floating point operations per
    lattice update
  • Just over 1.0 flops/byte (ideal)
  • Cache capacity requirements are independent of
    problem size
  • Two problem sizes
  • 643 (0.3 GB)
  • 1283 (2.5 GB)
  • periodic boundary
  • conditions

88
LBMHD Performance(reference implementation)
  • Generally, scalability looks good
  • Scalability is good
  • but is performance good?

NaïveNUMA
collision() only
89
LBMHD Performance(lattice-aware array padding)
  • LBMHD touches gt150 arrays.
  • Most caches have limited associativity
  • Conflict misses are likely
  • Apply heuristic to pad arrays

Padding
NaïveNUMA
90
Vectorization
  • Two phases with a lattice methods collision()
    operator
  • reconstruction of macroscopic variables
  • updating discretized velocities
  • Normally this is done one point at a time.
  • Change to do a vectors worth at a time (loop
    interchange tuning)

91
LBMHD Performance(vectorization)
  • Restructure loops to attain good TLB page
    locality and streaming accesses

Vectorization
Padding
NaïveNUMA
collision() only
92
LBMHD Performance(architecture specific
optimizations)
  • Add unrolling and reordering of inner loop
  • Additionally, it exploits SIMD where the compiler
    doesnt
  • Include a SPE/Local Store optimized version

small pages
Explicit SIMDization
SW Prefetching
Unrolling
Vectorization
Padding
NaïveNUMA
collision() only
93
LBMHD Performance(architecture specific
optimizations)
  • Add unrolling and reordering of inner loop
  • Additionally, it exploits SIMD where the compiler
    doesnt
  • Include a SPE/Local Store optimized version

1.6x
4x
3x
130x
small pages
Explicit SIMDization
SW Prefetching
Unrolling
Vectorization
Padding
NaïveNUMA
collision() only
94
Roofline model for LBMHD
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128
  • Far more adds than multiplies (imbalance)
  • Huge data sets

peak DP
peak DP
64
64
mul/add imbalance
mul/add imbalance
32
32
w/out SIMD
w/out SIMD
16
16
attainable Gflop/s
attainable Gflop/s
dataset fits in snoop filter
8
8
w/out ILP
4
4
w/out SW prefetch
w/out NUMA
w/out ILP
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
64
64
peak DP
32
32
peak DP
16
16
w/out FMA
attainable Gflop/s
attainable Gflop/s
25 FP
bank conflicts
8
8
w/out SW prefetch
w/out ILP
w/out NUMA
12 FP
w/out NUMA
4
4
w/out SIMD
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
95
Roofline model for LBMHD(overlay arithmetic
intensity)
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128
  • Far more adds than multiplies (imbalance)
  • Essentially random access to memory
  • Flopbyte ratio 0.7
  • NUMA allocation/access
  • Little ILP
  • No DLP
  • High conflict misses

peak DP
peak DP
64
64
mul/add imbalance
mul/add imbalance
32
32
w/out SIMD
w/out SIMD
16
16
attainable Gflop/s
attainable Gflop/s
dataset fits in snoop filter
8
8
w/out ILP
4
4
w/out SW prefetch
w/out NUMA
w/out ILP
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
No naïve SPE implementation
64
64
peak DP
32
32
peak DP
16
16
w/out FMA
attainable Gflop/s
attainable Gflop/s
25 FP
bank conflicts
8
8
w/out SW prefetch
w/out ILP
w/out NUMA
12 FP
w/out NUMA
4
4
w/out SIMD
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
96
Roofline model for LBMHD(out-of-the-box parallel
performance)
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128
  • Far more adds than multiplies (imbalance)
  • Essentially random access to memory
  • Flopbyte ratio 0.7
  • NUMA allocation/access
  • Little ILP
  • No DLP
  • High conflict misses
  • Peak VF performance with 64 threads (out of 128)
    - high conflict misses

peak DP
peak DP
64
64
mul/add imbalance
mul/add imbalance
32
32
w/out SIMD
w/out SIMD
16
16
attainable Gflop/s
attainable Gflop/s
dataset fits in snoop filter
8
8
w/out ILP
4
4
w/out SW prefetch
w/out NUMA
w/out ILP
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
No naïve SPE implementation
64
64
peak DP
32
32
peak DP
16
16
w/out FMA
attainable Gflop/s
attainable Gflop/s
25 FP
bank conflicts
8
8
w/out SW prefetch
w/out ILP
w/out NUMA
12 FP
w/out NUMA
4
4
w/out SIMD
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
97
Roofline model for LBMHD(Padding, Vectorization,
Unrolling, Reordering, )
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128
  • Vectorize the code to eliminate TLB capacity
    misses
  • Ensures unit stride access (bottom bandwidth
    ceiling)
  • Tune for optimal VL
  • Clovertown pinned to lower BW ceiling

peak DP
peak DP
64
64
mul/add imbalance
mul/add imbalance
32
32
w/out SIMD
w/out SIMD
16
16
attainable Gflop/s
attainable Gflop/s
dataset fits in snoop filter
8
8
w/out ILP
4
4
w/out SW prefetch
w/out NUMA
w/out ILP
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
Sun T2 T5140 (Victoria Falls)
IBM QS20 Cell Blade
128
128
No naïve SPE implementation
64
64
peak DP
32
32
peak DP
16
16
w/out FMA
attainable Gflop/s
attainable Gflop/s
25 FP
bank conflicts
8
8
w/out SW prefetch
w/out ILP
w/out NUMA
12 FP
w/out NUMA
4
4
w/out SIMD
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
8
flopDRAM byte ratio
flopDRAM byte ratio
98
Roofline model for LBMHD(SIMDization cache
bypass)
Opteron 2356 (Barcelona)
Intel Xeon E5345 (Clovertown)
128
128
  • Make SIMDization explicit
  • Technically, this swaps ILP and SIMD ceilings
  • Use cache bypass instruction movntpd
  • Increases flopbyte ratio to 1.0 on x86/Cell

peak DP
peak DP
64
64
mul/add imbalance
mul/add imbalance
32
32
16
16
attainable Gflop/s
attainable Gflop/s
w/out ILP
dataset fits in snoop filter
8
8
w/out ILP
w/out SIMD
4
4
w/out SW prefetch
w/out NUMA
w/out SIMD
2
2
1
1
1/16
1/8
1/4
1/2
1
2
4
8
1/16
1/8
1/4
1/2
1
2
4
Write a Comment
User Comments (0)
About PowerShow.com