Autotuning Structured Grid and Sparse Matrix Kernels - PowerPoint PPT Presentation

1 / 86
About This Presentation
Title:

Autotuning Structured Grid and Sparse Matrix Kernels

Description:

Present and analyze two threaded & auto-tuned implementations for each ... We show. Auto-tuning can significantly improve performance ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 87
Provided by: samwil
Category:

less

Transcript and Presenter's Notes

Title: Autotuning Structured Grid and Sparse Matrix Kernels


1
Auto-tuning Structured Grid and Sparse Matrix
Kernels
  • Samuel Williams1,2
  • Jonathan Carter2, Richard Vuduc3, Leonid
    Oliker1,2, John Shalf2,
  • Katherine Yelick1,2, James Demmel1,2, David
    Patterson1,2
  • 1University of California Berkeley 2Lawrence
    Berkeley National Laboratory 3Georgia
    Institute of Technology
  • samw_at_cs.berkeley.edu

2
Motivation
  • Multicore is the de facto solution for improving
    peak performance for the next decade
  • How do we ensure this applies to sustained
    performance as well ?
  • Processor architectures are extremely diverse and
    compilers can rarely fully exploit them
  • Require a HW/SW solution that guarantees
    performance without completely sacrificing
    productivity

3
Overview
  • Examined
  • Lattice-Boltzmann Magneto-hydrodynamic (LBMHD)
    application
  • Sparse Matrix Vector Multiplication (SpMV) kernel
  • Present and analyze two threaded auto-tuned
    implementations for each
  • Benchmarked performance across 4 diverse
    multicore architectures
  • Intel Xeon (Clovertown)
  • AMD Opteron
  • Sun Niagara2 (Huron)
  • IBM QS20 Cell Blade
  • Introduce a performance model for the
    architectures of interest
  • We show
  • Auto-tuning can significantly improve performance
  • Cell consistently delivers good performance and
    efficiency
  • Niagara2 delivers good performance and
    productivity

4
Multicore SMPs used
Multicore SMPsPerf. ModelingAuto-tuningLBMHDSp
MVSummary
5
Multicore SMP Systems
6
Multicore SMP Systems(memory hierarchy)
Conventional Cache-based Memory Hierarchy
7
Multicore SMP Systems(memory hierarchy)
Conventional Cache-based Memory Hierarchy
Disjoint Local Store Memory Hierarchy
8
Multicore SMP Systems(memory hierarchy)
Cache Pthreads implementationsb
Local Store libspe implementations
9
Multicore SMP Systems(peak flops)
75 Gflop/s
17 Gflop/s
PPEs 13 Gflop/s SPEs 29 Gflop/s
11 Gflop/s
10
Multicore SMP Systems(peak DRAM bandwidth)
21 GB/s(read) 10 GB/s(write)
21 GB/s
51 GB/s
42 GB/s(read) 21 GB/s(write)
11
Multicore SMP Systems
Non-Uniform Memory Access
Uniform Memory Access
12
Roofline Performance Model
Multicore SMPsPerf. ModelingAuto-tuningLBMHDSp
MVSummary
13
Arithmetic Intensity
O( N )
O( 1 )
O( log(N) )
A r i t h m e t i c I n t e n s i t y
SpMV, BLAS1,2
FFTs
Stencils (PDEs)
Dense Linear Algebra (BLAS3)
Lattice Methods
Particle Methods
  • Arithmetic Intensity Total Flops / Total DRAM
    Bytes
  • Some HPC kernels have an arithmetic intensity
    that scales with with problem size (increasing
    temporal locality)
  • But there are many important and interesting
    kernels that dont

14
Naïve Performance Modeling
  • Traditionally, architectures are often labeled as
    either
  • memory bound
  • computational bound
  • With the determining factor the relationship
    between a kernels flopbyte ratio and the
    architectures flopbyte ratio
  • On modern architectures, this is wholly
    insufficient as performance is heavily dependent
    on architecture, algorithm, implementation, and
    compilers.
  • Give programmers realistic performance and
    productivity expectations

15
Requisite Optimizations
  • Separate bandwidth from computation
  • optimizations are required to achieve either peak
    memory bandwidth or peak computational rate
  • These can be combined to estimate performance of
    a kernel
  • (with sufficient MLP) using its flopbyte ratio
  • Bandwidth Requirements
  • Long unit-stride streams
  • small number of streams
  • NUMA allocation
  • Software prefetching
  • Optimal balance between read and write (FBDIMM)
  • Computation Requirements
  • Multicore parallelization
  • Amortized loop overhead
  • ILP (instruction latency)
  • DLP(SIMD)
  • FMA or mul/add balance

16
Roofline model for Opteron
AMD Opteron
  • Performance Cartoon
  • 2GHz processor
  • Peak performance is attainable if
  • inherent in the algorithm,
  • expressed in the implementation, and
  • exploited by the architecture

64.0
32.0
16.0
peak DP
8.0
attainable Gflop/s
peak stream bandwidth
4.0
2.0
1.0
0.5
1/8
1/4
1/2
1
2
4
8
16
actual flopbyte ratio
17
Roofline model for Opteron
AMD Opteron
  • The Opteron has separate add and multiply
    datapaths

64.0
32.0
16.0
peak DP
8.0
mul / add imbalance
attainable Gflop/s
peak stream bandwidth
4.0
2.0
1.0
0.5
1/8
1/4
1/2
1
2
4
8
16
actual flopbyte ratio
18
Roofline model for Opteron
AMD Opteron
  • On the rev.F Opteron SIMD operations are
    serialized

64.0
32.0
16.0
peak DP
8.0
mul / add imbalance
attainable Gflop/s
peak stream bandwidth
4.0
w/out ILP or SIMD
2.0
1.0
0.5
1/8
1/4
1/2
1
2
4
8
16
actual flopbyte ratio
19
Roofline model for Opteron
AMD Opteron
  • Opterons HW prefetchers are insufficient

64.0
32.0
16.0
peak DP
8.0
mul / add imbalance
attainable Gflop/s
peak stream bandwidth
4.0
w/out SW prefetching
w/out ILP or SIMD
2.0
1.0
0.5
1/8
1/4
1/2
1
2
4
8
16
actual flopbyte ratio
20
Roofline model for Opteron
AMD Opteron
  • 2P Opteron has strong NUMA effects.
  • Must correctly partition and allocate data

64.0
32.0
16.0
peak DP
8.0
mul / add imbalance
attainable Gflop/s
peak stream bandwidth
4.0
w/out SW prefetching
w/out ILP or SIMD
w/out NUMA optimizations
2.0
1.0
0.5
1/8
1/4
1/2
1
2
4
8
16
actual flopbyte ratio
21
Roofline model for Opteron
AMD Opteron
  • Envelope defines a region of expected
    performance.
  • Guides the order of optimizations that should be
    explored

64.0
32.0
16.0
peak DP
8.0
mul / add imbalance
attainable Gflop/s
peak stream bandwidth
4.0
w/out SW prefetching
w/out ILP or SIMD
w/out NUMA optimizations
2.0
1.0
0.5
1/8
1/4
1/2
1
2
4
8
16
actual flopbyte ratio
22
Roofline model for Opteron
AMD Opteron
  • Defines three optimization regions

64.0
32.0
16.0
peak DP
8.0
mul / add imbalance
attainable Gflop/s
peak stream bandwidth
4.0
w/out SW prefetching
w/out ILP or SIMD
w/out NUMA optimizations
2.0
1.0
0.5
1/8
1/4
1/2
1
2
4
8
16
actual flopbyte ratio
23
Roofline model for Opteron
AMD Opteron
  • 3 types of optimizations
  • Traffic (cache blocking)
  • Bandwidth
  • Computational

Compulsory flopbyte ratio
24
Roofline model for Opteron
AMD Opteron
  • 3 types of optimizations
  • Traffic (cache blocking)
  • Bandwidth
  • Computational

25
Roofline model for Opteron
AMD Opteron
  • 3 types of optimizations
  • Traffic (cache blocking)
  • Bandwidth
  • Computational

26
Roofline model for SMPs
Intel Clovertown
AMD Opteron
  • Roofline model for the SMPs used in this work.
  • Based on micro-benchmarks, experience, and
    manuals
  • NOTE
  • log-log scale
  • Optimization order is arbitrary

Sun Niagara2 (Huron)
IBM Cell Blade (SPEs)
IBM Cell Blade (PPEs)
27
Auto-tuning
Multicore SMPsPerf. ModelingAuto-tuningLBMHDSp
MVSummary
28
Auto-tuning
  • Hand optimizing each architecture/dataset
    combination is not feasible
  • Our auto-tuning approach finds a good performance
    solution by a combination of heuristics and
    exhaustive search
  • Perl script generates many possible kernels
  • (Generate SIMD optimized kernels)
  • Auto-tuning benchmark examines kernels and
    reports back with the best one for the current
    architecture/dataset/compiler/
  • Performance depends on the optimizations
    generated
  • Heuristics are often desirable when the search
    space isnt tractable
  • Proven value in Dense Linear Algebra(ATLAS),
    Spectral(FFTW,SPIRAL), and Sparse Methods(OSKI)

29
Auto-tuning
  • Auto-tuning differentiates itself from
    compilation through the combination of search and
    data introspection
  • Either training or real data can be used offline
    to optimize
  • This creates heuristics for use at runtime.
  • Worth it if fewer aggregate offline cycles are
    consumed than runtime cycles.

Compilers
Auto-tuners
Auto-tuners
?
Auto-tuners
Auto-tuners
30
Lattice-Boltzmann Magneto-Hydrodynamics (LBMHD)
Multicore SMPsPerf. ModelingAuto-tuningLBMHDSp
MVSummary
  • Samuel Williams, Jonathan Carter, Leonid Oliker,
    John Shalf, Katherine Yelick, "Lattice Boltzmann
    Simulation Optimization on Leading Multicore
    Platforms", International Parallel Distributed
    Processing Symposium (IPDPS), 2008.
  • Best Paper, Application Track

31
Introduction to Lattice Methods
  • Structured grid code, with a series of time steps
  • Popular in CFD
  • Allows for complex boundary conditions
  • No temporal locality between points in space
    within one time step
  • Higher dimensional phase space
  • Simplified kinetic model that maintains the
    macroscopic quantities
  • Distribution functions (e.g. 5-27 velocities per
    point in space) are used to reconstruct
    macroscopic quantities
  • Significant Memory capacity requirements

32
Stencil for Lattice Methods
  • Very different the canonical heat equation
    stencil
  • There are multiple read and write arrays
  • There is no reuse

read_lattice
write_lattice
33
LBMHD(general characteristics)
  • Plasma turbulence simulation
  • Two distributions
  • momentum distribution (27 scalar components)
  • magnetic distribution (15 vector components)
  • Three macroscopic quantities
  • Density
  • Momentum (vector)
  • Magnetic Field (vector)
  • Must read 73 doubles, and update 79 doubles per
    point in space
  • Requires about 1300 floating point operations per
    point in space
  • Just over 1.0 flops/byte (ideal)

34
LBMHD(implementation details)
  • Data Structure choices
  • Array of Structures no spatial locality, strided
    access
  • Structure of Arrays huge number of memory
    streams per thread, but guarantees spatial
    locality, unit-stride, and vectorizes well
  • Parallelization
  • Fortran version used MPI to communicate between
    nodes.
  • bad match for multicore
  • The version in this work uses pthreads for
    multicore, and MPI for inter-node
  • MPI is not used when auto-tuning
  • Two problem sizes
  • 643 (330MB)
  • 1283 (2.5GB)

35
Roofline model for LBMHD
Intel Clovertown
AMD Opteron
  • mul/add imbalance in LBMHD inhibits FMA use
  • Inherent ILPgt1
  • Naïve flopbyte0.7

64.0
64.0
32.0
32.0
algorithmic peak DP
16.0
16.0
w/out SIMD
8.0
8.0
algorithmic peak DP
attainable Gflop/s
attainable Gflop/s
4.0
4.0
w/out ILP
w/out ILP or SIMD
2.0
2.0
1.0
1.0
0.5
0.5
1/8
1/4
1/2
1
2
4
8
16
1/8
1/4
1/2
1
2
4
8
16
actual flopbyte ratio
actual flopbyte ratio
Sun Niagara2 (Huron)
IBM Cell Blade (PPEs)
64.0
32.0
16.0
8.0
attainable Gflop/s
algorithmic peak DP
4.0
2.0
1.0
w/out ILP
0.5
1/8
1/4
1/2
1
2
4
8
16
actual flopbyte ratio
36
Note on Performance Graphs
  • Well step through performance as
    optimizations/features are enabled within the
    auto-tuner
  • This allows us to compare architecture
    performance while keeping programmer
    effort(productivity) constant

37
Pthread Implementation
  • Not naïve
  • fully unrolled loops
  • NUMA-aware
  • 1D parallelization
  • Always used 8 threads per core on Niagara2

38
Pthread Implementation
  • Not naïve
  • fully unrolled loops
  • NUMA-aware
  • 1D parallelization
  • Always used 8 threads per core on Niagara2

4.8 of peak flops 16 of bandwidth
14 of peak flops 17 of bandwidth
54 of peak flops 14 of bandwidth
1 of peak flops 3 of bandwidth
39
Auto-tuned Performance(Stencil-aware Padding)
  • This lattice method is essentially 79
    simultaneous
  • 72-point stencils
  • Can cause conflict misses even with highly
    associative L1 caches (not to mention opterons 2
    way)
  • Solution pad each component so that when
    accessed with the corresponding stencil(spatial)
    offset, the components are uniformly distributed
    in the cache

Padding
NaïveNUMA
40
Auto-tuned Performance(Vectorization)
  • Each update requires touching 150 components,
    each likely to be on a different page
  • TLB misses can significantly impact performance
  • Solution vectorization
  • Fuse spatial loops,
  • strip mine into vectors of size VL, and
    interchange with phase dimensional loops
  • Auto-tune search for the optimal vector length
  • Significant benefit on some architectures
  • Becomes irrelevant when bandwidth dominates
    performance

Vectorization
Padding
NaïveNUMA
41
Auto-tuned Performance(Explicit
Unrolling/Reordering)
  • Give the compilers a helping hand for the complex
    loops
  • Code Generator Perl script to generate all power
    of 2 possibilities
  • Auto-tune search for the best unrolling and
    expression of data level parallelism
  • Is essential when using SIMD intrinsics

Unrolling
Vectorization
Padding
NaïveNUMA
42
Auto-tuned Performance(Software prefetching)
  • Expanded the code generator to insert software
    prefetches in case the compiler doesnt.
  • Auto-tune
  • no prefetch
  • prefetch 1 line ahead
  • prefetch 1 vector ahead.
  • Relatively little benefit for relatively little
    work

SW Prefetching
Unrolling
Vectorization
Padding
NaïveNUMA
43
Roofline model for LBMHD
Intel Clovertown
AMD Opteron
  • The memory pattern is far from unit stride
    stream, resulting in low memory performance
  • We see the Opteron and Niagara2 achieved near
    algorithmic peak.

64.0
64.0
32.0
32.0
algorithmic peak DP
16.0
16.0
w/out SIMD
8.0
8.0
algorithmic peak DP
attainable Gflop/s
attainable Gflop/s
4.0
4.0
w/out ILP
w/out ILP or SIMD
2.0
2.0
1.0
1.0
0.5
0.5
1/8
1/4
1/2
1
2
4
8
16
1/8
1/4
1/2
1
2
4
8
16
actual flopbyte ratio
actual flopbyte ratio
Sun Niagara2 (Huron)
IBM Cell Blade (PPEs)
64.0
32.0
16.0
8.0
attainable Gflop/s
algorithmic peak DP
4.0
2.0
1.0
w/out ILP
0.5
1/8
1/4
1/2
1
2
4
8
16
actual flopbyte ratio
44
Auto-tuned Performance(SIMDization, including
non-temporal stores)
  • Compilers(gcc icc) failed at exploiting SIMD.
  • Expanded the code generator to use SIMD
    intrinsics.
  • Explicit unrolling/reordering was extremely
    valuable here.
  • Exploited movntpd to minimize memory traffic
    (only hope if memory bound)
  • Significant benefit for significant work

SIMDization
SW Prefetching
Unrolling
Vectorization
Padding
NaïveNUMA
45
Roofline model for LBMHD
Intel Clovertown
AMD Opteron
  • Elimination of write-fill can improve the
    flopbyte ratio to about 1.0
  • e.g. movntpd
  • Slightly improved Clovertown (memory wall) and
    Opteron (algorithmic peak)

64.0
64.0
32.0
32.0
algorithmic peak DP
16.0
16.0
w/out SIMD
8.0
8.0
algorithmic peak DP
attainable Gflop/s
attainable Gflop/s
4.0
4.0
w/out ILP
w/out ILP or SIMD
2.0
2.0
1.0
1.0
0.5
0.5
1/8
1/4
1/2
1
2
4
8
16
1/8
1/4
1/2
1
2
4
8
16
actual flopbyte ratio
actual flopbyte ratio
Sun Niagara2 (Huron)
IBM Cell Blade (PPEs)
64.0
32.0
16.0
8.0
attainable Gflop/s
algorithmic peak DP
4.0
2.0
1.0
w/out ILP
0.5
1/8
1/4
1/2
1
2
4
8
16
actual flopbyte ratio
46
Auto-tuned Performance(SIMDization, including
non-temporal stores)
  • Compilers(gcc icc) failed at exploiting SIMD.
  • Expanded the code generator to use SIMD
    intrinsics.
  • Explicit unrolling/reordering was extremely
    valuable here.
  • Exploited movntpd to minimize memory traffic
    (only hope if memory bound)
  • Significant benefit for significant work

4.3x
1.6x
1.5x
10x
SIMDization
SW Prefetching
Unrolling
Vectorization
Padding
NaïveNUMA
47
Auto-tuned Performance(Local Store
Implementation)
  • First attempt at cell implementation.
  • VL, unrolling, reordering fixed
  • No NUMA
  • Exploits DMA and double buffering to load vectors
  • Straight to SIMD intrinsics.
  • Despite the relative performance, Cells DP
    implementation severely impairs performance

SIMDization
SW Prefetching
Unrolling
Vectorization
Padding
NaïveNUMA
collision() only
48
Auto-tuned Performance(Local Store
Implementation)
  • First attempt at cell implementation.
  • VL, unrolling, reordering fixed
  • No NUMA
  • Exploits DMA and double buffering to load vectors
  • Straight to SIMD intrinsics.
  • Despite the relative performance, Cells DP
    implementation severely impairs performance

42 of peak flops 35 of bandwidth
7.5 of peak flops 17 of bandwidth
57 of peak flops 33 of bandwidth
SIMDization
59 of peak flops 15 of bandwidth
SW Prefetching
Unrolling
Vectorization
Padding
NaïveNUMA
collision() only
49
Roofline model for LBMHD
Intel Clovertown
AMD Opteron
  • Cell achieves near algorithmic peak.
  • Far from being memory bound.

64.0
64.0
32.0
32.0
algorithmic peak DP
16.0
16.0
w/out SIMD
8.0
8.0
algorithmic peak DP
attainable Gflop/s
attainable Gflop/s
4.0
4.0
w/out ILP
w/out ILP or SIMD
2.0
2.0
1.0
1.0
0.5
0.5
1/8
1/4
1/2
1
2
4
8
16
1/8
1/4
1/2
1
2
4
8
16
actual flopbyte ratio
Sun Niagara2 (Huron)
50
Speedup from Heterogenity
  • First attempt at cell implementation.
  • VL, unrolling, reordering fixed
  • Exploits DMA and double buffering to load vectors
  • Straight to SIMD intrinsics.
  • Despite the relative performance, Cells DP
    implementation severely impairs performance

4.3x
1.6x
1.5x
13x over auto-tuned PPE
SIMDization
SW Prefetching
Unrolling
Vectorization
Padding
NaïveNUMA
collision() only
51
Overall Speedup
  • First attempt at cell implementation.
  • VL, unrolling, reordering fixed
  • Exploits DMA and double buffering to load vectors
  • Straight to SIMD intrinsics.
  • Despite the relative performance, Cells DP
    implementation severely impairs performance

4.3x
1.6x
1.5x
130x
SIMDization
SW Prefetching
Unrolling
Vectorization
Padding
NaïveNUMA
collision() only
52
Sparse Matrix-Vector Multiplication (SpMV)
Multicore SMPsPerf. ModelingAuto-tuningLBMHDSp
MVSummary
  • Samuel Williams, Leonid Oliker, Richard Vuduc,
    John Shalf, Katherine Yelick, James Demmel,
    "Optimization of Sparse Matrix-Vector
    Multiplication on Emerging Multicore Platforms",
    Supercomputing (SC), 2007.

53
Sparse MatrixVector Multiplication
  • Sparse Matrix
  • Most entries are 0.0
  • Performance advantage in only
  • storing/operating on the nonzeros
  • Requires significant meta data
  • Evaluate yAx
  • A is a sparse matrix
  • x y are dense vectors
  • Challenges
  • Difficult to exploit ILP(bad for superscalar),
  • Difficult to exploit DLP(bad for SIMD)
  • Irregular memory access to source vector
  • Difficult to load balance
  • Very low computational intensity (often gt6
    bytes/flop)
  • likely memory bound

A
x
y
54
Dataset (Matrices)
2K x 2K Dense matrix stored in sparse format
Dense
Well Structured (sorted by nonzeros/row)
Protein
FEM / Spheres
FEM / Cantilever
Wind Tunnel
FEM / Harbor
QCD
FEM / Ship
Economics
Epidemiology
Poorly Structured hodgepodge
FEM / Accelerator
Circuit
webbase
Extreme Aspect Ratio (linear programming)
LP
  • Pruned original SPARSITY suite down to 14
  • none should fit in cache
  • Subdivided them into 4 categories
  • Rank ranges from 2K to 1M

55
Roofline model for SpMV
Intel Clovertown
AMD Opteron
algorithmic peak DP
  • inherent FMA
  • Low ILP
  • Even more difficult to SIMDize
  • Naïve flopbyte lt 0.166

64.0
64.0
32.0
32.0
w/out SIMD
algorithmic peak DP
16.0
16.0
8.0
8.0
w/out ILP
attainable Gflop/s
attainable Gflop/s
w/out ILP or SIMD
4.0
4.0
2.0
2.0
1.0
1.0
0.5
0.5
1/8
1/4
1/2
1
2
4
8
16
1/8
1/4
1/2
1
2
4
8
16
actual flopbyte ratio
actual flopbyte ratio
Sun Niagara2 (Huron)
IBM Cell Blade (PPEs)
64.0
32.0
16.0
algorithmic peak DP
8.0
attainable Gflop/s
4.0
2.0
w/out ILP
1.0
0.5
1/8
1/4
1/2
1
2
4
8
16
actual flopbyte ratio
56
Naïve Serial Implementation
  • Vanilla C implementation
  • Matrix stored in CSR (compressed sparse row)
  • Explored compiler options, but only the best is
    presented here
  • x86 core delivers gt 10x the performance of a
    Niagara2 thread

57
Naïve Parallel Implementation
  • SPMD style
  • Partition by rows
  • Load balance by nonzeros
  • N2 2.5x x86 machine

Naïve Pthreads
Naïve
58
Naïve Parallel Implementation
  • SPMD style
  • Partition by rows
  • Load balance by nonzeros
  • N2 2.5x x86 machine

8x cores 1.9x performance
4x cores 1.5x performance
64x threads 41x performance
4x threads 3.4x performance
Naïve Pthreads
Naïve
59
Naïve Parallel Implementation
  • SPMD style
  • Partition by rows
  • Load balance by nonzeros
  • N2 2.5x x86 machine

1.4 of peak flops 29 of bandwidth
4 of peak flops 20 of bandwidth
25 of peak flops 39 of bandwidth
2.7 of peak flops 4 of bandwidth
Naïve Pthreads
Naïve
60
Auto-tuned Performance(NUMA SW Prefetching)
  • Use first touch, or libnuma to exploit NUMA.
  • Also includes process affinity.
  • Tag prefetches with temporal locality
  • Auto-tune search for the optimal prefetch
    distances

SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
61
Roofline model for SpMV
Intel Clovertown
AMD Opteron
algorithmic peak DP
  • Limited by bandwidth
  • SW prefetches need to be used sparingly to be
    effective

64.0
64.0
32.0
32.0
w/out SIMD
algorithmic peak DP
16.0
16.0
8.0
8.0
w/out ILP
attainable Gflop/s
attainable Gflop/s
w/out ILP or SIMD
4.0
4.0
2.0
2.0
1.0
1.0
0.5
0.5
1/8
1/4
1/2
1
2
4
8
16
1/8
1/4
1/2
1
2
4
8
16
actual flopbyte ratio
actual flopbyte ratio
Sun Niagara2 (Huron)
IBM Cell Blade (PPEs)
64.0
32.0
16.0
algorithmic peak DP
8.0
attainable Gflop/s
4.0
2.0
w/out ILP
1.0
0.5
1/8
1/4
1/2
1
2
4
8
16
actual flopbyte ratio
62
Auto-tuned Performance(Matrix Compression)
  • If memory bound, only hope is minimizing memory
    traffic
  • Heuristically compress the parallelized matrix to
    minimize it
  • Implemented with SSE
  • Benefit of prefetching is hidden by requirement
    of register blocking
  • Options register blocking, index size, format,
    etc

Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
63
Auto-tuned Performance(Cache/TLB Blocking)
  • Reorganize matrix to maximize locality of source
    vector accesses

Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
64
Auto-tuned Performance(DIMMs, Firmware, Padding)
  • Clovertown was already fully populated with DIMMs
  • Gave Opteron as many DIMMs as Clovertown
  • Firmware update for Niagara2
  • Array padding to avoid inter-thread conflict
    misses
  • PPEs use 1/3 of Cell chip area

More DIMMs(opteron), FW fix, array
padding(N2), etc
Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
65
Auto-tuned Performance(DIMMs, Firmware, Padding)
  • Clovertown was already fully populated with DIMMs
  • Gave Opteron as many DIMMs as Clovertown
  • Firmware update for Niagara2
  • Array padding to avoid inter-thread conflict
    misses
  • PPEs use 1/3 of Cell chip area

4 of peak flops 52 of bandwidth
20 of peak flops 65 of bandwidth
54 of peak flops 57 of bandwidth
More DIMMs(opteron), FW fix, array
padding(N2), etc
Cache/TLB Blocking
Compression
10 of peak flops 10 of bandwidth
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
66
Roofline model for SpMV(Matrix Compression)
Intel Clovertown
AMD Opteron
algorithmic peak DP
  • Register blocking can eliminate the meta data and
    improve the flopbyte ratio to 0.25
  • Despite being far from machine peak, auto-tuning
    brought us close to model peak.

64.0
64.0
32.0
32.0
w/out SIMD
algorithmic peak DP
16.0
16.0
8.0
8.0
w/out ILP
attainable Gflop/s
attainable Gflop/s
w/out ILP or SIMD
4.0
4.0
2.0
2.0
1.0
1.0
0.5
0.5
1/8
1/4
1/2
1
2
4
8
16
1/8
1/4
1/2
1
2
4
8
16
actual flopbyte ratio
actual flopbyte ratio
Sun Niagara2 (Huron)
IBM Cell Blade (PPEs)
64.0
32.0
16.0
algorithmic peak DP
8.0
attainable Gflop/s
4.0
2.0
w/out ILP
1.0
0.5
1/8
1/4
1/2
1
2
4
8
16
actual flopbyte ratio
67
Speedup from Auto-tuningMedian (max)
  • Clovertown was already fully populated with DIMMs
  • Gave Opteron as many DIMMs as Clovertown
  • Firmware update for Niagara2
  • Array padding to avoid inter-thread conflict
    misses
  • PPEs use 1/3 of Cell chip area

3.9x (4.4x)
1.6x (2.7x)
1.3x (2.9x)
1.5x (3.8x)
More DIMMs(opteron), FW fix, array
padding(N2), etc
Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
68
Auto-tuned Performance(Cell/SPE version)
  • Wrote a double precision Cell/SPE version
  • DMA, local store blocked, NUMA aware, etc
  • Only 2x1 and larger BCOO
  • Only the SpMV-proper routine changed
  • About 12x faster (median) than using the PPEs
    alone.

More DIMMs(opteron), FW fix, array
padding(N2), etc
Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
69
Auto-tuned Performance(Cell/SPE version)
  • Wrote a double precision Cell/SPE version
  • DMA, local store blocked, NUMA aware, etc
  • Only 2x1 and larger BCOO
  • Only the SpMV-proper routine changed
  • About 12x faster than using the PPEs alone.

4 of peak flops 52 of bandwidth
20 of peak flops 65 of bandwidth
54 of peak flops 57 of bandwidth
More DIMMs(opteron), FW fix, array
padding(N2), etc
40 of peak flops 92 of bandwidth
Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
70
Roofline model for SpMV
Intel Clovertown
AMD Opteron
algorithmic peak DP
  • Cell SPE implementation achieved essentially
    model peak.

64.0
64.0
32.0
32.0
w/out SIMD
algorithmic peak DP
16.0
16.0
8.0
8.0
w/out ILP
attainable Gflop/s
attainable Gflop/s
w/out ILP or SIMD
4.0
4.0
2.0
2.0
1.0
1.0
0.5
0.5
1/8
1/4
1/2
1
2
4
8
16
1/8
1/4
1/2
1
2
4
8
16
actual flopbyte ratio
actual flopbyte ratio
Sun Niagara2 (Huron)
71
Auto-tuned Performance(How much did double
precision and 2x1 blocking hurt)
  • Model faster cores by commenting out the inner
    kernel calls, but still performing all DMAs
  • Enabled 1x1 BCOO
  • 16 improvement

better Cell implementation
More DIMMs(opteron), FW fix, array
padding(N2), etc
Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
72
Speedup from HeterogenityMedian (max)
  • Wrote a double precision Cell/SPE version
  • DMA, local store blocked, NUMA aware, etc
  • Only 2x1 and larger BCOO
  • Only the SpMV-proper routine changed
  • About 12x faster than using the PPEs alone.

3.9x (4.4x)
1.6x (2.7x)
1.3x (2.9x)
18x (16x)
More DIMMs(opteron), FW fix, array
padding(N2), etc
Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
73
Overall Speedup Median (max)
  • Wrote a double precision Cell/SPE version
  • DMA, local store blocked, NUMA aware, etc
  • Only 2x1 and larger BCOO
  • Only the SpMV-proper routine changed
  • About 12x faster than using the PPEs alone.

3.9x (4.4x)
1.6x (2.7x)
1.3x (2.9x)
26x (34x)
More DIMMs(opteron), FW fix, array
padding(N2), etc
Cache/TLB Blocking
Compression
SW Prefetching
NUMA/Affinity
Naïve Pthreads
Naïve
74
Summary
Multicore SMPsPerf. ModelingAuto-tuningLBMHDSp
MVSummary
75
Aggregate Performance (Fully optimized)
  • Cell consistently delivers the best full system
    performance
  • Although, Niagara2 delivers near comparable per
    socket performance
  • Dual core Opteron delivers far better performance
    (bandwidth) than Clovertown, but as the flopbyte
    ratio increases its performance advantage
    decreases.
  • Clovertown has far too little effective FSB
    bandwidth
  • Huron has far more bandwidth than it can exploit
  • (too much latency, too few cores)

76
Parallel Efficiency(average performance per
thread, Fully optimized)
  • Aggregate Mflop/s / cores
  • Niagara2 Cell showed very good multicore
    scaling
  • Clovertown showed very poor multicore scaling on
    both applications
  • For SpMV, Opteron and Clovertown showed good
    multisocket scaling

77
Power Efficiency(Fully Optimized)
  • Used a digital power meter to measure sustained
    power under load
  • Calculate power efficiency as
  • sustained performance / sustained power
  • All cache-based machines delivered similar power
    efficiency
  • FBDIMMs (12W each) sustained power
  • 8 DIMMs on Clovertown (total of 330W)
  • 16 DIMMs on N2 machine (total of 450W)

78
Productivity
  • Niagara2 required significantly less work to
    deliver good performance.
  • For LBMHD, Clovertown, Opteron, and Cell all
    required SIMD (hampers productivity) for best
    performance.
  • Virtually every optimization was required (sooner
    or later) for Opteron and Cell.
  • Cache based machines required search for some
    optimizations, while Cell relied solely on
    heuristics (less time to tune)

79
Multi-core arms race
80
New Multicores
2.2GHz Opteron (rev.F)
1.40GHz Niagara2
81
New Multicores
  • Barcelona is a quad core Opteron
  • Victoria Falls is a dual socket (128 threads)
    Niagara2
  • Both have the same total bandwidth

Smaller pages
SIMDization
SW Prefetching
Unrolling
Vectorization
Padding
NaïveNUMA
82
Speedup from multicore/socket
  • Barcelona is a quad core Opteron
  • Victoria Falls is a dual socket (128 threads)
    Niagara2
  • Both have the same total bandwidth

1.9x (1.8x frequency normalized)
1.6x (1.9x frequency normalized)
Smaller pages
SIMDization
SW Prefetching
Unrolling
Vectorization
Padding
NaïveNUMA
83
Speedup from Auto-tuning
  • Barcelona is a quad core Opteron
  • Victoria Falls is a dual socket (128 threads)
    Niagara2
  • Both have the same total bandwidth

3.9x
4.3x
1.5x
16x
Smaller pages
SIMDization
SW Prefetching
Unrolling
Vectorization
Padding
NaïveNUMA
84
Summary
  • Paradoxically, the most complex/advanced
    architectures required the most tuning, and
    delivered the lowest performance.
  • Niagara2 delivered both very good performance and
    productivity
  • Cell delivered very good performance and
    efficiency (processor and power)
  • Our multicore auto-tuned LBMHD implementation
    significantly outperformed the already optimized
    serial implementation
  • Our multicore specific auto-tuned SpMV
    implementation significantly outperformed
    existing parallelization strategies including an
    auto-tuned MPI implementation (as discussed
    _at_SC07)
  • Sustainable memory bandwidth is essential even on
    kernels with moderate computational intensity
    (flopbyte ratio)
  • Architectural transparency is invaluable in
    optimizing code

85
Acknowledgements
  • UC Berkeley
  • RADLab Cluster (Opterons)
  • PSI cluster(Clovertowns)
  • Sun Microsystems
  • Niagara2 donations
  • Forschungszentrum Jülich
  • Cell blade cluster access

86
Questions?
Write a Comment
User Comments (0)
About PowerShow.com