What does it take to produce nearpeak MatrixMatrix Multiply - PowerPoint PPT Presentation

About This Presentation
Title:

What does it take to produce nearpeak MatrixMatrix Multiply

Description:

Use architectural models to determine optimization parameters ... Hierarchy of Models (follows) MU and NU. MU * NU MU NU LS = NR. KU ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 31
Provided by: csCor
Category:

less

Transcript and Presenter's Notes

Title: What does it take to produce nearpeak MatrixMatrix Multiply


1
What does it take to produce near-peak
Matrix-Matrix Multiply
  • Kamen Yotov
  • Department of Computer Science
  • Cornell University
  • CS612 Lecture, 02/08/05

2
Approaches to fast MMM
  • Traditional approach
  • Hand-optimized code (e.g.) most native BLAS
  • Problem costly and tedious to develop
  • Alternatives
  • Restructuring compilers
  • General purpose generate code from high-level
    specifications
  • Use architectural models to determine
    optimization parameters
  • Performance of optimized code is not satisfactory
  • Library generators
  • Problem-specific (e.g.) ATLAS for BLAS, FFTW for
    FFT
  • Use empirical optimization to determine
    optimization parameters
  • Believed to produce optimal code

3
Outline
  • ATLAS
  • Hardware parameters
  • Transformations
  • Optimization parameters
  • Modeling
  • Experimental Results

4
Architecture
5
Architecture
6
Measure Machine Parameters
  • Empirical Optimization
  • L1Size similar to Saavedra Benchmark
  • NR, FMA, Lh
  • Not exact, used to bound the search space
  • Model-driven
  • Need exact values
  • Uses X-Ray http//iss.cs.cornell.edu/Software/X-R
    ay.aspx

7
Architecture
8
Code Generation Compiler Perspective
  • ATLAS is not a compiler
  • Code Generator output equivalent to
  • Cache-level blocking (tiling)
  • NB
  • Register-level blocking
  • MU, NU, KU
  • Scheduling software pipelining
  • FMA, LS, FF, IF, NF
  • Versioning
  • Back-end compiler optimizations

9
Cache-level blocking (tiling)
  • Tiling in ATLAS
  • Only square tiles (NBxNBxNB)
  • Working set of tile fits in L1
  • Tiles are usually copied to continuous storage
  • Special clean-up code generated for boundaries
  • Mini-MMM
  • for (int j 0 j
  • for (int i 0 i
  • for (int k 0 k
  • Cij Aik Bkj
  • NB Optimization parameter

10
Register-level blocking
  • Register blocking is to mini-MMM what cache
    blocking is to MMM
  • Very important for performance highest level of
    Memory Hierarchy
  • Register file is a form of cache
  • Fully associative
  • Unit line size
  • Optimal replacement
  • Register file is under software control
  • It is not transparent to software like normal
    caches
  • Compilers allocate variables to registers,
    performs spills, etc.
  • Compilers seldom allocate array elements to
    registers
  • To block for the register file one needs
  • Unroll the corresponding loops
  • Scalarize array elements into variables

11
Register-level blocking (cont.)
  • Micro-MMM
  • A MUx1
  • B 1xNU
  • C MUxNU
  • MUxNUMUNU registers
  • Unroll loops by MU, NU, and KU
  • Mini-MMM with Micro-MMM inside
  • for (int j 0 j
  • for (int i 0 i
  • load Ci..iMU-1, j..jNU-1 into registers
  • for (int k 0 k
  • load Ai..iMU-1,k into registers
  • load Bk,j..jNU-1 into registers
  • multiply As and Bs and add to Cs
  • store Ci..iMU-1, j..jNU-1
  • Special clean-up code required if
  • NB is not a multiple of MU,NU,KU
  • MU, NU, KU optimization parameters

12
Scheduling
Computation
MemoryOperations
  • FMA Present?
  • Schedule Computation
  • Using LS
  • Schedule Memory Operations
  • Using IF, NF,FF
  • LS, IF, NF, FF optimization parameters

L1
L2
L3

LMUNU
13
Code GenerationOptimization Parameters
  • Optimization parameters
  • NB constrained by size of L1 cache
  • MU, NU constrained by NR
  • KU constrained by size of I-Cache
  • FF, IF, NF constrained by number of outstanding
    loads
  • FMA, LS related to hardware parameters
  • Sensitive vs. insensitive parameters
  • Similar parameters used by compilers

14
Architecture
15
ATLAS Search Strategy
  • Multi-dimensional optimization problem
  • Independent parameters NB, MU, NU, KU,
  • Dependent variable MFLOPS
  • Function from parameters to variables
  • Is given implicitly
  • Can be evaluated repeatedly
  • One optimization strategy orthogonal line search
  • Optimize along one dimension at a time
  • Use reference values for parameters not yet
    optimized
  • Not guaranteed to find optimal point, but might
    come close
  • Specification of orthogonal line search
  • Order in which dimensions are optimized
  • Reference values for un-optimized dimensions at
    any step
  • Interval in which line search is done for each
    dimension

16
Search for best NB
  • Search in following range
  • 16
  • NB2
  • In this search, use simple estimates for other
    parameters
  • e.g. for KU test each NB candidate for
  • Full K unrolling (KU NB)
  • No K unrolling (KU 1)

17
Search for other parameters
  • MU and NU
  • Constraints
  • 1
  • MUNU MU NU
  • Use best found NB from previous step
  • KU
  • KU 1, 4, 8, , NB/2, and NB
  • LS
  • 1
  • FF, IF, and NF
  • FF 0, 1
  • 2
  • 1

18
Architecture
19
Models for Optimization Parameters
  • Optimization parameters
  • NB
  • Hierarchy of Models (follows)
  • MU and NU
  • MU NU MU NU LS
  • KU
  • maximize subject to L1 Instruction Cache
  • LS
  • LS (Lh ALU 1)/2
  • FF, IF, NF
  • FF0, IF2, NF2
  • FMA
  • hardware parameter

20
Models for NBNo capacity / conflict misses
  • Tiles copied to contiguous memory
  • Simplest model
  • 3 NB2

21
Models for NBNo capacity misses / some
conflicts
  • MMM
  • for (int j 0 i j cij aik bkj
  • Cache model
  • Fully associative
  • Line size 1 Word
  • Optimal Replacement
  • Bottom line
  • NB2 NB 1
  • One full matrix
  • One row / column
  • One element

22
Models for NBExtension for Line Size
  • Line Size 1
  • Spatial locality
  • Array layout in memory matters
  • Bottom line
  • This is for JIK loop order, column major layout
  • ATLAS uses this order / layout for Mini-MMM

23
Models for NBExtension for LRU Replacement
  • Intuition
  • Need more storage than optimal replacement
  • Data needs to cool down in cache to be replaced
  • This means smaller tiles
  • MMM sample
  • for (int j 0 i cij aik bkj
  • Bottom line

24
Models for NBAccounting for intra-level
interactions
  • Mini-MMM is not scalar operations underneath
  • It is a sequence of Micro-MMMs, operating on
    register tiles
  • So whenever we talked about rows and columns we
    should consider vertical and horizontal panels of
    register tiles instead!
  • We can refine the model for NB as follows

25
Models for NBSummary
  • Models of increasing complexity
  • 3 x NB2 C
  • Whole work-set fits in L1
  • NB2 NB 1 C
  • Fully Associative
  • Optimal Replacement
  • Line Size 1 word
  • Line Size 1 word
  • LRU Replacement
  • Multi-level interactions

26
ModelsSummary
  • This is the complete model
  • It contains a small refinement for MU and NU on
    x86

27
Experimental Results
  • Ten modern architectures
  • Model did well on
  • RISC architectures
  • UltraSPARC did better
  • Model did not do as well on
  • Itanium 2
  • x86 CISC architectures
  • Sometimes substantial gap between
  • ATLAS CGw/S
  • ATLAS Unleashed

28
Sensitivity Analysis
  • Sensitivity graphs are invaluable for tracking
    down performance problems
  • Such graphs are useful for figuring out
    parameters in your homework!
  • Example Sensitivity to NB
  • Use ATLAS values for all parameters other than NB
  • Vary NB and see how performance is affected
  • Gives a 2D slice of a high-dimensional
    performance surface
  • Similar for other parameters

29
Example Sensitivity Graphs
30
Complete MMM Results
Intel Itanium 2
AMD Opteron 240
SGI R12K
Sun UltraSPARC IIIi
Write a Comment
User Comments (0)
About PowerShow.com