Title: What does it take to produce nearpeak MatrixMatrix Multiply
1What does it take to produce near-peak
Matrix-Matrix Multiply
- Kamen Yotov
- Department of Computer Science
- Cornell University
- CS612 Lecture, 02/08/05
2Approaches to fast MMM
- Traditional approach
- Hand-optimized code (e.g.) most native BLAS
- Problem costly and tedious to develop
- Alternatives
- Restructuring compilers
- General purpose generate code from high-level
specifications - Use architectural models to determine
optimization parameters - Performance of optimized code is not satisfactory
- Library generators
- Problem-specific (e.g.) ATLAS for BLAS, FFTW for
FFT - Use empirical optimization to determine
optimization parameters - Believed to produce optimal code
3Outline
- ATLAS
- Hardware parameters
- Transformations
- Optimization parameters
- Modeling
- Experimental Results
4Architecture
5Architecture
6Measure Machine Parameters
- Empirical Optimization
- L1Size similar to Saavedra Benchmark
- NR, FMA, Lh
- Not exact, used to bound the search space
- Model-driven
- Need exact values
- Uses X-Ray http//iss.cs.cornell.edu/Software/X-R
ay.aspx
7Architecture
8Code Generation Compiler Perspective
- ATLAS is not a compiler
- Code Generator output equivalent to
- Cache-level blocking (tiling)
- NB
- Register-level blocking
- MU, NU, KU
- Scheduling software pipelining
- FMA, LS, FF, IF, NF
- Versioning
- Back-end compiler optimizations
9Cache-level blocking (tiling)
- Tiling in ATLAS
- Only square tiles (NBxNBxNB)
- Working set of tile fits in L1
- Tiles are usually copied to continuous storage
- Special clean-up code generated for boundaries
- Mini-MMM
- for (int j 0 j
- for (int i 0 i
- for (int k 0 k
- Cij Aik Bkj
- NB Optimization parameter
10Register-level blocking
- Register blocking is to mini-MMM what cache
blocking is to MMM - Very important for performance highest level of
Memory Hierarchy - Register file is a form of cache
- Fully associative
- Unit line size
- Optimal replacement
- Register file is under software control
- It is not transparent to software like normal
caches - Compilers allocate variables to registers,
performs spills, etc. - Compilers seldom allocate array elements to
registers - To block for the register file one needs
- Unroll the corresponding loops
- Scalarize array elements into variables
11Register-level blocking (cont.)
- Micro-MMM
- A MUx1
- B 1xNU
- C MUxNU
- MUxNUMUNU registers
- Unroll loops by MU, NU, and KU
- Mini-MMM with Micro-MMM inside
- for (int j 0 j
- for (int i 0 i
- load Ci..iMU-1, j..jNU-1 into registers
- for (int k 0 k
- load Ai..iMU-1,k into registers
- load Bk,j..jNU-1 into registers
- multiply As and Bs and add to Cs
- store Ci..iMU-1, j..jNU-1
- Special clean-up code required if
- NB is not a multiple of MU,NU,KU
- MU, NU, KU optimization parameters
12Scheduling
Computation
MemoryOperations
- FMA Present?
- Schedule Computation
- Using LS
- Schedule Memory Operations
- Using IF, NF,FF
- LS, IF, NF, FF optimization parameters
L1
L2
L3
LMUNU
13Code GenerationOptimization Parameters
- Optimization parameters
- NB constrained by size of L1 cache
- MU, NU constrained by NR
- KU constrained by size of I-Cache
- FF, IF, NF constrained by number of outstanding
loads - FMA, LS related to hardware parameters
- Sensitive vs. insensitive parameters
- Similar parameters used by compilers
14Architecture
15ATLAS Search Strategy
- Multi-dimensional optimization problem
- Independent parameters NB, MU, NU, KU,
- Dependent variable MFLOPS
- Function from parameters to variables
- Is given implicitly
- Can be evaluated repeatedly
- One optimization strategy orthogonal line search
- Optimize along one dimension at a time
- Use reference values for parameters not yet
optimized - Not guaranteed to find optimal point, but might
come close - Specification of orthogonal line search
- Order in which dimensions are optimized
- Reference values for un-optimized dimensions at
any step - Interval in which line search is done for each
dimension
16Search for best NB
- Search in following range
- 16
- NB2
- In this search, use simple estimates for other
parameters - e.g. for KU test each NB candidate for
- Full K unrolling (KU NB)
- No K unrolling (KU 1)
17Search for other parameters
- MU and NU
- Constraints
- 1
- MUNU MU NU
- Use best found NB from previous step
- KU
- KU 1, 4, 8, , NB/2, and NB
- LS
- 1
- FF, IF, and NF
- FF 0, 1
- 2
- 1
18Architecture
19Models for Optimization Parameters
- Optimization parameters
- NB
- Hierarchy of Models (follows)
- MU and NU
- MU NU MU NU LS
- KU
- maximize subject to L1 Instruction Cache
- LS
- LS (Lh ALU 1)/2
- FF, IF, NF
- FF0, IF2, NF2
- FMA
- hardware parameter
20Models for NBNo capacity / conflict misses
- Tiles copied to contiguous memory
- Simplest model
- 3 NB2
21Models for NBNo capacity misses / some
conflicts
- MMM
- for (int j 0 i j cij aik bkj
- Cache model
- Fully associative
- Line size 1 Word
- Optimal Replacement
- Bottom line
- NB2 NB 1
- One full matrix
- One row / column
- One element
22Models for NBExtension for Line Size
- Line Size 1
- Spatial locality
- Array layout in memory matters
- Bottom line
- This is for JIK loop order, column major layout
- ATLAS uses this order / layout for Mini-MMM
23Models for NBExtension for LRU Replacement
- Intuition
- Need more storage than optimal replacement
- Data needs to cool down in cache to be replaced
- This means smaller tiles
- MMM sample
- for (int j 0 i cij aik bkj
- Bottom line
24Models for NBAccounting for intra-level
interactions
- Mini-MMM is not scalar operations underneath
- It is a sequence of Micro-MMMs, operating on
register tiles - So whenever we talked about rows and columns we
should consider vertical and horizontal panels of
register tiles instead! - We can refine the model for NB as follows
-
25Models for NBSummary
- Models of increasing complexity
- 3 x NB2 C
- Whole work-set fits in L1
- NB2 NB 1 C
- Fully Associative
- Optimal Replacement
- Line Size 1 word
-
- Line Size 1 word
-
- LRU Replacement
-
- Multi-level interactions
26ModelsSummary
- This is the complete model
- It contains a small refinement for MU and NU on
x86
27Experimental Results
- Ten modern architectures
- Model did well on
- RISC architectures
- UltraSPARC did better
- Model did not do as well on
- Itanium 2
- x86 CISC architectures
- Sometimes substantial gap between
- ATLAS CGw/S
- ATLAS Unleashed
28Sensitivity Analysis
- Sensitivity graphs are invaluable for tracking
down performance problems - Such graphs are useful for figuring out
parameters in your homework! - Example Sensitivity to NB
- Use ATLAS values for all parameters other than NB
- Vary NB and see how performance is affected
- Gives a 2D slice of a high-dimensional
performance surface - Similar for other parameters
29Example Sensitivity Graphs
30Complete MMM Results
Intel Itanium 2
AMD Opteron 240
SGI R12K
Sun UltraSPARC IIIi