What does it take to produce nearpeak MatrixMatrix Multiply - PowerPoint PPT Presentation

About This Presentation

Title:

What does it take to produce nearpeak MatrixMatrix Multiply

Description:

Use architectural models to determine optimization parameters ... Hierarchy of Models (follows) MU and NU. MU * NU MU NU LS = NR. KU ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 31

Provided by: csCor

Learn more at: https://www.cs.cornell.edu

Category:

more less

Transcript and Presenter's Notes

Title: What does it take to produce nearpeak MatrixMatrix Multiply

1
What does it take to produce near-peak
Matrix-Matrix Multiply

Kamen Yotov
Department of Computer Science
Cornell University
CS612 Lecture, 02/08/05

2
Approaches to fast MMM

Traditional approach
Hand-optimized code (e.g.) most native BLAS
Problem costly and tedious to develop
Alternatives
Restructuring compilers
General purpose generate code from high-level
specifications
Use architectural models to determine
optimization parameters
Performance of optimized code is not satisfactory
Library generators
Problem-specific (e.g.) ATLAS for BLAS, FFTW for
FFT
Use empirical optimization to determine
optimization parameters
Believed to produce optimal code

3
Outline

ATLAS
Hardware parameters
Transformations
Optimization parameters
Modeling
Experimental Results

4
Architecture
5
Architecture
6
Measure Machine Parameters

Empirical Optimization
L1Size similar to Saavedra Benchmark
NR, FMA, Lh
Not exact, used to bound the search space
Model-driven
Need exact values
Uses X-Ray http//iss.cs.cornell.edu/Software/X-R
ay.aspx

7
Architecture
8
Code Generation Compiler Perspective

ATLAS is not a compiler
Code Generator output equivalent to
Cache-level blocking (tiling)
NB
Register-level blocking
MU, NU, KU
Scheduling software pipelining
FMA, LS, FF, IF, NF
Versioning
Back-end compiler optimizations

9
Cache-level blocking (tiling)

Tiling in ATLAS
Only square tiles (NBxNBxNB)
Working set of tile fits in L1
Tiles are usually copied to continuous storage
Special clean-up code generated for boundaries
Mini-MMM
for (int j 0 j
for (int i 0 i
for (int k 0 k
Cij Aik Bkj
NB Optimization parameter

10
Register-level blocking

Register blocking is to mini-MMM what cache
blocking is to MMM
Very important for performance highest level of
Memory Hierarchy
Register file is a form of cache
Fully associative
Unit line size
Optimal replacement
Register file is under software control
It is not transparent to software like normal
caches
Compilers allocate variables to registers,
performs spills, etc.
Compilers seldom allocate array elements to
registers
To block for the register file one needs
Unroll the corresponding loops
Scalarize array elements into variables

11
Register-level blocking (cont.)

Micro-MMM
A MUx1
B 1xNU
C MUxNU
MUxNUMUNU registers
Unroll loops by MU, NU, and KU
Mini-MMM with Micro-MMM inside
for (int j 0 j
for (int i 0 i
load Ci..iMU-1, j..jNU-1 into registers
for (int k 0 k
load Ai..iMU-1,k into registers
load Bk,j..jNU-1 into registers
multiply As and Bs and add to Cs
store Ci..iMU-1, j..jNU-1
Special clean-up code required if
NB is not a multiple of MU,NU,KU
MU, NU, KU optimization parameters

12
Scheduling
Computation
MemoryOperations

FMA Present?
Schedule Computation
Using LS
Schedule Memory Operations
Using IF, NF,FF
LS, IF, NF, FF optimization parameters

L1
L2
L3

LMUNU
13
Code GenerationOptimization Parameters

Optimization parameters
NB constrained by size of L1 cache
MU, NU constrained by NR
KU constrained by size of I-Cache
FF, IF, NF constrained by number of outstanding
loads
FMA, LS related to hardware parameters
Sensitive vs. insensitive parameters
Similar parameters used by compilers

14
Architecture
15
ATLAS Search Strategy

Multi-dimensional optimization problem
Independent parameters NB, MU, NU, KU,
Dependent variable MFLOPS
Function from parameters to variables
Is given implicitly
Can be evaluated repeatedly
One optimization strategy orthogonal line search
Optimize along one dimension at a time
Use reference values for parameters not yet
optimized
Not guaranteed to find optimal point, but might
come close
Specification of orthogonal line search
Order in which dimensions are optimized
Reference values for un-optimized dimensions at
any step
Interval in which line search is done for each
dimension

16
Search for best NB

Search in following range
16
NB2
In this search, use simple estimates for other
parameters
e.g. for KU test each NB candidate for
Full K unrolling (KU NB)
No K unrolling (KU 1)

17
Search for other parameters

MU and NU
Constraints
1
MUNU MU NU
Use best found NB from previous step
KU
KU 1, 4, 8, , NB/2, and NB
LS
1
FF, IF, and NF
FF 0, 1
2
1

18
Architecture
19
Models for Optimization Parameters

Optimization parameters
NB
Hierarchy of Models (follows)
MU and NU
MU NU MU NU LS
KU
maximize subject to L1 Instruction Cache
LS
LS (Lh ALU 1)/2
FF, IF, NF
FF0, IF2, NF2
FMA
hardware parameter

20
Models for NBNo capacity / conflict misses

Tiles copied to contiguous memory
Simplest model
3 NB2

21
Models for NBNo capacity misses / some
conflicts

MMM
for (int j 0 i j cij aik bkj
Cache model
Fully associative
Line size 1 Word
Optimal Replacement
Bottom line
NB2 NB 1
One full matrix
One row / column
One element

22
Models for NBExtension for Line Size

Line Size 1
Spatial locality
Array layout in memory matters
Bottom line
This is for JIK loop order, column major layout
ATLAS uses this order / layout for Mini-MMM

23
Models for NBExtension for LRU Replacement

Intuition
Need more storage than optimal replacement
Data needs to cool down in cache to be replaced
This means smaller tiles
MMM sample
for (int j 0 i cij aik bkj
Bottom line

24
Models for NBAccounting for intra-level
interactions

Mini-MMM is not scalar operations underneath
It is a sequence of Micro-MMMs, operating on
register tiles
So whenever we talked about rows and columns we
should consider vertical and horizontal panels of
register tiles instead!
We can refine the model for NB as follows

25
Models for NBSummary