Title: Wavefront Code
1Wavefront Code
2Machine
- Itanium-2 (sequoia at Cornell)
- L1 cache is invisible to floating-point numbers
and doubles - L2 cache characteristics
- capacity 256 KB 32K doubles
- line size 128 bytes 16 doubles
3Original code
for I 2,N for J 1,N-1 AI,J
AI,JAI-1,J1
Iteration Space
4Original code large cache model
A
1
N
1
N
Large cache model Number of elements
touched NN -2 Each miss brings in b
elements ( 16 for Itanium) So number of
misses (NN-2)/b Miss ratio
(NN-2)/3b(N-1)(N-1) (1/3b) (1 2/N)
0.022(1 2/N) for Itanium
5Original code small cache model
A
1
N
1
cache lines
N
Small cache model - to compute AI,J, we
bring in a line of the Ith column and a
line of the (I1)th column - so when
computing AI,J1, we will get a hit on the
accesses to AI,J1 and a miss for
AI-1,J2 So small cache miss ratio 1/3 Miss
ratio independent of b? we are unable to exploit
spatial locality
6Original code transition point
A
1
N
1
cache lines
N
Transition out of large cache model happens when
you start getting capacity misses. This happens
when bN gt C For the Itanium, this means 16N gt
32K
? N gt 2K
7Miss ratios
8Height reduction
A
1
N
1
cache lines
N
After height reduction, we walk the array along
anti-diagonals Between successive touches to
elements on the same line, reuse distance is
O(N). However, we can exploit temporal locality,
so reference AI-1,J1 on RHS will be a hit. So
miss ratio ? 1/3. Transition to small cache
model will happen at C/b. However, miss ratio
will increase more gradually than in
original code because capacity misses will happen
for more and more diagonals as the size of the
matrix increases.
9Miss ratios
10Height reductionData transformation
A
1
N
1
cache lines
N
Here we can exploit both temporal and spatial
locality. Small cache miss ratio 1/3b However,
data layout is more complex and there is
more address arithmetic that needs to be done.
11Miss ratios
12Row-major storage order
A
1
N
1
cache lines
N
Here, successive reads and successive writes have
stride 1 access, so small cache miss ratio
1/3b. Address arithmetic is simple.
13Miss ratios
14Execution times
L3 cache misses cause PAPI counters to overflow
producing weird results
15Question
- So why would you do unimodular transformation
diagonal data transformation rather than use a
simple data transformation (row-major order)? - Answer for a multicore, you will need much less
synchronization - in original code with row-major storage order, no
parallel loop - in code after unimodular transformation
diagonal data transformation, outer loop is
parallel and you have good uniprocessor locality