Wavefront Code

About This Presentation

Title:

Wavefront Code

Description:

L1 cache is invisible to floating-point numbers and doubles. L2 cache characteristics: ... cause PAPI counters to overflow producing weird results. Question ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 16

Provided by: csUt8

Learn more at: https://www.cs.utexas.edu

Category:

more less

Transcript and Presenter's Notes

Title: Wavefront Code

1
Wavefront Code
2
Machine

Itanium-2 (sequoia at Cornell)
L1 cache is invisible to floating-point numbers
and doubles
L2 cache characteristics
capacity 256 KB 32K doubles
line size 128 bytes 16 doubles

3
Original code
for I 2,N for J 1,N-1 AI,J
AI,JAI-1,J1
Iteration Space
4
Original code large cache model
A
1
N
1
N
Large cache model Number of elements
touched NN -2 Each miss brings in b
elements ( 16 for Itanium) So number of
misses (NN-2)/b Miss ratio
(NN-2)/3b(N-1)(N-1) (1/3b) (1 2/N)
0.022(1 2/N) for Itanium
5
Original code small cache model
A
1
N
1
cache lines
N
Small cache model - to compute AI,J, we
bring in a line of the Ith column and a
line of the (I1)th column - so when
computing AI,J1, we will get a hit on the
accesses to AI,J1 and a miss for
AI-1,J2 So small cache miss ratio 1/3 Miss
ratio independent of b? we are unable to exploit
spatial locality
6
Original code transition point
A
1
N
1
cache lines
N
Transition out of large cache model happens when
you start getting capacity misses. This happens
when bN gt C For the Itanium, this means 16N gt
32K
? N gt 2K
7
Miss ratios
8
Height reduction
A
1
N
1
cache lines
N
After height reduction, we walk the array along
anti-diagonals Between successive touches to
elements on the same line, reuse distance is
O(N). However, we can exploit temporal locality,
so reference AI-1,J1 on RHS will be a hit. So
miss ratio ? 1/3. Transition to small cache
model will happen at C/b. However, miss ratio
will increase more gradually than in
original code because capacity misses will happen
for more and more diagonals as the size of the
matrix increases.
9
Miss ratios
10
Height reductionData transformation
A
1
N
1
cache lines
N
Here we can exploit both temporal and spatial
locality. Small cache miss ratio 1/3b However,
data layout is more complex and there is
more address arithmetic that needs to be done.
11
Miss ratios
12
Row-major storage order
A
1
N
1
cache lines
N
Here, successive reads and successive writes have
stride 1 access, so small cache miss ratio
1/3b. Address arithmetic is simple.
13
Miss ratios
14
Execution times
L3 cache misses cause PAPI counters to overflow
producing weird results
15
Question

So why would you do unimodular transformation
diagonal data transformation rather than use a
simple data transformation (row-major order)?
Answer for a multicore, you will need much less
synchronization
in original code with row-major storage order, no
parallel loop
in code after unimodular transformation
diagonal data transformation, outer loop is
parallel and you have good uniprocessor locality