Wavefront Code - PowerPoint PPT Presentation

About This Presentation
Title:

Wavefront Code

Description:

L1 cache is invisible to floating-point numbers and doubles. L2 cache characteristics: ... cause PAPI counters to overflow producing weird results. Question ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 16
Provided by: csUt8
Category:
Tags: code | papi | wavefront

less

Transcript and Presenter's Notes

Title: Wavefront Code


1
Wavefront Code
2
Machine
  • Itanium-2 (sequoia at Cornell)
  • L1 cache is invisible to floating-point numbers
    and doubles
  • L2 cache characteristics
  • capacity 256 KB 32K doubles
  • line size 128 bytes 16 doubles

3
Original code
for I 2,N for J 1,N-1 AI,J
AI,JAI-1,J1
Iteration Space
4
Original code large cache model
A
1
N
1
N
Large cache model Number of elements
touched NN -2 Each miss brings in b
elements ( 16 for Itanium) So number of
misses (NN-2)/b Miss ratio
(NN-2)/3b(N-1)(N-1) (1/3b) (1 2/N)
0.022(1 2/N) for Itanium
5
Original code small cache model
A
1
N
1
cache lines
N
Small cache model - to compute AI,J, we
bring in a line of the Ith column and a
line of the (I1)th column - so when
computing AI,J1, we will get a hit on the
accesses to AI,J1 and a miss for
AI-1,J2 So small cache miss ratio 1/3 Miss
ratio independent of b? we are unable to exploit
spatial locality
6
Original code transition point
A
1
N
1
cache lines
N
Transition out of large cache model happens when
you start getting capacity misses. This happens
when bN gt C For the Itanium, this means 16N gt
32K
? N gt 2K
7
Miss ratios
8
Height reduction
A
1
N
1
cache lines
N
After height reduction, we walk the array along
anti-diagonals Between successive touches to
elements on the same line, reuse distance is
O(N). However, we can exploit temporal locality,
so reference AI-1,J1 on RHS will be a hit. So
miss ratio ? 1/3. Transition to small cache
model will happen at C/b. However, miss ratio
will increase more gradually than in
original code because capacity misses will happen
for more and more diagonals as the size of the
matrix increases.
9
Miss ratios
10
Height reductionData transformation
A
1
N
1
cache lines
N
Here we can exploit both temporal and spatial
locality. Small cache miss ratio 1/3b However,
data layout is more complex and there is
more address arithmetic that needs to be done.
11
Miss ratios
12
Row-major storage order
A
1
N
1
cache lines
N
Here, successive reads and successive writes have
stride 1 access, so small cache miss ratio
1/3b. Address arithmetic is simple.
13
Miss ratios
14
Execution times
L3 cache misses cause PAPI counters to overflow
producing weird results
15
Question
  • So why would you do unimodular transformation
    diagonal data transformation rather than use a
    simple data transformation (row-major order)?
  • Answer for a multicore, you will need much less
    synchronization
  • in original code with row-major storage order, no
    parallel loop
  • in code after unimodular transformation
    diagonal data transformation, outer loop is
    parallel and you have good uniprocessor locality
Write a Comment
User Comments (0)
About PowerShow.com