Title: Loop Tiling for Iterative Stencil Computations
1Loop Tiling for Iterative Stencil Computations
2What is an Iterative Stencil Computation?
Matrix A
DO K 1, NITER / time-step loop / do J
... do I ... A(I,J), A(I1,J),
enddo enddo wrapped-around
computations ENDDO
- ISC often performed for PDE, GM, IP
- swim, tomcatv, mgrid (from SPEC95 benchmark)
- Jacobi
3Loop Tiling
- Loop Tiling
- divides IS into regular tiles to make the working
set fit in the memory level being exploited - can be applied hierarchically (Multilevel Tiling)
- Current algorithms for Loop Tiling are limited to
loops that - are perfectly nested
- are fully permutable
- define a rectangular IS
- However, in iterative stencil computations, loops
are - NOT perfectly nested
- NOT fully permutable
4Todays talk
- Show how Loop Tiling can be applied to iterative
stencil computations - based on Song Lis paper PLDI99
- define a Program Model
- 1 Level of 1D-Tiling (cache)
- program example SWIM
- 2 levels of Tiling
- 2D-Tiling at the cache level
- 1D-Tiling at the register level (based on Jiménez
et al. ICS98HPCA98) - Performance Results
- Loop Tiling on EV5 EV6
5Steps
- 1- Apply a set of transformations to the original
program to achieve the desired program model
defined by Song Li - 2- Perform 2D-Tiling for the Cache Level
- 3- Perform 1D-Tiling for the Register Level
61st Step achieve desired program model
DO K 1, NITER / time-step loop / do J1
LJ1, UJ1 do I1 LI1, UI1 A(I,J),
A(I1,J), enddo enddo . . . do Jm
LJm, UJm do Im LIm, UIm A(I,J),
A(I1,J), enddo enddo ENDDO
- Usually, programs are NOT directly written in
this form - We must apply a set of transformations to
achieve this program model
7SWIM original code
SUBROUTINE CALCX do J 1,N do I
1,M ... enddo enddo c
wrapped-around computations do J 1, N
... enddo do I 1, M ...
enddo ...
initializations 90 NCYCLE NCYCLE 1
CALL CALC1 CALL CALC2 IF (NCYCLE gt
ITMAX) STOP IF (NCYCLE lt 1) THEN CALL
CALC3Z ELSE CALL CALC3 ENDIF GO
TO 90
- Transformations
- Inline subroutines
- Convert GO TO into DO-loop
- Peel iterations of the time-step loop to
eliminate IF-statements guarded by NCYCLE
8Wrapped-around Computations
J
J
DO K 2, ITMAX-1 do J 1,N do I 1,M
... enddo enddo wrapped-around comp do
J 1, N ... enddo do I 1, M ...
enddo ... do J 1,N do I 1,M
... enddo enddo ... ... ENDDO
I
I
CALC1
CALC2
CALC3
9Wrapped-around Computations
- Projection along direction I
DO K 2, ITMAX-1 do J 1,N ... enddo
wrapped-around comp do J 1, N ...
enddo do J 1,N ... enddo
wrapped-around comp do J 1, N ...
enddo ... ENDDO
J
c
c
- Another way of dealing with the wrapped-around
computations is performing code sinking
101st Step achieved program model
- Flow dependencies iterations space for SWIM
(Projection along direction I )
J
1
N
DO K 2, ITMAX-1 do J 1,N ... enddo
wrapped-around do J 1,N ... enddo
wrapped-around do J 1,N ... enddo
wrapped-around ENDDO
CALC1
K2
CALC2
K-loop (time)
K3
CALC3
11Steps
- 1- Apply a set of transformations to the original
program to achieve the program model defined by
Song Li - 2- Perform 2D-Tiling for the Cache Level
- 3- Perform 1D-Tiling for the Register Level
121D-Tiling
J
1
N
J
1
N
1
N
K2
OFFSET-i
SLOPE
K3
K4
- Dependencies are violated
- Tiling parameters SLOPE, OFFSETS-i
132D-Tiling
J
N
1
N
1
N
1
I
N
1
N
1
N
1
1
1
M
M
1
1
M
M
1
1
M
M
K (time-step loop)
- Tiling parameters SLOPE, OFFSETS-i for each
tiled dimension (J and I) - Computed using the JI-loop distance subgraph
14JI-loop Distance Subgraph
1,-1,-1
0,0,0
1,0,0
JI3-loop
JI2-loop
JI1-loop
1, 0, 0
1,-1,0 1,0,-1
1,-1,0 1,0,-1
1, 0, 0
1, 0, 0
1,0,-1 1,-1,0
0,0,0
flow dependencies
anti-dependencies
- Each node represents a JI-loop nest
- Each edge represents a dependence (distance
vector)
15Wrapped-around Computations
- SWIM Projection along direction I
J
1
N
DO K 2, ITMAX-1 do J 1,N ... enddo
wrapped-around do J 1,N ... enddo
wrapped-around do J 1,N ... enddo
wrapped-around ENDDO
K2
K-loop (time)
K3
- Backward dependencies with large distances make
Tiling not profitable - apply Circular Loop Skewing to shorten backward
dependencies
16Circular Loop Skewing
- Shorts backward dependencies by changing the
iteration order
J
J
1
N
2
1
N
1
4
2
3
2
K2
BETA-i
DELTA
K3
- CLS parameters BETA-i, DELTA (computed using
the JI-loop distance subgraph)
17Circular Loop Skewing
DO K 2, ITMAX-1 do JX 1BETA1DELTA(K-2),
NBETA1DELTA(K-2) J MOD(JX-1, N)
1 ... enddo wrapped-around do JX
1BETA2DELTA(K-2), NBETA2DELTA(K-2)
J MOD(JX-1, N) 1 ... enddo
wrapped-around do JX 1BETA3DELTA(K-2),
NBETA3DELTA(K-2) J MOD(JX-1, N) 1
... enddo wrapped-around ENDDO
182nd Step 2D-Tiling for cache level
- SWIM projection along direction I
- CLS parameters DELTA2, BETA10, BETA21,
BETA32 - Tiling parameters SLOPE2, OFFSET11,
OFFSET2OFFSET30
J
3
1
2
N
3
1
2
DO JJ ... DO II ... DO K ... if
(first tile) then do JX ... offsets
iter. enddo endif do JX ...
Iter. inside tile enddo do JX ...
Iter. inside tile enddo do JX ...
Iter. inside tile enddo ENDDO
3
2
3
1
2
N
1
K2
K3
K4
19Steps
- 1- Apply a set of transformations to the original
program to achieve the program model defined by
Song Li - 2- Perform 2D-Tiling for the Cache Level
- 3- Perform 1D-Tiling for the Register Level
203rd Step 1D-Tiling for register level
DO JJ ... DO II ... DO K ... ...
do JX LJ, UJ J MOD (JX-1, N)1 do
IX LI, UI I MOD (IX-1, M)1
loop body I,J enddo enddo
... ENDDO
J
N
1
N-1
2
N-2
I
M-2
M-1
M
1
2
unrolled
- The MOD operation introduced by CLS prevents us
to fully unroll the loop - Apply first Index Set Splitting to loop J
21Index Set Splitting
- ISS splits a loop into two new loops that iterate
over non-intersecting portions of the iteration
space
DO JJ ... DO II ... DO K ... ...
do JX LJ, min(N,UJ) J JX do IX
... enddo enddo do JX max(N1,LJ),
UJ J JX-N do IX ... enddo
enddo ... ENDDO
J
N
1
N-1
2
N-2
I
M-2
M-1
M
1
2
ISS
223rd Step 1D-Tiling for register level
23Code Transformations Summary
- 1- Apply a set of transformations to the original
program to achieve the program model defined by
Song Li - Inline subroutines
- Convert GOTO into DO-loop
- Peel iterations of the time-step loop to
eliminate IF-statements - 2- Perform 2D-Tiling for the Cache Level
- Construct JI-loop distance subgraph
- Compute DELTA and BETAs and apply CLS to shorten
backwards dep. - Update JI-loop distance subgraph
- Compute OFSSETs and SLOPE and tile the IS
- 3- Perform 1D-Tiling for the Register Level
- Index Set Splitting
- Tiling in a straightforward manner
24Performance Results (SWIM)
- Architecture EV56 (500Mhz, L18KB, L296KB),
EV6(500MHz, L164KB, L24MB) - Compiler Invocation
- f77 -O5 -arch ev56 (EV5)
- kf77 -O5 -arch ev6 -notransform_loop -unroll 1
(EV6) - Programs
- 1D-Tiling for the Cache Level loop J, TS 4
(EV5), TS8 (EV6) - 2D -Tiling for the Cache Level TSIxJ 32x16
(EV5), TSIxJ40x12(EV6) - 1D-Tiling for the register level loop J, TS4
(EV5 EV6)
1519s
1533s
1023s
999s
1009s
677s
EV5
(execution time)
439s
658s
294s
371s
578s
296s
EV6
Speedup
ORI
ORI RT
1D
1D RT
2D
2D RT
25Performance Results EV5 (SWIM)
- Architecture EV56 (500Mhz, L18KB, L296KB)
- Compiler invocations
- base kf77 -O5 -arch ev56
- no_prefetch kf77 -O5 -arch ev56 -switch
nolu_prefetch_fetch ..
Speedup over ORI (base)
Speedup
ORI
ORI RT
1D
1D RT
2D
2D RT
26Performance Results EV6 (SWIM)
- Architecture EV6(500MHz, L164KB, L24MB)
- Compiler invocations
- base f77 -O5 -arch ev6
- no_prefetch f77 -O5 -arch ev6 -switch
nolu_prefetch_fetch ..
Speedup over ORI (base)
Speedup
ORI
ORI RT
1D
1D RT
2D
2D RT
27Code for Result Verification
DO K 2, ITMAX-1 ... do J 1,N ...
enddo result verification IF
(MOD(K,MPRINT).eq.0) THEN do I do
J UCHECK UCHECK UNEW(I,J)
enddo UNEW (I,I) . . . enddo
PRINTS ENDIF do J 1,N ...
enddo ENDDO
J
c
NEW in SPEC2000!!
- Apply strip-mining to loop K (only useful if
MPRINT is large)