Loop Tiling for Iterative Stencil Computations - PowerPoint PPT Presentation

About This Presentation

Title:

Loop Tiling for Iterative Stencil Computations

Description:

CLS parameters: BETA-i, DELTA (computed using the JI-loop distance subgraph) K=2. K=3 ... Compute DELTA and BETAs and apply CLS to shorten backwards dep. ... – PowerPoint PPT presentation

Number of Views:118

Avg rating:3.0/5.0

Slides: 28

Provided by: Jime86

Learn more at: https://research.ac.upc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Loop Tiling for Iterative Stencil Computations

1
Loop Tiling for Iterative Stencil Computations

Marta Jiménez

2
What is an Iterative Stencil Computation?
Matrix A
DO K 1, NITER / time-step loop / do J
... do I ... A(I,J), A(I1,J),
enddo enddo wrapped-around
computations ENDDO

ISC often performed for PDE, GM, IP
swim, tomcatv, mgrid (from SPEC95 benchmark)
Jacobi

3
Loop Tiling

Loop Tiling
divides IS into regular tiles to make the working
set fit in the memory level being exploited
can be applied hierarchically (Multilevel Tiling)
Current algorithms for Loop Tiling are limited to
loops that
are perfectly nested
are fully permutable
define a rectangular IS
However, in iterative stencil computations, loops
are
NOT perfectly nested
NOT fully permutable

4
Todays talk

Show how Loop Tiling can be applied to iterative
stencil computations
based on Song Lis paper PLDI99
define a Program Model
1 Level of 1D-Tiling (cache)
program example SWIM
2 levels of Tiling
2D-Tiling at the cache level
1D-Tiling at the register level (based on Jiménez
et al. ICS98HPCA98)
Performance Results
Loop Tiling on EV5 EV6

5
Steps

1- Apply a set of transformations to the original
program to achieve the desired program model
defined by Song Li
2- Perform 2D-Tiling for the Cache Level
3- Perform 1D-Tiling for the Register Level

6
1st Step achieve desired program model

Program Model

DO K 1, NITER / time-step loop / do J1
LJ1, UJ1 do I1 LI1, UI1 A(I,J),
A(I1,J), enddo enddo . . . do Jm
LJm, UJm do Im LIm, UIm A(I,J),
A(I1,J), enddo enddo ENDDO

Usually, programs are NOT directly written in
this form
We must apply a set of transformations to
achieve this program model

7
SWIM original code
SUBROUTINE CALCX do J 1,N do I
1,M ... enddo enddo c
wrapped-around computations do J 1, N
... enddo do I 1, M ...
enddo ...
initializations 90 NCYCLE NCYCLE 1
CALL CALC1 CALL CALC2 IF (NCYCLE gt
ITMAX) STOP IF (NCYCLE lt 1) THEN CALL
CALC3Z ELSE CALL CALC3 ENDIF GO
TO 90

Transformations
Inline subroutines
Convert GO TO into DO-loop
Peel iterations of the time-step loop to
eliminate IF-statements guarded by NCYCLE

8
Wrapped-around Computations
J
J
DO K 2, ITMAX-1 do J 1,N do I 1,M
... enddo enddo wrapped-around comp do
J 1, N ... enddo do I 1, M ...
enddo ... do J 1,N do I 1,M
... enddo enddo ... ... ENDDO
I
I
CALC1
CALC2
CALC3
9
Wrapped-around Computations

Projection along direction I

DO K 2, ITMAX-1 do J 1,N ... enddo
wrapped-around comp do J 1, N ...
enddo do J 1,N ... enddo
wrapped-around comp do J 1, N ...
enddo ... ENDDO
J
c
c

Another way of dealing with the wrapped-around
computations is performing code sinking

10
1st Step achieved program model

Flow dependencies iterations space for SWIM
(Projection along direction I )

J
1
N
DO K 2, ITMAX-1 do J 1,N ... enddo
wrapped-around do J 1,N ... enddo
wrapped-around do J 1,N ... enddo
wrapped-around ENDDO
CALC1
K2
CALC2
K-loop (time)
K3
CALC3
11
Steps

1- Apply a set of transformations to the original
program to achieve the program model defined by
Song Li
2- Perform 2D-Tiling for the Cache Level
3- Perform 1D-Tiling for the Register Level

12
1D-Tiling
J
1
N
J
1
N
1
N
K2
OFFSET-i
SLOPE
K3
K4

Dependencies are violated

Tiling parameters SLOPE, OFFSETS-i

13
2D-Tiling
J
N
1
N
1
N
1
I
N
1
N
1
N
1
1
1
M
M
1
1
M
M
1
1
M
M
K (time-step loop)

Tiling parameters SLOPE, OFFSETS-i for each
tiled dimension (J and I)
Computed using the JI-loop distance subgraph

14
JI-loop Distance Subgraph
1,-1,-1
0,0,0
1,0,0
JI3-loop
JI2-loop
JI1-loop
1, 0, 0
1,-1,0 1,0,-1
1,-1,0 1,0,-1
1, 0, 0
1, 0, 0
1,0,-1 1,-1,0
0,0,0
flow dependencies
anti-dependencies

Each node represents a JI-loop nest
Each edge represents a dependence (distance
vector)

15
Wrapped-around Computations

SWIM Projection along direction I

J
1
N
DO K 2, ITMAX-1 do J 1,N ... enddo
wrapped-around do J 1,N ... enddo
wrapped-around do J 1,N ... enddo
wrapped-around ENDDO
K2
K-loop (time)
K3

Backward dependencies with large distances make
Tiling not profitable
apply Circular Loop Skewing to shorten backward
dependencies

16
Circular Loop Skewing

Shorts backward dependencies by changing the
iteration order

J
J
1
N
2
1
N
1
4
2
3
2
K2
BETA-i
DELTA
K3

CLS parameters BETA-i, DELTA (computed using
the JI-loop distance subgraph)

17
Circular Loop Skewing
DO K 2, ITMAX-1 do JX 1BETA1DELTA(K-2),
NBETA1DELTA(K-2) J MOD(JX-1, N)
1 ... enddo wrapped-around do JX
1BETA2DELTA(K-2), NBETA2DELTA(K-2)
J MOD(JX-1, N) 1 ... enddo
wrapped-around do JX 1BETA3DELTA(K-2),
NBETA3DELTA(K-2) J MOD(JX-1, N) 1
... enddo wrapped-around ENDDO
18
2nd Step 2D-Tiling for cache level

SWIM projection along direction I
CLS parameters DELTA2, BETA10, BETA21,
BETA32
Tiling parameters SLOPE2, OFFSET11,
OFFSET2OFFSET30

J
3
1
2
N
3
1
2
DO JJ ... DO II ... DO K ... if
(first tile) then do JX ... offsets
iter. enddo endif do JX ...
Iter. inside tile enddo do JX ...
Iter. inside tile enddo do JX ...
Iter. inside tile enddo ENDDO
3
2
3
1
2
N
1
K2
K3
K4
19
Steps

1- Apply a set of transformations to the original
program to achieve the program model defined by
Song Li
2- Perform 2D-Tiling for the Cache Level
3- Perform 1D-Tiling for the Register Level

20
3rd Step 1D-Tiling for register level
DO JJ ... DO II ... DO K ... ...
do JX LJ, UJ J MOD (JX-1, N)1 do
IX LI, UI I MOD (IX-1, M)1
loop body I,J enddo enddo
... ENDDO
J
N
1
N-1
2
N-2
I
M-2
M-1
M
1
2
unrolled

The MOD operation introduced by CLS prevents us
to fully unroll the loop
Apply first Index Set Splitting to loop J

21
Index Set Splitting

ISS splits a loop into two new loops that iterate
over non-intersecting portions of the iteration
space

DO JJ ... DO II ... DO K ... ...
do JX LJ, min(N,UJ) J JX do IX
... enddo enddo do JX max(N1,LJ),
UJ J JX-N do IX ... enddo
enddo ... ENDDO
J
N
1
N-1
2
N-2
I
M-2
M-1
M
1
2
ISS
22
3rd Step 1D-Tiling for register level
23
Code Transformations Summary

1- Apply a set of transformations to the original
program to achieve the program model defined by
Song Li
Inline subroutines
Convert GOTO into DO-loop
Peel iterations of the time-step loop to
eliminate IF-statements
2- Perform 2D-Tiling for the Cache Level
Construct JI-loop distance subgraph
Compute DELTA and BETAs and apply CLS to shorten
backwards dep.
Update JI-loop distance subgraph
Compute OFSSETs and SLOPE and tile the IS
3- Perform 1D-Tiling for the Register Level
Index Set Splitting
Tiling in a straightforward manner

24
Performance Results (SWIM)

Architecture EV56 (500Mhz, L18KB, L296KB),
EV6(500MHz, L164KB, L24MB)
Compiler Invocation
f77 -O5 -arch ev56 (EV5)
kf77 -O5 -arch ev6 -notransform_loop -unroll 1
(EV6)
Programs
1D-Tiling for the Cache Level loop J, TS 4
(EV5), TS8 (EV6)
2D -Tiling for the Cache Level TSIxJ 32x16
(EV5), TSIxJ40x12(EV6)
1D-Tiling for the register level loop J, TS4
(EV5 EV6)

1519s
1533s
1023s
999s
1009s
677s
EV5
(execution time)
439s
658s
294s
371s
578s
296s
EV6
Speedup
ORI
ORI RT
1D
1D RT
2D
2D RT
25
Performance Results EV5 (SWIM)

Architecture EV56 (500Mhz, L18KB, L296KB)
Compiler invocations
base kf77 -O5 -arch ev56
no_prefetch kf77 -O5 -arch ev56 -switch
nolu_prefetch_fetch ..

Speedup over ORI (base)
Speedup
ORI
ORI RT
1D
1D RT
2D
2D RT
26
Performance Results EV6 (SWIM)

Architecture EV6(500MHz, L164KB, L24MB)
Compiler invocations
base f77 -O5 -arch ev6
no_prefetch f77 -O5 -arch ev6 -switch
nolu_prefetch_fetch ..

Speedup over ORI (base)
Speedup
ORI
ORI RT
1D
1D RT
2D
2D RT
27
Code for Result Verification
DO K 2, ITMAX-1 ... do J 1,N ...
enddo result verification IF
(MOD(K,MPRINT).eq.0) THEN do I do
J UCHECK UCHECK UNEW(I,J)
enddo UNEW (I,I) . . . enddo
PRINTS ENDIF do J 1,N ...
enddo ENDDO
J
c
NEW in SPEC2000!!