Loop Tiling for Iterative Stencil Computations - PowerPoint PPT Presentation

About This Presentation
Title:

Loop Tiling for Iterative Stencil Computations

Description:

CLS parameters: BETA-i, DELTA (computed using the JI-loop distance subgraph) K=2. K=3 ... Compute DELTA and BETAs and apply CLS to shorten backwards dep. ... – PowerPoint PPT presentation

Number of Views:118
Avg rating:3.0/5.0
Slides: 28
Provided by: Jime86
Category:

less

Transcript and Presenter's Notes

Title: Loop Tiling for Iterative Stencil Computations


1
Loop Tiling for Iterative Stencil Computations
  • Marta Jiménez

2
What is an Iterative Stencil Computation?
Matrix A
DO K 1, NITER / time-step loop / do J
... do I ... A(I,J), A(I1,J),
enddo enddo wrapped-around
computations ENDDO
  • ISC often performed for PDE, GM, IP
  • swim, tomcatv, mgrid (from SPEC95 benchmark)
  • Jacobi

3
Loop Tiling
  • Loop Tiling
  • divides IS into regular tiles to make the working
    set fit in the memory level being exploited
  • can be applied hierarchically (Multilevel Tiling)
  • Current algorithms for Loop Tiling are limited to
    loops that
  • are perfectly nested
  • are fully permutable
  • define a rectangular IS
  • However, in iterative stencil computations, loops
    are
  • NOT perfectly nested
  • NOT fully permutable

4
Todays talk
  • Show how Loop Tiling can be applied to iterative
    stencil computations
  • based on Song Lis paper PLDI99
  • define a Program Model
  • 1 Level of 1D-Tiling (cache)
  • program example SWIM
  • 2 levels of Tiling
  • 2D-Tiling at the cache level
  • 1D-Tiling at the register level (based on Jiménez
    et al. ICS98HPCA98)
  • Performance Results
  • Loop Tiling on EV5 EV6

5
Steps
  • 1- Apply a set of transformations to the original
    program to achieve the desired program model
    defined by Song Li
  • 2- Perform 2D-Tiling for the Cache Level
  • 3- Perform 1D-Tiling for the Register Level

6
1st Step achieve desired program model
  • Program Model

DO K 1, NITER / time-step loop / do J1
LJ1, UJ1 do I1 LI1, UI1 A(I,J),
A(I1,J), enddo enddo . . . do Jm
LJm, UJm do Im LIm, UIm A(I,J),
A(I1,J), enddo enddo ENDDO
  • Usually, programs are NOT directly written in
    this form
  • We must apply a set of transformations to
    achieve this program model

7
SWIM original code
SUBROUTINE CALCX do J 1,N do I
1,M ... enddo enddo c
wrapped-around computations do J 1, N
... enddo do I 1, M ...
enddo ...
initializations 90 NCYCLE NCYCLE 1
CALL CALC1 CALL CALC2 IF (NCYCLE gt
ITMAX) STOP IF (NCYCLE lt 1) THEN CALL
CALC3Z ELSE CALL CALC3 ENDIF GO
TO 90
  • Transformations
  • Inline subroutines
  • Convert GO TO into DO-loop
  • Peel iterations of the time-step loop to
    eliminate IF-statements guarded by NCYCLE

8
Wrapped-around Computations
J
J
DO K 2, ITMAX-1 do J 1,N do I 1,M
... enddo enddo wrapped-around comp do
J 1, N ... enddo do I 1, M ...
enddo ... do J 1,N do I 1,M
... enddo enddo ... ... ENDDO
I
I
CALC1
CALC2
CALC3
9
Wrapped-around Computations
  • Projection along direction I

DO K 2, ITMAX-1 do J 1,N ... enddo
wrapped-around comp do J 1, N ...
enddo do J 1,N ... enddo
wrapped-around comp do J 1, N ...
enddo ... ENDDO
J
c
c
  • Another way of dealing with the wrapped-around
    computations is performing code sinking

10
1st Step achieved program model
  • Flow dependencies iterations space for SWIM
    (Projection along direction I )

J
1
N
DO K 2, ITMAX-1 do J 1,N ... enddo
wrapped-around do J 1,N ... enddo
wrapped-around do J 1,N ... enddo
wrapped-around ENDDO
CALC1
K2
CALC2
K-loop (time)
K3
CALC3
11
Steps
  • 1- Apply a set of transformations to the original
    program to achieve the program model defined by
    Song Li
  • 2- Perform 2D-Tiling for the Cache Level
  • 3- Perform 1D-Tiling for the Register Level

12
1D-Tiling
J
1
N
J
1
N
1
N
K2
OFFSET-i
SLOPE
K3
K4
  • Dependencies are violated
  • Tiling parameters SLOPE, OFFSETS-i

13
2D-Tiling
J
N
1
N
1
N
1
I
N
1
N
1
N
1
1
1
M
M
1
1
M
M
1
1
M
M
K (time-step loop)
  • Tiling parameters SLOPE, OFFSETS-i for each
    tiled dimension (J and I)
  • Computed using the JI-loop distance subgraph

14
JI-loop Distance Subgraph
1,-1,-1
0,0,0
1,0,0
JI3-loop
JI2-loop
JI1-loop
1, 0, 0
1,-1,0 1,0,-1
1,-1,0 1,0,-1
1, 0, 0
1, 0, 0
1,0,-1 1,-1,0
0,0,0
flow dependencies
anti-dependencies
  • Each node represents a JI-loop nest
  • Each edge represents a dependence (distance
    vector)

15
Wrapped-around Computations
  • SWIM Projection along direction I

J
1
N
DO K 2, ITMAX-1 do J 1,N ... enddo
wrapped-around do J 1,N ... enddo
wrapped-around do J 1,N ... enddo
wrapped-around ENDDO
K2
K-loop (time)
K3
  • Backward dependencies with large distances make
    Tiling not profitable
  • apply Circular Loop Skewing to shorten backward
    dependencies

16
Circular Loop Skewing
  • Shorts backward dependencies by changing the
    iteration order

J
J
1
N
2
1
N
1
4
2
3
2
K2
BETA-i
DELTA
K3
  • CLS parameters BETA-i, DELTA (computed using
    the JI-loop distance subgraph)

17
Circular Loop Skewing
DO K 2, ITMAX-1 do JX 1BETA1DELTA(K-2),
NBETA1DELTA(K-2) J MOD(JX-1, N)
1 ... enddo wrapped-around do JX
1BETA2DELTA(K-2), NBETA2DELTA(K-2)
J MOD(JX-1, N) 1 ... enddo
wrapped-around do JX 1BETA3DELTA(K-2),
NBETA3DELTA(K-2) J MOD(JX-1, N) 1
... enddo wrapped-around ENDDO
18
2nd Step 2D-Tiling for cache level
  • SWIM projection along direction I
  • CLS parameters DELTA2, BETA10, BETA21,
    BETA32
  • Tiling parameters SLOPE2, OFFSET11,
    OFFSET2OFFSET30

J
3
1
2
N
3
1
2
DO JJ ... DO II ... DO K ... if
(first tile) then do JX ... offsets
iter. enddo endif do JX ...
Iter. inside tile enddo do JX ...
Iter. inside tile enddo do JX ...
Iter. inside tile enddo ENDDO
3
2
3
1
2
N
1
K2
K3
K4
19
Steps
  • 1- Apply a set of transformations to the original
    program to achieve the program model defined by
    Song Li
  • 2- Perform 2D-Tiling for the Cache Level
  • 3- Perform 1D-Tiling for the Register Level

20
3rd Step 1D-Tiling for register level
DO JJ ... DO II ... DO K ... ...
do JX LJ, UJ J MOD (JX-1, N)1 do
IX LI, UI I MOD (IX-1, M)1
loop body I,J enddo enddo
... ENDDO
J
N
1
N-1
2
N-2
I
M-2
M-1
M
1
2
unrolled
  • The MOD operation introduced by CLS prevents us
    to fully unroll the loop
  • Apply first Index Set Splitting to loop J

21
Index Set Splitting
  • ISS splits a loop into two new loops that iterate
    over non-intersecting portions of the iteration
    space

DO JJ ... DO II ... DO K ... ...
do JX LJ, min(N,UJ) J JX do IX
... enddo enddo do JX max(N1,LJ),
UJ J JX-N do IX ... enddo
enddo ... ENDDO
J
N
1
N-1
2
N-2
I
M-2
M-1
M
1
2
ISS
22
3rd Step 1D-Tiling for register level
23
Code Transformations Summary
  • 1- Apply a set of transformations to the original
    program to achieve the program model defined by
    Song Li
  • Inline subroutines
  • Convert GOTO into DO-loop
  • Peel iterations of the time-step loop to
    eliminate IF-statements
  • 2- Perform 2D-Tiling for the Cache Level
  • Construct JI-loop distance subgraph
  • Compute DELTA and BETAs and apply CLS to shorten
    backwards dep.
  • Update JI-loop distance subgraph
  • Compute OFSSETs and SLOPE and tile the IS
  • 3- Perform 1D-Tiling for the Register Level
  • Index Set Splitting
  • Tiling in a straightforward manner

24
Performance Results (SWIM)
  • Architecture EV56 (500Mhz, L18KB, L296KB),
    EV6(500MHz, L164KB, L24MB)
  • Compiler Invocation
  • f77 -O5 -arch ev56 (EV5)
  • kf77 -O5 -arch ev6 -notransform_loop -unroll 1
    (EV6)
  • Programs
  • 1D-Tiling for the Cache Level loop J, TS 4
    (EV5), TS8 (EV6)
  • 2D -Tiling for the Cache Level TSIxJ 32x16
    (EV5), TSIxJ40x12(EV6)
  • 1D-Tiling for the register level loop J, TS4
    (EV5 EV6)

1519s
1533s
1023s
999s
1009s
677s
EV5
(execution time)
439s
658s
294s
371s
578s
296s
EV6
Speedup
ORI
ORI RT
1D
1D RT
2D
2D RT
25
Performance Results EV5 (SWIM)
  • Architecture EV56 (500Mhz, L18KB, L296KB)
  • Compiler invocations
  • base kf77 -O5 -arch ev56
  • no_prefetch kf77 -O5 -arch ev56 -switch
    nolu_prefetch_fetch ..

Speedup over ORI (base)
Speedup
ORI
ORI RT
1D
1D RT
2D
2D RT
26
Performance Results EV6 (SWIM)
  • Architecture EV6(500MHz, L164KB, L24MB)
  • Compiler invocations
  • base f77 -O5 -arch ev6
  • no_prefetch f77 -O5 -arch ev6 -switch
    nolu_prefetch_fetch ..

Speedup over ORI (base)
Speedup
ORI
ORI RT
1D
1D RT
2D
2D RT
27
Code for Result Verification
DO K 2, ITMAX-1 ... do J 1,N ...
enddo result verification IF
(MOD(K,MPRINT).eq.0) THEN do I do
J UCHECK UCHECK UNEW(I,J)
enddo UNEW (I,I) . . . enddo
PRINTS ENDIF do J 1,N ...
enddo ENDDO
J
c
NEW in SPEC2000!!
  • Apply strip-mining to loop K (only useful if
    MPRINT is large)
Write a Comment
User Comments (0)
About PowerShow.com