CMPUT680 - Winter 2006 - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

CMPUT680 - Winter 2006

Description:

CMPUT 680 - Compiler Design and Optimization. 3. Software Pipeline ... (d) qk qk-1 wk FADD (2) (e) zk zk-1 4 ADDR (1) (f) xk xk-1 4 ADDR (1) END DO ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 35
Provided by: josenels
Category:

less

Transcript and Presenter's Notes

Title: CMPUT680 - Winter 2006


1
CMPUT680 - Winter 2006
  • Topic E Software Pipelining
  • José Nelson Amaral
  • http//www.cs.ualberta.ca/amaral/courses/680

2
Reading List
  • Tiger book chapter 20
  • Other papers such as GovindAltmanGao97,
    RutenbergAtAl97

3
Software Pipeline
Software Pipeline is a technique that reduces the
execution time of important loops by interweaving
operations from many iterations to optimize the
use of resources.
0
1
2
3
4
5
6
7
8
9
10
11
12
16
15
14
13
time
4
Software Pipeline
  • What limits the speed of a loop?
  • Data dependencies recurrence initiation
    interval (rec_mii)
  • Processor resources resource initiation
    interval (res_mii)
  • Memory accesses memory initiation interval
    (mem_mii)

5
Problem Formulation (I)
  • Given a weighted dependence graph, derive a
    schedule which is time-optimal under a machine
    model M.
  • Def A schedule S of a loop L is time-optimal if
    among all legal schedules of L, no schedule is
    faster than S.
  • Note There may be more than one time-optimal
    schedule.

6
Example The Inner Product
Q 0.0 DO k 1, N Q QZ(k)X(k) ENDDO
z0 ? Z(1) x0 ? X(1) q0 ? 0.0 DO k1,N uk ?
load zk-1 vk ?? load xk-1 wk?? uk vk qk ?
qk-1 wk zk ? zk-1 4 xk ? xk-1 4 END DO
Dynamic Single Assignment (DSA) Uses an
expanded virtual register (EVR) that is an
infinite, linearly ordered, set of virtual
registers.
A program in DSA has no anti-dependencies and no
output dependencies.
(Dehnert, J. and Towle, R. A., Compiling for
Cidra 5)
7
Machine Model and Resource Constraints
What unit each operation in the loop uses?
Machine Model
z0 ? Z(1) x0 ? X(1) q0 ? 0.0 DO k1,N uk ?
load zk-1 MEM vk ?? load xk-1 MEM wk?? uk
vk FMULT qk ? qk-1 wk FADD zk ? zk-1
4 ADDR xk ? xk-1 4 ADDR END DO
Unit Latency MEM1 6 MEM2 6 ADDR1
1 ADDR2 1 FMULT 2 FADD 2
Without instruction level parallelism. How long
does the loop take to execute?
(662211)N18N
8
Resource Minimum Initiation Interval (resMII)
Each processor resource defines a minimum
initiation interval for the execution of the loop.
For instance in the machine model in the
previous example, a loop that requires the
computation of 6 addresses has a ResMII(ADDR)
61/2 3.
The Resource Minimum Initiation Interval of a
loop is given by
9
ResMII
z0 ? Z(1) x0 ? X(1) q0 ? 0.0 DO k1,N uk ?
load zk-1 MEM vk ?? load xk-1 MEM wk?? uk
vk FMULT qk ? qk-1 wk FADD zk ? zk-1
4 ADDR xk ? xk-1 4 ADDR END DO
Machine Model
Unit Latency MEM1 6 MEM2 6 ADDR1
1 ADDR2 1 FMULT 2 FADD 2
There are enough units to schedule all the
instructions of the loop in the same cycle.
Therefore ResMII 1. Can we execute the loop in
NC cycles (C a small constant)?
10
Recurrence Minimum Initiation Interval (RecMII)
k1
z0 ? Z(1) x0 ? X(1) q0 ? 0.0 DO k1,N (a) uk ?
load zk-1 (b) vk ?? load xk-1 (c) wk?? uk
vk (d) qk ? qk-1 wk (e) zk ? zk-1 4 (f) xk ?
xk-1 4 END DO
11
Recurrence Minimum Initiation Interval (RecMII)
z0 ? Z(1) x0 ? X(1) q0 ? 0.0 DO k1,N Unit
Lat. (a) uk ? load zk-1 MEM (6) (b) vk ??
load xk-1 MEM (6) (c) wk?? uk vk FMULT
(2) (d) qk ? qk-1 wk FADD (2) (e) zk ? zk-1
4 ADDR (1) (f) xk ? xk-1 4 ADDR (1) END
DO
(1,2)
(1,1)
(1,1)
(1,1)
(1,1)
(dist,lat)
12
Recurrence Minimum Initiation Interval (RecMII)
(1,2)
(1,1)
(1,1)
Quiz What is the rec_mii for the example?
(1,1)
(1,1)
(dist,lat)
13
Minimum Initiation Interval
In our example we have MII max(1,2)
2. Therefore the best that we can do without
transforming the loop is to execute it in 2NC.
14
Module Schedule
In module scheduling, we (1) start with the
first instruction (2) schedule as many
instructions as we can in every cycle,
limited only by the resources available and by
the dependences.
When a pattern emerges, we adopt the pattern
as our module schedule.
Instructions before this pattern form the loop
prologue.
Instructions after this pattern form the loop
epilogue.
15
Recurrence Minimum Initiation Interval (RecMII)
z0 ? Z(1) x0 ? X(1) q0 ? 0.0 DO
k1,N Lat. (a) uk ? load zk-1 (6) (b) vk ??
load xk-1 (6) (c) wk?? uk vk (2) (d) qk ? qk-1
wk (2) (e) zk ? zk-1 4 (1) (f) xk ? xk-1
4 (1) END DO
16
Why an eager scheduler fails in our example
Iterations
1
2
3
4
5
6
7
8
9
11
12
13
14
15
16
17
18
10
b1
0
b2
1
b3
2
b4
3
b5
4
b6
5
b7
6
c1
b8
7
c2
8
b9
d1
c3
9
b10
Cycles
c4
10
b11
d2
c5
11
b12
Cycles
c6
12
b13
d3
c7
13
b14
c8
14
b15
d4
15
c9
b16
16
b17
c10
d5
17
c11
b18
18
c12
d6
19
c13
20
c14
d7
21
c15
22
c16
d8
23
c17
17
Why an eager scheduler fails in our example
Iterations
1
2
3
4
5
6
7
8
9
11
12
13
14
15
16
17
18
10
b1
0
1
b2
2
3
b3
4
5
b4
6
Therefore we can do it in 2N9 cycles.
c1
7
b5
8
d1
c2
9
Cycles
b6
10
d2
c3
11
Cycles
b7
12
d3
c4
13
b8
14
d4
c5
15
16
b9
d5
c6
17
18
b10
d6
c7
19
20
b11
d7
c8
21
22
b12
d8
23
c9
18
Collision vectors
Given the reservation tables for two operations A
and B, the set of forbidden intervals, i.e.,
intervals at which distance the operations A and
B cannot be issued is called the collision vector
for the reservation tables.
19
A Simplistic Module Scheduling Algorithm
1. Compute MII as discussed 2. Use a modified
list scheduling algorithm to generate a
module schedule. The scheduling algorithm
must obey the following restriction If an
operation P is scheduled at time t,
it cannot be scheduled at any time t ? kII
for any k? 0. The Module Reservation Table
has II rows, representing the cycles of the
initiation interval, and as many columns as
the resources that it needs to keep track of.
20
Heuristic Method for Modulo Scheduling
Problem Generate a module schedule of a loop by
scheduling instructions until a pattern emerge.
  • Why a simple variant of list scheduling may not
    work?

21
Counter Example IList Scheduling May Fail
(0,4)
(0,2)
(0,2)
(1,2)
B
D
Therefore, in a machine with infinite
resources, we must be able to schedule the loop
in 4 cycles.
22
Counter Example IList Scheduling May Fail
List Scheduling a greedy algorithm that
schedules each operation at its earliest
possible time
C
A
D
B
B must be scheduled after the A of the current
iteration and before the C of the
next iteration. We are deadlocked!!!
0 1 2 3
A
C
D
23
Counter Example IList Scheduling May Fail
0 1 2 3
A
C
C
A
prologue
(0,4)
(0,2)
(0,2)
4 5 6 7
B
D
D
B
kernel
(1,2)
The solution is to create a kernel with
operations from different iterations, and use a
prologue and an epilogue.
epilogue
24
Counter Example IIList Scheduling May Fail
A1, A3, and A4 are non-pipelined adds that take
two cycles at the adder M5 and M6 are
non-pipelined multiply operations that take
three cycles each on the multiplier C2 is a copy
operation that uses the bus for one cycle What
is the ResMII for these operations in a machine
that has one adder, one multiplier and one bus?
A1
M6
(0,2)
(0,3)
C2
(0,1)
A3
(0,2)
A4
(0,2)
M5
(0,3)
25
Counter Example IIList Scheduling May Fail
A1
M6
A1
M6
C2
C2
A3
A3
A4
26
Counter Example IIList Scheduling May Fail
A1
M6
C2
C2
A4
M5
A3
A4
Although it seems counter-intuitive we obtain a
module schedule with MII 6 if we initially
schedule both M6 and A3 one cycle later than the
earliest possible time for these operations.
M5
27
Complex Reservation Tables
Consider three independent operations with the
reservation tables shown below
What is the MII for a loop formed by this three
operations?
28
Is the MII 2 Feasible??
A1
M2
Deadlocked. Cannot allocate MA3. Even though
MII max(ResMII, RecMII) 2, MII 2 is not
feasible!!!!
A1
M2
M2
A1
29
Increasing MII to 3 helps?
A1
M2
MA3
Adder Mult Bus
0 1 2
A1
M2
We find a module schedule with MII 3!!
A1
MA3
MA3
M2
30
Iteration Between Recurrence Constraints and
Resource Constraints
What is the RecMII for this loop?
A1
(0,2)
RecMII (2222)/2 4
A2
(2,2)
(0,2)
What is the ResMII for the loop?
A3
(0,2)
A4
Therefore MII max(ResMII,RecMII) 4
31
Is the MII 4 feasible?
A1
A1
A1
A2
A2
A2
In order to finish A4 in time to produce the
result for two iterations later, A3 must
be scheduled at time 4. But 4 module 4 0,
which conflicts with A1. Therefore there is no
feasible schedule with MII 4.
32
Scheduling Strategy
An exhaustive search will eventually reveal that
the MII calculated is not feasible, but it might
take too long.
In practice, we compute the MII and spend a
pre-allocated budget of time trying to find
a schedule with the MII. If we dont find one,
we increase the MII.
In some commercial compilers, the search for the
smallest feasible II is a binary search, where
the II is doubled at each step until a feasible
one is found, at which point a linear search
between the last unfeasible II and the feasible
one is conducted.
33
Previous Approaches
  • Approach I (Operational)
  • Emulate the loop execution under the machine
    model and a pattern will eventually occur
  • AikenNic88, EbciogluNic89, GaoEtAl91
  • Approach II (Periodic scheduling)
  • Specify the scheduling problem into a periodical
    scheduling problem and find optimal solution
  • Lam88, RauEtAl81,GovindAltmanGao94

34
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com