Title: CMPUT680 - Winter 2006
1CMPUT680 - Winter 2006
- Topic E Software Pipelining
- José Nelson Amaral
- http//www.cs.ualberta.ca/amaral/courses/680
2Reading List
- Tiger book chapter 20
- Other papers such as GovindAltmanGao97,
RutenbergAtAl97
3Software Pipeline
Software Pipeline is a technique that reduces the
execution time of important loops by interweaving
operations from many iterations to optimize the
use of resources.
0
1
2
3
4
5
6
7
8
9
10
11
12
16
15
14
13
time
4Software Pipeline
- What limits the speed of a loop?
- Data dependencies recurrence initiation
interval (rec_mii) - Processor resources resource initiation
interval (res_mii) - Memory accesses memory initiation interval
(mem_mii)
5Problem Formulation (I)
- Given a weighted dependence graph, derive a
schedule which is time-optimal under a machine
model M. - Def A schedule S of a loop L is time-optimal if
among all legal schedules of L, no schedule is
faster than S. - Note There may be more than one time-optimal
schedule.
6Example The Inner Product
Q 0.0 DO k 1, N Q QZ(k)X(k) ENDDO
z0 ? Z(1) x0 ? X(1) q0 ? 0.0 DO k1,N uk ?
load zk-1 vk ?? load xk-1 wk?? uk vk qk ?
qk-1 wk zk ? zk-1 4 xk ? xk-1 4 END DO
Dynamic Single Assignment (DSA) Uses an
expanded virtual register (EVR) that is an
infinite, linearly ordered, set of virtual
registers.
A program in DSA has no anti-dependencies and no
output dependencies.
(Dehnert, J. and Towle, R. A., Compiling for
Cidra 5)
7Machine Model and Resource Constraints
What unit each operation in the loop uses?
Machine Model
z0 ? Z(1) x0 ? X(1) q0 ? 0.0 DO k1,N uk ?
load zk-1 MEM vk ?? load xk-1 MEM wk?? uk
vk FMULT qk ? qk-1 wk FADD zk ? zk-1
4 ADDR xk ? xk-1 4 ADDR END DO
Unit Latency MEM1 6 MEM2 6 ADDR1
1 ADDR2 1 FMULT 2 FADD 2
Without instruction level parallelism. How long
does the loop take to execute?
(662211)N18N
8Resource Minimum Initiation Interval (resMII)
Each processor resource defines a minimum
initiation interval for the execution of the loop.
For instance in the machine model in the
previous example, a loop that requires the
computation of 6 addresses has a ResMII(ADDR)
61/2 3.
The Resource Minimum Initiation Interval of a
loop is given by
9ResMII
z0 ? Z(1) x0 ? X(1) q0 ? 0.0 DO k1,N uk ?
load zk-1 MEM vk ?? load xk-1 MEM wk?? uk
vk FMULT qk ? qk-1 wk FADD zk ? zk-1
4 ADDR xk ? xk-1 4 ADDR END DO
Machine Model
Unit Latency MEM1 6 MEM2 6 ADDR1
1 ADDR2 1 FMULT 2 FADD 2
There are enough units to schedule all the
instructions of the loop in the same cycle.
Therefore ResMII 1. Can we execute the loop in
NC cycles (C a small constant)?
10Recurrence Minimum Initiation Interval (RecMII)
k1
z0 ? Z(1) x0 ? X(1) q0 ? 0.0 DO k1,N (a) uk ?
load zk-1 (b) vk ?? load xk-1 (c) wk?? uk
vk (d) qk ? qk-1 wk (e) zk ? zk-1 4 (f) xk ?
xk-1 4 END DO
11Recurrence Minimum Initiation Interval (RecMII)
z0 ? Z(1) x0 ? X(1) q0 ? 0.0 DO k1,N Unit
Lat. (a) uk ? load zk-1 MEM (6) (b) vk ??
load xk-1 MEM (6) (c) wk?? uk vk FMULT
(2) (d) qk ? qk-1 wk FADD (2) (e) zk ? zk-1
4 ADDR (1) (f) xk ? xk-1 4 ADDR (1) END
DO
(1,2)
(1,1)
(1,1)
(1,1)
(1,1)
(dist,lat)
12Recurrence Minimum Initiation Interval (RecMII)
(1,2)
(1,1)
(1,1)
Quiz What is the rec_mii for the example?
(1,1)
(1,1)
(dist,lat)
13Minimum Initiation Interval
In our example we have MII max(1,2)
2. Therefore the best that we can do without
transforming the loop is to execute it in 2NC.
14Module Schedule
In module scheduling, we (1) start with the
first instruction (2) schedule as many
instructions as we can in every cycle,
limited only by the resources available and by
the dependences.
When a pattern emerges, we adopt the pattern
as our module schedule.
Instructions before this pattern form the loop
prologue.
Instructions after this pattern form the loop
epilogue.
15Recurrence Minimum Initiation Interval (RecMII)
z0 ? Z(1) x0 ? X(1) q0 ? 0.0 DO
k1,N Lat. (a) uk ? load zk-1 (6) (b) vk ??
load xk-1 (6) (c) wk?? uk vk (2) (d) qk ? qk-1
wk (2) (e) zk ? zk-1 4 (1) (f) xk ? xk-1
4 (1) END DO
16Why an eager scheduler fails in our example
Iterations
1
2
3
4
5
6
7
8
9
11
12
13
14
15
16
17
18
10
b1
0
b2
1
b3
2
b4
3
b5
4
b6
5
b7
6
c1
b8
7
c2
8
b9
d1
c3
9
b10
Cycles
c4
10
b11
d2
c5
11
b12
Cycles
c6
12
b13
d3
c7
13
b14
c8
14
b15
d4
15
c9
b16
16
b17
c10
d5
17
c11
b18
18
c12
d6
19
c13
20
c14
d7
21
c15
22
c16
d8
23
c17
17Why an eager scheduler fails in our example
Iterations
1
2
3
4
5
6
7
8
9
11
12
13
14
15
16
17
18
10
b1
0
1
b2
2
3
b3
4
5
b4
6
Therefore we can do it in 2N9 cycles.
c1
7
b5
8
d1
c2
9
Cycles
b6
10
d2
c3
11
Cycles
b7
12
d3
c4
13
b8
14
d4
c5
15
16
b9
d5
c6
17
18
b10
d6
c7
19
20
b11
d7
c8
21
22
b12
d8
23
c9
18Collision vectors
Given the reservation tables for two operations A
and B, the set of forbidden intervals, i.e.,
intervals at which distance the operations A and
B cannot be issued is called the collision vector
for the reservation tables.
19A Simplistic Module Scheduling Algorithm
1. Compute MII as discussed 2. Use a modified
list scheduling algorithm to generate a
module schedule. The scheduling algorithm
must obey the following restriction If an
operation P is scheduled at time t,
it cannot be scheduled at any time t ? kII
for any k? 0. The Module Reservation Table
has II rows, representing the cycles of the
initiation interval, and as many columns as
the resources that it needs to keep track of.
20Heuristic Method for Modulo Scheduling
Problem Generate a module schedule of a loop by
scheduling instructions until a pattern emerge.
- Why a simple variant of list scheduling may not
work?
21Counter Example IList Scheduling May Fail
(0,4)
(0,2)
(0,2)
(1,2)
B
D
Therefore, in a machine with infinite
resources, we must be able to schedule the loop
in 4 cycles.
22Counter Example IList Scheduling May Fail
List Scheduling a greedy algorithm that
schedules each operation at its earliest
possible time
C
A
D
B
B must be scheduled after the A of the current
iteration and before the C of the
next iteration. We are deadlocked!!!
0 1 2 3
A
C
D
23Counter Example IList Scheduling May Fail
0 1 2 3
A
C
C
A
prologue
(0,4)
(0,2)
(0,2)
4 5 6 7
B
D
D
B
kernel
(1,2)
The solution is to create a kernel with
operations from different iterations, and use a
prologue and an epilogue.
epilogue
24Counter Example IIList Scheduling May Fail
A1, A3, and A4 are non-pipelined adds that take
two cycles at the adder M5 and M6 are
non-pipelined multiply operations that take
three cycles each on the multiplier C2 is a copy
operation that uses the bus for one cycle What
is the ResMII for these operations in a machine
that has one adder, one multiplier and one bus?
A1
M6
(0,2)
(0,3)
C2
(0,1)
A3
(0,2)
A4
(0,2)
M5
(0,3)
25Counter Example IIList Scheduling May Fail
A1
M6
A1
M6
C2
C2
A3
A3
A4
26Counter Example IIList Scheduling May Fail
A1
M6
C2
C2
A4
M5
A3
A4
Although it seems counter-intuitive we obtain a
module schedule with MII 6 if we initially
schedule both M6 and A3 one cycle later than the
earliest possible time for these operations.
M5
27Complex Reservation Tables
Consider three independent operations with the
reservation tables shown below
What is the MII for a loop formed by this three
operations?
28Is the MII 2 Feasible??
A1
M2
Deadlocked. Cannot allocate MA3. Even though
MII max(ResMII, RecMII) 2, MII 2 is not
feasible!!!!
A1
M2
M2
A1
29Increasing MII to 3 helps?
A1
M2
MA3
Adder Mult Bus
0 1 2
A1
M2
We find a module schedule with MII 3!!
A1
MA3
MA3
M2
30Iteration Between Recurrence Constraints and
Resource Constraints
What is the RecMII for this loop?
A1
(0,2)
RecMII (2222)/2 4
A2
(2,2)
(0,2)
What is the ResMII for the loop?
A3
(0,2)
A4
Therefore MII max(ResMII,RecMII) 4
31Is the MII 4 feasible?
A1
A1
A1
A2
A2
A2
In order to finish A4 in time to produce the
result for two iterations later, A3 must
be scheduled at time 4. But 4 module 4 0,
which conflicts with A1. Therefore there is no
feasible schedule with MII 4.
32Scheduling Strategy
An exhaustive search will eventually reveal that
the MII calculated is not feasible, but it might
take too long.
In practice, we compute the MII and spend a
pre-allocated budget of time trying to find
a schedule with the MII. If we dont find one,
we increase the MII.
In some commercial compilers, the search for the
smallest feasible II is a binary search, where
the II is doubled at each step until a feasible
one is found, at which point a linear search
between the last unfeasible II and the feasible
one is conducted.
33Previous Approaches
- Approach I (Operational)
- Emulate the loop execution under the machine
model and a pattern will eventually occur - AikenNic88, EbciogluNic89, GaoEtAl91
- Approach II (Periodic scheduling)
- Specify the scheduling problem into a periodical
scheduling problem and find optimal solution - Lam88, RauEtAl81,GovindAltmanGao94
34(No Transcript)