Title: Software Optimisation
1- Chapter 12
- Software Optimisation
2Software Optimisation Chapter
- This chapter consists of three parts
- Part 1 Optimisation Methods.
- Part 2 Software Pipelining.
- Part 3 Multi-cycle Loop Pipelining.
3- Chapter 12
- Software Optimisation
- Part 1 - Optimisation Methods
4Objectives
- Introduction to optimisation and optimisation
procedure. - Optimisation of C code using the code generation
tools. - Optimisation of assembly code.
5Introduction
- Software optimisation is the process of
manipulating software code to achieve two main
goals - Faster execution time.
- Small code size.
- Note It will be shown that in general there is
a trade off between faster execution type and
smaller code size.
6Introduction
- To implement efficient software, the programmer
must be familiar with - Processor architecture.
- Programming language (C, assembly or linear
assembly). - The code generation tools (compiler, assembler
and linker).
7Code Optimisation Procedure
8Code Optimisation Procedure
9Optimising C Compiler Options
- The C6x optimising C compiler uses the ANSI C
source code and can perform optimisation
currently up-to about 80 compared with a
hand-scheduled assembly. - However, to achieve this level of optimisation,
knowledge of different levels of optimisation is
essential. Optimisation is performed at different
stages and levels.
10Assembly Optimisation
- To develop an appreciation of how to optimise
code, let us optimise an FIR filter
1
11Assembly Optimisation
- To implement Equation 1, we need to perform the
following steps - (1) Load the sample xi.
- (2) Load the coefficients hi.
- (3) Multiply xi and hi.
- (4) Add (xi hi) to the content of an
accumulator. - (5) Repeat steps 1 to 4 N-1 times.
- (6) Store the value in the accumulator to y.
12Assembly Optimisation
- Steps 1 to 6 can be translated into the following
C6x assembly code
13Assembly Optimisation
- In order to optimise the code, we need to
- (1) Use instructions in parallel.
- (2) Remove the NOPs.
- (3) Remove the loop overhead (remove SUB and B
loop unrolling). - (4) Use word access or double-word access instead
of byte or half-word access.
14Step 1 - Using Parallel Instructions
ldh
ldh
nop
nop
nop
nop
mpy
nop
add
sub
b
nop
nop
nop
nop
nop
15Step 1 - Using Parallel Instructions
ldh
ldh
nop
nop
nop
nop
mpy
nop
add
sub
b
nop
nop
nop
Note Not all instructions can be put in parallel
since the result of one unit is used as an input
to the following unit.
nop
nop
16Step 2 - Removing the NOPs
ldh
ldh
sub
b
nop
nop
mpy
nop
add
17Step 3 - Loop Unrolling
- The SUB and B instructions consume at least two
extra cycles per iteration (this is known as
branch overhead).
18Step 4 - Word or Double Word Access
- The C6711 has two 64-bit data buses for data
memory access and therefore up to two 64-bit can
be loaded into the registers at any time (see
Chapter 2). - In addition the C6711 devices have variants of
the multiplication instruction to support
different operation (see Chapter 2). - Note Store can only be up to 32-bit.
19Step 4 - Word or Double Word Access
- Using word access, MPY and MPYH the previous code
can be written as
- Note By loading words and using MPY and MPYH
instructions the execution time has been halved
since in each iteration two 16x16-bit
multiplications are performed.
20Optimisation Summary
- It has been shown that there are four
complementary methods for code optimisation - Using instructions in parallel.
- Filling the delay slots with useful code.
- Using word or double word load.
- Loop unrolling.
These increase performance and reduce code size.
21Optimisation Summary
- It has been shown that there are four
complementary methods for code optimisation - Using instructions in parallel.
- Filling the delay slots with useful code.
- Using word or double word load.
- Loop unrolling.
This increases performance but increases code
size.
22- Chapter 12
- Software Optimisation
- Part 2 - Software Pipelining
23Objectives
- Why using Software Pipelining, SP?
- Understand software pipelining concepts.
- Use software pipelining procedure.
- Code the word-wide software pipelined dot-product
routine. - Determine if your pipelined code is more
efficient with or without prolog and epilog.
24Why using Software Pipelining, SP?
- SP creates highly optimized loop-code by
- Putting several instructions in parallel.
- Filling delay slots with useful code.
- Maximizes functional units.
- SP is implemented by simply using the tools
- Compiler options -o2 or -o3.
- Assembly Optimizer if .sa file.
25Software Pipeline concept
To explain the concept of software pipelining, we
will assume that all instructions execute in one
cycle.
26Software Pipeline Example
LDH LDH MPY ADD
How many cycles wouldit take to perform
thisloop 5 times? (Disregard delay-slots). _____
_________ cycles
5 x 3 15
Lets examine hardware (functional units) usage
...
27Non-Pipelined Code
.D1
.D2
mpy
add
28Pipelining Code
Pipelining these instructions took 1/2 the cycles!
29Pipelining Code
Pipelining these instructions takes only 7 cycles!
30Pipelining Code
Prolog Staging for loop.
Epilog Completing finaloperations.
31Pipelined Code
32Software Pipelining Procedure
1. Write algorithm in C code verify.
2. Write C6x Linear Assembly code. 3. Create
dependency graph. 4. Allocate registers.
5. Create scheduling table. 6. Translate
scheduling table to C6x code.
33Software Pipelining Example (Step 1)
34Software Pipelining Procedure
1. Write algorithm in C code verify.
2. Write C6x Linear Assembly code. 3. Create
dependency graph. 4. Allocate registers.
5. Create scheduling table. 6. Translate
scheduling table to C6x code.
35Write code in Linear Assembly (Step 2)
for (i0 i lt count i) prod mi
ni sum prod loop ldh p_m,
m ldh p_n, n mpy m, n, prod add prod,
sum, sum count sub count, 1, count
count b loop
- 1. No NOPs required.
- 2. No parallel instructions required.
- 3. You dont have to specify
- Functional units, or
- Registers.
36Software Pipelining Procedure
1. Write algorithm in C code verify.
2. Write C6x Linear Assembly code. 3. Create
a dependency graph (4 steps). 4. Allocate
registers. 5. Create scheduling table.
6. Translate scheduling table to C6x code.
37Dependency Graph Terminology
38Dependency Graph Steps
- (a) Draw the algorithm nodes and paths.
- (b) Write the number of cycles it takes for each
instruction to complete execution. - (c) Assign required function units to each
node. - (d) Partition the nodes to A and B sides and
assign sides to all functional units.
39Dependency Graph (Step a)
- In this step each instruction is represented by a
node. - The node is represented by a circle, where
- Outside write instruction.
- Inside register where result is written.
- Nodes are then connected by paths showing the
data flow. - Note Conditional paths are represented by
dashed lines.
40Dependency Graph (Step a)
41Dependency Graph (Step a)
42Dependency Graph (Step a)
43Dependency Graph (Step a)
44Dependency Graph (Step a)
45Dependency Graph (Step a)
46Dependency Graph (Step b)
- In this step the number of cycles it takes for
each instruction to complete execution is added
to the dependency graph. - It is written along the associated data path.
47Dependency Graph (Step b)
48Dependency Graph (Step c)
- In this step functional units are assigned to
each node. - It is advantageous to start allocating units to
instructions which require a specific unit - Load/Store.
- Branch.
- We do not need to be concerned with multiply as
this is the only operation that the .M unit
performs. - Note The side is not allocated at this stage.
49Dependency Graph (Step c)
.D
.D
5
5
1
.M
1
2
1
.S
6
50Dependency Graph (Step d)
- The data path is partitioned into side A and B at
this stage. - To optimise code we need to ensure that a maximum
number of units are used with a minimum number of
cross paths. - To make the partition visible on the dependency
graph a line is used. - The side can then be added to the functional
units associated with each instruction or node.
51Dependency Graph (Step d)
ASide
BSide
.D
.D
5
5
1
.M
1
2
1
.S
6
52Dependency Graph (Step d)
53Software Pipelining Procedure
1. Write algorithm in C code verify.
2. Write C6x Linear Assembly code. 3. Create
a dependency graph (4 steps). 4. Allocate
registers. 5. Create scheduling table.
6. Translate scheduling table to C6x code.
54Step 4 - Allocate Functional Units
.L1 .M1 .D1 .S1 x1 .L2 .M2 .D2 .S2 x2
sum prod m .M1x count n loop
Do we have enough functional units to code this
algorithm in a single-cycle loop?
ASide
BSide
.D1
.D2
5
5
1
.L2
.M1x
1
2
1
.L1
.S2
6
55Step 4 - Allocate Registers
56Software Pipelining Procedure
1. Write algorithm in C code verify.
2. Write C6x Linear Assembly code. 3. Create
a dependency graph (4 steps). 4. Allocate
registers. 5. Create scheduling table.
6. Translate scheduling table to C6x code.
57Step 5 - Create Scheduling Table
8
7
6
5
4
3
2
1
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
How do we know the loop ends up in cycle 8?
58Length of Prolog
- Answer
- Count up the length of longest path, in this
case we have - 5 2 1 8 cycles
5
2
1
59Scheduling Table
8
7
6
5
4
3
2
1
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
60Scheduling Table
add
Where do we want to branch?
61Scheduling Table
62Software Pipelining Procedure
1. Write algorithm in C code verify.
2. Write C6x Linear Assembly code. 3. Create
a dependency graph (4 steps). 4. Allocate
registers. 5. Create scheduling table.
6. Translate scheduling table to C6x code.
63Translate Scheduling Table to C6x Code
LOOP
PROLOG
add
sub
64Translate Scheduling Table to C6x Code
C1 ldh .D1 A1,A2 ldh .D2
B1,B2
LOOP
PROLOG
add
C2 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0
sub
65Translate Scheduling Table to C6x Code
LOOP
PROLOG
C1 ldh .D1 A1,A2 ldh .D2
B1,B2
add
C2 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0
sub
C3 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0 B0 B
.S2 loop
66Translate Scheduling Table to C6x Code
C1 ldh .D1 A1,A2 ldh .D2
B1,B2
LOOP
C2 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0
add
sub
C3 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0 B0 B
.S2 loop
C4 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0 B0 B
.S2 loop
67Translate Scheduling Table to C6x Code
C1 ldh .D1 A1,A2 ldh .D2
B1,B2
LOOP
C2 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0
add
C3 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0 B0 B
.S2 loop
sub
sub
C4 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0 B0 B
.S2 loop
C5 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0 B0 B
.S2 loop
68Translate Scheduling Table to C6x Code
LOOP
PROLOG
add
C6 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0 B0 B
.S2 loop mpy .M1x A2,B2,A3
sub
sub
69Translate Scheduling Table to C6x Code
LOOP
PROLOG
add
C7 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0 B0 B
.S2 loop mpy .M1x A2,B2,A3
sub
sub
70Translate Scheduling Table to C6x Code
LOOP
PROLOG
add
Single-Cycle Loop loop ldh .D1 A1,A2
ldh .D2 B1,B2 B0 sub .L2
B0,1,B0 B0 B .S2 loop mpy .M1x
A2,B2,A3 add .L1 A4,A3,A4
sub
sub
See Chapter 14 for practical examples
71Translate Scheduling Table to C6x Code
- With this method we have only created the prolog
and the loop. - Therefore if the filter has 100 taps, then we
need to repeat the loop 100 times as we need 100
adds. - This means that we are performing 107 loads.
These 7 extra loads may lead to some illegal
memory acesses.
72Solution The Epilog
We only created the Prolog and Loop What
about the Epilog?
The Epilog can be extracted from your results as
described below.
See example in the next slide.
73Dot-Product with Epilog
e1 mpy add
74Dot-Product with Epilog
e2 mpy add
75Dot-Product with Epilog
e3 mpy add
76Dot-Product with Epilog
e4 mpy add
77Dot-Product with Epilog
e5 mpy add
78Dot-Product with Epilog
e6 add
79Dot-Product with Epilog
Prolog p1 ldhldh p2 ldhldh
sub p3 ldhldh sub b p4
ldhldh sub b p5 ldhldh
sub b p6 ldhldh mpy sub
b p7 ldhldh mpy sub b
Loop loop ldh ldh mpy
add sub b
Epilog e1 mpy add e2 mpy add e3 mpy
add e4 mpy add e5 mpy add e6 add
e7 add
80Scheduling Table Prolog, Loop and Epilog
81Loop only!
- Can the code be written as a loop only (i.e. no
prolog or epilog)?
Yes!
82Loop only!
(i) Remove all instructions except the branch.
83Loop only!
(i) Remove all instructions except the branch.
add
B
mpy
ldh m
84Loop only!
(i) Remove all instructions except the
branch. (ii) Zero input registers, accumulator
and product registers.
add
zero sum
zero a
zero prod
zero b
B
mpy
ldh m
85Loop only!
(i) Remove all instructions except the
branch. (ii) Zero input registers, accumulator
and product registers. (iii) Adjust the number of
subtractions.
add
zero sum
zero a
sub
zero prod
zero b
B
mpy
ldh m
86Loop Only - Final Code
b loop b loop b loop zero m input
register zero n input register b
loop zero prod product register zero sum
accumulator b loop sub modify count
register loop ldh ldh mpy add
sub b loop
Overhead
Loop
87Laboratory exercise
- Software pipeline using the LDW version of the
Dot-Product routine - (1) Write linear assembly.
- (2) Create dependency graph.
- (3) Complete scheduling table.
- (4) Transfer table to C6000 code.
- To Epilogue or Not to Epilog?
- Determine if your pipelined code is more
efficient with or without prolog and epilog.
88Lab Solution Step 1 - Linear Assembly
for (i0 i lt count i) prod mi
ni sum prod count becomes 20
loop ldw p_m, m ldw p_n,
n mpy m, n, prod mpyh m, n,
prodh add prod, sum, sum add prodh, sumh,
sumh count sub count, 1, count count
b loop Outside of Loop add sum, sumh, sum
89 Step 2 - Dependency Graph
90Step 2 - Functional Units
.L1 .M1 .D1 .S1 x1 .L2 .M2 .D2 .S2 x2
sum prod m loop .M1x sumh prodh n count .M2x
Do we still have enoughfunctional units tocode
this algorithmin a single-cycle loop? Yes !
91Step 2 - Registers
Register File A
Register File B
A0
B0
count
A1
B1
A2
B2
A3
B3
return address
a/ret value
A4
B4
x
a
A5
B5
x
count/prod
A6
B6
prodh
sum
A7
B7
sumh
92Step 3 - Schedule Algorithm
add
add
93Step 4 - C6000 Code
- The complete code is available in the following
location - \Links\DotP LDW.pdf
94Why Conditional Subtract?
loop ldh p_m, m ldh p_n, n mpy m,
n, prod add prod, sum, sum
count sub count, 1, count count b loop
Without Cond. Subtract Loop (count 1) (B)
With Cond. Subtract Loop (count 1) (B)
X
X
loop (count 0) (B)
loop (count 0) (B)
X
loop (count -1) (B)
loop (count 0) (B)
X
loop (count -2) (B)
loop (count 0) (B)
X
loop (count -3) (B)
loop (count 0) (B)
X
loop (count -4) (B)
loop (count 0) (B)
Loop never ends
Loop ends
95- Chapter 12
- Software Optimisation
- Part 3 - Pipelining Multi-cycle Loops
96Objectives
- Software pipeline the weighted vector sum
algorithm. - Describe four iteration interval constraints.
- Calculate minimum iteration interval.
- Convert and optimize the dot-product code to
floating point code.
97What Requires Multi-Cycle Loops?
- Resource Limitations
- Running out of resources (Functional Units,
Registers, Bus Accesses) - Weighted Vector Sum example requires three .D
units - Live Too Long
- Minimum iteration interval defined by length of
time a Variable is required to exist - Loop Carry Path
- Latency required between loop iterations
- FIR example and SP floating-point dot product
examples are demonstrated - Functional Unit Latency gt 1
- A few C67x instructions require functional units
for 2 or 4 cycles rather than one. This defines a
minimum iteration interval.
98What Requires Multi-Cycle Loops?
Use these four constraints to determine the
smallest Iteration Interval (Minimum Iteration
Interval or MII).
99Resource Limitation Weighted Vector Sum
Step 1 - C Code
a, b input arrays c output array n length
of arrays r weighting factor
Store .D
Load .D
Load .D
Requires 3 .D units
100Software Pipelining Procedure
101Step 2 - C6x Linear Code
ci ai (r bi) gtgt 15
loop LDH a, ai LDH b, bi MPY r, bi,
prod SHR prod, 15, sum ADD ai, sum,
ci STH ci, c i SUB i, 1, i i
B loop
- The full code is available here \Links\Wvs.sa
102Step 3 - Dependency Graph
103Step 4 -Allocate Functional Units
- This requires 3 .D units therefore it cannot fit
into a single cycle loop. - This may fit into a 2 cycle loop if there are no
other constraints.
1042 Cycle Loop
Cycle 1
Cycle 2
Iteration Interval (II) cycles per loop
iteration.
105Multi-Cycle Loop Iterations
loop 2
loop 3
loop 1
cycle 3
cycle 5
cycle 1
.D1
ldh
ldh
ldh
.D2
ldh
ldh
ldh
.S2
shr
shr
shr
.M1
mpy
mpy
mpy
.M2
.L1
add
add
add
.L2
sub
sub
sub
.S1
b
b
b
cycle 2
cycle 4
cycle 6
.D1
sth
sth
sth
.D2
.S1
.S2
.M1
.M2
.L1
.L2
106Multi-Cycle Loop Iterations
loop 2
loop 3
loop 1
cycle 3
cycle 5
cycle 1
.D1
ldh
ldh
ldh
.D2
ldh
ldh
ldh
.S2
shr
shr
shr
.M1
mpy
mpy
mpy
.M2
.L1
add
add
add
.L2
sub
sub
sub
.S1
b
b
b
cycle 2
cycle 4
cycle 6
.D1
sth
sth
sth
.D2
.S1
.S2
.M1
.M2
.L1
.L2
107Multi-Cycle Loop Iterations
loop 3
loop 1
loop 2
cycle 3
cycle 5
cycle 1
.D1
ldh
ldh
ldh
.D2
ldh
ldh
ldh
.S2
shr
shr
shr
.M1
mpy
mpy
mpy
.M2
.L1
add
add
add
.L2
sub
sub
sub
.S1
b
b
b
cycle 2
cycle 4
cycle 6
.D1
sth
sth
sth
.D2
.S1
.S2
.M1
.M2
.L1
.L2
108Multi-Cycle Loop Iterations
loop 3
loop 1
loop 2
cycle 3
cycle 5
cycle 1
.D1
ldh
ldh
ldh
.D2
ldh
ldh
ldh
.S2
shr
shr
shr
.M1
mpy
mpy
mpy
.M2
.L1
add
add
add
.L2
sub
sub
sub
.S1
b
b
b
cycle 2
cycle 4
cycle 6
.D1
sth
sth
sth
.D2
.S1
.S2
.M1
.M2
.L1
.L2
109Multi-Cycle Loop Iterations
loop 1
loop 2
loop 3
cycle 3
cycle 5
cycle 1
.D1
ldh
ldh
ldh
.D2
ldh
ldh
ldh
.S2
shr
shr
shr
.M1
mpy
mpy
mpy
.M2
.L1
add
add
add
.L2
sub
sub
sub
.S1
b
b
b
cycle 2
cycle 4
cycle 6
.D1
sth
sth
sth
.D2
.S1
.S2
.M1
.M2
.L1
.L2
110Multi-Cycle Loop Iterations
loop 3
loop 1
loop 2
cycle 3
cycle 5
cycle 1
.D1
ldh
ldh
ldh
.D2
ldh
ldh
ldh
.S2
shr
shr
shr
.M1
mpy
mpy
mpy
.M2
.L1
add
add
add
.L2
sub
sub
sub
.S1
b
b
b
cycle 2
cycle 4
cycle 6
.D1
sth
sth
sth
.D2
.S1
.S2
.M1
.M2
.L1
.L2
111How long is the Prolog?
What is the length of thelongest path?
10
How many cycles per loop?
2
112Step 5 - Create Scheduling Chart (0)
113Step 5 - Create Scheduling Chart
0
2
4
6
8
Unit\cycle
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
114Step 5 - Create Scheduling Chart
0
2
4
6
8
Unit\cycle
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
115Step 5 - Create Scheduling Chart
0
2
4
6
8
Unit\cycle
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
116Step 5 - Create Scheduling Chart
0
2
4
6
8
Unit\cycle
.L1
ADD ci
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
117Step 5 - Create Scheduling Chart
118Step 5 - Create Scheduling Chart
119Step 5 - Create Scheduling Chart
Conflict
120Conflict Solution
Here are two possibilities ...
Which is better?
121Conflict Solution
Here are two possibilities ...
Which is better? Move the LDH to cycle 2. (so
you dont have to go back and recheck crosspaths)
122Step 5 - Create Scheduling Chart
0
2
4
6
8
Unit\cycle
.L1
ADD ci
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
.S1
.S2
.M1
.M2
.D1
STH ci
.D2
123Step 5 - Create Scheduling Chart
0
2
4
6
8
Unit\cycle
.L1
ADD ci
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
.S1
.S2
.M1
.M2
.D1
STH ci
.D2
124Step 5 - Create Scheduling Chart
0
2
4
6
8
Unit\cycle
.L1
ADD ci
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
.S1
.S2
.M1
.M2
.D1
STH ci
.D2
125Step 5 - Create Scheduling Chart
0
2
4
6
8
Unit\cycle
ADD ci
.L1
.L2
i B
.S1
.S2
.M1
.M2
LDH ai
.D1
LDH bi
.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
i SUB i
.S1
.S2
SHR sum
.M1
.M2
MPY mi
.D1
LDH ai
STH ci
.D2
1262 Cycle Loop Kernel
0
2
4
6
8
Unit\cycle
ADD ci
.L1
.L2
i B
.S1
.S2
.M1
.M2
LDH ai
.D1
LDH bi
.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
i SUB i
.S1
.S2
SHR sum
.M1
.M2
MPY mi
.D1
LDH ai
STH ci
.D2
127What Requires Multi-Cycle Loops?
128Live Too Long - Example
129Live Too Long - Example
130Live Too Long - Example
0
5
6
3
4
LDH ai
a0 valid
a1
LDH
LDH
SHR
x0 valid
131Live Too Long - Example
0
5
6
3
4
LDH ai
a0 valid
a1
LDH
LDH
SHR
x0 valid
ADD
Oops, rather than adding a0 x0 we got a1 x0
Lets look at one solution ...
132Live Too Long - 2 Cycle Solution
133Live Too Long - 2 Cycle Solution
Notice, a0 and x0 are both valid for 2
cycleswhich is the length of the Iteration
Interval
Adding them ...
134Live Too Long - 2 Cycle Solution
Works! But whats the drawback?
2 cycle loop is slower.
Heres a better solution ...
135Live Too Long - 1 Cycle Solution
Using a temporary registersolves this problem
without increasing the Minimum Iteration Interval
136What Requires Multi-Cycle Loops?
137Loop Carry Path
- The loop carry path is a path which feeds one
variable from part of the algorithm back to
another.
e.g. Loop carry path 3.
Note The loop carry path is not the code loop.
138Loop Carry Path, e.g. IIR Filter
IIR Filter Example y0 a0x0 b1y1
139IIR.SA
IIR Filter Example y0 a0x0 b1y1
IIR ldh a_1, A1 ldh x1, A3 ldh b_1,
B1 ldh y0, B0 y1 is previous y0 mpy A1,
A3, prod1 mpy B1, B0, prod2 add prod1, prod2,
prod2 sth prod2, y0
140Loop Carry Path - IIR Example
IIR Filter Loop y0 a1x1 b1y1
Min Iteration Interval Resource 2 (need 3 .D
units)
Loop Carry Path 9 (9 5 2 1 1)
therefore, MII 9
1 Result carries over fromone iteration of the
loopto the next.
Can it be minimized?
141Loop Carry Path - IIR Example (Solution)
IIR Filter Loop y0 a1x1 b1y1
Min Iteration Interval Resource 2 (need 3 .D
units)
1
New Loop Carry Path 3 (3 2 1)
therefore, MII 3
Since y0 is stored in a CPU register,it can be
used directly by MPY (after the first loop
iteration).
142Reminder Fixed-Point Dot-Product Example
Is there a loop carrypath in this example?
Yes, but its only 1
Min Iteration Interval Resource 1Loop Carry
Path 1 ? MII 1
For the fixed-point implementation, the Loop
Carry Path was not taken into account because it
is equal to 1.
143Loop Carry Path
- IIR Example.
- Enhancing the IIR.
- Fixed-Point Dot-Product Example.
- Floating-Point Dot Product Example.
144Loop Carry Path due to FUL gt 1Floating-Point
Dot-Product Example
Min Iteration Interval Resource 1Loop Carry
Path 4 ? MII 4
145Unrolling the Loop
If the MII must be four cycles long, then use all
of them to calculate four results.
146ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
- ADDSP takes 4 cycles or three delay slots to
produce the result.
147ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
148ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
149ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
150ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
151ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
152ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
153ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
154ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 NOP sum x1 x5 10 NOP sum x2
x6 11 NOP sum x3 x7 12 NOP sum x0
x4 x8
- There are effectively four running sums
155ADDSP Pipeline (Staggered Results)
- There are effectively four running sums
- These need to be combined after the last addition
is complete...
156ADDSP Pipeline (Combining Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0 x4 9 MV
sum, temp sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
157ADDSP Pipeline (Combining Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0 x4 9 MV
sum, temp sum x1 x5 10 ADDSP sum, temp,
sum1 sum x2 x6, temp x1 x5 11 sum
x3 x7 12 sum x0 x4 x8
158ADDSP Pipeline (Combining Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0 x4 9 MV
sum, temp sum x1 x5 10 ADDSP sum, temp,
sum1 sum x2 x6, temp x1 x5 11 MV sum,
temp sum x3 x7 12 sum x0 x4 x8
159ADDSP Pipeline (Combining Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0 x4 9 MV
sum, temp sum x1 x5 10 ADDSP sum, temp,
sum1 sum x2 x6, temp x1 x5 11 MV sum,
temp sum x3 x7 12 ADDSP sum, temp, sum2 sum
x0 x4 x8, temp x3 x7
160ADDSP Pipeline (Combining Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0 x4 9 MV
sum, temp sum x1 x5 10 ADDSP sum, temp,
sum1 sum x2 x6, temp x1 x5 11 MV sum,
temp sum x3 x7 12 ADDSP sum, temp, sum2 sum
x0 x4 x8, temp x3 x7 13 NOP 14 NOP s
um1 x1 x2 x5 x6
161ADDSP Pipeline (Combining Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0 x4 9 MV
sum, temp sum x1 x5 10 ADDSP sum, temp,
sum1 sum x2 x6, temp x1 x5 11 MV sum,
temp sum x3 x7 12 ADDSP sum, temp sum2 sum
x0 x4 x8, temp x3 x7 13 NOP 14 NOP sum
1 x1 x2 x5 x6 15 NOP 16 ADDSP sum1,
sum2, sum sum2 x0 x3 x4 x7 x8
162ADDSP Pipeline (Combining Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0 x4 9 MV
sum, temp sum x1 x5 10 ADDSP sum, temp,
sum1 sum x2 x6, temp x1 x5 11 MV sum,
temp sum x3 x7 12 ADDSP sum, temp sum2 sum
x0 x4 x8, temp x3 x7 13 NOP 14 NOP sum
1 x1 x2 x5 x6 15 NOP 16 ADDSP sum1,
sum2, sum sum2 x0 x3 x4 x7
x8 17 NOP 18 NOP 19 NOP 20 NOP sum x0
x1 x2 x3 x4 x5 x6 x7 x8
163What Requires Multi-Cycle Loops?
164Simple FUL Example
3
5
MPYDP
prod
10 (4.9)
165A Better Way to Diagram this ...
166What Requires Multi-Cycle Loops?
- 1. Resource Limitations.
- 2. Live Too Long.
- 3. Loop Carry Path.
- 4. Double Precision (FUL gt 1).
Lab Converting your dot-product code to
Single-Precision Floating-Point.
167- Chapter 12
- Software Optimisation
- - End -