Software Optimisation

About This Presentation

Title:

Software Optimisation

Description:

Software Optimisation Software Optimisation Chapter This chapter consists of three parts: Part 1: Optimisation Methods. Part 2: Software Pipelining. – PowerPoint PPT presentation

Number of Views:138

Avg rating:3.0/5.0

Slides: 168

Provided by: NaimDa9

Category:

more less

Transcript and Presenter's Notes

Title: Software Optimisation

1

Chapter 12
Software Optimisation

2
Software Optimisation Chapter

This chapter consists of three parts
Part 1 Optimisation Methods.
Part 2 Software Pipelining.
Part 3 Multi-cycle Loop Pipelining.

Chapter 12
Software Optimisation
Part 1 - Optimisation Methods

4
Objectives

Introduction to optimisation and optimisation
procedure.
Optimisation of C code using the code generation
tools.
Optimisation of assembly code.

5
Introduction

Software optimisation is the process of
manipulating software code to achieve two main
goals
Faster execution time.
Small code size.
Note It will be shown that in general there is
a trade off between faster execution type and
smaller code size.

6
Introduction

To implement efficient software, the programmer
must be familiar with
Processor architecture.
Programming language (C, assembly or linear
assembly).
The code generation tools (compiler, assembler
and linker).

7
Code Optimisation Procedure
8
Code Optimisation Procedure
9
Optimising C Compiler Options

The C6x optimising C compiler uses the ANSI C
source code and can perform optimisation
currently up-to about 80 compared with a
hand-scheduled assembly.
However, to achieve this level of optimisation,
knowledge of different levels of optimisation is
essential. Optimisation is performed at different
stages and levels.

10
Assembly Optimisation

To develop an appreciation of how to optimise
code, let us optimise an FIR filter

For simplicity we write

1
11
Assembly Optimisation

To implement Equation 1, we need to perform the
following steps
(1) Load the sample xi.
(2) Load the coefficients hi.
(3) Multiply xi and hi.
(4) Add (xi hi) to the content of an
accumulator.
(5) Repeat steps 1 to 4 N-1 times.
(6) Store the value in the accumulator to y.

12
Assembly Optimisation

Steps 1 to 6 can be translated into the following
C6x assembly code

13
Assembly Optimisation

In order to optimise the code, we need to
(1) Use instructions in parallel.
(2) Remove the NOPs.
(3) Remove the loop overhead (remove SUB and B
loop unrolling).
(4) Use word access or double-word access instead
of byte or half-word access.

14
Step 1 - Using Parallel Instructions
ldh
ldh
nop
nop
nop
nop
mpy
nop
add
sub
b
nop
nop
nop
nop
nop
15
Step 1 - Using Parallel Instructions
ldh
ldh
nop
nop
nop
nop
mpy
nop
add
sub
b
nop
nop
nop
Note Not all instructions can be put in parallel
since the result of one unit is used as an input
to the following unit.
nop
nop
16
Step 2 - Removing the NOPs
ldh
ldh
sub
b
nop
nop
mpy
nop
add
17
Step 3 - Loop Unrolling

The SUB and B instructions consume at least two
extra cycles per iteration (this is known as
branch overhead).

18
Step 4 - Word or Double Word Access

The C6711 has two 64-bit data buses for data
memory access and therefore up to two 64-bit can
be loaded into the registers at any time (see
Chapter 2).
In addition the C6711 devices have variants of
the multiplication instruction to support
different operation (see Chapter 2).
Note Store can only be up to 32-bit.

19
Step 4 - Word or Double Word Access

Using word access, MPY and MPYH the previous code
can be written as

Note By loading words and using MPY and MPYH
instructions the execution time has been halved
since in each iteration two 16x16-bit
multiplications are performed.

20
Optimisation Summary

It has been shown that there are four
complementary methods for code optimisation
Using instructions in parallel.
Filling the delay slots with useful code.
Using word or double word load.
Loop unrolling.

These increase performance and reduce code size.
21
Optimisation Summary

It has been shown that there are four
complementary methods for code optimisation
Using instructions in parallel.
Filling the delay slots with useful code.
Using word or double word load.
Loop unrolling.

This increases performance but increases code
size.
22

Chapter 12
Software Optimisation
Part 2 - Software Pipelining

23
Objectives

Why using Software Pipelining, SP?
Understand software pipelining concepts.
Use software pipelining procedure.
Code the word-wide software pipelined dot-product
routine.
Determine if your pipelined code is more
efficient with or without prolog and epilog.

24
Why using Software Pipelining, SP?

SP creates highly optimized loop-code by
Putting several instructions in parallel.
Filling delay slots with useful code.
Maximizes functional units.
SP is implemented by simply using the tools
Compiler options -o2 or -o3.
Assembly Optimizer if .sa file.

25
Software Pipeline concept
To explain the concept of software pipelining, we
will assume that all instructions execute in one
cycle.
26
Software Pipeline Example
LDH LDH MPY ADD
How many cycles wouldit take to perform
thisloop 5 times? (Disregard delay-slots). _____
_________ cycles
5 x 3 15
Lets examine hardware (functional units) usage
...
27
Non-Pipelined Code
.D1
.D2
mpy
add
28
Pipelining Code
Pipelining these instructions took 1/2 the cycles!
29
Pipelining Code
Pipelining these instructions takes only 7 cycles!
30
Pipelining Code
Prolog Staging for loop.
Epilog Completing finaloperations.
31
Pipelined Code
32
Software Pipelining Procedure
1. Write algorithm in C code verify.
2. Write C6x Linear Assembly code. 3. Create
dependency graph. 4. Allocate registers.
5. Create scheduling table. 6. Translate
scheduling table to C6x code.
33
Software Pipelining Example (Step 1)
34
Software Pipelining Procedure
1. Write algorithm in C code verify.
2. Write C6x Linear Assembly code. 3. Create
dependency graph. 4. Allocate registers.
5. Create scheduling table. 6. Translate
scheduling table to C6x code.
35
Write code in Linear Assembly (Step 2)
for (i0 i lt count i) prod mi
ni sum prod loop ldh p_m,
m ldh p_n, n mpy m, n, prod add prod,
sum, sum count sub count, 1, count
count b loop

1. No NOPs required.
2. No parallel instructions required.
3. You dont have to specify
Functional units, or
Registers.

36
Software Pipelining Procedure
1. Write algorithm in C code verify.
2. Write C6x Linear Assembly code. 3. Create
a dependency graph (4 steps). 4. Allocate
registers. 5. Create scheduling table.
6. Translate scheduling table to C6x code.
37
Dependency Graph Terminology
38
Dependency Graph Steps

(a) Draw the algorithm nodes and paths.
(b) Write the number of cycles it takes for each
instruction to complete execution.
(c) Assign required function units to each
node.
(d) Partition the nodes to A and B sides and
assign sides to all functional units.

39
Dependency Graph (Step a)

In this step each instruction is represented by a
node.
The node is represented by a circle, where
Outside write instruction.
Inside register where result is written.
Nodes are then connected by paths showing the
data flow.
Note Conditional paths are represented by
dashed lines.

40
Dependency Graph (Step a)
41
Dependency Graph (Step a)
42
Dependency Graph (Step a)
43
Dependency Graph (Step a)
44
Dependency Graph (Step a)
45
Dependency Graph (Step a)
46
Dependency Graph (Step b)

In this step the number of cycles it takes for
each instruction to complete execution is added
to the dependency graph.
It is written along the associated data path.

47
Dependency Graph (Step b)
48
Dependency Graph (Step c)

In this step functional units are assigned to
each node.
It is advantageous to start allocating units to
instructions which require a specific unit
Load/Store.
Branch.
We do not need to be concerned with multiply as
this is the only operation that the .M unit
performs.
Note The side is not allocated at this stage.

49
Dependency Graph (Step c)
.D
.D
5
5
1
.M
1
2
1
.S
6
50
Dependency Graph (Step d)

The data path is partitioned into side A and B at
this stage.
To optimise code we need to ensure that a maximum
number of units are used with a minimum number of
cross paths.
To make the partition visible on the dependency
graph a line is used.
The side can then be added to the functional
units associated with each instruction or node.

51
Dependency Graph (Step d)
ASide
BSide
.D
.D
5
5
1
.M
1
2
1
.S
6
52
Dependency Graph (Step d)
53
Software Pipelining Procedure
1. Write algorithm in C code verify.
2. Write C6x Linear Assembly code. 3. Create
a dependency graph (4 steps). 4. Allocate
registers. 5. Create scheduling table.
6. Translate scheduling table to C6x code.
54
Step 4 - Allocate Functional Units

.L1 .M1 .D1 .S1 x1 .L2 .M2 .D2 .S2 x2
sum prod m .M1x count n loop
Do we have enough functional units to code this
algorithm in a single-cycle loop?

ASide
BSide

.D1
.D2

5
5
1
.L2
.M1x
1
2

1
.L1
.S2

6
55
Step 4 - Allocate Registers
56
Software Pipelining Procedure
1. Write algorithm in C code verify.
2. Write C6x Linear Assembly code. 3. Create
a dependency graph (4 steps). 4. Allocate
registers. 5. Create scheduling table.
6. Translate scheduling table to C6x code.
57
Step 5 - Create Scheduling Table
8
7
6
5
4
3
2
1
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
How do we know the loop ends up in cycle 8?
58
Length of Prolog

Answer
Count up the length of longest path, in this
case we have
5 2 1 8 cycles

5
2
1
59
Scheduling Table
8
7
6
5
4
3
2
1
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
60
Scheduling Table
add
Where do we want to branch?
61
Scheduling Table
62
Software Pipelining Procedure
1. Write algorithm in C code verify.
2. Write C6x Linear Assembly code. 3. Create
a dependency graph (4 steps). 4. Allocate
registers. 5. Create scheduling table.
6. Translate scheduling table to C6x code.
63
Translate Scheduling Table to C6x Code
LOOP
PROLOG
add

sub
64
Translate Scheduling Table to C6x Code
C1 ldh .D1 A1,A2 ldh .D2
B1,B2
LOOP
PROLOG
add
C2 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0

sub
65
Translate Scheduling Table to C6x Code
LOOP
PROLOG
C1 ldh .D1 A1,A2 ldh .D2
B1,B2
add
C2 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0

sub
C3 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0 B0 B
.S2 loop
66
Translate Scheduling Table to C6x Code
C1 ldh .D1 A1,A2 ldh .D2
B1,B2
LOOP
C2 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0
add

sub
C3 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0 B0 B
.S2 loop
C4 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0 B0 B
.S2 loop
67
Translate Scheduling Table to C6x Code
C1 ldh .D1 A1,A2 ldh .D2
B1,B2
LOOP
C2 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0
add
C3 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0 B0 B
.S2 loop

sub

sub
C4 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0 B0 B
.S2 loop
C5 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0 B0 B
.S2 loop
68
Translate Scheduling Table to C6x Code
LOOP
PROLOG
add
C6 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0 B0 B
.S2 loop mpy .M1x A2,B2,A3

sub

sub
69
Translate Scheduling Table to C6x Code
LOOP
PROLOG
add
C7 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0 B0 B
.S2 loop mpy .M1x A2,B2,A3

sub

sub
70
Translate Scheduling Table to C6x Code
LOOP
PROLOG
add
Single-Cycle Loop loop ldh .D1 A1,A2
ldh .D2 B1,B2 B0 sub .L2
B0,1,B0 B0 B .S2 loop mpy .M1x
A2,B2,A3 add .L1 A4,A3,A4

sub

sub
See Chapter 14 for practical examples
71
Translate Scheduling Table to C6x Code

With this method we have only created the prolog
and the loop.
Therefore if the filter has 100 taps, then we
need to repeat the loop 100 times as we need 100
adds.
This means that we are performing 107 loads.
These 7 extra loads may lead to some illegal
memory acesses.

72
Solution The Epilog
We only created the Prolog and Loop What
about the Epilog?
The Epilog can be extracted from your results as
described below.
See example in the next slide.
73
Dot-Product with Epilog
e1 mpy add
74
Dot-Product with Epilog
e2 mpy add
75
Dot-Product with Epilog
e3 mpy add
76
Dot-Product with Epilog
e4 mpy add
77
Dot-Product with Epilog
e5 mpy add
78
Dot-Product with Epilog
e6 add
79
Dot-Product with Epilog
Prolog p1 ldhldh p2 ldhldh
sub p3 ldhldh sub b p4
ldhldh sub b p5 ldhldh
sub b p6 ldhldh mpy sub
b p7 ldhldh mpy sub b
Loop loop ldh ldh mpy
add sub b
Epilog e1 mpy add e2 mpy add e3 mpy
add e4 mpy add e5 mpy add e6 add
e7 add
80
Scheduling Table Prolog, Loop and Epilog
81
Loop only!

Can the code be written as a loop only (i.e. no
prolog or epilog)?

Yes!
82
Loop only!
(i) Remove all instructions except the branch.
83
Loop only!
(i) Remove all instructions except the branch.
add
B
mpy
ldh m
84
Loop only!
(i) Remove all instructions except the
branch. (ii) Zero input registers, accumulator
and product registers.
add
zero sum
zero a
zero prod
zero b
B
mpy
ldh m
85
Loop only!
(i) Remove all instructions except the
branch. (ii) Zero input registers, accumulator
and product registers. (iii) Adjust the number of
subtractions.
add
zero sum
zero a
sub
zero prod
zero b
B
mpy
ldh m
86
Loop Only - Final Code
b loop b loop b loop zero m input
register zero n input register b
loop zero prod product register zero sum
accumulator b loop sub modify count
register loop ldh ldh mpy add
sub b loop
Overhead
Loop
87
Laboratory exercise

Software pipeline using the LDW version of the
Dot-Product routine
(1) Write linear assembly.
(2) Create dependency graph.
(3) Complete scheduling table.
(4) Transfer table to C6000 code.
To Epilogue or Not to Epilog?
Determine if your pipelined code is more
efficient with or without prolog and epilog.

88
Lab Solution Step 1 - Linear Assembly
for (i0 i lt count i) prod mi
ni sum prod count becomes 20
loop ldw p_m, m ldw p_n,
n mpy m, n, prod mpyh m, n,
prodh add prod, sum, sum add prodh, sumh,
sumh count sub count, 1, count count
b loop Outside of Loop add sum, sumh, sum
89
Step 2 - Dependency Graph
90
Step 2 - Functional Units
.L1 .M1 .D1 .S1 x1 .L2 .M2 .D2 .S2 x2
sum prod m loop .M1x sumh prodh n count .M2x

Do we still have enoughfunctional units tocode
this algorithmin a single-cycle loop? Yes !

91
Step 2 - Registers
Register File A

Register File B
A0
B0
count
A1
B1
A2
B2
A3
B3
return address
a/ret value
A4
B4
x
a
A5
B5
x
count/prod
A6
B6
prodh
sum
A7
B7
sumh
92
Step 3 - Schedule Algorithm
add
add
93
Step 4 - C6000 Code

The complete code is available in the following
location
\Links\DotP LDW.pdf

94
Why Conditional Subtract?
loop ldh p_m, m ldh p_n, n mpy m,
n, prod add prod, sum, sum
count sub count, 1, count count b loop
Without Cond. Subtract Loop (count 1) (B)
With Cond. Subtract Loop (count 1) (B)
X
X
loop (count 0) (B)
loop (count 0) (B)
X
loop (count -1) (B)
loop (count 0) (B)
X
loop (count -2) (B)
loop (count 0) (B)
X
loop (count -3) (B)
loop (count 0) (B)
X
loop (count -4) (B)
loop (count 0) (B)
Loop never ends
Loop ends
95

Chapter 12
Software Optimisation
Part 3 - Pipelining Multi-cycle Loops

96
Objectives

Software pipeline the weighted vector sum
algorithm.
Describe four iteration interval constraints.
Calculate minimum iteration interval.
Convert and optimize the dot-product code to
floating point code.

97
What Requires Multi-Cycle Loops?

Resource Limitations
Running out of resources (Functional Units,
Registers, Bus Accesses)
Weighted Vector Sum example requires three .D
units
Live Too Long
Minimum iteration interval defined by length of
time a Variable is required to exist
Loop Carry Path
Latency required between loop iterations
FIR example and SP floating-point dot product
examples are demonstrated
Functional Unit Latency gt 1
A few C67x instructions require functional units
for 2 or 4 cycles rather than one. This defines a
minimum iteration interval.

98
What Requires Multi-Cycle Loops?
Use these four constraints to determine the
smallest Iteration Interval (Minimum Iteration
Interval or MII).
99
Resource Limitation Weighted Vector Sum
Step 1 - C Code
a, b input arrays c output array n length
of arrays r weighting factor
Store .D
Load .D
Load .D
Requires 3 .D units
100
Software Pipelining Procedure
101
Step 2 - C6x Linear Code
ci ai (r bi) gtgt 15
loop LDH a, ai LDH b, bi MPY r, bi,
prod SHR prod, 15, sum ADD ai, sum,
ci STH ci, c i SUB i, 1, i i
B loop

The full code is available here \Links\Wvs.sa

102
Step 3 - Dependency Graph
103
Step 4 -Allocate Functional Units

This requires 3 .D units therefore it cannot fit
into a single cycle loop.
This may fit into a 2 cycle loop if there are no
other constraints.

104
2 Cycle Loop
Cycle 1
Cycle 2
Iteration Interval (II) cycles per loop
iteration.
105
Multi-Cycle Loop Iterations
loop 2
loop 3
loop 1
cycle 3
cycle 5
cycle 1
.D1
ldh
ldh
ldh
.D2
ldh
ldh
ldh
.S2
shr
shr
shr
.M1
mpy
mpy
mpy
.M2
.L1
add
add
add
.L2
sub
sub
sub
.S1
b
b
b
cycle 2
cycle 4
cycle 6
.D1
sth
sth
sth
.D2
.S1
.S2
.M1
.M2
.L1
.L2
106
Multi-Cycle Loop Iterations
loop 2
loop 3
loop 1
cycle 3
cycle 5
cycle 1
.D1
ldh
ldh
ldh
.D2
ldh
ldh
ldh
.S2
shr
shr
shr
.M1
mpy
mpy
mpy
.M2
.L1
add
add
add
.L2
sub
sub
sub
.S1
b
b
b
cycle 2
cycle 4
cycle 6
.D1
sth
sth
sth
.D2
.S1
.S2
.M1
.M2
.L1
.L2
107
Multi-Cycle Loop Iterations
loop 3
loop 1
loop 2
cycle 3
cycle 5
cycle 1
.D1
ldh
ldh
ldh
.D2
ldh
ldh
ldh
.S2
shr
shr
shr
.M1
mpy
mpy
mpy
.M2
.L1
add
add
add
.L2
sub
sub
sub
.S1
b
b
b
cycle 2
cycle 4
cycle 6
.D1
sth
sth
sth
.D2
.S1
.S2
.M1
.M2
.L1
.L2
108
Multi-Cycle Loop Iterations
loop 3
loop 1
loop 2
cycle 3
cycle 5
cycle 1
.D1
ldh
ldh
ldh
.D2
ldh
ldh
ldh
.S2
shr
shr
shr
.M1
mpy
mpy
mpy
.M2
.L1
add
add
add
.L2
sub
sub
sub
.S1
b
b
b
cycle 2
cycle 4
cycle 6
.D1
sth
sth
sth
.D2
.S1
.S2
.M1
.M2
.L1
.L2
109
Multi-Cycle Loop Iterations
loop 1
loop 2
loop 3
cycle 3
cycle 5
cycle 1
.D1
ldh
ldh
ldh
.D2
ldh
ldh
ldh
.S2
shr
shr
shr
.M1
mpy
mpy
mpy
.M2
.L1
add
add
add
.L2
sub
sub
sub
.S1
b
b
b
cycle 2
cycle 4
cycle 6
.D1
sth
sth
sth
.D2
.S1
.S2
.M1
.M2
.L1
.L2
110
Multi-Cycle Loop Iterations
loop 3
loop 1
loop 2
cycle 3
cycle 5
cycle 1
.D1
ldh
ldh
ldh
.D2
ldh
ldh
ldh
.S2
shr
shr
shr
.M1
mpy
mpy
mpy
.M2
.L1
add
add
add
.L2
sub
sub
sub
.S1
b
b
b
cycle 2
cycle 4
cycle 6
.D1
sth
sth
sth
.D2
.S1
.S2
.M1
.M2
.L1
.L2
111
How long is the Prolog?
What is the length of thelongest path?
10
How many cycles per loop?
2
112
Step 5 - Create Scheduling Chart (0)
113
Step 5 - Create Scheduling Chart
0
2
4
6
8
Unit\cycle
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
114
Step 5 - Create Scheduling Chart
0
2
4
6
8
Unit\cycle
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
115
Step 5 - Create Scheduling Chart
0
2
4
6
8
Unit\cycle
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
116
Step 5 - Create Scheduling Chart
0
2
4
6
8
Unit\cycle
.L1
ADD ci
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
117
Step 5 - Create Scheduling Chart
118
Step 5 - Create Scheduling Chart
119
Step 5 - Create Scheduling Chart
Conflict
120
Conflict Solution
Here are two possibilities ...
Which is better?
121
Conflict Solution
Here are two possibilities ...
Which is better? Move the LDH to cycle 2. (so
you dont have to go back and recheck crosspaths)
122
Step 5 - Create Scheduling Chart
0
2
4
6
8
Unit\cycle
.L1
ADD ci
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
.S1
.S2
.M1
.M2
.D1
STH ci
.D2
123
Step 5 - Create Scheduling Chart
0
2
4
6
8
Unit\cycle
.L1
ADD ci
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
.S1
.S2
.M1
.M2
.D1
STH ci
.D2
124
Step 5 - Create Scheduling Chart
0
2
4
6
8
Unit\cycle
.L1
ADD ci
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
.S1
.S2
.M1
.M2
.D1
STH ci
.D2
125
Step 5 - Create Scheduling Chart
0
2
4
6
8
Unit\cycle
ADD ci
.L1
.L2
i B

.S1
.S2
.M1
.M2
LDH ai

.D1
LDH bi

.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
i SUB i

.S1
.S2
SHR sum

.M1
.M2
MPY mi

.D1
LDH ai
STH ci
.D2
126
2 Cycle Loop Kernel
0
2
4
6
8
Unit\cycle
ADD ci
.L1
.L2
i B

.S1
.S2
.M1
.M2
LDH ai

.D1
LDH bi

.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
i SUB i

.S1
.S2
SHR sum

.M1
.M2
MPY mi

.D1
LDH ai
STH ci
.D2
127
What Requires Multi-Cycle Loops?
128
Live Too Long - Example
129
Live Too Long - Example
130
Live Too Long - Example
0
5
6
3
4
LDH ai
a0 valid
a1
LDH
LDH
SHR
x0 valid
131
Live Too Long - Example
0
5
6
3
4
LDH ai
a0 valid
a1
LDH
LDH
SHR
x0 valid
ADD
Oops, rather than adding a0 x0 we got a1 x0
Lets look at one solution ...
132
Live Too Long - 2 Cycle Solution
133
Live Too Long - 2 Cycle Solution
Notice, a0 and x0 are both valid for 2
cycleswhich is the length of the Iteration
Interval
Adding them ...
134
Live Too Long - 2 Cycle Solution
Works! But whats the drawback?
2 cycle loop is slower.
Heres a better solution ...
135
Live Too Long - 1 Cycle Solution
Using a temporary registersolves this problem
without increasing the Minimum Iteration Interval
136
What Requires Multi-Cycle Loops?
137
Loop Carry Path

The loop carry path is a path which feeds one
variable from part of the algorithm back to
another.

e.g. Loop carry path 3.
Note The loop carry path is not the code loop.
138
Loop Carry Path, e.g. IIR Filter
IIR Filter Example y0 a0x0 b1y1
139
IIR.SA
IIR Filter Example y0 a0x0 b1y1
IIR ldh a_1, A1 ldh x1, A3 ldh b_1,
B1 ldh y0, B0 y1 is previous y0 mpy A1,
A3, prod1 mpy B1, B0, prod2 add prod1, prod2,
prod2 sth prod2, y0
140
Loop Carry Path - IIR Example
IIR Filter Loop y0 a1x1 b1y1
Min Iteration Interval Resource 2 (need 3 .D
units)
Loop Carry Path 9 (9 5 2 1 1)
therefore, MII 9
1 Result carries over fromone iteration of the
loopto the next.
Can it be minimized?
141
Loop Carry Path - IIR Example (Solution)
IIR Filter Loop y0 a1x1 b1y1
Min Iteration Interval Resource 2 (need 3 .D
units)
1
New Loop Carry Path 3 (3 2 1)
therefore, MII 3
Since y0 is stored in a CPU register,it can be
used directly by MPY (after the first loop
iteration).
142
Reminder Fixed-Point Dot-Product Example
Is there a loop carrypath in this example?
Yes, but its only 1
Min Iteration Interval Resource 1Loop Carry
Path 1 ? MII 1
For the fixed-point implementation, the Loop
Carry Path was not taken into account because it
is equal to 1.
143
Loop Carry Path

IIR Example.
Enhancing the IIR.
Fixed-Point Dot-Product Example.
Floating-Point Dot Product Example.

144
Loop Carry Path due to FUL gt 1Floating-Point
Dot-Product Example
Min Iteration Interval Resource 1Loop Carry
Path 4 ? MII 4
145
Unrolling the Loop
If the MII must be four cycles long, then use all
of them to calculate four results.
146
ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8

ADDSP takes 4 cycles or three delay slots to
produce the result.

147
ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
148
ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
149
ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
150
ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
151
ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
152
ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
153
ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
154
ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 NOP sum x1 x5 10 NOP sum x2
x6 11 NOP sum x3 x7 12 NOP sum x0
x4 x8

There are effectively four running sums

155
ADDSP Pipeline (Staggered Results)

There are effectively four running sums

These need to be combined after the last addition
is complete...

156
ADDSP Pipeline (Combining Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0 x4 9 MV
sum, temp sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
157
ADDSP Pipeline (Combining Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0 x4 9 MV
sum, temp sum x1 x5 10 ADDSP sum, temp,
sum1 sum x2 x6, temp x1 x5 11 sum
x3 x7 12 sum x0 x4 x8
158
ADDSP Pipeline (Combining Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0 x4 9 MV
sum, temp sum x1 x5 10 ADDSP sum, temp,
sum1 sum x2 x6, temp x1 x5 11 MV sum,
temp sum x3 x7 12 sum x0 x4 x8
159
ADDSP Pipeline (Combining Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0 x4 9 MV
sum, temp sum x1 x5 10 ADDSP sum, temp,
sum1 sum x2 x6, temp x1 x5 11 MV sum,
temp sum x3 x7 12 ADDSP sum, temp, sum2 sum
x0 x4 x8, temp x3 x7
160
ADDSP Pipeline (Combining Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0 x4 9 MV
sum, temp sum x1 x5 10 ADDSP sum, temp,
sum1 sum x2 x6, temp x1 x5 11 MV sum,
temp sum x3 x7 12 ADDSP sum, temp, sum2 sum
x0 x4 x8, temp x3 x7 13 NOP 14 NOP s
um1 x1 x2 x5 x6
161
ADDSP Pipeline (Combining Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0 x4 9 MV
sum, temp sum x1 x5 10 ADDSP sum, temp,
sum1 sum x2 x6, temp x1 x5 11 MV sum,
temp sum x3 x7 12 ADDSP sum, temp sum2 sum
x0 x4 x8, temp x3 x7 13 NOP 14 NOP sum
1 x1 x2 x5 x6 15 NOP 16 ADDSP sum1,
sum2, sum sum2 x0 x3 x4 x7 x8
162
ADDSP Pipeline (Combining Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0 x4 9 MV
sum, temp sum x1 x5 10 ADDSP sum, temp,
sum1 sum x2 x6, temp x1 x5 11 MV sum,
temp sum x3 x7 12 ADDSP sum, temp sum2 sum
x0 x4 x8, temp x3 x7 13 NOP 14 NOP sum
1 x1 x2 x5 x6 15 NOP 16 ADDSP sum1,
sum2, sum sum2 x0 x3 x4 x7
x8 17 NOP 18 NOP 19 NOP 20 NOP sum x0
x1 x2 x3 x4 x5 x6 x7 x8
163
What Requires Multi-Cycle Loops?
164
Simple FUL Example
3
5
MPYDP
prod
10 (4.9)
165
A Better Way to Diagram this ...
166
What Requires Multi-Cycle Loops?