Software Optimisation - PowerPoint PPT Presentation

1 / 167
About This Presentation
Title:

Software Optimisation

Description:

Software Optimisation Software Optimisation Chapter This chapter consists of three parts: Part 1: Optimisation Methods. Part 2: Software Pipelining. – PowerPoint PPT presentation

Number of Views:135
Avg rating:3.0/5.0
Slides: 168
Provided by: NaimDa9
Category:

less

Transcript and Presenter's Notes

Title: Software Optimisation


1
  • Chapter 12
  • Software Optimisation

2
Software Optimisation Chapter
  • This chapter consists of three parts
  • Part 1 Optimisation Methods.
  • Part 2 Software Pipelining.
  • Part 3 Multi-cycle Loop Pipelining.

3
  • Chapter 12
  • Software Optimisation
  • Part 1 - Optimisation Methods

4
Objectives
  • Introduction to optimisation and optimisation
    procedure.
  • Optimisation of C code using the code generation
    tools.
  • Optimisation of assembly code.

5
Introduction
  • Software optimisation is the process of
    manipulating software code to achieve two main
    goals
  • Faster execution time.
  • Small code size.
  • Note It will be shown that in general there is
    a trade off between faster execution type and
    smaller code size.

6
Introduction
  • To implement efficient software, the programmer
    must be familiar with
  • Processor architecture.
  • Programming language (C, assembly or linear
    assembly).
  • The code generation tools (compiler, assembler
    and linker).

7
Code Optimisation Procedure
8
Code Optimisation Procedure
9
Optimising C Compiler Options
  • The C6x optimising C compiler uses the ANSI C
    source code and can perform optimisation
    currently up-to about 80 compared with a
    hand-scheduled assembly.
  • However, to achieve this level of optimisation,
    knowledge of different levels of optimisation is
    essential. Optimisation is performed at different
    stages and levels.

10
Assembly Optimisation
  • To develop an appreciation of how to optimise
    code, let us optimise an FIR filter
  • For simplicity we write

1
11
Assembly Optimisation
  • To implement Equation 1, we need to perform the
    following steps
  • (1) Load the sample xi.
  • (2) Load the coefficients hi.
  • (3) Multiply xi and hi.
  • (4) Add (xi hi) to the content of an
    accumulator.
  • (5) Repeat steps 1 to 4 N-1 times.
  • (6) Store the value in the accumulator to y.

12
Assembly Optimisation
  • Steps 1 to 6 can be translated into the following
    C6x assembly code

13
Assembly Optimisation
  • In order to optimise the code, we need to
  • (1) Use instructions in parallel.
  • (2) Remove the NOPs.
  • (3) Remove the loop overhead (remove SUB and B
    loop unrolling).
  • (4) Use word access or double-word access instead
    of byte or half-word access.

14
Step 1 - Using Parallel Instructions
ldh
ldh
nop
nop
nop
nop
mpy
nop
add
sub
b
nop
nop
nop
nop
nop
15
Step 1 - Using Parallel Instructions
ldh
ldh
nop
nop
nop
nop
mpy
nop
add
sub
b
nop
nop
nop
Note Not all instructions can be put in parallel
since the result of one unit is used as an input
to the following unit.
nop
nop
16
Step 2 - Removing the NOPs
ldh
ldh
sub
b
nop
nop
mpy
nop
add
17
Step 3 - Loop Unrolling
  • The SUB and B instructions consume at least two
    extra cycles per iteration (this is known as
    branch overhead).

18
Step 4 - Word or Double Word Access
  • The C6711 has two 64-bit data buses for data
    memory access and therefore up to two 64-bit can
    be loaded into the registers at any time (see
    Chapter 2).
  • In addition the C6711 devices have variants of
    the multiplication instruction to support
    different operation (see Chapter 2).
  • Note Store can only be up to 32-bit.

19
Step 4 - Word or Double Word Access
  • Using word access, MPY and MPYH the previous code
    can be written as
  • Note By loading words and using MPY and MPYH
    instructions the execution time has been halved
    since in each iteration two 16x16-bit
    multiplications are performed.

20
Optimisation Summary
  • It has been shown that there are four
    complementary methods for code optimisation
  • Using instructions in parallel.
  • Filling the delay slots with useful code.
  • Using word or double word load.
  • Loop unrolling.

These increase performance and reduce code size.
21
Optimisation Summary
  • It has been shown that there are four
    complementary methods for code optimisation
  • Using instructions in parallel.
  • Filling the delay slots with useful code.
  • Using word or double word load.
  • Loop unrolling.

This increases performance but increases code
size.
22
  • Chapter 12
  • Software Optimisation
  • Part 2 - Software Pipelining

23
Objectives
  • Why using Software Pipelining, SP?
  • Understand software pipelining concepts.
  • Use software pipelining procedure.
  • Code the word-wide software pipelined dot-product
    routine.
  • Determine if your pipelined code is more
    efficient with or without prolog and epilog.

24
Why using Software Pipelining, SP?
  • SP creates highly optimized loop-code by
  • Putting several instructions in parallel.
  • Filling delay slots with useful code.
  • Maximizes functional units.
  • SP is implemented by simply using the tools
  • Compiler options -o2 or -o3.
  • Assembly Optimizer if .sa file.

25
Software Pipeline concept
To explain the concept of software pipelining, we
will assume that all instructions execute in one
cycle.
26
Software Pipeline Example
LDH LDH MPY ADD
How many cycles wouldit take to perform
thisloop 5 times? (Disregard delay-slots). _____
_________ cycles
5 x 3 15
Lets examine hardware (functional units) usage
...
27
Non-Pipelined Code
.D1
.D2
mpy
add
28
Pipelining Code
Pipelining these instructions took 1/2 the cycles!
29
Pipelining Code
Pipelining these instructions takes only 7 cycles!
30
Pipelining Code
Prolog Staging for loop.
Epilog Completing finaloperations.
31
Pipelined Code
32
Software Pipelining Procedure
1. Write algorithm in C code verify.
2. Write C6x Linear Assembly code. 3. Create
dependency graph. 4. Allocate registers.
5. Create scheduling table. 6. Translate
scheduling table to C6x code.
33
Software Pipelining Example (Step 1)
34
Software Pipelining Procedure
1. Write algorithm in C code verify.
2. Write C6x Linear Assembly code. 3. Create
dependency graph. 4. Allocate registers.
5. Create scheduling table. 6. Translate
scheduling table to C6x code.
35
Write code in Linear Assembly (Step 2)
for (i0 i lt count i) prod mi
ni sum prod loop ldh p_m,
m ldh p_n, n mpy m, n, prod add prod,
sum, sum count sub count, 1, count
count b loop
  • 1. No NOPs required.
  • 2. No parallel instructions required.
  • 3. You dont have to specify
  • Functional units, or
  • Registers.

36
Software Pipelining Procedure
1. Write algorithm in C code verify.
2. Write C6x Linear Assembly code. 3. Create
a dependency graph (4 steps). 4. Allocate
registers. 5. Create scheduling table.
6. Translate scheduling table to C6x code.
37
Dependency Graph Terminology
38
Dependency Graph Steps
  • (a) Draw the algorithm nodes and paths.
  • (b) Write the number of cycles it takes for each
    instruction to complete execution.
  • (c) Assign required function units to each
    node.
  • (d) Partition the nodes to A and B sides and
    assign sides to all functional units.

39
Dependency Graph (Step a)
  • In this step each instruction is represented by a
    node.
  • The node is represented by a circle, where
  • Outside write instruction.
  • Inside register where result is written.
  • Nodes are then connected by paths showing the
    data flow.
  • Note Conditional paths are represented by
    dashed lines.

40
Dependency Graph (Step a)
41
Dependency Graph (Step a)
42
Dependency Graph (Step a)
43
Dependency Graph (Step a)
44
Dependency Graph (Step a)
45
Dependency Graph (Step a)
46
Dependency Graph (Step b)
  • In this step the number of cycles it takes for
    each instruction to complete execution is added
    to the dependency graph.
  • It is written along the associated data path.

47
Dependency Graph (Step b)
48
Dependency Graph (Step c)
  • In this step functional units are assigned to
    each node.
  • It is advantageous to start allocating units to
    instructions which require a specific unit
  • Load/Store.
  • Branch.
  • We do not need to be concerned with multiply as
    this is the only operation that the .M unit
    performs.
  • Note The side is not allocated at this stage.

49
Dependency Graph (Step c)
.D
.D
5
5
1
.M
1
2
1
.S
6
50
Dependency Graph (Step d)
  • The data path is partitioned into side A and B at
    this stage.
  • To optimise code we need to ensure that a maximum
    number of units are used with a minimum number of
    cross paths.
  • To make the partition visible on the dependency
    graph a line is used.
  • The side can then be added to the functional
    units associated with each instruction or node.

51
Dependency Graph (Step d)
ASide
BSide
.D
.D
5
5
1
.M
1
2
1
.S
6
52
Dependency Graph (Step d)
53
Software Pipelining Procedure
1. Write algorithm in C code verify.
2. Write C6x Linear Assembly code. 3. Create
a dependency graph (4 steps). 4. Allocate
registers. 5. Create scheduling table.
6. Translate scheduling table to C6x code.
54
Step 4 - Allocate Functional Units

.L1 .M1 .D1 .S1 x1 .L2 .M2 .D2 .S2 x2
sum prod m .M1x count n loop
Do we have enough functional units to code this
algorithm in a single-cycle loop?

ASide
BSide

.D1
.D2

5
5
1
.L2
.M1x
1
2

1
.L1
.S2

6
55
Step 4 - Allocate Registers
56
Software Pipelining Procedure
1. Write algorithm in C code verify.
2. Write C6x Linear Assembly code. 3. Create
a dependency graph (4 steps). 4. Allocate
registers. 5. Create scheduling table.
6. Translate scheduling table to C6x code.
57
Step 5 - Create Scheduling Table
8
7
6
5
4
3
2
1
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
How do we know the loop ends up in cycle 8?
58
Length of Prolog
  • Answer
  • Count up the length of longest path, in this
    case we have
  • 5 2 1 8 cycles

5
2
1
59
Scheduling Table
8
7
6
5
4
3
2
1
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
60
Scheduling Table
add
Where do we want to branch?
61
Scheduling Table
62
Software Pipelining Procedure
1. Write algorithm in C code verify.
2. Write C6x Linear Assembly code. 3. Create
a dependency graph (4 steps). 4. Allocate
registers. 5. Create scheduling table.
6. Translate scheduling table to C6x code.
63
Translate Scheduling Table to C6x Code
LOOP
PROLOG
add






sub
64
Translate Scheduling Table to C6x Code
C1 ldh .D1 A1,A2 ldh .D2
B1,B2
LOOP
PROLOG
add
C2 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0






sub
65
Translate Scheduling Table to C6x Code
LOOP
PROLOG
C1 ldh .D1 A1,A2 ldh .D2
B1,B2
add
C2 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0






sub
C3 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0 B0 B
.S2 loop
66
Translate Scheduling Table to C6x Code
C1 ldh .D1 A1,A2 ldh .D2
B1,B2
LOOP
C2 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0
add






sub
C3 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0 B0 B
.S2 loop
C4 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0 B0 B
.S2 loop
67
Translate Scheduling Table to C6x Code
C1 ldh .D1 A1,A2 ldh .D2
B1,B2
LOOP
C2 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0
add
C3 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0 B0 B
.S2 loop



sub


sub
C4 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0 B0 B
.S2 loop
C5 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0 B0 B
.S2 loop
68
Translate Scheduling Table to C6x Code
LOOP
PROLOG
add
C6 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0 B0 B
.S2 loop mpy .M1x A2,B2,A3


sub



sub
69
Translate Scheduling Table to C6x Code
LOOP
PROLOG
add
C7 ldh .D1 A1,A2 ldh .D2
B1,B2 B0 sub .L2 B0,1,B0 B0 B
.S2 loop mpy .M1x A2,B2,A3


sub



sub
70
Translate Scheduling Table to C6x Code
LOOP
PROLOG
add
Single-Cycle Loop loop ldh .D1 A1,A2
ldh .D2 B1,B2 B0 sub .L2
B0,1,B0 B0 B .S2 loop mpy .M1x
A2,B2,A3 add .L1 A4,A3,A4


sub



sub
See Chapter 14 for practical examples
71
Translate Scheduling Table to C6x Code
  • With this method we have only created the prolog
    and the loop.
  • Therefore if the filter has 100 taps, then we
    need to repeat the loop 100 times as we need 100
    adds.
  • This means that we are performing 107 loads.
    These 7 extra loads may lead to some illegal
    memory acesses.

72
Solution The Epilog
We only created the Prolog and Loop What
about the Epilog?
The Epilog can be extracted from your results as
described below.
See example in the next slide.
73
Dot-Product with Epilog
e1 mpy add
74
Dot-Product with Epilog
e2 mpy add
75
Dot-Product with Epilog
e3 mpy add
76
Dot-Product with Epilog
e4 mpy add
77
Dot-Product with Epilog
e5 mpy add
78
Dot-Product with Epilog
e6 add
79
Dot-Product with Epilog
Prolog p1 ldhldh p2 ldhldh
sub p3 ldhldh sub b p4
ldhldh sub b p5 ldhldh
sub b p6 ldhldh mpy sub
b p7 ldhldh mpy sub b
Loop loop ldh ldh mpy
add sub b
Epilog e1 mpy add e2 mpy add e3 mpy
add e4 mpy add e5 mpy add e6 add
e7 add
80
Scheduling Table Prolog, Loop and Epilog
81
Loop only!
  • Can the code be written as a loop only (i.e. no
    prolog or epilog)?

Yes!
82
Loop only!
(i) Remove all instructions except the branch.
83
Loop only!
(i) Remove all instructions except the branch.
add
B
mpy
ldh m
84
Loop only!
(i) Remove all instructions except the
branch. (ii) Zero input registers, accumulator
and product registers.
add
zero sum
zero a
zero prod
zero b
B
mpy
ldh m
85
Loop only!
(i) Remove all instructions except the
branch. (ii) Zero input registers, accumulator
and product registers. (iii) Adjust the number of
subtractions.
add
zero sum
zero a
sub
zero prod
zero b
B
mpy
ldh m
86
Loop Only - Final Code
b loop b loop b loop zero m input
register zero n input register b
loop zero prod product register zero sum
accumulator b loop sub modify count
register loop ldh ldh mpy add
sub b loop
Overhead
Loop
87
Laboratory exercise
  • Software pipeline using the LDW version of the
    Dot-Product routine
  • (1) Write linear assembly.
  • (2) Create dependency graph.
  • (3) Complete scheduling table.
  • (4) Transfer table to C6000 code.
  • To Epilogue or Not to Epilog?
  • Determine if your pipelined code is more
    efficient with or without prolog and epilog.

88
Lab Solution Step 1 - Linear Assembly
for (i0 i lt count i) prod mi
ni sum prod count becomes 20
loop ldw p_m, m ldw p_n,
n mpy m, n, prod mpyh m, n,
prodh add prod, sum, sum add prodh, sumh,
sumh count sub count, 1, count count
b loop Outside of Loop add sum, sumh, sum
89
Step 2 - Dependency Graph
90
Step 2 - Functional Units
.L1 .M1 .D1 .S1 x1 .L2 .M2 .D2 .S2 x2
sum prod m loop .M1x sumh prodh n count .M2x

Do we still have enoughfunctional units tocode
this algorithmin a single-cycle loop? Yes !

91
Step 2 - Registers
Register File A


Register File B
A0
B0
count
A1
B1
A2
B2
A3
B3
return address
a/ret value
A4
B4
x
a
A5
B5
x
count/prod
A6
B6
prodh
sum
A7
B7
sumh
92
Step 3 - Schedule Algorithm
add
add
93
Step 4 - C6000 Code
  • The complete code is available in the following
    location
  • \Links\DotP LDW.pdf

94
Why Conditional Subtract?
loop ldh p_m, m ldh p_n, n mpy m,
n, prod add prod, sum, sum
count sub count, 1, count count b loop
Without Cond. Subtract Loop (count 1) (B)
With Cond. Subtract Loop (count 1) (B)
X
X
loop (count 0) (B)
loop (count 0) (B)
X
loop (count -1) (B)
loop (count 0) (B)
X
loop (count -2) (B)
loop (count 0) (B)
X
loop (count -3) (B)
loop (count 0) (B)
X
loop (count -4) (B)
loop (count 0) (B)
Loop never ends
Loop ends
95
  • Chapter 12
  • Software Optimisation
  • Part 3 - Pipelining Multi-cycle Loops

96
Objectives
  • Software pipeline the weighted vector sum
    algorithm.
  • Describe four iteration interval constraints.
  • Calculate minimum iteration interval.
  • Convert and optimize the dot-product code to
    floating point code.

97
What Requires Multi-Cycle Loops?
  • Resource Limitations
  • Running out of resources (Functional Units,
    Registers, Bus Accesses)
  • Weighted Vector Sum example requires three .D
    units
  • Live Too Long
  • Minimum iteration interval defined by length of
    time a Variable is required to exist
  • Loop Carry Path
  • Latency required between loop iterations
  • FIR example and SP floating-point dot product
    examples are demonstrated
  • Functional Unit Latency gt 1
  • A few C67x instructions require functional units
    for 2 or 4 cycles rather than one. This defines a
    minimum iteration interval.

98
What Requires Multi-Cycle Loops?
Use these four constraints to determine the
smallest Iteration Interval (Minimum Iteration
Interval or MII).
99
Resource Limitation Weighted Vector Sum
Step 1 - C Code
a, b input arrays c output array n length
of arrays r weighting factor
Store .D
Load .D
Load .D
Requires 3 .D units
100
Software Pipelining Procedure
101
Step 2 - C6x Linear Code
ci ai (r bi) gtgt 15
loop LDH a, ai LDH b, bi MPY r, bi,
prod SHR prod, 15, sum ADD ai, sum,
ci STH ci, c i SUB i, 1, i i
B loop
  • The full code is available here \Links\Wvs.sa

102
Step 3 - Dependency Graph
103
Step 4 -Allocate Functional Units
  • This requires 3 .D units therefore it cannot fit
    into a single cycle loop.
  • This may fit into a 2 cycle loop if there are no
    other constraints.

104
2 Cycle Loop
Cycle 1
Cycle 2
Iteration Interval (II) cycles per loop
iteration.
105
Multi-Cycle Loop Iterations
loop 2
loop 3
loop 1
cycle 3
cycle 5
cycle 1
.D1
ldh
ldh
ldh
.D2
ldh
ldh
ldh
.S2
shr
shr
shr
.M1
mpy
mpy
mpy
.M2
.L1
add
add
add
.L2
sub
sub
sub
.S1
b
b
b
cycle 2
cycle 4
cycle 6
.D1
sth
sth
sth
.D2
.S1
.S2
.M1
.M2
.L1
.L2
106
Multi-Cycle Loop Iterations
loop 2
loop 3
loop 1
cycle 3
cycle 5
cycle 1
.D1
ldh
ldh
ldh
.D2
ldh
ldh
ldh
.S2
shr
shr
shr
.M1
mpy
mpy
mpy
.M2
.L1
add
add
add
.L2
sub
sub
sub
.S1
b
b
b
cycle 2
cycle 4
cycle 6
.D1
sth
sth
sth
.D2
.S1
.S2
.M1
.M2
.L1
.L2
107
Multi-Cycle Loop Iterations
loop 3
loop 1
loop 2
cycle 3
cycle 5
cycle 1
.D1
ldh
ldh
ldh
.D2
ldh
ldh
ldh
.S2
shr
shr
shr
.M1
mpy
mpy
mpy
.M2
.L1
add
add
add
.L2
sub
sub
sub
.S1
b
b
b
cycle 2
cycle 4
cycle 6
.D1
sth
sth
sth
.D2
.S1
.S2
.M1
.M2
.L1
.L2
108
Multi-Cycle Loop Iterations
loop 3
loop 1
loop 2
cycle 3
cycle 5
cycle 1
.D1
ldh
ldh
ldh
.D2
ldh
ldh
ldh
.S2
shr
shr
shr
.M1
mpy
mpy
mpy
.M2
.L1
add
add
add
.L2
sub
sub
sub
.S1
b
b
b
cycle 2
cycle 4
cycle 6
.D1
sth
sth
sth
.D2
.S1
.S2
.M1
.M2
.L1
.L2
109
Multi-Cycle Loop Iterations
loop 1
loop 2
loop 3
cycle 3
cycle 5
cycle 1
.D1
ldh
ldh
ldh
.D2
ldh
ldh
ldh
.S2
shr
shr
shr
.M1
mpy
mpy
mpy
.M2
.L1
add
add
add
.L2
sub
sub
sub
.S1
b
b
b
cycle 2
cycle 4
cycle 6
.D1
sth
sth
sth
.D2
.S1
.S2
.M1
.M2
.L1
.L2
110
Multi-Cycle Loop Iterations
loop 3
loop 1
loop 2
cycle 3
cycle 5
cycle 1
.D1
ldh
ldh
ldh
.D2
ldh
ldh
ldh
.S2
shr
shr
shr
.M1
mpy
mpy
mpy
.M2
.L1
add
add
add
.L2
sub
sub
sub
.S1
b
b
b
cycle 2
cycle 4
cycle 6
.D1
sth
sth
sth
.D2
.S1
.S2
.M1
.M2
.L1
.L2
111
How long is the Prolog?
What is the length of thelongest path?
10
How many cycles per loop?
2
112
Step 5 - Create Scheduling Chart (0)
113
Step 5 - Create Scheduling Chart
0
2
4
6
8
Unit\cycle
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
114
Step 5 - Create Scheduling Chart
0
2
4
6
8
Unit\cycle
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
115
Step 5 - Create Scheduling Chart
0
2
4
6
8
Unit\cycle
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
116
Step 5 - Create Scheduling Chart
0
2
4
6
8
Unit\cycle
.L1
ADD ci
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
.S1
.S2
.M1
.M2
.D1
.D2
117
Step 5 - Create Scheduling Chart
118
Step 5 - Create Scheduling Chart
119
Step 5 - Create Scheduling Chart
Conflict
120
Conflict Solution
Here are two possibilities ...
Which is better?
121
Conflict Solution
Here are two possibilities ...
Which is better? Move the LDH to cycle 2. (so
you dont have to go back and recheck crosspaths)
122
Step 5 - Create Scheduling Chart
0
2
4
6
8
Unit\cycle
.L1
ADD ci
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
.S1
.S2
.M1
.M2
.D1
STH ci
.D2
123
Step 5 - Create Scheduling Chart
0
2
4
6
8
Unit\cycle
.L1
ADD ci
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
.S1
.S2
.M1
.M2
.D1
STH ci
.D2
124
Step 5 - Create Scheduling Chart
0
2
4
6
8
Unit\cycle
.L1
ADD ci
.L2
.S1
.S2
.M1
.M2
.D1
.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
.S1
.S2
.M1
.M2
.D1
STH ci
.D2
125
Step 5 - Create Scheduling Chart
0
2
4
6
8
Unit\cycle
ADD ci
.L1
.L2
i B


.S1
.S2
.M1
.M2
LDH ai



.D1
LDH bi




.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
i SUB i



.S1
.S2
SHR sum

.M1
.M2
MPY mi


.D1
LDH ai
STH ci
.D2
126
2 Cycle Loop Kernel
0
2
4
6
8
Unit\cycle
ADD ci
.L1
.L2
i B


.S1
.S2
.M1
.M2
LDH ai



.D1
LDH bi




.D2
Unit\cycle
1
3
5
7
9
.L1
.L2
i SUB i



.S1
.S2
SHR sum

.M1
.M2
MPY mi


.D1
LDH ai
STH ci
.D2
127
What Requires Multi-Cycle Loops?
128
Live Too Long - Example
129
Live Too Long - Example
130
Live Too Long - Example
0
5
6
3
4
LDH ai
a0 valid
a1
LDH
LDH
SHR
x0 valid
131
Live Too Long - Example
0
5
6
3
4
LDH ai
a0 valid
a1
LDH
LDH
SHR
x0 valid
ADD
Oops, rather than adding a0 x0 we got a1 x0
Lets look at one solution ...
132
Live Too Long - 2 Cycle Solution
133
Live Too Long - 2 Cycle Solution
Notice, a0 and x0 are both valid for 2
cycleswhich is the length of the Iteration
Interval
Adding them ...
134
Live Too Long - 2 Cycle Solution
Works! But whats the drawback?
2 cycle loop is slower.
Heres a better solution ...
135
Live Too Long - 1 Cycle Solution
Using a temporary registersolves this problem
without increasing the Minimum Iteration Interval
136
What Requires Multi-Cycle Loops?
137
Loop Carry Path
  • The loop carry path is a path which feeds one
    variable from part of the algorithm back to
    another.

e.g. Loop carry path 3.
Note The loop carry path is not the code loop.
138
Loop Carry Path, e.g. IIR Filter
IIR Filter Example y0 a0x0 b1y1
139
IIR.SA
IIR Filter Example y0 a0x0 b1y1
IIR ldh a_1, A1 ldh x1, A3 ldh b_1,
B1 ldh y0, B0 y1 is previous y0 mpy A1,
A3, prod1 mpy B1, B0, prod2 add prod1, prod2,
prod2 sth prod2, y0
140
Loop Carry Path - IIR Example
IIR Filter Loop y0 a1x1 b1y1
Min Iteration Interval Resource 2 (need 3 .D
units)
Loop Carry Path 9 (9 5 2 1 1)
therefore, MII 9
1 Result carries over fromone iteration of the
loopto the next.
Can it be minimized?
141
Loop Carry Path - IIR Example (Solution)
IIR Filter Loop y0 a1x1 b1y1
Min Iteration Interval Resource 2 (need 3 .D
units)
1
New Loop Carry Path 3 (3 2 1)
therefore, MII 3
Since y0 is stored in a CPU register,it can be
used directly by MPY (after the first loop
iteration).
142
Reminder Fixed-Point Dot-Product Example
Is there a loop carrypath in this example?
Yes, but its only 1
Min Iteration Interval Resource 1Loop Carry
Path 1 ? MII 1
For the fixed-point implementation, the Loop
Carry Path was not taken into account because it
is equal to 1.
143
Loop Carry Path
  • IIR Example.
  • Enhancing the IIR.
  • Fixed-Point Dot-Product Example.
  • Floating-Point Dot Product Example.

144
Loop Carry Path due to FUL gt 1Floating-Point
Dot-Product Example
Min Iteration Interval Resource 1Loop Carry
Path 4 ? MII 4
145
Unrolling the Loop
If the MII must be four cycles long, then use all
of them to calculate four results.
146
ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
  • ADDSP takes 4 cycles or three delay slots to
    produce the result.

147
ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
148
ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
149
ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
150
ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
151
ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
152
ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
153
ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
154
ADDSP Pipeline (Staggered Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0
x4 9 NOP sum x1 x5 10 NOP sum x2
x6 11 NOP sum x3 x7 12 NOP sum x0
x4 x8
  • There are effectively four running sums

155
ADDSP Pipeline (Staggered Results)
  • There are effectively four running sums
  • These need to be combined after the last addition
    is complete...

156
ADDSP Pipeline (Combining Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0 x4 9 MV
sum, temp sum x1 x5 10 sum x2
x6 11 sum x3 x7 12 sum x0 x4 x8
157
ADDSP Pipeline (Combining Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0 x4 9 MV
sum, temp sum x1 x5 10 ADDSP sum, temp,
sum1 sum x2 x6, temp x1 x5 11 sum
x3 x7 12 sum x0 x4 x8
158
ADDSP Pipeline (Combining Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0 x4 9 MV
sum, temp sum x1 x5 10 ADDSP sum, temp,
sum1 sum x2 x6, temp x1 x5 11 MV sum,
temp sum x3 x7 12 sum x0 x4 x8
159
ADDSP Pipeline (Combining Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0 x4 9 MV
sum, temp sum x1 x5 10 ADDSP sum, temp,
sum1 sum x2 x6, temp x1 x5 11 MV sum,
temp sum x3 x7 12 ADDSP sum, temp, sum2 sum
x0 x4 x8, temp x3 x7
160
ADDSP Pipeline (Combining Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0 x4 9 MV
sum, temp sum x1 x5 10 ADDSP sum, temp,
sum1 sum x2 x6, temp x1 x5 11 MV sum,
temp sum x3 x7 12 ADDSP sum, temp, sum2 sum
x0 x4 x8, temp x3 x7 13 NOP 14 NOP s
um1 x1 x2 x5 x6
161
ADDSP Pipeline (Combining Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0 x4 9 MV
sum, temp sum x1 x5 10 ADDSP sum, temp,
sum1 sum x2 x6, temp x1 x5 11 MV sum,
temp sum x3 x7 12 ADDSP sum, temp sum2 sum
x0 x4 x8, temp x3 x7 13 NOP 14 NOP sum
1 x1 x2 x5 x6 15 NOP 16 ADDSP sum1,
sum2, sum sum2 x0 x3 x4 x7 x8
162
ADDSP Pipeline (Combining Results)
Cycle Instruction Result 0 ADDSP x0, sum,
sum sum 0 1 ADDSP x1, sum, sum sum
0 2 ADDSP x2, sum, sum sum 0 3 ADDSP x3,
sum, sum sum 0 4 ADDSP x4, sum, sum sum
x0 5 ADDSP x5, sum, sum sum x1 6 ADDSP x6,
sum, sum sum x2 7 ADDSP x7, sum, sum sum
x3 8 ADDSP x8, sum, sum sum x0 x4 9 MV
sum, temp sum x1 x5 10 ADDSP sum, temp,
sum1 sum x2 x6, temp x1 x5 11 MV sum,
temp sum x3 x7 12 ADDSP sum, temp sum2 sum
x0 x4 x8, temp x3 x7 13 NOP 14 NOP sum
1 x1 x2 x5 x6 15 NOP 16 ADDSP sum1,
sum2, sum sum2 x0 x3 x4 x7
x8 17 NOP 18 NOP 19 NOP 20 NOP sum x0
x1 x2 x3 x4 x5 x6 x7 x8
163
What Requires Multi-Cycle Loops?
164
Simple FUL Example
3
5
MPYDP
prod
10 (4.9)
165
A Better Way to Diagram this ...
166
What Requires Multi-Cycle Loops?
  • 1. Resource Limitations.
  • 2. Live Too Long.
  • 3. Loop Carry Path.
  • 4. Double Precision (FUL gt 1).

Lab Converting your dot-product code to
Single-Precision Floating-Point.
167
  • Chapter 12
  • Software Optimisation
  • - End -
Write a Comment
User Comments (0)
About PowerShow.com