OMSE 510: Computing Foundations 4: The CPU

About This Presentation

Title:

OMSE 510: Computing Foundations 4: The CPU

Description:

RISC was first introduction by Patterson and Ditzel in1980 ... ( microprogramming is overkill when ISA matches datapath 1-1) Pipelining is Natural! ... – PowerPoint PPT presentation

Number of Views:70

Avg rating:3.0/5.0

Slides: 99

Provided by: franci52

Learn more at: http://web.cecs.pdx.edu

Category:

more less

Transcript and Presenter's Notes

Title: OMSE 510: Computing Foundations 4: The CPU

1
OMSE 510 Computing Foundations4 The CPU!

Chris Gilmore ltgrimjack_at_cs.pdx.edugt
Systems Software Lab
Portland State University/OMSE

2
Today

Caches
DLX Assembly
CPU Overview

3
Introduction to RISC

Reduced Instruction Set Computer
1975 John Cocke IBM 801
IBM started working on a RISC-type computer on
1975 without calling it by this name
used as an I/O processor for IBM Mainframe
Patterson and Hennessey
RISC was first introduction by Patterson and
Ditzel in1980
Produced first RISC chip in early 1980s
RISC I and RISC II from Berkeley and MIPS from
Stanford

4
RISC Chips

RISC II
Had 39 instructions and 2 addressing modes, 3
data types
234 combinations
Compared to VAX 304 inst, 16 address mode, 14
data type
68,096
Found that
Compiled programs were 30 larger than CISC (Vax
11/780)
Ran upto 5 times faster than 68000
Assembler-Compiler ratio (Execution time of
assembler program divided by the exec time of
compiled version)
Ratio lt 50 for CISC
90 for RISC

5
RISC Definition

1. Single cycle operation
2. Load / store design
3. Hardwired control unit
4. Few instructions and addressing modes
5. Fixed instruction format
6. More compile time effort to avoid pipeline
penalties

6
Disadvantages of CISC

Large, complicated, and time-consuming
instruction set
Complex CU to decode and execute
Not necessarily faster than a sequence of several
RISC instructions
Complexity of the CISC CU
A large number of design errors
Longer design time
Too large a choice for the compiler
Very difficult to design the optimal compiler
Not always yield the most efficient code
Specialized to fit certain HLL instruction
May be redundant for another HLL
Relatively low cost/benefit factor

7
The Advantage of RISC

RISC and VLSI realization
Relatively small and simple C.U. hardware
RISC I 6 RISC II 10
MC68020 68
Higher chance of fitting other features on a chip
Can fit a large number of CPU registers
Enhances the throughput and HLL support
Increase the regularization factor

8
The Advantage of RISC

RISC and Computing Speed
Faster decoding process
Small instruction set, addressing mode, fixed
instruction format
Reduce Memory access.
A large number of CPU registers permits R-R
operations
Faster Parameter passing
Register windows in RISC I and RISC II
streamlined instruction handing
All instruction have the same length
All execute in one cycle
Suitable for the pipelined implementation

9
The Advantage of RISC

RISC and design costs and reliability
Shorter time to design and reduction of overall
design costs
Reduce the probability that the end product will
be obsolete
Reduced number of design errors
Virtual Memory Management System enhancement
inst will not cross word boundaries and cant
wind up on two separate pages

10
The Advantage of RISC

RISC and HLL Support
Shorter and simpler compiler
Usually only a single choice rather than several
choice in CISC
Large Number of CPU registers
More efficient code optimization
Fast Parameter Passing between procedures
register windows
Reduced burden on compiler writer

11
The Disadvantage and Criticism of RISC(80s)

RISC code to be longer
Extra burden on the machine and assembly language
programmer
Several instructions required per a single CISC
instruction
More Memory Locations for their storage
Floating Point Support and VMM support

12
RISC Characteristics

Pipelined operation
Compiler responsible for pipeline conflict
resolution
Delayed branch
Delayed load

13
Question 1 Why do microcoding?

If simple instruction could execute at very high
clock rate
If you could even write compilers to produce
microinstructions
If most programs use simple instructions and
addressing modes
If microcode is kept in RAM instead of ROM so as
to fix bugs
If same memory used for control memory could be
used instead as cache for macroinstructions
Then why not skip instruction interpretation by a
microprogram and simply compile directly into
lowest language of machine? (microprogramming is
overkill when ISA matches datapath 1-1)

14
Pipelining is Natural!

Laundry Example
Ann, Brian, Cathy, Dave each have one load of
clothes to wash, dry, and fold
Washer takes 30 minutes
Dryer takes 40 minutes
Folder takes 20 minutes

15
Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r

Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would
laundry take?

16
Pipelined Laundry Start work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r

Pipelined laundry takes 3.5 hours for 4 loads

17
Pipelining Lessons

Pipelining doesnt help latency of single task,
it helps throughput of entire workload
Pipeline rate limited by slowest pipeline stage
Multiple tasks operating simultaneously using
different resources
Potential speedup Number pipe stages
Unbalanced lengths of pipe stages reduces speedup
Time to fill pipeline and time to drain it
reduces speedup
Stall for Dependences

6 PM
7
8
9
Time
T a s k O r d e r
18
Execution Cycle
Obtain instruction from program storage
Determine required actions and instruction size
Locate and obtain operand data
Compute result value or status
Deposit results in storage for later use
Determine successor instruction
19
The Five Stages of Load
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Load

Ifetch Instruction Fetch
Fetch the instruction from the Instruction Memory
Reg/Dec Registers Fetch and Instruction Decode
Exec Calculate the memory address
Mem Read the data from the Data Memory
Wr Write the data back to the register file

20
Note These 5 stages were there all along!
Fetch
Decode
Execute
Memory
Write-back
21
Pipelining

Improve performance by increasing throughput
Ideal speedup is number of stages in the
pipeline. Do we achieve this?

22
Basic Idea

What do we need to add to split the datapath into
stages?

23
Graphically Representing Pipelines

Can help with answering questions like
how many cycles does it take to execute this
code?
what is the ALU doing during cycle 4?
use this representation to help understand
datapaths

24
Conventional Pipelined Execution Representation
Time
Program Flow
25
Single Cycle, Multiple Cycle, vs. Pipeline
Cycle 1
Cycle 2
Clk
Single Cycle Implementation
Load
Store
Waste
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Cycle 8
Cycle 9
Cycle 10
Clk
Multiple Cycle Implementation
Load
Store
R-type
Pipeline Implementation
Load
Store
R-type
26
Why Pipeline?

Suppose we execute 100 instructions
Single Cycle Machine
45 ns/cycle x 1 CPI x 100 inst 4500 ns
Multicycle Machine
10 ns/cycle x 4.6 CPI (due to inst mix) x 100
inst 4600 ns
Ideal pipelined machine
10 ns/cycle x (1 CPI x 100 inst 4 cycle drain)
1040 ns

27
Why Pipeline? Because we can!
Time (clock cycles)
I n s t r. O r d e r
Inst 0
Inst 1
Inst 2
Inst 3
Inst 4
28
Can pipelining get us into trouble?

Yes Pipeline Hazards
structural hazards attempt to use the same
resource two different ways at the same time
E.g., combined washer/dryer would be a structural
hazard or folder busy doing something else
(watching TV)
control hazards attempt to make a decision
before condition is evaluated
E.g., washing football uniforms and need to get
proper detergent level need to see after dryer
before next load in
branch instructions
data hazards attempt to use item before it is
ready
E.g., one sock of pair in dryer and one in
washer cant fold until get sock from washer
through dryer
instruction depends on result of prior
instruction still in the pipeline
Can always resolve hazards by waiting
pipeline control must detect the hazard
take action (or delay action) to resolve hazards

29
Single Memory is a Structural Hazard
Time (clock cycles)
I n s t r. O r d e r
Load
Mem
Reg
Reg
Instr 1
Instr 2
Mem
Mem
Reg
Reg
Instr 3
Instr 4
Detection is easy in this case! (right half
highlight means read, left half write)
30
Structural Hazards limit performance

Example if 1.3 memory accesses per instruction
and only one memory access per cycle then
average CPI ? 1.3
otherwise resource is more than 100 utilized

31
Control Hazard Solution 1 Stall

Stall wait until decision is clear
Impact 2 lost cycles (i.e. 3 clock cycles per
branch instruction) gt slow
Move decision to end of decode
save 1 cycle per branch

32
Control Hazard Solution 2 Predict

Predict guess one direction then back up if
wrong
Impact 0 lost cycles per branch instruction if
right, 1 if wrong (right 50 of time)
Need to Squash and restart following
instruction if wrong
Produce CPI on branch of (1 .5 2 .5) 1.5
Total CPI might then be 1.5 .2 1 .8 1.1
(20 branch)
More dynamic scheme history of 1 branch ( 90)

33
Control Hazard Solution 3 Delayed Branch

Delayed Branch Redefine branch behavior (takes
place after next instruction)
Impact 0 clock cycles per branch instruction if
can find instruction to put in slot ( 50 of
time)
As we launch more instruction per clock cycle,
less useful

34
Delayed/Predicted Branch

Where to get instructions to fill branch delay
slot?
Before branch instruction
From the target address only valuable when
branch taken
From fall through only valuable when branch not
taken
Cancelling branches allow more slots to be
filled
Compiler effectiveness for single branch delay
slot
Fills about 60 of branch delay slots
About 80 of instructions executed in branch
delay slots useful in computation
About 50 (60 x 80) of slots usefully filled
Delayed Branch downside 7-8 stage pipelines,
multiple instructions issued per clock
(superscalar)

35
Data Hazard on r1
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
36
Data Hazard on r1

Dependencies backwards in time are hazards

Time (clock cycles)
IF
ID/RF
EX
MEM
WB
add r1,r2,r3
Reg
Reg
ALU
Im
Dm
I n s t r. O r d e r
sub r4,r1,r3
Dm
Reg
Reg
Dm
Reg
and r6,r1,r7
Reg
Im
Dm
Reg
Reg
or r8,r1,r9
ALU
xor r10,r1,r11
37
Data Hazard Solution

Forward result from one stage to another
or OK if define read/write properly

Time (clock cycles)
IF
ID/RF
EX
MEM
WB
add r1,r2,r3
Reg
Reg
ALU
Im
Dm
I n s t r. O r d e r
sub r4,r1,r3
Dm
Reg
Reg
Dm
Reg
and r6,r1,r7
Reg
Im
Dm
Reg
Reg
or r8,r1,r9
ALU
xor r10,r1,r11
38
Forwarding (or Bypassing) What about Loads?

Dependencies backwards in time are
hazards
Cant solve with forwarding
Must delay/stall instruction dependent on loads

Time (clock cycles)
IF
ID/RF
EX
MEM
WB
lw r1,0(r2)
Reg
Reg
ALU
Im
Dm
sub r4,r1,r3
Dm
Reg
Reg
39
Forwarding (or Bypassing) What about Loads

Dependencies backwards in time are
hazards
Cant solve with forwarding
Must delay/stall instruction dependent on loads

Time (clock cycles)
IF
ID/RF
EX
MEM
WB
lw r1,0(r2)
Reg
Reg
ALU
Im
Dm
Stall
sub r4,r1,r3
40
Detecting Control Signals
41
Conflicts/Problems

I-cache and D-cache are accessed in the same
cycle it
helps to implement them separately
Registers are read and written in the same cycle
easy to
deal with if register read/write time equals
cycle time/2
(else, use bypassing)
Branch target changes only at the end of the
second stage
-- what do you do in the meantime?
Data between stages get latched into registers
(overhead
that increases latency per instruction)

42
Control Hazards

Simple techniques to handle control hazard
stalls
for every branch, introduce a stall cycle (note
every
6th instruction is a branch!)
assume the branch is not taken and start
fetching the
next instruction if the branch is taken,
need hardware
to cancel the effect of the wrong-path
instruction
fetch the next instruction (branch delay slot)
and
execute it anyway if the instruction turns
out to be
on the correct path, useful work was done
if the
instruction turns out to be on the wrong
path,
hopefully program state is not lost

43
Slowdowns from Stalls

Perfect pipelining with no hazards ? an
instruction
completes every cycle (total cycles num
instructions)
? speedup increase in clock speed num
pipeline stages
With hazards and stalls, some cycles ( stall
time) go by
during which no instruction completes, and then
the stalled
instruction completes
Total cycles number of instructions stall
cycles
Slowdown because of stalls 1/ (1 stall
cycles per instr)

44
Control and Datapath Split state diag into 5
pieces
IR lt- MemPC PC lt PC4
A lt- Rrs Blt Rrt
S lt A B
S lt A SX
S lt A or ZX
S lt A SX
If Cond PC lt PCSX
M lt MemS
MemS lt- B
Rrd lt S
Rrd lt M
Rrt lt S
Equal
Reg. File
Reg File
Exec
PC
IR
Next PC
Inst. Mem
Mem Access
Data Mem
45
Three Generic Data Hazards

InstrI followed by InstrJ
Read After Write (RAW) InstrJ tries to read
operand before InstrI writes it

46
Three Generic Data Hazards

InstrI followed by InstrJ
Write After Read (WAR) InstrJ tries to write
operand before InstrI reads i
Gets wrong operand
Cant happen in DLX 5 stage pipeline because
All instructions take 5 stages, and
Reads are always in stage 2, and
Writes are always in stage 5

47
Three Generic Data Hazards

InstrI followed by InstrJ
Write After Write (WAW) InstrJ tries to write
operand before InstrI writes it
Leaves wrong result ( InstrI not InstrJ )
Cant happen in DLX 5 stage pipeline because
All instructions take 5 stages, and
Writes are always in stage 5
Can have WAR and WAW in more complicated pipes

48
Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd
Fast code LW Rb,b LW Rc,c LW Re,e
ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB
Rd,Re,Rf SW d,Rd
49
Summary Pipelining

Reduce CPI by overlapping many instructions
Average throughput of approximately 1 CPI with
fast clock
Utilize capabilities of the Datapath
start next instruction while working on the
current one
limited by length of longest stage (plus
fill/flush)
detect and resolve hazards
What makes it easy
all instructions are the same length
just a few instruction formats
memory operands appear only in loads and stores
What makes it hard?
structural hazards suppose we had only one
memory
control hazards need to worry about branch
instructions
data hazards an instruction depends on a
previous instruction

50
Some Issues for your consideration

Wont be tested
Well talk about modern processors and whats
really hard
exception handling
trying to improve performance with out-of-order
execution, etc.
Trying to get CPI lt 1 (Superscalar execution)

51
Superscalar Execution

Throwing more hardware at the problem
Instruction level parallelism (ILP)
Multiple functional units
Eg. Multiple ALUs
Add a, b, c
Add d, e, f
Can get CPI lt1!

52
Out-of-order execution

Idea Its best if we keep all functional units
busy
Can sometimes reorder computations to take
advantage of functional units that are otherwise
idle
Automatically do reordering like we did 4 slides
ago!

53
Register Renaming

Internally rename registers, allow for better ILP
Add a, b, c
Add b, c, d

54
Hyperthreading/Multicore

Hyperthreading
gt1 virtual CPUs
Multi-core
gt1 actual CPUs per die

55
Integrated Circuits Costs
IC cost Die cost Testing cost
Packaging cost
Final test yield Die cost
Wafer cost Dies per
Wafer Die yield Dies per wafer (
Wafer_diam / 2)2 Wafer_diam Test
dies
Die Area 2 Die Area
Die Yield Wafer yield 1

Defects_per_unit_area Die_Area
a

- a
Die Cost goes roughly with die area4
56
Real World Examples

Chip Metal Line Wafer Defect Area Dies/ Yield Die
Cost layers width cost
/cm2 mm2 wafer
386DX 2 0.90 900 1.0 43 360 71 4
486DX2 3 0.80 1200 1.0 81 181 54 12
PowerPC 601 4 0.80 1700 1.3 121 115 28 53
HP PA 7100 3 0.80 1300 1.0 196 66 27 73
DEC Alpha 3 0.70 1500 1.2 234 53 19 149
SuperSPARC 3 0.70 1700 1.6 256 48 13 272
Pentium 3 0.80 1500 1.5 296 40 9 417
From "Estimating IC Manufacturing Costs, by
Linley Gwennap, Microprocessor Report, August 2,
1993, p. 15

57
Midterm Questions

Examples
List and describe 3 types of DRAM
What are the relative advantages/disadvantages of
RISC/CISC
What do we have a memory heirarchy?
Using your choice of assembly write a (commented)
routine that computes the nth fibonnaci number.
Why do CPUs have registers?
Describe how a 3-disk RAID-5 system works

58
Midterm Questions

More Examples
What are the differences between an synchronous
and asynchronous bus? What are the relative
advantages/disadvantages?
List and describe techniques to improve cache
miss rate, reduce cache miss penalty and reduce
cache hit times

59
Topics for further study

The following slides will not be covered in class
or on tests.

60
Multicycle Instructions
61
Effects of Multicycle Instructions

Structural hazards if the unit is not fully
pipelined (divider)
Frequent RAW hazard stalls
Potentially multiple writes to the register file
in a cycle
WAW hazards because of out-of-order instr
completion
Imprecise exceptions because of o-o-o instr
completion

62
Precise Exceptions

On an exception
must save PC of instruction where program must
resume
all instructions after that PC that might be in
the pipeline
must be converted to NOPs (other instructions
continue
to execute and may raise exceptions of their
own)
temporary program state not in memory (in other
words,
registers) has to be stored in memory
potential problems if a later instruction has
already
modified memory or registers
A processor that fulfils all the above
conditions is said to
provide precise exceptions (useful for
debugging and of
course, correctness)

63
Dealing with these Effects

Multiple writes to the register file increase
the number of
ports, stall one of the writers during ID,
stall one of the
writers during WB (the stall will propagate)
WAW hazards detect the hazard during ID and
stall the
later instruction
Imprecise exceptions buffer the results if they
complete
early or save more pipeline state so that you
can return to
exactly the same state that you left at

64
ILP

Instruction-level parallelism overlap among
instructions
pipelining or multiple instruction execution
What determines the degree of ILP?
dependences property of the program
hazards property of the pipeline

65
Types of Dependences

Data dependences an instr produces a result for
another
(true dependence, results in RAW hazards in a
pipeline)
Name dependences two instrs that use the same
names
(anti and output dependences, result in WAR and
WAW
hazards in a pipeline)
Control dependences an instructions execution
depends
on the result of a branch re-ordering should
preserve
exception behavior and dataflow

66
An Out-of-Order Processor Implementation
Reorder Buffer (ROB)
Branch prediction and instr fetch
Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6
T1 T2 T3 T4 T5 T6
Register File R1-R32
R1 ? R1R2 R2 ? R1R3 BEQZ R2 R3 ? R1R2 R1 ?
R3R2
Decode Rename
T1 ? R1R2 T2 ? T1R3 BEQZ T2 T4 ? T1T2 T5 ?
T4T2
ALU
ALU
ALU
Instr Fetch Queue
Results written to ROB and tags broadcast to IQ
Issue Queue (IQ)
67
Design Details - I

Instructions enter the pipeline in order
No need for branch delay slots if prediction
happens in time
Instructions leave the pipeline in order all
instructions
that enter also get placed in the ROB the
process of an
instruction leaving the ROB (in order) is
called commit
an instruction commits only if it and all
instructions before
it have completed successfully (without an
exception)
To preserve precise exceptions, a result is
written into the
register file only when the instruction commits
until then,
the result is saved in a temporary register in
the ROB

68
Design Details - II

Instructions get renamed and placed in the issue
queue
some operands are available (T1-T6 R1-R32),
while
others are being produced by instructions in
flight (T1-T6)
As instructions finish, they write results into
the ROB (T1-T6)
and broadcast the operand tag (T1-T6) to the
issue queue
instructions now know if their operands are
ready
When a ready instruction issues, it reads its
operands from
T1-T6 and R1-R32 and executes (out-of-order
execution)
Can you have WAW or WAR hazards? By using more
names (T1-T6), name dependences can be avoided

69
Design Details - III

If instr-3 raises an exception, wait until it
reaches the top
of the ROB at this point, R1-R32 contain
results for all
instructions up to instr-3 save registers,
save PC of instr-3,
and service the exception
If branch is a mispredict, flush all
instructions after the
branch and start on the correct path
mispredicted instrs
will not have updated registers (the branch
cannot commit
until it has completed and the flush happens as
soon as the
branch completes)
Potential problems ?

70
Managing Register Names
Temporary values are stored in the register file
and not the ROB
Logical Registers R1-R32
Physical Registers P1-P64
At the start, R1-R32 can be found in
P1-P32 Instructions stop entering the pipeline
when P64 is assigned
R1 ? R1R2 R2 ? R1R3 BEQZ R2 R3 ? R1R2
P33 ? P1P2 P34 ? P33P3 BEQZ P34 P35 ? P33P34
What happens on commit?
71
The Commit Process

On commit, no copy is required
The register map table is updated the
committed value
of R1 is now in P33 and not P1 on an
exception, P33 is
copied to memory and not P1
An instruction in the issue queue need not
modify its
input operand when the producer commits
When instruction-1 commits, we no longer have
any use
for P1 it is put in a free pool and a new
instruction can
now enter the pipeline ? for every instr that
commits, a
new instr can enter the pipeline ? number of
in-flight
instrs is a constant number of extra (rename)
registers

72
The Alpha 21264 Out-of-Order Implementation
Reorder Buffer (ROB)
Branch prediction and instr fetch
Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6
Register File P1-P64
Register Map Table R1?P1 R2?P2
R1 ? R1R2 R2 ? R1R3 BEQZ R2 R3 ? R1R2 R1 ?
R3R2
Decode Rename
P33 ? P1P2 P34 ? P33P3 BEQZ P34 P35 ?
P33P34 P36 ? P35P34
ALU
ALU
ALU
Instr Fetch Queue
Results written to regfile and tags broadcast to
IQ
Issue Queue (IQ)
73
Lecture 11 Advanced Static ILP

Topics loop unrolling, software pipelining
(Section 4.4)

74
Loop Dependences

If a loop only has dependences within an
iteration, the loop
is considered parallel ? multiple iterations
can be executed
together so long as order within an iteration
is preserved
If a loop has dependeces across iterations, it
is not parallel
and these dependeces are referred to as
loop-carried
Not all loop-carried dependences imply lack of
parallelism
Parallel loops are especially desireable in a
multiprocessor
system

75
Examples
For (i1000 igt0 ii-1) xi xi s
No dependences
For (i1 ilt100 ii1) Ai1 Ai
Ci S1 Bi1 Bi Ai1
S2
S2 depends on S1 in the same iteration S1 depends
on S1 from prev iteration S2 depends on S2 from
prev iteration
For (i1 ilt100 ii1) Ai Ai
Bi S1 Bi1 Ci Di
S2
S1 depends on S2 from prev iteration
S1 depends on S1 from 3 prev iterations Referred
to as a recursion Dependence distance 3 limited
parallelism
For (i1000 igt0 ii-1) xi xi-3 s
S1
76
Finding Dependences the GCD Test

Do Aai b and Aci d refer to the same
element?
Restrict ourselves to affine array indices
(expressible as
ai b, where i is the loop index, a and b are
constants)
example of non-affine index xyi
For a dependence to exist, must have two indices
j and k
that are within the loop bounds, such that
aj b ck d
aj ck d b
G GCD(a,c)
(aj/G - ck/G) (d-b)/G
If (d-b)/G is not an integer, the initial
equality can not be true

77
Static vs. Dynamic ILP
Loop L.D F0, 0(R1) F0
array element ADD.D F4, F0, F2
add scalar S.D F4,
0(R1) store result
DADDUI R1, R1, -8 decrement address
pointer BNE R1, R2, Loop
branch if R1 ! R2
Loop L.D F0, 0(R1)
L.D F6, -8(R1) L.D
F10,-16(R1) L.D F14,
-24(R1) ADD.D F4, F0, F2
ADD.D F8, F6, F2 ADD.D
F12, F10, F2 ADD.D F16, F14,
F2 S.D F4, 0(R1)
S.D F8, -8(R1) DADDUI
R1, R1, -32 S.D F12,
16(R1) BNE R1,R2, Loop
S.D F16, 8(R1)
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D
F4, 0(R1) DADDUI R1, R1, -8 BNE R1,
R2, Loop L.D F0, 0(R1) ADD.D F4, F0,
F2 S.D F4, 0(R1) DADDUI R1, R1, -8 BNE
R1, R2, Loop L.D F0, 0(R1) ADD.D
F4, F0, F2 S.D F4, 0(R1) ..
Statically unrolled loop
Large window dynamic ooo proc
78
Dynamic ILP
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D
F4, 0(R1) DADDUI R1, R1, -8 BNE R1,
R2, Loop L.D F0, 0(R1) ADD.D F4, F0,
F2 S.D F4, 0(R1) DADDUI R1, R1, -8 BNE
R1, R2, Loop L.D F0, 0(R1) ADD.D
F4, F0, F2 S.D F4, 0(R1) DADDUI R1,
R1, -8 BNE R1, R2, Loop L.D F0,
0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2, Loop
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D
F4, 0(R1) DADDUI R3, R1, -8 BNE R3,
R2, Loop L.D F6, 0(R3) ADD.D F8, F6,
F2 S.D F8, 0(R3) DADDUI R4, R3, -8 BNE
R4, R2, Loop L.D F10, 0(R4) ADD.D
F12, F10, F2 S.D F12, 0(R4) DADDUI R5,
R4, -8 BNE R5, R2, Loop L.D F14,
0(R5) ADD.D F16, F14, F2 S.D F16,
0(R5) DADDUI R6, R5, -8 BNE R6, R2, Loop
Renamed
79
Dynamic ILP
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D
F4, 0(R1) DADDUI R1, R1, -8 BNE R1,
R2, Loop L.D F0, 0(R1) ADD.D F4, F0,
F2 S.D F4, 0(R1) DADDUI R1, R1, -8 BNE
R1, R2, Loop L.D F0, 0(R1) ADD.D
F4, F0, F2 S.D F4, 0(R1) DADDUI R1,
R1, -8 BNE R1, R2, Loop L.D F0,
0(R1) ADD.D F4, F0, F2 S.D F4,
0(R1) DADDUI R1, R1, -8 BNE R1, R2, Loop
L.D F0, 0(R1) ADD.D F4, F0, F2 S.D
F4, 0(R1) DADDUI R3, R1, -8 BNE R3,
R2, Loop L.D F6, 0(R3) ADD.D F8, F6,
F2 S.D F8, 0(R3) DADDUI R4, R3, -8 BNE
R4, R2, Loop L.D F10, 0(R4) ADD.D
F12, F10, F2 S.D F12, 0(R4) DADDUI R5,
R4, -8 BNE R5, R2, Loop L.D F14,
0(R5) ADD.D F16, F14, F2 S.D F16,
0(R5) DADDUI R6, R5, -8 BNE R6, R2, Loop
1 3 6 1 3 2 4 7 2 4 3 5 8 3 5 4 6 9 4 6
Cycle of Issue
Renamed
80
Loop Pipeline
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
S.D
DADDUI
BNE

L.D
ADD.D
DADDUI
BNE

L.D
ADD.D
DADDUI
BNE
81
Statically Unrolled Loop
Loop L.D F0, 0(R1)
L.D F6, -8(R1) L.D
F10,-16(R1) L.D F14,
-24(R1) L.D F18, -32(R1)
ADD.D F4, F0, F2 L.D
F22, -40(R1) ADD.D F8, F6, F2
L.D F26, -48(R1) ADD.D F12, F10, F2
L.D F30, -56(R1) ADD.D
F16, F14, F2 L.D F34,
-64(R1) ADD.D F20, F18, F2 S.D
F4, 0(R1) L.D F38, -72(R1)
ADD.D F24, F22, F2 S.D F8, -8(R1)

S.D
F12, 16(R1)

S.D F16, 8(R1) DADDUI
R1, R1, -32 S.D
BNE R1,R2, Loop S.D
82
Static Vs. Dynamic
New iterations completed
1
Dynamic ILP
Cycles
New iterations completed
1
Static ILP
Cycles

What if I doubled the number of resources in
each processor?
What if I unrolled the loop and executed it on a
dynamic ILP processor?

83
Static vs. Dynamic

Dynamic because of the loop index, at most one
iteration
can start every cycle even fewer if there are
resource
constraints in other words, we have a
pipeline that has
a throughput of one iteration per cycle!
Static by eliminating loop index, each
iteration is
independent ? as many loops can start in a
cycle as there
are resources however, after a while, we
dont start any
more iterations thus, loop unrolling provides
a brief steady
state, where an iteration starts/finishes every
cycle and the
rest is start-up/wind-down for each unrolled
loop

84
Software Pipeline?!
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
S.D
DADDUI
BNE
L.D
ADD.D
S.D
DADDUI
BNE

L.D
ADD.D
DADDUI
BNE

L.D
ADD.D
DADDUI
BNE
85
Software Pipelining
Loop L.D F0, 0(R1)
ADD.D F4, F0, F2 S.D
F4, 0(R1) DADDUI R1,
R1, -8 BNE R1, R2, Loop
Loop S.D F4, 16(R1)
ADD.D F4, F0, F2 L.D
F0, 0(R1) DADDUI R1,
R1, -8 BNE R1, R2, Loop

Advantages achieves nearly the same effect as
loop unrolling, but
without the code expansion an unrolled loop
may have inefficiencies
at the start and end of each iteration, while a
sw-pipelined loop is
almost always in steady state a sw-pipelined
loop can also be unrolled
to reduce loop overhead
Disadvantages does not reduce loop overhead,
may require more
registers

86
Loop Dependences

If a loop only has dependences within an
iteration, the loop
is considered parallel ? multiple iterations
can be executed
together so long as order within an iteration
is preserved
If a loop has dependeces across iterations, it
is not parallel
and these dependeces are referred to as
loop-carried
Not all loop-carried dependences imply lack of
parallelism
Parallel loops are especially desireable in a
multiprocessor
system

87
Examples
For (i1000 igt0 ii-1) xi xi s
No dependences
For (i1 ilt100 ii1) Ai1 Ai
Ci S1 Bi1 Bi Ai1
S2
S2 depends on S1 in the same iteration S1 depends
on S1 from prev iteration S2 depends on S2 from
prev iteration
For (i1 ilt100 ii1) Ai Ai
Bi S1 Bi1 Ci Di
S2
S1 depends on S2 from prev iteration
S1 depends on S1 from 3 prev iterations Referred
to as a recursion Dependence distance 3 limited
parallelism
For (i1000 igt0 ii-1) xi xi-3 s
S1
88
Constructing Parallel Loops
If loop-carried dependences are not cyclic (S1
depending on S1 is cyclic), loops can be
restructured to be parallel
For (i1 ilt100 ii1) Ai Ai
Bi S1 Bi1 Ci Di
S2
A1 A1 B1 For (i1 ilt99 ii1)
Bi1 Ci Di S3 Ai1
Ai1 Bi1 S4 B101 C100
D100
S1 depends on S2 from prev iteration
S4 depends on S3 of same iteration
89
Finding Dependences the GCD Test

Do Aai b and Aci d refer to the same
element?
Restrict ourselves to affine array indices
(expressible as
ai b, where i is the loop index, a and b are
constants)
example of non-affine index xyi
For a dependence to exist, must have two indices
j and k
that are within the loop bounds, such that
aj b ck d
aj ck d b
G GCD(a,c)
(aj/G - ck/G) (d-b)/G
If (d-b)/G is not an integer, the initial
equality can not be true

90
Predication

A branch within a loop can be problematic to
schedule
Control dependences are a problem because of the
need
to re-fetch on a mispredict
For short loop bodies, control dependences can
be
converted to data dependences by using
predicated/conditional instructions

91
Predicated or Conditional Instructions

The instruction has an additional operand that
determines
whether the instr completes or gets converted
into a no-op
Example lwc R1, 0(R2), R3
(load-word-conditional)
will load the word at address (R2) into R1 if
R3 is non-zero
if R3 is zero, the instruction becomes a no-op
Replaces a control dependence with a data
dependence
(branches disappear) may need register copies
for the
condition or for values used by both directions

if (R1 0) R2 R2 R4 else R6 R3
R5 R4 R2 R3
R7 !R1 R8 R2 R2 R2 R4 (predicated
on R7) R6 R3 R5 (predicated on R1) R4 R8
R3 (predicated on R1)
92
Complications