Review of Instruction Sets, Pipelines, and Caches - PowerPoint PPT Presentation

About This Presentation

Title:

Review of Instruction Sets, Pipelines, and Caches

Description:

CSE 7381/5381. Review of Instruction Sets, ... Sequential laundry takes 6 hours for 4 loads ... 'Squash' instructions in pipeline if branch actually taken ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 37

Provided by: Rand245

Learn more at: https://s2.smu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Review of Instruction Sets, Pipelines, and Caches

1
Review of Instruction Sets, Pipelines, and Caches
2
Pipelining Its Natural!

Laundry Example
Ann, Brian, Cathy, Dave each have one load of
clothes to wash, dry, and fold
Washer takes 30 minutes
Dryer takes 40 minutes
Folder takes 20 minutes

3
Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r

Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would
laundry take?

4
Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r

Pipelined laundry takes 3.5 hours for 4 loads

5
Pipelining Lessons

Pipelining doesnt help latency of single task,
it helps throughput of entire workload
Pipeline rate limited by slowest pipeline stage
Multiple tasks operating simultaneously
Potential speedup Number pipe stages
Unbalanced lengths of pipe stages reduces speedup
Time to fill pipeline and time to drain it
reduces speedup

6 PM
7
8
9
Time
T a s k O r d e r
6
Computer Pipelines

Execute billions of instructions, so throughput
is what matters
DLX desirable features all instructions same
length, registers located in same place in
instruction format, memory operands only in loads
or stores

7
5 Steps of DLX DatapathFigure 3.1, Page 130
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
IR
L M D
8
Pipelined DLX DatapathFigure 3.4, page 137
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
Write Back
Memory Access

Data stationary control
local decode for each instruction phase /
pipeline stage

9
Visualizing PipeliningFigure 3.3, Page 133
Time (clock cycles)
I n s t r. O r d e r
10
Its Not That Easy for Computers

Limits to pipelining Hazards prevent next
instruction from executing during its designated
clock cycle
Structural hazards HW cannot support this
combination of instructions (single person to
fold and put clothes away)
Data hazards Instruction depends on result of
prior instruction still in the pipeline (missing
sock)
Control hazards Pipelining of branches other
instructionsstall the pipeline until the
hazardbubbles in the pipeline

11
One Memory Port/Structural HazardsFigure 3.6,
Page 142
Time (clock cycles)
Load
I n s t r. O r d e r
Instr 1
Instr 2
Instr 3
Instr 4
12
One Memory Port/Structural HazardsFigure 3.7,
Page 143
Time (clock cycles)
Load
I n s t r. O r d e r
Instr 1
Instr 2
stall
Instr 3
13
Speed Up Equation for Pipelining

CPIpipelined Ideal CPI Pipeline stall
clock cycles per instr
Speedup Ideal CPI x Pipeline depth Clock
Cycleunpipelined
Ideal CPI Pipeline stall CPI Clock
Cyclepipelined
Speedup Pipeline depth Clock
Cycleunpipelined
1 Pipeline stall CPI Clock
Cyclepipelined

x
x
14
Example Dual-port vs. Single-port

Machine A Dual ported memory
Machine B Single ported memory, but its
pipelined implementation has a 1.05 times faster
clock rate
Ideal CPI 1 for both
Loads are 40 of instructions executed
SpeedUpA Pipeline Depth/(1 0) x
(clockunpipe/clockpipe)
Pipeline Depth
SpeedUpB Pipeline Depth/(1 0.4 x 1)
x (clockunpipe/(clockunpipe / 1.05)
(Pipeline Depth/1.4) x 1.05
0.75 x Pipeline Depth
SpeedUpA / SpeedUpB Pipeline
Depth/(0.75 x Pipeline Depth) 1.33
Machine A is 1.33 times faster

15
Data Hazard on R1Figure 3.9, page 147
Time (clock cycles)
IF
ID/RF
EX
MEM
WB
I n s t r. O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
16
Three Generic Data Hazards

InstrI followed by InstrJ
Read After Write (RAW) InstrJ tries to read
operand before InstrI writes it

17
Three Generic Data Hazards

InstrI followed by InstrJ
Write After Read (WAR) InstrJ tries to write
operand before InstrI reads i
Gets wrong operand
Cant happen in DLX 5 stage pipeline because
All instructions take 5 stages, and
Reads are always in stage 2, and
Writes are always in stage 5

18
Three Generic Data Hazards

InstrI followed by InstrJ
Write After Write (WAW) InstrJ tries to write
operand before InstrI writes it
Leaves wrong result ( InstrI not InstrJ )
Cant happen in DLX 5 stage pipeline because
All instructions take 5 stages, and
Writes are always in stage 5
Will see WAR and WAW in later more complicated
pipes

19
Forwarding to Avoid Data HazardFigure 3.10, Page
149
Time (clock cycles)
I n s t r. O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
20
HW Change for ForwardingFigure 3.20, Page 161
21
Data Hazard Even with ForwardingFigure 3.12,
Page 153
Time (clock cycles)
lw r1, 0(r2)
I n s t r. O r d e r
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
22
Data Hazard Even with ForwardingFigure 3.13,
Page 154
Time (clock cycles)
I n s t r. O r d e r
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
23
Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd

Fast code
LW Rb,b
LW Rc,c
LW Re,e
ADD Ra,Rb,Rc
LW Rf,f
SW a,Ra
SUB Rd,Re,Rf
SW d,Rd

24
Control Hazard on BranchesThree Stage Stall
25
Branch Stall Impact

If CPI 1, 30 branch, Stall 3 cycles gt new CPI
1.9!
Two part solution
Determine branch taken or not sooner, AND
Compute taken branch address earlier
DLX branch tests if register 0 or 0
DLX Solution
Move Zero test to ID/RF stage
Adder to calculate new PC in ID/RF stage
1 clock cycle penalty for branch versus 3

26
Pipelined DLX DatapathFigure 3.22, page 163
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
This is the correct 1 cycle latency
implementation!
27
Four Branch Hazard Alternatives

1 Stall until branch direction is clear
2 Predict Branch Not Taken
Execute successor instructions in sequence
Squash instructions in pipeline if branch
actually taken
Advantage of late pipeline state update
47 DLX branches not taken on average
PC4 already calculated, so use it to get next
instruction
3 Predict Branch Taken
53 DLX branches taken on average
But havent calculated branch target address in
DLX
DLX still incurs 1 cycle branch penalty
Other machines branch target known before outcome

28
Four Branch Hazard Alternatives

4 Delayed Branch
Define branch to take place AFTER a following
instruction
branch instruction sequential
successor1 sequential successor2 ........ seque
ntial successorn
branch target if taken
1 slot delay allows proper decision and branch
target address in 5 stage pipeline
DLX uses this

Branch delay of length n
29
Delayed Branch

Where to get instructions to fill branch delay
slot?
Before branch instruction
From the target address only valuable when
branch taken
From fall through only valuable when branch not
taken
Canceling branches allow more slots to be filled
Compiler effectiveness for single branch delay
slot
Fills about 60 of branch delay slots
About 80 of instructions executed in branch
delay slots useful in computation
About 50 (60 x 80) of slots usefully filled
Delayed Branch downside 7-8 stage pipelines,
multiple instructions issued per clock
(superscalar)

30
Evaluating Branch Alternatives

Scheduling Branch CPI speedup v. speedup v.
scheme penalty unpipelined stall
Stall pipeline 3 1.42 3.5 1.0
Predict taken 1 1.14 4.4 1.26
Predict not taken 1 1.09 4.5 1.29
Delayed branch 0.5 1.07 4.6 1.31
Conditional Unconditional 14, 65 change PC

31
Improvements in Delayed Branches

Cancelling branches
if branch behaves as predicted, normal delayed
branch
otherwise, turn the delay slot to a NO-OP
Helps the compiler in rescheduling instructions
without restrictions
Deeper pipes with longer branch delays make
delayed branching less attractive
Newer RISC machines use combination of ordinary
and delayed branches, sometimes only ordinary
branches with better prediction

32
Prediction Techniques

Taken and non-taken predictions
Separating the forward and backward branches
Profile-based predictions
behavior of branches highy biased towards taken
and non-taken
changing the input has minimal effect on the
branch behavior

33
Handling Exceptions

Turn off all writes for the faulting instruction
and for all the instructions that follow in the
pipe
Save PC of the faulting instruction
For delayed branch, needs multiple PCs
no. of delay slots 1
Precise exceptions - instructions just before the
fault are completed and those after can be
restarted from scratch
slower mode

34
Out-of-order Exceptions