Title: Review of Instruction Sets, Pipelines, and Caches
1Review of Instruction Sets, Pipelines, and Caches
2Pipelining Its Natural!
- Laundry Example
- Ann, Brian, Cathy, Dave each have one load of
clothes to wash, dry, and fold - Washer takes 30 minutes
- Dryer takes 40 minutes
- Folder takes 20 minutes
3Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r
- Sequential laundry takes 6 hours for 4 loads
- If they learned pipelining, how long would
laundry take?
4Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r
- Pipelined laundry takes 3.5 hours for 4 loads
5Pipelining Lessons
- Pipelining doesnt help latency of single task,
it helps throughput of entire workload - Pipeline rate limited by slowest pipeline stage
- Multiple tasks operating simultaneously
- Potential speedup Number pipe stages
- Unbalanced lengths of pipe stages reduces speedup
- Time to fill pipeline and time to drain it
reduces speedup
6 PM
7
8
9
Time
T a s k O r d e r
6Computer Pipelines
- Execute billions of instructions, so throughput
is what matters - DLX desirable features all instructions same
length, registers located in same place in
instruction format, memory operands only in loads
or stores
75 Steps of DLX DatapathFigure 3.1, Page 130
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
IR
L M D
8Pipelined DLX DatapathFigure 3.4, page 137
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
Write Back
Memory Access
- Data stationary control
- local decode for each instruction phase /
pipeline stage
9Visualizing PipeliningFigure 3.3, Page 133
Time (clock cycles)
I n s t r. O r d e r
10Its Not That Easy for Computers
- Limits to pipelining Hazards prevent next
instruction from executing during its designated
clock cycle - Structural hazards HW cannot support this
combination of instructions (single person to
fold and put clothes away) - Data hazards Instruction depends on result of
prior instruction still in the pipeline (missing
sock) - Control hazards Pipelining of branches other
instructionsstall the pipeline until the
hazardbubbles in the pipeline
11One Memory Port/Structural HazardsFigure 3.6,
Page 142
Time (clock cycles)
Load
I n s t r. O r d e r
Instr 1
Instr 2
Instr 3
Instr 4
12One Memory Port/Structural HazardsFigure 3.7,
Page 143
Time (clock cycles)
Load
I n s t r. O r d e r
Instr 1
Instr 2
stall
Instr 3
13Speed Up Equation for Pipelining
- CPIpipelined Ideal CPI Pipeline stall
clock cycles per instr - Speedup Ideal CPI x Pipeline depth Clock
Cycleunpipelined - Ideal CPI Pipeline stall CPI Clock
Cyclepipelined - Speedup Pipeline depth Clock
Cycleunpipelined - 1 Pipeline stall CPI Clock
Cyclepipelined
x
x
14Example Dual-port vs. Single-port
- Machine A Dual ported memory
- Machine B Single ported memory, but its
pipelined implementation has a 1.05 times faster
clock rate - Ideal CPI 1 for both
- Loads are 40 of instructions executed
- SpeedUpA Pipeline Depth/(1 0) x
(clockunpipe/clockpipe) - Pipeline Depth
- SpeedUpB Pipeline Depth/(1 0.4 x 1)
x (clockunpipe/(clockunpipe / 1.05) - (Pipeline Depth/1.4) x 1.05
- 0.75 x Pipeline Depth
- SpeedUpA / SpeedUpB Pipeline
Depth/(0.75 x Pipeline Depth) 1.33 - Machine A is 1.33 times faster
15Data Hazard on R1Figure 3.9, page 147
Time (clock cycles)
IF
ID/RF
EX
MEM
WB
I n s t r. O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
16Three Generic Data Hazards
- InstrI followed by InstrJ
- Read After Write (RAW) InstrJ tries to read
operand before InstrI writes it
17Three Generic Data Hazards
- InstrI followed by InstrJ
- Write After Read (WAR) InstrJ tries to write
operand before InstrI reads i - Gets wrong operand
- Cant happen in DLX 5 stage pipeline because
- All instructions take 5 stages, and
- Reads are always in stage 2, and
- Writes are always in stage 5
18Three Generic Data Hazards
- InstrI followed by InstrJ
- Write After Write (WAW) InstrJ tries to write
operand before InstrI writes it - Leaves wrong result ( InstrI not InstrJ )
- Cant happen in DLX 5 stage pipeline because
- All instructions take 5 stages, and
- Writes are always in stage 5
- Will see WAR and WAW in later more complicated
pipes
19Forwarding to Avoid Data HazardFigure 3.10, Page
149
Time (clock cycles)
I n s t r. O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
20HW Change for ForwardingFigure 3.20, Page 161
21Data Hazard Even with ForwardingFigure 3.12,
Page 153
Time (clock cycles)
lw r1, 0(r2)
I n s t r. O r d e r
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
22Data Hazard Even with ForwardingFigure 3.13,
Page 154
Time (clock cycles)
I n s t r. O r d e r
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
23Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd
- Fast code
- LW Rb,b
- LW Rc,c
- LW Re,e
- ADD Ra,Rb,Rc
- LW Rf,f
- SW a,Ra
- SUB Rd,Re,Rf
- SW d,Rd
24Control Hazard on BranchesThree Stage Stall
25Branch Stall Impact
- If CPI 1, 30 branch, Stall 3 cycles gt new CPI
1.9! - Two part solution
- Determine branch taken or not sooner, AND
- Compute taken branch address earlier
- DLX branch tests if register 0 or 0
- DLX Solution
- Move Zero test to ID/RF stage
- Adder to calculate new PC in ID/RF stage
- 1 clock cycle penalty for branch versus 3
26Pipelined DLX DatapathFigure 3.22, page 163
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
This is the correct 1 cycle latency
implementation!
27Four Branch Hazard Alternatives
- 1 Stall until branch direction is clear
- 2 Predict Branch Not Taken
- Execute successor instructions in sequence
- Squash instructions in pipeline if branch
actually taken - Advantage of late pipeline state update
- 47 DLX branches not taken on average
- PC4 already calculated, so use it to get next
instruction - 3 Predict Branch Taken
- 53 DLX branches taken on average
- But havent calculated branch target address in
DLX - DLX still incurs 1 cycle branch penalty
- Other machines branch target known before outcome
28Four Branch Hazard Alternatives
- 4 Delayed Branch
- Define branch to take place AFTER a following
instruction - branch instruction sequential
successor1 sequential successor2 ........ seque
ntial successorn - branch target if taken
- 1 slot delay allows proper decision and branch
target address in 5 stage pipeline - DLX uses this
Branch delay of length n
29Delayed Branch
- Where to get instructions to fill branch delay
slot? - Before branch instruction
- From the target address only valuable when
branch taken - From fall through only valuable when branch not
taken - Canceling branches allow more slots to be filled
- Compiler effectiveness for single branch delay
slot - Fills about 60 of branch delay slots
- About 80 of instructions executed in branch
delay slots useful in computation - About 50 (60 x 80) of slots usefully filled
- Delayed Branch downside 7-8 stage pipelines,
multiple instructions issued per clock
(superscalar)
30Evaluating Branch Alternatives
- Scheduling Branch CPI speedup v. speedup v.
scheme penalty unpipelined stall - Stall pipeline 3 1.42 3.5 1.0
- Predict taken 1 1.14 4.4 1.26
- Predict not taken 1 1.09 4.5 1.29
- Delayed branch 0.5 1.07 4.6 1.31
- Conditional Unconditional 14, 65 change PC
31Improvements in Delayed Branches
- Cancelling branches
- if branch behaves as predicted, normal delayed
branch - otherwise, turn the delay slot to a NO-OP
- Helps the compiler in rescheduling instructions
without restrictions - Deeper pipes with longer branch delays make
delayed branching less attractive - Newer RISC machines use combination of ordinary
and delayed branches, sometimes only ordinary
branches with better prediction
32Prediction Techniques
- Taken and non-taken predictions
- Separating the forward and backward branches
- Profile-based predictions
- behavior of branches highy biased towards taken
and non-taken - changing the input has minimal effect on the
branch behavior
33Handling Exceptions
- Turn off all writes for the faulting instruction
and for all the instructions that follow in the
pipe - Save PC of the faulting instruction
- For delayed branch, needs multiple PCs
- no. of delay slots 1
- Precise exceptions - instructions just before the
fault are completed and those after can be
restarted from scratch - slower mode
34Out-of-order Exceptions
- (I1)th instruction may cause an exception before
I does - Handles by using exception status vectors
- Disable the side effects as soon as exception is
found - Exception handling happens at WB, in the
un-pipelined order
35Multi-Cycle Operations
- Impractical to require the FP operations to
complete in 1 or 2 clock cycles - either slow down the clock
- or complex fp hardware
- Instead allow FP pipe line a longer latency
- May cause more hazards
- Divide unit not fully pipelined - structural
hazard - WAW since the instructions reach WB out of order
- Causes additional problems with exception
- Out of order precise exceptions
- Either serialize the FP operations or buffer the
results of operation
36Pipelining Introduction Summary
- Just overlap tasks, and easy if tasks are
independent - Speed Up Å Pipeline Depth if ideal CPI is 1,
then - Hazards limit performance on computers
- Structural need more HW resources
- Data (RAW,WAR,WAW) need forwarding, compiler
scheduling - Control delayed branch, prediction
Pipeline Depth
Clock Cycle Unpipelined
Speedup
X
Clock Cycle Pipelined
1 Pipeline stall CPI