Review of Instruction Sets, Pipelines, and Caches - PowerPoint PPT Presentation

About This Presentation
Title:

Review of Instruction Sets, Pipelines, and Caches

Description:

CSE 7381/5381. Review of Instruction Sets, ... Sequential laundry takes 6 hours for 4 loads ... 'Squash' instructions in pipeline if branch actually taken ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 37
Provided by: Rand245
Learn more at: https://s2.smu.edu
Category:

less

Transcript and Presenter's Notes

Title: Review of Instruction Sets, Pipelines, and Caches


1
Review of Instruction Sets, Pipelines, and Caches
2
Pipelining Its Natural!
  • Laundry Example
  • Ann, Brian, Cathy, Dave each have one load of
    clothes to wash, dry, and fold
  • Washer takes 30 minutes
  • Dryer takes 40 minutes
  • Folder takes 20 minutes

3
Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r
  • Sequential laundry takes 6 hours for 4 loads
  • If they learned pipelining, how long would
    laundry take?

4
Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r
  • Pipelined laundry takes 3.5 hours for 4 loads

5
Pipelining Lessons
  • Pipelining doesnt help latency of single task,
    it helps throughput of entire workload
  • Pipeline rate limited by slowest pipeline stage
  • Multiple tasks operating simultaneously
  • Potential speedup Number pipe stages
  • Unbalanced lengths of pipe stages reduces speedup
  • Time to fill pipeline and time to drain it
    reduces speedup

6 PM
7
8
9
Time
T a s k O r d e r
6
Computer Pipelines
  • Execute billions of instructions, so throughput
    is what matters
  • DLX desirable features all instructions same
    length, registers located in same place in
    instruction format, memory operands only in loads
    or stores

7
5 Steps of DLX DatapathFigure 3.1, Page 130
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc
IR
L M D
8
Pipelined DLX DatapathFigure 3.4, page 137
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
Write Back
Memory Access
  • Data stationary control
  • local decode for each instruction phase /
    pipeline stage

9
Visualizing PipeliningFigure 3.3, Page 133
Time (clock cycles)
I n s t r. O r d e r
10
Its Not That Easy for Computers
  • Limits to pipelining Hazards prevent next
    instruction from executing during its designated
    clock cycle
  • Structural hazards HW cannot support this
    combination of instructions (single person to
    fold and put clothes away)
  • Data hazards Instruction depends on result of
    prior instruction still in the pipeline (missing
    sock)
  • Control hazards Pipelining of branches other
    instructionsstall the pipeline until the
    hazardbubbles in the pipeline

11
One Memory Port/Structural HazardsFigure 3.6,
Page 142
Time (clock cycles)
Load
I n s t r. O r d e r
Instr 1
Instr 2
Instr 3
Instr 4
12
One Memory Port/Structural HazardsFigure 3.7,
Page 143
Time (clock cycles)
Load
I n s t r. O r d e r
Instr 1
Instr 2
stall
Instr 3
13
Speed Up Equation for Pipelining
  • CPIpipelined Ideal CPI Pipeline stall
    clock cycles per instr
  • Speedup Ideal CPI x Pipeline depth Clock
    Cycleunpipelined
  • Ideal CPI Pipeline stall CPI Clock
    Cyclepipelined
  • Speedup Pipeline depth Clock
    Cycleunpipelined
  • 1 Pipeline stall CPI Clock
    Cyclepipelined

x
x
14
Example Dual-port vs. Single-port
  • Machine A Dual ported memory
  • Machine B Single ported memory, but its
    pipelined implementation has a 1.05 times faster
    clock rate
  • Ideal CPI 1 for both
  • Loads are 40 of instructions executed
  • SpeedUpA Pipeline Depth/(1 0) x
    (clockunpipe/clockpipe)
  • Pipeline Depth
  • SpeedUpB Pipeline Depth/(1 0.4 x 1)
    x (clockunpipe/(clockunpipe / 1.05)
  • (Pipeline Depth/1.4) x 1.05
  • 0.75 x Pipeline Depth
  • SpeedUpA / SpeedUpB Pipeline
    Depth/(0.75 x Pipeline Depth) 1.33
  • Machine A is 1.33 times faster

15
Data Hazard on R1Figure 3.9, page 147
Time (clock cycles)
IF
ID/RF
EX
MEM
WB
I n s t r. O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
16
Three Generic Data Hazards
  • InstrI followed by InstrJ
  • Read After Write (RAW) InstrJ tries to read
    operand before InstrI writes it

17
Three Generic Data Hazards
  • InstrI followed by InstrJ
  • Write After Read (WAR) InstrJ tries to write
    operand before InstrI reads i
  • Gets wrong operand
  • Cant happen in DLX 5 stage pipeline because
  • All instructions take 5 stages, and
  • Reads are always in stage 2, and
  • Writes are always in stage 5

18
Three Generic Data Hazards
  • InstrI followed by InstrJ
  • Write After Write (WAW) InstrJ tries to write
    operand before InstrI writes it
  • Leaves wrong result ( InstrI not InstrJ )
  • Cant happen in DLX 5 stage pipeline because
  • All instructions take 5 stages, and
  • Writes are always in stage 5
  • Will see WAR and WAW in later more complicated
    pipes

19
Forwarding to Avoid Data HazardFigure 3.10, Page
149
Time (clock cycles)
I n s t r. O r d e r
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
20
HW Change for ForwardingFigure 3.20, Page 161
21
Data Hazard Even with ForwardingFigure 3.12,
Page 153
Time (clock cycles)
lw r1, 0(r2)
I n s t r. O r d e r
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
22
Data Hazard Even with ForwardingFigure 3.13,
Page 154
Time (clock cycles)
I n s t r. O r d e r
lw r1, 0(r2)
sub r4,r1,r6
and r6,r1,r7
or r8,r1,r9
23
Software Scheduling to Avoid Load Hazards
Try producing fast code for a b c d e
f assuming a, b, c, d ,e, and f in memory.
Slow code LW Rb,b LW Rc,c ADD
Ra,Rb,Rc SW a,Ra LW Re,e LW
Rf,f SUB Rd,Re,Rf SW d,Rd
  • Fast code
  • LW Rb,b
  • LW Rc,c
  • LW Re,e
  • ADD Ra,Rb,Rc
  • LW Rf,f
  • SW a,Ra
  • SUB Rd,Re,Rf
  • SW d,Rd

24
Control Hazard on BranchesThree Stage Stall
25
Branch Stall Impact
  • If CPI 1, 30 branch, Stall 3 cycles gt new CPI
    1.9!
  • Two part solution
  • Determine branch taken or not sooner, AND
  • Compute taken branch address earlier
  • DLX branch tests if register 0 or 0
  • DLX Solution
  • Move Zero test to ID/RF stage
  • Adder to calculate new PC in ID/RF stage
  • 1 clock cycle penalty for branch versus 3

26
Pipelined DLX DatapathFigure 3.22, page 163
Memory Access
Write Back
Instruction Fetch
Instr. Decode Reg. Fetch
Execute Addr. Calc.
This is the correct 1 cycle latency
implementation!
27
Four Branch Hazard Alternatives
  • 1 Stall until branch direction is clear
  • 2 Predict Branch Not Taken
  • Execute successor instructions in sequence
  • Squash instructions in pipeline if branch
    actually taken
  • Advantage of late pipeline state update
  • 47 DLX branches not taken on average
  • PC4 already calculated, so use it to get next
    instruction
  • 3 Predict Branch Taken
  • 53 DLX branches taken on average
  • But havent calculated branch target address in
    DLX
  • DLX still incurs 1 cycle branch penalty
  • Other machines branch target known before outcome

28
Four Branch Hazard Alternatives
  • 4 Delayed Branch
  • Define branch to take place AFTER a following
    instruction
  • branch instruction sequential
    successor1 sequential successor2 ........ seque
    ntial successorn
  • branch target if taken
  • 1 slot delay allows proper decision and branch
    target address in 5 stage pipeline
  • DLX uses this

Branch delay of length n
29
Delayed Branch
  • Where to get instructions to fill branch delay
    slot?
  • Before branch instruction
  • From the target address only valuable when
    branch taken
  • From fall through only valuable when branch not
    taken
  • Canceling branches allow more slots to be filled
  • Compiler effectiveness for single branch delay
    slot
  • Fills about 60 of branch delay slots
  • About 80 of instructions executed in branch
    delay slots useful in computation
  • About 50 (60 x 80) of slots usefully filled
  • Delayed Branch downside 7-8 stage pipelines,
    multiple instructions issued per clock
    (superscalar)

30
Evaluating Branch Alternatives
  • Scheduling Branch CPI speedup v. speedup v.
    scheme penalty unpipelined stall
  • Stall pipeline 3 1.42 3.5 1.0
  • Predict taken 1 1.14 4.4 1.26
  • Predict not taken 1 1.09 4.5 1.29
  • Delayed branch 0.5 1.07 4.6 1.31
  • Conditional Unconditional 14, 65 change PC

31
Improvements in Delayed Branches
  • Cancelling branches
  • if branch behaves as predicted, normal delayed
    branch
  • otherwise, turn the delay slot to a NO-OP
  • Helps the compiler in rescheduling instructions
    without restrictions
  • Deeper pipes with longer branch delays make
    delayed branching less attractive
  • Newer RISC machines use combination of ordinary
    and delayed branches, sometimes only ordinary
    branches with better prediction

32
Prediction Techniques
  • Taken and non-taken predictions
  • Separating the forward and backward branches
  • Profile-based predictions
  • behavior of branches highy biased towards taken
    and non-taken
  • changing the input has minimal effect on the
    branch behavior

33
Handling Exceptions
  • Turn off all writes for the faulting instruction
    and for all the instructions that follow in the
    pipe
  • Save PC of the faulting instruction
  • For delayed branch, needs multiple PCs
  • no. of delay slots 1
  • Precise exceptions - instructions just before the
    fault are completed and those after can be
    restarted from scratch
  • slower mode

34
Out-of-order Exceptions
  • (I1)th instruction may cause an exception before
    I does
  • Handles by using exception status vectors
  • Disable the side effects as soon as exception is
    found
  • Exception handling happens at WB, in the
    un-pipelined order

35
Multi-Cycle Operations
  • Impractical to require the FP operations to
    complete in 1 or 2 clock cycles
  • either slow down the clock
  • or complex fp hardware
  • Instead allow FP pipe line a longer latency
  • May cause more hazards
  • Divide unit not fully pipelined - structural
    hazard
  • WAW since the instructions reach WB out of order
  • Causes additional problems with exception
  • Out of order precise exceptions
  • Either serialize the FP operations or buffer the
    results of operation

36
Pipelining Introduction Summary
  • Just overlap tasks, and easy if tasks are
    independent
  • Speed Up Å  Pipeline Depth if ideal CPI is 1,
    then
  • Hazards limit performance on computers
  • Structural need more HW resources
  • Data (RAW,WAR,WAW) need forwarding, compiler
    scheduling
  • Control delayed branch, prediction

Pipeline Depth
Clock Cycle Unpipelined
Speedup
X
Clock Cycle Pipelined
1 Pipeline stall CPI
Write a Comment
User Comments (0)
About PowerShow.com