Title: CS152
1CS152 Computer Architecture andEngineeringLect
ure 12 Pipeline Wrap up Control Hazards,
RAW/WAR/WAW
2004-10-07 John Lazzaro(www.cs.berkeley.edu/lazz
aro) Dave Patterson (www.cs.berkeley.edu/patters
on) www-inst.eecs.berkeley.edu/cs152/
2Pipelining Review
- What makes it easy
- all instructions are the same length
- just a few instruction formats
- memory operands appear only in loads and stores
- Hazards limit performance
- Structural need more HW resources
- Data need forwarding, compiler scheduling
- Data hazards must be handled carefully
- MIPS I instruction set architecture made pipeline
visible (delayed branch, delayed load)
3Outline
- Pipelined Control
- Control Hazards
- RAW, WAR, WAW
- Brainstorm on pipeline bugs
4MIPS Pipeline Data / Control Paths A (fast)
1
PCSrc
ID/EX
0
EX/MEM
EX
Control
MEM
IF/ID
Add
MEM/WB
Branch
Add
WB
4
Shift left 2
RegWrite
Read Addr 1
Instruction Memory
Data Memory
Register File
Read Data 1
Read Addr 2
MemtoReg
Read Address
ALUSrc
PC
Read Data
Address
1
Write Addr
ALU
Read Data 2
0
Write Data
0
Write Data
1
ALU cntrl
MemWrite
MemRead
Sign Extend
16
32
ALUOp
0
1
RegDst
5MIPS Pipeline Data / Control Paths (debug)
1
PCSrc
ID/EX
EX/MEM
MEM/WB
0
EX
MEM
WB
Instr
Instr
Instr
IF/ID
Control
Control
Add
Branch
Add
4
Shift left 2
RegWrite
Read Addr 1
Instruction Memory
Data Memory
Register File
Read Data 1
Read Addr 2
MemtoReg
Read Address
ALUSrc
PC
Read Data
Address
1
Write Addr
ALU
Read Data 2
0
Write Data
0
Write Data
1
ALU cntrl
MemWrite
MemRead
Sign Extend
16
32
ALUOp
0
1
RegDst
6MIPS Pipeline Control (pipelined debug)
1
PCSrc
ID/EX
EX/MEM
MEM/WB
0
Instr
Instr
Instr
MEM
EX
IF/ID
WB
Control
Control
Control
Add
Branch
Add
4
Shift left 2
RegWrite
Read Addr 1
Instruction Memory
Data Memory
Register File
Read Data 1
Read Addr 2
MemtoReg
Read Address
ALUSrc
PC
Read Data
Address
1
Write Addr
ALU
Read Data 2
0
Write Data
0
Write Data
1
ALU cntrl
MemWrite
MemRead
Sign Extend
16
32
ALUOp
0
1
RegDst
7Control Hazards
- When the flow of instruction addresses is not
what the pipeline expects incurred by change of
flow instructions - Conditional branches (beq, bne)
- Unconditional branches (j)
- Possible solutions
- Stall
- Move decision point earlier in the pipeline
- Predict
- Delay decision (requires compiler support)
- Control hazards occur less frequently than data
hazards there is nothing as effective against
control hazards as forwarding is for data hazards
8Datapath Branch and Jump Hardware
9Datapath Branch and Jump Hardware
10Administrivia
- Finish Lab 3 meet with TA Friday
- Midterm Tue Oct 12 530 - 830 in 101 Morgan
- Northwest corner of campus, near Arch and Hearst
- Midterm review Sunday Oct 10, 7 PM, 306 Soda
- Bring 1 page, handwritten notes, both sides
- Nothing electronic no calculators, cell phones,
pagers, - Meet at LaVals Northside afterwards for Pizza
11Jumps Incur One Stall
- Jumps not decoded until ID, so one stall is needed
j
I n s t r. O r d e r
lw
and
- Fortunately, jumps are very infrequent only 2
of the SPECint instruction mix
12Review Branches Incur Three Stalls
beq
I n s t r. O r d e r
Can fix branch hazard by waiting stall but
affects throughput
13Moving Branch Decisions Earlier in Pipe
- Move the branch decision hardware back to the EX
stage - Reduces the number of stall cycles to two
- Adds an and gate and a 2x1 mux to the EX timing
path - Add hardware to compute the branch target address
and evaluate the branch decision to the ID stage - Reduces the number of stall cycles to one (like
with jumps) - Computing branch target address can be done in
parallel with RegFile read (done for all
instructions only used when needed) - Comparing the registers cant be done until after
RegFile read, so comparing and updating the PC
adds a comparator, an and gate, and a 3x1 mux to
the ID timing path - Need forwarding hardware in ID stage
- For longer pipelines, decision points are later
in the pipeline, incurring more stalls, so we
need a better solution
14Early Branch Forwarding Issues
- Bypass of source operands from the EX/MEM
- if (IDcontrol.Branch
- and (EX/MEM.RegisterRd ! 0)
- and (EX/MEM.RegisterRd IF/ID.RegisterRs))
- ForwardC 1
- if (IDcontrol.Branch
- and (EX/MEM.RegisterRd ! 0)
- and (EX/MEM.RegisterRd IF/ID.RegisterRt))
- ForwardD 1
Forwards the result from the second previous
instr. to either input of the Compare
- MEM/WB dependency also needs to be forwarded
- If the instruction 2 before the branch is a load,
then a stall will be required since the MEM stage
memory access is occurring at the same time as
the ID stage branch compare operation
15Branch Prediction
- Resolve branch hazards by assuming a given
outcome and proceeding without waiting to see the
actual branch outcome - Predict not taken always predict branches will
not be taken, continue to fetch from the
sequential instruction stream, only when branch
is taken does the pipeline stall - If taken, flush instructions in the pipeline
after the branch - in IF, ID, and EX if branch logic in MEM three
stalls - in IF if branch logic in ID one stall
- ensure that those flushed instructions havent
changed machine state automatic in the MIPS
pipeline since machine state changing operations
are at the tail end of the pipeline (MemWrite or
RegWrite) - restart the pipeline at the branch destination
16Flushing with Misprediction (Not Taken)
4 beq 1,2,2
8 sub 4,1,5
- To flush the IF stage instruction, add a IF.Flush
control line that zeros the instruction field of
the IF/ID pipeline register (transforming it into
a noop)
17Flushing with Misprediction (Not Taken)
4 beq 1,2,2
8 sub 4,1,5
- To flush the IF stage instruction, add a IF.Flush
control line that zeros the instruction field of
the IF/ID pipeline register (transforming it into
a noop)
18Branch Prediction, cont
- Resolve branch hazards by statically assuming a
given outcome and proceeding - Predict taken always predict branches will be
taken - Predict taken always incurs a stall (if branch
destination hardware has been moved to the ID
stage) - As the branch penalty increases (for deeper
pipelines), a simple static prediction scheme
will hurt performance - With more hardware, possible to try to predict
branch behavior dynamically during program
execution - Dynamic branch prediction predict branches at
run-time using run-time information
19Dynamic Branch Prediction
- A branch prediction buffer (aka branch history
table (BHT)) in the IF stage, addressed by the
lower bits of the PC, contains a bit that tells
whether the branch was taken the last time it was
execute - Bit may predict incorrectly (may be from a
different branch with the same low order PC bits,
or may be a wrong prediction for this branch) but
the doesnt affect correctness, just performance - If the prediction is wrong, flush the incorrect
instructions in pipeline, restart the pipeline
with the right instructions, and invert the
prediction bit - The BHT predicts when a branch is taken, but does
not tell where its taken to! - A branch target buffer (BTB) in the IF stage can
cache the branch target address (or !even! the
branch target instruction) so that a stall can be
avoided
201-bit Prediction Accuracy
- 1-bit predictor in loop is incorrect twice when
not taken
- Assume predict_bit 0 to start (indicating
branch not taken) and loop control is at the
bottom of the loop code - First time through the loop, the predictor
mispredicts the branch since the branch is taken
back to the top of the loop invert prediction
bit (predict_bit 1) - As long as branch is taken (looping), prediction
is correct - Exiting the loop, the predictor again mispredicts
the branch since this time the branch is not
taken falling out of the loop invert prediction
bit (predict_bit 0)
Loop 1st loop instr 2nd loop instr
. . . last loop
instr bne 1,2,Loop fall out instr
- For 10 times through the loop we have a 80
prediction accuracy for a branch that is taken
90 of the time
212-bit Predictors
- A 2-bit scheme can give 90 accuracy since a
prediction must be wrong twice before the
prediction bit is changed.
Loop 1st loop instr 2nd loop instr
. . . last loop
instr bne 1,2,Loop fall out instr
Taken
Not taken
Predict Taken
Predict Taken
Taken
Not taken
Taken
Not taken
Predict Not Taken
Predict Not Taken
Taken
Not taken
222-bit Predictors
- A 2-bit scheme can give 90 accuracy since a
prediction must be wrong twice before the
prediction bit is changed
right 9 times
Loop 1st loop instr 2nd loop instr
. . . last loop
instr bne 1,2,Loop fall out instr
wrong on loop fall out
Taken
Not taken
1
Predict Taken
Predict Taken
1
Taken
right on 1st iteration
Not taken
Taken
Not taken
0
Predict Not Taken
Predict Not Taken
0
Taken
Not taken
23Delayed Decision
- First, move the branch decision hardware and
target address calculation to the ID pipeline
stage - A delayed branch always executes the next
sequential instruction the branch takes effect
after that next instruction - MIPS software moves an instruction to immediately
after the branch that is not affected by the
branch (a safe instruction) thereby hiding the
branch delay
- As processor go to deeper pipelines and multiple
issue, the branch delay grows and need more
than one delay slot. - Delayed branching has lost popularity compared to
more expensive but more flexible dynamic
approaches - Growth in available transistors has made dynamic
approaches relatively cheaper
24Scheduling Branch Delay Slots
A. From before branch
B. From branch target
C. From fall through
add 1,2,3 if 10 then
add 1,2,3 if 20 then
sub 4,5,6
delay slot
delay slot
add 1,2,3 if 10 then
sub 4,5,6
delay slot
- A is the best choice, fills delay slot reduces
instruction count (IC) - In B, the sub instruction may need to be copied,
increasing IC - In B and C, must be okay to execute sub when
branch fails
253 Generic Data Hazards RAW, WAR, WAW
- Read After Write (RAW) InstrJ tries to read
operand before InstrI writes it - Caused by a Dependence (in compiler
nomenclature). This hazard results from an
actual need for communication. - Forwarding handles many, but not all, RAW
dependencies in 5 stage MIPS pipeline
I add r1,r2,r3 J sub r4,r1,r3
263 Generic Data Hazards RAW, WAR, WAW
- Write After Read (WAR) InstrJ writes operand
before InstrI reads it - Called an anti-dependence by compiler
writers.This results from reuse of the name
r1. - Cant happen in MIPS 5 stage pipeline because
- All instructions take 5 stages, and
- Reads are always in stage 2, and
- Register Writes must be in stage 5
273 Generic Data Hazards RAW, WAR, WAW
- Write After Write (WAW) InstrJ writes operand
before InstrI writes it. - Called an output dependence by compiler
writersThis also results from the reuse of
name r1. - Cant happen in MIPS 5 stage pipeline because
- All instructions take 5 stages, and
- Register Writes must be in stage 5
- Can see WAR and WAW in more complicated pipes
28Supporting ID Stage Branches
PCSrc
Branch
1
ID/EX
Hazard Unit
0
EX/MEM
1
0
0
Control
IF/ID
Add
MEM/WB
4
Shift left 2
Add
Compare
Read Addr 1
Instruction Memory
Data Memory
RegFile
Read Addr 2
Read Address
Read Data 1
PC
Read Data
1
Write Addr
ALU
Address
1
ReadData 2
Write Data
0
Write Data
0
ALU cntrl
16
Sign Extend
32
Forward Unit
Forward Unit
29Brain storm on pipeline bugs
- Where are bugs likely to hide in a pipelined
processor? -
-
- How can you write tests to uncover these likely
bugs? -
-
- Once it passes a test, never need to run it again
in the design process?
30Brain storm on pipeline bugs
- Depending on branch solution (move to ID,
delayed, static prediction, dynamic prediction),
where are bugs likely to hide? -
-
- How can you write tests to uncover these likely
bugs? -
-
- Once it passes a test, dont need to run it
again?
31Peer Instruction
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Clock
1st add
Mem/Wr
2nd lw
3rd add
Mem/Wr
- Suppose we use with a 4 stage pipeline that
combines memory access and write back stages for
all instructions but load, stalling when there
are structural hazards. Impact? - 1. The branch delay slot is now 0 instructions
- 2. Most loads cause stall since often a
structural hazard on reg. writes - 3. Most stores cause stall since they have a
structural hazard - 4. Both 2 3 most loadsstores cause stall due
to structural hazards - 5. Most loads cause stall, but there is no
load-use hazard anymore - 6. Both 2 3, but there is no load-use hazard
anymore - 7. None of the above
32Peer Instruction
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Cycle 6
Cycle 7
Clock
1st add
Mem/Wr
2nd lw
3rd add
Mem/Wr
- Suppose we use with a 4 stage pipeline that
combines memory access and write back stages for
all instructions but load, stalling when there
are structural hazards. Impact? - 1. The branch delay slot is now 0 instructions
- 2. Most loads cause stall since often a
structural hazard on reg. writes - 3. Most stores cause stall since they have a
structural hazard - 4. Both 2 3 most loadsstores cause stall due
to structural hazards - 5. Most loads cause stall, but there is no
load-use hazard anymore - 6. Both 2 3, but there is no load-use hazard
anymore - 7. None of the above
Q Why not say every load stalls?
A Not all next instructions write in Wr stage
33Summary Designing a Pipelined Processor
- Go back and examine your data path and control
diagram - Associate resources with states
- Be sure there are no structural hazards one use
/ clock cycle - Add pipeline registers between stages to balance
clock cycle - Amdahls Law suggests splitting longest stage
- Resolve all data and control dependencies
- If backwards in time in pipeline drawing to
registersgt data hazard forward or stall to
resolve them - If backwards in time in pipeline drawing to PCgt
control hazard well see next time - 5 stage pipeline with reads early in same stage,
writes later in same stage, avoids WAR/WAW
hazards - Assert control in appropriate stage
- Develop test instruction sequences likely to
uncover pipeline bugs (If you dont test it, it
wont work )