Title: ECE369 Chapter 4
1ECE369Chapter 4
2State Elements
- Unclocked vs. Clocked
- Clocks used in synchronous logic
- Clocks are needed in sequential logic to decide
when an element that contains state should be
updated. -
3Latches and Flip-flops
4Latches and Flip-flops
5Latches and Flip-flops
Latches whenever the inputs change, and the
clock is asserted Flip-flop state changes only
on a clock edge (edge-triggered methodology)
6SRAM
7SRAM vs. DRAM
Which one has a better memory density?
static RAM (SRAM) value stored in a cell is kept
on a pair of inverting gates dynamic RAM
(DRAM), value kept in a cell is stored as a
charge in a capacitor. DRAMs use only a single
transistor per bit of storage, By comparison,
SRAMs require four to six transistors per bit
Which one is faster?
In DRAMs, the charge is stored on a capacitor, so
it cannot be kept indefinitely and must
periodically be refreshed. (called dynamic)
Synchronous RAMs ??
is the ability to transfer a burst of data from a
series of sequential addresses within an array
or row.
8Datapath control design
- We will design a simplified MIPS processor
- The instructions supported are
- Memory-reference instructions lw, sw
- Arithmetic-logical instructions add, sub, and,
or, slt - Control flow instructions beq, j
- Generic implementation
- Use the program counter (PC) to supply
instruction address - Get the instruction from memory
- Read registers
- Use the instruction to decide exactly what to do
- All instructions use the ALU after reading the
registersWhy? memory-reference? arithmetic?
control flow?
9ALU Control
- ALU's operation based on instruction type and
function code - Example
- add t1, s7, s8
-
-
- 000 AND
- 001 OR
- 010 Add
- 110 Subtract
- 111 Set-on-less-than
lw t0, 32(s2)
10Summary of Instruction Types
11Building blocks
Why do we need each of these?
12Fetching instructions
13Reading registers
14Load/Store memory access
15Branch target
16Combining datapath for memory and R-type
instructions
17Appending instruction fetch
18Now Insert Branch
19The simple datapath
20Control
- For each instruction
- Select the registers to be read (always read two)
- Select the 2nd ALU input
- Select the operation to be performed by ALU
- Select if data memory is to be read or written
- Select what is written and where in the register
file - Select what goes in PC
- Information comes from the 32 bits of the
instruction -
21Adding control to datapath
22Adding control to datapath
23ALU Control
- given instruction type 00 lw, sw 01 beq,
10 arithmetic
24Control (Reading Assignment Appendix C.2)
- Simple combinational logic (truth tables)
25 26Datapath in Operation for R-Type Instruction
27Datapath in Operation for Load Instruction
28Datapath in Operation for Branch Equal Instruction
29Datapath with control for Jump instruction
- J-type instructions use 6 bits for the opcode,
and 26 bits for the immediate value (called the
target). - newPC lt- PC31-28 IR25-0 00
30Timing Single cycle implementation
- Calculate cycle time assuming negligible delays
except - Memory (2ns), ALU and adders (2ns), Register file
access (1ns)
31Why is Single Cycle not GOOD???
- Memory - 2ns
- ALU - 2ns Adder - 2ns
- Reg - 1ns
- what if we had floating point instructions to
handle?
321 clock cycle fixed vs. variable for each
instruction
- Memory - 2ns
- ALU - 2ns Adder - 2ns
- Reg - 1ns
- Loads 24
- Stores 12
- R-type 44
- Branch 18
- Jumps 2
331 clock cycle fixed vs. variable for each
instruction
- Memory - 2ns
- ALU - 2ns Adder - 2ns
- Reg - 1ns
- Loads 24
- Stores 12
- R-type 44
- Branch 18
- Jumps 2
CPU IC CPI CC CPU 824 712 644
518 22 CPU 6.3ns
34Single Cycle Problems
- Wasteful of area
- Each unit used once per clock cycle
- Clock cycle equal to worst case scenario
- Will reducing the delay of common case help?
35Pipelining
36Pipelining
- Improve performance by increasing instruction
throughput
Ideal speedup is number of stages in the
pipeline. Do we achieve this?
37Pipelining
- What makes it easy
- all instructions are the same length
- just a few instruction formats
- memory operands appear only in loads and stores
- What makes it hard?
- structural hazards suppose we had only one
memory - control hazards need to worry about branch
instructions - data hazards an instruction depends on a
previous instruction - Well build a simple pipeline and look at these
issues - Well talk about modern processors and what
really makes it hard - exception handling
- trying to improve performance with out-of-order
execution, etc.
38Representation
39Hazards
40Hazards
41Hazards
42Basic Idea
What do we need to add to actually split the
datapath into stages?
43Pipelined datapath
44Five Stages (lw)
Memory and registers Left half write Right half
read
45Five Stages (lw)
46Five Stages (lw)
47What is wrong with this datapath?
48Store Instruction
49Store Instruction
50Graphically representing pipelines
-
- Can help with answering questions like
- How many cycles does it take to execute this
code? - What is the ALU doing during cycle 4?
- Use this representation to help understand
datapaths
51Pipeline operation
- In pipeline one operation begins in every cycle
- Also, one operation completes in each cycle
- Each instruction takes 5 clock cycles
- k cycles in general, where k is pipeline depth
- When a stage is not used, no control needs to be
applied - In one clock cycle, several instructions are
active - Different stages are executing different
instructions - How to generate control signals for them is an
issue
52Pipeline control
- We have 5 stages. What needs to be controlled in
each stage? - Instruction Fetch and PC Increment
- Instruction Decode / Register Fetch
- Execution
- Memory Stage
- Write Back
- How would control be handled in an automobile
plant? - A fancy control center telling everyone what to
do? - Should we use a finite state machine?
53Pipeline control
54Pipeline control
55Datapath with control
56Dependencies
- Problem with starting next instruction before
first is finished - Dependencies that go backward in time are data
hazards
57Forwarding
- Use temporary results, dont wait for them to be
written - register file forwarding to handle read/write to
same register - ALU forwarding
-
58Forwarding
sub 2, 1, 3 and 12, 2, 5 or 13, 6,
2 add 14, 2, 2 sw 15, 100(2)
59 Forwarding
60Can't always forward
- Load word can still cause a hazard
- an instruction tries to read a register following
a load instruction that writes to the same
register.
61Stalling
- Hardware detection and no-op insertion is called
stalling - Stall pipeline by keeping instruction in the same
stage
62Example
63(No Transcript)
64Stall logic
- Stall logic
- If (ID/EX.MemRead) // Load word instruction AND
- If ((ID/EX.Rt IF/ID.Rs) or (ID/EX.Rt
IF/ID.Rt)) - Insert no-op (no-operation)
- Deasserting all control signals
- Stall following instruction
- Not writing program counter
- Not writing IF/ID registers
PCWrite
IF/ID.Rs IF/ID.Rt
ID/EX.Rt
65Pipeline with hazard detection
66Assume that register file is written in the first
half and read in the second half of the clock
cycle.
load r2 lt- mem(r10) LOAD1 r3 lt- r3 r2
ADD load r4 lt- mem(r2r3) LOAD2 r4 lt- r5 -
r3 SUB
IF
ID
EX
ME
WB
IF
ID
S
S
EX
ME
WB
IF
S
S
ID
EX
ME
WB
IF
ID
S
EX
ME
WB
S
S
67Summary
68Forwarding Case Summary
69Multi-cycle
70Multi-cycle
71Multi-cycle Pipeline
72Branch Hazards
73Branch hazards
- When we decide to branch, other instructions are
in the pipeline! - We are predicting branch not taken
- need to add hardware for flushing instructions if
we are wrong
74Solution to control hazards
- Branch prediction
- We are predicting branch not taken
- Need to add hardware for flushing instructions if
we are wrong - Reduce branch penalty
- By advancing the branch decision to ID stage
- Compare the data read from two registers read in
ID stage - Comparison for equality is a simpler design!
(Why?) - Still need to flush instruction in IF stage
- Make the hazard into a feature!
- Delayed branch slot - Always execute instruction
following branch
75Branch detection in ID stage
76Dynamic branch prediction
- Use lower part of instruction address
- Use one bit to say denote branch taken or not
taken - Disadvantage poor performance in loops
- Dynamic branch prediction
- Use two bits instead of one
- Condition must be satisfied twice to predict
- More sophisticated
- Count the number of times branch is taken
2-bit branch prediction State diagram
77Correlating Branches
- Hypothesis recent branches are correlated that
is, behavior of recently executed branches
affects prediction of current branch - Idea record m most recently executed branches as
taken or not taken, and use that pattern to
select the proper branch history table - In general, (m,n) predictor means record last m
branches to select between 2m history tables each
with n-bit counters - Old 2-bit BHT is then a (0,2) predictor
- If (aa 2)
- aa0
- If (bb 2)
- bb 0
- If (aa ! bb)
- do something
78Correlating Branches
- (2,2) predictor
- Then behavior of recent branches selects between,
say, four predictions of next branch, updating
just that prediction
Branch address
2-bits per branch predictors
Prediction
2-bit global branch history
79Accuracy of Different Schemes
18
4096 Entries 2-bit BHT Unlimited Entries 2-bit
BHT 1024 Entries (2,2) BHT
Frequency of Mispredictions
0
80Branch Prediction
- Sophisticated Techniques
- A branch target buffer to help us look up the
destination - Correlating predictors that base prediction on
global behaviorand recently executed branches
(e.g., prediction for a specificbranch
instruction based on what happened in previous
branches) - Tournament predictors that use different types of
prediction strategies and keep track of which one
is performing best. - A branch delay slot which the compiler tries to
fill with a useful instruction (make the one
cycle delay part of the ISA) - Branch prediction is especially important because
it enables other more advanced pipelining
techniques to be effective! - Modern processors predict correctly 95 of the
time!
81Branch Target Buffer
- Branch Target Buffer (BTB) Address of branch
index to get prediction AND branch address (if
taken) - Note must check for branch match now, since
cant use wrong branch address - Return instruction addresses predicted with stack
Branch Prediction Taken or not Taken
Predicted PC
82Scheduling in delayed branching
83Other issues in pipelines
- Exceptions
- Errors in ALU for arithmetic instructions
- Memory non-availability
- Exceptions lead to a jump in a program
- However, the current PC value must be saved so
that the program can return to it back for
recoverable errors - Multiple exception can occur in a pipeline
- Preciseness of exception location is important in
some cases - I/O exceptions are handled in the same manner
84Exceptions
85Improving Performance
- Try and avoid stalls! E.g., reorder these
instructions - lw t0, 0(t1)
- lw t2, 4(t1)
- sw t2, 0(t1)
- sw t0, 4(t1)
- Dynamic Pipeline Scheduling
- Hardware chooses which instructions to execute
next - Will execute instructions out of order (e.g.,
doesnt wait for a dependency to be resolved, but
rather keeps going!) - Speculates on branches and keeps the pipeline
full (may need to rollback if prediction
incorrect) - Trying to exploit instruction-level parallelism
86Advanced Pipelining
- Increase the depth of the pipeline
- Start more than one instruction each cycle
(multiple issue) - Loop unrolling to expose more ILP (better
scheduling) - Superscalar processors
- DEC Alpha 21264 9 stage pipeline, 6 instruction
issue - All modern processors are superscalar and issue
multiple instructions usually with some
limitations (e.g., different pipes) - VLIW very long instruction word, static
multiple issue (relies more on compiler
technology) - This class has given you the background you need
to learn more!
87- Source For ( I1 Ilt 1000 II1 )
- xI xI yI
- Direct translation
- Loop LD F0, 0 (R1) R1 points to
x1000 ADDD F4, F0, F2 F2 scalar
value SD 0(R1), F4 SUBI R1, R1, 8 BNEZ R1,
loop x0 is at address 0
88Reducing stalls
- Pipeline Implementation
- Loop LD F0, 0 (R1) stall ADDD F4, F0,
F2 stall stall SD 0(R1), F4 SUBI R1, R1, 8 - stall BNEZ R1, loop stall
- Loop LD F0, 0 (R1) stall ADDD F4, F0,
F2 SUBI R1, R1, 8 BNEZ R1, loop
SD 8 (R1), F4
89Loop Unrolling
Loop LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1),
F4 drop SUBI BNEZ LD F6, -8
(R1) ADDD F8, F6, F2 SD -8 (R1), F8 drop
SUBI BNEZ LD F10, -16 (R1) ADDD F12, F10,
F2 SD -16 (R1), F12 drop SUBI
BNEZ LD F14, -24 (R1) ADDD F16, F14,
F2 SD -24 (R1), F16 SUBI R1, R1,
32 BNEZ R1, Loop
90- Loop LD F0, 0(R1) LD F6, -8 (R1) LD F10,
-16(R1) LD F14, -24(R1) ADDD F4, F0, F2
ADDD F8, F6, F2 ADDD F12, F10, F2
ADDD F16, F14, F2 SD 0(R1), F4 SD -8
(R1), F8 SD -16 (R1), F12 SUBI R1, R1,
32 BNEZ R1, Loop SD 8(R1), F16 8 - 32
24 - 14 instructions (3.5 inst/iteration vs 6)
91Superscalar architecture -- Two instructions
executed in parallel
92Dynamically scheduled pipeline
93Motorola G4e
94Intel Pentium 4
95IBM PowerPC 970
96Important facts to remember
- Pipelined processors divide execution in multiple
steps - However pipeline hazards reduce performance
- Structural, data, and control hazard
- Data forwarding helps resolve data hazards
- But all hazards cannot be resolved
- Some data hazards require bubble or noop
insertion - Effects of control hazard reduced by branch
prediction - Predict always taken, delayed slots, branch
prediction table - Structural hazards are resolved by duplicating
resources
- Time to execute n instructions depends on
- of stages (k)
- of control hazard and penalty of each step
- of data hazards and penalty for each
- Time n k - 1 (load hazard penalty)
(branch penalty) - Load hazard penalty is 1 or 0 cycle
- Depending on data use with forwarding
- Branch penalty is 3, 2, 1, or zero cycles
depending on scheme
97Design and performance issues with pipelining
- Pipelined processors are not EASY to design
- Technology affect implementation
- Instruction set design affect the performance
- i.e., beq, bne
- More stages do not lead to higher performance!