ECE369 Chapter 4

About This Presentation

Title:

ECE369 Chapter 4

Description:

Memory-reference instructions: lw, sw. Arithmetic-logical instructions: add, sub, and, or, slt ... 00 = lw, sw. 01 = beq, 10 = arithmetic. 24. ECE369. Control ... – PowerPoint PPT presentation

Number of Views:25

Avg rating:3.0/5.0

Slides: 98

Provided by: AliAk7

Category:

Tags: chapter | ece369 | lw

more less

Transcript and Presenter's Notes

Title: ECE369 Chapter 4

1
ECE369Chapter 4

2
State Elements

Unclocked vs. Clocked
Clocks used in synchronous logic
Clocks are needed in sequential logic to decide
when an element that contains state should be
updated.

3
Latches and Flip-flops
4
Latches and Flip-flops
5
Latches and Flip-flops
Latches whenever the inputs change, and the
clock is asserted Flip-flop state changes only
on a clock edge (edge-triggered methodology)
6
SRAM
7
SRAM vs. DRAM
Which one has a better memory density?
static RAM (SRAM) value stored in a cell is kept
on a pair of inverting gates dynamic RAM
(DRAM), value kept in a cell is stored as a
charge in a capacitor. DRAMs use only a single
transistor per bit of storage, By comparison,
SRAMs require four to six transistors per bit
Which one is faster?
In DRAMs, the charge is stored on a capacitor, so
it cannot be kept indefinitely and must
periodically be refreshed. (called dynamic)
Synchronous RAMs ??
is the ability to transfer a burst of data from a
series of sequential addresses within an array
or row.
8
Datapath control design

We will design a simplified MIPS processor
The instructions supported are
Memory-reference instructions lw, sw
Arithmetic-logical instructions add, sub, and,
or, slt
Control flow instructions beq, j
Generic implementation
Use the program counter (PC) to supply
instruction address
Get the instruction from memory
Read registers
Use the instruction to decide exactly what to do
All instructions use the ALU after reading the
registersWhy? memory-reference? arithmetic?
control flow?

9
ALU Control

ALU's operation based on instruction type and
function code
Example
add t1, s7, s8
000 AND
001 OR
010 Add
110 Subtract
111 Set-on-less-than

lw t0, 32(s2)
10
Summary of Instruction Types
11
Building blocks
Why do we need each of these?
12
Fetching instructions
13
Reading registers
14
Load/Store memory access
15
Branch target
16
Combining datapath for memory and R-type
instructions
17
Appending instruction fetch
18
Now Insert Branch
19
The simple datapath
20
Control

For each instruction
Select the registers to be read (always read two)
Select the 2nd ALU input
Select the operation to be performed by ALU
Select if data memory is to be read or written
Select what is written and where in the register
file
Select what goes in PC
Information comes from the 32 bits of the
instruction

21
Adding control to datapath
22
Adding control to datapath

23
ALU Control

given instruction type 00 lw, sw 01 beq,
10 arithmetic

24
Control (Reading Assignment Appendix C.2)

Simple combinational logic (truth tables)

26
Datapath in Operation for R-Type Instruction
27
Datapath in Operation for Load Instruction
28
Datapath in Operation for Branch Equal Instruction
29
Datapath with control for Jump instruction

J-type instructions use 6 bits for the opcode,
and 26 bits for the immediate value (called the
target).
newPC lt- PC31-28 IR25-0 00

30
Timing Single cycle implementation

Calculate cycle time assuming negligible delays
except
Memory (2ns), ALU and adders (2ns), Register file
access (1ns)

31
Why is Single Cycle not GOOD???

Memory - 2ns
ALU - 2ns Adder - 2ns
Reg - 1ns

what if we had floating point instructions to
handle?

32
1 clock cycle fixed vs. variable for each
instruction

Memory - 2ns
ALU - 2ns Adder - 2ns
Reg - 1ns
Loads 24
Stores 12
R-type 44
Branch 18
Jumps 2

33
1 clock cycle fixed vs. variable for each
instruction

Memory - 2ns
ALU - 2ns Adder - 2ns
Reg - 1ns
Loads 24
Stores 12
R-type 44
Branch 18
Jumps 2

CPU IC CPI CC CPU 824 712 644
518 22 CPU 6.3ns
34
Single Cycle Problems

Wasteful of area
Each unit used once per clock cycle
Clock cycle equal to worst case scenario
Will reducing the delay of common case help?

35
Pipelining
36
Pipelining

Improve performance by increasing instruction
throughput

Ideal speedup is number of stages in the
pipeline. Do we achieve this?
37
Pipelining

What makes it easy
all instructions are the same length
just a few instruction formats
memory operands appear only in loads and stores
What makes it hard?
structural hazards suppose we had only one
memory
control hazards need to worry about branch
instructions
data hazards an instruction depends on a
previous instruction
Well build a simple pipeline and look at these
issues
Well talk about modern processors and what
really makes it hard
exception handling
trying to improve performance with out-of-order
execution, etc.

38
Representation
39
Hazards
40
Hazards
41
Hazards
42
Basic Idea
What do we need to add to actually split the
datapath into stages?
43
Pipelined datapath
44
Five Stages (lw)
Memory and registers Left half write Right half
read
45
Five Stages (lw)
46
Five Stages (lw)
47
What is wrong with this datapath?
48
Store Instruction
49
Store Instruction
50
Graphically representing pipelines

Can help with answering questions like
How many cycles does it take to execute this
code?
What is the ALU doing during cycle 4?
Use this representation to help understand
datapaths

51
Pipeline operation

In pipeline one operation begins in every cycle
Also, one operation completes in each cycle
Each instruction takes 5 clock cycles
k cycles in general, where k is pipeline depth
When a stage is not used, no control needs to be
applied
In one clock cycle, several instructions are
active
Different stages are executing different
instructions
How to generate control signals for them is an
issue

52
Pipeline control

We have 5 stages. What needs to be controlled in
each stage?
Instruction Fetch and PC Increment
Instruction Decode / Register Fetch
Execution
Memory Stage
Write Back
How would control be handled in an automobile
plant?
A fancy control center telling everyone what to
do?
Should we use a finite state machine?

53
Pipeline control
54
Pipeline control
55
Datapath with control
56
Dependencies

Problem with starting next instruction before
first is finished
Dependencies that go backward in time are data
hazards

57
Forwarding

Use temporary results, dont wait for them to be
written
register file forwarding to handle read/write to
same register
ALU forwarding

58
Forwarding
sub 2, 1, 3 and 12, 2, 5 or 13, 6,
2 add 14, 2, 2 sw 15, 100(2)
59
Forwarding
60
Can't always forward

Load word can still cause a hazard
an instruction tries to read a register following
a load instruction that writes to the same
register.

61
Stalling

Hardware detection and no-op insertion is called
stalling
Stall pipeline by keeping instruction in the same
stage

62
Example
63
(No Transcript)
64
Stall logic

Stall logic
If (ID/EX.MemRead) // Load word instruction AND
If ((ID/EX.Rt IF/ID.Rs) or (ID/EX.Rt
IF/ID.Rt))
Insert no-op (no-operation)
Deasserting all control signals
Stall following instruction
Not writing program counter
Not writing IF/ID registers

PCWrite
IF/ID.Rs IF/ID.Rt
ID/EX.Rt
65
Pipeline with hazard detection
66
Assume that register file is written in the first
half and read in the second half of the clock
cycle.
load r2 lt- mem(r10) LOAD1 r3 lt- r3 r2
ADD load r4 lt- mem(r2r3) LOAD2 r4 lt- r5 -
r3 SUB
IF
ID
EX
ME
WB
IF
ID
S
S
EX
ME
WB
IF
S
S
ID
EX
ME
WB
IF
ID
S
EX
ME
WB
S
S
67
Summary
68
Forwarding Case Summary
69
Multi-cycle
70
Multi-cycle
71
Multi-cycle Pipeline
72
Branch Hazards
73
Branch hazards

When we decide to branch, other instructions are
in the pipeline!
We are predicting branch not taken
need to add hardware for flushing instructions if
we are wrong

74
Solution to control hazards

Branch prediction
We are predicting branch not taken
Need to add hardware for flushing instructions if
we are wrong
Reduce branch penalty
By advancing the branch decision to ID stage
Compare the data read from two registers read in
ID stage
Comparison for equality is a simpler design!
(Why?)
Still need to flush instruction in IF stage
Make the hazard into a feature!
Delayed branch slot - Always execute instruction
following branch

75
Branch detection in ID stage

76
Dynamic branch prediction

Use lower part of instruction address
Use one bit to say denote branch taken or not
taken
Disadvantage poor performance in loops
Dynamic branch prediction
Use two bits instead of one
Condition must be satisfied twice to predict
More sophisticated
Count the number of times branch is taken

2-bit branch prediction State diagram
77
Correlating Branches

Hypothesis recent branches are correlated that
is, behavior of recently executed branches
affects prediction of current branch
Idea record m most recently executed branches as
taken or not taken, and use that pattern to
select the proper branch history table
In general, (m,n) predictor means record last m
branches to select between 2m history tables each
with n-bit counters
Old 2-bit BHT is then a (0,2) predictor
If (aa 2)
aa0
If (bb 2)
bb 0
If (aa ! bb)
do something

78
Correlating Branches

(2,2) predictor
Then behavior of recent branches selects between,
say, four predictions of next branch, updating
just that prediction

Branch address
2-bits per branch predictors
Prediction
2-bit global branch history
79
Accuracy of Different Schemes
18
4096 Entries 2-bit BHT Unlimited Entries 2-bit
BHT 1024 Entries (2,2) BHT
Frequency of Mispredictions
0
80
Branch Prediction

Sophisticated Techniques
A branch target buffer to help us look up the
destination
Correlating predictors that base prediction on
global behaviorand recently executed branches
(e.g., prediction for a specificbranch
instruction based on what happened in previous
branches)
Tournament predictors that use different types of
prediction strategies and keep track of which one
is performing best.
A branch delay slot which the compiler tries to
fill with a useful instruction (make the one
cycle delay part of the ISA)
Branch prediction is especially important because
it enables other more advanced pipelining
techniques to be effective!
Modern processors predict correctly 95 of the
time!

81
Branch Target Buffer

Branch Target Buffer (BTB) Address of branch
index to get prediction AND branch address (if
taken)
Note must check for branch match now, since
cant use wrong branch address
Return instruction addresses predicted with stack

Branch Prediction Taken or not Taken
Predicted PC
82
Scheduling in delayed branching
83
Other issues in pipelines

Exceptions
Errors in ALU for arithmetic instructions
Memory non-availability
Exceptions lead to a jump in a program
However, the current PC value must be saved so
that the program can return to it back for
recoverable errors
Multiple exception can occur in a pipeline
Preciseness of exception location is important in
some cases
I/O exceptions are handled in the same manner

84
Exceptions
85
Improving Performance

Try and avoid stalls! E.g., reorder these
instructions
lw t0, 0(t1)
lw t2, 4(t1)
sw t2, 0(t1)
sw t0, 4(t1)
Dynamic Pipeline Scheduling
Hardware chooses which instructions to execute
next
Will execute instructions out of order (e.g.,
doesnt wait for a dependency to be resolved, but
rather keeps going!)
Speculates on branches and keeps the pipeline
full (may need to rollback if prediction
incorrect)
Trying to exploit instruction-level parallelism

86
Advanced Pipelining

Increase the depth of the pipeline
Start more than one instruction each cycle
(multiple issue)
Loop unrolling to expose more ILP (better
scheduling)
Superscalar processors
DEC Alpha 21264 9 stage pipeline, 6 instruction
issue
All modern processors are superscalar and issue
multiple instructions usually with some
limitations (e.g., different pipes)
VLIW very long instruction word, static
multiple issue (relies more on compiler
technology)
This class has given you the background you need
to learn more!

Source For ( I1 Ilt 1000 II1 )
xI xI yI
Direct translation
Loop LD F0, 0 (R1) R1 points to
x1000 ADDD F4, F0, F2 F2 scalar
value SD 0(R1), F4 SUBI R1, R1, 8 BNEZ R1,
loop x0 is at address 0

88
Reducing stalls

Pipeline Implementation
Loop LD F0, 0 (R1) stall ADDD F4, F0,
F2 stall stall SD 0(R1), F4 SUBI R1, R1, 8
stall BNEZ R1, loop stall

Loop LD F0, 0 (R1) stall ADDD F4, F0,
F2 SUBI R1, R1, 8 BNEZ R1, loop
SD 8 (R1), F4

89
Loop Unrolling
Loop LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1),
F4 drop SUBI BNEZ LD F6, -8
(R1) ADDD F8, F6, F2 SD -8 (R1), F8 drop
SUBI BNEZ LD F10, -16 (R1) ADDD F12, F10,
F2 SD -16 (R1), F12 drop SUBI
BNEZ LD F14, -24 (R1) ADDD F16, F14,
F2 SD -24 (R1), F16 SUBI R1, R1,
32 BNEZ R1, Loop
90

Loop LD F0, 0(R1) LD F6, -8 (R1) LD F10,
-16(R1) LD F14, -24(R1) ADDD F4, F0, F2
ADDD F8, F6, F2 ADDD F12, F10, F2
ADDD F16, F14, F2 SD 0(R1), F4 SD -8
(R1), F8 SD -16 (R1), F12 SUBI R1, R1,
32 BNEZ R1, Loop SD 8(R1), F16 8 - 32
24
14 instructions (3.5 inst/iteration vs 6)

91
Superscalar architecture -- Two instructions
executed in parallel
92
Dynamically scheduled pipeline
93
Motorola G4e
94
Intel Pentium 4
95
IBM PowerPC 970
96
Important facts to remember