CS 42906290 Lecture 04 MIPS, Dataflow Design, Pipelining

About This Presentation

Title:

CS 42906290 Lecture 04 MIPS, Dataflow Design, Pipelining

Description:

(Lectures based on the work of Jay Brockman, Sharon Hu, Randy Katz, Peter Kogge, ... RAMs (SRAM, DRAM), ROMs (PROM, EEPROM), disk. tradeoff between speed and cost/bit ... – PowerPoint PPT presentation

Number of Views:85

Avg rating:3.0/5.0

Slides: 67

Provided by: michaelt8

Category:

more less

Transcript and Presenter's Notes

Title: CS 42906290 Lecture 04 MIPS, Dataflow Design, Pipelining

1
CS 4290/6290 Lecture 04MIPS, Dataflow Design,
Pipelining

(Lectures based on the work of Jay Brockman,
Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
Ken MacKenzie, Richard Murphy, Michael Niemier,
and Milos Pruvlovic)

2
The organization of a computer

Von Neumann Model
Stored-program machine instructions are
represented as numbers
Programs can be stored in memory to be
read/written just like numbers.

Compiler
Control
Input
Memory
Datapath
Output
Processor
3
Functions of Each Component

Datapath performs data manipulation operations
arithmetic logic unit (ALU)
floating point unit (FPU)
Control directs operation of other components
finite state machines
micro-programming
Memory stores instructions and data
random access v.s. sequential access
volatile v.s. non-volatile
RAMs (SRAM, DRAM), ROMs (PROM, EEPROM), disk
tradeoff between speed and cost/bit
Input/Output and I/O devices interface to the
environment
mouse, keyboard, display, device drivers

4
The Performance Perspective

Performance of a machine determined by
Instruction count, clock cycles per instruction,
clock cycle time
Processor design (datapath and control)
determines
Clock cycles per instruction
Clock cycle time
We will discuss two implementations.
Single-Cycle Implementation (a bx cx2
example)
Advantage One clock cycle per instruction
Disadvantage Less flexible
Multiple-Cycle Implementation (bus based)
Advantage Shorter clock cycle times, different
number of cycles for different instructions,
functional unit sharing,

5
MIPS Instruction Formats

All MIPS instructions are 32 bits (4 bytes) long.
R-type
I-Type
J-type

6
The MIPS Subset

Consider a subset of instructions
memory-reference lw, sw
arithmetic-logical add, sub, and, or, slt
branching beq, j
Organizational overview
fetch an instruction based on the content of PC
decode the instruction
fetch operands
(read one or two registers)
execute
(effective address calculation/arithmetic-logical
operations/comparison)
store result
(write to memory / write to register / update PC)

At simplest level, this is how Von Neumann, RISC
model works
7
Implementation Overview
simplest view of Von Neumann, RISC mP

Abstract / Simplified View
2 types of signals data and control
Clocking strategy All storage elements clocked
by same
clock edge.

Data
Address
PC
Ra
Instruction
Address
Rb
A
L
U
Instruction Memory
Register File
Rw
Data Memory
Data
8
Review of Design Steps

Instruction set Architecture gt RTL
representation
RTL representation gt
Datapath components
Datapath interconnects
Datapath components gt Control signals
Control signals gt Control logic
Writing RTL How many states (cycles) should an
instruction take?
CPI
Datapath component sharing

i.e. PC ? PC 4
(or 4 ? 3 2)
need these to do
need these to do
need these to do
9
Single Cycle Implementation

Each instruction takes one cycle to complete.
We wait for everything to settle down, and the
right thing to be done
ALU might not produce right answer right away
(why?)
we use write signals along with clock to
determine when to write
Cycle time determined by length of the longest
path

referring to 2 slides ago, what instruction
takes the longest?
10
An exercise in dataflow design

OK, as a class exercise, were going to design a
simple MIPS dataflow.
FYI, the slides that describe this are in
Appendix A
but lets do this together first
and think about ways to make it better along the
way
Well use the instruction formats to help

11
Lets start with a few instructions

For example
Add 5, 6, 7
SW 0(9), 10
Sub 1, 2, 3
LW 11, 0(12)
We want to execute these instructions in order.
Whats the first thing we have to do?

12
Lets say we want to fetchan R-type
instruction (arithmetic)

Instruction format
RTL
Instruction fetch memPC
ALU operation regrd lt- regrs op regrt
Go to next instruction Pc lt- PC 4
Ra, Rb and Rw are from instructions rs, rt, rd
fields.
Actual ALU operation and register write should
occur after decoding the instruction.

13
Lets say we want to fetchan I-Type
Arithmetic/Logic Instructions

Instruction format
RTL for arithmetic operations e.g., ADDI
Instruction fetch memPC
Add operation regrt lt- regrs
SignExt(imm16)
Go to next instruction Pc lt- PC 4
Also, immediate instructions

14
Lets say we want to fetchan I-Type Load/Store
Instructions

Instruction format
RTL for load/store operations e.g., LW
Instruction fetch memPC
Compute memory address Addr lt- regrs
SignExt(imm16)
Load data into register regrt lt- memAddr
Go to next instruction Pc lt- PC 4
How about store?

same thing, just skip 3rd step (memaddr ?
regrs)
15
Lets say we want to fetchan I-Type Branch
Instructions

Instruction format
RTL for branch operations e.g., BEQ
Instruction fetch memPC
Compute conditon Cond lt- regrs - regrt
Calculate the next instructions address
if (Cond eq 0) then
PC lt- PC 4 (SignExd(imm16) x 4)
else ?

16
Lets say we want to fetchan J-Type Jump
Instructions

Instruction format
RTL operations e.g., BEQ
Instruction fetch memPC
Set up PC PC lt- ((PC 4)lt3129gt
CONCAT(targetlt250gt) x 4

17
What do we get?A Single Cycle Datapath
P
C
S
r
c
A
d
d
4
t

2
ALUctr
3
i
M
e
m
W
r
i
t
e
A
L
U
S
r
c
M
e
m
t
o
R
e
g
i
Z
e
r
o
A
L
U
A
L
U
R
e
a
d
A
d
d
r
e
s
s
r
e
s
u
l
t
M
d
a
t
a
M
u
u
x
D
a
t
a
x
m
e
m
o
r
y
W
r
i
t
e
R
e
g
W
r
i
t
e
d
a
t
a
If you dont understand this, take a look at
Appendix A
S
i
g
n
M
e
m
R
e
a
d
e
x
t
e
n
d
18
Control Logic
19
The HW needed, plus control
Single cycle MIPS machine
When we talk about control, we talk about these
blocks
20
Implementing Control

Implementation Steps Review
Identify control inputs and control outputs
Make a control signal table for each cycle
Derive control logic from the control table
As youve seen (and as well review), this logic
can take on many forms combinational logic,
ROMs, microcode, or combinations

21
Single Cycle Control Input/Output

Control Inputs
Opcode (6 bits)
How about R-type instructions?
Control Outputs
RegDst
ALUSrc
MemtoReg
RegWrite
MemRead
MemWrite
Branch
Jump
ALUctr

Step 2 Make a control signal table for each cycle
22
Control Signal Table
(inputs)
R-type
(outputs)
23
The HW needed, plus control
Single cycle MIPS machine
24
Main control, ALU control
Func
ALUctr
OP
ALU Control
Main Control
6
ALUOp
3
6
2
(opcode)
ALU
Other cnt. signals

Use OP field to generate ALUOp (encoding)
Control signal fed to ALU control block
Use Func field and ALUOp to generate ALUctr
(decoding)
Specifically sets 3 ALU control signals
B-Invert, Carry-in, operation

25
Main control, ALU control
Or in other words 00 ALU performs add 01 ALU
performs sub 10 ALU does what function code
says (see p. 284 for more)
26
Generating ALUctr

We want these outputs

and - 00
or - 01
mux
adder - 10
ALUctrlt2gt B-negate (C-in B-invert) ALUctrlt1gt
Select ALU Output ALUctrlt0gt Select ALU Output
Invert B and C-in must be a 1 for subtract
less - 11
27
The Logic
This table is used to generate the actual Boolean
logic gates that produce ALUctr.
Could generate gates by hand, often done w/SW.
(ALUOp)
ALUOp0
X/1
ALUctrlt2gt
ALUOp1
1/0
0/X
1/1
F3
1/0
ALUctr
(funclt50gt)
110/110
ALUctrlt1gt
F2
0/X
1/1
Ex ALUctrlt2gt (SUB/BEQ)
ALUctrlt0gt
F1
1/X
0/0
0/0
F0
0/X
0/X
28
Recall
Single cycle MIPS machine
Recall, for MIPS, we have to build a Main Control
Block and an ALU Control Block
29
Well, heres what we did
Single cycle MIPS machine
We came up with the information to generate this
logic which would fit here in the datapath.
30
Single cycle versus multi-cycle
31
Single Cycle Implementation

Calculate cycle time assuming negligible delays
except
memory (2ns), ALU and adders (2ns), register file
access (1ns)

32
Single-Cycle Implementation (Contd)

Single-cycle, fixed-length clock
CPI 1
Clock cycle propagation delay of the longest
datapath operations among all instruction types
Easy to implement
Single-cycle, variable-length clock
CPI 1
Clock cycle ? ((type-i instructions)
propagation delay of the type i instruction
datapath operations)
Better than the previous, but impractical to
implement
Disadvantages
What if we have floating-point operations?
How about component usage?

33
Multiple Cycle Alternative

Break an instruction into smaller steps
Execute each step in one cycle.
Execution sequence
Balance amount of work to be done
Restrict each cycle to use only one major
functional unit
At the end of a cycle
Store values for use in later cycles, why?
Introduce additional internal registers
The advantages
Cycle time much shorter
Diff. inst. take different of cycles to
complete
Functional unit used more than once per
instruction

34
Step 1 Instruction Fetch

Use PC to get instruction, put it in IR.
Increment PC by 4, put the result back in PC.
Can you write this using the RTL notation?
IR lt- MemoryPC , PC lt- PC 4What is the
advantage of updating the PC now?

35
Step 2 I-Decode and Register Fetch

Read registers rs and rt in case we need them
Compute branch address in case instruction is
branch
RTL A lt- RegIR25-21
B lt- RegIR20-16
ALUOut lt- PC (sign-extend(IR15-0) ltlt2)
Did we set any control lines based on the
instruction type? (we are busy "decoding" it in
our control logic)

Means in parallel
36
Step 3 (Instruction dependent)

ALU is performing 1 of 3 functions, based on
instruction type
Memory Reference ALUOut lt- A
sign-extend(IR15-0)
R-type ALUOut lt- A op B
Branch if (AB) then (PC lt- ALUOut)

37
Step 4 (R-type or memory-access)

Loads and stores access memory MDR lt-
MemoryALUOut or MemoryALUOut lt- B
R-type instructions finish RegIR15-11 lt-
ALUOutWhen does the write actually take
place?
-at the end of the cycle on the edge.

38
Step 5 Write-Back

RegIR20-16lt- MDR
What about all the other instructions?

39
Single cycle
40
Multiple Cycle Design

Break up instructions into steps, each step takes
1 cycle
balance work to be done
restrict each cycle to use only 1 major
functional unit
At the end of a cycle
store values for use in later cycles (easiest
thing to do)
introduce additional internal registers

41
Execution Sequence Summary
IR ? MemoryPC
PC ? PC 4
A ? RegIR(2521)
B ? RegIR(2016)
ALUOut ? PC SignEx(IR(150) ltlt 2)
42
Control Signals
New
Old

PC PCWrite, PCWriteCond, PCSource
Memory IorD, MemRead, MemWrite
IR IRWrite
Reg. File RegWrite, MemtoReg, RegDst
ALU ALUSrcA, ALUSrcB, ALUOp, ALUCnt.

RegDst, MemToReg, RegWrite, MemRead, MemWrite,
Branch, ALUSrc, ALUOp, ALUCnt.
43
Implementing the Control

Value of control signals is dependent upon
what instruction is being executed
which step is being performed
Use accumulated information to specify a finite
state machine
use a state diagram, or
use microprogramming
Implementation can be derived from specification

44
Graphical Specification of FSM
t
Instruction Fetch
MemRead ALUSrcA 0 IorD 0 IRWrite ALUSrcB
01 ALUOp 00 PCWrite PCSource 00
Instruction decode/ Register fetch
1
0
ALUSrcA 0 ALUSrcB 11 ALUOp 00
start
8
9
Branch Completion
Memory address computation
Jump Completion
2
6
Execution
ALUSrcA 1 ALUSrcB 00 ALUOp
01 PCWriteCond PCSource 01
ALUSrcA 1 ALUSrcB 10 ALUOp 00
ALUSrcA 1 ALUSrcB 00 ALUOp 10
PCWrite PCSource 10
Memory access
5
Memory access
RegDst 1 RegWrite MemToReg 0
MemRead IorD 1
MemRead IorD 1
3
Tells us what values are needed and during what
step
R-type completion
7
RegDst 0 RegWrite MemToReg 1
4
Memory read completion
45
Finite State Machine for Control
Control logic is inside this box (could be
implemented in many different ways)
The outputs that we want now also dependent
on the current state.
could be ROM, logic, etc.
Inputs (which now also include the previous state)
(Still might need ALU control logic and hence
function code developed earlier)
46
Microprogramming

For our example, state diagrams, combinational
logic more than adequate
But were dealing with small subset of MIPS
processor
Full MIPS instruction set has over 100
instructions
In 1 implementation instructions take from 1 to
20 clock cycles
Control would be much more complex for this case
Another alternative microcoding
Think of control signals that must be asserted in
a state as an instruction to be executed by
datapath
Call these micro instructions

47
The entire microprogram
48
Sample Microinstruction

Ifetch IR lt- MemPC PC lt- PC4

Microinstruction 1d011ddd000100d11
49
Pipelining
50
Pipelining Its Natural!

Laundry Example
Ann, Brian, Cathy, Dave each have one load of
clothes to wash, dry, and fold
Washer takes 30 minutes
Dryer takes 40 minutes
Folder takes 20 minutes

51
Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r

Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would
laundry take?

52
Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r
Note More time to go out later that night

Pipelined laundry takes 3.5 hours for 4 loads

53
Pipelining Lessons

Multiple tasks operating simultaneously
Pipelining doesnt help latency of single task,
it helps throughput of entire workload
Pipeline rate limited by slowest pipeline stage
Potential speedup Number pipe stages
Unbalanced lengths of pipe stages reduces speedup
Also, need time to fill and drain the
pipeline.

6 PM
7
8
9
Time
T a s k O r d e r
54
Pipelining Some terms

If youre doing laundry or implementing a mP,
each stage where something is done called a pipe
stage
In laundry example, washer, dryer, and folding
table are pipe stages clothes enter at one end,
exit other
In a mP, instructions enter at one end and have
been executed when they leave
Another example auto assembly line
Throughput is how often stuff comes out of a
pipeline

55
More technical detail

If times for all S stages are equal to T
Time for one initiation to complete still ST
Time between 2 initiates T not ST
Initiations per second 1/T
Pipelining Overlap multiple executions of same
sequence
Improves THROUGHPUT, not the time to perform a
single operation
Other examples
Automobile assembly plant, chemical factory,
garden hose, cooking

56
More technical detail

Books approach to draw pipeline timing diagrams
Time runs left-to-right, in units of stage time
Each row below corresponds to distinct
initiation
Boundary b/t 2 column entries pipeline register
(i.e. hamper)
Must look at column contents to see what stage is
doing what

Time for N initiations to complete NT (S-1)T
Throughput Time per initiation T (S-1)T/N ?
T!
57
Ideal digital system pipeline speedup
Unpipelined
combinational logic delay t
combinational logic delay t
combinational logic delay t
combinational logic delay t
delay for 1 piece of data 4t latch setup
(assume small)
Latch
Latch
approximate delay for 1000 pieces of data 4000t
Pipelined
combinational logic delay t
combinational logic delay t
combinational logic delay t
combinational logic delay t
Latch
Latch
delay for 1 piece of data 4(t latch setup)
approximate delay for 1000 pieces of data 3t
1000t
4000
4
speedup for 1000 pieces of data
1003
Ideal speedup of pipeline stages
58
The new look dataflow
IF/ID
ID/EX
EX/MEM
MEM/WB
4
M u x
ADD
PC
Branch taken
Comp.
IR6...10
M u x
Inst. Memory
IR11..15
Register File
ALU
MEM/ WB.IR
M u x
Data Mem.
Data must be stored from one stage to the
next in pipeline registers/latches. hold
temporary values between clocks and needed info.
for execution.
M u x
Sign Extend
16
32
59
Another way to look at it
Clock Number
Time
Program execution order (in instructions)
60
So, what about the details?

In each cycle, new instruction fetched and begins
5 cycle execution
In perfect world (pipeline) performance improved
5 times over!
So, thats it, huh? Hardly!!!
What else do we have to worry about?
Must know whats going on in every cycle of
machine
What if 2 instructions try to use the same
resource at same time?
(LOTS more on this later)
Separate instruction/data memories, multiple
register ports, etc. help avoid this

61
Limits, limits, limits

So, now that the ideal stuff is out of the way,
lets look at how a pipeline REALLY works
Pipelines are slowed b/c of
Pipeline latency
Imbalance of pipeline stages
(Think A chain is only as strong as its weakest
link)
Well, a pipeline is only as fast as its slowest
stage
Pipeline overhead (from where?)
Register delay from pipe stage latches
Clock skew Once a clock cycle is as small as
the sum of the clock skew and latch overhead, you
cant get any work done

62
Note