Title: CS 42906290 Lecture 04 MIPS, Dataflow Design, Pipelining
1CS 4290/6290 Lecture 04MIPS, Dataflow Design,
Pipelining
- (Lectures based on the work of Jay Brockman,
Sharon Hu, Randy Katz, Peter Kogge, Bill Leahy,
Ken MacKenzie, Richard Murphy, Michael Niemier,
and Milos Pruvlovic)
2The organization of a computer
- Von Neumann Model
- Stored-program machine instructions are
represented as numbers - Programs can be stored in memory to be
read/written just like numbers.
Compiler
Control
Input
Memory
Datapath
Output
Processor
3Functions of Each Component
- Datapath performs data manipulation operations
- arithmetic logic unit (ALU)
- floating point unit (FPU)
- Control directs operation of other components
- finite state machines
- micro-programming
- Memory stores instructions and data
- random access v.s. sequential access
- volatile v.s. non-volatile
- RAMs (SRAM, DRAM), ROMs (PROM, EEPROM), disk
- tradeoff between speed and cost/bit
- Input/Output and I/O devices interface to the
environment - mouse, keyboard, display, device drivers
4The Performance Perspective
- Performance of a machine determined by
- Instruction count, clock cycles per instruction,
clock cycle time - Processor design (datapath and control)
determines - Clock cycles per instruction
- Clock cycle time
- We will discuss two implementations.
- Single-Cycle Implementation (a bx cx2
example) - Advantage One clock cycle per instruction
- Disadvantage Less flexible
- Multiple-Cycle Implementation (bus based)
- Advantage Shorter clock cycle times, different
number of cycles for different instructions,
functional unit sharing,
5MIPS Instruction Formats
- All MIPS instructions are 32 bits (4 bytes) long.
- R-type
- I-Type
- J-type
6The MIPS Subset
- Consider a subset of instructions
- memory-reference lw, sw
- arithmetic-logical add, sub, and, or, slt
- branching beq, j
- Organizational overview
- fetch an instruction based on the content of PC
- decode the instruction
- fetch operands
- (read one or two registers)
- execute
- (effective address calculation/arithmetic-logical
operations/comparison) - store result
- (write to memory / write to register / update PC)
At simplest level, this is how Von Neumann, RISC
model works
7Implementation Overview
simplest view of Von Neumann, RISC mP
- Abstract / Simplified View
- 2 types of signals data and control
- Clocking strategy All storage elements clocked
by same - clock edge.
Data
Address
PC
Ra
Instruction
Address
Rb
A
L
U
Instruction Memory
Register File
Rw
Data Memory
Data
8Review of Design Steps
- Instruction set Architecture gt RTL
representation - RTL representation gt
- Datapath components
- Datapath interconnects
- Datapath components gt Control signals
- Control signals gt Control logic
- Writing RTL How many states (cycles) should an
instruction take? - CPI
- Datapath component sharing
i.e. PC ? PC 4
(or 4 ? 3 2)
need these to do
need these to do
need these to do
9Single Cycle Implementation
- Each instruction takes one cycle to complete.
- We wait for everything to settle down, and the
right thing to be done - ALU might not produce right answer right away
(why?) - we use write signals along with clock to
determine when to write - Cycle time determined by length of the longest
path
referring to 2 slides ago, what instruction
takes the longest?
10An exercise in dataflow design
- OK, as a class exercise, were going to design a
simple MIPS dataflow. - FYI, the slides that describe this are in
Appendix A - but lets do this together first
- and think about ways to make it better along the
way - Well use the instruction formats to help
11Lets start with a few instructions
- For example
- Add 5, 6, 7
- SW 0(9), 10
- Sub 1, 2, 3
- LW 11, 0(12)
- We want to execute these instructions in order.
- Whats the first thing we have to do?
12Lets say we want to fetchan R-type
instruction (arithmetic)
- Instruction format
- RTL
- Instruction fetch memPC
- ALU operation regrd lt- regrs op regrt
- Go to next instruction Pc lt- PC 4
- Ra, Rb and Rw are from instructions rs, rt, rd
fields. - Actual ALU operation and register write should
occur after decoding the instruction.
13Lets say we want to fetchan I-Type
Arithmetic/Logic Instructions
- Instruction format
- RTL for arithmetic operations e.g., ADDI
- Instruction fetch memPC
- Add operation regrt lt- regrs
SignExt(imm16) - Go to next instruction Pc lt- PC 4
- Also, immediate instructions
14Lets say we want to fetchan I-Type Load/Store
Instructions
- Instruction format
- RTL for load/store operations e.g., LW
- Instruction fetch memPC
- Compute memory address Addr lt- regrs
SignExt(imm16) - Load data into register regrt lt- memAddr
- Go to next instruction Pc lt- PC 4
- How about store?
same thing, just skip 3rd step (memaddr ?
regrs)
15Lets say we want to fetchan I-Type Branch
Instructions
- Instruction format
- RTL for branch operations e.g., BEQ
- Instruction fetch memPC
- Compute conditon Cond lt- regrs - regrt
- Calculate the next instructions address
- if (Cond eq 0) then
- PC lt- PC 4 (SignExd(imm16) x 4)
- else ?
16Lets say we want to fetchan J-Type Jump
Instructions
- Instruction format
- RTL operations e.g., BEQ
- Instruction fetch memPC
- Set up PC PC lt- ((PC 4)lt3129gt
CONCAT(targetlt250gt) x 4
17What do we get?A Single Cycle Datapath
P
C
S
r
c
A
d
d
4
t
2
ALUctr
3
i
M
e
m
W
r
i
t
e
A
L
U
S
r
c
M
e
m
t
o
R
e
g
i
Z
e
r
o
A
L
U
A
L
U
R
e
a
d
A
d
d
r
e
s
s
r
e
s
u
l
t
M
d
a
t
a
M
u
u
x
D
a
t
a
x
m
e
m
o
r
y
W
r
i
t
e
R
e
g
W
r
i
t
e
d
a
t
a
If you dont understand this, take a look at
Appendix A
S
i
g
n
M
e
m
R
e
a
d
e
x
t
e
n
d
18Control Logic
19The HW needed, plus control
Single cycle MIPS machine
When we talk about control, we talk about these
blocks
20Implementing Control
- Implementation Steps Review
- Identify control inputs and control outputs
- Make a control signal table for each cycle
- Derive control logic from the control table
- As youve seen (and as well review), this logic
can take on many forms combinational logic,
ROMs, microcode, or combinations
21Single Cycle Control Input/Output
- Control Inputs
- Opcode (6 bits)
- How about R-type instructions?
- Control Outputs
- RegDst
- ALUSrc
- MemtoReg
- RegWrite
- MemRead
- MemWrite
- Branch
- Jump
- ALUctr
Step 2 Make a control signal table for each cycle
22Control Signal Table
(inputs)
R-type
(outputs)
23The HW needed, plus control
Single cycle MIPS machine
24Main control, ALU control
Func
ALUctr
OP
ALU Control
Main Control
6
ALUOp
3
6
2
(opcode)
ALU
Other cnt. signals
- Use OP field to generate ALUOp (encoding)
- Control signal fed to ALU control block
- Use Func field and ALUOp to generate ALUctr
(decoding) - Specifically sets 3 ALU control signals
- B-Invert, Carry-in, operation
25Main control, ALU control
Or in other words 00 ALU performs add 01 ALU
performs sub 10 ALU does what function code
says (see p. 284 for more)
26Generating ALUctr
and - 00
or - 01
mux
adder - 10
ALUctrlt2gt B-negate (C-in B-invert) ALUctrlt1gt
Select ALU Output ALUctrlt0gt Select ALU Output
Invert B and C-in must be a 1 for subtract
less - 11
27The Logic
This table is used to generate the actual Boolean
logic gates that produce ALUctr.
Could generate gates by hand, often done w/SW.
(ALUOp)
ALUOp0
X/1
ALUctrlt2gt
ALUOp1
1/0
0/X
1/1
F3
1/0
ALUctr
(funclt50gt)
110/110
ALUctrlt1gt
F2
0/X
1/1
Ex ALUctrlt2gt (SUB/BEQ)
ALUctrlt0gt
F1
1/X
0/0
0/0
F0
0/X
0/X
28Recall
Single cycle MIPS machine
Recall, for MIPS, we have to build a Main Control
Block and an ALU Control Block
29Well, heres what we did
Single cycle MIPS machine
We came up with the information to generate this
logic which would fit here in the datapath.
30Single cycle versus multi-cycle
31Single Cycle Implementation
- Calculate cycle time assuming negligible delays
except - memory (2ns), ALU and adders (2ns), register file
access (1ns)
32Single-Cycle Implementation (Contd)
- Single-cycle, fixed-length clock
- CPI 1
- Clock cycle propagation delay of the longest
datapath operations among all instruction types - Easy to implement
- Single-cycle, variable-length clock
- CPI 1
- Clock cycle ? ((type-i instructions)
propagation delay of the type i instruction
datapath operations) - Better than the previous, but impractical to
implement - Disadvantages
- What if we have floating-point operations?
- How about component usage?
33Multiple Cycle Alternative
- Break an instruction into smaller steps
- Execute each step in one cycle.
- Execution sequence
- Balance amount of work to be done
- Restrict each cycle to use only one major
functional unit - At the end of a cycle
- Store values for use in later cycles, why?
- Introduce additional internal registers
- The advantages
- Cycle time much shorter
- Diff. inst. take different of cycles to
complete - Functional unit used more than once per
instruction
34Step 1 Instruction Fetch
- Use PC to get instruction, put it in IR.
- Increment PC by 4, put the result back in PC.
- Can you write this using the RTL notation?
- IR lt- MemoryPC , PC lt- PC 4What is the
advantage of updating the PC now?
35Step 2 I-Decode and Register Fetch
- Read registers rs and rt in case we need them
- Compute branch address in case instruction is
branch - RTL A lt- RegIR25-21
- B lt- RegIR20-16
- ALUOut lt- PC (sign-extend(IR15-0) ltlt2)
- Did we set any control lines based on the
instruction type? (we are busy "decoding" it in
our control logic)
Means in parallel
36Step 3 (Instruction dependent)
- ALU is performing 1 of 3 functions, based on
instruction type - Memory Reference ALUOut lt- A
sign-extend(IR15-0) - R-type ALUOut lt- A op B
- Branch if (AB) then (PC lt- ALUOut)
37Step 4 (R-type or memory-access)
- Loads and stores access memory MDR lt-
MemoryALUOut or MemoryALUOut lt- B - R-type instructions finish RegIR15-11 lt-
ALUOutWhen does the write actually take
place? - -at the end of the cycle on the edge.
38Step 5 Write-Back
- RegIR20-16lt- MDR
- What about all the other instructions?
39Single cycle
40Multiple Cycle Design
- Break up instructions into steps, each step takes
1 cycle - balance work to be done
- restrict each cycle to use only 1 major
functional unit - At the end of a cycle
- store values for use in later cycles (easiest
thing to do) - introduce additional internal registers
41Execution Sequence Summary
IR ? MemoryPC
PC ? PC 4
A ? RegIR(2521)
B ? RegIR(2016)
ALUOut ? PC SignEx(IR(150) ltlt 2)
42Control Signals
New
Old
- PC PCWrite, PCWriteCond, PCSource
- Memory IorD, MemRead, MemWrite
- IR IRWrite
- Reg. File RegWrite, MemtoReg, RegDst
- ALU ALUSrcA, ALUSrcB, ALUOp, ALUCnt.
RegDst, MemToReg, RegWrite, MemRead, MemWrite,
Branch, ALUSrc, ALUOp, ALUCnt.
43Implementing the Control
- Value of control signals is dependent upon
- what instruction is being executed
- which step is being performed
- Use accumulated information to specify a finite
state machine - use a state diagram, or
- use microprogramming
- Implementation can be derived from specification
44Graphical Specification of FSM
t
Instruction Fetch
MemRead ALUSrcA 0 IorD 0 IRWrite ALUSrcB
01 ALUOp 00 PCWrite PCSource 00
Instruction decode/ Register fetch
1
0
ALUSrcA 0 ALUSrcB 11 ALUOp 00
start
8
9
Branch Completion
Memory address computation
Jump Completion
2
6
Execution
ALUSrcA 1 ALUSrcB 00 ALUOp
01 PCWriteCond PCSource 01
ALUSrcA 1 ALUSrcB 10 ALUOp 00
ALUSrcA 1 ALUSrcB 00 ALUOp 10
PCWrite PCSource 10
Memory access
5
Memory access
RegDst 1 RegWrite MemToReg 0
MemRead IorD 1
MemRead IorD 1
3
Tells us what values are needed and during what
step
R-type completion
7
RegDst 0 RegWrite MemToReg 1
4
Memory read completion
45Finite State Machine for Control
Control logic is inside this box (could be
implemented in many different ways)
The outputs that we want now also dependent
on the current state.
could be ROM, logic, etc.
Inputs (which now also include the previous state)
(Still might need ALU control logic and hence
function code developed earlier)
46Microprogramming
- For our example, state diagrams, combinational
logic more than adequate - But were dealing with small subset of MIPS
processor - Full MIPS instruction set has over 100
instructions - In 1 implementation instructions take from 1 to
20 clock cycles - Control would be much more complex for this case
- Another alternative microcoding
- Think of control signals that must be asserted in
a state as an instruction to be executed by
datapath - Call these micro instructions
47The entire microprogram
48Sample Microinstruction
- Ifetch IR lt- MemPC PC lt- PC4
Microinstruction 1d011ddd000100d11
49Pipelining
50Pipelining Its Natural!
- Laundry Example
- Ann, Brian, Cathy, Dave each have one load of
clothes to wash, dry, and fold - Washer takes 30 minutes
- Dryer takes 40 minutes
- Folder takes 20 minutes
51Sequential Laundry
6 PM
Midnight
7
8
9
11
10
Time
30
40
20
30
40
20
30
40
20
30
40
20
T a s k O r d e r
- Sequential laundry takes 6 hours for 4 loads
- If they learned pipelining, how long would
laundry take?
52Pipelined LaundryStart work ASAP
6 PM
Midnight
7
8
9
11
10
Time
T a s k O r d e r
Note More time to go out later that night
- Pipelined laundry takes 3.5 hours for 4 loads
53Pipelining Lessons
- Multiple tasks operating simultaneously
- Pipelining doesnt help latency of single task,
it helps throughput of entire workload - Pipeline rate limited by slowest pipeline stage
- Potential speedup Number pipe stages
- Unbalanced lengths of pipe stages reduces speedup
- Also, need time to fill and drain the
pipeline.
6 PM
7
8
9
Time
T a s k O r d e r
54Pipelining Some terms
- If youre doing laundry or implementing a mP,
each stage where something is done called a pipe
stage - In laundry example, washer, dryer, and folding
table are pipe stages clothes enter at one end,
exit other - In a mP, instructions enter at one end and have
been executed when they leave - Another example auto assembly line
- Throughput is how often stuff comes out of a
pipeline
55More technical detail
- If times for all S stages are equal to T
- Time for one initiation to complete still ST
- Time between 2 initiates T not ST
- Initiations per second 1/T
- Pipelining Overlap multiple executions of same
sequence - Improves THROUGHPUT, not the time to perform a
single operation - Other examples
- Automobile assembly plant, chemical factory,
garden hose, cooking
56More technical detail
- Books approach to draw pipeline timing diagrams
- Time runs left-to-right, in units of stage time
- Each row below corresponds to distinct
initiation - Boundary b/t 2 column entries pipeline register
- (i.e. hamper)
- Must look at column contents to see what stage is
doing what
Time for N initiations to complete NT (S-1)T
Throughput Time per initiation T (S-1)T/N ?
T!
57Ideal digital system pipeline speedup
Unpipelined
combinational logic delay t
combinational logic delay t
combinational logic delay t
combinational logic delay t
delay for 1 piece of data 4t latch setup
(assume small)
Latch
Latch
approximate delay for 1000 pieces of data 4000t
Pipelined
combinational logic delay t
combinational logic delay t
combinational logic delay t
combinational logic delay t
Latch
Latch
delay for 1 piece of data 4(t latch setup)
approximate delay for 1000 pieces of data 3t
1000t
4000
4
speedup for 1000 pieces of data
1003
Ideal speedup of pipeline stages
58The new look dataflow
IF/ID
ID/EX
EX/MEM
MEM/WB
4
M u x
ADD
PC
Branch taken
Comp.
IR6...10
M u x
Inst. Memory
IR11..15
Register File
ALU
MEM/ WB.IR
M u x
Data Mem.
Data must be stored from one stage to the
next in pipeline registers/latches. hold
temporary values between clocks and needed info.
for execution.
M u x
Sign Extend
16
32
59Another way to look at it
Clock Number
Time
Program execution order (in instructions)
60So, what about the details?
- In each cycle, new instruction fetched and begins
5 cycle execution - In perfect world (pipeline) performance improved
5 times over! - So, thats it, huh? Hardly!!!
- What else do we have to worry about?
- Must know whats going on in every cycle of
machine - What if 2 instructions try to use the same
resource at same time? - (LOTS more on this later)
- Separate instruction/data memories, multiple
register ports, etc. help avoid this
61Limits, limits, limits
- So, now that the ideal stuff is out of the way,
lets look at how a pipeline REALLY works - Pipelines are slowed b/c of
- Pipeline latency
- Imbalance of pipeline stages
- (Think A chain is only as strong as its weakest
link) - Well, a pipeline is only as fast as its slowest
stage - Pipeline overhead (from where?)
- Register delay from pipe stage latches
- Clock skew Once a clock cycle is as small as
the sum of the clock skew and latch overhead, you
cant get any work done
62Note
- See Appendix B in the supplementary materials for
more detail, examples.
63Control Signals in a Pipeline
64Questions about control signals
- Following discussion relevant to a single
instruction - Q Are all control signals active at the same
time? - A ?
- Q Can we generate all these signals at the same
time? - A ?
65Passing control w/pipe registers
- Analogy send instruction with car on assembly
line - Install Corinthian leather interior on car 6 _at_
stage 3
66Pipelined datapath w/control signals