Title: PROCESSOR ARCHITECTURE
1PROCESSOR ARCHITECTURE
- Jehan-François Pâris
- jparis_at_uh.edu
2Chapter Organization
- Logic design conventions
- Implementation of a "toy" CPU
- Pipelining
- Pipelining hazards
- Data hazards
- Control hazards
- Exceptions
- Parallelism
IMPORTANT
3LOGIC DESIGN CONVENTIONS
4Combinational/state elements
- Combinational elements
- Outputs only depend on current inputs
- Stateless
- Adders and, more generally, arithmetic logic unit
(ALU)
5Combinational/state elements
- State elements
- Have a memory holding a state
- Output depends on current inputs and state of
element - State reflects past inputs
- Flip-flops,
6Judicial analogy
- In our legal system
- Guilty/not guilty decision is stateless
- Good reasons
- Sentencing decision is not
- "Three strikes and you are out" laws
- Good reasons
7Clocking methodology
- We will assume an edge-triggered clocking
technology - Edge is short-enough to prevent data propagation
in state elements - Can read current state of a memory element at the
same time we update it
8Clocking convention
- Omit write control signal if state element is
updated at every active clock edge
9A "TOY" CPU
10Motivation
- "Toy" CPU will implement a subset of MIPS
instruction set - Subset will be
- Self-sufficient
- Simpler to implement
- Complex enough to allow a serious discussion of
CPU architecture
11The subset
- Will include
- Load and store instructionslw (load word) and
sw (store word) - Arithmetic-logic instructionsadd, sub, and, or
and slt (set less than) - Branch instructionsbeq (branch if equal) and j
(jump)
12Load and store instructions
- Format I
- Three operands
- Two registers r1 and r2
- One displacement d
- lw r1, d(r2) loads into register r1 main
memory word at address contents(r2) d - sw r1, d(r2) stores contents of register r1
into main memory word at address contents(r2) d
13Arithmetic-logic instructions
- Format R
- Three operands
- Three registers r1, r2 and r3
- Store into register r1 result of r2 ltopgt
r3where ltopgt can be add, subtract, and, oras
well as set if less than
14Branch instruction
- Format I
- Three operands
- Two registers r1 and r2
- One displacement d
- beq r1, r2, dset value of PC to PC4
4diff r1 r2
15The simplest data path
- Assume CPU will do nothing but
- Incrementing its program counter and
- Deliver the next instruction
16The simplest data path
Add
4
InstructionMemory Read address Instruction
PC
17Implementing R2R instructions
- Takes two 32-bit inputs
- Returns
- A 32-bit output
- A 1-bit signal if the result is zero
18The register file
- Two read outputs that are always available
- One write input activated by a RegWrite signal
- Three register selectors
19The register file
Read select 1 Read data 1 Read select 2 Read
data 2 Write select Write data
5
RegWriteenables register writes
20Implementing R2R instructions
Registerfile
ALU
Result
RegWrite is enabled
21Implementing load and store
- Require
- An address calculation
- contents(r2) d
- An access to data memory
- Before doing the address calculation, we must
transform 16-bit displacement d into a 32-bit
value using sign extension
22The data memory
- One address selector
- One write data input
- One read data output
- Two controls
- MemWrite
- MemRead
23Sign extension (I)
- If 16-bit number has a zero as MSB
- It is positive
- Must add 16 zero bits
0110 1010 1010 0100
24Sign extension (II)
- If 16-bit number has a one as MSB
- It is negative
- Must add 16 one bits
1110 1010 1010 0100
25The data memory
MemWrite enables memory writes
Memory address Read data Write data
MemRead enables memory reads
26Implementing the store instruction
Registerfile
ALU
Address Read Write
SE
Sign-extended d field
27Implementing the load instruction
Registerfile
ALU
Address Read Write
SEd field
SE
28Implementing conditional branch
- Target Address
- Sign-extend 16-bit immediate part of instruction
- Shift left 2
- Add to PC
- Branch Control Logic
- Perform test operation on two registers
- Check result
29Implementing conditional branch
Branch Destination
PC4
Add
Shiftleft 2
Registerfile
To branch control logic
ALU
d field of instruction
Sign-extended d field
30Note
- Arithmetic-logic operations only use
- Register file and ALU
- Load and store use
- ALU for computing memory address
- Data memory
31Implementing other instructions
32Combining everything
33Left to be done
- All control signals
- Two multiplexers ALUSrc and MemtoReg
- RegWrite, MemRad and MemWrite switches
- ALU controls (4 bits)
34ALU control signals
ALU control lines Function
0000 and
0001 or
0010 add
0110 subtract
0111 set on less than
1100 nor (not in "toy" subset)
35Controlling the ALU
- Recall that all R-format instructions have same
opcode - Operation performed by ALU is specified in the
function field (bits lt05gt)
36Controlling the ALU
- ALU control inputs generated by two-step process
- Construct two ALUOp control bits fromopcode
- Construct four ALU control bits using
- Two ALUop bits
- Six bits from function field when they are needed
37Dependence table
Opcode ALUOp Operation Function Action ALU Ctl
lw 00 lw - add 0010
sw 00 sw - add 0010
beq 01 beq - subtract 0110
R-type 10 add 100000 add 0010
R-type 10 subtract 100010 subtract 0110
R-type 10 and 100100 and 0000
R-type 10 or 100101 or 0001
R-type 10 slt 101010 slt 0111
38Notes
- Two step process simplifies combinatorial logic
- Many don't care conditions in truth table
39Truth table
ALUOp1 ALUOp2 F5 F4 F3 F2 F1 F0 ALU Control bits
0 0 X X X X X X 0010
0 1 X X X X X X 0110
1 0 X X 0 0 0 0 0010
1 X X X 0 0 1 0 0110
1 0 X X 0 1 0 0 0000
1 0 X X 0 1 0 1 0001
1 X X X 1 0 1 0 0111
40Note
- Bits 4 and 5 of function field are not used
- ALUOp bits only have three possible values00,
01 and 10 - Introduces don't care conditions
- All R instructions use same data paths
- Other control bits depend only on opcode
41Control signal effects
Signal When deasserted When asserted
Regdest Destination register comes from rt field (bits 2016) Destination register comes from rd field (bits 1510)
Regwrite None Enables write into destination register
ALUSrc Second ALU operand comes from second register output Second ALU operand comes from sign-extended displacement (bits 150)
42Control signal effects
Signal When deasserted When asserted
PCSrc PC is incremented by 4 PC set to branch target value
MemRead None Enables memory read output
MemWrite None Enables memory write
MemtoReg Value fed to destination register comes from ALU Value fed to destination register comes from memory
43Note
- PCSrc is asserted when
- Instruction is a branch
- and
- ALU Zero result bit is asserted
- We will introduce a Branch control line
44Control line settings
Instruction Rdest ALUsrc MemtoReg RegWrite
R-format 1 0 0 1
lw 0 1 1 1
sw X 1 X 0
beq X 0 X 0
45Control line settings
Instruction MemRead Mem Write Branch ALUOp 1 ALUOp 0
R-format 0 0 0 1 0
lw 1 0 0 0 0
sw 0 1 0 0 0
beq 0 0 1 0 1
46Active datapaths for a R instruction
47Active datapaths for a load instruction
48Active datapaths for a beq instruction
49The weird" jump instruction
- Uses J format
- Single 26 bit operand
- Implements an unconditional jump
- New value of PC is obtained as follows
- Bits 10 are zero (address is multiple of 4)
- Bits 282 come from jump operand
- Bits 3129 come from PC4
50Implementing the jump instruction
51Limitations of single-cycle design
- If we want all instructions to be executed in one
cycle - Clock cycle must be long enough to accommodate
instruction taking the most time - Floating-point multiply or divide
- Does not work for CPUs that have a rich
instruction set
52PIPELINING
53An analogy (I)
- Washing your clothes
- Four steps
- Putting in the washer
- Putting in the dryer
- Folding/ironing
- Putting them away
54An analogy (II)
- Most people
- Start second wash load as soon as first wash
load is in dryer - Put second wash load in dryer and start a third
wash load while they are folding/ironing the firs
washload
55Purely sequential approach
Time 6 pm 630 7pm 730 8pm 830 9pm 930
Wash Dry Fold Store
Wash Dry Fold Store
56Smart approach
Time 6 pm 630 7pm 730 8pm 830 9pm 930
Wash Dry Fold Store
Wash Dry Fold Store
Wash Dry Fold Store
Wash Dry Fold Store
Solution assumes that a housemateputs
folded/ironed clothes away for us
57Main advantage
- Can do much more in much less time
58Limitation
- Slowed down by time taken by longest step
- Could be washing/drying/ironing
59Instruction steps (I)
- Good candidates for pipelining steps
- Fetch instruction from memory
- Decode instruction
- Read registers
- Execute register to register operation or
calculate address - Access operand in memory
- Write results into a register
60Instruction steps (II)
- Since MIPS instruction set has fixed fields, we
can combine steps 2 and 3 - Fetch instruction from memory
- Read registers while decoding instruction
- Execute register to register operation or
calculate address - Access operand in memory
- Write results into a register
61Sample step timings
Instructionclass Instructionfetch Registerread ALUoperation Data access Registerwrite Totaltime
Load word (lw) 200 ps 100ps 200ps 200ps 100ps 800ps
Store word (sw) 200 ps 100ps 200ps 200ps --- 700ps
R format instruction 200 ps 100ps 200ps -- 100ps 600ps
Branch(beq) 200 ps 100ps 200ps -- -- 500 ps
62Step 1 Fetch and decode
63Step 2 Read registers
64Step 3 Use the ALU
65Step 4 Access operand in memory
66Step 5 Store result in register
67Observations
- Most R format instructions operate on three
registers and skip step 4 - Same for most I format instructions with an
immediate operand - Store operations skip step 5
- Load register instructions go through all five
steps
68Pipelining limitations
- Some instructions that skip a step will still
have to wait until preceding instruction is done. - Hazards
- An instruction cannot proceed because
- Hardware cannot support the combination of
instructions (structural hazards) - Data are not ready (data hazards)
- Control/branch hazards
69Structural hazards
- Combinations of instructions that prevent
pipelining
70A bad MIPS instruction (I)
- Recall that IBM instructions set had instructions
allowing to add to a register the contents of a
memory location - RX format
71A bad MIPS instruction (II)
- We could think of a MIPS instruction with three
registers operands - ADDX r1, r2, r3
- adding to r1 the contents of the word at address
contents of r2 contents of r3 - We would have r1 r1 Memr2r3
72A bad MIPS instruction (III)
- It would be great for accessing arrays
- r2 will have starting address of array
- r3 would contain the array index multiplied by 4
(incremented after each step)
r3
r2
(fixed value)
73A bad MIPS instruction (IV)
- Adding this instruction would be a very bad idea
- Why?
74Answer
- Instruction would require two steps using the ALU
- Adding r2 and r3 to compute the address of the
memory operand (step 4) - Adding the memory operand to r1
- New step would introduce a structural hazard by
preventing any other instruction to access the
ALU
75My comment
- Careful design of the MIPS CPU and instruction
set should be noted - Not true for older instructions sets
- IBM 360, DEC VAX,
- Not true for X86 instruction sets
- CPU is designed to be compatible with an existing
instruction set
76Designing instruction sets for pipelining (I)
- All instructions should have the same length
- Can fetch future instructions before the current
one is decoded - Have few instruction formats with register fields
always in the same position - Can combine instruction decode and register read
steps
77Designing instruction sets for pipelining (II)
- Memory operands should only appear in load and
store instruction - No instruction can use the ALU twice!
- Operands must be properly aligned in memory
- Can always access them in a single memory cycle
78Data hazards (I)
- Assume we have
- add s0, t0, t1sub t2, s0, t3
- or
- s0 t0 t1t2 s0 t3
- Need result of add before proceeding with sub
instruction
79Detail of steps
Cycle 1 2 3 4 5 6
add IF ID/RR ALU RW
sub IF stall stall ID/RR ALU
- Second instruction must wait until first
instruction updated s0 in cycle 4 before reading
its value in cycle 5
80Data hazards (II)
- New value of s0 computed by the add instruction
is not stored in s0 until its step 5 has
completed - New instruction must wait until add instruction
has performed its step 5 before performing its
step
81Data hazards (III)
add
sub
82Data hazards (IV)
- We lose two cycles during which nothing can be
done - Cannot trust compiler to remove all data hazards
- Observe that new value of s0 become available
at the end of step 3 of add instruction - Add special circuitry to provide this value at
the end of step 2 of sub instruction - Forwarding or bypassing
83After forwarding
84Detail of steps
Cycle 1 2 3 4 5 6
add IF ID/RR ALU RW
sub IF ID/RR ALU RW
- Second instruction now gets updated value at the
end of cycle 3 just in time to use it in cycle 4 - No stall cycles
85Limitations (I)
- Forwarding worked very well because output of
step 4 of add was forwarded to be input of step 3
of sub - Would not work as well if output of an
instruction step is need as input of instruction
step of next instruction - Will still have one or more pipeline stalls
(bubbles)
86Limitations (II)
- Assume we have
- lw s0, 20(t1)sub t2, s0, t3
- or
- s0 Memt120t2 s0 t3
- Need new value of s0 before proceeding with sub
instruction
87Limitations (III)
88Detail of steps
Cycle 1 2 3 4 5 6
lw IF ID/RR ALU MEM RW
sub IF ID/RR stall ALU RW
- Even with forwarding second instruction must wait
until completion of memory access of first
instruction in cycle 4 before performing its ALU
step in cycle 5 - One stall cycle
89A last word
- In many architectures, the floating point unit is
a significant source of structural hazards - Less well adapted to pipelining
- The MIPS architecture assumes that we have
separate memories for instructions and data - Having a single memory for both would result in
many more hazards
90Control / jump hazards
- Happen whenever we have a conditional jump
- Consider the instructions
- add 4, 5,6beq 1,2, 40or 7, 8, 9
- Need result of conditional branch (beq) before
deciding whether to execute next instruction (or) -
91Control hazards (II)
92Pipelined datapath
93Datapaths for pipelined organization
- Define five steps
- Fetch instruction from memory (IF)
- Instruction decode and register reads (ID)
- Execute AL operation on ALU (EX)
- Access operand in memory (MEM)
- Write back results into a register (WB)
94Datapaths for pipelined organization
- Insert registers to save outputs of each step
before they get updated by th next step - IF/ID registers
- ID/EX registers
- EX/MEM registers
- MEM/WB registers
95A first try
New
New
New
New
IF/
96Comments
- This first try is not correct
- Load instruction will not be implemented
correctly - Address of destination register will be lost as
soon as new instruction will be fetched - Must save it at each step
97The almost correct datapaths
Register address follows instruction
98The almost correct datapaths
99More problems
- Address of destination register is not always at
the same place in all instructions - Could be instruction bits (20-16)
- For all I-format instructions that write into a
register - Could be instruction bits (15-11)
- In R format instructions
100Why?
- In R format instructions
- In I format instructions
constant/address
101The solution
- Add a multiplexer at stage EX
102More about data hazards
- Consider
- sub 2,1,3and 12, 2, 5or 13, 6, 2add
14, 2, 2sw 15, 100(2) - Last four instructions depend on result of sub
103More about data hazards
- 2 is updated at the end of last cycle of sub
- First instruction that would get the correct
value of 2 would be the add
104More about data hazards
sub IF IDReg EX MEM WB
and IF IDReg EX MEM WB
or IF IDReg EX MEM
add IF IDReg EX
sw IF IDReg
105Adding a forwarding unit
106More data hazards
- We can forward the results of sub instruction at
the end of its EX step - In time for all four following instructions
- To do that we need special forwarding unit
- Not all data hazards can be avoided
- lw followed by any instruction accessing the
loaded word
107Why?
- lw loads word from RAM into memory
- Goes through IF, IDReg, EX, MEM and WB steps
- Register value is updated at the end of WB step
- Must delay any following instruction that wants
to access the contents of the register
108Data hazard detection unit
- Detects hazards that cannot be avoided
- Inserts no operation instructions (nop)
- They do nothing!
109More about control hazards
- Outcome of conditional branch is not known until
end of step EX - beq and bne use arithmetic unit to evaluate the
branch condition - If branch is taken, we must abort the two
following instructions - Easy because they have not yet updated anything
110More about control hazards
beq IF IDReg EX MEM WB
next IF IDReg ABORT
next IF ABORT
dest IF IDReg EX
111More about control hazards
beq IF IDReg EX MEM WB
next IF ABORT
dest IF IDReg EX MEM
112Better implementation of beq/bne
113MIPS Optimization
- Move comparison ahead to reduce the number of
aborted instructions - Add a simple EQUAL/NOT EQUALcomparison hardware
that tests outputs of register file - Bitwise XOR then ORing the results
- Will return zero if the register contents are
identical
114Explanations
- Moving the jump address calculation one step
ahead means that we will always do the
calculation even when it is not needed. - Simple comparator duplicates one ALU function
115New problem
- We need now the correct values of the input
registers in step ID - More data hazards
- add t0, t2, t3beq t0, s0, 400
- Data forwarding can reduce the number of nops but
not eliminate them
116New data hazards
add IF IDReg EX MEM WB
nop
nop
beq IF IDReg EX MEM
117EXCEPTIONS AND INTERRUPTS
118Interrupts (I)
- Request to interrupt the flow of execution the
CPU - Detected by the CPU hardware
- After it has executed the current instruction
- Before it starts the next instruction.
119Interrupts (II)
- When an interrupt occurs
- The current state of the CPU (program counter,
program status word, contents of registers, and
so forth) is saved, normally on the top of a
stack - A new CPU state is fetched
120Interrupts (III)
- New state includes a new hardware-defined value
for the program counter - Cannot hijack interrupts
- Process is totally transparent to the task being
interrupted - A process never knows whether it has been
interrupted or not
121Types of interrupts (I)
- I/O completion interrupts
- Notify the OS that an I/O operation has
completed, - Timer interrupts
- Notify the OS that a task has exceeded its
quantum of CPU time,
122Types of interrupts (II)
- Traps
- Notify the OS of a program error (division by
zero, illegal op code, illegal operand address,
...) or a hardware failure - System calls
- Notify OS that the running task wants to submit a
request to the OS - Notification of another event
123A surprising discovery
- Programs do interrupt themselves!
124Context switches
- Each interrupt will result intotwo context
switches - One when the running task is interrupted
- Another when it regains the CPU
- Context switches are not cheap
- The overhead of any simple system call istwo
context switches
Remember that for 4330!
125Prioritizing interrupts (I)
- Interrupt requests may occur while the system is
processing another interrupt - All interrupts are not equally urgent (as it is
also in real life) - Some are more urgent than other
- Also true in real life
-
126Prioritizing interrupts (II)
- The best solution is to prioritize interrupts
and assign to each source of interrupts a
priority level - New interrupt requests will be allowed to
interrupt lower-priority interrupts but will have
to wait for the completion of all other
interrupts - Solution is known as vectorized interrupts.
127Example from real life
- Let us try to prioritize
- Phone is ringing
- Washer signals end of cycle
- Dark smoke is coming out of the kitchen
-
- With vectorized interrupts, a phone call will
never interrupt another phone call
128The solution
129MIPS Implementation (I)
- Interrupts are a special case of a branch
- Use same techniques for handling control hazards
- Almost all MIPS interrupts jump to the same
hardware address (x80000180) - MIPS use a special register to pass along the
type of interrupt to the interrupt handler - The Cause register
130MIPS Implementation (II)
- MIPS also saves the address 4 of the affected
instruction in a special register - EPC register
- A STATUS register allows selective disabling of
interrupts - Useful for handling short critical sections in
single-threaded kernel
131Issues (I)
- Interrupted instruction may have to be restarted
- Typical for I/O completion interrupts
- Must then maintain precise exceptions that
accurately identify the instruction being
interrupted - Not true for hardware interrupts
132Issues (II)
- Must be able to restart instruction at the exact
point it was interrupted - Not always easy on many architectures
- MIPS solution is to roll back everything and
restart instruction as if nothing had happened - Easier on MIPS since register/memory update is
always the last step of any instruction - Must still ensure that we can restore the
original values of all registers
133Branch prediction
- CPU will try to predict whether a branch will be
taken or not - Important for loops
- Branch is taken at every iteration but last one
See speculative execution
134Parallelism
- Instruction-level parallelism (ILP)
- Two ways
- Increasing the depth of the pipeline
- More steps can be executed in parallel
- Multiple issue
- We duplicate some units (ALU)
- Two or more units can be at the same pipeline
stage
135An example
- Could modify the toy MIPS architecture by adding
a second ALU - Would allow RR instructions be executed in
parallel with load and store instructions - Would also need extra ports in the register bank
- Faster but much more complex
136Hazards
- Become an even bigger issue
- Some architectures assume that the compiler will
take care of all data hazards - Will never issue sequence of instructions with
unsatisfied dependencies - Other architectures check for problems
137Speculative execution (I)
- Can speculate that
- A branch will not be taken (think of loops)
- A store than precedes a load will not store at
the address the load will use - and execute the instruction(s) hoping for the
best - If speculation is wrong, we must undo what we
have done
138Speculative execution (II)
- Any speculation mechanism must include methods to
- Check if the speculation was correct
- Undo the effect of the speculated instructions
- Quite complex
- Can be done by the compiler or the hardware
139Fallacies
- Pipelining is easy
- Pipelining ideas can be implemented independently
of technologies
140Pitfalls
- Instruction set must be pipelining friendly