Title: Computer Architecture
1Computer Architecture
- Processor Design-Advanced Topics
2Chapter Outline
- 5.1 Pipelining
- A pipelined design of SRC
- Pipeline hazards
- 5.2 Instruction-Level Parallelism
- Superscalar processors
- Very Long Instruction Word (VLIW) machines
- 5.3 Microprogramming
- Control store and micro-branching
- Horizontal and vertical microprogramming
3The Pipeline and the Assembly Line
- Executing Machine Instructions versus
Manufacturing Small Parts
I
n
s
t
r
u
c
t
i
o
n
InstructionInterpretationand Execution
P
a
r
t
i
n
t
e
r
p
r
e
t
a
t
i
o
n
P
a
r
t
m
a
n
u
f
a
c
t
u
r
e
a
n
d
e
x
e
c
u
t
i
o
n
m
a
n
u
f
a
c
t
u
r
e
FetchInstruction
F
e
t
c
h
S
e
l
e
c
t
S
e
l
e
c
t
C
o
v
e
r
I
d
r
2
,
a
d
d
r
2
i
n
s
t
r
u
c
t
i
o
n
p
a
r
t
p
a
r
t
p
l
a
t
e
F
e
t
c
h
D
r
i
l
l
D
r
i
l
l
E
n
d
FetchOperands
s
t
r
4
,
a
d
d
r
1
o
p
e
r
a
n
d
s
p
a
r
t
p
a
r
t
p
l
a
t
e
ALUOperation
A
L
U
C
u
t
C
u
t
T
o
p
a
d
d
r
4
,
r
3
,
r
2
o
p
e
r
a
t
i
o
n
p
a
r
t
p
a
r
t
p
l
a
t
e
MemoryAccess
M
e
m
o
r
y
P
o
l
i
s
h
P
o
l
i
s
h
B
o
t
t
o
m
s
u
b
r
2
,
r
5
,
1
a
c
c
e
s
s
p
a
r
t
p
a
r
t
p
l
a
t
e
RegisterWrite
R
e
g
i
s
t
e
r
P
a
c
k
a
g
e
P
a
c
k
a
g
e
C
e
n
t
e
r
s
h
r
r
3
,
r
3
,
2
w
r
i
t
e
p
a
r
t
p
a
r
t
p
l
a
t
e
M
a
k
e
e
n
d
p
l
a
t
e
add r4, r3, r2
(
a
)
W
i
t
h
o
u
t
p
i
p
e
l
i
n
i
n
g
/
a
s
s
e
m
b
l
y
l
i
n
e
(
b
)
W
i
t
h
p
i
p
e
l
i
n
i
n
g
/
a
s
s
e
m
b
l
y
l
i
n
e
4The Pipeline Stages
- 5 pipeline stages are shown
- 1. Fetch instruction
- 2. Fetch operands
- 3. ALU operation
- 4. Memory access
- 5. Register write
- Example of 5 instructions executing at different
stages in pipeline - shr r3, r3, 2 Storing result into r3
- sub r2, r5, 1 Idleno memory access needed
- add r4, r3, r2 Performing addition in ALU
- st r4, addr1 Accessing r4 and addr1
- ld r2, addr2 Fetching instruction
5Pipelining Instruction Processing
- Pipeline stages are shown top to bottom in order
traversed by one instruction - Instructions listed in order they are fetched
- Order of instructions in pipeline is reverse of
listed - If each stage takes 1 clock
- every instruction takes 5 clocks to complete
- some instruction completes every clock tick
- Two performance issues instruction latency and
instruction bandwidth
6Dependence Among Instructions
- Execution of some instructions can depend on the
completion of others in the pipeline - Pipeline Stalls One solution is to stall the
pipeline - early stages stop while later ones complete
processing - Data Forwarding Dependences involving registers
can be detected and data forwarded to
instruction needing it, without waiting for
register write - Dependence involving memory is harder and is
sometimes addressed by restricting the way the
instruction set is used - Delayed Load Decree Values loaded from memory
into the register file cannot be accessed until
tow instructions later. - Branch Delay slot is anotherexample of such a
restriction Branch targets cannot be computed
before the instruction following the branch
instruction has entered the pipeline. Hardware
detects the dependence and stalls the pipeline.
7Branch and Load Delay Examples
Branch Delay
brz r2, r3 add r6, r7, r8 st r6, addr1
This instruction always executed
Only done if r2 ¹ 0
Load Delay
ld r2, addr add r5, r1, r2 shr r1,r1,4 sub r6,
r8, r2
This instruction gets old value of r2
This instruction gets r2 value loaded from addr
- Working of instructions is not changed, but
- The way they work together is changed.
8Characteristics of Pipelined Processor Design
- The instruction set is unchanged
- Instructions should execute and provide the same
results on all architectures regardless of the
pipeline structure. - Main memory must operate in one cycle
- This can be accomplished by expensive memory, but
- It is usually done with cache, to be discussed in
Chap. 7 - Instruction and data memory must appear separate
- Harvard architecture has separate instruction and
data memories - Again, this is usually done with separate caches
9Characteristics of Pipelined Processor Design
- 3-Port Register File
- Pipelined architecture require a 3-port register
file so that to allow the reading of two operands
and the writing of a third in a single clock
cycle. - Modification to Buses and the Data Path
- Few buses are used since
- Most connections are point to point
- Some few-way multiplexers are used
- Data is latched (stored in temporary registers)
at each pipeline stagecalled pipeline
registers - ALU operations take only 1 clock (esp. shift)
10Adapting Instructions to Pipelined Execution
- All instructions must fit into a common pipeline
stage structure - We use a 5-stage pipeline for the SRC
- (1) Instruction fetch
- (2) Decode and operand access
- (3) ALU operations
- (4) Data memory access
- (5) Register write
- We must fit load/store, ALU, and branch
instructions into this pattern
11Control Signals
- Need to specify signals that will control the
flow to the pipeline - Grouping of opcodes into a set of instructions
with similar properties is useful in generating
signals that control the register transfers
through the pipeline. - Example of Figure 5.3 (next slide)
12Logic Expressions Defining Pipeline Stage Activity
branch br Ú brl cond (IR2á2..0ñ 1) Ú
((IR2á2..1ñ1)Ù(IR2á0ñÅRrb0)) Ú
((IR2á2..1ñ2)Ù(IR2á0ñÅRrbá31ñ) sh shr Ú
shra Ú shl Ú shc alu add Ú addi Ú sub Ú neg
Ú and Ú andi Ú or Ú ori Ú not Ú sh imm addi
Ú andi Ú ori Ú (sh Ù (IR2á4..0ñ ¹ 0) ) load
ld Ú ldr ladr la Ú lar store st Ú str
l-s load Ú ladr Ú store regwrite load Ú
ladr Ú brl Ú alu Instructions that write to the
register file dsp ld Ú st Ú lar
Instructions that use disp addressing rl ldr
Ú str Ú lar Instructions that use rel
addressing
- Notes
- cond and imm are used only in step 2 ?
- IR2 (instruction register for stage 2) is used as
the register from which their signals are
generated. - Other signals in the example will be required in
several different stages. The number is appended
to the signal name to show which stage generates
it (e.g., branch2 is generated in stage 2 from
IR2 by testing the opcode field in IR2)
13Notes on the Equations and Different Stages
- The logic equations are based on the instruction
in the stage where they are used - When necessary, we append a digit to a logic
signal name to specify it is computed from values
in that stage - Thus regwrite5 is true when the opcode in stage 5
is load5 Ú ladr5 Ú brl5 Ú alu5, all of which are
determined from op5
14ALU Instructions
- ALU Instructions
- Instructions fit into 5 stages
- Stage 1 Instruction FetchInstruction pointed to
by PC is fetched from instruction memory
(separate from data) and PC is incremented. - Stage 2 Instruction Decode/Operand Access
Instruction is read for IR2 and decoded. Recall
that all ALU or shift operations are of the form - Rra ? Rrb op Rrc
- Rra ? Rrb op c2lt16..0gt
- Second ALU operand comes either from a register
or instruction register c2 field (see next
slide) -
- Y3 ? (imm ? c2 imm ? Rrc)
- Stage 3 ALU Operation Opcode must be available
in stage 3 to tell ALU what to do. - Stage 4 Memory Access Since there is no memory
access operation in ALU ? NOP. - Stage 5 Register Write Result register, ra, is
written in stage 5 from Z4.regwrite signal is
set to true to enable the write into register.
?
15The Memory Access Instructions ld, ldr, st,
and str
- RTN of Memory Access Instructions
- ld ( op 1) ? Rra ? Mdisp
- ldr ( op 2) ? Rra ? Mrel
- st ( op 3) ? Mdisp ? Rra
- str ( op 4) ? Mrel ? Rra
- lda ( op 5) ? Rra ? disp
- lar ( op 6) ? Rra ? rel
- displt31..0gt
- ((rb0) ? c2lt16..0gt sign ext.
- (rb?0) ? Rrb c2lt16..0gt sign extend, 2's
comp. ) - rellt31..0gt PClt31..0gt c1lt21..0gt sign extend,
2's comp.
16The Memory Access Instructions ld, ldr, st,
and str
- Stage 1 Instruction Fetch and PC Increment.
Note incremented value of PC is recorded in PC2. - Stage 2 Operand Fetch.
- 1st address computation
- X3 ? (rl ? PC2 dsp ? Rrb)
- 2nd address computation
- Y3 ? (rl ? C1 dsp ? c2)
- Stage 3 ALU OperationRelative or displacement
address is computed by adding X3 and Y3. Result
stored in Z4. - Stage 4 Memory Accessld or ldr data memory at
the address in Z4 is copied into Z5.la or lar,
address in Z4 is directly copied into Z5.Store
instructions have value in Md3 written into data
memory at the addresses sotered in Z4. - Stage 5. Register WriteIf load instruction then
regwrite will be true and the value stored in Z5
will be written into the register file at the
register address stored in the ra field of IR2.
17Branch Instructions
cond ( c3á2..0ñ0 0 never c3á2..0ñ1
1 always c3á2..0ñ2 Rrc0 if register
is zero c3á2..0ñ3 Rrc¹0 if register is
nonzero c3á2..0ñ4 Rrcá31ñ0 if positive or
zero c3á2..0ñ5 Rrcá31ñ1 ) if negative br
( op 8) ? (cond ? (PC ? Rrb))
Conditional branch brl ( op 9) ? (Rra
? PC cond ? (PC ? Rrb)) Branch and link
18Branch Instructions
- The new program counter value is known in stage
2but not in stage 1. - If branch is taken then the PC receives the new
branch address. - Only for branch and link (brl) does a register
write in stage 5 - The value of the old PC is incremented and stored
in PC2 to be written into Rra (link register)
in stage 5 regardless of whether the branch is
taken or not. - There is no ALU or memory operation
- Mp1 is controlled according to
- cond(IR2,Rrc) ?
- PC ? Rrb X3?PC2
19Designing the Pipelined Data Path
- All information pertaining to the instruction
that will be used in subsequent stages of
execution (data and instruction) must be
propagated along the pipeline so-called
pipeline-registers. - Global State
- Register file,
- Data Memory,
- Instruction Memory,
- The SRC Pipeline Registers and RTN Specification
- Hardware and Control to Support Pipelining
- Requires
- Examination of previous figures and
- Determination of which information needs to be
propagated to the next stage.
20Designing the Pipelined Data Path
- RTN and the Pipeline Design
- Figure 5.6 (next slide) depicts all of the
pipeline registers and the RTN descriptions of
the flow of all the instructions through the
pipeline. - It combines
- All the data path (pipeline) registers, and
- The actions specified for different instruction
classes (as described previously).
21Pipeline Registers and RTN Specification
- Control signals are labeled with the stage from
which they are computed. Example - PC ?
- (branch2 ? PC4
- branch2 ? (cond(IR2,Rrc)?Rrb
- cond(IR2,Rrc)?PC4))
- Propagation of IR register content necessary
across the pipeline. - Stages 3,4, and 5 require only the op filed and
the ra field. - In Stage 3 the ALU instructions require opcode to
determine which operation to perform. - Stage 4 requires the opcode to supply the load
and store instruction with the information they
will need to control data memory access. - Stage 5 ra is used to tell its instruction which
register in the register file to write its value
into, also opcode determines whether a register
write is required. - Z4 (ALU output register)
- Memory address (load store)
- A memory value if the instruction is ld or ldr.
- Incremented PC (brl)
- ALU results if ALU instruction
22Global State of the Pipelined SRC
- PC, the general registers, instruction memory,
and data memory represent the global machine
state - PC is accessed in stage 1 (and stage 2 on branch)
- Instruction memory is accessed in stage 1
- General registers are read in stage 2 and written
in stage 5 - Data memory is only accessed in stage 4
23Restrictions on Access to Global State by Pipeline
- Can see why separate instruction and data
memories (or caches) are needed - When a load or store accesses data memory in
stage 4, stage 1 is accessing an instruction - Thus two memory accesses occur simultaneously
- Two operands may be needed from registers in
stage 2 while another instruction is writing a
result register in stage 5. - Thus, as far as the registers are concerned, 2
reads and a write happen simultaneously - Increment of PC in stage 1 must be overridden by
a successful branch in stage 2
24Control Signals Pipeline Data Path
- The Pipeline Data Path with Selected Control
Signals - Most control signals are shown and given values
- Multiplexer control is stressed in the next
figure - Notation change on the inputs/outputs of the
register file - Address inputs are labeled a1,a2, and a3.
- Figure in the next slide indicates which register
field from the instruction ra, rb, or rc, is
sent to which address input a1, a2, or a3. - Data inputs/outputs are labeled as R1, R2, and R3.
25Control Signals Pipeline Data Path
I
n
s
t
r
u
c
t
i
o
n
P
C
GA1- plays the role of BAout gates all 0s if
R0 is selected as part of disp calculation
m
e
m
o
r
y
M
p
1
1
.
(
Ø
(
b
r
a
n
c
h
2
c
o
n
d
)
l
n
c
4
)
Ú
M
p
1
I
n
s
t
r
u
c
t
i
o
n
I
n
c
4
(
(
b
r
a
n
c
h
2
c
o
n
d
)
P
C
2
)
Ú
f
e
t
c
h
Gate Signals
G
1
I
R
2
R
e
g
i
s
t
e
r
f
i
l
e
G
A
1
o
p
r
a
r
b
r
c
c
1
c
2
G
2
P
C
2
a
1
R
1
a
2
R
2
a
3
R
3
W
3
r
b
2
.
M
p
2
(
Ø
s
t
o
r
e
r
c
)
M
p
2
c
o
n
d
D
e
c
o
d
e
(
s
t
o
r
e
r
a
)
r
c
a
n
d
r
a
M
p
3
(
r
l
Ú
b
r
a
n
c
h
P
C
2
)
c
2
á
2
.
.
0
ñ
B
r
a
n
c
h
c
2
o
p
e
r
a
n
d
(
d
s
p
Ú
a
l
u
R
1
)
c
1
l
o
g
i
c
r
e
a
d
M
p
4
(
r
l
c
1
)
M
p
3
M
p
4
(
d
s
p
Ú
i
m
m
c
2
)
(
a
l
u
Ù
7
1
m
m
Ø
i
m
m
R
2
)
Y
3
I
R
3
X
3
M
D
3
o
p
r
a
A
L
U
Mp1-Mp5 allow the pipeline registers to have
multiple input sources
o
p
n
3
.
A
L
U
D
e
c
o
d
e
A
L
U
o
p
e
r
a
t
i
o
n
M
D
4
Z
4
I
4
R
D
a
t
a
a
d
d
r
m
e
m
o
r
y
4
.
r
a
o
p
M
p
5
(
Ø
l
o
a
d
Z
4
)
M
e
m
o
r
y
(
l
o
a
d
m
e
m
d
a
t
a
)
l
o
a
d
/
s
t
o
r
e
D
e
c
o
d
e
a
c
c
e
s
s
M
p
5
Z
5
I
R
5
5
.
r
a
v
a
l
u
e
o
p
R
e
g
i
s
t
e
r
D
e
c
o
d
e
l
o
a
d
Ú
l
a
d
r
Ú
b
r
l
Ú
a
l
u
w
r
i
t
e
26Control Signals Pipeline Data Path
- Example
- Mp4 ? (rl ? c1) ldr, str, lar rel instr.
- (dsp Ú imm ? c2) addi, andi, ori imm
instr. or ld, st or la disp instr.
((alu ? imm) ? sh ? R2) alu and
not imm. or shift instruction - Register operand access in stage 2
- All instructions with exception of the store
instructions - rb and rc - specify source operands to be
accessed in stage 2 - ra specifies the register into which the result
is to be stored in stage 5. - Store instructions
- Rra contains the value of the operand to be
fetched out of the register file in stage 2 - Multiplexer Mp2 is used to route ra instead of rc
to register read address port a2. - Fetched value Rra is copied into MD3 to be
stored in memory in stage 4.
27Generating the control Signals
- In the pipeline architecture
- Control signals are generated at each stage from
the op field in IRx ? - Control signals are distributed throughout the
stages of pipeline. - Most control signals that are generated (see
previous figure) in a given stage are also used
in that stage. - There are few specific exceptions, e.g., PC
control. - Note that each register must have a strobe signal
that control reading/writing of data from/to it. - From the figure all the paths are
point-to-point ? - No gating signals are required except at the
multiplexers. - RTN presented in previous figure provides
sufficient information to generate all of the
gate and strobe signals in the data path. - Special cases of pipeline hazard require special
solution (covered later).
28Propagating an Instruction Sequence Through the
Pipeline
100 add r4, r6, r8 R4 R6
R8 104 ld r7, 128(r5) R7
MR5128 108 brl r9, r11, 001 PC R11
R9 PC 112 str r12, 32 MPC32 R12 .
. . . . . 512 sub ... next instr. ...
- PC is initialized to PC100.
- R11 512 when the brl instruction is executed.
- R6 4 and R8 5 are the add operands
- R5 16 for the ld and R12 23 for the str
29First Clock Cycle add Enters Stage 1 of Pipeline
- Program counter is incremented to 104
512 sub ... . . . . . . 112 str r12,
32 108 brl r9, r11, 001 104 ld r7, r5,
128 100 add r4, r6, r8
Stage 1
30Second Clock Cycle
- add Enters Stage 2
- ld is Being Fetched at Stage 1
- add operands are fetched in stage 2
512 sub ... . . . . . . 112 str r12,
32 108 brl r9, r11, 001 104 ld r7, r5,
128 100 add r4, r6, r8
Stage 1
Stage 2
31Third Clock Cycle brl Enters the Pipeline
add performs its arithmetic in stage 3 ld moves
to stage 2.
512 sub ... . . . . . . 112 str r12,
32 108 brl r9, r11, 001 104 ld r7, r5,
128 100 add r4, r6, r8
Stage 1
Stage 2
Stage 3
32Fourth Clock Cycle str Enters the Pipeline
- add is idle in stage 4
- Success of brl changes program counter to 512
512 sub ... . . . . . . 112 str r12,
32 108 brl r9, r11, 001 104 ld r7, r5,
128 100 add r4, r6, r8
Stage 1
Stage 2
Stage 3
Stage 4
33Fifth Clock Cycle
- add Completes
- sub Enters the Pipeline
- add completes in stage 5
- sub is fetched from location 512 after successful
brl
512 sub ... . . . . . . 112 str r12,
32 108 brl r9, r11, 001 104 ld r7, r5,
128 100 add r4, r6, r8
34Functions of the SRC Pipeline Stages
- Stage 1 fetches instruction
- PC incremented or replaced by successful branch
in stage 2 - Stage 2 decodes instruction and gets operands
- Load or store gets operands for address
computation - Store gets register value to be stored as 3rd
operand - ALU operation gets 2 registers or register and
constant - Stage 3 performs ALU operation
- Calculates effective address or does
arithmetic/logic - May pass through link PC or value to be stored in
memory
35Functions of the SRC Pipeline Stages
- Stage 4 accesses data memory
- Passes Z4 to Z5 unchanged for non-memory
instructions - Load fills Z5 from memory
- Store uses address from Z4 and data from MD4 (no
longer needed) - Stage 5 writes result register
- Z5 contains value to be written, which can be ALU
result, effective address, PC link value, or
fetched data - ra field always specifies result register in SRC
36Functions of the Pipeline Registers in SRC
- Registers between stages 1 and 2
- IR2 holds full instruction including any register
fields and constant - PC2 holds the incremented PC from instruction
fetch - Registers between stages 2 and 3
- IR3 holds opcode and ra (needed in stage 5)
- X3 holds PC or a register value (for link or 1st
ALU operand) - Y3 holds c1 or c2 or a register value as 2nd ALU
operand - MD3 is used for a register value to be stored in
memory
37Functions of the Pipeline Registers in SRC
- Registers between stages 3 and 4
- I4 has op code and ra
- Z4 has memory address or result register value
- MD4 has value to be stored in data memory
- Registers between stages 4 and 5
- I5 has opcode and destination register number, ra
- Z5 has value to be stored in destination
register from ALU result, PC link value, or
fetched data
38Pipeline Hazard
- Entirely predictable, deterministic events.
- Occur as side effects of having instructions in
the pipeline that depend upon the results of
instructions ahead of them that have not exited
the pipeline. - The element of hazard comes only from not taking
into account pipelines behavior - Rogue compiler
- Assembly language programmer
- Compiler must perform hazard analysis based on
static condition (as oppose to run-time dynamic
conditions) and thus take into account the worst
case scenario.
39Dependence Between Instructions in Pipe
Pipeline Hazards
- Instructions that occupy the pipeline together
are being executed in parallel - This leads to the problem of instruction
dependence, well known in parallel processing - The basic problem is that an instruction depends
on the result of a previously issued instruction
that is not yet complete - Two categories of hazards
- Data hazards an instruction initiates
modification of the data in a register that is
need in one of the next instructions in the
pipeline. - Branch hazards fetch of wrong instruction on a
change in PC
40Data Hazard
- Classification of Data Hazards
- A read after write hazard (RAW) arises from a
flow dependence, where an instruction uses data
produced by a previous one - A write after read hazard (WAR) comes from an
anti-dependence, where an instruction writes a
new value over one that is still needed by a
previous instruction - A write after write hazard (WAW) comes from an
output dependence, where two parallel
instructions write the to same register and must
do it in the order in which they were issued
41Data Hazards in SRC
- Since all data memory access occurs in stage 4,
memory writes and reads are sequential and give
rise to no hazards - Since all registers are written in the last
stage, WAW and WAR hazards do not occur - Two writes always occur in the order issued, and
a write always follows a previously issued read - SRC hazards on register data are limited to RAW
hazards coming from flow dependence - Values are written into registers at the end of
stage 5 but may be needed by a following
instruction at the beginning of stage 2
42Example of Pipeline Data Hazard in SRC
add instruction writes into register r0 in Stage
5
- 100 add r0, r2, r4
- 104 sub r3, r0, r1
- How to prevent this kind of hazard
- When instruction
- add is in stage 5
- sub must be no closer than stage 1
- ? separation of at least 4 instructions!
- Note that result operand of the add instruction
is actually available in register Z4 when the sub
instruction requires it. - Data forwarding Forwarding hardware can be
designed to detect this particular hazard and to
forward the value to register Y3 in time for the
sub instruction to use it.
sub instruction reads register r0 in Stage 2
43Possible Solutions to the Register Data Hazard
Problem
- Detection
- The machine manual could list rules specifying
that a dependent instruction cannot be issued
less than a given number of steps after the one
on which it depends - This is usually too restrictive
- Since the operation and operands are known at
each stage, dependence on a following stage can
be detected - Correction
- The dependent instruction can be stalled and
those ahead of it in the pipeline allowed to
complete - Result can be forwarded to a following inst. in
a previous stage without waiting to be written
into its register - Preferred SRC design will use detection,
forwarding and stalling only when unavoidable
44Detecting Hazards and Dependence Distance
- To detect hazards, each pair of instructions must
be considered - Data is normally available after being written to
register - Can be made available for forwarding as early as
the stage where it is produced - Stage 3 output for ALU results, stage 4 for
memory fetch - Operands normally needed in stage 2
- Can be received from forwarding as late as the
stage in which they are used - Stage 3 for ALU operands and address modifiers,
stage 4 for stored register, stage 2 for branch
target
45Data Hazards in SRC
- The task of determining all possible data hazards
in a given pipeline structure and instruction set
is to consider all possible interactions between
all instruction at all stages in the pipeline. - ALU Instructions
- All ALU instruction Write and Read data.
- Potential for data Hazard (previous example)
because - Read data in the stage 2 (Normally Read-Needed)
- Write data in the stage 5 ? data becomes
available at stage 6 (Normally
Written-Available) - ? ALU instructions that use the data from the
previous ALU instructions must be separated by at
least four instructions. - Write data actually available in Z4 when ALU
instruction is in stage 4 (Earliest Available). - Read data normally done in stage 2 does not get
used until stage 3 when ALU operation is
performed (Latest Needed). - Data forwarding One hardware solution that
would detect the hazard between the two
instructions - Forward the data from Z4 to the proper ALU input
in the previous stage.
46Data Hazards - ALU Instructions
- Data Forwarding Implementation requires further
analysis - Associate with ALU instructions a pair of
numbers - The stage where the data is Normally Available
- The stage where the data is Earliest Available.
- For ALU instructions that do data write this
Normally Available/Earliest Available pair would
be 6/4. - For ALU instructions that do data read a similar
pair of stages associated with its
register-reading requirements Normally Required
and Latest Required. - For ALU reader this Normally Required/Latest
Required pair is 2/3. - By taking pair-wise difference, minimum spacing
for the ALU pair as (6-2)/(4-3)4/1 is obtained - Instructions must be separated by at least 4
stages unless there is a forwarding scheme, - For forwarding scheme they need to be separated
by only one stage.
47Instruction Pair Hazard Interaction
- Data Dependence for the Modifier Instructions
Normal/Forwarded no hazard
Instruction Class Write to Register File Write to Register File Write to Register File Write to Register File Write to Register File
Instruction Class Data Available Normal/Earliest, Stage Data Available Normal/Earliest, Stage Data Available Normal/Earliest, Stage Data Available Normal/Earliest, Stage Data Available Normal/Earliest, Stage
Instruction Class alu load ladr brl
6/4 6/5 6/4 6/2
Read from Register FileNormal/Latest, Stage alu 2/3 4/1 4/2 4/1 4/1
Read from Register FileNormal/Latest, Stage load 2/3 4/1 4/2 4/1 4/1
Read from Register FileNormal/Latest, Stage ladr 2/3 4/1 4/2 4/1 4/1
Read from Register FileNormal/Latest, Stage store (rb) 2/3 4/1 4/2 4/1 4/1
Read from Register FileNormal/Latest, Stage store (ra) 2/4 4/1 4/1 4/1 4/1
Read from Register FileNormal/Latest, Stage branch 2/2 4/2 4/3 4/2 4/1
6-24
4-31
BranchHazard
48Instruction Pair Hazard Interaction
- Previous table considers only register writes and
subsequent register reads. - It also covers hazards due to loads from data
memory to the register file. - What are the data hazards associated with stores?
- Only possible hazard after a store is subsequent
load - Since only load instructions access data memory
and - Since both load and store occur in stage 4
- ? there is no hazard.
- Branch delay
- Branch delay is unavoidable
- Work around the problem as is the case with
delayed loads. - Techniques that try to predict the outcome of a
branch in the fetch stage ? beyond the scope of
this class.
49Delays Unavoidable by Forwarding
- In the Table Load column, we see the value
loaded cannot be available to the next
instruction, even with forwarding - Can restrict compiler not to put a dependent
instruction in the next position after a load
(next 2 positions if the dependent instruction is
a branch) - Target register cannot be forwarded to branch
from the immediately preceding instruction - Code is restricted so that branch target must not
be changed by instruction preceding branch
(previous 2 instructions if loaded from memory) - Do not confuse this with the branch delay slot,
which is a dependence of instruction fetch on
branch, not a dependence of branch on something
else
50Hazard Detection in the Compiler
- Hazard Resolved by Compiler
- Burden of hazard detection and elimination on the
compiler - Analyze the code sequence and either
- Rearrange the instructions to remove the hazard,
or - If finding no possible rearrangement insert nop
between instructions that form a hazard
software bubble in the pipeline. - Problems with this approach
- Additional burden on the compiler writer to
develop a compiler that is both correct and
efficient. - Lead to more expensive and longer development
cycle. - May lead to buggy compiler when maximum
optimization of code is required. - Without hardware detection there can be no data
forwarding ? no reduction in the no-hazard
distance of four instructions. - Compiler can only perform static analysis of code
? must assume the most pessimistic scenario.
51Hazard Detection by Hardware
- Hazard Resolved by Pipeline Stalls
- Focus on hardware solution of data forwarding.
- Test for hazards at each place where they can
occur (as described in Table). - Illustrate process of detecting data hazard with
2-operand ALU-ALU instruction pairs. - Approach 1
- Remove the hazard by inserting bubbles in the
pipeline.
52Hazard Detection by Hardware
- Pipeline Bubble Insertion
- Facts we need to take into account
- The minimum spacing between data-dependent
instructions. - The dependent instruction the stallee must be
paused at stage 2 until hazard has been resolved.
Note - instruction can not complete its operand fetch
until the operand has been written to the
register file. - The instruction behind it in stage 1 must also be
held as long as the pipeline is stalled. - The two dependent instructions may be 1, 2, or 3
instructions apart hazard detection hardware
must detect all three of these cases. - The staller and intermediate instructions
between staller and stallee must be allowed
to finish and exit the pipeline.
53Example of Detecting ALU Hazards and Stalling
Pipeline
- The following expression detects hazards between
ALU instructions in stages 2 and 3 and stalls the
pipeline - ( alu3 Ù alu2 Ù ((ra3 rb2) Ú (ra3 rc2) ÙØimm2
) ) ( pause2 pause1 op3 0 ) - After such a stall, the hazard will be between
stages 2 and 4, detected by - ( alu4 Ù alu2 Ù ((ra4 rb2) Ú (ra4 rc2) ÙØimm2
) ) ( pause2 pause1 op3 0 ) - Hazards between stages 2 5 require
- ( alu5 Ù alu2 Ù ((ra5 rb2) Ú (ra5 rc2) ÙØimm2
) ) ( pause2 pause1 op3 0 )
If opcodes in stages 2 and 3 are both alu, andIf
ra in stage 3 rb or rc in stage 2 unless it
is an immediate instruction in which case there
is no rc There is a hazard between the
instructions in stages 2 and 3. Emit signals to
pause pipeline stages 1 and 2 (pause1 and
pause2) Insert a bubble in the pipeline between
the staller in stage 3 and stallee in stage 2
op3 ?0.
54Data Dependence - Stalling
- Stall Due to a Data Dependence Between Two ALU
Instructions
55Data Forwarding
- Example of Data Forwarding from an ALU
Instruction to another ALU Instruction - The pair table for data dependencies says that if
forwarding is done, dependent ALU instructions
can be adjacent, not 4 apart - For this to work, dependences must be detected
and data sent from where it is available directly
to X or Y input of ALU - For a dependence of an ALU instruction in stage 3
on an ALU instruction in stage 5/(4) the equation
is - alu5 Ù alu3 ((ra5 rb3) X Z5
- (ra5 rc3) ÙØimm3 Y
Z5 ) - alu4 Ù alu3 ((ra4 rb3) X Z4
- (ra4 rc3) ÙØimm3 Y
Z4 )
56Hazard Detection and Forwarding
- Can be from either Z4 or Z5 to either X or Y
input to ALU - rb and rc needed in stage 3 for detection
57Data Forwarding ALU to ALU Instruction (contd)
- For an ALU instruction in stage 3 depending on
one in stage 4(5), the equation is - alu4 Ù alu3 ((ra4 rb3) X Z4
- (ra4 rc3) Ù Øimm3 Y
Z4 ) - alu5 Ù alu3 ((ra5 rb3) X Z5
- (ra5 rc3) Ù Øimm3 Y
Z5 ) - We can see that the rb and rc fields must be
available in stage 3 for hazard detection - Multiplexers must be put on the X and Y inputs to
the ALU so that Z4 or Z5 can replace either X3 or
Y3 as inputs
58Example
- add r5, r1, r1 instr C, issued 3rd, in stage 3
- add r1, r4, r1 instr B, issued 2rd, in stage 4
- add r1, r3, r2 instr A, issued 1rd, in stage 5
- Hazard detection units in both stages 4 and 5
will detect the hazard - However, only the hazard detection init in stage
4 should forward its result, which is in Z4 to
both X3 and Y3. and thus . - implies that the dependences between 3 and 4
should take precedence over the dependences
between 3 and 5 because the X or Y value set by
3-5 dependence is replaced by value from a 3-4
dependence.
59Exceptions and the Pipeline
- Internal and external exceptions must be handled.
- Imprecise exception
- Precise exception
- Instructions ahead of one that caused the
exception (e.g., divide by zero) should continue. - Ideally the contents of the registers in its
stage would be saved for later analysis and the
instruction aborted and replaced with nop. - Instructions behind the faulty one may be
restarted after the exception handler has
completed.
60Restrictions Left If Forwarding Done Wherever
Possible
br r4 add . . . ld r4, 4(r5) nop neg r6,
r4 ld r0, 1000 nop nop br r0 not r0, r1 nop br
r0
- (1) Branch delay slot
- The instruction after a branch is always
executed, whether the branch succeeds or not. - (2) Load delay slot
- A register loaded from memory cannot be used as
an operand in the next instruction. - A register loaded from memory cannot be used as a
branch target for the next two instructions. - (3) Branch target
- Result register of ALU or ladr instruction cannot
be used as branch target by the next instruction.
61Performance and Design
- Notation
- IC Instruction Count
- CPI Clock Cycles per Instruction
- ? - Clock Period
- Assumptions
- Clock period of pipeline architecture same as
non-pipelined - Average CPI for a 1-bus non-pipelined design is
5, and the pipelined design can issue and
complete one instruction per clock - Assume that there is one pipeline stall for every
four instructions ? 5 clocks for 4 instructions
or 5/41.25CPI.
62Instruction-Level Parallelism
- Two fundamental approaches to increasing a
processors instruction execution rate - Increasing Clock Speed (IC technology dependent)
- Instruction-level pipelining (Computer Architect
and Logic Designer domain). - Efficient Sequential Execution.
- Instruction-level parallelism
- Increasing number of instructions executed
simultaneously If there are multiple function
units and multiple instructions have been
fetched, then it is possible to start several at
once - Two approaches are
- Superscalar
- Dynamically issue as many prefetched instructions
to idle function units as possible - Very Long Instruction Word (VLIW)
- Statically compile long instruction words with
many operations in a word, each for a different
function unit
63Superscalar Architectures
- There may be different types of function units
used for each type of the instruction set - Floating-point (FPUs)
- Integer (IUs)
- Branch Prediction (BPUs)
- There can be more than one of the same type
- Each function unit is itself pipelined
- How they work
- Fetch instruction into FIFO queue
- Partial Decode to determine type
- Dispatch instruction to appropriate unit (IU,FPU
or BPU) according to type. - Branches become more of a problem
- There are fewer clock cycles between branches
- Branch units try to predict branch direction
- Instructions at branch target may be prefetched,
and even executed speculatively, in hopes the
branch goes that way
64VLIW Architectures
- 64-128 bit instruction word
- Each word contains fields to control the routing
of data to multiple register files and execution
units. - More info at
- http//www.research.ibm.com/vliw/
65Microprogramming
- Alternate approach to control unit design.
- SRC hardwired approach to control unit design.
- SRC-MP Control signal are stored as words in a
microcode memory as a control word. - MP is transparent with respect to the
- Program
- The rest of the architecture.
- The control signals emanating from the microcode
control unit to the data path will remain
unchanged. - Micro-programmed architectures popular in the
60s 80s - Hardwired architectures popular in the 90s.
66Microprogramming-Basic Idea
- Recall control sequence for 1-bus SRC
Step Concrete RTN Control Sequence T0 MA PC C
PC 4 PCout, MAin, INC4, Cin, Read T1 MD
MMA PC C Cout, PCin, Wait T2 IR
MD MDout, IRin T3 A Rrb Grb, Rout,
Ain T4 C A Rrc Grc, Rout, ADD,
Cin T5 Rra C Cout, Gra, Rin, End
- Control unit job is to generate the sequence of
control signals - How about building a special computer to do this?
67Microprogramming Concept
- The Microcode Engine
- The microcode control unit is itself a small
stored program computer. - Micro PC ??PC
- Microprogram memory ??Memory
- Microinstruction word ??Instruction Word
- A computer to generate control signals is much
simpler than an ordinary computer - At the simplest, it just reads the control
signals in order from a read-only memory - The memory is called the control store
- A control store word, or microinstruction,
contains a bit pattern telling which control
signals are true in a specific step - The major issue is determining the order in which
microinstructions are read
68Block Diagram of Microcoded Control Unit
- Microinstruction has
- branch control,
- branch address, and
- control signal fields
- Microprogram counter can be set from several
sources to do the required sequencing
69Parts of the Microprogrammed Control Unit
- Since the control signals are just read from
memory, the main function of mCU is sequencing - This is reflected in the several ways the mPC can
be loaded - Output of incrementermPC 1
- PLA outputstart address for a macroinstruction
- Branch address from minstruction
- External sourcesay for exception or reset
- Micro conditional branches can depend on
condition codes, data path state, external
signals, etc.
70Contents of a Microinstruction
Microinstruction format
Control signals
Branch control
Branch address
Ain
Cout
End
PCin
MAin
PCout
- Main component is list of 1/0 control signal
values - There is a branch address in the control store
- There are branch control bits to determine when
to use the branch address and when to use mPC 1
71The Control Store
- Faster then main memory
- 70-bit or more wide
- 2-4 K of control words
- B(kcn)x2n capacity of control store in bits.
- Common instruction fetch sequence
- Separate sequences for each (macro) instruction
- Wide words
72Control Signals for the add Instruction
- Addresses 100102 are the instruction fetch
- Addresses 200202 do the add
- Change of mcontrol from 102 to 200 uses a kind of
mbranch
73Uses for mbranching in the Microprogrammed
Control Unit
- (1) Branch to start of mcode for a specific inst.
- (2) Conditional control signals, e.g. CON PCin
- (3) Looping on conditions, e.g. n ¹ 0 ... Goto6
- Those constructs can be implemented by
conditional branches specified in mcode word
instead of using AND gates to control conditional
branches - Conditions will control mbranches instead of
being AND-ed with control signals - Microbranches are frequent and control store
addresses are short, so it is reasonable to have
a mbranch address field in every m instruction
74Illustration of mbranching Control Logic
- We illustrate a mbranching control scheme by a
machine having condition code bits N and Z - Branch control has 2 parts
- (1) selecting the input applied to the mPC and
- (2) specifying whether this input or mPC 1 is
used - 4 possible inputs to mPC are allowed
- The incremented value mPC 1
- The PLA lookup table for the start of a
macroinstruction - An externally supplied address
- The branch address field in the minstruction word
75Branching Controls in the Microcoded Control Unit
- 5 branch conditions
- NotN
- N
- NotZ
- Z
- Unconditional
- To 1 of 4 places
- Next minstruction
- PLA
- External address
- Branch address
76?branches Examples
.
Address
C
o
n
t
r
o
l
B
r
a
n
c
h
S
i
g
n
a
l
s
A
d
d
r
e
s
s
B
r
a
n
c
h
i
n
g
a
c
t
i
o
n
201
n
e
x
t
i
n
s
t
r
u
c
t
i
o
n
0
0
0
0
0
0
0
X
X
X
N
o
n
e
200
0
1
1
0
0
0
0
X
X
X
B
r
a
n
c
h
t
o
o
u
t
p
u
t
o
f
P
L
A
201
1
0
0
0
1
0
0
X
X
X
B
r
i
f
Z
t
o
E
x
t
e
r
n
.
A
d
d
r
.
202
203
1
204)
1
1
0
0
0
0
3
0
0
B
r
i
f
N
t
o
3
0
0
(
e
l
s
e
n
e
x
t
1
205)
1
1
0
0
0
0
0
0
2
0
6
B
r
i
f
N
t
o
2
0
6
(
e
l
s
e
n
e