Computer Architecture

About This Presentation

Title:

Computer Architecture

Description:

Computer Architecture Processor Design-Advanced Topics – PowerPoint PPT presentation

Number of Views:138

Avg rating:3.0/5.0

Slides: 86

Provided by: vkepuska

Category:

more less

Transcript and Presenter's Notes

Title: Computer Architecture

1
Computer Architecture

Processor Design-Advanced Topics

2
Chapter Outline

5.1 Pipelining
A pipelined design of SRC
Pipeline hazards
5.2 Instruction-Level Parallelism
Superscalar processors
Very Long Instruction Word (VLIW) machines
5.3 Microprogramming
Control store and micro-branching
Horizontal and vertical microprogramming

3
The Pipeline and the Assembly Line

Executing Machine Instructions versus
Manufacturing Small Parts

I
n
s
t
r
u
c
t
i
o
n
InstructionInterpretationand Execution
P
a
r
t
i
n
t
e
r
p
r
e
t
a
t
i
o
n
P
a
r
t
m
a
n
u
f
a
c
t
u
r
e
a
n
d

e
x
e
c
u
t
i
o
n
m
a
n
u
f
a
c
t
u
r
e
FetchInstruction
F
e
t
c
h
S
e
l
e
c
t
S
e
l
e
c
t
C
o
v
e
r
I
d

r
2
,

a
d
d
r
2
i
n
s
t
r
u
c
t
i
o
n
p
a
r
t
p
a
r
t
p
l
a
t
e
F
e
t
c
h
D
r
i
l
l
D
r
i
l
l
E
n
d
FetchOperands
s
t

r
4
,

a
d
d
r
1
o
p
e
r
a
n
d
s
p
a
r
t
p
a
r
t
p
l
a
t
e
ALUOperation
A
L
U
C
u
t
C
u
t
T
o
p
a
d
d

r
4
,

r
3
,

r
2
o
p
e
r
a
t
i
o
n
p
a
r
t
p
a
r
t
p
l
a
t
e
MemoryAccess
M
e
m
o
r
y
P
o
l
i
s
h
P
o
l
i
s
h
B
o
t
t
o
m
s
u
b

r
2
,

r
5
,

1
a
c
c
e
s
s
p
a
r
t
p
a
r
t
p
l
a
t
e
RegisterWrite
R
e
g
i
s
t
e
r
P
a
c
k
a
g
e
P
a
c
k
a
g
e
C
e
n
t
e
r
s
h
r

r
3
,

r
3
,

2
w
r
i
t
e
p
a
r
t
p
a
r
t
p
l
a
t
e
M
a
k
e

e
n
d

p
l
a
t
e
add r4, r3, r2
(
a
)

W
i
t
h
o
u
t

p
i
p
e
l
i
n
i
n
g
/
a
s
s
e
m
b
l
y

l
i
n
e
(
b
)

W
i
t
h

p
i
p
e
l
i
n
i
n
g
/
a
s
s
e
m
b
l
y

l
i
n
e
4
The Pipeline Stages

5 pipeline stages are shown
1. Fetch instruction
2. Fetch operands
3. ALU operation
4. Memory access
5. Register write
Example of 5 instructions executing at different
stages in pipeline
shr r3, r3, 2 Storing result into r3
sub r2, r5, 1 Idleno memory access needed
add r4, r3, r2 Performing addition in ALU
st r4, addr1 Accessing r4 and addr1
ld r2, addr2 Fetching instruction

5
Pipelining Instruction Processing

Pipeline stages are shown top to bottom in order
traversed by one instruction
Instructions listed in order they are fetched
Order of instructions in pipeline is reverse of
listed
If each stage takes 1 clock
every instruction takes 5 clocks to complete
some instruction completes every clock tick
Two performance issues instruction latency and
instruction bandwidth

6
Dependence Among Instructions

Execution of some instructions can depend on the
completion of others in the pipeline
Pipeline Stalls One solution is to stall the
pipeline
early stages stop while later ones complete
processing
Data Forwarding Dependences involving registers
can be detected and data forwarded to
instruction needing it, without waiting for
register write
Dependence involving memory is harder and is
sometimes addressed by restricting the way the
instruction set is used
Delayed Load Decree Values loaded from memory
into the register file cannot be accessed until
tow instructions later.
Branch Delay slot is anotherexample of such a
restriction Branch targets cannot be computed
before the instruction following the branch
instruction has entered the pipeline. Hardware
detects the dependence and stalls the pipeline.

7
Branch and Load Delay Examples

Branch Delay
brz r2, r3 add r6, r7, r8 st r6, addr1
This instruction always executed
Only done if r2 ¹ 0
Load Delay
ld r2, addr add r5, r1, r2 shr r1,r1,4 sub r6,
r8, r2
This instruction gets old value of r2
This instruction gets r2 value loaded from addr

Working of instructions is not changed, but
The way they work together is changed.

8
Characteristics of Pipelined Processor Design

The instruction set is unchanged
Instructions should execute and provide the same
results on all architectures regardless of the
pipeline structure.
Main memory must operate in one cycle
This can be accomplished by expensive memory, but
It is usually done with cache, to be discussed in
Chap. 7
Instruction and data memory must appear separate
Harvard architecture has separate instruction and
data memories
Again, this is usually done with separate caches

9
Characteristics of Pipelined Processor Design

3-Port Register File
Pipelined architecture require a 3-port register
file so that to allow the reading of two operands
and the writing of a third in a single clock
cycle.
Modification to Buses and the Data Path
Few buses are used since
Most connections are point to point
Some few-way multiplexers are used
Data is latched (stored in temporary registers)
at each pipeline stagecalled pipeline
registers
ALU operations take only 1 clock (esp. shift)

10
Adapting Instructions to Pipelined Execution

All instructions must fit into a common pipeline
stage structure
We use a 5-stage pipeline for the SRC
(1) Instruction fetch
(2) Decode and operand access
(3) ALU operations
(4) Data memory access
(5) Register write
We must fit load/store, ALU, and branch
instructions into this pattern

11
Control Signals

Need to specify signals that will control the
flow to the pipeline
Grouping of opcodes into a set of instructions
with similar properties is useful in generating
signals that control the register transfers
through the pipeline.
Example of Figure 5.3 (next slide)

12
Logic Expressions Defining Pipeline Stage Activity
branch br Ú brl cond (IR2á2..0ñ 1) Ú
((IR2á2..1ñ1)Ù(IR2á0ñÅRrb0)) Ú
((IR2á2..1ñ2)Ù(IR2á0ñÅRrbá31ñ) sh shr Ú
shra Ú shl Ú shc alu add Ú addi Ú sub Ú neg
Ú and Ú andi Ú or Ú ori Ú not Ú sh imm addi
Ú andi Ú ori Ú (sh Ù (IR2á4..0ñ ¹ 0) ) load
ld Ú ldr ladr la Ú lar store st Ú str
l-s load Ú ladr Ú store regwrite load Ú
ladr Ú brl Ú alu Instructions that write to the
register file dsp ld Ú st Ú lar
Instructions that use disp addressing rl ldr
Ú str Ú lar Instructions that use rel
addressing

Notes
cond and imm are used only in step 2 ?
IR2 (instruction register for stage 2) is used as
the register from which their signals are
generated.
Other signals in the example will be required in
several different stages. The number is appended
to the signal name to show which stage generates
it (e.g., branch2 is generated in stage 2 from
IR2 by testing the opcode field in IR2)

13
Notes on the Equations and Different Stages

The logic equations are based on the instruction
in the stage where they are used
When necessary, we append a digit to a logic
signal name to specify it is computed from values
in that stage
Thus regwrite5 is true when the opcode in stage 5
is load5 Ú ladr5 Ú brl5 Ú alu5, all of which are
determined from op5

14
ALU Instructions

ALU Instructions
Instructions fit into 5 stages
Stage 1 Instruction FetchInstruction pointed to
by PC is fetched from instruction memory
(separate from data) and PC is incremented.
Stage 2 Instruction Decode/Operand Access
Instruction is read for IR2 and decoded. Recall
that all ALU or shift operations are of the form
Rra ? Rrb op Rrc
Rra ? Rrb op c2lt16..0gt
Second ALU operand comes either from a register
or instruction register c2 field (see next
slide)
Y3 ? (imm ? c2 imm ? Rrc)
Stage 3 ALU Operation Opcode must be available
in stage 3 to tell ALU what to do.
Stage 4 Memory Access Since there is no memory
access operation in ALU ? NOP.
Stage 5 Register Write Result register, ra, is
written in stage 5 from Z4.regwrite signal is
set to true to enable the write into register.

?
15
The Memory Access Instructions ld, ldr, st,
and str

RTN of Memory Access Instructions
ld ( op 1) ? Rra ? Mdisp
ldr ( op 2) ? Rra ? Mrel
st ( op 3) ? Mdisp ? Rra
str ( op 4) ? Mrel ? Rra
lda ( op 5) ? Rra ? disp
lar ( op 6) ? Rra ? rel
displt31..0gt
((rb0) ? c2lt16..0gt sign ext.
(rb?0) ? Rrb c2lt16..0gt sign extend, 2's
comp. )
rellt31..0gt PClt31..0gt c1lt21..0gt sign extend,
2's comp.

16
The Memory Access Instructions ld, ldr, st,
and str

Stage 1 Instruction Fetch and PC Increment.
Note incremented value of PC is recorded in PC2.
Stage 2 Operand Fetch.
1st address computation
X3 ? (rl ? PC2 dsp ? Rrb)
2nd address computation
Y3 ? (rl ? C1 dsp ? c2)
Stage 3 ALU OperationRelative or displacement
address is computed by adding X3 and Y3. Result
stored in Z4.
Stage 4 Memory Accessld or ldr data memory at
the address in Z4 is copied into Z5.la or lar,
address in Z4 is directly copied into Z5.Store
instructions have value in Md3 written into data
memory at the addresses sotered in Z4.
Stage 5. Register WriteIf load instruction then
regwrite will be true and the value stored in Z5
will be written into the register file at the
register address stored in the ra field of IR2.

17
Branch Instructions

cond ( c3á2..0ñ0 0 never c3á2..0ñ1
1 always c3á2..0ñ2 Rrc0 if register
is zero c3á2..0ñ3 Rrc¹0 if register is
nonzero c3á2..0ñ4 Rrcá31ñ0 if positive or
zero c3á2..0ñ5 Rrcá31ñ1 ) if negative br
( op 8) ? (cond ? (PC ? Rrb))
Conditional branch brl ( op 9) ? (Rra
? PC cond ? (PC ? Rrb)) Branch and link
18
Branch Instructions

The new program counter value is known in stage
2but not in stage 1.
If branch is taken then the PC receives the new
branch address.
Only for branch and link (brl) does a register
write in stage 5
The value of the old PC is incremented and stored
in PC2 to be written into Rra (link register)
in stage 5 regardless of whether the branch is
taken or not.
There is no ALU or memory operation
Mp1 is controlled according to
cond(IR2,Rrc) ?
PC ? Rrb X3?PC2

19
Designing the Pipelined Data Path

All information pertaining to the instruction
that will be used in subsequent stages of
execution (data and instruction) must be
propagated along the pipeline so-called
pipeline-registers.
Global State
Register file,
Data Memory,
Instruction Memory,
The SRC Pipeline Registers and RTN Specification
Hardware and Control to Support Pipelining
Requires
Examination of previous figures and
Determination of which information needs to be
propagated to the next stage.

20
Designing the Pipelined Data Path

RTN and the Pipeline Design
Figure 5.6 (next slide) depicts all of the
pipeline registers and the RTN descriptions of
the flow of all the instructions through the
pipeline.
It combines
All the data path (pipeline) registers, and
The actions specified for different instruction
classes (as described previously).

21
Pipeline Registers and RTN Specification

Control signals are labeled with the stage from
which they are computed. Example
PC ?
(branch2 ? PC4
branch2 ? (cond(IR2,Rrc)?Rrb
cond(IR2,Rrc)?PC4))
Propagation of IR register content necessary
across the pipeline.
Stages 3,4, and 5 require only the op filed and
the ra field.
In Stage 3 the ALU instructions require opcode to
determine which operation to perform.
Stage 4 requires the opcode to supply the load
and store instruction with the information they
will need to control data memory access.
Stage 5 ra is used to tell its instruction which
register in the register file to write its value
into, also opcode determines whether a register
write is required.
Z4 (ALU output register)
Memory address (load store)
A memory value if the instruction is ld or ldr.
Incremented PC (brl)
ALU results if ALU instruction

22
Global State of the Pipelined SRC

PC, the general registers, instruction memory,
and data memory represent the global machine
state
PC is accessed in stage 1 (and stage 2 on branch)
Instruction memory is accessed in stage 1
General registers are read in stage 2 and written
in stage 5
Data memory is only accessed in stage 4

23
Restrictions on Access to Global State by Pipeline

Can see why separate instruction and data
memories (or caches) are needed
When a load or store accesses data memory in
stage 4, stage 1 is accessing an instruction
Thus two memory accesses occur simultaneously
Two operands may be needed from registers in
stage 2 while another instruction is writing a
result register in stage 5.
Thus, as far as the registers are concerned, 2
reads and a write happen simultaneously
Increment of PC in stage 1 must be overridden by
a successful branch in stage 2

24
Control Signals Pipeline Data Path

The Pipeline Data Path with Selected Control
Signals
Most control signals are shown and given values
Multiplexer control is stressed in the next
figure
Notation change on the inputs/outputs of the
register file
Address inputs are labeled a1,a2, and a3.
Figure in the next slide indicates which register
field from the instruction ra, rb, or rc, is
sent to which address input a1, a2, or a3.
Data inputs/outputs are labeled as R1, R2, and R3.

25
Control Signals Pipeline Data Path
I
n
s
t
r
u
c
t
i
o
n
P
C
GA1- plays the role of BAout gates all 0s if
R0 is selected as part of disp calculation

m
e
m
o
r
y
M
p
1
1
.
(
Ø
(
b
r
a
n
c
h
2

c
o
n
d
)

l
n
c
4
)

Ú
M
p
1

I
n
s
t
r
u
c
t
i
o
n
I
n
c
4
(
(
b
r
a
n
c
h
2

c
o
n
d
)

P
C
2
)

Ú
f
e
t
c
h
Gate Signals
G
1
I
R
2
R
e
g
i
s
t
e
r

f
i
l
e
G
A
1
o
p

r
a

r
b

r
c

c
1

c
2
G
2
P
C
2
a
1
R
1
a
2
R
2
a
3
R
3
W
3
r
b
2
.
M
p
2

(
Ø
s
t
o
r
e

r
c
)

M
p
2
c
o
n
d
D
e
c
o
d
e
(

s
t
o
r
e

r
a
)

r
c
a
n
d
r
a
M
p
3

(
r
l

Ú

b
r
a
n
c
h

P
C
2
)

c
2
á
2
.
.
0
ñ
B
r
a
n
c
h
c
2
o
p
e
r
a
n
d
(
d
s
p

Ú

a
l
u

R
1
)

c
1
l
o
g
i
c
r
e
a
d
M
p
4

(
r
l

c
1
)

M
p
3
M
p
4
(
d
s
p

Ú

i
m
m

c
2
)

(
a
l
u

Ù

7
1
m
m

Ø
i
m
m

R
2
)

Y
3
I
R
3
X
3
M
D
3
o
p
r
a
A
L
U
Mp1-Mp5 allow the pipeline registers to have
multiple input sources
o
p
n
3
.
A
L
U
D
e
c
o
d
e
A
L
U
o
p
e
r
a
t
i
o
n
M
D
4
Z
4
I
4
R
D
a
t
a
a
d
d
r
m
e
m
o
r
y
4
.
r
a
o
p
M
p
5

(
Ø
l
o
a
d

Z
4
)

M
e
m
o
r
y
(
l
o
a
d

m
e
m

d
a
t
a
)

l
o
a
d
/
s
t
o
r
e
D
e
c
o
d
e
a
c
c
e
s
s
M
p
5
Z
5
I
R
5
5
.
r
a
v
a
l
u
e
o
p
R
e
g
i
s
t
e
r
D
e
c
o
d
e
l
o
a
d

Ú

l
a
d
r

Ú

b
r
l

Ú

a
l
u
w
r
i
t
e
26
Control Signals Pipeline Data Path

Example
Mp4 ? (rl ? c1) ldr, str, lar rel instr.
(dsp Ú imm ? c2) addi, andi, ori imm
instr. or ld, st or la disp instr.
((alu ? imm) ? sh ? R2) alu and
not imm. or shift instruction
Register operand access in stage 2
All instructions with exception of the store
instructions
rb and rc - specify source operands to be
accessed in stage 2
ra specifies the register into which the result
is to be stored in stage 5.
Store instructions
Rra contains the value of the operand to be
fetched out of the register file in stage 2
Multiplexer Mp2 is used to route ra instead of rc
to register read address port a2.
Fetched value Rra is copied into MD3 to be
stored in memory in stage 4.

27
Generating the control Signals

In the pipeline architecture
Control signals are generated at each stage from
the op field in IRx ?
Control signals are distributed throughout the
stages of pipeline.
Most control signals that are generated (see
previous figure) in a given stage are also used
in that stage.
There are few specific exceptions, e.g., PC
control.
Note that each register must have a strobe signal
that control reading/writing of data from/to it.
From the figure all the paths are
point-to-point ?
No gating signals are required except at the
multiplexers.
RTN presented in previous figure provides
sufficient information to generate all of the
gate and strobe signals in the data path.
Special cases of pipeline hazard require special
solution (covered later).

28
Propagating an Instruction Sequence Through the
Pipeline

Example

100 add r4, r6, r8 R4 R6
R8 104 ld r7, 128(r5) R7
MR5128 108 brl r9, r11, 001 PC R11
R9 PC 112 str r12, 32 MPC32 R12 .
. . . . . 512 sub ... next instr. ...

PC is initialized to PC100.
R11 512 when the brl instruction is executed.
R6 4 and R8 5 are the add operands
R5 16 for the ld and R12 23 for the str

29
First Clock Cycle add Enters Stage 1 of Pipeline

Program counter is incremented to 104

512 sub ... . . . . . . 112 str r12,
32 108 brl r9, r11, 001 104 ld r7, r5,
128 100 add r4, r6, r8
Stage 1
30
Second Clock Cycle

add Enters Stage 2
ld is Being Fetched at Stage 1

add operands are fetched in stage 2

512 sub ... . . . . . . 112 str r12,
32 108 brl r9, r11, 001 104 ld r7, r5,
128 100 add r4, r6, r8
Stage 1
Stage 2
31
Third Clock Cycle brl Enters the Pipeline

add performs its arithmetic in stage 3 ld moves
to stage 2.
512 sub ... . . . . . . 112 str r12,
32 108 brl r9, r11, 001 104 ld r7, r5,
128 100 add r4, r6, r8
Stage 1
Stage 2
Stage 3
32
Fourth Clock Cycle str Enters the Pipeline

add is idle in stage 4
Success of brl changes program counter to 512

512 sub ... . . . . . . 112 str r12,
32 108 brl r9, r11, 001 104 ld r7, r5,
128 100 add r4, r6, r8
Stage 1
Stage 2
Stage 3
Stage 4
33
Fifth Clock Cycle

add Completes
sub Enters the Pipeline

add completes in stage 5
sub is fetched from location 512 after successful
brl

512 sub ... . . . . . . 112 str r12,
32 108 brl r9, r11, 001 104 ld r7, r5,
128 100 add r4, r6, r8
34
Functions of the SRC Pipeline Stages

Stage 1 fetches instruction
PC incremented or replaced by successful branch
in stage 2
Stage 2 decodes instruction and gets operands
Load or store gets operands for address
computation
Store gets register value to be stored as 3rd
operand
ALU operation gets 2 registers or register and
constant
Stage 3 performs ALU operation
Calculates effective address or does
arithmetic/logic
May pass through link PC or value to be stored in
memory

35
Functions of the SRC Pipeline Stages

Stage 4 accesses data memory
Passes Z4 to Z5 unchanged for non-memory
instructions
Load fills Z5 from memory
Store uses address from Z4 and data from MD4 (no
longer needed)
Stage 5 writes result register
Z5 contains value to be written, which can be ALU
result, effective address, PC link value, or
fetched data
ra field always specifies result register in SRC

36
Functions of the Pipeline Registers in SRC

Registers between stages 1 and 2
IR2 holds full instruction including any register
fields and constant
PC2 holds the incremented PC from instruction
fetch
Registers between stages 2 and 3
IR3 holds opcode and ra (needed in stage 5)
X3 holds PC or a register value (for link or 1st
ALU operand)
Y3 holds c1 or c2 or a register value as 2nd ALU
operand
MD3 is used for a register value to be stored in
memory

37
Functions of the Pipeline Registers in SRC

Registers between stages 3 and 4
I4 has op code and ra
Z4 has memory address or result register value
MD4 has value to be stored in data memory
Registers between stages 4 and 5
I5 has opcode and destination register number, ra
Z5 has value to be stored in destination
register from ALU result, PC link value, or
fetched data

38
Pipeline Hazard

Entirely predictable, deterministic events.
Occur as side effects of having instructions in
the pipeline that depend upon the results of
instructions ahead of them that have not exited
the pipeline.
The element of hazard comes only from not taking
into account pipelines behavior
Rogue compiler
Assembly language programmer
Compiler must perform hazard analysis based on
static condition (as oppose to run-time dynamic
conditions) and thus take into account the worst
case scenario.

39
Dependence Between Instructions in Pipe
Pipeline Hazards

Instructions that occupy the pipeline together
are being executed in parallel
This leads to the problem of instruction
dependence, well known in parallel processing
The basic problem is that an instruction depends
on the result of a previously issued instruction
that is not yet complete
Two categories of hazards
Data hazards an instruction initiates
modification of the data in a register that is
need in one of the next instructions in the
pipeline.
Branch hazards fetch of wrong instruction on a
change in PC

40
Data Hazard

Classification of Data Hazards
A read after write hazard (RAW) arises from a
flow dependence, where an instruction uses data
produced by a previous one
A write after read hazard (WAR) comes from an
anti-dependence, where an instruction writes a
new value over one that is still needed by a
previous instruction
A write after write hazard (WAW) comes from an
output dependence, where two parallel
instructions write the to same register and must
do it in the order in which they were issued

41
Data Hazards in SRC

Since all data memory access occurs in stage 4,
memory writes and reads are sequential and give
rise to no hazards
Since all registers are written in the last
stage, WAW and WAR hazards do not occur
Two writes always occur in the order issued, and
a write always follows a previously issued read
SRC hazards on register data are limited to RAW
hazards coming from flow dependence
Values are written into registers at the end of
stage 5 but may be needed by a following
instruction at the beginning of stage 2

42
Example of Pipeline Data Hazard in SRC
add instruction writes into register r0 in Stage
5

100 add r0, r2, r4
104 sub r3, r0, r1
How to prevent this kind of hazard
When instruction
add is in stage 5
sub must be no closer than stage 1
? separation of at least 4 instructions!
Note that result operand of the add instruction
is actually available in register Z4 when the sub
instruction requires it.
Data forwarding Forwarding hardware can be
designed to detect this particular hazard and to
forward the value to register Y3 in time for the
sub instruction to use it.

sub instruction reads register r0 in Stage 2
43
Possible Solutions to the Register Data Hazard
Problem

Detection
The machine manual could list rules specifying
that a dependent instruction cannot be issued
less than a given number of steps after the one
on which it depends
This is usually too restrictive
Since the operation and operands are known at
each stage, dependence on a following stage can
be detected
Correction
The dependent instruction can be stalled and
those ahead of it in the pipeline allowed to
complete
Result can be forwarded to a following inst. in
a previous stage without waiting to be written
into its register
Preferred SRC design will use detection,
forwarding and stalling only when unavoidable

44
Detecting Hazards and Dependence Distance

To detect hazards, each pair of instructions must
be considered
Data is normally available after being written to
register
Can be made available for forwarding as early as
the stage where it is produced
Stage 3 output for ALU results, stage 4 for
memory fetch
Operands normally needed in stage 2
Can be received from forwarding as late as the
stage in which they are used
Stage 3 for ALU operands and address modifiers,
stage 4 for stored register, stage 2 for branch
target

45
Data Hazards in SRC

The task of determining all possible data hazards
in a given pipeline structure and instruction set
is to consider all possible interactions between
all instruction at all stages in the pipeline.
ALU Instructions
All ALU instruction Write and Read data.
Potential for data Hazard (previous example)
because
Read data in the stage 2 (Normally Read-Needed)
Write data in the stage 5 ? data becomes
available at stage 6 (Normally
Written-Available)
? ALU instructions that use the data from the
previous ALU instructions must be separated by at
least four instructions.
Write data actually available in Z4 when ALU
instruction is in stage 4 (Earliest Available).
Read data normally done in stage 2 does not get
used until stage 3 when ALU operation is
performed (Latest Needed).
Data forwarding One hardware solution that
would detect the hazard between the two
instructions
Forward the data from Z4 to the proper ALU input
in the previous stage.

46
Data Hazards - ALU Instructions

Data Forwarding Implementation requires further
analysis
Associate with ALU instructions a pair of
numbers
The stage where the data is Normally Available
The stage where the data is Earliest Available.
For ALU instructions that do data write this
Normally Available/Earliest Available pair would
be 6/4.
For ALU instructions that do data read a similar
pair of stages associated with its
register-reading requirements Normally Required
and Latest Required.
For ALU reader this Normally Required/Latest
Required pair is 2/3.
By taking pair-wise difference, minimum spacing
for the ALU pair as (6-2)/(4-3)4/1 is obtained
Instructions must be separated by at least 4
stages unless there is a forwarding scheme,
For forwarding scheme they need to be separated
by only one stage.

47
Instruction Pair Hazard Interaction

Data Dependence for the Modifier Instructions

Normal/Forwarded no hazard
Instruction Class Write to Register File Write to Register File Write to Register File Write to Register File Write to Register File
Instruction Class Data Available Normal/Earliest, Stage Data Available Normal/Earliest, Stage Data Available Normal/Earliest, Stage Data Available Normal/Earliest, Stage Data Available Normal/Earliest, Stage
Instruction Class alu load ladr brl
6/4 6/5 6/4 6/2
Read from Register FileNormal/Latest, Stage alu 2/3 4/1 4/2 4/1 4/1
Read from Register FileNormal/Latest, Stage load 2/3 4/1 4/2 4/1 4/1
Read from Register FileNormal/Latest, Stage ladr 2/3 4/1 4/2 4/1 4/1
Read from Register FileNormal/Latest, Stage store (rb) 2/3 4/1 4/2 4/1 4/1
Read from Register FileNormal/Latest, Stage store (ra) 2/4 4/1 4/1 4/1 4/1
Read from Register FileNormal/Latest, Stage branch 2/2 4/2 4/3 4/2 4/1
6-24
4-31
BranchHazard
48
Instruction Pair Hazard Interaction

Previous table considers only register writes and
subsequent register reads.
It also covers hazards due to loads from data
memory to the register file.
What are the data hazards associated with stores?
Only possible hazard after a store is subsequent
load
Since only load instructions access data memory
and
Since both load and store occur in stage 4
? there is no hazard.
Branch delay
Branch delay is unavoidable
Work around the problem as is the case with
delayed loads.
Techniques that try to predict the outcome of a
branch in the fetch stage ? beyond the scope of
this class.

49
Delays Unavoidable by Forwarding

In the Table Load column, we see the value
loaded cannot be available to the next
instruction, even with forwarding
Can restrict compiler not to put a dependent
instruction in the next position after a load
(next 2 positions if the dependent instruction is
a branch)
Target register cannot be forwarded to branch
from the immediately preceding instruction
Code is restricted so that branch target must not
be changed by instruction preceding branch
(previous 2 instructions if loaded from memory)
Do not confuse this with the branch delay slot,
which is a dependence of instruction fetch on
branch, not a dependence of branch on something
else

50
Hazard Detection in the Compiler

Hazard Resolved by Compiler
Burden of hazard detection and elimination on the
compiler
Analyze the code sequence and either
Rearrange the instructions to remove the hazard,
or
If finding no possible rearrangement insert nop
between instructions that form a hazard
software bubble in the pipeline.
Problems with this approach
Additional burden on the compiler writer to
develop a compiler that is both correct and
efficient.
Lead to more expensive and longer development
cycle.
May lead to buggy compiler when maximum
optimization of code is required.
Without hardware detection there can be no data
forwarding ? no reduction in the no-hazard
distance of four instructions.
Compiler can only perform static analysis of code
? must assume the most pessimistic scenario.

51
Hazard Detection by Hardware

Hazard Resolved by Pipeline Stalls
Focus on hardware solution of data forwarding.
Test for hazards at each place where they can
occur (as described in Table).
Illustrate process of detecting data hazard with
2-operand ALU-ALU instruction pairs.
Approach 1
Remove the hazard by inserting bubbles in the
pipeline.

52
Hazard Detection by Hardware

Pipeline Bubble Insertion
Facts we need to take into account
The minimum spacing between data-dependent
instructions.
The dependent instruction the stallee must be
paused at stage 2 until hazard has been resolved.
Note
instruction can not complete its operand fetch
until the operand has been written to the
register file.
The instruction behind it in stage 1 must also be
held as long as the pipeline is stalled.
The two dependent instructions may be 1, 2, or 3
instructions apart hazard detection hardware
must detect all three of these cases.
The staller and intermediate instructions
between staller and stallee must be allowed
to finish and exit the pipeline.

53
Example of Detecting ALU Hazards and Stalling
Pipeline

The following expression detects hazards between
ALU instructions in stages 2 and 3 and stalls the
pipeline
( alu3 Ù alu2 Ù ((ra3 rb2) Ú (ra3 rc2) ÙØimm2
) ) ( pause2 pause1 op3 0 )
After such a stall, the hazard will be between
stages 2 and 4, detected by
( alu4 Ù alu2 Ù ((ra4 rb2) Ú (ra4 rc2) ÙØimm2
) ) ( pause2 pause1 op3 0 )
Hazards between stages 2 5 require
( alu5 Ù alu2 Ù ((ra5 rb2) Ú (ra5 rc2) ÙØimm2
) ) ( pause2 pause1 op3 0 )

If opcodes in stages 2 and 3 are both alu, andIf
ra in stage 3 rb or rc in stage 2 unless it
is an immediate instruction in which case there
is no rc There is a hazard between the
instructions in stages 2 and 3. Emit signals to
pause pipeline stages 1 and 2 (pause1 and
pause2) Insert a bubble in the pipeline between
the staller in stage 3 and stallee in stage 2
op3 ?0.
54
Data Dependence - Stalling

Stall Due to a Data Dependence Between Two ALU
Instructions

55
Data Forwarding

Example of Data Forwarding from an ALU
Instruction to another ALU Instruction
The pair table for data dependencies says that if
forwarding is done, dependent ALU instructions
can be adjacent, not 4 apart
For this to work, dependences must be detected
and data sent from where it is available directly
to X or Y input of ALU
For a dependence of an ALU instruction in stage 3
on an ALU instruction in stage 5/(4) the equation
is
alu5 Ù alu3 ((ra5 rb3) X Z5
(ra5 rc3) ÙØimm3 Y
Z5 )
alu4 Ù alu3 ((ra4 rb3) X Z4
(ra4 rc3) ÙØimm3 Y
Z4 )

56
Hazard Detection and Forwarding

Can be from either Z4 or Z5 to either X or Y
input to ALU
rb and rc needed in stage 3 for detection

57
Data Forwarding ALU to ALU Instruction (contd)

For an ALU instruction in stage 3 depending on
one in stage 4(5), the equation is
alu4 Ù alu3 ((ra4 rb3) X Z4
(ra4 rc3) Ù Øimm3 Y
Z4 )
alu5 Ù alu3 ((ra5 rb3) X Z5
(ra5 rc3) Ù Øimm3 Y
Z5 )
We can see that the rb and rc fields must be
available in stage 3 for hazard detection
Multiplexers must be put on the X and Y inputs to
the ALU so that Z4 or Z5 can replace either X3 or
Y3 as inputs

58
Example

add r5, r1, r1 instr C, issued 3rd, in stage 3
add r1, r4, r1 instr B, issued 2rd, in stage 4
add r1, r3, r2 instr A, issued 1rd, in stage 5
Hazard detection units in both stages 4 and 5
will detect the hazard
However, only the hazard detection init in stage
4 should forward its result, which is in Z4 to
both X3 and Y3. and thus .
implies that the dependences between 3 and 4
should take precedence over the dependences
between 3 and 5 because the X or Y value set by
3-5 dependence is replaced by value from a 3-4
dependence.

59
Exceptions and the Pipeline

Internal and external exceptions must be handled.
Imprecise exception
Precise exception
Instructions ahead of one that caused the
exception (e.g., divide by zero) should continue.
Ideally the contents of the registers in its
stage would be saved for later analysis and the
instruction aborted and replaced with nop.
Instructions behind the faulty one may be
restarted after the exception handler has
completed.

60
Restrictions Left If Forwarding Done Wherever
Possible
br r4 add . . . ld r4, 4(r5) nop neg r6,
r4 ld r0, 1000 nop nop br r0 not r0, r1 nop br
r0

(1) Branch delay slot
The instruction after a branch is always
executed, whether the branch succeeds or not.
(2) Load delay slot
A register loaded from memory cannot be used as
an operand in the next instruction.
A register loaded from memory cannot be used as a
branch target for the next two instructions.
(3) Branch target
Result register of ALU or ladr instruction cannot
be used as branch target by the next instruction.

61
Performance and Design

Notation
IC Instruction Count
CPI Clock Cycles per Instruction
? - Clock Period
Assumptions
Clock period of pipeline architecture same as
non-pipelined
Average CPI for a 1-bus non-pipelined design is
5, and the pipelined design can issue and
complete one instruction per clock
Assume that there is one pipeline stall for every
four instructions ? 5 clocks for 4 instructions
or 5/41.25CPI.

62
Instruction-Level Parallelism

Two fundamental approaches to increasing a
processors instruction execution rate
Increasing Clock Speed (IC technology dependent)
Instruction-level pipelining (Computer Architect
and Logic Designer domain).
Efficient Sequential Execution.
Instruction-level parallelism
Increasing number of instructions executed
simultaneously If there are multiple function
units and multiple instructions have been
fetched, then it is possible to start several at
once
Two approaches are
Superscalar
Dynamically issue as many prefetched instructions
to idle function units as possible
Very Long Instruction Word (VLIW)
Statically compile long instruction words with
many operations in a word, each for a different
function unit

63
Superscalar Architectures

There may be different types of function units
used for each type of the instruction set
Floating-point (FPUs)
Integer (IUs)
Branch Prediction (BPUs)
There can be more than one of the same type
Each function unit is itself pipelined
How they work
Fetch instruction into FIFO queue
Partial Decode to determine type
Dispatch instruction to appropriate unit (IU,FPU
or BPU) according to type.
Branches become more of a problem
There are fewer clock cycles between branches
Branch units try to predict branch direction
Instructions at branch target may be prefetched,
and even executed speculatively, in hopes the
branch goes that way

64
VLIW Architectures

64-128 bit instruction word
Each word contains fields to control the routing
of data to multiple register files and execution
units.
More info at
http//www.research.ibm.com/vliw/

65
Microprogramming

Alternate approach to control unit design.
SRC hardwired approach to control unit design.
SRC-MP Control signal are stored as words in a
microcode memory as a control word.
MP is transparent with respect to the
Program
The rest of the architecture.
The control signals emanating from the microcode
control unit to the data path will remain
unchanged.
Micro-programmed architectures popular in the
60s 80s
Hardwired architectures popular in the 90s.

66
Microprogramming-Basic Idea

Recall control sequence for 1-bus SRC

Step Concrete RTN Control Sequence T0 MA PC C
PC 4 PCout, MAin, INC4, Cin, Read T1 MD
MMA PC C Cout, PCin, Wait T2 IR
MD MDout, IRin T3 A Rrb Grb, Rout,
Ain T4 C A Rrc Grc, Rout, ADD,
Cin T5 Rra C Cout, Gra, Rin, End

Control unit job is to generate the sequence of
control signals
How about building a special computer to do this?

67
Microprogramming Concept

The Microcode Engine
The microcode control unit is itself a small
stored program computer.
Micro PC ??PC
Microprogram memory ??Memory
Microinstruction word ??Instruction Word
A computer to generate control signals is much
simpler than an ordinary computer
At the simplest, it just reads the control
signals in order from a read-only memory
The memory is called the control store
A control store word, or microinstruction,
contains a bit pattern telling which control
signals are true in a specific step
The major issue is determining the order in which
microinstructions are read

68
Block Diagram of Microcoded Control Unit

Microinstruction has
branch control,
branch address, and
control signal fields
Microprogram counter can be set from several
sources to do the required sequencing

69
Parts of the Microprogrammed Control Unit

Since the control signals are just read from
memory, the main function of mCU is sequencing
This is reflected in the several ways the mPC can
be loaded
Output of incrementermPC 1
PLA outputstart address for a macroinstruction
Branch address from minstruction
External sourcesay for exception or reset
Micro conditional branches can depend on
condition codes, data path state, external
signals, etc.

70
Contents of a Microinstruction

Microinstruction format
Control signals
Branch control
Branch address

Ain
Cout
End
PCin
MAin
PCout

Main component is list of 1/0 control signal
values
There is a branch address in the control store
There are branch control bits to determine when
to use the branch address and when to use mPC 1

71
The Control Store

Faster then main memory
70-bit or more wide
2-4 K of control words
B(kcn)x2n capacity of control store in bits.
Common instruction fetch sequence
Separate sequences for each (macro) instruction
Wide words

72
Control Signals for the add Instruction

Addresses 100102 are the instruction fetch
Addresses 200202 do the add
Change of mcontrol from 102 to 200 uses a kind of
mbranch

73
Uses for mbranching in the Microprogrammed
Control Unit

(1) Branch to start of mcode for a specific inst.
(2) Conditional control signals, e.g. CON PCin
(3) Looping on conditions, e.g. n ¹ 0 ... Goto6
Those constructs can be implemented by
conditional branches specified in mcode word
instead of using AND gates to control conditional
branches
Conditions will control mbranches instead of
being AND-ed with control signals
Microbranches are frequent and control store
addresses are short, so it is reasonable to have
a mbranch address field in every m instruction

74
Illustration of mbranching Control Logic

We illustrate a mbranching control scheme by a
machine having condition code bits N and Z
Branch control has 2 parts
(1) selecting the input applied to the mPC and
(2) specifying whether this input or mPC 1 is
used
4 possible inputs to mPC are allowed
The incremented value mPC 1
The PLA lookup table for the start of a
macroinstruction
An externally supplied address
The branch address field in the minstruction word

75
Branching Controls in the Microcoded Control Unit

5 branch conditions
NotN
N
NotZ
Z
Unconditional
To 1 of 4 places
Next minstruction
PLA
External address
Branch address

76
?branches Examples
.

Address
C
o
n
t
r
o
l
B
r
a
n
c
h
S
i
g
n
a
l
s
A
d
d
r
e
s
s
B
r
a
n
c
h
i
n
g

a
c
t
i
o
n
201
n
e
x
t

i
n
s
t
r
u
c
t
i
o
n
0
0
0
0
0
0
0

X
X
X
N
o
n
e
200
0
1
1
0
0
0
0

X
X
X
B
r
a
n
c
h

t
o

o
u
t
p
u
t

o
f

P
L
A
201
1
0
0
0
1
0
0

X
X
X
B
r

i
f

Z

t
o

E
x
t
e
r
n
.

A
d
d
r
.
202
203
1
204)
1
1
0
0
0
0

3
0
0
B
r

i
f

N

t
o

3
0
0

(
e
l
s
e

n
e
x
t
1
205)
1
1
0
0
0
0
0

0
2
0
6
B
r

i
f

N

t
o

2
0
6

(
e
l
s
e

n
e

Write a Comment

User Comments (0)