Computer Architecture - PowerPoint PPT Presentation

1 / 85
About This Presentation
Title:

Computer Architecture

Description:

Computer Architecture Processor Design-Advanced Topics – PowerPoint PPT presentation

Number of Views:132
Avg rating:3.0/5.0
Slides: 86
Provided by: vkepuska
Category:

less

Transcript and Presenter's Notes

Title: Computer Architecture


1
Computer Architecture
  • Processor Design-Advanced Topics

2
Chapter Outline
  • 5.1 Pipelining
  • A pipelined design of SRC
  • Pipeline hazards
  • 5.2 Instruction-Level Parallelism
  • Superscalar processors
  • Very Long Instruction Word (VLIW) machines
  • 5.3 Microprogramming
  • Control store and micro-branching
  • Horizontal and vertical microprogramming

3
The Pipeline and the Assembly Line
  • Executing Machine Instructions versus
    Manufacturing Small Parts

I
n
s
t
r
u
c
t
i
o
n
InstructionInterpretationand Execution
P
a
r
t
i
n
t
e
r
p
r
e
t
a
t
i
o
n
P
a
r
t
m
a
n
u
f
a
c
t
u
r
e
a
n
d

e
x
e
c
u
t
i
o
n
m
a
n
u
f
a
c
t
u
r
e
FetchInstruction
F
e
t
c
h
S
e
l
e
c
t
S
e
l
e
c
t
C
o
v
e
r
I
d

r
2
,

a
d
d
r
2
i
n
s
t
r
u
c
t
i
o
n
p
a
r
t
p
a
r
t
p
l
a
t
e
F
e
t
c
h
D
r
i
l
l
D
r
i
l
l
E
n
d
FetchOperands
s
t

r
4
,

a
d
d
r
1
o
p
e
r
a
n
d
s
p
a
r
t
p
a
r
t
p
l
a
t
e
ALUOperation
A
L
U
C
u
t
C
u
t
T
o
p
a
d
d

r
4
,

r
3
,

r
2
o
p
e
r
a
t
i
o
n
p
a
r
t
p
a
r
t
p
l
a
t
e
MemoryAccess
M
e
m
o
r
y
P
o
l
i
s
h
P
o
l
i
s
h
B
o
t
t
o
m
s
u
b

r
2
,

r
5
,

1
a
c
c
e
s
s
p
a
r
t
p
a
r
t
p
l
a
t
e
RegisterWrite
R
e
g
i
s
t
e
r
P
a
c
k
a
g
e
P
a
c
k
a
g
e
C
e
n
t
e
r
s
h
r

r
3
,

r
3
,

2
w
r
i
t
e
p
a
r
t
p
a
r
t
p
l
a
t
e
M
a
k
e

e
n
d

p
l
a
t
e
add r4, r3, r2
(
a
)

W
i
t
h
o
u
t

p
i
p
e
l
i
n
i
n
g
/
a
s
s
e
m
b
l
y

l
i
n
e
(
b
)

W
i
t
h

p
i
p
e
l
i
n
i
n
g
/
a
s
s
e
m
b
l
y

l
i
n
e
4
The Pipeline Stages
  • 5 pipeline stages are shown
  • 1. Fetch instruction
  • 2. Fetch operands
  • 3. ALU operation
  • 4. Memory access
  • 5. Register write
  • Example of 5 instructions executing at different
    stages in pipeline
  • shr r3, r3, 2 Storing result into r3
  • sub r2, r5, 1 Idleno memory access needed
  • add r4, r3, r2 Performing addition in ALU
  • st r4, addr1 Accessing r4 and addr1
  • ld r2, addr2 Fetching instruction

5
Pipelining Instruction Processing
  • Pipeline stages are shown top to bottom in order
    traversed by one instruction
  • Instructions listed in order they are fetched
  • Order of instructions in pipeline is reverse of
    listed
  • If each stage takes 1 clock
  • every instruction takes 5 clocks to complete
  • some instruction completes every clock tick
  • Two performance issues instruction latency and
    instruction bandwidth

6
Dependence Among Instructions
  • Execution of some instructions can depend on the
    completion of others in the pipeline
  • Pipeline Stalls One solution is to stall the
    pipeline
  • early stages stop while later ones complete
    processing
  • Data Forwarding Dependences involving registers
    can be detected and data forwarded to
    instruction needing it, without waiting for
    register write
  • Dependence involving memory is harder and is
    sometimes addressed by restricting the way the
    instruction set is used
  • Delayed Load Decree Values loaded from memory
    into the register file cannot be accessed until
    tow instructions later.
  • Branch Delay slot is anotherexample of such a
    restriction Branch targets cannot be computed
    before the instruction following the branch
    instruction has entered the pipeline. Hardware
    detects the dependence and stalls the pipeline.

7
Branch and Load Delay Examples

Branch Delay
brz r2, r3 add r6, r7, r8 st r6, addr1
This instruction always executed
Only done if r2 ¹ 0
Load Delay
ld r2, addr add r5, r1, r2 shr r1,r1,4 sub r6,
r8, r2
This instruction gets old value of r2
This instruction gets r2 value loaded from addr
  • Working of instructions is not changed, but
  • The way they work together is changed.

8
Characteristics of Pipelined Processor Design
  • The instruction set is unchanged
  • Instructions should execute and provide the same
    results on all architectures regardless of the
    pipeline structure.
  • Main memory must operate in one cycle
  • This can be accomplished by expensive memory, but
  • It is usually done with cache, to be discussed in
    Chap. 7
  • Instruction and data memory must appear separate
  • Harvard architecture has separate instruction and
    data memories
  • Again, this is usually done with separate caches

9
Characteristics of Pipelined Processor Design
  • 3-Port Register File
  • Pipelined architecture require a 3-port register
    file so that to allow the reading of two operands
    and the writing of a third in a single clock
    cycle.
  • Modification to Buses and the Data Path
  • Few buses are used since
  • Most connections are point to point
  • Some few-way multiplexers are used
  • Data is latched (stored in temporary registers)
    at each pipeline stagecalled pipeline
    registers
  • ALU operations take only 1 clock (esp. shift)

10
Adapting Instructions to Pipelined Execution
  • All instructions must fit into a common pipeline
    stage structure
  • We use a 5-stage pipeline for the SRC
  • (1) Instruction fetch
  • (2) Decode and operand access
  • (3) ALU operations
  • (4) Data memory access
  • (5) Register write
  • We must fit load/store, ALU, and branch
    instructions into this pattern

11
Control Signals
  • Need to specify signals that will control the
    flow to the pipeline
  • Grouping of opcodes into a set of instructions
    with similar properties is useful in generating
    signals that control the register transfers
    through the pipeline.
  • Example of Figure 5.3 (next slide)

12
Logic Expressions Defining Pipeline Stage Activity
branch br Ú brl cond (IR2á2..0ñ 1) Ú
((IR2á2..1ñ1)Ù(IR2á0ñÅRrb0)) Ú
((IR2á2..1ñ2)Ù(IR2á0ñÅRrbá31ñ) sh shr Ú
shra Ú shl Ú shc alu add Ú addi Ú sub Ú neg
Ú and Ú andi Ú or Ú ori Ú not Ú sh imm addi
Ú andi Ú ori Ú (sh Ù (IR2á4..0ñ ¹ 0) ) load
ld Ú ldr ladr la Ú lar store st Ú str
l-s load Ú ladr Ú store regwrite load Ú
ladr Ú brl Ú alu Instructions that write to the
register file dsp ld Ú st Ú lar
Instructions that use disp addressing rl ldr
Ú str Ú lar Instructions that use rel
addressing
  • Notes
  • cond and imm are used only in step 2 ?
  • IR2 (instruction register for stage 2) is used as
    the register from which their signals are
    generated.
  • Other signals in the example will be required in
    several different stages. The number is appended
    to the signal name to show which stage generates
    it (e.g., branch2 is generated in stage 2 from
    IR2 by testing the opcode field in IR2)

13
Notes on the Equations and Different Stages
  • The logic equations are based on the instruction
    in the stage where they are used
  • When necessary, we append a digit to a logic
    signal name to specify it is computed from values
    in that stage
  • Thus regwrite5 is true when the opcode in stage 5
    is load5 Ú ladr5 Ú brl5 Ú alu5, all of which are
    determined from op5

14
ALU Instructions
  • ALU Instructions
  • Instructions fit into 5 stages
  • Stage 1 Instruction FetchInstruction pointed to
    by PC is fetched from instruction memory
    (separate from data) and PC is incremented.
  • Stage 2 Instruction Decode/Operand Access
    Instruction is read for IR2 and decoded. Recall
    that all ALU or shift operations are of the form
  • Rra ? Rrb op Rrc
  • Rra ? Rrb op c2lt16..0gt
  • Second ALU operand comes either from a register
    or instruction register c2 field (see next
    slide)
  • Y3 ? (imm ? c2 imm ? Rrc)
  • Stage 3 ALU Operation Opcode must be available
    in stage 3 to tell ALU what to do.
  • Stage 4 Memory Access Since there is no memory
    access operation in ALU ? NOP.
  • Stage 5 Register Write Result register, ra, is
    written in stage 5 from Z4.regwrite signal is
    set to true to enable the write into register.

?
15
The Memory Access Instructions ld, ldr, st,
and str
  • RTN of Memory Access Instructions
  • ld ( op 1) ? Rra ? Mdisp
  • ldr ( op 2) ? Rra ? Mrel
  • st ( op 3) ? Mdisp ? Rra
  • str ( op 4) ? Mrel ? Rra
  • lda ( op 5) ? Rra ? disp
  • lar ( op 6) ? Rra ? rel
  • displt31..0gt
  • ((rb0) ? c2lt16..0gt sign ext.
  • (rb?0) ? Rrb c2lt16..0gt sign extend, 2's
    comp. )
  • rellt31..0gt PClt31..0gt c1lt21..0gt sign extend,
    2's comp.

16
The Memory Access Instructions ld, ldr, st,
and str
  • Stage 1 Instruction Fetch and PC Increment.
    Note incremented value of PC is recorded in PC2.
  • Stage 2 Operand Fetch.
  • 1st address computation
  • X3 ? (rl ? PC2 dsp ? Rrb)
  • 2nd address computation
  • Y3 ? (rl ? C1 dsp ? c2)
  • Stage 3 ALU OperationRelative or displacement
    address is computed by adding X3 and Y3. Result
    stored in Z4.
  • Stage 4 Memory Accessld or ldr data memory at
    the address in Z4 is copied into Z5.la or lar,
    address in Z4 is directly copied into Z5.Store
    instructions have value in Md3 written into data
    memory at the addresses sotered in Z4.
  • Stage 5. Register WriteIf load instruction then
    regwrite will be true and the value stored in Z5
    will be written into the register file at the
    register address stored in the ra field of IR2.

17
Branch Instructions

cond ( c3á2..0ñ0 0 never c3á2..0ñ1
1 always c3á2..0ñ2 Rrc0 if register
is zero c3á2..0ñ3 Rrc¹0 if register is
nonzero c3á2..0ñ4 Rrcá31ñ0 if positive or
zero c3á2..0ñ5 Rrcá31ñ1 ) if negative br
( op 8) ? (cond ? (PC ? Rrb))
Conditional branch brl ( op 9) ? (Rra
? PC cond ? (PC ? Rrb)) Branch and link
18
Branch Instructions
  • The new program counter value is known in stage
    2but not in stage 1.
  • If branch is taken then the PC receives the new
    branch address.
  • Only for branch and link (brl) does a register
    write in stage 5
  • The value of the old PC is incremented and stored
    in PC2 to be written into Rra (link register)
    in stage 5 regardless of whether the branch is
    taken or not.
  • There is no ALU or memory operation
  • Mp1 is controlled according to
  • cond(IR2,Rrc) ?
  • PC ? Rrb X3?PC2

19
Designing the Pipelined Data Path
  • All information pertaining to the instruction
    that will be used in subsequent stages of
    execution (data and instruction) must be
    propagated along the pipeline so-called
    pipeline-registers.
  • Global State
  • Register file,
  • Data Memory,
  • Instruction Memory,
  • The SRC Pipeline Registers and RTN Specification
  • Hardware and Control to Support Pipelining
  • Requires
  • Examination of previous figures and
  • Determination of which information needs to be
    propagated to the next stage.

20
Designing the Pipelined Data Path
  • RTN and the Pipeline Design
  • Figure 5.6 (next slide) depicts all of the
    pipeline registers and the RTN descriptions of
    the flow of all the instructions through the
    pipeline.
  • It combines
  • All the data path (pipeline) registers, and
  • The actions specified for different instruction
    classes (as described previously).

21
Pipeline Registers and RTN Specification
  • Control signals are labeled with the stage from
    which they are computed. Example
  • PC ?
  • (branch2 ? PC4
  • branch2 ? (cond(IR2,Rrc)?Rrb
  • cond(IR2,Rrc)?PC4))
  • Propagation of IR register content necessary
    across the pipeline.
  • Stages 3,4, and 5 require only the op filed and
    the ra field.
  • In Stage 3 the ALU instructions require opcode to
    determine which operation to perform.
  • Stage 4 requires the opcode to supply the load
    and store instruction with the information they
    will need to control data memory access.
  • Stage 5 ra is used to tell its instruction which
    register in the register file to write its value
    into, also opcode determines whether a register
    write is required.
  • Z4 (ALU output register)
  • Memory address (load store)
  • A memory value if the instruction is ld or ldr.
  • Incremented PC (brl)
  • ALU results if ALU instruction

22
Global State of the Pipelined SRC
  • PC, the general registers, instruction memory,
    and data memory represent the global machine
    state
  • PC is accessed in stage 1 (and stage 2 on branch)
  • Instruction memory is accessed in stage 1
  • General registers are read in stage 2 and written
    in stage 5
  • Data memory is only accessed in stage 4

23
Restrictions on Access to Global State by Pipeline
  • Can see why separate instruction and data
    memories (or caches) are needed
  • When a load or store accesses data memory in
    stage 4, stage 1 is accessing an instruction
  • Thus two memory accesses occur simultaneously
  • Two operands may be needed from registers in
    stage 2 while another instruction is writing a
    result register in stage 5.
  • Thus, as far as the registers are concerned, 2
    reads and a write happen simultaneously
  • Increment of PC in stage 1 must be overridden by
    a successful branch in stage 2

24
Control Signals Pipeline Data Path
  • The Pipeline Data Path with Selected Control
    Signals
  • Most control signals are shown and given values
  • Multiplexer control is stressed in the next
    figure
  • Notation change on the inputs/outputs of the
    register file
  • Address inputs are labeled a1,a2, and a3.
  • Figure in the next slide indicates which register
    field from the instruction ra, rb, or rc, is
    sent to which address input a1, a2, or a3.
  • Data inputs/outputs are labeled as R1, R2, and R3.

25
Control Signals Pipeline Data Path
I
n
s
t
r
u
c
t
i
o
n
P
C
GA1- plays the role of BAout gates all 0s if
R0 is selected as part of disp calculation

m
e
m
o
r
y
M
p
1
1
.
(
Ø
(
b
r
a
n
c
h
2




c
o
n
d
)




l
n
c
4
)

Ú
M
p
1


I
n
s
t
r
u
c
t
i
o
n
I
n
c
4
(
(
b
r
a
n
c
h
2




c
o
n
d
)




P
C
2
)

Ú
f
e
t
c
h
Gate Signals
G
1
I
R
2
R
e
g
i
s
t
e
r

f
i
l
e
G
A
1
o
p


r
a


r
b


r
c


c
1


c
2
G
2
P
C
2
a
1
R
1
a
2
R
2
a
3
R
3
W
3
r
b
2
.
M
p
2


(
Ø
s
t
o
r
e



r
c
)

M
p
2
c
o
n
d
D
e
c
o
d
e
(

s
t
o
r
e



r
a
)

r
c
a
n
d
r
a
M
p
3


(
r
l

Ú

b
r
a
n
c
h



P
C
2
)

c
2
á
2
.
.
0
ñ
B
r
a
n
c
h
c
2
o
p
e
r
a
n
d
(
d
s
p

Ú

a
l
u



R
1
)

c
1
l
o
g
i
c
r
e
a
d
M
p
4


(
r
l



c
1
)

M
p
3
M
p
4
(
d
s
p

Ú

i
m
m



c
2
)

(
a
l
u

Ù

7
1
m
m

Ø
i
m
m




R
2
)

Y
3
I
R
3
X
3
M
D
3
o
p
r
a
A
L
U
Mp1-Mp5 allow the pipeline registers to have
multiple input sources
o
p
n
3
.
A
L
U
D
e
c
o
d
e
A
L
U
o
p
e
r
a
t
i
o
n
M
D
4
Z
4
I
4
R
D
a
t
a
a
d
d
r
m
e
m
o
r
y
4
.
r
a
o
p
M
p
5


(
Ø
l
o
a
d




Z
4
)

M
e
m
o
r
y
(
l
o
a
d




m
e
m

d
a
t
a
)

l
o
a
d
/
s
t
o
r
e
D
e
c
o
d
e
a
c
c
e
s
s
M
p
5
Z
5
I
R
5
5
.
r
a
v
a
l
u
e
o
p
R
e
g
i
s
t
e
r
D
e
c
o
d
e
l
o
a
d

Ú

l
a
d
r

Ú

b
r
l

Ú

a
l
u
w
r
i
t
e
26
Control Signals Pipeline Data Path
  • Example
  • Mp4 ? (rl ? c1) ldr, str, lar rel instr.
  • (dsp Ú imm ? c2) addi, andi, ori imm
    instr. or ld, st or la disp instr.
    ((alu ? imm) ? sh ? R2) alu and
    not imm. or shift instruction
  • Register operand access in stage 2
  • All instructions with exception of the store
    instructions
  • rb and rc - specify source operands to be
    accessed in stage 2
  • ra specifies the register into which the result
    is to be stored in stage 5.
  • Store instructions
  • Rra contains the value of the operand to be
    fetched out of the register file in stage 2
  • Multiplexer Mp2 is used to route ra instead of rc
    to register read address port a2.
  • Fetched value Rra is copied into MD3 to be
    stored in memory in stage 4.

27
Generating the control Signals
  • In the pipeline architecture
  • Control signals are generated at each stage from
    the op field in IRx ?
  • Control signals are distributed throughout the
    stages of pipeline.
  • Most control signals that are generated (see
    previous figure) in a given stage are also used
    in that stage.
  • There are few specific exceptions, e.g., PC
    control.
  • Note that each register must have a strobe signal
    that control reading/writing of data from/to it.
  • From the figure all the paths are
    point-to-point ?
  • No gating signals are required except at the
    multiplexers.
  • RTN presented in previous figure provides
    sufficient information to generate all of the
    gate and strobe signals in the data path.
  • Special cases of pipeline hazard require special
    solution (covered later).

28
Propagating an Instruction Sequence Through the
Pipeline
  • Example

100 add r4, r6, r8 R4 R6
R8 104 ld r7, 128(r5) R7
MR5128 108 brl r9, r11, 001 PC R11
R9 PC 112 str r12, 32 MPC32 R12 .
. . . . . 512 sub ... next instr. ...
  • PC is initialized to PC100.
  • R11 512 when the brl instruction is executed.
  • R6 4 and R8 5 are the add operands
  • R5 16 for the ld and R12 23 for the str

29
First Clock Cycle add Enters Stage 1 of Pipeline
  • Program counter is incremented to 104

512 sub ... . . . . . . 112 str r12,
32 108 brl r9, r11, 001 104 ld r7, r5,
128 100 add r4, r6, r8
Stage 1
30
Second Clock Cycle
  • add Enters Stage 2
  • ld is Being Fetched at Stage 1
  • add operands are fetched in stage 2

512 sub ... . . . . . . 112 str r12,
32 108 brl r9, r11, 001 104 ld r7, r5,
128 100 add r4, r6, r8
Stage 1
Stage 2
31
Third Clock Cycle brl Enters the Pipeline

add performs its arithmetic in stage 3 ld moves
to stage 2.
512 sub ... . . . . . . 112 str r12,
32 108 brl r9, r11, 001 104 ld r7, r5,
128 100 add r4, r6, r8
Stage 1
Stage 2
Stage 3
32
Fourth Clock Cycle str Enters the Pipeline
  • add is idle in stage 4
  • Success of brl changes program counter to 512

512 sub ... . . . . . . 112 str r12,
32 108 brl r9, r11, 001 104 ld r7, r5,
128 100 add r4, r6, r8
Stage 1
Stage 2
Stage 3
Stage 4
33
Fifth Clock Cycle
  • add Completes
  • sub Enters the Pipeline
  • add completes in stage 5
  • sub is fetched from location 512 after successful
    brl

512 sub ... . . . . . . 112 str r12,
32 108 brl r9, r11, 001 104 ld r7, r5,
128 100 add r4, r6, r8
34
Functions of the SRC Pipeline Stages
  • Stage 1 fetches instruction
  • PC incremented or replaced by successful branch
    in stage 2
  • Stage 2 decodes instruction and gets operands
  • Load or store gets operands for address
    computation
  • Store gets register value to be stored as 3rd
    operand
  • ALU operation gets 2 registers or register and
    constant
  • Stage 3 performs ALU operation
  • Calculates effective address or does
    arithmetic/logic
  • May pass through link PC or value to be stored in
    memory

35
Functions of the SRC Pipeline Stages
  • Stage 4 accesses data memory
  • Passes Z4 to Z5 unchanged for non-memory
    instructions
  • Load fills Z5 from memory
  • Store uses address from Z4 and data from MD4 (no
    longer needed)
  • Stage 5 writes result register
  • Z5 contains value to be written, which can be ALU
    result, effective address, PC link value, or
    fetched data
  • ra field always specifies result register in SRC

36
Functions of the Pipeline Registers in SRC
  • Registers between stages 1 and 2
  • IR2 holds full instruction including any register
    fields and constant
  • PC2 holds the incremented PC from instruction
    fetch
  • Registers between stages 2 and 3
  • IR3 holds opcode and ra (needed in stage 5)
  • X3 holds PC or a register value (for link or 1st
    ALU operand)
  • Y3 holds c1 or c2 or a register value as 2nd ALU
    operand
  • MD3 is used for a register value to be stored in
    memory

37
Functions of the Pipeline Registers in SRC
  • Registers between stages 3 and 4
  • I4 has op code and ra
  • Z4 has memory address or result register value
  • MD4 has value to be stored in data memory
  • Registers between stages 4 and 5
  • I5 has opcode and destination register number, ra
  • Z5 has value to be stored in destination
    register from ALU result, PC link value, or
    fetched data

38
Pipeline Hazard
  • Entirely predictable, deterministic events.
  • Occur as side effects of having instructions in
    the pipeline that depend upon the results of
    instructions ahead of them that have not exited
    the pipeline.
  • The element of hazard comes only from not taking
    into account pipelines behavior
  • Rogue compiler
  • Assembly language programmer
  • Compiler must perform hazard analysis based on
    static condition (as oppose to run-time dynamic
    conditions) and thus take into account the worst
    case scenario.

39
Dependence Between Instructions in Pipe
Pipeline Hazards
  • Instructions that occupy the pipeline together
    are being executed in parallel
  • This leads to the problem of instruction
    dependence, well known in parallel processing
  • The basic problem is that an instruction depends
    on the result of a previously issued instruction
    that is not yet complete
  • Two categories of hazards
  • Data hazards an instruction initiates
    modification of the data in a register that is
    need in one of the next instructions in the
    pipeline.
  • Branch hazards fetch of wrong instruction on a
    change in PC

40
Data Hazard
  • Classification of Data Hazards
  • A read after write hazard (RAW) arises from a
    flow dependence, where an instruction uses data
    produced by a previous one
  • A write after read hazard (WAR) comes from an
    anti-dependence, where an instruction writes a
    new value over one that is still needed by a
    previous instruction
  • A write after write hazard (WAW) comes from an
    output dependence, where two parallel
    instructions write the to same register and must
    do it in the order in which they were issued

41
Data Hazards in SRC
  • Since all data memory access occurs in stage 4,
    memory writes and reads are sequential and give
    rise to no hazards
  • Since all registers are written in the last
    stage, WAW and WAR hazards do not occur
  • Two writes always occur in the order issued, and
    a write always follows a previously issued read
  • SRC hazards on register data are limited to RAW
    hazards coming from flow dependence
  • Values are written into registers at the end of
    stage 5 but may be needed by a following
    instruction at the beginning of stage 2

42
Example of Pipeline Data Hazard in SRC
add instruction writes into register r0 in Stage
5
  • 100 add r0, r2, r4
  • 104 sub r3, r0, r1
  • How to prevent this kind of hazard
  • When instruction
  • add is in stage 5
  • sub must be no closer than stage 1
  • ? separation of at least 4 instructions!
  • Note that result operand of the add instruction
    is actually available in register Z4 when the sub
    instruction requires it.
  • Data forwarding Forwarding hardware can be
    designed to detect this particular hazard and to
    forward the value to register Y3 in time for the
    sub instruction to use it.

sub instruction reads register r0 in Stage 2
43
Possible Solutions to the Register Data Hazard
Problem
  • Detection
  • The machine manual could list rules specifying
    that a dependent instruction cannot be issued
    less than a given number of steps after the one
    on which it depends
  • This is usually too restrictive
  • Since the operation and operands are known at
    each stage, dependence on a following stage can
    be detected
  • Correction
  • The dependent instruction can be stalled and
    those ahead of it in the pipeline allowed to
    complete
  • Result can be forwarded to a following inst. in
    a previous stage without waiting to be written
    into its register
  • Preferred SRC design will use detection,
    forwarding and stalling only when unavoidable

44
Detecting Hazards and Dependence Distance
  • To detect hazards, each pair of instructions must
    be considered
  • Data is normally available after being written to
    register
  • Can be made available for forwarding as early as
    the stage where it is produced
  • Stage 3 output for ALU results, stage 4 for
    memory fetch
  • Operands normally needed in stage 2
  • Can be received from forwarding as late as the
    stage in which they are used
  • Stage 3 for ALU operands and address modifiers,
    stage 4 for stored register, stage 2 for branch
    target

45
Data Hazards in SRC
  • The task of determining all possible data hazards
    in a given pipeline structure and instruction set
    is to consider all possible interactions between
    all instruction at all stages in the pipeline.
  • ALU Instructions
  • All ALU instruction Write and Read data.
  • Potential for data Hazard (previous example)
    because
  • Read data in the stage 2 (Normally Read-Needed)
  • Write data in the stage 5 ? data becomes
    available at stage 6 (Normally
    Written-Available)
  • ? ALU instructions that use the data from the
    previous ALU instructions must be separated by at
    least four instructions.
  • Write data actually available in Z4 when ALU
    instruction is in stage 4 (Earliest Available).
  • Read data normally done in stage 2 does not get
    used until stage 3 when ALU operation is
    performed (Latest Needed).
  • Data forwarding One hardware solution that
    would detect the hazard between the two
    instructions
  • Forward the data from Z4 to the proper ALU input
    in the previous stage.

46
Data Hazards - ALU Instructions
  • Data Forwarding Implementation requires further
    analysis
  • Associate with ALU instructions a pair of
    numbers
  • The stage where the data is Normally Available
  • The stage where the data is Earliest Available.
  • For ALU instructions that do data write this
    Normally Available/Earliest Available pair would
    be 6/4.
  • For ALU instructions that do data read a similar
    pair of stages associated with its
    register-reading requirements Normally Required
    and Latest Required.
  • For ALU reader this Normally Required/Latest
    Required pair is 2/3.
  • By taking pair-wise difference, minimum spacing
    for the ALU pair as (6-2)/(4-3)4/1 is obtained
  • Instructions must be separated by at least 4
    stages unless there is a forwarding scheme,
  • For forwarding scheme they need to be separated
    by only one stage.

47
Instruction Pair Hazard Interaction
  • Data Dependence for the Modifier Instructions

Normal/Forwarded no hazard
Instruction Class Write to Register File Write to Register File Write to Register File Write to Register File Write to Register File
Instruction Class Data Available Normal/Earliest, Stage Data Available Normal/Earliest, Stage Data Available Normal/Earliest, Stage Data Available Normal/Earliest, Stage Data Available Normal/Earliest, Stage
Instruction Class alu load ladr brl
6/4 6/5 6/4 6/2
Read from Register FileNormal/Latest, Stage alu 2/3 4/1 4/2 4/1 4/1
Read from Register FileNormal/Latest, Stage load 2/3 4/1 4/2 4/1 4/1
Read from Register FileNormal/Latest, Stage ladr 2/3 4/1 4/2 4/1 4/1
Read from Register FileNormal/Latest, Stage store (rb) 2/3 4/1 4/2 4/1 4/1
Read from Register FileNormal/Latest, Stage store (ra) 2/4 4/1 4/1 4/1 4/1
Read from Register FileNormal/Latest, Stage branch 2/2 4/2 4/3 4/2 4/1
6-24
4-31
BranchHazard
48
Instruction Pair Hazard Interaction
  • Previous table considers only register writes and
    subsequent register reads.
  • It also covers hazards due to loads from data
    memory to the register file.
  • What are the data hazards associated with stores?
  • Only possible hazard after a store is subsequent
    load
  • Since only load instructions access data memory
    and
  • Since both load and store occur in stage 4
  • ? there is no hazard.
  • Branch delay
  • Branch delay is unavoidable
  • Work around the problem as is the case with
    delayed loads.
  • Techniques that try to predict the outcome of a
    branch in the fetch stage ? beyond the scope of
    this class.

49
Delays Unavoidable by Forwarding
  • In the Table Load column, we see the value
    loaded cannot be available to the next
    instruction, even with forwarding
  • Can restrict compiler not to put a dependent
    instruction in the next position after a load
    (next 2 positions if the dependent instruction is
    a branch)
  • Target register cannot be forwarded to branch
    from the immediately preceding instruction
  • Code is restricted so that branch target must not
    be changed by instruction preceding branch
    (previous 2 instructions if loaded from memory)
  • Do not confuse this with the branch delay slot,
    which is a dependence of instruction fetch on
    branch, not a dependence of branch on something
    else

50
Hazard Detection in the Compiler
  • Hazard Resolved by Compiler
  • Burden of hazard detection and elimination on the
    compiler
  • Analyze the code sequence and either
  • Rearrange the instructions to remove the hazard,
    or
  • If finding no possible rearrangement insert nop
    between instructions that form a hazard
    software bubble in the pipeline.
  • Problems with this approach
  • Additional burden on the compiler writer to
    develop a compiler that is both correct and
    efficient.
  • Lead to more expensive and longer development
    cycle.
  • May lead to buggy compiler when maximum
    optimization of code is required.
  • Without hardware detection there can be no data
    forwarding ? no reduction in the no-hazard
    distance of four instructions.
  • Compiler can only perform static analysis of code
    ? must assume the most pessimistic scenario.

51
Hazard Detection by Hardware
  • Hazard Resolved by Pipeline Stalls
  • Focus on hardware solution of data forwarding.
  • Test for hazards at each place where they can
    occur (as described in Table).
  • Illustrate process of detecting data hazard with
    2-operand ALU-ALU instruction pairs.
  • Approach 1
  • Remove the hazard by inserting bubbles in the
    pipeline.

52
Hazard Detection by Hardware
  • Pipeline Bubble Insertion
  • Facts we need to take into account
  • The minimum spacing between data-dependent
    instructions.
  • The dependent instruction the stallee must be
    paused at stage 2 until hazard has been resolved.
    Note
  • instruction can not complete its operand fetch
    until the operand has been written to the
    register file.
  • The instruction behind it in stage 1 must also be
    held as long as the pipeline is stalled.
  • The two dependent instructions may be 1, 2, or 3
    instructions apart hazard detection hardware
    must detect all three of these cases.
  • The staller and intermediate instructions
    between staller and stallee must be allowed
    to finish and exit the pipeline.

53
Example of Detecting ALU Hazards and Stalling
Pipeline
  • The following expression detects hazards between
    ALU instructions in stages 2 and 3 and stalls the
    pipeline
  • ( alu3 Ù alu2 Ù ((ra3 rb2) Ú (ra3 rc2) ÙØimm2
    ) ) ( pause2 pause1 op3 0 )
  • After such a stall, the hazard will be between
    stages 2 and 4, detected by
  • ( alu4 Ù alu2 Ù ((ra4 rb2) Ú (ra4 rc2) ÙØimm2
    ) ) ( pause2 pause1 op3 0 )
  • Hazards between stages 2 5 require
  • ( alu5 Ù alu2 Ù ((ra5 rb2) Ú (ra5 rc2) ÙØimm2
    ) ) ( pause2 pause1 op3 0 )

If opcodes in stages 2 and 3 are both alu, andIf
ra in stage 3 rb or rc in stage 2 unless it
is an immediate instruction in which case there
is no rc There is a hazard between the
instructions in stages 2 and 3. Emit signals to
pause pipeline stages 1 and 2 (pause1 and
pause2) Insert a bubble in the pipeline between
the staller in stage 3 and stallee in stage 2
op3 ?0.
54
Data Dependence - Stalling
  • Stall Due to a Data Dependence Between Two ALU
    Instructions

55
Data Forwarding
  • Example of Data Forwarding from an ALU
    Instruction to another ALU Instruction
  • The pair table for data dependencies says that if
    forwarding is done, dependent ALU instructions
    can be adjacent, not 4 apart
  • For this to work, dependences must be detected
    and data sent from where it is available directly
    to X or Y input of ALU
  • For a dependence of an ALU instruction in stage 3
    on an ALU instruction in stage 5/(4) the equation
    is
  • alu5 Ù alu3 ((ra5 rb3) X Z5
  • (ra5 rc3) ÙØimm3 Y
    Z5 )
  • alu4 Ù alu3 ((ra4 rb3) X Z4
  • (ra4 rc3) ÙØimm3 Y
    Z4 )

56
Hazard Detection and Forwarding
  • Can be from either Z4 or Z5 to either X or Y
    input to ALU
  • rb and rc needed in stage 3 for detection

57
Data Forwarding ALU to ALU Instruction (contd)
  • For an ALU instruction in stage 3 depending on
    one in stage 4(5), the equation is
  • alu4 Ù alu3 ((ra4 rb3) X Z4
  • (ra4 rc3) Ù Øimm3 Y
    Z4 )
  • alu5 Ù alu3 ((ra5 rb3) X Z5
  • (ra5 rc3) Ù Øimm3 Y
    Z5 )
  • We can see that the rb and rc fields must be
    available in stage 3 for hazard detection
  • Multiplexers must be put on the X and Y inputs to
    the ALU so that Z4 or Z5 can replace either X3 or
    Y3 as inputs

58
Example
  • add r5, r1, r1 instr C, issued 3rd, in stage 3
  • add r1, r4, r1 instr B, issued 2rd, in stage 4
  • add r1, r3, r2 instr A, issued 1rd, in stage 5
  • Hazard detection units in both stages 4 and 5
    will detect the hazard
  • However, only the hazard detection init in stage
    4 should forward its result, which is in Z4 to
    both X3 and Y3. and thus .
  • implies that the dependences between 3 and 4
    should take precedence over the dependences
    between 3 and 5 because the X or Y value set by
    3-5 dependence is replaced by value from a 3-4
    dependence.

59
Exceptions and the Pipeline
  • Internal and external exceptions must be handled.
  • Imprecise exception
  • Precise exception
  • Instructions ahead of one that caused the
    exception (e.g., divide by zero) should continue.
  • Ideally the contents of the registers in its
    stage would be saved for later analysis and the
    instruction aborted and replaced with nop.
  • Instructions behind the faulty one may be
    restarted after the exception handler has
    completed.

60
Restrictions Left If Forwarding Done Wherever
Possible
br r4 add . . . ld r4, 4(r5) nop neg r6,
r4 ld r0, 1000 nop nop br r0 not r0, r1 nop br
r0
  • (1) Branch delay slot
  • The instruction after a branch is always
    executed, whether the branch succeeds or not.
  • (2) Load delay slot
  • A register loaded from memory cannot be used as
    an operand in the next instruction.
  • A register loaded from memory cannot be used as a
    branch target for the next two instructions.
  • (3) Branch target
  • Result register of ALU or ladr instruction cannot
    be used as branch target by the next instruction.

61
Performance and Design
  • Notation
  • IC Instruction Count
  • CPI Clock Cycles per Instruction
  • ? - Clock Period
  • Assumptions
  • Clock period of pipeline architecture same as
    non-pipelined
  • Average CPI for a 1-bus non-pipelined design is
    5, and the pipelined design can issue and
    complete one instruction per clock
  • Assume that there is one pipeline stall for every
    four instructions ? 5 clocks for 4 instructions
    or 5/41.25CPI.

62
Instruction-Level Parallelism
  • Two fundamental approaches to increasing a
    processors instruction execution rate
  • Increasing Clock Speed (IC technology dependent)
  • Instruction-level pipelining (Computer Architect
    and Logic Designer domain).
  • Efficient Sequential Execution.
  • Instruction-level parallelism
  • Increasing number of instructions executed
    simultaneously If there are multiple function
    units and multiple instructions have been
    fetched, then it is possible to start several at
    once
  • Two approaches are
  • Superscalar
  • Dynamically issue as many prefetched instructions
    to idle function units as possible
  • Very Long Instruction Word (VLIW)
  • Statically compile long instruction words with
    many operations in a word, each for a different
    function unit

63
Superscalar Architectures
  • There may be different types of function units
    used for each type of the instruction set
  • Floating-point (FPUs)
  • Integer (IUs)
  • Branch Prediction (BPUs)
  • There can be more than one of the same type
  • Each function unit is itself pipelined
  • How they work
  • Fetch instruction into FIFO queue
  • Partial Decode to determine type
  • Dispatch instruction to appropriate unit (IU,FPU
    or BPU) according to type.
  • Branches become more of a problem
  • There are fewer clock cycles between branches
  • Branch units try to predict branch direction
  • Instructions at branch target may be prefetched,
    and even executed speculatively, in hopes the
    branch goes that way

64
VLIW Architectures
  • 64-128 bit instruction word
  • Each word contains fields to control the routing
    of data to multiple register files and execution
    units.
  • More info at
  • http//www.research.ibm.com/vliw/

65
Microprogramming
  • Alternate approach to control unit design.
  • SRC hardwired approach to control unit design.
  • SRC-MP Control signal are stored as words in a
    microcode memory as a control word.
  • MP is transparent with respect to the
  • Program
  • The rest of the architecture.
  • The control signals emanating from the microcode
    control unit to the data path will remain
    unchanged.
  • Micro-programmed architectures popular in the
    60s 80s
  • Hardwired architectures popular in the 90s.

66
Microprogramming-Basic Idea
  • Recall control sequence for 1-bus SRC

Step Concrete RTN Control Sequence T0 MA PC C
PC 4 PCout, MAin, INC4, Cin, Read T1 MD
MMA PC C Cout, PCin, Wait T2 IR
MD MDout, IRin T3 A Rrb Grb, Rout,
Ain T4 C A Rrc Grc, Rout, ADD,
Cin T5 Rra C Cout, Gra, Rin, End
  • Control unit job is to generate the sequence of
    control signals
  • How about building a special computer to do this?

67
Microprogramming Concept
  • The Microcode Engine
  • The microcode control unit is itself a small
    stored program computer.
  • Micro PC ??PC
  • Microprogram memory ??Memory
  • Microinstruction word ??Instruction Word
  • A computer to generate control signals is much
    simpler than an ordinary computer
  • At the simplest, it just reads the control
    signals in order from a read-only memory
  • The memory is called the control store
  • A control store word, or microinstruction,
    contains a bit pattern telling which control
    signals are true in a specific step
  • The major issue is determining the order in which
    microinstructions are read

68
Block Diagram of Microcoded Control Unit
  • Microinstruction has
  • branch control,
  • branch address, and
  • control signal fields
  • Microprogram counter can be set from several
    sources to do the required sequencing

69
Parts of the Microprogrammed Control Unit
  • Since the control signals are just read from
    memory, the main function of mCU is sequencing
  • This is reflected in the several ways the mPC can
    be loaded
  • Output of incrementermPC 1
  • PLA outputstart address for a macroinstruction
  • Branch address from minstruction
  • External sourcesay for exception or reset
  • Micro conditional branches can depend on
    condition codes, data path state, external
    signals, etc.

70
Contents of a Microinstruction

Microinstruction format
Control signals
Branch control
Branch address


Ain
Cout
End
PCin
MAin
PCout
  • Main component is list of 1/0 control signal
    values
  • There is a branch address in the control store
  • There are branch control bits to determine when
    to use the branch address and when to use mPC 1

71
The Control Store
  • Faster then main memory
  • 70-bit or more wide
  • 2-4 K of control words
  • B(kcn)x2n capacity of control store in bits.
  • Common instruction fetch sequence
  • Separate sequences for each (macro) instruction
  • Wide words

72
Control Signals for the add Instruction
  • Addresses 100102 are the instruction fetch
  • Addresses 200202 do the add
  • Change of mcontrol from 102 to 200 uses a kind of
    mbranch

73
Uses for mbranching in the Microprogrammed
Control Unit
  • (1) Branch to start of mcode for a specific inst.
  • (2) Conditional control signals, e.g. CON PCin
  • (3) Looping on conditions, e.g. n ¹ 0 ... Goto6
  • Those constructs can be implemented by
    conditional branches specified in mcode word
    instead of using AND gates to control conditional
    branches
  • Conditions will control mbranches instead of
    being AND-ed with control signals
  • Microbranches are frequent and control store
    addresses are short, so it is reasonable to have
    a mbranch address field in every m instruction

74
Illustration of mbranching Control Logic
  • We illustrate a mbranching control scheme by a
    machine having condition code bits N and Z
  • Branch control has 2 parts
  • (1) selecting the input applied to the mPC and
  • (2) specifying whether this input or mPC 1 is
    used
  • 4 possible inputs to mPC are allowed
  • The incremented value mPC 1
  • The PLA lookup table for the start of a
    macroinstruction
  • An externally supplied address
  • The branch address field in the minstruction word

75
Branching Controls in the Microcoded Control Unit
  • 5 branch conditions
  • NotN
  • N
  • NotZ
  • Z
  • Unconditional
  • To 1 of 4 places
  • Next minstruction
  • PLA
  • External address
  • Branch address

76
?branches Examples
.

Address
C
o
n
t
r
o
l
B
r
a
n
c
h
S
i
g
n
a
l
s
A
d
d
r
e
s
s
B
r
a
n
c
h
i
n
g

a
c
t
i
o
n
201
n
e
x
t

i
n
s
t
r
u
c
t
i
o
n
0
0
0
0
0
0
0



X
X
X
N
o
n
e
200
0
1
1
0
0
0
0



X
X
X
B
r
a
n
c
h

t
o

o
u
t
p
u
t

o
f

P
L
A
201
1
0
0
0
1
0
0



X
X
X
B
r

i
f

Z

t
o

E
x
t
e
r
n
.

A
d
d
r
.
202
203
1
204)
1
1
0
0
0
0



3
0
0
B
r

i
f

N

t
o

3
0
0

(
e
l
s
e

n
e
x
t
1
205)
1
1
0
0
0
0
0



0
2
0
6
B
r

i
f

N

t
o

2
0
6

(
e
l
s
e

n
e
Write a Comment
User Comments (0)
About PowerShow.com