Computer Architecture

About This Presentation

Title:

Computer Architecture

Description:

Loop Unrolling - Either the compiler or the hardware is able to exploit the ... either dynamic via branch prediction or static via loop unrolling by compiler ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 61

Provided by: jb20

Category:

more less

Transcript and Presenter's Notes

Title: Computer Architecture

1
Computer Architecture

Chapter 3
Instruction-Level Parallelism I
Prof. Jerry Breecher
CSCI 240
Fall 2003

2
Chapter Overview

3.1 Instruction Level Parallelism Concepts and
Challenges
3.2 Overcoming Data Hazards with Dynamic
Scheduling
3.3 Dynamic Scheduling Examples The Algorithm
3.4 Reducing Branch Penalties with Dynamic
Hardware Prediction
3.5 High Performance Instruction Delivery
3.6 Taking Advantage of More ILP with Multiple
Issue
3.7 Hardware-based Speculation
3.8 Studies of The Limitations of ILP
3.10 The Pentium 4

3
Ideas To Reduce Stalls
Chapter 3
Chapter 4
4
Instruction Level Parallelism

3.1 Instruction Level Parallelism Concepts and
Challenges
3.2 Overcoming Data Hazards with Dynamic
Scheduling
3.3 Dynamic Scheduling Examples The Algorithm
3.4 Reducing Branch Penalties with Dynamic
Hardware Prediction
3.5 High Performance Instruction Delivery
3.6 Taking Advantage of More ILP with Multiple
Issue
3.7 Hardware-based Speculation
3.8 Studies of The Limitations of ILP
3.10 The Pentium 4

ILP is the principle that there are many
instructions in code that dont depend on each
other. That means its possible to execute those
instructions in parallel.
This is easier said than done
Issues include
Building compilers to analyze the code,
Building hardware to be even smarter than that
code.
This section looks at some of the problems to be
solved.

5
Terminology
Instruction Level Parallelism

Basic Block - That set of instructions between
entry points and between branches. A basic block
has only one entry and one exit. Typically this
is about 6 instructions long.
Loop Level Parallelism - that parallelism that
exists within a loop. Such parallelism can cross
loop iterations.
Loop Unrolling - Either the compiler or the
hardware is able to exploit the parallelism
inherent in the loop.

6
Terminology
Instruction Level Parallelism

Basic Block (BB) ILP is quite small
BB a straight-line code sequence with no
branches in except to the entry and no branches
out except at the exit
average dynamic branch frequency 15 to 25 gt 4
to 7 instructions execute between a pair of
branches
Plus instructions in BB likely to depend on each
other
To obtain substantial performance enhancements,
we must exploit ILP across multiple basic blocks
Simplest loop-level parallelism to exploit
parallelism among iterations of a loop
Vector is one way
If not vector, then either dynamic via branch
prediction or static via loop unrolling by
compiler

7
Instruction Level Parallelism
Data Dependence and Hazards

InstrJ is data dependent on InstrI InstrJ tries
to read operand before InstrI writes it
or InstrJ is data dependent on InstrK which is
dependent on InstrI
Caused by a True Dependence (compiler term)
If true dependence caused a hazard in the
pipeline, called a Read After Write (RAW) hazard

I add r1,r2,r3 J sub r4,r1,r3
8
Data Dependence and Hazards
Instruction Level Parallelism

Dependences are a property of programs
Presence of dependence indicates potential for a
hazard, but actual hazard and length of any stall
is a property of the pipeline
Importance of the data dependencies
1) indicates the possibility of a hazard
2) determines order in which results must be
calculated
3) sets an upper bound on how much parallelism
can possibly be exploited
Today looking at HW schemes to avoid hazard

9
Name Dependence 1 Anti-dependence
Instruction Level Parallelism

Name dependence when 2 instructions use same
register or memory location, called a name, but
no flow of data between the instructions
associated with that name 2 versions of name
dependence
InstrJ writes operand before InstrI reads
itCalled an anti-dependence by compiler
writers.This results from reuse of the name r1
If anti-dependence caused a hazard in the
pipeline, called a Write After Read (WAR) hazard

10
Name Dependence 2 Output dependence
Instruction Level Parallelism

InstrJ writes operand before InstrI writes
it.
Called an output dependence by compiler
writersThis also results from the reuse of name
r1
If anti-dependence caused a hazard in the
pipeline, called a Write After Write (WAW) hazard

11
ILP and Data Hazards
Instruction Level Parallelism

HW/SW must preserve program order order
instructions would execute in if executed
sequentially 1 at a time as determined by
original source program
HW/SW goal exploit parallelism by preserving
program order only where it affects the outcome
of the program
Instructions involved in a name dependence can
execute simultaneously if name used in
instructions is changed so instructions do not
conflict
Register renaming resolves name dependence for
regs
Either by compiler or by HW

12
Control Dependencies
Instruction Level Parallelism

Every instruction is control dependent on some
set of branches, and, in general, these control
dependencies must be preserved to preserve
program order
if p1
S1
if p2
S2
S1 is control dependent on p1, and S2 is control
dependent on p2 but not on p1.

13
Control Dependence Ignored
Instruction Level Parallelism

Control dependence need not be preserved
willing to execute instructions that should not
have been executed, thereby violating the control
dependences, if can do so without affecting
correctness of the program
Instead, 2 properties critical to program
correctness are exception behavior and data flow

14
Exception Behavior
Instruction Level Parallelism

Preserving exception behavior gt any changes in
instruction execution order must not change how
exceptions are raised in program (gt no new
exceptions)
Example
DADDU R2,R3,R4 BEQZ R2,L1 LW R1,0(R2)L1
Problem with moving LW before BEQZ?

15
Data Flow
Instruction Level Parallelism

Data flow actual flow of data values among
instructions that produce results and those that
consume them
branches make flow dynamic, determine which
instruction is supplier of data
Example
DADDU R1,R2,R3 BEQZ R4, L DSUBU R1,R5,R6L
OR R7,R1,R8
OR depends on DADDU or DSUBU? Must preserve data
flow on execution

16
Dynamic Scheduling
Advantages ofDynamic Scheduling

3.1 Instruction Level Parallelism Concepts and
Challenges
3.2 Overcoming Data Hazards with Dynamic
Scheduling
3.3 Dynamic Scheduling Examples The Algorithm
3.4 Reducing Branch Penalties with Dynamic
Hardware Prediction
3.5 High Performance Instruction Delivery
3.6 Taking Advantage of More ILP with Multiple
Issue
3.7 Hardware-based Speculation
3.8 Studies of The Limitations of ILP
3.10 The Pentium 4

Handles cases when dependences unknown at compile
time
(e.g., because they may involve a memory
reference)
It simplifies the compiler
Allows code that compiled for one pipeline to run
efficiently on a different pipeline
Hardware speculation, a technique with
significant performance advantages, that builds
on dynamic scheduling

17
Dynamic Scheduling
Logistics

Sections 3.2 and 3.3 of the text use, as an
example of Dynamic Scheduling, an algorithm due
to Tomasulo.
We instead use another scoreboarding technique
which is discussed in Appendix A8

18
Dynamic Scheduling
The idea
HW Schemes Instruction Parallelism

Why is this in Hardware at run time?
Works when cant know real dependence at compile
time
Compiler simpler
Code for one machine runs well on another
Key Idea Allow instructions behind stall to
proceed.
Key Idea Instructions executing in parallel.
There are multiple execution units, so use them.
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F12,F8,F14
Enables out-of-order execution gt out-of-order
completion

Even though ADDD stalls, the SUBD has no
dependencies and can run.
19
Dynamic Scheduling
The idea
HW Schemes Instruction Parallelism

Out-of-order execution divides ID stage
1. Issuedecode instructions, check for
structural hazards
2. Read operandswait until no data hazards, then
read operands
Scoreboards allow instruction to execute whenever
1 2 hold, not waiting for prior instructions.
A scoreboard is a data structure that provides
the information necessary for all pieces of the
processor to work together.
We will use In order issue, out of order
execution, out of order commit ( also called
completion)
First used in CDC6600. Our example modified here
for MIPS.
CDC had 4 FP units, 5 memory reference units, 7
integer units.
MIPS has 2 FP multiply, 1 FP adder, 1 FP divider,
1 integer.

20
Scoreboard Implications
Dynamic Scheduling
Using A Scoreboard

Out-of-order completion gt WAR, WAW hazards?
Solutions for WAR
Queue both the operation and copies of its
operands
Read registers only during Read Operands stage
For WAW, must detect hazard stall until other
completes
Need to have multiple instructions in execution
phase gt multiple execution units or pipelined
execution units
Scoreboard keeps track of dependencies, state or
operations
Scoreboard replaces ID, EX, WB with 4 stages

21
Four Stages of Scoreboard Control
Dynamic Scheduling
Using A Scoreboard

1. Issue decode instructions check for
structural hazards (ID1)
If a functional unit for the instruction is free
and no other active instruction has the same
destination register (WAW), the scoreboard issues
the instruction to the functional unit and
updates its internal data structure.
If a structural or WAW hazard exists, then the
instruction issue stalls, and no further
instructions will issue until these hazards are
cleared.

22
Four Stages of Scoreboard Control
Dynamic Scheduling
Using A Scoreboard

2. Read operands wait until no data hazards,
then read operands (ID2)
A source operand is available if no earlier
issued active instruction is going to write it,
or if the register containing the operand is
being written by a currently active functional
unit.
When the source operands are available, the
scoreboard tells the functional unit to proceed
to read the operands from the registers and begin
execution. The scoreboard resolves RAW hazards
dynamically in this step, and instructions may be
sent into execution out of order.

23
Four Stages of Scoreboard Control
Dynamic Scheduling
Using A Scoreboard

3. Execution operate on operands (EX)
The functional unit begins execution upon
receiving operands. When the result is ready, it
notifies the scoreboard that it has completed
execution.
4. Write result finish execution (WB)
Once the scoreboard is aware that the
functional unit has completed execution, the
scoreboard checks for WAR hazards. If none, it
writes results. If WAR, then it stalls the
instruction.
Example
DIVD F0,F2,F4
ADDD F10,F0,F8
SUBD F8,F8,F14
Scoreboard would stall SUBD until ADDD reads
operands

24
Three Parts of the Scoreboard
Dynamic Scheduling
Using A Scoreboard

1. Instruction statuswhich of 4 steps the
instruction is in
2. Functional unit statusIndicates the state of
the functional unit (FU). 9 fields for each
functional unit
BusyIndicates whether the unit is busy or not
OpOperation to perform in the unit (e.g., or
)
FiDestination register
Fj, FkSource-register numbers
Qj, QkFunctional units producing source
registers Fj, Fk
Rj, RkFlags indicating when Fj, Fk are ready
3. Register result statusIndicates which
functional unit will write each register, if one
exists. Blank when no pending instructions will
write that register

25
Detailed Scoreboard Pipeline Control
Dynamic Scheduling
Using A Scoreboard
Bookkeeping
Wait until
Busy(FU)? yes Op(FU)? op Fi(FU)? D Fj(FU)?
S1 Fk(FU)? S2 Qj? Result(S1) Qk?
Result(S2) Rj? not Qj Rk? not Qk
Result(D)? FU
Not busy (FU) and not result(D)
Rj? No Rk? No
Rj and Rk
Functional unit done
?f(if Qj(f)FU then Rj(f)? Yes)?f(if Qk(f)FU
then Rj(f)? Yes) Result(Fi(FU))? 0 Busy(FU)? No
?f((Fj( f )?Fi(FU) or Rj( f )No) (Fk( f )
?Fi(FU) or Rk( f )No))
26
Dynamic Scheduling Examples

3.1 Instruction Level Parallelism Concepts and
Challenges
3.2 Overcoming Data Hazards with Dynamic
Scheduling
3.3 Dynamic Scheduling Examples The Algorithm
3.4 Reducing Branch Penalties with Dynamic
Hardware Prediction
3.5 High Performance Instruction Delivery
3.6 Taking Advantage of More ILP with Multiple
Issue
3.7 Hardware-based Speculation
3.8 Studies of The Limitations of ILP
3.10 The Pentium 4

In this section we look at an example of how
Dynamic Scheduling actually works. Its all
about accounting!
27
Scoreboard Example
Dynamic Scheduling
Using A Scoreboard
This is the sample code well be working with in
the example LD F6, 34(R2) LD F2,
45(R3) MULT F0, F2, F4 SUBD F8, F6,
F2 DIVD F10, F0, F6 ADDD F6, F8, F2 What are
the hazards in this code?
Latencies (clock cycles) LD 1 MULT 10 SUBD 2 D
IVD 40 ADDD 2
28
Scoreboard Example
Dynamic Scheduling
Using A Scoreboard
29
Scoreboard Example Cycle 1
Dynamic Scheduling
Using A Scoreboard
Issue LD 1
Shows in which cycle the operation occurred.
30
Scoreboard Example Cycle 2
Dynamic Scheduling
Using A Scoreboard
LD 2 cant issue since integer unit is
busy. MULT cant issue because we require
in-order issue.
31
Scoreboard Example Cycle 3
Dynamic Scheduling
Using A Scoreboard
32
Scoreboard Example Cycle 4
Dynamic Scheduling
Using A Scoreboard
33
Scoreboard Example Cycle 5
Dynamic Scheduling
Using A Scoreboard
Issue LD 2 since integer unit is now free.
34
Scoreboard Example Cycle 6
Dynamic Scheduling
Using A Scoreboard
Issue MULT.
35
Scoreboard Example Cycle 7
Dynamic Scheduling
Using A Scoreboard
MULT cant read its operands (F2) because LD 2
hasnt finished.
36
Scoreboard Example Cycle 8a
Dynamic Scheduling
Using A Scoreboard
DIVD issues. MULT and SUBD both waiting for F2.
37
Scoreboard Example Cycle 8b
Dynamic Scheduling
Using A Scoreboard
LD 2 writes F2.
38
Scoreboard Example Cycle 9
Dynamic Scheduling
Using A Scoreboard
Now MULT and SUBD can both read F2. How can both
instructions do this at the same time??
39
Scoreboard Example Cycle 11
Dynamic Scheduling
Using A Scoreboard
ADDD cant start because add unit is busy.
40
Scoreboard Example Cycle 12
Dynamic Scheduling
Using A Scoreboard
SUBD finishes. DIVD waiting for F0.
41
Scoreboard Example Cycle 13
Dynamic Scheduling
Using A Scoreboard
ADDD issues.
42
Scoreboard Example Cycle 14
Dynamic Scheduling
Using A Scoreboard
43
Scoreboard Example Cycle 15
Dynamic Scheduling
Using A Scoreboard
44
Scoreboard Example Cycle 16
Dynamic Scheduling
Using A Scoreboard
45
Scoreboard Example Cycle 17
Dynamic Scheduling
Using A Scoreboard
ADDD cant write because of DIVD. RAW!
46
Scoreboard Example Cycle 18
Dynamic Scheduling
Using A Scoreboard
Nothing Happens!!
47
Scoreboard Example Cycle 19
Dynamic Scheduling
Using A Scoreboard
MULT completes execution.
48
Scoreboard Example Cycle 20
Dynamic Scheduling
Using A Scoreboard
MULT writes.
49
Scoreboard Example Cycle 21
Dynamic Scheduling
Using A Scoreboard
DIVD loads operands
50
Scoreboard Example Cycle 22
Dynamic Scheduling
Using A Scoreboard
Now ADDD can write since WAR removed.
51
Scoreboard Example Cycle 61
Dynamic Scheduling
Using A Scoreboard
DIVD completes execution
52
Scoreboard Example Cycle 62
Dynamic Scheduling
Using A Scoreboard
DONE!!
53
Another Dynamic Algorithm Tomasulo Algorithm
Dynamic Scheduling
Tomasulo Algorithm

For IBM 360/91 about 3 years after CDC 6600
(1966)
Goal High Performance without special compilers
Differences between IBM 360 CDC 6600 ISA
IBM has only 2 register specifiers / instruction
vs. 3 in CDC 6600
IBM has 4 FP registers vs. 8 in CDC 6600
Why Study? lead to Alpha 21264, HP 8000, MIPS
10000, Pentium II, PowerPC 604,

This is the example that the text uses in
Sections 3.2 3.3.
54
Tomasulo Algorithm vs. Scoreboard
Dynamic Scheduling
Tomasulo Algorithm

Control buffers distributed with Function Units
(FU) vs. centralized in scoreboard
FU buffers called reservation stations have
pending operands
Registers in instructions replaced by values or
pointers to reservation stations(RS) called
register renaming
avoids WAR, WAW hazards
More reservation stations than registers, so can
do optimizations compilers cant
Results to FU from RS, not through registers,
over Common Data Bus that broadcasts results to
all FUs
Load and Stores treated as FUs with RSs as well
Integer instructions can go past branches,
allowing FP ops beyond basic block in FP queue

55
Tomasulo Organization
Dynamic Scheduling
FP Registers
FP Op Queue
From Mem
Load1 Load2 Load3 Load4 Load5 Load6
Load Buffers
Store Buffers
Add1 Add2 Add3
Mult1 Mult2
Reservation Stations
To Mem
FP adders
FP multipliers
Common Data Bus (CDB)
56
Reservation Station Components
Dynamic Scheduling
Tomasulo Algorithm

OpOperation to perform in the unit (e.g., or
)
Vj, VkValue of Source operands
Store buffers have V field, result to be stored
Qj, QkReservation stations producing source
registers (value to be written)
Note No ready flags as in Scoreboard Qj,Qk0 gt
ready
Store buffers only have Qi for RS producing
result
BusyIndicates reservation station or FU is busy
Register result statusIndicates which functional
unit will write each register, if one exists.
Blank when no pending instructions that will
write that register.

57
Three Stages of Tomasulo Algorithm
Dynamic Scheduling
Tomasulo Algorithm

1. Issueget instruction from FP Op Queue
If reservation station free (no structural
hazard), control issues instruction sends
operands (renames registers).
2. Executionoperate on operands (EX)
When both operands ready then execute if not
ready, watch Common Data Bus for result
3. Write resultfinish execution (WB)
Write on Common Data Bus to all awaiting units
mark reservation station available
Normal data bus data destination (go
to bus)
Common data bus data source (come from bus)
64 bits of data 4 bits of Functional Unit
source address
Write if matches expected Functional Unit
(produces result)
Does the broadcast

58
Tomasulo Example Cycle 0
Dynamic Scheduling
Tomasulo Algorithm
59
Review Tomasulo
Dynamic Scheduling
Tomasulo Algorithm

Prevents Register as bottleneck
Avoids WAR, WAW hazards of Scoreboard
Allows loop unrolling in HW
Not limited to basic blocks (provided branch
prediction)
Lasting Contributions
Dynamic scheduling
Register renaming
Load/store disambiguation
360/91 descendants are PowerPC 604, 620 MIPS
R10000 HP-PA 8000 Intel Pentium Pro

60
Summary