Title: Computer Architecture
1Computer Architecture
- Chapter 3
- Instruction-Level Parallelism I
- Prof. Jerry Breecher
- CSCI 240
- Fall 2003
2Chapter Overview
- 3.1 Instruction Level Parallelism Concepts and
Challenges - 3.2 Overcoming Data Hazards with Dynamic
Scheduling - 3.3 Dynamic Scheduling Examples The Algorithm
- 3.4 Reducing Branch Penalties with Dynamic
Hardware Prediction - 3.5 High Performance Instruction Delivery
- 3.6 Taking Advantage of More ILP with Multiple
Issue - 3.7 Hardware-based Speculation
- 3.8 Studies of The Limitations of ILP
- 3.10 The Pentium 4
3Ideas To Reduce Stalls
Chapter 3
Chapter 4
4Instruction Level Parallelism
- 3.1 Instruction Level Parallelism Concepts and
Challenges - 3.2 Overcoming Data Hazards with Dynamic
Scheduling - 3.3 Dynamic Scheduling Examples The Algorithm
- 3.4 Reducing Branch Penalties with Dynamic
Hardware Prediction - 3.5 High Performance Instruction Delivery
- 3.6 Taking Advantage of More ILP with Multiple
Issue - 3.7 Hardware-based Speculation
- 3.8 Studies of The Limitations of ILP
- 3.10 The Pentium 4
- ILP is the principle that there are many
instructions in code that dont depend on each
other. That means its possible to execute those
instructions in parallel. - This is easier said than done
- Issues include
- Building compilers to analyze the code,
- Building hardware to be even smarter than that
code. - This section looks at some of the problems to be
solved.
5Terminology
Instruction Level Parallelism
- Basic Block - That set of instructions between
entry points and between branches. A basic block
has only one entry and one exit. Typically this
is about 6 instructions long. - Loop Level Parallelism - that parallelism that
exists within a loop. Such parallelism can cross
loop iterations. - Loop Unrolling - Either the compiler or the
hardware is able to exploit the parallelism
inherent in the loop.
6Terminology
Instruction Level Parallelism
- Basic Block (BB) ILP is quite small
- BB a straight-line code sequence with no
branches in except to the entry and no branches
out except at the exit - average dynamic branch frequency 15 to 25 gt 4
to 7 instructions execute between a pair of
branches - Plus instructions in BB likely to depend on each
other - To obtain substantial performance enhancements,
we must exploit ILP across multiple basic blocks - Simplest loop-level parallelism to exploit
parallelism among iterations of a loop - Vector is one way
- If not vector, then either dynamic via branch
prediction or static via loop unrolling by
compiler
7Instruction Level Parallelism
Data Dependence and Hazards
- InstrJ is data dependent on InstrI InstrJ tries
to read operand before InstrI writes it - or InstrJ is data dependent on InstrK which is
dependent on InstrI - Caused by a True Dependence (compiler term)
- If true dependence caused a hazard in the
pipeline, called a Read After Write (RAW) hazard
I add r1,r2,r3 J sub r4,r1,r3
8Data Dependence and Hazards
Instruction Level Parallelism
- Dependences are a property of programs
- Presence of dependence indicates potential for a
hazard, but actual hazard and length of any stall
is a property of the pipeline - Importance of the data dependencies
- 1) indicates the possibility of a hazard
- 2) determines order in which results must be
calculated - 3) sets an upper bound on how much parallelism
can possibly be exploited - Today looking at HW schemes to avoid hazard
9Name Dependence 1 Anti-dependence
Instruction Level Parallelism
- Name dependence when 2 instructions use same
register or memory location, called a name, but
no flow of data between the instructions
associated with that name 2 versions of name
dependence - InstrJ writes operand before InstrI reads
itCalled an anti-dependence by compiler
writers.This results from reuse of the name r1 - If anti-dependence caused a hazard in the
pipeline, called a Write After Read (WAR) hazard
10Name Dependence 2 Output dependence
Instruction Level Parallelism
- InstrJ writes operand before InstrI writes
it. - Called an output dependence by compiler
writersThis also results from the reuse of name
r1 - If anti-dependence caused a hazard in the
pipeline, called a Write After Write (WAW) hazard
11ILP and Data Hazards
Instruction Level Parallelism
- HW/SW must preserve program order order
instructions would execute in if executed
sequentially 1 at a time as determined by
original source program - HW/SW goal exploit parallelism by preserving
program order only where it affects the outcome
of the program - Instructions involved in a name dependence can
execute simultaneously if name used in
instructions is changed so instructions do not
conflict - Register renaming resolves name dependence for
regs - Either by compiler or by HW
12Control Dependencies
Instruction Level Parallelism
- Every instruction is control dependent on some
set of branches, and, in general, these control
dependencies must be preserved to preserve
program order - if p1
- S1
-
- if p2
- S2
-
- S1 is control dependent on p1, and S2 is control
dependent on p2 but not on p1.
13Control Dependence Ignored
Instruction Level Parallelism
- Control dependence need not be preserved
- willing to execute instructions that should not
have been executed, thereby violating the control
dependences, if can do so without affecting
correctness of the program - Instead, 2 properties critical to program
correctness are exception behavior and data flow
14Exception Behavior
Instruction Level Parallelism
- Preserving exception behavior gt any changes in
instruction execution order must not change how
exceptions are raised in program (gt no new
exceptions) - Example
- DADDU R2,R3,R4 BEQZ R2,L1 LW R1,0(R2)L1
- Problem with moving LW before BEQZ?
15Data Flow
Instruction Level Parallelism
- Data flow actual flow of data values among
instructions that produce results and those that
consume them - branches make flow dynamic, determine which
instruction is supplier of data - Example
- DADDU R1,R2,R3 BEQZ R4, L DSUBU R1,R5,R6L
OR R7,R1,R8 - OR depends on DADDU or DSUBU? Must preserve data
flow on execution
16Dynamic Scheduling
Advantages ofDynamic Scheduling
- 3.1 Instruction Level Parallelism Concepts and
Challenges - 3.2 Overcoming Data Hazards with Dynamic
Scheduling - 3.3 Dynamic Scheduling Examples The Algorithm
- 3.4 Reducing Branch Penalties with Dynamic
Hardware Prediction - 3.5 High Performance Instruction Delivery
- 3.6 Taking Advantage of More ILP with Multiple
Issue - 3.7 Hardware-based Speculation
- 3.8 Studies of The Limitations of ILP
- 3.10 The Pentium 4
- Handles cases when dependences unknown at compile
time - (e.g., because they may involve a memory
reference) - It simplifies the compiler
- Allows code that compiled for one pipeline to run
efficiently on a different pipeline - Hardware speculation, a technique with
significant performance advantages, that builds
on dynamic scheduling
17Dynamic Scheduling
Logistics
- Sections 3.2 and 3.3 of the text use, as an
example of Dynamic Scheduling, an algorithm due
to Tomasulo. - We instead use another scoreboarding technique
which is discussed in Appendix A8
18Dynamic Scheduling
The idea
HW Schemes Instruction Parallelism
- Why is this in Hardware at run time?
- Works when cant know real dependence at compile
time - Compiler simpler
- Code for one machine runs well on another
- Key Idea Allow instructions behind stall to
proceed. - Key Idea Instructions executing in parallel.
There are multiple execution units, so use them. - DIVD F0,F2,F4
- ADDD F10,F0,F8
- SUBD F12,F8,F14
- Enables out-of-order execution gt out-of-order
completion
Even though ADDD stalls, the SUBD has no
dependencies and can run.
19Dynamic Scheduling
The idea
HW Schemes Instruction Parallelism
- Out-of-order execution divides ID stage
- 1. Issuedecode instructions, check for
structural hazards - 2. Read operandswait until no data hazards, then
read operands - Scoreboards allow instruction to execute whenever
1 2 hold, not waiting for prior instructions. - A scoreboard is a data structure that provides
the information necessary for all pieces of the
processor to work together. - We will use In order issue, out of order
execution, out of order commit ( also called
completion) - First used in CDC6600. Our example modified here
for MIPS. - CDC had 4 FP units, 5 memory reference units, 7
integer units. - MIPS has 2 FP multiply, 1 FP adder, 1 FP divider,
1 integer.
20Scoreboard Implications
Dynamic Scheduling
Using A Scoreboard
- Out-of-order completion gt WAR, WAW hazards?
- Solutions for WAR
- Queue both the operation and copies of its
operands - Read registers only during Read Operands stage
- For WAW, must detect hazard stall until other
completes - Need to have multiple instructions in execution
phase gt multiple execution units or pipelined
execution units - Scoreboard keeps track of dependencies, state or
operations - Scoreboard replaces ID, EX, WB with 4 stages
21Four Stages of Scoreboard Control
Dynamic Scheduling
Using A Scoreboard
- 1. Issue decode instructions check for
structural hazards (ID1) - If a functional unit for the instruction is free
and no other active instruction has the same
destination register (WAW), the scoreboard issues
the instruction to the functional unit and
updates its internal data structure. - If a structural or WAW hazard exists, then the
instruction issue stalls, and no further
instructions will issue until these hazards are
cleared.
22Four Stages of Scoreboard Control
Dynamic Scheduling
Using A Scoreboard
- 2. Read operands wait until no data hazards,
then read operands (ID2) - A source operand is available if no earlier
issued active instruction is going to write it,
or if the register containing the operand is
being written by a currently active functional
unit. - When the source operands are available, the
scoreboard tells the functional unit to proceed
to read the operands from the registers and begin
execution. The scoreboard resolves RAW hazards
dynamically in this step, and instructions may be
sent into execution out of order.
23Four Stages of Scoreboard Control
Dynamic Scheduling
Using A Scoreboard
- 3. Execution operate on operands (EX)
- The functional unit begins execution upon
receiving operands. When the result is ready, it
notifies the scoreboard that it has completed
execution. -
- 4. Write result finish execution (WB)
- Once the scoreboard is aware that the
functional unit has completed execution, the
scoreboard checks for WAR hazards. If none, it
writes results. If WAR, then it stalls the
instruction. - Example
- DIVD F0,F2,F4
- ADDD F10,F0,F8
- SUBD F8,F8,F14
- Scoreboard would stall SUBD until ADDD reads
operands
24Three Parts of the Scoreboard
Dynamic Scheduling
Using A Scoreboard
- 1. Instruction statuswhich of 4 steps the
instruction is in - 2. Functional unit statusIndicates the state of
the functional unit (FU). 9 fields for each
functional unit - BusyIndicates whether the unit is busy or not
- OpOperation to perform in the unit (e.g., or
) - FiDestination register
- Fj, FkSource-register numbers
- Qj, QkFunctional units producing source
registers Fj, Fk - Rj, RkFlags indicating when Fj, Fk are ready
- 3. Register result statusIndicates which
functional unit will write each register, if one
exists. Blank when no pending instructions will
write that register
25Detailed Scoreboard Pipeline Control
Dynamic Scheduling
Using A Scoreboard
Bookkeeping
Wait until
Busy(FU)? yes Op(FU)? op Fi(FU)? D Fj(FU)?
S1 Fk(FU)? S2 Qj? Result(S1) Qk?
Result(S2) Rj? not Qj Rk? not Qk
Result(D)? FU
Not busy (FU) and not result(D)
Rj? No Rk? No
Rj and Rk
Functional unit done
?f(if Qj(f)FU then Rj(f)? Yes)?f(if Qk(f)FU
then Rj(f)? Yes) Result(Fi(FU))? 0 Busy(FU)? No
?f((Fj( f )?Fi(FU) or Rj( f )No) (Fk( f )
?Fi(FU) or Rk( f )No))
26Dynamic Scheduling Examples
- 3.1 Instruction Level Parallelism Concepts and
Challenges - 3.2 Overcoming Data Hazards with Dynamic
Scheduling - 3.3 Dynamic Scheduling Examples The Algorithm
- 3.4 Reducing Branch Penalties with Dynamic
Hardware Prediction - 3.5 High Performance Instruction Delivery
- 3.6 Taking Advantage of More ILP with Multiple
Issue - 3.7 Hardware-based Speculation
- 3.8 Studies of The Limitations of ILP
- 3.10 The Pentium 4
In this section we look at an example of how
Dynamic Scheduling actually works. Its all
about accounting!
27Scoreboard Example
Dynamic Scheduling
Using A Scoreboard
This is the sample code well be working with in
the example LD F6, 34(R2) LD F2,
45(R3) MULT F0, F2, F4 SUBD F8, F6,
F2 DIVD F10, F0, F6 ADDD F6, F8, F2 What are
the hazards in this code?
Latencies (clock cycles) LD 1 MULT 10 SUBD 2 D
IVD 40 ADDD 2
28Scoreboard Example
Dynamic Scheduling
Using A Scoreboard
29Scoreboard Example Cycle 1
Dynamic Scheduling
Using A Scoreboard
Issue LD 1
Shows in which cycle the operation occurred.
30Scoreboard Example Cycle 2
Dynamic Scheduling
Using A Scoreboard
LD 2 cant issue since integer unit is
busy. MULT cant issue because we require
in-order issue.
31Scoreboard Example Cycle 3
Dynamic Scheduling
Using A Scoreboard
32Scoreboard Example Cycle 4
Dynamic Scheduling
Using A Scoreboard
33Scoreboard Example Cycle 5
Dynamic Scheduling
Using A Scoreboard
Issue LD 2 since integer unit is now free.
34Scoreboard Example Cycle 6
Dynamic Scheduling
Using A Scoreboard
Issue MULT.
35Scoreboard Example Cycle 7
Dynamic Scheduling
Using A Scoreboard
MULT cant read its operands (F2) because LD 2
hasnt finished.
36Scoreboard Example Cycle 8a
Dynamic Scheduling
Using A Scoreboard
DIVD issues. MULT and SUBD both waiting for F2.
37Scoreboard Example Cycle 8b
Dynamic Scheduling
Using A Scoreboard
LD 2 writes F2.
38Scoreboard Example Cycle 9
Dynamic Scheduling
Using A Scoreboard
Now MULT and SUBD can both read F2. How can both
instructions do this at the same time??
39Scoreboard Example Cycle 11
Dynamic Scheduling
Using A Scoreboard
ADDD cant start because add unit is busy.
40Scoreboard Example Cycle 12
Dynamic Scheduling
Using A Scoreboard
SUBD finishes. DIVD waiting for F0.
41Scoreboard Example Cycle 13
Dynamic Scheduling
Using A Scoreboard
ADDD issues.
42Scoreboard Example Cycle 14
Dynamic Scheduling
Using A Scoreboard
43Scoreboard Example Cycle 15
Dynamic Scheduling
Using A Scoreboard
44Scoreboard Example Cycle 16
Dynamic Scheduling
Using A Scoreboard
45Scoreboard Example Cycle 17
Dynamic Scheduling
Using A Scoreboard
ADDD cant write because of DIVD. RAW!
46Scoreboard Example Cycle 18
Dynamic Scheduling
Using A Scoreboard
Nothing Happens!!
47Scoreboard Example Cycle 19
Dynamic Scheduling
Using A Scoreboard
MULT completes execution.
48Scoreboard Example Cycle 20
Dynamic Scheduling
Using A Scoreboard
MULT writes.
49Scoreboard Example Cycle 21
Dynamic Scheduling
Using A Scoreboard
DIVD loads operands
50Scoreboard Example Cycle 22
Dynamic Scheduling
Using A Scoreboard
Now ADDD can write since WAR removed.
51Scoreboard Example Cycle 61
Dynamic Scheduling
Using A Scoreboard
DIVD completes execution
52Scoreboard Example Cycle 62
Dynamic Scheduling
Using A Scoreboard
DONE!!
53Another Dynamic Algorithm Tomasulo Algorithm
Dynamic Scheduling
Tomasulo Algorithm
- For IBM 360/91 about 3 years after CDC 6600
(1966) - Goal High Performance without special compilers
- Differences between IBM 360 CDC 6600 ISA
- IBM has only 2 register specifiers / instruction
vs. 3 in CDC 6600 - IBM has 4 FP registers vs. 8 in CDC 6600
- Why Study? lead to Alpha 21264, HP 8000, MIPS
10000, Pentium II, PowerPC 604,
This is the example that the text uses in
Sections 3.2 3.3.
54Tomasulo Algorithm vs. Scoreboard
Dynamic Scheduling
Tomasulo Algorithm
- Control buffers distributed with Function Units
(FU) vs. centralized in scoreboard - FU buffers called reservation stations have
pending operands - Registers in instructions replaced by values or
pointers to reservation stations(RS) called
register renaming - avoids WAR, WAW hazards
- More reservation stations than registers, so can
do optimizations compilers cant - Results to FU from RS, not through registers,
over Common Data Bus that broadcasts results to
all FUs - Load and Stores treated as FUs with RSs as well
- Integer instructions can go past branches,
allowing FP ops beyond basic block in FP queue
55Tomasulo Organization
Dynamic Scheduling
FP Registers
FP Op Queue
From Mem
Load1 Load2 Load3 Load4 Load5 Load6
Load Buffers
Store Buffers
Add1 Add2 Add3
Mult1 Mult2
Reservation Stations
To Mem
FP adders
FP multipliers
Common Data Bus (CDB)
56Reservation Station Components
Dynamic Scheduling
Tomasulo Algorithm
- OpOperation to perform in the unit (e.g., or
) - Vj, VkValue of Source operands
- Store buffers have V field, result to be stored
- Qj, QkReservation stations producing source
registers (value to be written) - Note No ready flags as in Scoreboard Qj,Qk0 gt
ready - Store buffers only have Qi for RS producing
result - BusyIndicates reservation station or FU is busy
-
- Register result statusIndicates which functional
unit will write each register, if one exists.
Blank when no pending instructions that will
write that register.
57Three Stages of Tomasulo Algorithm
Dynamic Scheduling
Tomasulo Algorithm
- 1. Issueget instruction from FP Op Queue
- If reservation station free (no structural
hazard), control issues instruction sends
operands (renames registers). - 2. Executionoperate on operands (EX)
- When both operands ready then execute if not
ready, watch Common Data Bus for result - 3. Write resultfinish execution (WB)
- Write on Common Data Bus to all awaiting units
mark reservation station available - Normal data bus data destination (go
to bus) - Common data bus data source (come from bus)
- 64 bits of data 4 bits of Functional Unit
source address - Write if matches expected Functional Unit
(produces result) - Does the broadcast
58Tomasulo Example Cycle 0
Dynamic Scheduling
Tomasulo Algorithm
59Review Tomasulo
Dynamic Scheduling
Tomasulo Algorithm
- Prevents Register as bottleneck
- Avoids WAR, WAW hazards of Scoreboard
- Allows loop unrolling in HW
- Not limited to basic blocks (provided branch
prediction) - Lasting Contributions
- Dynamic scheduling
- Register renaming
- Load/store disambiguation
- 360/91 descendants are PowerPC 604, 620 MIPS
R10000 HP-PA 8000 Intel Pentium Pro
60Summary
- 3.1 Instruction Level Parallelism Concepts and
Challenges - 3.2 Overcoming Data Hazards with Dynamic
Scheduling - 3.3 Dynamic Scheduling Examples The Algorithm
- 3.4 Reducing Branch Penalties with Dynamic
Hardware Prediction - 3.5 High Performance Instruction Delivery
- 3.6 Taking Advantage of More ILP with Multiple
Issue - 3.7 Hardware-based Speculation
- 3.8 Studies of The Limitations of ILP
- 3.10 The Pentium 4