Computer Architecture - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

Computer Architecture

Description:

Loop Unrolling - Either the compiler or the hardware is able to exploit the ... either dynamic via branch prediction or static via loop unrolling by compiler ... – PowerPoint PPT presentation

Number of Views:66
Avg rating:3.0/5.0
Slides: 61
Provided by: jb20
Category:

less

Transcript and Presenter's Notes

Title: Computer Architecture


1
Computer Architecture
  • Chapter 3
  • Instruction-Level Parallelism I
  • Prof. Jerry Breecher
  • CSCI 240
  • Fall 2003

2
Chapter Overview
  • 3.1 Instruction Level Parallelism Concepts and
    Challenges
  • 3.2 Overcoming Data Hazards with Dynamic
    Scheduling
  • 3.3 Dynamic Scheduling Examples The Algorithm
  • 3.4 Reducing Branch Penalties with Dynamic
    Hardware Prediction
  • 3.5 High Performance Instruction Delivery
  • 3.6 Taking Advantage of More ILP with Multiple
    Issue
  • 3.7 Hardware-based Speculation
  • 3.8 Studies of The Limitations of ILP
  • 3.10 The Pentium 4

3
Ideas To Reduce Stalls
Chapter 3
Chapter 4
4
Instruction Level Parallelism
  • 3.1 Instruction Level Parallelism Concepts and
    Challenges
  • 3.2 Overcoming Data Hazards with Dynamic
    Scheduling
  • 3.3 Dynamic Scheduling Examples The Algorithm
  • 3.4 Reducing Branch Penalties with Dynamic
    Hardware Prediction
  • 3.5 High Performance Instruction Delivery
  • 3.6 Taking Advantage of More ILP with Multiple
    Issue
  • 3.7 Hardware-based Speculation
  • 3.8 Studies of The Limitations of ILP
  • 3.10 The Pentium 4
  • ILP is the principle that there are many
    instructions in code that dont depend on each
    other. That means its possible to execute those
    instructions in parallel.
  • This is easier said than done
  • Issues include
  • Building compilers to analyze the code,
  • Building hardware to be even smarter than that
    code.
  • This section looks at some of the problems to be
    solved.

5
Terminology
Instruction Level Parallelism
  • Basic Block - That set of instructions between
    entry points and between branches. A basic block
    has only one entry and one exit. Typically this
    is about 6 instructions long.
  • Loop Level Parallelism - that parallelism that
    exists within a loop. Such parallelism can cross
    loop iterations.
  • Loop Unrolling - Either the compiler or the
    hardware is able to exploit the parallelism
    inherent in the loop.

6
Terminology
Instruction Level Parallelism
  • Basic Block (BB) ILP is quite small
  • BB a straight-line code sequence with no
    branches in except to the entry and no branches
    out except at the exit
  • average dynamic branch frequency 15 to 25 gt 4
    to 7 instructions execute between a pair of
    branches
  • Plus instructions in BB likely to depend on each
    other
  • To obtain substantial performance enhancements,
    we must exploit ILP across multiple basic blocks
  • Simplest loop-level parallelism to exploit
    parallelism among iterations of a loop
  • Vector is one way
  • If not vector, then either dynamic via branch
    prediction or static via loop unrolling by
    compiler

7
Instruction Level Parallelism
Data Dependence and Hazards
  • InstrJ is data dependent on InstrI InstrJ tries
    to read operand before InstrI writes it
  • or InstrJ is data dependent on InstrK which is
    dependent on InstrI
  • Caused by a True Dependence (compiler term)
  • If true dependence caused a hazard in the
    pipeline, called a Read After Write (RAW) hazard

I add r1,r2,r3 J sub r4,r1,r3
8
Data Dependence and Hazards
Instruction Level Parallelism
  • Dependences are a property of programs
  • Presence of dependence indicates potential for a
    hazard, but actual hazard and length of any stall
    is a property of the pipeline
  • Importance of the data dependencies
  • 1) indicates the possibility of a hazard
  • 2) determines order in which results must be
    calculated
  • 3) sets an upper bound on how much parallelism
    can possibly be exploited
  • Today looking at HW schemes to avoid hazard

9
Name Dependence 1 Anti-dependence
Instruction Level Parallelism
  • Name dependence when 2 instructions use same
    register or memory location, called a name, but
    no flow of data between the instructions
    associated with that name 2 versions of name
    dependence
  • InstrJ writes operand before InstrI reads
    itCalled an anti-dependence by compiler
    writers.This results from reuse of the name r1
  • If anti-dependence caused a hazard in the
    pipeline, called a Write After Read (WAR) hazard

10
Name Dependence 2 Output dependence
Instruction Level Parallelism
  • InstrJ writes operand before InstrI writes
    it.
  • Called an output dependence by compiler
    writersThis also results from the reuse of name
    r1
  • If anti-dependence caused a hazard in the
    pipeline, called a Write After Write (WAW) hazard

11
ILP and Data Hazards
Instruction Level Parallelism
  • HW/SW must preserve program order order
    instructions would execute in if executed
    sequentially 1 at a time as determined by
    original source program
  • HW/SW goal exploit parallelism by preserving
    program order only where it affects the outcome
    of the program
  • Instructions involved in a name dependence can
    execute simultaneously if name used in
    instructions is changed so instructions do not
    conflict
  • Register renaming resolves name dependence for
    regs
  • Either by compiler or by HW

12
Control Dependencies
Instruction Level Parallelism
  • Every instruction is control dependent on some
    set of branches, and, in general, these control
    dependencies must be preserved to preserve
    program order
  • if p1
  • S1
  • if p2
  • S2
  • S1 is control dependent on p1, and S2 is control
    dependent on p2 but not on p1.

13
Control Dependence Ignored
Instruction Level Parallelism
  • Control dependence need not be preserved
  • willing to execute instructions that should not
    have been executed, thereby violating the control
    dependences, if can do so without affecting
    correctness of the program
  • Instead, 2 properties critical to program
    correctness are exception behavior and data flow

14
Exception Behavior
Instruction Level Parallelism
  • Preserving exception behavior gt any changes in
    instruction execution order must not change how
    exceptions are raised in program (gt no new
    exceptions)
  • Example
  • DADDU R2,R3,R4 BEQZ R2,L1 LW R1,0(R2)L1
  • Problem with moving LW before BEQZ?

15
Data Flow
Instruction Level Parallelism
  • Data flow actual flow of data values among
    instructions that produce results and those that
    consume them
  • branches make flow dynamic, determine which
    instruction is supplier of data
  • Example
  • DADDU R1,R2,R3 BEQZ R4, L DSUBU R1,R5,R6L
    OR R7,R1,R8
  • OR depends on DADDU or DSUBU? Must preserve data
    flow on execution

16
Dynamic Scheduling
Advantages ofDynamic Scheduling
  • 3.1 Instruction Level Parallelism Concepts and
    Challenges
  • 3.2 Overcoming Data Hazards with Dynamic
    Scheduling
  • 3.3 Dynamic Scheduling Examples The Algorithm
  • 3.4 Reducing Branch Penalties with Dynamic
    Hardware Prediction
  • 3.5 High Performance Instruction Delivery
  • 3.6 Taking Advantage of More ILP with Multiple
    Issue
  • 3.7 Hardware-based Speculation
  • 3.8 Studies of The Limitations of ILP
  • 3.10 The Pentium 4
  • Handles cases when dependences unknown at compile
    time
  • (e.g., because they may involve a memory
    reference)
  • It simplifies the compiler
  • Allows code that compiled for one pipeline to run
    efficiently on a different pipeline
  • Hardware speculation, a technique with
    significant performance advantages, that builds
    on dynamic scheduling

17
Dynamic Scheduling
Logistics
  • Sections 3.2 and 3.3 of the text use, as an
    example of Dynamic Scheduling, an algorithm due
    to Tomasulo.
  • We instead use another scoreboarding technique
    which is discussed in Appendix A8

18
Dynamic Scheduling
The idea
HW Schemes Instruction Parallelism
  • Why is this in Hardware at run time?
  • Works when cant know real dependence at compile
    time
  • Compiler simpler
  • Code for one machine runs well on another
  • Key Idea Allow instructions behind stall to
    proceed.
  • Key Idea Instructions executing in parallel.
    There are multiple execution units, so use them.
  • DIVD F0,F2,F4
  • ADDD F10,F0,F8
  • SUBD F12,F8,F14
  • Enables out-of-order execution gt out-of-order
    completion

Even though ADDD stalls, the SUBD has no
dependencies and can run.
19
Dynamic Scheduling
The idea
HW Schemes Instruction Parallelism
  • Out-of-order execution divides ID stage
  • 1. Issuedecode instructions, check for
    structural hazards
  • 2. Read operandswait until no data hazards, then
    read operands
  • Scoreboards allow instruction to execute whenever
    1 2 hold, not waiting for prior instructions.
  • A scoreboard is a data structure that provides
    the information necessary for all pieces of the
    processor to work together.
  • We will use In order issue, out of order
    execution, out of order commit ( also called
    completion)
  • First used in CDC6600. Our example modified here
    for MIPS.
  • CDC had 4 FP units, 5 memory reference units, 7
    integer units.
  • MIPS has 2 FP multiply, 1 FP adder, 1 FP divider,
    1 integer.

20
Scoreboard Implications
Dynamic Scheduling
Using A Scoreboard
  • Out-of-order completion gt WAR, WAW hazards?
  • Solutions for WAR
  • Queue both the operation and copies of its
    operands
  • Read registers only during Read Operands stage
  • For WAW, must detect hazard stall until other
    completes
  • Need to have multiple instructions in execution
    phase gt multiple execution units or pipelined
    execution units
  • Scoreboard keeps track of dependencies, state or
    operations
  • Scoreboard replaces ID, EX, WB with 4 stages

21
Four Stages of Scoreboard Control
Dynamic Scheduling
Using A Scoreboard
  • 1. Issue decode instructions check for
    structural hazards (ID1)
  • If a functional unit for the instruction is free
    and no other active instruction has the same
    destination register (WAW), the scoreboard issues
    the instruction to the functional unit and
    updates its internal data structure.
  • If a structural or WAW hazard exists, then the
    instruction issue stalls, and no further
    instructions will issue until these hazards are
    cleared.

22
Four Stages of Scoreboard Control
Dynamic Scheduling
Using A Scoreboard
  • 2. Read operands wait until no data hazards,
    then read operands (ID2)
  • A source operand is available if no earlier
    issued active instruction is going to write it,
    or if the register containing the operand is
    being written by a currently active functional
    unit.
  • When the source operands are available, the
    scoreboard tells the functional unit to proceed
    to read the operands from the registers and begin
    execution. The scoreboard resolves RAW hazards
    dynamically in this step, and instructions may be
    sent into execution out of order.

23
Four Stages of Scoreboard Control
Dynamic Scheduling
Using A Scoreboard
  • 3. Execution operate on operands (EX)
  • The functional unit begins execution upon
    receiving operands. When the result is ready, it
    notifies the scoreboard that it has completed
    execution.
  • 4. Write result finish execution (WB)
  • Once the scoreboard is aware that the
    functional unit has completed execution, the
    scoreboard checks for WAR hazards. If none, it
    writes results. If WAR, then it stalls the
    instruction.
  • Example
  • DIVD F0,F2,F4
  • ADDD F10,F0,F8
  • SUBD F8,F8,F14
  • Scoreboard would stall SUBD until ADDD reads
    operands

24
Three Parts of the Scoreboard
Dynamic Scheduling
Using A Scoreboard
  • 1. Instruction statuswhich of 4 steps the
    instruction is in
  • 2. Functional unit statusIndicates the state of
    the functional unit (FU). 9 fields for each
    functional unit
  • BusyIndicates whether the unit is busy or not
  • OpOperation to perform in the unit (e.g., or
    )
  • FiDestination register
  • Fj, FkSource-register numbers
  • Qj, QkFunctional units producing source
    registers Fj, Fk
  • Rj, RkFlags indicating when Fj, Fk are ready
  • 3. Register result statusIndicates which
    functional unit will write each register, if one
    exists. Blank when no pending instructions will
    write that register

25
Detailed Scoreboard Pipeline Control
Dynamic Scheduling
Using A Scoreboard
Bookkeeping
Wait until
Busy(FU)? yes Op(FU)? op Fi(FU)? D Fj(FU)?
S1 Fk(FU)? S2 Qj? Result(S1) Qk?
Result(S2) Rj? not Qj Rk? not Qk
Result(D)? FU
Not busy (FU) and not result(D)
Rj? No Rk? No
Rj and Rk
Functional unit done
?f(if Qj(f)FU then Rj(f)? Yes)?f(if Qk(f)FU
then Rj(f)? Yes) Result(Fi(FU))? 0 Busy(FU)? No
?f((Fj( f )?Fi(FU) or Rj( f )No) (Fk( f )
?Fi(FU) or Rk( f )No))
26
Dynamic Scheduling Examples
  • 3.1 Instruction Level Parallelism Concepts and
    Challenges
  • 3.2 Overcoming Data Hazards with Dynamic
    Scheduling
  • 3.3 Dynamic Scheduling Examples The Algorithm
  • 3.4 Reducing Branch Penalties with Dynamic
    Hardware Prediction
  • 3.5 High Performance Instruction Delivery
  • 3.6 Taking Advantage of More ILP with Multiple
    Issue
  • 3.7 Hardware-based Speculation
  • 3.8 Studies of The Limitations of ILP
  • 3.10 The Pentium 4

In this section we look at an example of how
Dynamic Scheduling actually works. Its all
about accounting!
27
Scoreboard Example
Dynamic Scheduling
Using A Scoreboard
This is the sample code well be working with in
the example LD F6, 34(R2) LD F2,
45(R3) MULT F0, F2, F4 SUBD F8, F6,
F2 DIVD F10, F0, F6 ADDD F6, F8, F2 What are
the hazards in this code?
Latencies (clock cycles) LD 1 MULT 10 SUBD 2 D
IVD 40 ADDD 2
28
Scoreboard Example
Dynamic Scheduling
Using A Scoreboard
29
Scoreboard Example Cycle 1
Dynamic Scheduling
Using A Scoreboard
Issue LD 1
Shows in which cycle the operation occurred.
30
Scoreboard Example Cycle 2
Dynamic Scheduling
Using A Scoreboard
LD 2 cant issue since integer unit is
busy. MULT cant issue because we require
in-order issue.
31
Scoreboard Example Cycle 3
Dynamic Scheduling
Using A Scoreboard
32
Scoreboard Example Cycle 4
Dynamic Scheduling
Using A Scoreboard
33
Scoreboard Example Cycle 5
Dynamic Scheduling
Using A Scoreboard
Issue LD 2 since integer unit is now free.
34
Scoreboard Example Cycle 6
Dynamic Scheduling
Using A Scoreboard
Issue MULT.
35
Scoreboard Example Cycle 7
Dynamic Scheduling
Using A Scoreboard
MULT cant read its operands (F2) because LD 2
hasnt finished.
36
Scoreboard Example Cycle 8a
Dynamic Scheduling
Using A Scoreboard
DIVD issues. MULT and SUBD both waiting for F2.
37
Scoreboard Example Cycle 8b
Dynamic Scheduling
Using A Scoreboard
LD 2 writes F2.
38
Scoreboard Example Cycle 9
Dynamic Scheduling
Using A Scoreboard
Now MULT and SUBD can both read F2. How can both
instructions do this at the same time??
39
Scoreboard Example Cycle 11
Dynamic Scheduling
Using A Scoreboard
ADDD cant start because add unit is busy.
40
Scoreboard Example Cycle 12
Dynamic Scheduling
Using A Scoreboard
SUBD finishes. DIVD waiting for F0.
41
Scoreboard Example Cycle 13
Dynamic Scheduling
Using A Scoreboard
ADDD issues.
42
Scoreboard Example Cycle 14
Dynamic Scheduling
Using A Scoreboard
43
Scoreboard Example Cycle 15
Dynamic Scheduling
Using A Scoreboard
44
Scoreboard Example Cycle 16
Dynamic Scheduling
Using A Scoreboard
45
Scoreboard Example Cycle 17
Dynamic Scheduling
Using A Scoreboard
ADDD cant write because of DIVD. RAW!
46
Scoreboard Example Cycle 18
Dynamic Scheduling
Using A Scoreboard
Nothing Happens!!
47
Scoreboard Example Cycle 19
Dynamic Scheduling
Using A Scoreboard
MULT completes execution.
48
Scoreboard Example Cycle 20
Dynamic Scheduling
Using A Scoreboard
MULT writes.
49
Scoreboard Example Cycle 21
Dynamic Scheduling
Using A Scoreboard
DIVD loads operands
50
Scoreboard Example Cycle 22
Dynamic Scheduling
Using A Scoreboard
Now ADDD can write since WAR removed.
51
Scoreboard Example Cycle 61
Dynamic Scheduling
Using A Scoreboard
DIVD completes execution
52
Scoreboard Example Cycle 62
Dynamic Scheduling
Using A Scoreboard
DONE!!
53
Another Dynamic Algorithm Tomasulo Algorithm
Dynamic Scheduling
Tomasulo Algorithm
  • For IBM 360/91 about 3 years after CDC 6600
    (1966)
  • Goal High Performance without special compilers
  • Differences between IBM 360 CDC 6600 ISA
  • IBM has only 2 register specifiers / instruction
    vs. 3 in CDC 6600
  • IBM has 4 FP registers vs. 8 in CDC 6600
  • Why Study? lead to Alpha 21264, HP 8000, MIPS
    10000, Pentium II, PowerPC 604,

This is the example that the text uses in
Sections 3.2 3.3.
54
Tomasulo Algorithm vs. Scoreboard
Dynamic Scheduling
Tomasulo Algorithm
  • Control buffers distributed with Function Units
    (FU) vs. centralized in scoreboard
  • FU buffers called reservation stations have
    pending operands
  • Registers in instructions replaced by values or
    pointers to reservation stations(RS) called
    register renaming
  • avoids WAR, WAW hazards
  • More reservation stations than registers, so can
    do optimizations compilers cant
  • Results to FU from RS, not through registers,
    over Common Data Bus that broadcasts results to
    all FUs
  • Load and Stores treated as FUs with RSs as well
  • Integer instructions can go past branches,
    allowing FP ops beyond basic block in FP queue

55
Tomasulo Organization
Dynamic Scheduling
FP Registers
FP Op Queue
From Mem
Load1 Load2 Load3 Load4 Load5 Load6
Load Buffers
Store Buffers
Add1 Add2 Add3
Mult1 Mult2
Reservation Stations
To Mem
FP adders
FP multipliers
Common Data Bus (CDB)
56
Reservation Station Components
Dynamic Scheduling
Tomasulo Algorithm
  • OpOperation to perform in the unit (e.g., or
    )
  • Vj, VkValue of Source operands
  • Store buffers have V field, result to be stored
  • Qj, QkReservation stations producing source
    registers (value to be written)
  • Note No ready flags as in Scoreboard Qj,Qk0 gt
    ready
  • Store buffers only have Qi for RS producing
    result
  • BusyIndicates reservation station or FU is busy
  • Register result statusIndicates which functional
    unit will write each register, if one exists.
    Blank when no pending instructions that will
    write that register.

57
Three Stages of Tomasulo Algorithm
Dynamic Scheduling
Tomasulo Algorithm
  • 1. Issueget instruction from FP Op Queue
  • If reservation station free (no structural
    hazard), control issues instruction sends
    operands (renames registers).
  • 2. Executionoperate on operands (EX)
  • When both operands ready then execute if not
    ready, watch Common Data Bus for result
  • 3. Write resultfinish execution (WB)
  • Write on Common Data Bus to all awaiting units
    mark reservation station available
  • Normal data bus data destination (go
    to bus)
  • Common data bus data source (come from bus)
  • 64 bits of data 4 bits of Functional Unit
    source address
  • Write if matches expected Functional Unit
    (produces result)
  • Does the broadcast

58
Tomasulo Example Cycle 0
Dynamic Scheduling
Tomasulo Algorithm
59
Review Tomasulo
Dynamic Scheduling
Tomasulo Algorithm
  • Prevents Register as bottleneck
  • Avoids WAR, WAW hazards of Scoreboard
  • Allows loop unrolling in HW
  • Not limited to basic blocks (provided branch
    prediction)
  • Lasting Contributions
  • Dynamic scheduling
  • Register renaming
  • Load/store disambiguation
  • 360/91 descendants are PowerPC 604, 620 MIPS
    R10000 HP-PA 8000 Intel Pentium Pro

60
Summary
  • 3.1 Instruction Level Parallelism Concepts and
    Challenges
  • 3.2 Overcoming Data Hazards with Dynamic
    Scheduling
  • 3.3 Dynamic Scheduling Examples The Algorithm
  • 3.4 Reducing Branch Penalties with Dynamic
    Hardware Prediction
  • 3.5 High Performance Instruction Delivery
  • 3.6 Taking Advantage of More ILP with Multiple
    Issue
  • 3.7 Hardware-based Speculation
  • 3.8 Studies of The Limitations of ILP
  • 3.10 The Pentium 4
Write a Comment
User Comments (0)
About PowerShow.com