Images from Patterson-Hennessy Book - PowerPoint PPT Presentation

About This Presentation
Title:

Images from Patterson-Hennessy Book

Description:

Title: Lecture 11 Author: Montek Singh Last modified by: Montek Singh Created Date: 3/13/2000 2:52:39 AM Document presentation format: Letter Paper (8.5x11 in) – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 71
Provided by: Monte156
Learn more at: http://www.cs.unc.edu
Category:

less

Transcript and Presenter's Notes

Title: Images from Patterson-Hennessy Book


1
Machines that introduced pipelining and
instruction-level parallelism. Clockwise from
top IBM Stretch, IBM 360/91, and CDC 6600
  • Images from Patterson-Hennessy Book

2
COMP 740Computer Architecture and Implementation
  • Montek Singh
  • Thu, Feb 12, 2009
  • Topic Instruction-Level Parallelism I
  • (Dynamic Scheduling Scoreboarding)

3
Outline
  • A more complex pipeline, the MIPS R4000
  • Look at the effects of memory with longer latency
  • Also long floating point instructions
  • Dynamic scheduling
  • Scoreboarding

4
R4000 Pipeline
  • From early 90s
  • Just before SGI bought MIPS
  • Superpipelined
  • Approx. 2 instructions per cycle
  • Caches were pipelined
  • Which is what most of the books discussion is
    about
  • R4000 100MHz, 1.3M transistors, 2 levels of
    cache
  • R4400 up to 250 MHz, larger caches

5
Block Diagram
6
Pipeline Diagram
Decode
Address calculation, branching
  • Same logic as before, but now multiple cycles for
    memory access
  • Deeper pipeline will lead to more hazards
  • More forwarding
  • Longer branch delays

7
Forwarding, 2 cycle delay
8
Or a 2 cycle stall
  • ADD stalled for R1
  • SUB uses forwarded value, OR from reg

9
Branch Delay 3 Cycles
10
Predicted not Taken
  • If branch taken, need to stall for 2 cycles
    beyond delay slot

11
8 Stages in FP pipeline
  • Stages are used one or more times, depending on
    instruction (next)

12
Some FP Instructions
  • Note latencies and initiation intervals
  • Individual stages may result in structural hazards

13
Structural Hazard Example 1
  • Units needed at same time highlighted

14
Structural Hazard Example 2
  • The shorter ADD instruction clears the pipeline
    fast so doesnt stall MUL

15
Structural Hazard Example 3
  • Notice how these long instructions can have
    long-lasting effects

16
Performance
  • CPI for base case (1.0), and with stalls
  • Left 4 programs integer
  • Cache effects not included
  • Load stalls 2 cycles now
  • Branch stalls now more expensive
  • FP result is a RAW hazard
  • Structural not a big problem

17
What Do We Have So Far?
  • Multiple instructions in flight at one time
  • If data hazard, no new instructions issue until
    hazard cleared (stall)
  • Could minimize stalls by reordering instructions
  • static scheduling
  • a smart complier could reorder instructions to
    minimize stall
  • using a detailed description of the architecture
  • dynamic scheduling next topic
  • or, add hardware to do this at run time

18
Out of Order Execution
  • With dynamic scheduling, we can do out of order
    execution
  • Execute instructions with no dependencies
  • Implies out of order completion
  • Today discuss one method scoreboarding
  • So far, instructions issued in order
  • Later well look at out of order issue

19
Decode Stage
  • Split the ID stage into 2 stages
  • 1st issue stage
  • decode and check for structural hazards
  • 2nd read operand stage
  • wait until operands available, read and proceed

20
Scoreboarding
  • Use a new hardware unit called the scoreboard
  • hardware data structure
  • Keeps track of dependencies, and executes out of
    order
  • operands become available
  • First used on CDC 6600
  • 16 functional units

21
MIPS with Scoreboard
  • Complex EX stage
  • Each functional unit has
  • 2 inputs
  • 1 output

22
What is a Scoreboard?
  • A Scoreboard is a table maintained by the
    hardware
  • keeps track of instructions being fetched,
    issued, executed etc.
  • keeps track of the resources (functional units
    and operands) they use/need
  • keeps track of which instructions modify which
    registers
  • uses this information to dynamically schedule
    instructions
  • very similar to a pen and paper calculation
  • simple step-by-step procedure easily implemented
    in hardware

23
Dynamic Scheduling with a Scoreboard
  • Original development in CDC 6600
  • Simplified example in HP4 for MIPS FP operations
  • Using neither renaming nor forwarding
  • Values always move from registers to function
    units, and from function units back to registers
  • However, write-back of results happen as soon as
    possible, not in a statically scheduled slot
  • Out-of-order completion can give rise to WAR and
    WAW hazards
  • Remember machine knows original program order
    (needed for hazard detection)
  • Machine model
  • 2 FP multipliers (10 cycles), 1 FP adder (2
    cycles), 1 FP divider (40 cycles), all
    non-pipelined
  • 1 integer unit for everything else (incl. memory
    references)

24
New Worry WAR Hazards
  • Didnt exist before, because read occurred early
  • Example
  • DIV.D F0, F2, F4
  • ADD.D F10, F0, F8
  • SUB.D F8, F8, F14
  • ADD could easily stall for DIVs F0
  • If SUB allowed to execute, then ADD might use
    wrong value for F8
  • SUB has a WAR hazard with ADD through register F8!

25
Scoreboard Implications
  • Out-of-order completion ? WAW, WAR hazards?
  • for WAW stall in Issue until previous write
    completes
  • for WAR stall in Write Result until previous
    read completes
  • Need to have multiple instructions in execution
    phase
  • ? multiple execution units or pipelined execution
    units
  • Scoreboard keeps track of dependences, state of
    operations
  • Scoreboard replaces ID, EX, WB with 4 stages

26
New Stages
  • The fetch is same, others have changed.
  • Lets look at them one by one

27
Issue
  • If
  • the required functional unit is available, and
  • no other unit is pending a write to same register
  • Then an instruction is issued
  • Moves to read operands stage
  • The register restriction prevents WAW hazards

28
Read Operands
  • By now, the functional unit is assigned
  • If operands are available, allows functional unit
    to read operands from register file
  • This design has no forwarding
  • So one extra cycle of latency

29
EX
  • Has more functional units
  • Notifies scoreboard when done

30
Write Result
  • Prevent WAR hazards
  • In this case
  • DIV.D F0, F2, F4
  • ADD.D F10, F0, F8
  • SUB.D F8, F8, F14
  • Will stall the WB of the SUB.D until ADD.D reads
    F8

31
Components of Scoreboard
  • Hardware data structure
  • Look at pieces, one by one
  • Instructions (in order) listed on top left

32
Instruction Status
  • All but last issued (ADD is waiting in Issue
    stage)
  • First LD complete
  • MUL, SUB waiting for register F2 (LD)
  • DIV waiting for F0 (result of MUL)

33
Status of Each Functional Unit
  • Fi is destination j, k sources
  • Q lists producers of inputs
  • R column indicates that input registers are
    ready, but not yet read (set to No after read)

34
Register Result
  • Shows which unit is producing which register
  • Needed by Issue stage

35
Later in Execution
  • LD and SUB (fast ops) have completed
  • ADD and MUL in process
  • DIV waiting for MUL to write F0

36
Almost Done
  • DIV about ready to write
  • Most everything complete and pipeline almost
    flushed

37
Cost of Extra Performance
  • Scoreboard hardware
  • Extra functional units
  • Extra buses
  • Which may result in structural hazard
  • Hardware needs to assign buses
  • Performance depends on
  • Amount of parallelism in code sequence
  • Window size of the scoreboard
  • Size of basic block (i.e., code without
    branches), next

38
Status Our Pipeline Now
  • Can execute instructions out of order
  • Have not discussed out of order issue
  • Could extend our scoreboarding to do this
  • Still, the opportunities in basic block limited
  • Basic blocks tend to be short
  • Would like to issue past branches

39
Next
  • Well first look at techniques to increase issue
    potential
  • Compiler techniques
  • Then look at branch prediction
  • Look at Tomasulos algorithm for dynamic
    scheduling
  • Begin reading Chapter 2 of HP

40
Self-Study Material
  • Summary of scoreboarding algorithm
  • One long scoreboarding example
  • Formal logic equations for scoreboarding logic

41
Four Stages of Scoreboard Control
  • Issue decode instr. check for structural
    hazards (ID1)
  • If functional unit is free and no WAW hazard with
    other active instruction
  • scoreboard issues the instruction to the
    functional unit and updates its internal data
    structure.
  • If a structural or WAW hazard exists
  • instruction issue stalls
  • unless there is buffering between fetch and
    issue, no further instructions can issue until
    these hazards are cleared.
  • Read operands wait until no data hazards, then
    read (ID2)
  • A source operand is available if no earlier
    issued active instruction is going to write it.
  • When all source operands are available
  • scoreboard tells the functional unit to proceed
    to read the operands from registers and begin
    execution.
  • Thus, scoreboard resolves RAW hazards dynamically
    in this step
  • instructions may be sent into execution out of
    order

42
Four Stages of Scoreboard Control (cont.)
  • Execution operate on operands
  • The functional unit begins execution upon
    receiving operands
  • When result is ready, it notifies the scoreboard
  • Write Result finish execution (WB)
  • Once scoreboard is aware that functional unit has
    completed execution, scoreboard checks for WAR
    hazards.
  • If no WAR hazard
  • it writes results
  • If WAR hazard
  • it stalls the completing instruction
  • Example
  • DIV.D F0,F2,F4
  • ADD.D F10,F0,F8
  • SUB.D F8,F8,F14
  • CDC 6600 scoreboard would stall SUB.D until ADD.D
    reads ops

43
Three Parts of the Scoreboard
  • Instruction status Which of 4 steps instruction
    is in
  • Functional unit (FU) status Indicates state of
    FU
  • Nine fields for each functional unit
  • Busy Indicates whether the unit is busy or not
  • Op Operation to perform in the unit (e.g., or
    -)
  • Fi Destination register
  • Fj, Fk Source registers
  • Qj, Qk Functional units producing source
    registers Fj, Fk
  • Rj, Rk Flags indicating when Fj, Fk are ready
  • Register result status Indicates which
    functional unit will write each register, if any
  • blank when no pending instructions will write
    that register

44
Scoreboard Example Cycle 0
45
Scoreboard Example Cycle 1
First LD issues
46
Scoreboard Example Cycle 2
Structural hazard on Integer unit second LD
stalls in IF stage
47
Scoreboard Example Cycle 3
Second LD is still stalled
48
Scoreboard Example Cycle 4
Second LD still stalled first LD done
49
Scoreboard Example Cycle 5
Second LD issues as the structural hazard on
Integer unit has cleared
50
Scoreboard Example Cycle 6
MULT issues
51
Scoreboard Example Cycle 7
SUBD issues MULT stalled on LD
52
Scoreboard Example Cycle 8a
DIVD issues SUBD stalled on LD
53
Scoreboard Example Cycle 8b
LD writes F2 MULT and SUBD enabled
54
Scoreboard Example Cycle 9
MULT and SUBD read operands and enter execution
55
Scoreboard Example Cycle 10
Structural hazard on Add unit stalls the final
ADDD
56
Scoreboard Example Cycle 11
SUBD and MULT are still in execution
57
Scoreboard Example Cycle 12
SUBD writes results Add unit free structural
hazard resolves
58
Scoreboard Example Cycle 13
Note WAR hazard between DIVD and ADDD
59
Scoreboard Example Cycle 14
MULT still executing DIVD stalled on F0 (RAW
hazard)
60
Scoreboard Example Cycle 15
MULT still executing
61
Scoreboard Example Cycle 16
ADDD completes execution, ready to write result
into F6
62
Scoreboard Example Cycle 17
WAR hazard ADDD stalls in Write Result stage
63
Scoreboard Example Cycle 18
DIVD stalled (RAW hazard on F0), ADDD stalled
(WAR hazard on F6)
64
Scoreboard Example Cycle 19
MULT completes execution
65
Scoreboard Example Cycle 20
MULT writes result DIVD can proceed to read
operands at next cycle
66
Scoreboard Example Cycle 21
DIVD reads operands WAR hazard on F6 is resolved
67
Scoreboard Example Cycle 22
40 cycle Divide!
ADDD completes writing of result
68
Scoreboard Example Cycle 61
DIVD completes execution ready to write result
69
Scoreboard Summary
  • CDC designers measured performance improvement of
    1.7 for compiled FORTRAN code, 2.5 for assembly
  • No pipeline scheduling in software
  • Slow memory (no cache)
  • Limitations of 6600 scoreboard
  • No forwarding
  • Limited to instructions in basic block (small
    issue window)
  • Number of functional units (structural hazards)
  • Wait for WAR hazards
  • Prevent WAW hazards

70
Scoreboard Bookkeeping Actions
Instruction Status Wait Until Bookkeeping
Issue Not BusyFU and not ResultD BusyFU?yes OpFU?op FiFU?D FjFU?S1 FkFU?S2 QjFU?ResultS1 QkFU?ResultS2 Rj?not Qj Rk?not Qk ResultD?FU
Read Operands Rj and Rk Rj?No Rk?NoQj?0 Qk?0
Execution Complete Functional unit done
Write Result ? f((Fjf?FiFU or RjfNo) (Fkf?FiFU or RkfNo)) ? f (if QjfFU then Rjf?yes)? f (if QkfFU then Rkf?yes)ResultFiFU?0 BusyFU?No
Write a Comment
User Comments (0)
About PowerShow.com