CS5222 Advanced Computer Architecture Part 4: Superscalar Processors - PowerPoint PPT Presentation

About This Presentation
Title:

CS5222 Advanced Computer Architecture Part 4: Superscalar Processors

Description:

CS5222 Advanced Computer Architecture Part 4: Superscalar Processors – PowerPoint PPT presentation

Number of Views:1052
Avg rating:3.0/5.0
Slides: 124
Provided by: chich8
Category:

less

Transcript and Presenter's Notes

Title: CS5222 Advanced Computer Architecture Part 4: Superscalar Processors


1
CS5222Advanced Computer ArchitecturePart 4
Superscalar Processors
  • Fall Term, 2004/2005
  • Chi Chi Hung (email chich_at_comp.nus.edu.sg)
  • Building S/17, Rm 5-13
  • Phone 874-2832

2
Part A
  • Emergence of Superscalar Processors

3
Introduction
  • Three phases
  • Idea (in early 70s)
  • Architecture proposals and prototype machines
  • Commercial products

4
Proposed/Prototypes
5
Appearance of Superscalar Processors (I)
  • Early 90s was the time when VLSI technology
    started to accelerate.

6
Appearance of Superscalar Processors (II)
  • Come from
  • Converting from an existing (scalar) RISC line.
    E.g. Intel 960, MC 88000, HP PA, Sun SPARC, MIPS
    R AMD.
  • Conceiving a new architecture. E.g. Power 1,
    Alpha.
  • CISC superscalar processors appear later than
    RISC ones because
  • Complexity of decoding multiple variable length
    instr..
  • Complexity of handling memory architecture.
  • Relatively lower issue rate for CISC processors
    (Note that for the same performance, RISC
    superscalar proc. needs a higher issue rate).

7
Commercial Superscalar Processors
8
Part B
  • Tasks of Superscalar Processing

9
Specific Tasks of Superscalar Processing (I)
10
Specific Tasks of Superscalar Processing (II)
  • Five aspects
  • Parallel decoding
  • Sophisticated hardware with high issue rate.
  • Length of decoding stage multiple cycles?
    Predecoding?
  • Superscalar instruction issue
  • High issue rate implies smaller gap betn 2
    sequential instr.
  • Amplify restrictive effects of control data
    dependency.
  • Sample solns shelving, register renaming,
    speculative branch processing.
  • Parallel instruction execution
  • Preserving sequential consistency of execution
  • Retain logical consistency of program execution
    due to out-of-order execution.
  • Preserving sequential consistency of exception
    execution

11
Part C
  • Parallel Decoding

12
Sequential Decoding vs. Parallel Decoding
13
Basic Ideas of Parallel Decoding
  • Parallel decoding Decoding multiple instr. /
    cycle
  • Hardware complexity increases with issue rate.
  • Check dependencies w.r.t.
  • Instructions currently being executed.
  • Instruction candidates to be issued next.
  • Multiple instructions decoding in a clock cycle
  • Decode-issue path becomes critical for clock
    frequencies.
  • Solutions
  • Multiple pipeline cycles for decoding
  • E.g. PowerPC601/604, UltraSPARC 2 cycles Alpha
    21064 3 cycles, Pentium Pro 4.5 cycles
  • Predecoding

14
Principle of Pre-Decoding
  • Part of decode task in loading phase of on-chip
    instruction cache.
  • Shorten overall decoding time or reduce no. of
    cycles for decoding and instruction issue.
  • Append a number of decode bits to each instr.
  • Instruction class
  • Type of resources required for execution
  • Calculation of branch address (for some
    processors)
  • CISC processors require more bits for information
    such as variable instruction length (e.g.
    starting/ending of I).
  • Extra space is required. E.g. K5 adds 5 extra
    bits to each byte.
  • Common to most predominant processor lines.

15
Example of Pre-Decoding
16
Pre-decode Bits
17
Superscalar Processing with Pre-Decoding
18
Part D
Superscalar Instruction Issue
19
Design Space
  • Issue policy specifies how dependencies are
    handled during issue process.
  • Issue rate specifies the max. no. of instructions
    a superscalar processor is able to issue in each
    cycle.

20
Design Space of Issue Policies (I)
21
Design Space of Issue Policies (II)
  • Four main aspects
  • False data dependencies
  • E.g. WAR, WAW (note that this is just for
    registers, not mem.)
  • Solution Register renaming renaming the
    destination reg. That is, the result is written
    into a dynamically allocated spare register
    instead of the specified register.
  • Unresolved control dependencies
  • Solution Speculative branch processing A guess
    about the outcome of the unresolved conditional
    branch is made.
  • Use of shelving
  • Separate issue/dispatch into two stages.
  • Handling blockages either directly (with issue
    window) or by decoupling (no dependency checking
    on issue).
  • Handling of issue blockages
  • Preserving issue order In-order vs. out-of-order
  • Alignment of issue Aligned vs. unaligned issue

22
Principle of Blocking Issue Mode
23
Principle of Shelving Shelving
24
Design Aspects Related to Handling of Blockages
25
Issue Order of Instructions (I)
26
Issue Order of Instructions (II)
  • In-order
  • A dependent instruction will block the issue of
    all subsequent instructions until the dependency
    is resolved.
  • Out-of-order
  • An independent instruction can be issued even if
    a dependent instruction is still in the issue
    window.
  • Some processors allow partial out-of-order. E.g.
    PowerPC 601 issues branches and FP out-of-order
    MC 88100 does only for FP instructions.
  • Not many processors employ out-of-order because
  • Preserving sequential consistency requires much
    more efforts.
  • Shelving reduces the need for out-of-order.

27
Aligned Issue of Instructions (I)
28
Aligned Issue of Instructions (II)
  • Aligned issue
  • No instructions of the next window will be
    considered as candidates for issue until all
    instructions in the current window have been
    issued.
  • Unaligned issue
  • A gliding window whose width equals the issue
    rate is employed.
  • In every cycle, all instructions in the window
    are checked for dependencies. Those independent
    ones are issued either as in-order or
    out-of-order. Then the window will be refilled.

29
Most Frequently Used Issue Policies of Scalar
Processors
30
Most Frequently Used Issue Policies of
SuperScalar Proc.
31
Trend in Instruction Issue Policies
32
Issue Rate (I)
  • Issue rate (or superscalarity) refers to the
    maximum number of instructions a superscalar
    processor can issue in one cycle.
  • Higher issue rate potentially offers higher
    performance. The cost is the more complex
    circuitry. It needs a balance between the two.

33
Issue Rate (II)
34
Part E
Superscalar Instruction Issue Shelving
35
Introduction
  • Eliminate issue blockages due to dependencies.
  • Make use of dedicated instruction buffers, called
    shelving buffers in front of EU(s).
  • Shelving decouples dependency checking from
    instruction issue, and defers it to instr.
    dispatch.
  • Decoded instructions are issued to the shelving
    buffers without any checks for data or control
    dependencies or for busy EU(s).
  • Processors with shelving usually employ in-order,
    aligned issue polices, together with register
    renaming speculative conditional branch
    execution (Only true dependencies can block
    instruction execution). (Why in-order, aligned
    issue?)
  • Dependency check will be done during instruction
    dispatch phase (from shelving buffer to EU).
    Dependency free instructions, with their operands
    available, will be available for execution
    dataflow principle of operation.

36
Principle of Straightforward Issue Policy
37
Principle of Shelving
38
Design Space of Shelving
39
Part E-1
  • Design Space Topic of Shelving
  • Scope of Shelving

40
Scope of Shelving
  • Scope of shelving specifies whether shelving is
    restricted to a few instruction types or is
    performed for all instructions.

41
Part E-2
  • Design Space Topic of Shelving
  • Layout of Shelving Buffers

42
Layout of Shelving Buffers
43
Part E-2-1
  • Design Space Topic of Shelving
  • Layout of Shelving Buffers
  • Type of Buffers

44
Type of Shelving Buffers (I)
  • Standalone buffers are buffers which are used
    exclusively for shelving.
  • Combined buffers are those with multiple
    functionalities.

45
Type of Shelving Buffers (II)
  • Standalone using reservation station (RS)
  • Individual
  • Earliest to be adopted
  • In front of each EU
  • Size usually small (2-4)
  • Group
  • Hold instructions for a group of EUs that execute
    inst. of the same type
  • More reliable
  • Large in size (8-16)
  • Shelving or dispatching more than one instruction
    per cycle

46
Type of Shelving Buffers (III)
  • Standalone using reservation station (RS)
    (Contd)
  • Central
  • Most flexible
  • Disadvantages
  • Need a word length equal to the longest possible
    data word
  • Much more complex
  • Size about 20
  • Combined buffers (reorder buffer ROB) for
    shelving, renaming reordering.
  • Expect to be the future trend

47
Type of Shelving Buffers (IV)
48
Combined Buffer for Shelving, Renaming and
Reordering
49
Part E-2-2
  • Design Space Topic of Shelving
  • Layout of Shelving Buffers
  • Number of Buffer Entries

50
Shelving Buffer Entries in Superscalar Processors
What types of RSs should be expected?
51
Part E-2-3
  • Design Space Topic of Shelving
  • Layout of Shelving Buffers
  • Number of Read/Write Ports

52
Number of Read/Write Ports for Shelving Buffers
  • Individual reservation stations only need to
    forward a single instruction per cycle.
  • Group/Central reservation stations need to
    deliver multiple instructions per cycle, ideally
    as many as the number of EU(s) connected.
  • Study the relationship between read/write ports
    and no. of shelving buffer entries

53
Part E-3
Design Space Topic of Shelving Operand Fetch
Policy
54
Types of Operand Fetch Policies (I)
  • Two types
  • Issue bound
  • Operands fetched during instruction issue.
  • Shelving buffers provide entries long enough to
    hold source operands.
  • Dispatch bound
  • Operands fetched during instruction dispatch.
  • Shelving buffers contain short register
    identifiers.

55
Types of Operand Fetch Policies (II)
56
Operand Fetch During Instr. Issue w/ Single
Register File
57
Operand Fetch During Instr. Dispatch w/ Single
Register File
58
Policies Comparison of Operand Fetch
  • Policy comparison
  • Issue bound
  • Register file supplies all operands for all
    issued instructions.
  • Need twice as many read ports in the register
    file as the max. issue rate.
  • Size of RS is relatively larger.
  • Dispatch bound
  • No. of read ports should equal to twice the
    dispatch rate (Note that max. dispatch rate is
    usually higher than that of issue rate, why?).
  • Critical decode/issue path is shorter.
  • Shelving buffers are relatively less complex.

59
Issue Bound Operand Fetch with Multiple Register
Files
60
Dispatch Bound Operand Fetch with Multiple
Register Files
61
MFU Shelving Buffer Types Operand Fetch
Policies
62
Part E-4
Design Space Topic of Shelving Instruction
Dispatch Scheme
63
Design Space of Inst. Dispatch
  • Instruction dispatch involves twp basic tasks
    scheduling the instructions held in a particular
    RS for execution and disseminating the scheduled
    instruction(s) to the allocated EU(s).

Instruction dispatch scheme
64
Part E-4-1
  • Design Space Topic of Shelving
  • Instruction Dispatch Scheme
  • Dispatch policy

65
Design Space of Dispatch Policy
Dispatch policy
66
Consideration of Dispatch Policy (I)
  • Dispatch policy specifies how instructions are
    selected for execution and how dispatch blockages
    are handled.
  • Selection rule
  • Specify when instructions are considered as
    executable.
  • Arbitration rule
  • Choose a subset of instructions when more
    instructions are eligible for execution than can
    be disseminated in the next cycle.
  • Usually , older instructions are preferable
    than younger ones.

67
Consideration of Dispatch Policy (II)
  • Dispatch policy (Contd)
  • Dispatch order
  • Will a non-executable instruction block all
    subsequent instructions from being dispatched.
  • Three types
  • In-order Simple (only last inst. to be
    inspected)
  • Partially out-of-order (for certain instr. Types)
  • Out-of-order
  • Complex
  • Need to check all instructions in shelving buffer
    for executable instructions.
  • Expect to be used in group or central RS.

68
Dispatch Order
69
Part E-4-2
  • Design Space Topic of Shelving
  • Instruction Dispatch Scheme
  • Dispatch rate

70
Considerations of Dispatch Rate
  • Dispatch rate is defined as the no. of
    instructions that can be dispatched from each
    reservation station per cycle.
  • Ideal dispatch rate is one instruction per EU.
  • Easier to achieve in individual and group RS.
  • Future dispatch rate is expected to get higher
    because of less restrictions imposed on data
    path, ports, and transistor count.
  • Note that very often, max. issue rate is less
    than max. dispatch rate.

71
Multiplicity of Dispatched Instructions
72
Max. Issue and Dispatch Rates of Superscalar Proc.
  • Study relationship between issue rate and
    dispatch rate.

73
Part E-4-3
  • Design Space Topic of Shelving
  • Instruction Dispatch Scheme
  • Checking for Operand Availability

74
Intro. to Checking for Operand Availability
  • Availability checking is done
  • when operands are fetched from the register file,
    and
  • (during dispatch) if operands of instructions in
    the shelving buffers are available.
  • Solution Scoreboard
  • Direct check of the scoreboard bits
  • RS does not hold any explicit status information
    indicating if source operands are available.
  • Employed when operands are fetched during inst.
    dispatch.
  • Check of explicit status bit
  • Availability is indicated in RS through status
    bits.
  • Employed if operands are fetched during inst.
    issue.
  • Additional associative search needed for value
    updating in RS.

75
Principle of Scoreboarding
76
Scheme for Checking Operand Availability
77
Use of Multiple Buses for Updating Multiple RSs
  • If multiple RSs exists, their updating must be
    done globally.

78
Updating RSs in case of Multiple Register Files
79
Internal Data Paths of PowerPC604
80
Part E-4-4
  • Design Space Topic of Shelving
  • Instruction Dispatch Scheme
  • Treatment of Empty Reservation Station

81
Treatment of Empty Reservation Table
82
Part E-4-5
  • Design Space Topic of Shelving
  • Instruction Dispatch Scheme
  • Typical Dispatch Schemes

83
Typical Approaches in Dispatching (I)
  • Assumptions for typical solutions
  • Register renaming and speculative execution are
    usually employed.
  • If operands are fetched during instruction
    dispatch, use direct checking method.
  • If operands are fetched during instruction issue,
    use explicit status bits to maintain and check
    operand availability
  • Empty RS is usually bypassed.

84
Typical Approaches in Dispatching (II)
85
Part F
Superscalar Instruction Issue Register Renaming
86
Introduction to Register Renaming
  • Standard technique for removing false data
    dependencies (i.e. WAR, WAW).
  • Always turn instructions to be three-operands by
    renaming the destination operand.
  • Two implementations
  • Static
  • Done by the compiler.
  • Dynamic
  • Take place in hardware during execution time.
  • Require extra circuitry for suppl. register
    space, additional data paths and logic.

87
Implementation of Register Renaming
88
Chronology of Renaming in Commercial Processors
89
Design Space of Register Renaming
90
Part F-1
  • Design Space Topic Register Renaming
  • Scope of Register Renaming

91
Scope of Renaming
92
Part F-2
Design Space Topic Register Renaming Layout of
Rename Buffers
93
Layout of Rename Buffers
94
Types of Rename Buffers
95
Architecture of Rename Buffers
  • For merged arch. rename register file
  • A free physical register is allocated to each
    destination register specified in an instruction.
  • A mapping table is used to track all allocation
    reg. pairs.
  • Scheme is required to reclaim physical registers
    no longer in use.
  • For all three other cases, intermediate results
    are held in respective rename buffer until their
    retirement. During retirement, content of rename
    buffer will be written back to architectural
    register file.

96
Example of Renaming Architecture Register (I)
97
Example of Renaming Architecture Register (II)
98
Number of Rename Buffers
99
Access Mechanism of Rename Buffers (I)
  • Need to access rename buffers because
  • Fetch operands
  • Update rename registers
  • Deallocate rename registers
  • Two distinct mechanisms
  • Associative mechanism
  • Indexed access mechanism

100
Access Mechanism of Rename Buffers (II)
101
Part F-3
Design Space Topic Register Renaming Operand
Fetch Policy
102
Operand Fetch Policies of Rename Buffers
  • Two policies
  • Rename bound
  • Fetch referenced operands during renaming
  • Dispatch bound
  • Defer operand fetch until dispatching

103
Part F-4
Design Space Topic Register Renaming Rename Rate
104
Rename Rate
  • Rename rate is the max. number of renames per
    cycle that a processor is able to perform.
  • To avoid bottlenecks, rename rate is equal to
    issue rate.
  • HW requirements a large number of ports at
    register files and the mapping tables.

105
Part F-5
Design Space Topic Register Renaming Most
Frequently Used Renaming
106
Most Frequently Used Basic Renaming
107
Part G
Parallel Execution
108
Concept of Parallel Execution
  • Independent of whether instructions are issued or
    dispatched in-order or out-of-order, they will
    generally be finished in out-of-program-order.
  • Three terms
  • to finish operation is completed except for
    writing back the result into the architectural
    register or memory (and status bits).
  • to complete the last action of instruction
    execution (i.e. write back to arch. registers) is
    finished.
  • to retire write back to arch. registers and
    delete completed instruction from ROB (Reorder
    Buffer).

109
Part H
Preserving Sequential Consistency of Instruction
Execution
110
Sequential Consistency (I)
  • Two aspects
  • Order in which instructions are completed.
  • Order in which memory is accessed due to LD/ST.
  • Processor consistency indicates the consistency
    of instruction completion with sequential
    instruction execution.
  • Two possible processor consistencies
  • Weak instructions are completed out of order,
    provided that non data dependencies are
    scarified.
  • Strong instructions are forced to complete in
    strict program order. Usually achieved with ROB.

111
Sequential Consistency (II)
  • Memory consistency indicates whether memory
    accesses are performed in the same order as in a
    sequential processor.
  • Two possible memory access consistencies
  • Weak memory accesses may be out of order
    compared with a strict sequential program
    execution, provided that data dependencies must
    not be violated.
  • Strong memory accesses occur strictly in program
    order.

112
Sequential Consistency (III)
113
Sequential Consistency Model
114
Concept of Load/Store Reordering
115
Principle of Reorder Buffer
116
Use of Reorder Buffer in Commercial Processors
117
Design Space of Reorder Buffers
118
Basic Layout of Reorder Buffers
119
Sample Implementation of Reorder Buffers
120
Comparison of Shelves and Reorder Buffer Entries
121
Part I
Preserving Sequential Consistency of Exception
Processing
122
Sequential Consistency of Exception Processing
123
  • END
Write a Comment
User Comments (0)
About PowerShow.com