Title: CS5222 Advanced Computer Architecture Part 4: Superscalar Processors
1CS5222Advanced Computer ArchitecturePart 4
Superscalar Processors
- Fall Term, 2004/2005
- Chi Chi Hung (email chich_at_comp.nus.edu.sg)
- Building S/17, Rm 5-13
- Phone 874-2832
2Part A
- Emergence of Superscalar Processors
3Introduction
- Three phases
- Idea (in early 70s)
- Architecture proposals and prototype machines
- Commercial products
4Proposed/Prototypes
5Appearance of Superscalar Processors (I)
- Early 90s was the time when VLSI technology
started to accelerate.
6Appearance of Superscalar Processors (II)
- Come from
- Converting from an existing (scalar) RISC line.
E.g. Intel 960, MC 88000, HP PA, Sun SPARC, MIPS
R AMD. - Conceiving a new architecture. E.g. Power 1,
Alpha. - CISC superscalar processors appear later than
RISC ones because - Complexity of decoding multiple variable length
instr.. - Complexity of handling memory architecture.
- Relatively lower issue rate for CISC processors
(Note that for the same performance, RISC
superscalar proc. needs a higher issue rate).
7Commercial Superscalar Processors
8Part B
- Tasks of Superscalar Processing
9Specific Tasks of Superscalar Processing (I)
10Specific Tasks of Superscalar Processing (II)
- Five aspects
- Parallel decoding
- Sophisticated hardware with high issue rate.
- Length of decoding stage multiple cycles?
Predecoding? - Superscalar instruction issue
- High issue rate implies smaller gap betn 2
sequential instr. - Amplify restrictive effects of control data
dependency. - Sample solns shelving, register renaming,
speculative branch processing. - Parallel instruction execution
- Preserving sequential consistency of execution
- Retain logical consistency of program execution
due to out-of-order execution. - Preserving sequential consistency of exception
execution
11Part C
12Sequential Decoding vs. Parallel Decoding
13Basic Ideas of Parallel Decoding
- Parallel decoding Decoding multiple instr. /
cycle - Hardware complexity increases with issue rate.
- Check dependencies w.r.t.
- Instructions currently being executed.
- Instruction candidates to be issued next.
- Multiple instructions decoding in a clock cycle
- Decode-issue path becomes critical for clock
frequencies. - Solutions
- Multiple pipeline cycles for decoding
- E.g. PowerPC601/604, UltraSPARC 2 cycles Alpha
21064 3 cycles, Pentium Pro 4.5 cycles - Predecoding
14Principle of Pre-Decoding
- Part of decode task in loading phase of on-chip
instruction cache. - Shorten overall decoding time or reduce no. of
cycles for decoding and instruction issue. - Append a number of decode bits to each instr.
- Instruction class
- Type of resources required for execution
- Calculation of branch address (for some
processors) - CISC processors require more bits for information
such as variable instruction length (e.g.
starting/ending of I). - Extra space is required. E.g. K5 adds 5 extra
bits to each byte. - Common to most predominant processor lines.
15Example of Pre-Decoding
16Pre-decode Bits
17Superscalar Processing with Pre-Decoding
18Part D
Superscalar Instruction Issue
19Design Space
- Issue policy specifies how dependencies are
handled during issue process. - Issue rate specifies the max. no. of instructions
a superscalar processor is able to issue in each
cycle.
20Design Space of Issue Policies (I)
21Design Space of Issue Policies (II)
- Four main aspects
- False data dependencies
- E.g. WAR, WAW (note that this is just for
registers, not mem.) - Solution Register renaming renaming the
destination reg. That is, the result is written
into a dynamically allocated spare register
instead of the specified register. - Unresolved control dependencies
- Solution Speculative branch processing A guess
about the outcome of the unresolved conditional
branch is made. - Use of shelving
- Separate issue/dispatch into two stages.
- Handling blockages either directly (with issue
window) or by decoupling (no dependency checking
on issue). - Handling of issue blockages
- Preserving issue order In-order vs. out-of-order
- Alignment of issue Aligned vs. unaligned issue
22Principle of Blocking Issue Mode
23Principle of Shelving Shelving
24Design Aspects Related to Handling of Blockages
25Issue Order of Instructions (I)
26Issue Order of Instructions (II)
- In-order
- A dependent instruction will block the issue of
all subsequent instructions until the dependency
is resolved. - Out-of-order
- An independent instruction can be issued even if
a dependent instruction is still in the issue
window. - Some processors allow partial out-of-order. E.g.
PowerPC 601 issues branches and FP out-of-order
MC 88100 does only for FP instructions. - Not many processors employ out-of-order because
- Preserving sequential consistency requires much
more efforts. - Shelving reduces the need for out-of-order.
27Aligned Issue of Instructions (I)
28Aligned Issue of Instructions (II)
- Aligned issue
- No instructions of the next window will be
considered as candidates for issue until all
instructions in the current window have been
issued. - Unaligned issue
- A gliding window whose width equals the issue
rate is employed. - In every cycle, all instructions in the window
are checked for dependencies. Those independent
ones are issued either as in-order or
out-of-order. Then the window will be refilled.
29Most Frequently Used Issue Policies of Scalar
Processors
30Most Frequently Used Issue Policies of
SuperScalar Proc.
31Trend in Instruction Issue Policies
32 Issue Rate (I)
- Issue rate (or superscalarity) refers to the
maximum number of instructions a superscalar
processor can issue in one cycle. - Higher issue rate potentially offers higher
performance. The cost is the more complex
circuitry. It needs a balance between the two.
33Issue Rate (II)
34Part E
Superscalar Instruction Issue Shelving
35Introduction
- Eliminate issue blockages due to dependencies.
- Make use of dedicated instruction buffers, called
shelving buffers in front of EU(s). - Shelving decouples dependency checking from
instruction issue, and defers it to instr.
dispatch. - Decoded instructions are issued to the shelving
buffers without any checks for data or control
dependencies or for busy EU(s). - Processors with shelving usually employ in-order,
aligned issue polices, together with register
renaming speculative conditional branch
execution (Only true dependencies can block
instruction execution). (Why in-order, aligned
issue?) - Dependency check will be done during instruction
dispatch phase (from shelving buffer to EU).
Dependency free instructions, with their operands
available, will be available for execution
dataflow principle of operation.
36Principle of Straightforward Issue Policy
37Principle of Shelving
38Design Space of Shelving
39Part E-1
- Design Space Topic of Shelving
- Scope of Shelving
40Scope of Shelving
- Scope of shelving specifies whether shelving is
restricted to a few instruction types or is
performed for all instructions.
41Part E-2
- Design Space Topic of Shelving
- Layout of Shelving Buffers
42Layout of Shelving Buffers
43Part E-2-1
- Design Space Topic of Shelving
- Layout of Shelving Buffers
- Type of Buffers
44Type of Shelving Buffers (I)
- Standalone buffers are buffers which are used
exclusively for shelving. - Combined buffers are those with multiple
functionalities.
45Type of Shelving Buffers (II)
- Standalone using reservation station (RS)
- Individual
- Earliest to be adopted
- In front of each EU
- Size usually small (2-4)
- Group
- Hold instructions for a group of EUs that execute
inst. of the same type - More reliable
- Large in size (8-16)
- Shelving or dispatching more than one instruction
per cycle
46Type of Shelving Buffers (III)
- Standalone using reservation station (RS)
(Contd) - Central
- Most flexible
- Disadvantages
- Need a word length equal to the longest possible
data word - Much more complex
- Size about 20
- Combined buffers (reorder buffer ROB) for
shelving, renaming reordering. - Expect to be the future trend
47Type of Shelving Buffers (IV)
48Combined Buffer for Shelving, Renaming and
Reordering
49Part E-2-2
- Design Space Topic of Shelving
- Layout of Shelving Buffers
- Number of Buffer Entries
50Shelving Buffer Entries in Superscalar Processors
What types of RSs should be expected?
51Part E-2-3
- Design Space Topic of Shelving
- Layout of Shelving Buffers
- Number of Read/Write Ports
52Number of Read/Write Ports for Shelving Buffers
- Individual reservation stations only need to
forward a single instruction per cycle. - Group/Central reservation stations need to
deliver multiple instructions per cycle, ideally
as many as the number of EU(s) connected. - Study the relationship between read/write ports
and no. of shelving buffer entries
53Part E-3
Design Space Topic of Shelving Operand Fetch
Policy
54Types of Operand Fetch Policies (I)
- Two types
- Issue bound
- Operands fetched during instruction issue.
- Shelving buffers provide entries long enough to
hold source operands. - Dispatch bound
- Operands fetched during instruction dispatch.
- Shelving buffers contain short register
identifiers.
55Types of Operand Fetch Policies (II)
56Operand Fetch During Instr. Issue w/ Single
Register File
57Operand Fetch During Instr. Dispatch w/ Single
Register File
58Policies Comparison of Operand Fetch
- Policy comparison
- Issue bound
- Register file supplies all operands for all
issued instructions. - Need twice as many read ports in the register
file as the max. issue rate. - Size of RS is relatively larger.
- Dispatch bound
- No. of read ports should equal to twice the
dispatch rate (Note that max. dispatch rate is
usually higher than that of issue rate, why?). - Critical decode/issue path is shorter.
- Shelving buffers are relatively less complex.
59Issue Bound Operand Fetch with Multiple Register
Files
60Dispatch Bound Operand Fetch with Multiple
Register Files
61MFU Shelving Buffer Types Operand Fetch
Policies
62Part E-4
Design Space Topic of Shelving Instruction
Dispatch Scheme
63Design Space of Inst. Dispatch
- Instruction dispatch involves twp basic tasks
scheduling the instructions held in a particular
RS for execution and disseminating the scheduled
instruction(s) to the allocated EU(s).
Instruction dispatch scheme
64Part E-4-1
- Design Space Topic of Shelving
- Instruction Dispatch Scheme
- Dispatch policy
65Design Space of Dispatch Policy
Dispatch policy
66Consideration of Dispatch Policy (I)
- Dispatch policy specifies how instructions are
selected for execution and how dispatch blockages
are handled. - Selection rule
- Specify when instructions are considered as
executable. - Arbitration rule
- Choose a subset of instructions when more
instructions are eligible for execution than can
be disseminated in the next cycle. - Usually , older instructions are preferable
than younger ones.
67Consideration of Dispatch Policy (II)
- Dispatch policy (Contd)
- Dispatch order
- Will a non-executable instruction block all
subsequent instructions from being dispatched. - Three types
- In-order Simple (only last inst. to be
inspected) - Partially out-of-order (for certain instr. Types)
- Out-of-order
- Complex
- Need to check all instructions in shelving buffer
for executable instructions. - Expect to be used in group or central RS.
68Dispatch Order
69Part E-4-2
- Design Space Topic of Shelving
- Instruction Dispatch Scheme
- Dispatch rate
70Considerations of Dispatch Rate
- Dispatch rate is defined as the no. of
instructions that can be dispatched from each
reservation station per cycle. - Ideal dispatch rate is one instruction per EU.
- Easier to achieve in individual and group RS.
- Future dispatch rate is expected to get higher
because of less restrictions imposed on data
path, ports, and transistor count. - Note that very often, max. issue rate is less
than max. dispatch rate.
71Multiplicity of Dispatched Instructions
72Max. Issue and Dispatch Rates of Superscalar Proc.
- Study relationship between issue rate and
dispatch rate.
73Part E-4-3
- Design Space Topic of Shelving
- Instruction Dispatch Scheme
- Checking for Operand Availability
74Intro. to Checking for Operand Availability
- Availability checking is done
- when operands are fetched from the register file,
and - (during dispatch) if operands of instructions in
the shelving buffers are available. - Solution Scoreboard
- Direct check of the scoreboard bits
- RS does not hold any explicit status information
indicating if source operands are available. - Employed when operands are fetched during inst.
dispatch. - Check of explicit status bit
- Availability is indicated in RS through status
bits. - Employed if operands are fetched during inst.
issue. - Additional associative search needed for value
updating in RS.
75Principle of Scoreboarding
76Scheme for Checking Operand Availability
77Use of Multiple Buses for Updating Multiple RSs
- If multiple RSs exists, their updating must be
done globally.
78Updating RSs in case of Multiple Register Files
79Internal Data Paths of PowerPC604
80Part E-4-4
- Design Space Topic of Shelving
- Instruction Dispatch Scheme
- Treatment of Empty Reservation Station
81Treatment of Empty Reservation Table
82Part E-4-5
- Design Space Topic of Shelving
- Instruction Dispatch Scheme
- Typical Dispatch Schemes
83Typical Approaches in Dispatching (I)
- Assumptions for typical solutions
- Register renaming and speculative execution are
usually employed. - If operands are fetched during instruction
dispatch, use direct checking method. - If operands are fetched during instruction issue,
use explicit status bits to maintain and check
operand availability - Empty RS is usually bypassed.
84Typical Approaches in Dispatching (II)
85Part F
Superscalar Instruction Issue Register Renaming
86Introduction to Register Renaming
- Standard technique for removing false data
dependencies (i.e. WAR, WAW). - Always turn instructions to be three-operands by
renaming the destination operand. - Two implementations
- Static
- Done by the compiler.
- Dynamic
- Take place in hardware during execution time.
- Require extra circuitry for suppl. register
space, additional data paths and logic.
87Implementation of Register Renaming
88Chronology of Renaming in Commercial Processors
89Design Space of Register Renaming
90Part F-1
- Design Space Topic Register Renaming
- Scope of Register Renaming
91Scope of Renaming
92Part F-2
Design Space Topic Register Renaming Layout of
Rename Buffers
93Layout of Rename Buffers
94Types of Rename Buffers
95Architecture of Rename Buffers
- For merged arch. rename register file
- A free physical register is allocated to each
destination register specified in an instruction. - A mapping table is used to track all allocation
reg. pairs. - Scheme is required to reclaim physical registers
no longer in use. - For all three other cases, intermediate results
are held in respective rename buffer until their
retirement. During retirement, content of rename
buffer will be written back to architectural
register file.
96Example of Renaming Architecture Register (I)
97Example of Renaming Architecture Register (II)
98Number of Rename Buffers
99Access Mechanism of Rename Buffers (I)
- Need to access rename buffers because
- Fetch operands
- Update rename registers
- Deallocate rename registers
- Two distinct mechanisms
- Associative mechanism
- Indexed access mechanism
100Access Mechanism of Rename Buffers (II)
101Part F-3
Design Space Topic Register Renaming Operand
Fetch Policy
102Operand Fetch Policies of Rename Buffers
- Two policies
- Rename bound
- Fetch referenced operands during renaming
- Dispatch bound
- Defer operand fetch until dispatching
103Part F-4
Design Space Topic Register Renaming Rename Rate
104Rename Rate
- Rename rate is the max. number of renames per
cycle that a processor is able to perform. - To avoid bottlenecks, rename rate is equal to
issue rate. - HW requirements a large number of ports at
register files and the mapping tables.
105Part F-5
Design Space Topic Register Renaming Most
Frequently Used Renaming
106Most Frequently Used Basic Renaming
107Part G
Parallel Execution
108Concept of Parallel Execution
- Independent of whether instructions are issued or
dispatched in-order or out-of-order, they will
generally be finished in out-of-program-order. - Three terms
- to finish operation is completed except for
writing back the result into the architectural
register or memory (and status bits). - to complete the last action of instruction
execution (i.e. write back to arch. registers) is
finished. - to retire write back to arch. registers and
delete completed instruction from ROB (Reorder
Buffer).
109Part H
Preserving Sequential Consistency of Instruction
Execution
110Sequential Consistency (I)
- Two aspects
- Order in which instructions are completed.
- Order in which memory is accessed due to LD/ST.
- Processor consistency indicates the consistency
of instruction completion with sequential
instruction execution. - Two possible processor consistencies
- Weak instructions are completed out of order,
provided that non data dependencies are
scarified. - Strong instructions are forced to complete in
strict program order. Usually achieved with ROB.
111Sequential Consistency (II)
- Memory consistency indicates whether memory
accesses are performed in the same order as in a
sequential processor. - Two possible memory access consistencies
- Weak memory accesses may be out of order
compared with a strict sequential program
execution, provided that data dependencies must
not be violated. - Strong memory accesses occur strictly in program
order.
112Sequential Consistency (III)
113Sequential Consistency Model
114Concept of Load/Store Reordering
115Principle of Reorder Buffer
116Use of Reorder Buffer in Commercial Processors
117Design Space of Reorder Buffers
118Basic Layout of Reorder Buffers
119Sample Implementation of Reorder Buffers
120Comparison of Shelves and Reorder Buffer Entries
121Part I
Preserving Sequential Consistency of Exception
Processing
122Sequential Consistency of Exception Processing
123