Title: OutofOrder Speculative Execution
1Out-of-Order Speculative Execution
- Designing a Configurable Simulator for an OOO
Microprocessor
By Mustafa Imran Ali ID 230203
2Presentation Outline
- Introduction
- Examples - Representative Micro-architectures
- Some Issues - Limitations and Other Approaches
- Simulator Details
3Out-of-order Speculative Execution Maximizing
ILP
- In-order Execution
- Pipelining exploiting temporal parallelism
through overlap - Superscalar more parallelism by allowing
multiple instructions to issue - Problem Pipeline Stalls
- Data dependencies allow limited ILP
- Large latency functions cause structural hazards
- Data loads - Cache miss stalls
4Out-of-order Speculative Execution
- instructions execute as soon as possible and in
parallel with other nondependent work - results in faster execution because critical-path
computations start and complete quickly - speculatively fetch and execute instructions even
though it may not know immediately whether the
instructions will be on the final execution path - Multilevel Branch prediction to avoid waiting for
outcome of multiple branches
5OOO Speculative Execution - Benefits
- Reduced reliance on compilers
- Compilers are cannot examine runtime dependencies
- No need for recompilation
- Source code access not always possible
- Binary compatibility with existing code
6OOO Speculative Execution -Problems and Issues
- Overcoming WAW and WAR hazards Register
Renaming - More branches/cycle accurate branch prediction
- Register Renaming Dependency checking mechanism
(Large comparisions) - Data forwarding from producers to consumers use
of tagging and broadcast mechanism - Exceptions Committing instructions in program
order
7Compaq Alpha 21264 (1998)
- OOO superscalar with speculative execution
- Fetches 4 instructions/cycle
- Dynamically issues up to 6 instructions/cycle 4
integer and 2 floating point - Can speculate through up to 20 branches
- 64 architectural register
- 41 integer 41 floating point rename register
- Up to 80 instructions in-flight 32 in-flight
loads 32 in-flight stores - 20-entry integer queue ? Issues 4 instructions
- 15-entry floating point queue ? Issues 2
instructions - Can retire at most 11 instructions/cycle, can
sustain a rate of 8/cycle (over short periods)
8Stages in Instruction Pipeline
All pipeline stages subsequent to the register
map stage operate on internal registers rather
than user-visible registers
Dynamically selects from up to 6 instructions
Issue reordering takes place
Provides 4 instructions/cycle
Maps virtual register to physical registers
9Register Renaming Process
- assigns a unique storage location with each
write-reference to a register - speculatively allocates a register to each
instruction with a register result - register only becomes part of the user-visible
(architectural) register state when the
instruction retires/commits - allows instruction to speculatively issue and
deposit its result into the register file before
the instruction retires
10Register Renaming Process (continued)
- processor maintains storage with each internal
register indicating the user-visible register
that is currently associated with the given
internal register (if any) - register renaming is a content-addressable memory
(CAM) operation for register sources together
with a register allocation for the destination
register - register mapper stores the register map state for
each in-flight instruction so that the machine
architectural state can be restored in case a
misspeculation occurs
11Map (register rename) and QueueStages
- The map stage renames programmer-visible
register numbers to internal register numbers
structures are duplicated for integer and
floating point execution
- The queue stage stores instructions until they
are ready to issue
12Out-of-order Issue Queues
- issue queue logic maintains 2 lists of pending
instructions in separate integer and
floating-point queues - scoreboards maintain status of the internal
registers by tracking the progress of
single-cycle, multiple-cycle, and variable-cycle
(memory load) instructions - the scoreboard unit notifies all instructions in
the queue that require the register value when
functional unit or load-data results become
available
13Out-of-order Execution
- Each queue/arbiter selects the oldest
operand-ready and functional-unit-ready
instructions for execution each cycle - queues are collapsablean entry becomes
immediately available once the instruction issues
or is squashed due to misspeculation
14Retire Mechanism
- assigns each mapped instruction a slot in a
circular in-flight window (in fetch order) - tracks the internal register usage for all
in-flight instructions - each entry in the mechanism contains storage
indicating the internal register that held the
old contents of the destination register for the
corresponding instruction - this (stale) register can be freed for other use
after the instruction retires
15Exception Handling
- exception causes all younger instructions in the
in-flight window to be squashed and are removed
from all queues in the system - register map is backed up to the state before the
last squashed instruction using the saved map
state - registers allocated by the squashed instructions
become immediately available
16HP PA-RISC 8000
17ROB Size Performance Effect
18AMD K-5 ROB Entry
19AMD K-5 Reservation Station Entry
20Approaches for Billion Transistor Architectures
- Advanced superscalar processors
- scale up from current designs to issue 16 or 32
instructions per cycle - Superspeculative processors
- enhance wide-issue superscalar performance by
speculating aggressively at every point in the
processor pipeline
21SPARC64 V9
22Pentium III and 4 Register Renaming and ROB
23One BillionTransistors, One Uniprocessor, One
Chip?
24Superspeculative Architecture
25Area Issues
- A large circuitry required to feed the processors
with a continuous instructions stream - Dynamic execution requires a large amount of
comparisons for dependency checking - The size of reorder buffer, reservation
stations/rename registers increase accordingly
26Limitations
- Larger issue machines have high peak to sustained
rate ratios Intel Pentium Pro architecture
Approach - Beyond issue widths of 8, inherent limited ILP in
single-thread, give diminishing returns More
architectures switching to Simultaneous
Multithreading
27Alternate Approaches
28OOO Speculative Execution Processor - Simulator
Design
- Tracking all the activities of the pipelined
machine in each clock cycle - Issue Unit design that solves structural and data
hazards - Dependency checking Mechanisms
- Strategy for sending data from producers to
consumers
29Data Structures
- Instruction Queue
- Execution Tracking Hardware Structure
- Register File Producer Table
- Reservation Stations
- The Reorder Buffer
- Functional Units State Structure
30Service Functions
- Issue
- Dispatch
- Completion
- CDB Snooping
- Retirement and Writeback
31Overall Structure
32Producer Table
- Each register is extended by a tag and valid flag
- Validtrue iff register contains appropriate data
- Other tag points to instruction producing the data
33Reservation Stations
- Full bit is set if entry occupied
- Tag points to ROB tag of the instruction
- op1 and op2 hold the source references
34The Reorder Buffer
- Realized as a FIFO with ROBhead and ROBtail
- New instructions put at ROBtail and instruction
is tagged in RS with this. - Each cycle the ROBhead valid entry is checked for
instruction completion
35Issue Protocol
if (there is a free RS and a free ROB entry)
RS.full1 RS.tagROBtail for all
operands x of Ii with address r if Rr.valid1
RS.opxRr else if CDB.tagRr.tag and
CDB.valid RS.opxCDB else
RS.opxROBRr.tag if ( Ii has a destination
register r) Rr.tagROBtail Rr.valid0
ROBROBtail.destr else ROBROBtail.destn
one ROBtailROBtail1
36Dispatch Protocol
if there is a RS with RS.opx.valid1 for all
operands x and the function unit is not stalled
Pass instruction, operands, and tag to FU
RS.full0
37Completion Protocol
if FU has result and got CDBacknowledge
CDB.valid1 CDB.dataresult from FU
CDB.tagtag from FU ROBCDB.tag.valid1
ROBCDB.tag.dataCDB.data
38CDB Snooping
For all operands x if RS.full1 and
RS.opx.valid0 and RS.opx.tagCDB.tag
RS.opxCDB
39Retirement/Writeback Protocol
if ROB not empty and ROBROBhead.valid1
if instruction in the ROBROBhead requires
writeback xROBROBhead.dest
Rx.dataROBROBhead.data if ROBheadRx.tag
Rx.valid1 ROBheadROBhead1
40Configurable Parameters
- Probability of memory misses
- Probability of correct branch prediction
- Branch mis-prediction penalty
- Cache miss penalty
- Window Size for instruction issue
- Number of Issues/cycle
- Number of Functional Units (FUs)
- Pipeline Depth/Latency of each FU
- Number of CDBs
- Size of reservation stations/rename registers
(RS) - Operand matching mechanism in each RS
- Size of re-order buffer
- Branch Prediction Mechanisms (optional)
41Performance Metrics
- Number of Clock cycles on an instruction trace
- Number of Stalls (Various Types)
- Effect on Hardware costs
- Peak vs. Sustained Rates (actual issues vs.
maximum possible) - Percentage Resource Utilization
42OOO Speculative Micro-architecture Simulators
- Simple Scalar
- University of Wisconsin in Madison
- www.simplescalar.com
- KScalar
- Universidad Autónoma de Barcelona
- www.caos.uab.es/kscalar
43Simple Scalar v3.0
- tool set includes sample simulators ranging from
a fast functional simulator to a detailed,
dynamically scheduled processor model that
supports non-blocking caches, speculative
execution, and state-of-the-art branch prediction - includes performance visualization tools,
statistical analysis resources, and debug and
verification infrastructure - includes a machine definition infrastructure that
permits most architectural details to be
separated from simulator implementations
44KScalar
- allows analyzing the performance behavior of a
wide range of processor microarchitectures from
a very simple in-order, scalar pipeline, to a
detailed out-of-order, superscalar pipeline with
non-blocking caches, speculative execution, and
complex branch prediction - The simulator interprets executables for the
Alpha AXP instruction set from very short
program fragments to large applications - The object's program execution may be simulated
in varying levels of detail either
cycle-by-cycle, observing all the pipeline events
that determine processor performance, - or million cycles at once, taking statistics of
the main performance issues
45Study Direction
- Modeling and comparison of representative
Micro-architectures - Parameters modeling commercial micro-architecture
s OOO speculative execution core - SPEC benchmarks instruction traces
- analysis of relative importance of supporting
assumptions
46Study Direction (continued)
- Modeling Resource Utilization of Simultaneous
Multithreaded Workload - Comparison of resource utilization and
performance metrics of single-thread vs. SMT
execution - Use of instruction traces that model multi-thread
workload (e.g. modeling Hyperthreading in Pentium
4)