OutofOrder Speculative Execution - PowerPoint PPT Presentation

1 / 46

About This Presentation

Title:

OutofOrder Speculative Execution

Description:

COE 501 Presentation by Mustafa Imran Ali. 3. Out-of-order Speculative Execution Maximizing ILP ... by Mustafa Imran Ali. 8. Stages in Instruction ... – PowerPoint PPT presentation

Number of Views:104

Avg rating:3.0/5.0

Slides: 47

Provided by: MustafaI7

Category:

more less

Transcript and Presenter's Notes

Title: OutofOrder Speculative Execution

1
Out-of-Order Speculative Execution

Designing a Configurable Simulator for an OOO
Microprocessor

By Mustafa Imran Ali ID 230203
2
Presentation Outline

Introduction
Examples - Representative Micro-architectures
Some Issues - Limitations and Other Approaches
Simulator Details

3
Out-of-order Speculative Execution Maximizing
ILP

In-order Execution
Pipelining exploiting temporal parallelism
through overlap
Superscalar more parallelism by allowing
multiple instructions to issue
Problem Pipeline Stalls
Data dependencies allow limited ILP
Large latency functions cause structural hazards
Data loads - Cache miss stalls

4
Out-of-order Speculative Execution

instructions execute as soon as possible and in
parallel with other nondependent work
results in faster execution because critical-path
computations start and complete quickly
speculatively fetch and execute instructions even
though it may not know immediately whether the
instructions will be on the final execution path
Multilevel Branch prediction to avoid waiting for
outcome of multiple branches

5
OOO Speculative Execution - Benefits

Reduced reliance on compilers
Compilers are cannot examine runtime dependencies
No need for recompilation
Source code access not always possible
Binary compatibility with existing code

6
OOO Speculative Execution -Problems and Issues

Overcoming WAW and WAR hazards Register
Renaming
More branches/cycle accurate branch prediction
Register Renaming Dependency checking mechanism
(Large comparisions)
Data forwarding from producers to consumers use
of tagging and broadcast mechanism
Exceptions Committing instructions in program
order

7
Compaq Alpha 21264 (1998)

OOO superscalar with speculative execution
Fetches 4 instructions/cycle
Dynamically issues up to 6 instructions/cycle 4
integer and 2 floating point
Can speculate through up to 20 branches
64 architectural register
41 integer 41 floating point rename register
Up to 80 instructions in-flight 32 in-flight
loads 32 in-flight stores
20-entry integer queue ? Issues 4 instructions
15-entry floating point queue ? Issues 2
instructions
Can retire at most 11 instructions/cycle, can
sustain a rate of 8/cycle (over short periods)

8
Stages in Instruction Pipeline
All pipeline stages subsequent to the register
map stage operate on internal registers rather
than user-visible registers
Dynamically selects from up to 6 instructions
Issue reordering takes place
Provides 4 instructions/cycle
Maps virtual register to physical registers
9
Register Renaming Process

assigns a unique storage location with each
write-reference to a register
speculatively allocates a register to each
instruction with a register result
register only becomes part of the user-visible
(architectural) register state when the
instruction retires/commits
allows instruction to speculatively issue and
deposit its result into the register file before
the instruction retires

10
Register Renaming Process (continued)

processor maintains storage with each internal
register indicating the user-visible register
that is currently associated with the given
internal register (if any)
register renaming is a content-addressable memory
(CAM) operation for register sources together
with a register allocation for the destination
register
register mapper stores the register map state for
each in-flight instruction so that the machine
architectural state can be restored in case a
misspeculation occurs

11
Map (register rename) and QueueStages

The map stage renames programmer-visible
register numbers to internal register numbers

structures are duplicated for integer and
floating point execution

The queue stage stores instructions until they
are ready to issue

12
Out-of-order Issue Queues

issue queue logic maintains 2 lists of pending
instructions in separate integer and
floating-point queues
scoreboards maintain status of the internal
registers by tracking the progress of
single-cycle, multiple-cycle, and variable-cycle
(memory load) instructions
the scoreboard unit notifies all instructions in
the queue that require the register value when
functional unit or load-data results become
available

13
Out-of-order Execution

Each queue/arbiter selects the oldest
operand-ready and functional-unit-ready
instructions for execution each cycle
queues are collapsablean entry becomes
immediately available once the instruction issues
or is squashed due to misspeculation

14
Retire Mechanism

assigns each mapped instruction a slot in a
circular in-flight window (in fetch order)
tracks the internal register usage for all
in-flight instructions
each entry in the mechanism contains storage
indicating the internal register that held the
old contents of the destination register for the
corresponding instruction
this (stale) register can be freed for other use
after the instruction retires

15
Exception Handling

exception causes all younger instructions in the
in-flight window to be squashed and are removed
from all queues in the system
register map is backed up to the state before the
last squashed instruction using the saved map
state
registers allocated by the squashed instructions
become immediately available

16
HP PA-RISC 8000
17
ROB Size Performance Effect
18
AMD K-5 ROB Entry
19
AMD K-5 Reservation Station Entry
20
Approaches for Billion Transistor Architectures

Advanced superscalar processors
scale up from current designs to issue 16 or 32
instructions per cycle
Superspeculative processors
enhance wide-issue superscalar performance by
speculating aggressively at every point in the
processor pipeline

21
SPARC64 V9
22
Pentium III and 4 Register Renaming and ROB
23
One BillionTransistors, One Uniprocessor, One
Chip?
24
Superspeculative Architecture
25
Area Issues

A large circuitry required to feed the processors
with a continuous instructions stream
Dynamic execution requires a large amount of
comparisons for dependency checking
The size of reorder buffer, reservation
stations/rename registers increase accordingly

26
Limitations

Larger issue machines have high peak to sustained
rate ratios Intel Pentium Pro architecture
Approach
Beyond issue widths of 8, inherent limited ILP in
single-thread, give diminishing returns More
architectures switching to Simultaneous
Multithreading

27
Alternate Approaches
28
OOO Speculative Execution Processor - Simulator
Design

Tracking all the activities of the pipelined
machine in each clock cycle
Issue Unit design that solves structural and data
hazards
Dependency checking Mechanisms
Strategy for sending data from producers to
consumers

29
Data Structures

Instruction Queue
Execution Tracking Hardware Structure
Register File Producer Table
Reservation Stations
The Reorder Buffer
Functional Units State Structure

30
Service Functions

Issue
Dispatch
Completion
CDB Snooping
Retirement and Writeback

31
Overall Structure
32
Producer Table

Each register is extended by a tag and valid flag
Validtrue iff register contains appropriate data
Other tag points to instruction producing the data

33
Reservation Stations

Full bit is set if entry occupied
Tag points to ROB tag of the instruction
op1 and op2 hold the source references

34
The Reorder Buffer

Realized as a FIFO with ROBhead and ROBtail
New instructions put at ROBtail and instruction
is tagged in RS with this.
Each cycle the ROBhead valid entry is checked for
instruction completion

35
Issue Protocol
if (there is a free RS and a free ROB entry)
RS.full1 RS.tagROBtail for all
operands x of Ii with address r if Rr.valid1
RS.opxRr else if CDB.tagRr.tag and
CDB.valid RS.opxCDB else
RS.opxROBRr.tag if ( Ii has a destination
register r) Rr.tagROBtail Rr.valid0
ROBROBtail.destr else ROBROBtail.destn
one ROBtailROBtail1
36
Dispatch Protocol
if there is a RS with RS.opx.valid1 for all
operands x and the function unit is not stalled
Pass instruction, operands, and tag to FU
RS.full0
37
Completion Protocol
if FU has result and got CDBacknowledge
CDB.valid1 CDB.dataresult from FU
CDB.tagtag from FU ROBCDB.tag.valid1
ROBCDB.tag.dataCDB.data
38
CDB Snooping
For all operands x if RS.full1 and
RS.opx.valid0 and RS.opx.tagCDB.tag
RS.opxCDB
39
Retirement/Writeback Protocol
if ROB not empty and ROBROBhead.valid1
if instruction in the ROBROBhead requires
writeback xROBROBhead.dest
Rx.dataROBROBhead.data if ROBheadRx.tag
Rx.valid1 ROBheadROBhead1
40
Configurable Parameters

Probability of memory misses
Probability of correct branch prediction
Branch mis-prediction penalty
Cache miss penalty
Window Size for instruction issue
Number of Issues/cycle
Number of Functional Units (FUs)
Pipeline Depth/Latency of each FU
Number of CDBs
Size of reservation stations/rename registers
(RS)
Operand matching mechanism in each RS
Size of re-order buffer
Branch Prediction Mechanisms (optional)

41
Performance Metrics

Number of Clock cycles on an instruction trace
Number of Stalls (Various Types)
Effect on Hardware costs
Peak vs. Sustained Rates (actual issues vs.
maximum possible)
Percentage Resource Utilization

42
OOO Speculative Micro-architecture Simulators

Simple Scalar
University of Wisconsin in Madison
www.simplescalar.com
KScalar
Universidad Autónoma de Barcelona
www.caos.uab.es/kscalar

43
Simple Scalar v3.0

tool set includes sample simulators ranging from
a fast functional simulator to a detailed,
dynamically scheduled processor model that
supports non-blocking caches, speculative
execution, and state-of-the-art branch prediction
includes performance visualization tools,
statistical analysis resources, and debug and
verification infrastructure
includes a machine definition infrastructure that
permits most architectural details to be
separated from simulator implementations

44
KScalar

allows analyzing the performance behavior of a
wide range of processor microarchitectures from
a very simple in-order, scalar pipeline, to a
detailed out-of-order, superscalar pipeline with
non-blocking caches, speculative execution, and
complex branch prediction
The simulator interprets executables for the
Alpha AXP instruction set from very short
program fragments to large applications
The object's program execution may be simulated
in varying levels of detail either
cycle-by-cycle, observing all the pipeline events
that determine processor performance,
or million cycles at once, taking statistics of
the main performance issues

45
Study Direction

Modeling and comparison of representative
Micro-architectures
Parameters modeling commercial micro-architecture
s OOO speculative execution core
SPEC benchmarks instruction traces
analysis of relative importance of supporting
assumptions

46
Study Direction (continued)

Modeling Resource Utilization of Simultaneous
Multithreaded Workload
Comparison of resource utilization and
performance metrics of single-thread vs. SMT
execution
Use of instruction traces that model multi-thread
workload (e.g. modeling Hyperthreading in Pentium
4)

Write a Comment

User Comments (0)