OutofOrder Speculative Execution - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

OutofOrder Speculative Execution

Description:

COE 501 Presentation by Mustafa Imran Ali. 3. Out-of-order Speculative Execution Maximizing ILP ... by Mustafa Imran Ali. 8. Stages in Instruction ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 47
Provided by: MustafaI7
Category:

less

Transcript and Presenter's Notes

Title: OutofOrder Speculative Execution


1
Out-of-Order Speculative Execution
  • Designing a Configurable Simulator for an OOO
    Microprocessor

By Mustafa Imran Ali ID 230203
2
Presentation Outline
  • Introduction
  • Examples - Representative Micro-architectures
  • Some Issues - Limitations and Other Approaches
  • Simulator Details

3
Out-of-order Speculative Execution Maximizing
ILP
  • In-order Execution
  • Pipelining exploiting temporal parallelism
    through overlap
  • Superscalar more parallelism by allowing
    multiple instructions to issue
  • Problem Pipeline Stalls
  • Data dependencies allow limited ILP
  • Large latency functions cause structural hazards
  • Data loads - Cache miss stalls

4
Out-of-order Speculative Execution
  • instructions execute as soon as possible and in
    parallel with other nondependent work
  • results in faster execution because critical-path
    computations start and complete quickly
  • speculatively fetch and execute instructions even
    though it may not know immediately whether the
    instructions will be on the final execution path
  • Multilevel Branch prediction to avoid waiting for
    outcome of multiple branches

5
OOO Speculative Execution - Benefits
  • Reduced reliance on compilers
  • Compilers are cannot examine runtime dependencies
  • No need for recompilation
  • Source code access not always possible
  • Binary compatibility with existing code

6
OOO Speculative Execution -Problems and Issues
  • Overcoming WAW and WAR hazards Register
    Renaming
  • More branches/cycle accurate branch prediction
  • Register Renaming Dependency checking mechanism
    (Large comparisions)
  • Data forwarding from producers to consumers use
    of tagging and broadcast mechanism
  • Exceptions Committing instructions in program
    order

7
Compaq Alpha 21264 (1998)
  • OOO superscalar with speculative execution
  • Fetches 4 instructions/cycle
  • Dynamically issues up to 6 instructions/cycle 4
    integer and 2 floating point
  • Can speculate through up to 20 branches
  • 64 architectural register
  • 41 integer 41 floating point rename register
  • Up to 80 instructions in-flight 32 in-flight
    loads 32 in-flight stores
  • 20-entry integer queue ? Issues 4 instructions
  • 15-entry floating point queue ? Issues 2
    instructions
  • Can retire at most 11 instructions/cycle, can
    sustain a rate of 8/cycle (over short periods)

8
Stages in Instruction Pipeline
All pipeline stages subsequent to the register
map stage operate on internal registers rather
than user-visible registers
Dynamically selects from up to 6 instructions
Issue reordering takes place
Provides 4 instructions/cycle
Maps virtual register to physical registers
9
Register Renaming Process
  • assigns a unique storage location with each
    write-reference to a register
  • speculatively allocates a register to each
    instruction with a register result
  • register only becomes part of the user-visible
    (architectural) register state when the
    instruction retires/commits
  • allows instruction to speculatively issue and
    deposit its result into the register file before
    the instruction retires

10
Register Renaming Process (continued)
  • processor maintains storage with each internal
    register indicating the user-visible register
    that is currently associated with the given
    internal register (if any)
  • register renaming is a content-addressable memory
    (CAM) operation for register sources together
    with a register allocation for the destination
    register
  • register mapper stores the register map state for
    each in-flight instruction so that the machine
    architectural state can be restored in case a
    misspeculation occurs

11
Map (register rename) and QueueStages
  • The map stage renames programmer-visible
    register numbers to internal register numbers

structures are duplicated for integer and
floating point execution
  • The queue stage stores instructions until they
    are ready to issue

12
Out-of-order Issue Queues
  • issue queue logic maintains 2 lists of pending
    instructions in separate integer and
    floating-point queues
  • scoreboards maintain status of the internal
    registers by tracking the progress of
    single-cycle, multiple-cycle, and variable-cycle
    (memory load) instructions
  • the scoreboard unit notifies all instructions in
    the queue that require the register value when
    functional unit or load-data results become
    available

13
Out-of-order Execution
  • Each queue/arbiter selects the oldest
    operand-ready and functional-unit-ready
    instructions for execution each cycle
  • queues are collapsablean entry becomes
    immediately available once the instruction issues
    or is squashed due to misspeculation

14
Retire Mechanism
  • assigns each mapped instruction a slot in a
    circular in-flight window (in fetch order)
  • tracks the internal register usage for all
    in-flight instructions
  • each entry in the mechanism contains storage
    indicating the internal register that held the
    old contents of the destination register for the
    corresponding instruction
  • this (stale) register can be freed for other use
    after the instruction retires

15
Exception Handling
  • exception causes all younger instructions in the
    in-flight window to be squashed and are removed
    from all queues in the system
  • register map is backed up to the state before the
    last squashed instruction using the saved map
    state
  • registers allocated by the squashed instructions
    become immediately available

16
HP PA-RISC 8000
17
ROB Size Performance Effect
18
AMD K-5 ROB Entry
19
AMD K-5 Reservation Station Entry
20
Approaches for Billion Transistor Architectures
  • Advanced superscalar processors
  • scale up from current designs to issue 16 or 32
    instructions per cycle
  • Superspeculative processors
  • enhance wide-issue superscalar performance by
    speculating aggressively at every point in the
    processor pipeline

21
SPARC64 V9
22
Pentium III and 4 Register Renaming and ROB
23
One BillionTransistors, One Uniprocessor, One
Chip?
24
Superspeculative Architecture
25
Area Issues
  • A large circuitry required to feed the processors
    with a continuous instructions stream
  • Dynamic execution requires a large amount of
    comparisons for dependency checking
  • The size of reorder buffer, reservation
    stations/rename registers increase accordingly

26
Limitations
  • Larger issue machines have high peak to sustained
    rate ratios Intel Pentium Pro architecture
    Approach
  • Beyond issue widths of 8, inherent limited ILP in
    single-thread, give diminishing returns More
    architectures switching to Simultaneous
    Multithreading

27
Alternate Approaches
28
OOO Speculative Execution Processor - Simulator
Design
  • Tracking all the activities of the pipelined
    machine in each clock cycle
  • Issue Unit design that solves structural and data
    hazards
  • Dependency checking Mechanisms
  • Strategy for sending data from producers to
    consumers

29
Data Structures
  • Instruction Queue
  • Execution Tracking Hardware Structure
  • Register File Producer Table
  • Reservation Stations
  • The Reorder Buffer
  • Functional Units State Structure

30
Service Functions
  • Issue
  • Dispatch
  • Completion
  • CDB Snooping
  • Retirement and Writeback

31
Overall Structure
32
Producer Table
  • Each register is extended by a tag and valid flag
  • Validtrue iff register contains appropriate data
  • Other tag points to instruction producing the data

33
Reservation Stations
  • Full bit is set if entry occupied
  • Tag points to ROB tag of the instruction
  • op1 and op2 hold the source references

34
The Reorder Buffer
  • Realized as a FIFO with ROBhead and ROBtail
  • New instructions put at ROBtail and instruction
    is tagged in RS with this.
  • Each cycle the ROBhead valid entry is checked for
    instruction completion

35
Issue Protocol
if (there is a free RS and a free ROB entry)
RS.full1 RS.tagROBtail for all
operands x of Ii with address r if Rr.valid1
RS.opxRr else if CDB.tagRr.tag and
CDB.valid RS.opxCDB else
RS.opxROBRr.tag if ( Ii has a destination
register r) Rr.tagROBtail Rr.valid0
ROBROBtail.destr else ROBROBtail.destn
one ROBtailROBtail1
36
Dispatch Protocol
if there is a RS with RS.opx.valid1 for all
operands x and the function unit is not stalled
Pass instruction, operands, and tag to FU
RS.full0
37
Completion Protocol
if FU has result and got CDBacknowledge
CDB.valid1 CDB.dataresult from FU
CDB.tagtag from FU ROBCDB.tag.valid1
ROBCDB.tag.dataCDB.data
38
CDB Snooping
For all operands x if RS.full1 and
RS.opx.valid0 and RS.opx.tagCDB.tag
RS.opxCDB
39
Retirement/Writeback Protocol
if ROB not empty and ROBROBhead.valid1
if instruction in the ROBROBhead requires
writeback xROBROBhead.dest
Rx.dataROBROBhead.data if ROBheadRx.tag
Rx.valid1 ROBheadROBhead1
40
Configurable Parameters
  • Probability of memory misses
  • Probability of correct branch prediction
  • Branch mis-prediction penalty
  • Cache miss penalty
  • Window Size for instruction issue
  • Number of Issues/cycle
  • Number of Functional Units (FUs)
  • Pipeline Depth/Latency of each FU
  • Number of CDBs
  • Size of reservation stations/rename registers
    (RS)
  • Operand matching mechanism in each RS
  • Size of re-order buffer
  • Branch Prediction Mechanisms (optional)

41
Performance Metrics
  • Number of Clock cycles on an instruction trace
  • Number of Stalls (Various Types)
  • Effect on Hardware costs
  • Peak vs. Sustained Rates (actual issues vs.
    maximum possible)
  • Percentage Resource Utilization

42
OOO Speculative Micro-architecture Simulators
  • Simple Scalar
  • University of Wisconsin in Madison
  • www.simplescalar.com
  • KScalar
  • Universidad Autónoma de Barcelona
  • www.caos.uab.es/kscalar

43
Simple Scalar v3.0
  • tool set includes sample simulators ranging from
    a fast functional simulator to a detailed,
    dynamically scheduled processor model that
    supports non-blocking caches, speculative
    execution, and state-of-the-art branch prediction
  • includes performance visualization tools,
    statistical analysis resources, and debug and
    verification infrastructure
  • includes a machine definition infrastructure that
    permits most architectural details to be
    separated from simulator implementations

44
KScalar
  • allows analyzing the performance behavior of a
    wide range of processor microarchitectures from
    a very simple in-order, scalar pipeline, to a
    detailed out-of-order, superscalar pipeline with
    non-blocking caches, speculative execution, and
    complex branch prediction
  • The simulator interprets executables for the
    Alpha AXP instruction set from very short
    program fragments to large applications
  • The object's program execution may be simulated
    in varying levels of detail either
    cycle-by-cycle, observing all the pipeline events
    that determine processor performance,
  • or million cycles at once, taking statistics of
    the main performance issues

45
Study Direction
  • Modeling and comparison of representative
    Micro-architectures
  • Parameters modeling commercial micro-architecture
    s OOO speculative execution core
  • SPEC benchmarks instruction traces
  • analysis of relative importance of supporting
    assumptions

46
Study Direction (continued)
  • Modeling Resource Utilization of Simultaneous
    Multithreaded Workload
  • Comparison of resource utilization and
    performance metrics of single-thread vs. SMT
    execution
  • Use of instruction traces that model multi-thread
    workload (e.g. modeling Hyperthreading in Pentium
    4)
Write a Comment
User Comments (0)
About PowerShow.com