EECSCS 370 - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

EECSCS 370

Description:

Up to 3 instructions can be decoded simultaneously (one per decoder) ... Pre-decode instruction mem ... How long does it take to complete the LC2K1 decode stage? ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 27
Provided by: garyt
Category:
Tags: eecscs | decode

less

Transcript and Presenter's Notes

Title: EECSCS 370


1
EECS/CS 370
  • Advanced Issues in Pipelining
  • Lecture 21

2
Four lectures on pipelining
  • Data hazards
  • Control hazards
  • Other issues
  • Advanced topics
  • Super pipelined execution
  • Superscalar execution
  • Out-of-order execution
  • Wave pipelining

3
Super pipelining
  • Processor implementations with pipelines greater
    than 5 stages are superpiplined
  • Superpiplining enables the clock frequency to be
    increased (i.e., the cycle time goes down)
  • Superpiplining exacerbates the problems caused by
    hazards.
  • Where you add the extra stages is important
  • Frontend (before register reads)
  • Middle (during execution before result known)
  • Backend (after results calculated, before
    completion)

4
Superscalar Machines
  • A processor implementation with multiple
    pipelines (dependent pipelines) is said to be
    executing superscalar (more than scalar)
  • Superscalar implementations improve CPI by
    enabling more than one instruction to be in each
    pipeline stage
  • Superscalar implementations must still manage
    pipeline hazards.
  • This increases the complexity of the processor
  • It is also more difficult to avoid hazards
  • for much the same reasons that superpiplining does

5
Scheduling Issues
  • In order to execute 2 instructions at the same
    time we must still avoid hazards.
  • Detection
  • Must compare source operands with all previous
    destinations in flight on either pipeline
  • Must also compare source of one instruction in
    decode with the other.
  • Management
  • More forwarding locations (why?)
  • More stalls (why?)

6
Out-of-order (OoO) execution
  • Some instructions take a long time to complete
    (e.g., a load instruction).
  • OoO execution allows the following instructions
    to execute as long as they dont need the result
    of the slow instruction.
  • OoO execution reduces stalls in the pipeline by
    filling them with future instructions as long as
    that doesnt violate the program semantics.

7
Scheduling in OoO machines
  • Out-of-order execution creates additional
    problems in pipeline scheduling.
  • When is reordering possible?
  • How is data forwarding accomplished?
  • What about control hazards?
  • What about exceptions?

8
Register renaming
  • Sometimes it is OK to reorder instructions that
    reference the same register.

div r1, r2 ? r3 sub r3, r4 ? r5 add r6, r7
? r3 mult r3, r8 ? r9
You can move the add and mult ahead of the
div/sub if you are careful!
div r1, r2 ? p3 sub p3, r4 ? r5 add r6, r7
? p10 mult p10, r8 ? r9
Register renaming remaps architected registers
to physical registers to avoid anti-dependencies

9
Pentium Pro/II/III Pipeline
  • 11 stages 7 phases
  • Instruction Fetch
  • Decode
  • Register Access
  • Reordering
  • Dispatch
  • Execution
  • Retirement

10
Instruction Fetch
  • There are 3 stages in this phase IFU1, IFU2,
    IFU3. IFU stands for Instruction Fetch Unit.
  • IFU1 Fetches a 32-byte line from the L1 code
    cache. The line is stored in a buffer in the CPU.
  • IFU2 Marks the boundaries of the IA instructions
    in each 32-byte line. If an instruction is found
    to be a branch instruction, it is also forwarded
    to the BTB (branch target buffer) for dynamic
    branch prediction.
  • IFU3 Aligns instructions for delivery to the
    instruction decoders. This step is required,
    since an instruction can be anywhere in the
    32-byte stream.

11
Instruction Decode
  • There are 3 decoders in the CPU. The total
    decode time takes 2 1/2 clock cycles for both
    decode stages to decode an instruction.
  • DEC1 Translates the IA instructions into a uop
    (where possible). Up to 3 instructions can be
    decoded simultaneously (one per decoder). These 3
    decoders only handle instructions up to 7 bytes
    in length and that can be converted into 4 uops
    or less. The 3 decoders consist of 1 complex
    decoder and 2 simple decoders. Simple decoders
    can only convert IA instructions that map to a
    single uop. Luckily, most IA instructions are
    simple.
  • For instructions longer than 7-bytes, or that
    require more than 4 uops, the IA instruction is
    sent to the Micro-Instruction Sequencer. The job
    of the MIS is to convert these more complicated
    instructions into uops. It does this by using ROM
    (read-only memory) microcode and sends the uops
    it produces to the ID Queue.
  • DEC2 DEC2 moves uops to the ID Queue. It brings
    together the results of the 3 decoders.

12
Register Access
  • The Pentium Pro has 40 hidden registers (hidden
    from programmers). These registers are utilized
    by the register allocation table to modify uop
    references to the standard 16 IA architecture
    registers, to use the 40 registers instead. This
    allows for increased parallelism since more
    registers can be allocated to the instructions
    than originally available.
  • The modified uops are sent to the ROB.

13
Reorder Buffer (ROB)
  • The ROB contains 40 entries for uops. The uops
    are added to the ROB in program order (the order
    of the original IA instructions). The ROB is
    essentially a pool of instructions that are
    available for execution.
  • After a uop executes, its results are stored in
    the ROB entry for that uop.

14
Dispatch
  • The Dispatcher copies a uop from the ROB to the
    Reservation Station (RS) and allocates a specific
    execution unit to execute the uop. The RS is a
    buffer for the execution units.

15
Execution
  • There are 5 execution units in a Pentium Pro
    Store Data, Store Address, Load Address, Simple
    Integer, Floating Point/Complex Integer. All 5 of
    these execution units can operate simultaneously.

16
Retirement
  • The Retirement phase has the job of equating the
    uop results back into the original IA
    instructions and registers.
  • RET1 Marks a uop for retirement, after it has
    executed, only if all conditional branches
    earlier in the code stream have also been
    executed.
  • Why is this a problem? Since the Pentium Pro
    performs branch prediction, it is possible to
    execute code after a predicted branch, before the
    real branch evaluation takes place. Thus, code
    executed after a branch is like a transaction. We
    don't want the results to be available until the
    CPU has "committed" that the predicted branch is
    the correct one. We can't make the results of the
    processing available outside the CPU until this
    commitment has been made.
  • RET2 Only retires uops marked for retirement
    when the previous IA instruction has been retired
    and all uops associated with the next IA
    instruction have completed execution.
  • Retirement consists of putting the results into
    the set of 16 IA registers called for by the
    original IA instruction.

17
(No Transcript)
18
AMD Hammer Microarchitecture
  • 12 Stage pipeline
  • Pre-decode instruction mem
  • With ID bits to identify branch instructions and
    the first byte of all instructions
  • Partitioned Register file
  • Bigger data cache memory

19
AMD Hammer Architectural Extensions (64 bit)
20
Classical Pipelining
  • Synchronous digital circuit
  • Partition combination logic into stages
  • Insert pipeline registers between stages

Pipeline register
21
Classical Pipelining - Problems
  • For max performance, all stages must be busy all
    the time.
  • How many LC2K1 instruction do something useful
    each stage?
  • Logic divided equally so all computations finish
    at exactly the same time.
  • How long does it take to complete the LC2K1
    decode stage?
  • Very deep pipelines have a lot of overhead
    writing to the pipeline registers.

22
Wave Pipelining
  • Also referred to as maximal rate pipelining
  • Allows multiple data waves simultaneously between
    successive storage elements (registers or
    pipeline registers).
  • So pipeline register are not needed.
  • Uses clock period that is less than max
    propagation delay between the registers.

23
Wave Pipelining (Cont.)
  • Data at input is changed before previous data has
    completely propagated through to output.
  • Picture a water slide

Cycle time
24
Wave Pipelining Example
  • Min delay of 16, max delay of 20

25
Wave Pipelining Maximizing Clock Rate
  • Minimum cycle time limited by difference between
    min and max Input-Output delays (and device
    switching speed).
  • For max clock rate - must equalize all path
    delays from input to output.
  • Factors
  • Topological path differences.
  • Process/temperature/power variations.
  • Data-dependent delay variations.
  • Intentional clock skew?

26
Wave Pipelining - Problems
  • Operating speed constrained to narrow range of
    frequencies for given degree of wave pipelining.
  • New fabrication process requires significant
    redesign
  • No effective mechanism for starting/stopping
  • Pipeline stalls, low speed testing?
  • In general, very hard to do circuit analysis.
Write a Comment
User Comments (0)
About PowerShow.com