Title: EECSCS 370
1EECS/CS 370
- Advanced Issues in Pipelining
- Lecture 21
2Four lectures on pipelining
- Data hazards
- Control hazards
- Other issues
- Advanced topics
- Super pipelined execution
- Superscalar execution
- Out-of-order execution
- Wave pipelining
3Super pipelining
- Processor implementations with pipelines greater
than 5 stages are superpiplined - Superpiplining enables the clock frequency to be
increased (i.e., the cycle time goes down) - Superpiplining exacerbates the problems caused by
hazards. - Where you add the extra stages is important
- Frontend (before register reads)
- Middle (during execution before result known)
- Backend (after results calculated, before
completion)
4Superscalar Machines
- A processor implementation with multiple
pipelines (dependent pipelines) is said to be
executing superscalar (more than scalar) - Superscalar implementations improve CPI by
enabling more than one instruction to be in each
pipeline stage - Superscalar implementations must still manage
pipeline hazards. - This increases the complexity of the processor
- It is also more difficult to avoid hazards
- for much the same reasons that superpiplining does
5Scheduling Issues
- In order to execute 2 instructions at the same
time we must still avoid hazards. - Detection
- Must compare source operands with all previous
destinations in flight on either pipeline - Must also compare source of one instruction in
decode with the other. - Management
- More forwarding locations (why?)
- More stalls (why?)
6Out-of-order (OoO) execution
- Some instructions take a long time to complete
(e.g., a load instruction). - OoO execution allows the following instructions
to execute as long as they dont need the result
of the slow instruction. - OoO execution reduces stalls in the pipeline by
filling them with future instructions as long as
that doesnt violate the program semantics.
7Scheduling in OoO machines
- Out-of-order execution creates additional
problems in pipeline scheduling. - When is reordering possible?
- How is data forwarding accomplished?
- What about control hazards?
- What about exceptions?
8Register renaming
- Sometimes it is OK to reorder instructions that
reference the same register.
div r1, r2 ? r3 sub r3, r4 ? r5 add r6, r7
? r3 mult r3, r8 ? r9
You can move the add and mult ahead of the
div/sub if you are careful!
div r1, r2 ? p3 sub p3, r4 ? r5 add r6, r7
? p10 mult p10, r8 ? r9
Register renaming remaps architected registers
to physical registers to avoid anti-dependencies
9Pentium Pro/II/III Pipeline
- 11 stages 7 phases
- Instruction Fetch
- Decode
- Register Access
- Reordering
- Dispatch
- Execution
- Retirement
10Instruction Fetch
- There are 3 stages in this phase IFU1, IFU2,
IFU3. IFU stands for Instruction Fetch Unit. - IFU1 Fetches a 32-byte line from the L1 code
cache. The line is stored in a buffer in the CPU.
- IFU2 Marks the boundaries of the IA instructions
in each 32-byte line. If an instruction is found
to be a branch instruction, it is also forwarded
to the BTB (branch target buffer) for dynamic
branch prediction. - IFU3 Aligns instructions for delivery to the
instruction decoders. This step is required,
since an instruction can be anywhere in the
32-byte stream.
11Instruction Decode
- There are 3 decoders in the CPU. The total
decode time takes 2 1/2 clock cycles for both
decode stages to decode an instruction. - DEC1 Translates the IA instructions into a uop
(where possible). Up to 3 instructions can be
decoded simultaneously (one per decoder). These 3
decoders only handle instructions up to 7 bytes
in length and that can be converted into 4 uops
or less. The 3 decoders consist of 1 complex
decoder and 2 simple decoders. Simple decoders
can only convert IA instructions that map to a
single uop. Luckily, most IA instructions are
simple. - For instructions longer than 7-bytes, or that
require more than 4 uops, the IA instruction is
sent to the Micro-Instruction Sequencer. The job
of the MIS is to convert these more complicated
instructions into uops. It does this by using ROM
(read-only memory) microcode and sends the uops
it produces to the ID Queue. - DEC2 DEC2 moves uops to the ID Queue. It brings
together the results of the 3 decoders.
12Register Access
- The Pentium Pro has 40 hidden registers (hidden
from programmers). These registers are utilized
by the register allocation table to modify uop
references to the standard 16 IA architecture
registers, to use the 40 registers instead. This
allows for increased parallelism since more
registers can be allocated to the instructions
than originally available. - The modified uops are sent to the ROB.
13Reorder Buffer (ROB)
- The ROB contains 40 entries for uops. The uops
are added to the ROB in program order (the order
of the original IA instructions). The ROB is
essentially a pool of instructions that are
available for execution. - After a uop executes, its results are stored in
the ROB entry for that uop.
14Dispatch
- The Dispatcher copies a uop from the ROB to the
Reservation Station (RS) and allocates a specific
execution unit to execute the uop. The RS is a
buffer for the execution units.
15Execution
- There are 5 execution units in a Pentium Pro
Store Data, Store Address, Load Address, Simple
Integer, Floating Point/Complex Integer. All 5 of
these execution units can operate simultaneously.
16Retirement
- The Retirement phase has the job of equating the
uop results back into the original IA
instructions and registers. - RET1 Marks a uop for retirement, after it has
executed, only if all conditional branches
earlier in the code stream have also been
executed. - Why is this a problem? Since the Pentium Pro
performs branch prediction, it is possible to
execute code after a predicted branch, before the
real branch evaluation takes place. Thus, code
executed after a branch is like a transaction. We
don't want the results to be available until the
CPU has "committed" that the predicted branch is
the correct one. We can't make the results of the
processing available outside the CPU until this
commitment has been made. - RET2 Only retires uops marked for retirement
when the previous IA instruction has been retired
and all uops associated with the next IA
instruction have completed execution. - Retirement consists of putting the results into
the set of 16 IA registers called for by the
original IA instruction.
17(No Transcript)
18AMD Hammer Microarchitecture
- 12 Stage pipeline
- Pre-decode instruction mem
- With ID bits to identify branch instructions and
the first byte of all instructions - Partitioned Register file
- Bigger data cache memory
19AMD Hammer Architectural Extensions (64 bit)
20Classical Pipelining
- Synchronous digital circuit
- Partition combination logic into stages
- Insert pipeline registers between stages
Pipeline register
21Classical Pipelining - Problems
- For max performance, all stages must be busy all
the time. - How many LC2K1 instruction do something useful
each stage? - Logic divided equally so all computations finish
at exactly the same time. - How long does it take to complete the LC2K1
decode stage? - Very deep pipelines have a lot of overhead
writing to the pipeline registers.
22Wave Pipelining
- Also referred to as maximal rate pipelining
- Allows multiple data waves simultaneously between
successive storage elements (registers or
pipeline registers). - So pipeline register are not needed.
- Uses clock period that is less than max
propagation delay between the registers.
23Wave Pipelining (Cont.)
- Data at input is changed before previous data has
completely propagated through to output. - Picture a water slide
Cycle time
24Wave Pipelining Example
- Min delay of 16, max delay of 20
25Wave Pipelining Maximizing Clock Rate
- Minimum cycle time limited by difference between
min and max Input-Output delays (and device
switching speed). - For max clock rate - must equalize all path
delays from input to output. - Factors
- Topological path differences.
- Process/temperature/power variations.
- Data-dependent delay variations.
- Intentional clock skew?
26Wave Pipelining - Problems
- Operating speed constrained to narrow range of
frequencies for given degree of wave pipelining. - New fabrication process requires significant
redesign - No effective mechanism for starting/stopping
- Pipeline stalls, low speed testing?
- In general, very hard to do circuit analysis.