EECSCS 370 - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

EECSCS 370

Description:

Up to 3 instructions can be decoded simultaneously (one per decoder) ... Pre-decode instruction mem ... How long does it take to complete the LC2K1 decode stage? ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 27

Provided by: garyt

Category:

more less

Transcript and Presenter's Notes

Title: EECSCS 370

1
EECS/CS 370

Advanced Issues in Pipelining
Lecture 21

2
Four lectures on pipelining

Data hazards
Control hazards
Other issues
Advanced topics
Super pipelined execution
Superscalar execution
Out-of-order execution
Wave pipelining

3
Super pipelining

Processor implementations with pipelines greater
than 5 stages are superpiplined
Superpiplining enables the clock frequency to be
increased (i.e., the cycle time goes down)
Superpiplining exacerbates the problems caused by
hazards.
Where you add the extra stages is important
Frontend (before register reads)
Middle (during execution before result known)
Backend (after results calculated, before
completion)

4
Superscalar Machines

A processor implementation with multiple
pipelines (dependent pipelines) is said to be
executing superscalar (more than scalar)
Superscalar implementations improve CPI by
enabling more than one instruction to be in each
pipeline stage
Superscalar implementations must still manage
pipeline hazards.
This increases the complexity of the processor
It is also more difficult to avoid hazards
for much the same reasons that superpiplining does

5
Scheduling Issues

In order to execute 2 instructions at the same
time we must still avoid hazards.
Detection
Must compare source operands with all previous
destinations in flight on either pipeline
Must also compare source of one instruction in
decode with the other.
Management
More forwarding locations (why?)
More stalls (why?)

6
Out-of-order (OoO) execution

Some instructions take a long time to complete
(e.g., a load instruction).
OoO execution allows the following instructions
to execute as long as they dont need the result
of the slow instruction.
OoO execution reduces stalls in the pipeline by
filling them with future instructions as long as
that doesnt violate the program semantics.

7
Scheduling in OoO machines

Out-of-order execution creates additional
problems in pipeline scheduling.
When is reordering possible?
How is data forwarding accomplished?
What about control hazards?
What about exceptions?

8
Register renaming

Sometimes it is OK to reorder instructions that
reference the same register.

div r1, r2 ? r3 sub r3, r4 ? r5 add r6, r7
? r3 mult r3, r8 ? r9
You can move the add and mult ahead of the
div/sub if you are careful!
div r1, r2 ? p3 sub p3, r4 ? r5 add r6, r7
? p10 mult p10, r8 ? r9
Register renaming remaps architected registers
to physical registers to avoid anti-dependencies

9
Pentium Pro/II/III Pipeline

11 stages 7 phases
Instruction Fetch
Decode
Register Access
Reordering
Dispatch
Execution
Retirement

10
Instruction Fetch

There are 3 stages in this phase IFU1, IFU2,
IFU3. IFU stands for Instruction Fetch Unit.
IFU1 Fetches a 32-byte line from the L1 code
cache. The line is stored in a buffer in the CPU.
IFU2 Marks the boundaries of the IA instructions
in each 32-byte line. If an instruction is found
to be a branch instruction, it is also forwarded
to the BTB (branch target buffer) for dynamic
branch prediction.
IFU3 Aligns instructions for delivery to the
instruction decoders. This step is required,
since an instruction can be anywhere in the
32-byte stream.

11
Instruction Decode

There are 3 decoders in the CPU. The total
decode time takes 2 1/2 clock cycles for both
decode stages to decode an instruction.
DEC1 Translates the IA instructions into a uop
(where possible). Up to 3 instructions can be
decoded simultaneously (one per decoder). These 3
decoders only handle instructions up to 7 bytes
in length and that can be converted into 4 uops
or less. The 3 decoders consist of 1 complex
decoder and 2 simple decoders. Simple decoders
can only convert IA instructions that map to a
single uop. Luckily, most IA instructions are
simple.
For instructions longer than 7-bytes, or that
require more than 4 uops, the IA instruction is
sent to the Micro-Instruction Sequencer. The job
of the MIS is to convert these more complicated
instructions into uops. It does this by using ROM
(read-only memory) microcode and sends the uops
it produces to the ID Queue.
DEC2 DEC2 moves uops to the ID Queue. It brings
together the results of the 3 decoders.

12
Register Access

The Pentium Pro has 40 hidden registers (hidden
from programmers). These registers are utilized
by the register allocation table to modify uop
references to the standard 16 IA architecture
registers, to use the 40 registers instead. This
allows for increased parallelism since more
registers can be allocated to the instructions
than originally available.
The modified uops are sent to the ROB.

13
Reorder Buffer (ROB)

The ROB contains 40 entries for uops. The uops
are added to the ROB in program order (the order
of the original IA instructions). The ROB is
essentially a pool of instructions that are
available for execution.
After a uop executes, its results are stored in
the ROB entry for that uop.

14
Dispatch

The Dispatcher copies a uop from the ROB to the
Reservation Station (RS) and allocates a specific
execution unit to execute the uop. The RS is a
buffer for the execution units.

15
Execution

There are 5 execution units in a Pentium Pro
Store Data, Store Address, Load Address, Simple
Integer, Floating Point/Complex Integer. All 5 of
these execution units can operate simultaneously.

16
Retirement

The Retirement phase has the job of equating the
uop results back into the original IA
instructions and registers.
RET1 Marks a uop for retirement, after it has
executed, only if all conditional branches
earlier in the code stream have also been
executed.
Why is this a problem? Since the Pentium Pro
performs branch prediction, it is possible to
execute code after a predicted branch, before the
real branch evaluation takes place. Thus, code
executed after a branch is like a transaction. We
don't want the results to be available until the
CPU has "committed" that the predicted branch is
the correct one. We can't make the results of the
processing available outside the CPU until this
commitment has been made.
RET2 Only retires uops marked for retirement
when the previous IA instruction has been retired
and all uops associated with the next IA
instruction have completed execution.
Retirement consists of putting the results into
the set of 16 IA registers called for by the
original IA instruction.

17
(No Transcript)
18
AMD Hammer Microarchitecture

12 Stage pipeline
Pre-decode instruction mem
With ID bits to identify branch instructions and
the first byte of all instructions
Partitioned Register file
Bigger data cache memory

19
AMD Hammer Architectural Extensions (64 bit)
20
Classical Pipelining

Synchronous digital circuit
Partition combination logic into stages
Insert pipeline registers between stages

Pipeline register
21
Classical Pipelining - Problems

For max performance, all stages must be busy all
the time.
How many LC2K1 instruction do something useful
each stage?
Logic divided equally so all computations finish
at exactly the same time.
How long does it take to complete the LC2K1
decode stage?
Very deep pipelines have a lot of overhead
writing to the pipeline registers.

22
Wave Pipelining

Also referred to as maximal rate pipelining
Allows multiple data waves simultaneously between
successive storage elements (registers or
pipeline registers).
So pipeline register are not needed.
Uses clock period that is less than max
propagation delay between the registers.

23
Wave Pipelining (Cont.)

Data at input is changed before previous data has
completely propagated through to output.
Picture a water slide

Cycle time
24
Wave Pipelining Example

Min delay of 16, max delay of 20

25
Wave Pipelining Maximizing Clock Rate

Minimum cycle time limited by difference between
min and max Input-Output delays (and device
switching speed).
For max clock rate - must equalize all path
delays from input to output.
Factors
Topological path differences.
Process/temperature/power variations.
Data-dependent delay variations.
Intentional clock skew?

26
Wave Pipelining - Problems

Operating speed constrained to narrow range of
frequencies for given degree of wave pipelining.
New fabrication process requires significant
redesign
No effective mechanism for starting/stopping
Pipeline stalls, low speed testing?
In general, very hard to do circuit analysis.

Write a Comment

User Comments (0)