9th Lecture

About This Presentation

Title:

9th Lecture

Description:

Two or more predictors and a predictor selection mechanism are necessary in a ... Coppermine is a shrink of Pentium III down to 0.18 micron. 33. Pentium 4 ... – PowerPoint PPT presentation

Number of Views:398

Avg rating:3.0/5.0

Slides: 44

Provided by: unge

Category:

more less

Transcript and Presenter's Notes

Title: 9th Lecture

1
9th Lecture

Branch prediction (rest)
Predication
Intel Pentium II/III
Intel Pentium 4

2
Hybrid Predictors

The second strategy of McFarling is to combine
multiple separate branch predictors, each tuned
to a different class of branches.
Two or more predictors and a predictor selection
mechanism are necessary in a combining or hybrid
predictor.
McFarling combination of two-bit predictor and
gshare two-level adaptive,
Young and Smith a compiler-based static branch
prediction with a two-level adaptive type,
and many more combinations!
Hybrid predictors often better than single-type
predictors.

3
Simulations of Grunwald 1998
Table 1.1. SAg, gshare and MCFarlings combining
predictor
4
Results

Simulation of Keeton et al. 1998 using an OLTP
(online transaction workload) on a PentiumPro
multiprocessor reported a misprediction rate of
14 with an branch instruction frequency of about
21.
The speculative execution factor, given by the
number of instructions decoded divided by the
number of instructions committed, is 1.4 for the
database programs.
Two different conclusions may be drawn from these
simulation results
Branch predictors should be further improved
and/or branch prediction is only effective if the
branch is predictable.
If a branch outcome is dependent on irregular
data inputs, the branch often shows an irregular
behavior. ? Question Confidence of a branch
prediction?

5
4.3.4 Predicated Instructions and Multipath
Execution- Confidence Estimation

Confidence estimation is a technique for
assessing the quality of a particular prediction.
Applied to branch prediction, a confidence
estimator attempts to assess the prediction made
by a branch predictor.
A low confidence branch is a branch which
frequently changes its branch direction in an
irregular way making its outcome hard to predict
or even unpredictable.
Four classes possible
correctly predicted with high confidence C(HC),
correctly predicted with low confidence C(LC),
incorrectly predicted with high confidence I(HC),
and
incorrectly predicted with low confidence I(LC).

6
Implementation of a confidence estimator

Information from the branch prediction tables is
used
Use of saturation counter information to
construct a confidence estimator ? speculate
more aggressively when the confidence level is
higher
Used of a miss distance counter table (MDC) ?
Each time a branch is predicted, the value in the
MDC is compared to a threshold. If the value is
above the threshold, then the branch is
considered to have high confidence, and low
confidence otherwise.
A small number of branch history patterns
typically leads to correct predictions in a PAs
predictor scheme. The confidence estimator
assigned high confidence to a fixed set of
patterns and low confidence to all others.
Confidence estimation can be used for speculation
control,thread switching in multithreaded
processors or multipath execution

7
Predicated Instructions

Provide predicated or conditional instructions
and one or more predicate registers.
Predicated instructions use a predicate register
as additional input operand.
The Boolean result of a condition testing is
recorded in a (one-bit) predicate register.
Predicated instructions are fetched, decoded and
placed in the instruction window like non
predicated instructions.
It is dependent on the processor architecture,
how far a predicated instruction proceeds
speculatively in the pipeline before its
predication is resolved
A predicated instruction executes only if its
predicate is true, otherwise the instruction is
discarded. In this case predicated instructions
are not executed before the predicate is
resolved.
Alternatively, as reported for Intel's IA64 ISA,
the predicated instruction may be executed, but
commits only if the predicate is true, otherwise
the result is discarded.

8
Predication Example

if (x 0) /branch b1 /
a b c
d e - f
g h i / instruction independent of branch
b1 /
(Pred (x 0) ) / branch b1 Pred is set to
true in x equals 0 /
if Pred then a b c / The operations are
only performed /
if Pred then e e - f / if Pred is set to true
/
g h i

9
Predication

Able to eliminate a branch and therefore the
associated branch prediction ? increasing the
distance between mispredictions.
The the run length of a code block is increased ?
better compiler scheduling.
Predication affects the instruction set, adds a
port to the register file, and complicates
instruction execution.
Predicated instructions that are discarded still
consume processor resources especially the fetch
bandwidth.
Predication is most effective when control
dependences can be completely eliminated, such as
in an if-then with a small then body.
The use of predicated instructions is limited
when the control flow involves more than a simple
alternative sequence.

10
Eager (Multipath) Execution

Execution proceeds down both paths of a branch,
and no prediction is made.
When a branch resolves, all operations on the
non-taken path are discarded.
Oracle execution eager execution with unlimited
resources
gives the same theoretical maximum performance as
a perfect branch prediction
With limited resources, the eager execution
strategy must be employed carefully.
Mechanism is required that decides when to employ
prediction and when eager execution e.g. a
confidence estimator
Rarely implemented (IBM mainframes) but some
research projects
Dansoft processor, Polypath architecture,
selective dual path execution, simultaneous
speculation scheduling, disjoint eager execution

11
(a) Single path speculative execution(b) full
eager execution (c) disjoint eager execution
12
4.3.5 Prediction of Indirect Branches

Indirect branches, which transfer control to an
address stored in register, are harder to predict
accurately.
Indirect branches occur frequently in machine
code compiled from object-oriented programs like
C and Java programs.
One simple solution is to update the PHT to
include the branch target addresses.

13
Branch handling techniques and implementations

Technique Implementation examples
No branch prediction Intel 8086
Static prediction
always not taken Intel i486
always taken Sun SuperSPARC
backward taken, forward not taken HP PA-7x00
semistatic with profiling early PowerPCs
Dynamic prediction
1-bit DEC Alpha 21064, AMD K5
2-bit PowerPC 604, MIPS R10000,
Cyrix 6x86 and M2, NexGen 586
two-level adaptive Intel PentiumPro, Pentium II,
AMD K6, Athlon
Hybrid prediction DEC Alpha 21264
Predication Intel/HP Merced and most signal
processors as e.g.
ARM processors, TI TMS320C6201 and many other
Eager execution (limited) IBM mainframes IBM
360/91, IBM 3090
Disjoint eager execution none yet

14
High-Bandwidth Branch Prediction

Future microprocessor will require more than one
prediction per cycle starting speculation over
multiple branches in a single cycle,
e.g. Gag predictor is independent of branch
address.
When multiple branches are predicted per cycle,
then instructions must be fetched from multiple
target addresses per cycle, complicating I-cache
access.
Possible solution Trace cache in combination
with next trace prediction.
Most likely a combination of branch handling
techniques will be applied,
e.g. a multi-hybrid branch predictor combined
with support for context switching, indirect
jumps, and interference handling.

15
The Intel P5 and P6 family
P5
P6
NetBurst
including L2 cache
16
Micro-Dataflow in PentiumPro 1995

... The flow of the Intel Architecture
instructions is predicted and these instructions
are decoded into micro-operations (?ops), or
series of ?ops, and these ?ops are
register-renamed, placed into an out-of-order
speculative pool of pending operations, executed
in dataflow order (when operands are ready), and
retired to permanent machine state in source
program order. ...
R.P. Colwell, R. L. Steck A 0.6 ?m BiCMOS
Processor with Dynamic Execution, International
Solid State Circuits Conference, Feb. 1995.

17
PentiumPro and Pentium II/III

The Pentium II/III processors use the same
dynamic execution microarchitecture as the other
members of P6 family.
This three-way superscalar, pipelined
micro-architecture features a decoupled,
multi-stage superpipeline, which trades less work
per pipestage for more stages.
The Pentium II/III processor has twelve stages
with a pipestage time 33 percent less than the
Pentium processor, which helps achieve a higher
clock rate on any given manufacturing process.
A wide instruction window using an instruction
pool.
Optimized scheduling requires the fundamental
execute phase to be replaced by decoupled
issue/execute and retire phases. This allows
instructions to be started in any order but
always be retired in the original program order.
Processors in the P6 family may be thought of as
three independent engines coupled with an
instruction pool.

18
Pentium Pro Processor and Pentium II/III
Microarchitecture
19
Pentium II/III
20
Pentium II/III The In-Order Section

The instruction fetch unit (IFU) accesses a
non-blocking I-cache, it contains the Next IP
unit.
The Next IP unit provides the I-cache index
(based on inputs from the BTB), trap/interrupt
status, and branch-misprediction indications from
the integer FUs.
Branch prediction
two-level adaptive scheme of Yeh and Patt,
BTB contains 512 entries, maintains branch
history information and the predicted branch
target address.
Branch misprediction penalty at least 11 cycles,
on average 15 cycles
The instruction decoder unit (IDU) is composed of
three separate decoders

21
Pentium II/III The In-Order Section (Continued)

A decoder breaks the IA-32 instruction down to
?ops, each comprised of an opcode, two source and
one destination operand. These ?ops are of fixed
length.
Most IA-32 instructions are converted directly
into single micro ops (by any of the three
decoders),
some instructions are decoded into one-to-four
?ops (by the general decoder),
more complex instructions are used as indices
into the microcode instruction sequencer (MIS)
which will generate the appropriate stream of
?ops.
The ?ops are send to the register alias table
(RAT) where register renaming is performed,
i.e., the logical IA-32 based register
references are converted into references to
physical registers.
Then, with added status information, ?ops
continue to the reorder buffer (ROB, 40 entries)
and to the reservation station unit (RSU, 20
entries).

22
The Fetch/Decode Unit
IA-32 instructions
Instruction Fetch Unit
Next_IP
Alignment
I-cache
Branch Target Buffer
Simple Decoder
Simple Decoder
General Decoder
Microcode Instruction Sequencer
Instruction Decode Unit
Register Alias Table
op1
op2
op3
(a) in-order section
(b) instruction decoder unit (IDU)
23
The Out-of-Order Execute Section

When the ?ops flow into the ROB, they effectively
take a place in program order.
?ops also go to the RSU which forms a central
instruction window with 20 reservation stations
(RS), each capable of hosting one ?op.
?ops are issued to the FUs according to dataflow
constraints and resource availability, without
regard to the original ordering of the program.
After completion the result goes to two different
places, RSU and ROB.
The RSU has five ports and can issue at a peak
rate of 5 ?ops each cycle.

24
Latencies and throughtput for Pentium II/III FUs
25
Issue/Execute Unit
26
The In-Order Retire Section.

A ?op can be retired
if its execution is completed,
if it is its turn in program order,
and if no interrupt, trap, or misprediction
occurred.
Retirement means taking data that was
speculatively created and writing it into the
retirement register file (RRF).
Three ?ops per clock cycle can be retired.

27
Retire Unit
28
The Pentium II/III Pipeline
ROB read
BTB0
Reorder buffer read
BTB access
Issue
BTB1
Reservation station
RSU
IFU0
Fetch and predecode
Port 0
I-cache access
IFU1
IFU2
Port 1
IDU0
Execution and completion
Decode
Port 2
IDU1
ROB write
Reorder buffer write-back
Port 3
Register renaming
RAT

Retirement
ROB read
Reorder buffer read
Retirement
RRF
Port 4
(a)
(c)
(b)
29
Pentium Pro Processor Basic Execution Environment
232-1
Eight 32-bit Registers
General Purpose Registers
Six 16-bit Registers
Address Space
Segment Registers
32 bits
EFLAGS Register
32 bits
EIP (Instruction Pointer Register)
0
The address space can be flat or segmented
30
Application Programming Registers
31
Pentium III
32
Pentium II/III summary and offsprings

Pentium III in 1999, initially at 450 MHz (0.25
micron technology), former name Katmai
two 32 kB caches, faster floating-point
performance
Coppermine is a shrink of Pentium III down to
0.18 micron.

33
Pentium 4

Was announced for mid-2000 under the code name
Willamette
native IA-32 processor with Pentium III processor
core
running at 1.5 GHz
42 million transistors
0.18 µm
20 pipeline stages (integer pipeline), IF and ID
not included
trace execution cache (TEC) for the decoded µOps
NetBurst micro-architecture

34
Pentium 4 Features

Rapid Execution Engine
Intel Arithmetic Logic Units (ALUs) run at
twice the processor frequency
Fact Two ALUs, running at processor frequency
connected with a multiplexer running at twice the
processor frequency
Hyper Pipelined Technology
Twenty-stage pipeline to enable high clock rates
Frequency headroom and performance scalability

35
Advanced Dynamic Execution

Very deep, out-of-order, speculative execution
engine
Up to 126 instructions in flight (3 times larger
than the Pentium III processor)
Up to 48 loads and 24 stores in pipeline (2 times
larger than the Pentium III processor)
branch prediction
based on µOPs
4K entry branch target array (8 times larger than
the Pentium III processor)
new algorithm (not specified), reduces
mispredictions compared to G-Share of the P6
generation about one third

36
First level caches

12k µOP Execution Trace Cache (100 k)
Execution Trace Cache that removes decoder
latency from main execution loops
Execution Trace Cache integrates path of program
execution flow into a single line
Low latency 8 kByte data cache with 2 cycle
latency

37
Second level caches

Included on the die
size 256 kB
Full-speed, unified 8-way 2nd-level on-die
Advance Transfer Cache
256-bit data bus to the level 2 cache
Delivers 45 GB/s data throughput (at 1.4 GHz
processor frequency)
Bandwidth and performance increases with
processor frequency

38
NetBurst Micro-Architecture
39
Streaming SIMD Extensions 2 (SSE2) Technology

SSE2 Extends MMX and SSE technology with the
addition of 144 new instructions, which include
support for
128-bit SIMD integer arithmetic operations.
128-bit SIMD double precision floating point
operations.
Cache and memory management operations.
Further enhances and accelerates video, speech,
encryption, image and photo processing.

40
400 MHz Intel NetBurst micro-architecture system
bus

Provides 3.2 GB/s throughput (3 times faster than
the Pentium III processor).
Quad-pumped 100MHz scalable bus clock to achieve
400 MHz effective speed.
Split-transaction, deeply pipelined.
128-byte lines with 64-byte accesses.

41
Pentium 4 data types
42
Pentium 4
43
Pentium 4 offsprings

Foster
Pentium 4 with external L3 cache and DDR-SDRAM
support
provided for server
clock rate 1.7 - 2 GHz
to be launched in Q2/2001
Northwood
0.13 µm technique
new 478 pin socket

Write a Comment

User Comments (0)

About PowerShow.com

9th Lecture - PowerPoint PPT Presentation

9th Lecture

Two or more predictors and a predictor selection mechanism are necessary in a ... Coppermine is a shrink of Pentium III down to 0.18 micron. 33. Pentium 4 ... – PowerPoint PPT presentation