Title: 9th Lecture
19th Lecture
- Branch prediction (rest)
- Predication
- Intel Pentium II/III
- Intel Pentium 4
2Hybrid Predictors
- The second strategy of McFarling is to combine
multiple separate branch predictors, each tuned
to a different class of branches. - Two or more predictors and a predictor selection
mechanism are necessary in a combining or hybrid
predictor. - McFarling combination of two-bit predictor and
gshare two-level adaptive, - Young and Smith a compiler-based static branch
prediction with a two-level adaptive type, - and many more combinations!
- Hybrid predictors often better than single-type
predictors.
3Simulations of Grunwald 1998
Table 1.1. SAg, gshare and MCFarlings combining
predictor
4Results
- Simulation of Keeton et al. 1998 using an OLTP
(online transaction workload) on a PentiumPro
multiprocessor reported a misprediction rate of
14 with an branch instruction frequency of about
21. - The speculative execution factor, given by the
number of instructions decoded divided by the
number of instructions committed, is 1.4 for the
database programs. - Two different conclusions may be drawn from these
simulation results - Branch predictors should be further improved
- and/or branch prediction is only effective if the
branch is predictable. - If a branch outcome is dependent on irregular
data inputs, the branch often shows an irregular
behavior. ? Question Confidence of a branch
prediction?
54.3.4 Predicated Instructions and Multipath
Execution- Confidence Estimation
- Confidence estimation is a technique for
assessing the quality of a particular prediction.
- Applied to branch prediction, a confidence
estimator attempts to assess the prediction made
by a branch predictor. - A low confidence branch is a branch which
frequently changes its branch direction in an
irregular way making its outcome hard to predict
or even unpredictable. - Four classes possible
- correctly predicted with high confidence C(HC),
- correctly predicted with low confidence C(LC),
- incorrectly predicted with high confidence I(HC),
and - incorrectly predicted with low confidence I(LC).
6Implementation of a confidence estimator
- Information from the branch prediction tables is
used - Use of saturation counter information to
construct a confidence estimator ? speculate
more aggressively when the confidence level is
higher - Used of a miss distance counter table (MDC) ?
Each time a branch is predicted, the value in the
MDC is compared to a threshold. If the value is
above the threshold, then the branch is
considered to have high confidence, and low
confidence otherwise. - A small number of branch history patterns
typically leads to correct predictions in a PAs
predictor scheme. The confidence estimator
assigned high confidence to a fixed set of
patterns and low confidence to all others. - Confidence estimation can be used for speculation
control,thread switching in multithreaded
processors or multipath execution
7Predicated Instructions
- Provide predicated or conditional instructions
and one or more predicate registers. - Predicated instructions use a predicate register
as additional input operand. - The Boolean result of a condition testing is
recorded in a (one-bit) predicate register. - Predicated instructions are fetched, decoded and
placed in the instruction window like non
predicated instructions. - It is dependent on the processor architecture,
how far a predicated instruction proceeds
speculatively in the pipeline before its
predication is resolved - A predicated instruction executes only if its
predicate is true, otherwise the instruction is
discarded. In this case predicated instructions
are not executed before the predicate is
resolved. - Alternatively, as reported for Intel's IA64 ISA,
the predicated instruction may be executed, but
commits only if the predicate is true, otherwise
the result is discarded.
8Predication Example
- if (x 0) /branch b1 /
- a b c
- d e - f
-
- g h i / instruction independent of branch
b1 / - (Pred (x 0) ) / branch b1 Pred is set to
true in x equals 0 / - if Pred then a b c / The operations are
only performed / - if Pred then e e - f / if Pred is set to true
/ - g h i
9Predication
- Able to eliminate a branch and therefore the
associated branch prediction ? increasing the
distance between mispredictions. - The the run length of a code block is increased ?
better compiler scheduling. - Predication affects the instruction set, adds a
port to the register file, and complicates
instruction execution. - Predicated instructions that are discarded still
consume processor resources especially the fetch
bandwidth. - Predication is most effective when control
dependences can be completely eliminated, such as
in an if-then with a small then body. - The use of predicated instructions is limited
when the control flow involves more than a simple
alternative sequence.
10Eager (Multipath) Execution
- Execution proceeds down both paths of a branch,
and no prediction is made. - When a branch resolves, all operations on the
non-taken path are discarded. - Oracle execution eager execution with unlimited
resources - gives the same theoretical maximum performance as
a perfect branch prediction - With limited resources, the eager execution
strategy must be employed carefully. - Mechanism is required that decides when to employ
prediction and when eager execution e.g. a
confidence estimator - Rarely implemented (IBM mainframes) but some
research projects - Dansoft processor, Polypath architecture,
selective dual path execution, simultaneous
speculation scheduling, disjoint eager execution
11(a) Single path speculative execution(b) full
eager execution (c) disjoint eager execution
124.3.5 Prediction of Indirect Branches
- Indirect branches, which transfer control to an
address stored in register, are harder to predict
accurately. - Indirect branches occur frequently in machine
code compiled from object-oriented programs like
C and Java programs. - One simple solution is to update the PHT to
include the branch target addresses.
13Branch handling techniques and implementations
- Technique Implementation examples
- No branch prediction Intel 8086
- Static prediction
- always not taken Intel i486
- always taken Sun SuperSPARC
- backward taken, forward not taken HP PA-7x00
- semistatic with profiling early PowerPCs
- Dynamic prediction
- 1-bit DEC Alpha 21064, AMD K5
- 2-bit PowerPC 604, MIPS R10000,
- Cyrix 6x86 and M2, NexGen 586
- two-level adaptive Intel PentiumPro, Pentium II,
AMD K6, Athlon - Hybrid prediction DEC Alpha 21264
- Predication Intel/HP Merced and most signal
processors as e.g. - ARM processors, TI TMS320C6201 and many other
- Eager execution (limited) IBM mainframes IBM
360/91, IBM 3090 - Disjoint eager execution none yet
14High-Bandwidth Branch Prediction
- Future microprocessor will require more than one
prediction per cycle starting speculation over
multiple branches in a single cycle, - e.g. Gag predictor is independent of branch
address. - When multiple branches are predicted per cycle,
then instructions must be fetched from multiple
target addresses per cycle, complicating I-cache
access. - Possible solution Trace cache in combination
with next trace prediction. - Most likely a combination of branch handling
techniques will be applied, - e.g. a multi-hybrid branch predictor combined
with support for context switching, indirect
jumps, and interference handling.
15The Intel P5 and P6 family
P5
P6
NetBurst
including L2 cache
16Micro-Dataflow in PentiumPro 1995
- ... The flow of the Intel Architecture
instructions is predicted and these instructions
are decoded into micro-operations (?ops), or
series of ?ops, and these ?ops are
register-renamed, placed into an out-of-order
speculative pool of pending operations, executed
in dataflow order (when operands are ready), and
retired to permanent machine state in source
program order. ... - R.P. Colwell, R. L. Steck A 0.6 ?m BiCMOS
Processor with Dynamic Execution, International
Solid State Circuits Conference, Feb. 1995.
17PentiumPro and Pentium II/III
- The Pentium II/III processors use the same
dynamic execution microarchitecture as the other
members of P6 family. - This three-way superscalar, pipelined
micro-architecture features a decoupled,
multi-stage superpipeline, which trades less work
per pipestage for more stages. - The Pentium II/III processor has twelve stages
with a pipestage time 33 percent less than the
Pentium processor, which helps achieve a higher
clock rate on any given manufacturing process. - A wide instruction window using an instruction
pool. - Optimized scheduling requires the fundamental
execute phase to be replaced by decoupled
issue/execute and retire phases. This allows
instructions to be started in any order but
always be retired in the original program order. - Processors in the P6 family may be thought of as
three independent engines coupled with an
instruction pool.
18Pentium Pro Processor and Pentium II/III
Microarchitecture
19Pentium II/III
20Pentium II/III The In-Order Section
- The instruction fetch unit (IFU) accesses a
non-blocking I-cache, it contains the Next IP
unit. - The Next IP unit provides the I-cache index
(based on inputs from the BTB), trap/interrupt
status, and branch-misprediction indications from
the integer FUs. - Branch prediction
- two-level adaptive scheme of Yeh and Patt,
- BTB contains 512 entries, maintains branch
history information and the predicted branch
target address. - Branch misprediction penalty at least 11 cycles,
on average 15 cycles - The instruction decoder unit (IDU) is composed of
three separate decoders
21Pentium II/III The In-Order Section (Continued)
- A decoder breaks the IA-32 instruction down to
?ops, each comprised of an opcode, two source and
one destination operand. These ?ops are of fixed
length. - Most IA-32 instructions are converted directly
into single micro ops (by any of the three
decoders), - some instructions are decoded into one-to-four
?ops (by the general decoder), - more complex instructions are used as indices
into the microcode instruction sequencer (MIS)
which will generate the appropriate stream of
?ops. - The ?ops are send to the register alias table
(RAT) where register renaming is performed,
i.e., the logical IA-32 based register
references are converted into references to
physical registers. - Then, with added status information, ?ops
continue to the reorder buffer (ROB, 40 entries)
and to the reservation station unit (RSU, 20
entries).
22The Fetch/Decode Unit
IA-32 instructions
Instruction Fetch Unit
Next_IP
Alignment
I-cache
Branch Target Buffer
Simple Decoder
Simple Decoder
General Decoder
Microcode Instruction Sequencer
Instruction Decode Unit
Register Alias Table
op1
op2
op3
(a) in-order section
(b) instruction decoder unit (IDU)
23The Out-of-Order Execute Section
- When the ?ops flow into the ROB, they effectively
take a place in program order. - ?ops also go to the RSU which forms a central
instruction window with 20 reservation stations
(RS), each capable of hosting one ?op. - ?ops are issued to the FUs according to dataflow
constraints and resource availability, without
regard to the original ordering of the program. - After completion the result goes to two different
places, RSU and ROB. - The RSU has five ports and can issue at a peak
rate of 5 ?ops each cycle.
24Latencies and throughtput for Pentium II/III FUs
25Issue/Execute Unit
26The In-Order Retire Section.
- A ?op can be retired
- if its execution is completed,
- if it is its turn in program order,
- and if no interrupt, trap, or misprediction
occurred. - Retirement means taking data that was
speculatively created and writing it into the
retirement register file (RRF). - Three ?ops per clock cycle can be retired.
27Retire Unit
28The Pentium II/III Pipeline
ROB read
BTB0
Reorder buffer read
BTB access
Issue
BTB1
Reservation station
RSU
IFU0
Fetch and predecode
Port 0
I-cache access
IFU1
IFU2
Port 1
IDU0
Execution and completion
Decode
Port 2
IDU1
ROB write
Reorder buffer write-back
Port 3
Register renaming
RAT
Retirement
ROB read
Reorder buffer read
Retirement
RRF
Port 4
(a)
(c)
(b)
29Pentium Pro Processor Basic Execution Environment
232-1
Eight 32-bit Registers
General Purpose Registers
Six 16-bit Registers
Address Space
Segment Registers
32 bits
EFLAGS Register
32 bits
EIP (Instruction Pointer Register)
0
The address space can be flat or segmented
30Application Programming Registers
31Pentium III
32Pentium II/III summary and offsprings
- Pentium III in 1999, initially at 450 MHz (0.25
micron technology), former name Katmai - two 32 kB caches, faster floating-point
performance - Coppermine is a shrink of Pentium III down to
0.18 micron.
33Pentium 4
- Was announced for mid-2000 under the code name
Willamette - native IA-32 processor with Pentium III processor
core - running at 1.5 GHz
- 42 million transistors
- 0.18 µm
- 20 pipeline stages (integer pipeline), IF and ID
not included - trace execution cache (TEC) for the decoded µOps
- NetBurst micro-architecture
34Pentium 4 Features
- Rapid Execution Engine
- Intel Arithmetic Logic Units (ALUs) run at
twice the processor frequency - Fact Two ALUs, running at processor frequency
connected with a multiplexer running at twice the
processor frequency - Hyper Pipelined Technology
- Twenty-stage pipeline to enable high clock rates
- Frequency headroom and performance scalability
35Advanced Dynamic Execution
- Very deep, out-of-order, speculative execution
engine - Up to 126 instructions in flight (3 times larger
than the Pentium III processor) - Up to 48 loads and 24 stores in pipeline (2 times
larger than the Pentium III processor) - branch prediction
- based on µOPs
- 4K entry branch target array (8 times larger than
the Pentium III processor) - new algorithm (not specified), reduces
mispredictions compared to G-Share of the P6
generation about one third
36First level caches
- 12k µOP Execution Trace Cache (100 k)
- Execution Trace Cache that removes decoder
latency from main execution loops - Execution Trace Cache integrates path of program
execution flow into a single line - Low latency 8 kByte data cache with 2 cycle
latency
37Second level caches
- Included on the die
- size 256 kB
- Full-speed, unified 8-way 2nd-level on-die
Advance Transfer Cache - 256-bit data bus to the level 2 cache
- Delivers 45 GB/s data throughput (at 1.4 GHz
processor frequency) - Bandwidth and performance increases with
processor frequency
38NetBurst Micro-Architecture
39Streaming SIMD Extensions 2 (SSE2) Technology
- SSE2 Extends MMX and SSE technology with the
addition of 144 new instructions, which include
support for - 128-bit SIMD integer arithmetic operations.
- 128-bit SIMD double precision floating point
operations. - Cache and memory management operations.
- Further enhances and accelerates video, speech,
encryption, image and photo processing.
40400 MHz Intel NetBurst micro-architecture system
bus
- Provides 3.2 GB/s throughput (3 times faster than
the Pentium III processor). - Quad-pumped 100MHz scalable bus clock to achieve
400 MHz effective speed. - Split-transaction, deeply pipelined.
- 128-byte lines with 64-byte accesses.
41Pentium 4 data types
42Pentium 4
43Pentium 4 offsprings
- Foster
- Pentium 4 with external L3 cache and DDR-SDRAM
support - provided for server
- clock rate 1.7 - 2 GHz
- to be launched in Q2/2001
- Northwood
- 0.13 µm technique
- new 478 pin socket