9th Lecture - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

9th Lecture

Description:

Two or more predictors and a predictor selection mechanism are necessary in a ... Coppermine is a shrink of Pentium III down to 0.18 micron. 33. Pentium 4 ... – PowerPoint PPT presentation

Number of Views:398
Avg rating:3.0/5.0
Slides: 44
Provided by: unge
Category:

less

Transcript and Presenter's Notes

Title: 9th Lecture


1
9th Lecture
  • Branch prediction (rest)
  • Predication
  • Intel Pentium II/III
  • Intel Pentium 4

2
Hybrid Predictors
  • The second strategy of McFarling is to combine
    multiple separate branch predictors, each tuned
    to a different class of branches.
  • Two or more predictors and a predictor selection
    mechanism are necessary in a combining or hybrid
    predictor.
  • McFarling combination of two-bit predictor and
    gshare two-level adaptive,
  • Young and Smith a compiler-based static branch
    prediction with a two-level adaptive type,
  • and many more combinations!
  • Hybrid predictors often better than single-type
    predictors.

3
Simulations of Grunwald 1998
Table 1.1. SAg, gshare and MCFarlings combining
predictor
4
Results
  • Simulation of Keeton et al. 1998 using an OLTP
    (online transaction workload) on a PentiumPro
    multiprocessor reported a misprediction rate of
    14 with an branch instruction frequency of about
    21.
  • The speculative execution factor, given by the
    number of instructions decoded divided by the
    number of instructions committed, is 1.4 for the
    database programs.
  • Two different conclusions may be drawn from these
    simulation results
  • Branch predictors should be further improved
  • and/or branch prediction is only effective if the
    branch is predictable.
  • If a branch outcome is dependent on irregular
    data inputs, the branch often shows an irregular
    behavior. ? Question Confidence of a branch
    prediction?

5
4.3.4 Predicated Instructions and Multipath
Execution- Confidence Estimation
  • Confidence estimation is a technique for
    assessing the quality of a particular prediction.
  • Applied to branch prediction, a confidence
    estimator attempts to assess the prediction made
    by a branch predictor.
  • A low confidence branch is a branch which
    frequently changes its branch direction in an
    irregular way making its outcome hard to predict
    or even unpredictable.
  • Four classes possible
  • correctly predicted with high confidence C(HC),
  • correctly predicted with low confidence C(LC),
  • incorrectly predicted with high confidence I(HC),
    and
  • incorrectly predicted with low confidence I(LC).

6
Implementation of a confidence estimator
  • Information from the branch prediction tables is
    used
  • Use of saturation counter information to
    construct a confidence estimator ? speculate
    more aggressively when the confidence level is
    higher
  • Used of a miss distance counter table (MDC) ?
    Each time a branch is predicted, the value in the
    MDC is compared to a threshold. If the value is
    above the threshold, then the branch is
    considered to have high confidence, and low
    confidence otherwise.
  • A small number of branch history patterns
    typically leads to correct predictions in a PAs
    predictor scheme. The confidence estimator
    assigned high confidence to a fixed set of
    patterns and low confidence to all others.
  • Confidence estimation can be used for speculation
    control,thread switching in multithreaded
    processors or multipath execution

7
Predicated Instructions
  • Provide predicated or conditional instructions
    and one or more predicate registers.
  • Predicated instructions use a predicate register
    as additional input operand.
  • The Boolean result of a condition testing is
    recorded in a (one-bit) predicate register.
  • Predicated instructions are fetched, decoded and
    placed in the instruction window like non
    predicated instructions.
  • It is dependent on the processor architecture,
    how far a predicated instruction proceeds
    speculatively in the pipeline before its
    predication is resolved
  • A predicated instruction executes only if its
    predicate is true, otherwise the instruction is
    discarded. In this case predicated instructions
    are not executed before the predicate is
    resolved.
  • Alternatively, as reported for Intel's IA64 ISA,
    the predicated instruction may be executed, but
    commits only if the predicate is true, otherwise
    the result is discarded.

8
Predication Example
  • if (x 0) /branch b1 /
  • a b c
  • d e - f
  • g h i / instruction independent of branch
    b1 /
  • (Pred (x 0) ) / branch b1 Pred is set to
    true in x equals 0 /
  • if Pred then a b c / The operations are
    only performed /
  • if Pred then e e - f / if Pred is set to true
    /
  • g h i

9
Predication
  • Able to eliminate a branch and therefore the
    associated branch prediction ? increasing the
    distance between mispredictions.
  • The the run length of a code block is increased ?
    better compiler scheduling.
  • Predication affects the instruction set, adds a
    port to the register file, and complicates
    instruction execution.
  • Predicated instructions that are discarded still
    consume processor resources especially the fetch
    bandwidth.
  • Predication is most effective when control
    dependences can be completely eliminated, such as
    in an if-then with a small then body.
  • The use of predicated instructions is limited
    when the control flow involves more than a simple
    alternative sequence.

10
Eager (Multipath) Execution
  • Execution proceeds down both paths of a branch,
    and no prediction is made.
  • When a branch resolves, all operations on the
    non-taken path are discarded.
  • Oracle execution eager execution with unlimited
    resources
  • gives the same theoretical maximum performance as
    a perfect branch prediction
  • With limited resources, the eager execution
    strategy must be employed carefully.
  • Mechanism is required that decides when to employ
    prediction and when eager execution e.g. a
    confidence estimator
  • Rarely implemented (IBM mainframes) but some
    research projects
  • Dansoft processor, Polypath architecture,
    selective dual path execution, simultaneous
    speculation scheduling, disjoint eager execution

11
(a) Single path speculative execution(b) full
eager execution (c) disjoint eager execution
12
4.3.5 Prediction of Indirect Branches
  • Indirect branches, which transfer control to an
    address stored in register, are harder to predict
    accurately.
  • Indirect branches occur frequently in machine
    code compiled from object-oriented programs like
    C and Java programs.
  • One simple solution is to update the PHT to
    include the branch target addresses.

13
Branch handling techniques and implementations
  • Technique Implementation examples
  • No branch prediction Intel 8086
  • Static prediction
  • always not taken Intel i486
  • always taken Sun SuperSPARC
  • backward taken, forward not taken HP PA-7x00
  • semistatic with profiling early PowerPCs
  • Dynamic prediction
  • 1-bit DEC Alpha 21064, AMD K5
  • 2-bit PowerPC 604, MIPS R10000,
  • Cyrix 6x86 and M2, NexGen 586
  • two-level adaptive Intel PentiumPro, Pentium II,
    AMD K6, Athlon
  • Hybrid prediction DEC Alpha 21264
  • Predication Intel/HP Merced and most signal
    processors as e.g.
  • ARM processors, TI TMS320C6201 and many other
  • Eager execution (limited) IBM mainframes IBM
    360/91, IBM 3090
  • Disjoint eager execution none yet

14
High-Bandwidth Branch Prediction
  • Future microprocessor will require more than one
    prediction per cycle starting speculation over
    multiple branches in a single cycle,
  • e.g. Gag predictor is independent of branch
    address.
  • When multiple branches are predicted per cycle,
    then instructions must be fetched from multiple
    target addresses per cycle, complicating I-cache
    access.
  • Possible solution Trace cache in combination
    with next trace prediction.
  • Most likely a combination of branch handling
    techniques will be applied,
  • e.g. a multi-hybrid branch predictor combined
    with support for context switching, indirect
    jumps, and interference handling.

15
The Intel P5 and P6 family
P5
P6
NetBurst
including L2 cache
16
Micro-Dataflow in PentiumPro 1995
  • ... The flow of the Intel Architecture
    instructions is predicted and these instructions
    are decoded into micro-operations (?ops), or
    series of ?ops, and these ?ops are
    register-renamed, placed into an out-of-order
    speculative pool of pending operations, executed
    in dataflow order (when operands are ready), and
    retired to permanent machine state in source
    program order. ...
  • R.P. Colwell, R. L. Steck A 0.6 ?m BiCMOS
    Processor with Dynamic Execution, International
    Solid State Circuits Conference, Feb. 1995.

17
PentiumPro and Pentium II/III
  • The Pentium II/III processors use the same
    dynamic execution microarchitecture as the other
    members of P6 family.
  • This three-way superscalar, pipelined
    micro-architecture features a decoupled,
    multi-stage superpipeline, which trades less work
    per pipestage for more stages.
  • The Pentium II/III processor has twelve stages
    with a pipestage time 33 percent less than the
    Pentium processor, which helps achieve a higher
    clock rate on any given manufacturing process.
  • A wide instruction window using an instruction
    pool.
  • Optimized scheduling requires the fundamental
    execute phase to be replaced by decoupled
    issue/execute and retire phases. This allows
    instructions to be started in any order but
    always be retired in the original program order.
  • Processors in the P6 family may be thought of as
    three independent engines coupled with an
    instruction pool.

18
Pentium Pro Processor and Pentium II/III
Microarchitecture
19
Pentium II/III
20
Pentium II/III The In-Order Section
  • The instruction fetch unit (IFU) accesses a
    non-blocking I-cache, it contains the Next IP
    unit.
  • The Next IP unit provides the I-cache index
    (based on inputs from the BTB), trap/interrupt
    status, and branch-misprediction indications from
    the integer FUs.
  • Branch prediction
  • two-level adaptive scheme of Yeh and Patt,
  • BTB contains 512 entries, maintains branch
    history information and the predicted branch
    target address.
  • Branch misprediction penalty at least 11 cycles,
    on average 15 cycles
  • The instruction decoder unit (IDU) is composed of
    three separate decoders

21
Pentium II/III The In-Order Section (Continued)
  • A decoder breaks the IA-32 instruction down to
    ?ops, each comprised of an opcode, two source and
    one destination operand. These ?ops are of fixed
    length.
  • Most IA-32 instructions are converted directly
    into single micro ops (by any of the three
    decoders),
  • some instructions are decoded into one-to-four
    ?ops (by the general decoder),
  • more complex instructions are used as indices
    into the microcode instruction sequencer (MIS)
    which will generate the appropriate stream of
    ?ops.
  • The ?ops are send to the register alias table
    (RAT) where register renaming is performed,
    i.e., the logical IA-32 based register
    references are converted into references to
    physical registers.
  • Then, with added status information, ?ops
    continue to the reorder buffer (ROB, 40 entries)
    and to the reservation station unit (RSU, 20
    entries).

22
The Fetch/Decode Unit
IA-32 instructions
Instruction Fetch Unit
Next_IP
Alignment
I-cache
Branch Target Buffer
Simple Decoder
Simple Decoder
General Decoder
Microcode Instruction Sequencer
Instruction Decode Unit
Register Alias Table
op1
op2
op3
(a) in-order section
(b) instruction decoder unit (IDU)
23
The Out-of-Order Execute Section
  • When the ?ops flow into the ROB, they effectively
    take a place in program order.
  • ?ops also go to the RSU which forms a central
    instruction window with 20 reservation stations
    (RS), each capable of hosting one ?op.
  • ?ops are issued to the FUs according to dataflow
    constraints and resource availability, without
    regard to the original ordering of the program.
  • After completion the result goes to two different
    places, RSU and ROB.
  • The RSU has five ports and can issue at a peak
    rate of 5 ?ops each cycle.

24
Latencies and throughtput for Pentium II/III FUs
25
Issue/Execute Unit
26
The In-Order Retire Section.
  • A ?op can be retired
  • if its execution is completed,
  • if it is its turn in program order,
  • and if no interrupt, trap, or misprediction
    occurred.
  • Retirement means taking data that was
    speculatively created and writing it into the
    retirement register file (RRF).
  • Three ?ops per clock cycle can be retired.

27
Retire Unit
28
The Pentium II/III Pipeline
ROB read
BTB0
Reorder buffer read
BTB access
Issue
BTB1
Reservation station
RSU
IFU0
Fetch and predecode
Port 0
I-cache access
IFU1
IFU2
Port 1
IDU0
Execution and completion
Decode
Port 2
IDU1
ROB write
Reorder buffer write-back
Port 3
Register renaming
RAT

Retirement
ROB read
Reorder buffer read
Retirement
RRF
Port 4
(a)
(c)
(b)
29
Pentium Pro Processor Basic Execution Environment
232-1
Eight 32-bit Registers
General Purpose Registers
Six 16-bit Registers
Address Space
Segment Registers
32 bits
EFLAGS Register
32 bits
EIP (Instruction Pointer Register)
0
The address space can be flat or segmented
30
Application Programming Registers
31
Pentium III
32
Pentium II/III summary and offsprings
  • Pentium III in 1999, initially at 450 MHz (0.25
    micron technology), former name Katmai
  • two 32 kB caches, faster floating-point
    performance
  • Coppermine is a shrink of Pentium III down to
    0.18 micron.

33
Pentium 4
  • Was announced for mid-2000 under the code name
    Willamette
  • native IA-32 processor with Pentium III processor
    core
  • running at 1.5 GHz
  • 42 million transistors
  • 0.18 µm
  • 20 pipeline stages (integer pipeline), IF and ID
    not included
  • trace execution cache (TEC) for the decoded µOps
  • NetBurst micro-architecture

34
Pentium 4 Features
  • Rapid Execution Engine
  • Intel Arithmetic Logic Units (ALUs) run at
    twice the processor frequency
  • Fact Two ALUs, running at processor frequency
    connected with a multiplexer running at twice the
    processor frequency
  • Hyper Pipelined Technology
  • Twenty-stage pipeline to enable high clock rates
  • Frequency headroom and performance scalability

35
Advanced Dynamic Execution
  • Very deep, out-of-order, speculative execution
    engine
  • Up to 126 instructions in flight (3 times larger
    than the Pentium III processor)
  • Up to 48 loads and 24 stores in pipeline (2 times
    larger than the Pentium III processor)
  • branch prediction
  • based on µOPs
  • 4K entry branch target array (8 times larger than
    the Pentium III processor)
  • new algorithm (not specified), reduces
    mispredictions compared to G-Share of the P6
    generation about one third

36
First level caches
  • 12k µOP Execution Trace Cache (100 k)
  • Execution Trace Cache that removes decoder
    latency from main execution loops
  • Execution Trace Cache integrates path of program
    execution flow into a single line
  • Low latency 8 kByte data cache with 2 cycle
    latency

37
Second level caches
  • Included on the die
  • size 256 kB
  • Full-speed, unified 8-way 2nd-level on-die
    Advance Transfer Cache
  • 256-bit data bus to the level 2 cache
  • Delivers 45 GB/s data throughput (at 1.4 GHz
    processor frequency)
  • Bandwidth and performance increases with
    processor frequency

38
NetBurst Micro-Architecture
39
Streaming SIMD Extensions 2 (SSE2) Technology
  • SSE2 Extends MMX and SSE technology with the
    addition of 144 new instructions, which include
    support for
  • 128-bit SIMD integer arithmetic operations.
  • 128-bit SIMD double precision floating point
    operations.
  • Cache and memory management operations.
  • Further enhances and accelerates video, speech,
    encryption, image and photo processing.

40
400 MHz Intel NetBurst micro-architecture system
bus
  • Provides 3.2 GB/s throughput (3 times faster than
    the Pentium III processor).
  • Quad-pumped 100MHz scalable bus clock to achieve
    400 MHz effective speed.
  • Split-transaction, deeply pipelined.
  • 128-byte lines with 64-byte accesses.

41
Pentium 4 data types
42
Pentium 4
43
Pentium 4 offsprings
  • Foster
  • Pentium 4 with external L3 cache and DDR-SDRAM
    support
  • provided for server
  • clock rate 1.7 - 2 GHz
  • to be launched in Q2/2001
  • Northwood
  • 0.13 µm technique
  • new 478 pin socket
Write a Comment
User Comments (0)
About PowerShow.com