Pentium Architecture - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Pentium Architecture

Description:

P4 has a higher CPI on all benchmarks except mcf (in which the AMD is more than twice the P4) ... For the li benchmark ... for the doduc benchmark. Solution: ... – PowerPoint PPT presentation

Number of Views:362
Avg rating:3.0/5.0
Slides: 35
Provided by: rfox
Category:

less

Transcript and Presenter's Notes

Title: Pentium Architecture


1
Pentium Architecture
  • Recall our examination of the Intel 486 pipeline
  • variable length of instructions, variable
    complexity of operations, memory-register ALU
    operations, etc led to poor performance
  • In order to improve performance using RISC
    features, the Pentium architects had to rethink
    things they were stuck with their CISC
    instruction set (for backward compatibility)
  • in CISC architectures, a machine instruction is
    first translated into a sequence of
    microinstructions
  • each microinstruction is a lengthy string of 1s
    and 0s, each of which refer to one control signal
    in the machine
  • there needs to be a process to translate each
    machine instruction into microinstructions and
    execute each microinstruction this is done by
    collecting machine instructions and their
    associated microinstructions into microprograms

2
Why Microinstructions?
  • First, since the Pentium architecture uses a
    microprogrammed control unit, there is already a
    necessary step of decoding a machine instruction
    into microcode
  • Now, consider each microinstruction
  • each is equal length
  • each executes in the same amount of time
  • unless there are structural hazards such as a
    cache miss
  • branches are at the microinstruction level and
    are more predictable than machine language level
    branching
  • In a RISC architecture, each machine instruction
    is carried out directly in hardware because each
    instruction is simple and takes roughly 1 cycle
    to execute
  • to more efficiently pipeline a CISC architecture,
    we can pipeline the microinstructions (instead of
    machine instructions) to keep a pipeline running
    efficiently

3
Control and Micro-Operations
  • An example architecture is shown to the right
  • Each of the various connections is controlled by
    a particular control signal
  • for instance, to send the MBR value to the AC, we
    would signal C11
  • note that this figure is incomplete
  • a microprogram is a sequence of micro-operations
  • each micro-operation is one or more control
    signals sent out in a clock cycle to move
    information from one location to another

this is not an x86 architecture!
4
Example
  • Consider a CISC instruction such as Add R1, X
  • this requires that X be moved into the MAR and a
    read signaled
  • the datum returned will be placed into the MBR
  • the adder is then sent the value in R1 and MBR,
    adding the two and storing the result back into
    R1
  • this sequence can be written in terms of
    micro-operations as
  • t1 MAR ? (IR (address) )
  • t2 MBR ? Memory
  • t3 R1 ? (R1) (MBR)
  • There may be other sequences needed as well, for
    instance, if register results are stored in an
    accumulator temporarily, then we must change the
    above to include
  • t3 Acc ? (R1) (MBR)
  • t4 R1 ? (Acc)
  • we can then convert these into the actual control
    signals (for instance, MBR ? Memory is C5 in the
    previous figure)

the values t1, t2, etc denote separately clock
cycles
5
Control Memory
Each micro-program consists of one or more
micro-instructions, each stored in a separate
entry of the control memory The control memory
itself is firmware, a program stored in ROM, that
is placed inside of the control unit
... Jump to Indirect or Execute
Fetch cycle routine
... Jump to Execute
Indirect Cycle routine
... Jump to Fetch
Interrupt cycle routine
Jump to Op code routine
Execute cycle begin
... Jump to Fetch or Interrupt
AND routine
... Jump to Fetch or Interrupt
ADD routine
Note each micro-program ends with a branch to
the Fetch, Interrupt, Indirect or Execute
micro-program
6
Example of Three Micro-Programs
  • Fetch t1 MAR ? (PC) C2 t2 MBR ? Memory
    C0, C5, CR PC ? (PC) 1 C
    t3 IR ? (MBR) C4
  • Indirect t1 MAR ? (IR (address) )
    C8 t2 MBR ? Memory C0, C5, CR
    t3 IR(address) ?
    (MBR (address) ) C4
  • Interrupt t1 MBR ? (PC) C1 t2 MAR ?
    save address C PC ? routine
    address C t3 Memory ? (MBR) C12, CW
  • CR Read control to system bus
  • CW write control to system bus
  • C0 C12 refers to the previous figure
  • C are signals not shown in the figure

7
Horizontal vs. Vertical Micro-Instructions
Micro-instruction address points to a branch in
the control memory and is taken if the condition
bit is true
Micro-instruction Address
Function Codes
Jump Condition
Horizontal micro-instructions contain 1 bit for
every control signal controlled by the control
unit
Vertical micro-instructions use function codes
that need additional decoding
Internal CPU Control Signals
Micro-instruction Address
Because this micro-instruction requires 1 bit for
every control line, it is longer than the
vertical micro-instruction and therefore take
more space to store, but does not require
additional time to decode by the control unit
Jump Condition
System Bus Control Signals
8
Micro-programmed Control Unit
  • Decoder analyzes IR
  • delivers starting address of op codes
    micro-program in control store
  • address placed in the to a micro-program counter
    (here, called a Control Address Register)
  • Loop on the following
  • sequencer signals read of control memory using
    address in microPC
  • item in control memory moved to control buffer
    register
  • contents of control buffer register generate
    control signals and next address information
  • if the micro-instructions are vertical, decoding
    is required here
  • sequencer moves next address to control address
    register
  • next instruction (add 1 to current)
  • jump to new part of this microprogram
  • jump to new machine routine

9
Pentium IV RISC features
  • All RISC features are implemented on the
    execution of microinstructions instead of machine
    instructions
  • microinstruction-level pipeline with dynamically
    scheduled microoperations
  • fetch machine instruction (3 stages)
  • decode machine instruction into microinstructions
    (2 stages)
  • superscalar issues multiple microinstructions (2
    stages, register renaming occurs here, up to 3
    microinstructions can be issued per cycle)
  • execute of microinstructions (1 stage, units are
    pipelined and can take from 1 to many cycles (up
    to 32?) to execute)
  • write back (3 stages)
  • commit (3 stages, up to 3 microinstructions can
    commit in any cycle)
  • reservation stations (128 registers available)
    and multiple functional units (7 of them)
  • branch speculation used (control of speculation
    is given to reservation stations rather than a
    reorder buffer, commit still occurs, controlled
    by reservation stations)
  • trace cache used

10
Pentium IV Architecture
11
Specifications
  • There are 7 functional units
  • 2 simple ALUs (for simple integer operations like
    add and compare)
  • 1 complex ALU (for integer multiplication and
    integer division)
  • 1 load unit
  • 1 store unit
  • 1 floating point move (register to register move
    and convert)
  • 1 floating point unit (addition, subtraction,
    multiplication, division)
  • the simple ALU units execute in half a clock
    cycle so each can accommodate up to two
    microoperations per cycle reducing latency
  • the load and store units have their own address
    calculation components so that the memory address
    can be computed first and then the memory access
    performed, along with aggressive data cache to
    lower load latencies
  • floating point and complex ALU take more than 1
    cycle so are pipelined
  • floating point units can handle up to 2 FP
    operations at a time allowing for some SIMD
    execution and improving overall FP performance
  • There are 128 registers for renaming
  • reservation stations are used rather than a
    re-order buffer (which was used in older versions
    of the Pentium pipeline)
  • this means that instructions must wait in
    reservation stations longer than in Tomasulos
    version, waiting for speculation results

12
Pentium IV Pipeline
  • Pentium III (Pentium Pro) pipeline was 10 stages
    deep
  • taking a minimum of 10 clock cycles to complete
    the shortest instructions with a clock rate of
    1.1 GHz or less
  • the figure below shows the Pentium III pipeline
  • For the Pentium IV
  • pipeline depth was lengthened to 21 stages
    (minimum) in order to accommodate a faster clock
    rate of 1.5 GHz
  • by 2004, the pipeline was lengthened to 31 stages
    (minimum) and the clock rate up to 3.2 GHz
  • The lengthening of the pipeline allowed for the
    faster clock rates
  • the clock rate is now so fast that it takes 2
    complete cycles for an instruction or data to
    cross the chip so that at least 2 stages in the
    pipeline are needed for certain operations like
    data movement!
  • With the 128 reservation stations, 128
    instructions could be in some state of operation
    simultaneously (as opposed to 40 in the Pentium
    III)

13
Trace Cache and Branch Prediction
  • We talk about the trace cache in chapter 5
  • for now, consider it to be an instruction cache
    that stores instruction not by address but by the
    order they are being executed
  • in this way, branches do not necessarily cost us
    cache misses because the instruction being
    branched to is not in the same cache block
  • The trace cache stores microinstructions (not
    machine instructions)
  • repeated decoding is avoided, once a machine
    instruction has been decoded, the decoded version
    is placed in the trace cache, this greatly
    reduces time necessary to do instruction decoding
  • A branch target buffer is used to store
    microinstruction branches (not machine
    instruction branches) within the trace cache
  • the target buffer uses a 2-level predictor to
    select between local and global histories
  • target buffer is 8 times the size of the target
    buffer used in the Pentium III
  • the misprediction rate for the target buffer is
    below .15!
  • The trace cache and branch target buffer combined
    mean that
  • microinstruction fetch and microinstruction
    decoding is rarely needed because, once fetched
    and decoded, the items are often found in the
    cache and because predictions rarely cause wrong
    instructions to be fetched

14
Source of Stalls
  • This architecture is very complex and relies on
    being able to fetch and decode instructions
    quickly
  • the process breaks down when
  • less than 3 instructions can be fetched in 1
    cycle
  • trace cache causes a miss, or branches are miss
    predicted
  • less than 3 instructions can be issued because
    instructions have different number of
    microoperations
  • e.g., one instruction has 4 and another has 1,
    staggering when each instruction issues and
    executes
  • limitation of reservation stations
  • data dependencies cause a functional unit to
    stall
  • data cache access results in a miss
  • in some of these cases, the issue stage must
    stall, in others the commit stage must stall
  • misprediction rates are very low, about .8 for
    integer benchmarks and .1 for floating point
    benchmarks (these are misprediction rates at the
    machine level of instructions, not
    microinstructions)
  • trace cache has nearly a 0 miss rate, the L1 and
    L2 data caches have miss rates of around 6 and
    .5 respectively
  • the machines effective CPI is around 2.2

15
Pentium IV Comparison
  • Comparing the Pentium IV to the Pentium III
  • P4 has over twice the performance in many SPEC
    benchmarks in spite of a clock speed that isnt
    twice as fast (this info is not in this text
    edition)
  • The text provides a comparison between the P4 and
    the AMD Opteron
  • the Opteron uses dynamic scheduling, speculation,
    a shallower pipeline, issue and commit of up to 3
    instructions per cycle, 2-level cache, and the
    chip has a similar transistor count although is
    only 2.8 GHz
  • the Opteron is a RISC instruction set, so
    instructions are machine instructions, not
    microinstructions
  • P4 has a higher CPI on all benchmarks except mcf
    (in which the AMD is more than twice the P4)
  • so for the most case, instructions take less
    clock time in the AMD than in the P4 but the P4
    is a slightly faster clock
  • The text provides a briefer comparison between
    the P4 and the IBM Power5
  • the Power5 is only 1.9 GHz
  • P5 is significantly better on most floating point
    benchmarks and slightly worse on most integer
    benchmarks with a clock speed half that of the P4
  • see figures 2.28 2.34 for specific comparisons

16
A Balancing Act
  • Improving one aspect of our processor does not
    necessarily improve performance
  • in fact, it might harm performance
  • consider lengthening the pipeline depth and
    increasing clock speed (as with the P4) but
    without adding reservation stations or using the
    trace cache
  • Modern processor design takes a lot of effort to
    balance out the factors
  • without accurate branch prediction and
    speculation hardware, stalls from miss-predicted
    branches will drop performance greatly
  • as clock speeds increase, stalls from cache
    misses create a bigger impact on CPI, so larger
    caches and cache optimization techniques are
    needed (we cover the latter in chapter 5)
  • to support multiple issue of instructions, we
    need a larger cache-to-processor bandwidth, which
    can take up valuable space
  • as we increase the number of instructions that
    can be issued, we need to increase the number of
    reservation stations and reorder buffer size
  • For even greater improvement, we might need to
    turn to software approaches instead of or in
    addition to hardware enhancements in appendix
    G, we will visit several compiler-based ideas

17
Sample Problem 1
  • We see how complex an architecture can become in
    the case of the Pentium IV
  • assume that we have additional space on the CPU
    and want to enhance some element(s), what should
    we pick and why?
  • choices are to
  • add more reservation stations
  • add more ALU functional units
  • add another FP functional unit
  • add more load/store units
  • add a larger branch target buffer (either more
    entries, or more prediction bits)
  • attempt to speed up the system clock and lengthen
    the pipeline (the additional space will be used
    for pipeline latches, control logic, etc)
  • add more memory to the trace cache
  • add more memory to the L1 cache
  • increase the microoperation queue size to store
    more microoperations at any time

18
Solution
  • Lets consider each not from the perspective of
    how useful it might be but how much that
    particular hardware is limiting instruction issue
    and CPI
  • add more reservation stations because we can
    issue no more than 3 microoperations per cycle,
    and assuming that the average microoperation
    executes for under 10 cycles, the 128 registers
    should be sufficient
  • add more ALU/FP functional units since these
    are pipelined, additional units are not necessary
  • add more load/store units limiting the number
    of loads may be a source of data dependencies,
    and so an additional load unit might help, an
    additional store unit is probably not necessary
  • add a larger branch target buffer (either more
    entries, or more prediction bits) prediction
    accuracy is extremely high, more entries or bits
    are not needed

19
Solution Continue
  • attempt to speed up the system clock and lengthen
    the pipeline (the additional space will be used
    for pipeline latches, control logic, etc) there
    is little that we can do to further lengthen the
    pipeline, this may not be feasible
  • add more memory to the trace cache similar to
    the branch target buffer, this will probably have
    very little impact because of the low miss rate
    of the current trace cache
  • add more L1 cache this can make a significant
    impact since the miss rate is currently fairly
    high, this would be my top choice
  • increase the microoperation queue size to store
    more microoperations at any time although it is
    unclear how many stalls arise from running out of
    microoperations, because of the trace caches
    performance, this is probably not necessary
  • Top choices increase L1 cache and add another
    load unit

20
Sample Problem 2
  • Two fallacies cited in the chapter are
  • Processors with lower CPI will always be faster
  • Processors with faster clock rates will always
    be faster
  • Why are these not necessarily true?
  • recall our CPU time formula CPU Time
    ICCPICCT
  • if CPI is lower, the CPU Time is lower and thus
    the processor is faster
  • if clock rate is higher, then CCT is lower and
    CPU Time is lower, thus the processor is faster
  • BUT, we see from our examination of various
    processors that
  • deeper pipelines can have a larger impact than
    faster clock rates
  • multiple issue superscalars have a significant
    impact on CPI but only if supported by
    reservation stations, reorder buffers, and
    accurate branch speculation
  • in the Pentium IV, the CPI might be lower than
    other machines but its IC can be higher because,
    in this case, IC is at the microinstruction level
  • additionally, a very low CPI with a slow clock
    rate may not outperform a higher CPI with a
    faster clock rate

21
Limitations on ILP (Chapter 3)
  • From mid 80s through 2000, architects focused on
    promoting ILP
  • deeper pipelines
  • multiple instruction issue
  • dynamic scheduling
  • Speculation
  • Hardware needs increased
  • multiple function units
  • cost grows linearly with the number of units
  • increase (possibly very large) in memory
    bandwidth
  • more register-file bandwidth
  • which might take up significant space on the chip
    and may require larger system bus sizes which
    turns into more pins
  • more complex memory system
  • possibly independent memory banks

22
Limitations
  • By 2000, architects found limitations in just how
    much ILP there is to exploit
  • inherent limitations to multiple-issue are the
    limited amount of ILP of a program
  • how many instructions are independent of each
    other?
  • how much distance is available between loading an
    operand and using it? between using and saving
    it?
  • multi-cycle latency for certain types of
    operations that cause inconsistencies in the
    amount of issuing that can be simultaneous
  • Architects more recently have concentrated
  • on further optimizations of current architectures
  • and achieving higher clock rates without
    increasing issue rates

23
Limitations on Issue Size
  • Ideally, we would like to issue as many
    independent instructions simultaneously as
    possible, but this is not practical because we
    would have to
  • look arbitrarily far ahead to find an instruction
    to issue
  • rename all registers when needed to avoid WAR/WAW
  • determine all register and memory dependences
  • predict all branches (conditional, unconditional,
    returns)
  • provide enough functional units to ensure all
    ready instructions can be issued
  • What is a possible maximum window size?
  • to determine register dependences over n
    instructions requires n2-n comparisons
  • 2000 instructions ? 4,000,000 comparisons
  • 50 instructions ? 2450 comparisons
  • window sizes have ranged between 4 and 32 with
    some recent machines having sizes of 2-8
  • a machine with window size of 32 achieves about
    1/5 of the ideal speedup for most benchmarks (see
    figure on next slide)

24
Window Size Impact on Instruction Issue
25
Realistic Branch Prediction
  • Types of predictions
  • Perfect branch prediction
  • impossible to achieve so we wont bother with
    this
  • Selective history prediction using
  • correlating two-bit predictor
  • non-correlating two-bit predictor
  • selector between them
  • Standard two-bit predictor with 512 two-bit
    entries
  • Static predictor
  • uses program profile history
  • None

Misprediction Issue Rate Rate Selective 3
24 Standard 17 20 Static 10 21 see
the figures on the next slide for details
  • Experimental results shown to the right
  • notice that issue rate is not significantly
    different and that the static predictor is the
    easiest so might be a reasonable approach

26
Branch Predictor Performance
27
Effects of Finite Registers
  • With infinite registers, register renaming can
    eliminate all WAW and WAR hazards
  • with Tomasulos approach, the reservation
    stations offer virtual registers
  • Power 5 has 88 additional FP and 88 additional
    integer registers for reservation stations
  • surprisingly though, the number of registers does
    not have a dramatic impact as long as there are
    at least 64 64 registers available

28
Alias Analysis
  • Aside from register renaming, we have name
    dependencies on memory references
  • Three models are
  • global (perfect analysis of all global vars)
  • stack perfect (perfect analysis of all stack
    references)
  • inspection (examine accesses for interference at
    compile time)
  • none (assume all references conflict)

29
A Realizable Processor
  • The authors describe an ambitious but realistic
    processor that could be available with todays
    technology
  • issue up to 64 instructions / cycle with no
    restrictions on what instructions can be issued
    in the same cycle
  • tournament branch predictor with 1K entries and
    16 entry return predictor
  • perfect memory reference disambiguation performed
    dynamically
  • register renaming with 64 int and 64 FP registers
  • with a 64 instruction / cycle issue capability,
    the average number of instructions issued per
    cycle is estimated to be around 20
  • if there are no stalls for limited hardware,
    cache misses and miss-speculation, this would
    result in a CPI of .05!
  • we might question whether a 64 instruction window
    is reasonable given the complexity needed in
    comparing up to 64 instructions together in each
    cycle, today we find most computers limit window
    sizes to 8 at most

30
Example
  • Lets compare three hypothetical processors and
    determine their MIPS rating for the gcc benchmark
  • processor 1 simple MIPS 2-issue superscalar
    pipeline with clock rate of 4 GHz, CPI of 0.8,
    cache system with .005 misses per instruction
  • processor 2 deeply pipelined MIPS with a clock
    rate of 5 GHz, CPI of 1.0, smaller cache yielding
    .0055 misses per instruction
  • processor 3 speculative superscalar with
    64-entry window that achieves 50 of its ideal
    issue rate (see figure 3.7) with a clock rate of
    2.5 GHz, a small cache yielding .01 misses per
    instruction (although 25 of the miss penalty is
    not visible due to dynamic scheduling)
  • assume memory access time (miss penalty) is 50 ns
  • to solve this problem, we have to determine each
    processors CPI, which is a combination of
    processor CPI and the impact of memory (cache
    misses)

31
Solution
  • Processor 1
  • 4 GHz clock .25 ns per clock cycle
  • memory access of 50 ns so miss penalty 50 / .25
    200 cycles
  • cache penalty .005 200 1.0 cycles per
    instruction
  • overall CPI 0.8 1.0 1.8
  • MIPS 4 GHz / 1.8 2222 MIPS
  • Processor 2
  • 5 GHz clock .2 ns per clock cycle
  • miss penalty 50 / .2 250 cycles
  • cache penalty .0055 250 1.4 cycles per
    instruction
  • overall CPI 1.0 1.4 2.4
  • MIPS 5 GHz / 2.4 2083 MIPS
  • Processor 3
  • 2.5 GHz clock .4 ns per clock cycle
  • miss penalty takes affect only 75 of the time,
    so miss penalty .75 50 / .4 94 cycles
  • cache penalty .01 94 0.94
  • CPU portion of the CPI is based on half the ideal
    issue rate of a 64-entry window, which is 1 / (9
    2) 0.22
  • overall CPI 0.94 0.22 1.16
  • MIPS 2.5 / 1.16 2155 MIPS

32
Sample Problem 1
  • For the li benchmark
  • compare a perfect processor from one that has a
    128 window size, tournament branch predictor, 64
    integer and 64 FP renaming registers and
    inspection alias analysis
  • The perfect processor can issue 18 instructions
    per cycle
  • but the branch prediction only permits up to 16
    instructions per cycle and an infinite number of
    registers and perfect alias analysis can only
    accommodate 12 instructions per cycle
  • so the perfect processor can achieve an issue
    rate of 12 instructions per cycle, or a CPI 1 /
    12 .083
  • The more realistic processor is most limited by
    alias analysis (4 instructions per cycle), so a
    CPI .25
  • the perfect machine is then .25 / .083 3 times
    faster on this benchmark

33
Sample Problem 2
  • Architects are considering one of three
    enhancements to the next generation of computer
  • more on-chip cache to reduce the impact of memory
    access
  • faster memories
  • faster clock rates
  • Explain, using the example on pages 167-169, how
    each of these would impact the three hypothetical
    processors
  • more on-chip cache lowers cache CPI depending
    on the current miss rate, this might be useful,
    but for processor 1 and 2, the miss rates are
    already lt .1
  • faster memory reduces cache CPI (it decreases the
    number of cycles needed for any cache miss)
    since all three processors CPIs are roughly half
    from cache miss and half from processor
    performance, this could have a significant impact
  • faster clock rates increases cache CPI, possibly
    will have no effect on execution CPI by merely
    increasing clock rate, the stalls for memory
    accesses will increase, however if this increase
    is coupled with a longer pipeline, then execution
    CPI might decrease and so overall performance
    might improve

34
Sample Problem 3
  • Consider a speculative superscalar with a window
    size of 32
  • with proper hardware support, the superscalar can
    issue 70 of the expected issue rate (see figure
    3.2)
  • the processor has a 3.33 GHz clock rate
  • the processor stalls when all functional units
    are busy (which arises once in every 12 cycles)
  • when there is a misprediction, the processor
    require 6 complete cycles to flush the reorder
    buffer and begin again (profile-based prediction
    is used)
  • memory accesses take 40 ns, 40 of the
    instructions are loads or stores and the
    instruction cache has a miss rate of .5 and the
    data cache has a miss rate of .03
  • determine this machines MIPS rating for the
    doduc benchmark
  • Solution
  • cache miss penalty 40 ns / 3.33 GHz 120
    cycles
  • memory CPI .005 120 .40 .0003 120
    .614
  • CPU CPI 1 / 6.3 1 / 12 6 .05 .542
  • CPI .614 .542 1.156
  • MIPS rating 3.33 GHz / 1.156 2881 MIPS
Write a Comment
User Comments (0)
About PowerShow.com