EECS 470 - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

EECS 470

Description:

Reduce the number of instructions executed. Reduce the cycles to execute an instruction ... First flagship architecture since the P6 microarchitecture ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 32
Provided by: todda7
Category:

less

Transcript and Presenter's Notes

Title: EECS 470


1
EECS 470
  • Superscalar Architectures and the Pentium 4
  • Lecture 12

2
Optimizing CPU Performance
  • Golden Rule tCPU NinstCPItCLK
  • Given this, what are our options
  • Reduce the number of instructions executed
  • Reduce the cycles to execute an instruction
  • Reduce the clock period
  • Our next focus Further reducing CPI
  • Approach Superscalar execution
  • Capable of initiating multiple instructions per
    cycle
  • Possible to implement for in-order or
    out-of-order pipelines

3
Why Superscalar?
Pipelining
Superscalar Pipelining
  • Optimization results in more complexity
  • Longer wires, more logic ? higher tCLK and tCPU
  • Architects must strike a balance with reductions
    in CPI

4
Implications of Superscalar Execution
  • Instruction fetch?
  • Taken branches, multiple branches, partial cache
    lines
  • Instruction decode?
  • Simple for fixed length ISA, much harder for
    variable length
  • Renaming?
  • Multi-port RT, inter-inst dependencies must be
    recognized
  • Dynamic Scheduling?
  • Requires multiple results buses, smarter
    selection logic
  • Execution?
  • Multiple functional units, multiple result buses
  • Commit?
  • Multiple ROB/ARF ports, dependencies must be
    recognized

5
P4 Overview
  • Latest iA32 processor from Intel
  • Equipped with the full set of iA32 SIMD
    operations
  • First flagship architecture since the P6
    microarchitecture
  • Pentium 4 ISA Pentium III ISA SSE2
  • SSE2 (Streaming SIMD Extensions 2) provides
    128-bit SIMD integer and floating point
    operations prefetch

6
Comparison Between Pentium III and Pentium 4
7
Execution Pipeline
8
Front End
  • Predicts branches
  • Fetches/decodes code into trace cache
  • Generates mops for complex instructions
  • Prefetches instructions that are likely to be
    executed

9
Branch Prediction
  • Dynamically predict the direction and target of
    branches based on PC using BTB
  • If no dynamic prediction available, statically
    predict
  • Taken for backwards looping branches
  • Not taken for forward branches
  • Implemented at decode
  • Traces built across (predicted) taken branches to
    avoid taken branch penalties
  • Also includes a 16-entry return address stack
    predictor

10
Decoder
  • Single decoder available
  • Operates at a maximum of 1 instruction per cycle
  • Receives instructions from L2 cache 64 bits at a
    time
  • Some complex instructions must enlist the
    micro-ROM
  • Used for very complex iA32 instructions (gt 4
    mops)
  • After the microcode ROM finishes, the front-end
    resumes fetching mops from the Trace Cache

11
Execution Pipeline
12
Trace Cache
  • Primary instruction cache in P4 architecture
  • Stores 12k decoded mops
  • On a miss, instructions are fetched from L2
  • Trace predictor connects traces
  • Trace cache removes
  • Decode latency after mispredictions
  • Decode power for all pre-decoded instructions

13
Branch Hints
  • P4 software can provide hints to branch
    prediction and trace cache
  • Specify the likely direction of a branch
  • Implemented with conditional branch prefixes
  • Used for decode-stage predictions and trace
    building

14
Execution Pipeline
15
Execution Pipeline
16
Execution
  • 126 mops can in flight at once
  • Up to 48 loads / 24 stores
  • Can dispatch up to 6 mops per cycle
  • 2x trace cache and retirement mop bandwidth
  • Provides additional B/W for scheduling
    mispeculation

17
Execution Units
18
Register Renaming
19
Register Renaming
  • 8-entry architectural register file
  • 128-entry physical register file
  • 2 RAT (Front-end RAT and Retirement RAT)
  • Retirement RAT eliminates register writes into
    ARF

20
Store and Load Scheduling
  • Out of order store and load operations
  • Stores are always in program order
  • 48 loads and 24 stores could be in flight
  • Store/load buffers are allocated at the
    allocation stage
  • Total 24 store buffers and 48 load buffers

21
Execution Pipeline
22
Retirement
  • Can retire 3 mops per cycle
  • Implements precise exceptions
  • Reorder buffer used to organize completed mops
  • Also keeps track of branches and sends updated
    branch information to the BTB

23
Data Stream of Pentium 4 Processor
24
On-chip Caches
  • L1 instruction cache (Trace Cache)
  • L1 data cache
  • L2 unified cache
  • All caches use a pseudo-LRU replacement algorithm
  • Parameters

25
L1 Data Cache
  • Non-blocking
  • Support up to 4 outstanding load misses
  • Load latency
  • 2-clock for integer
  • 6-clock for floating-point
  • 1 Load and 1 Store per clock
  • Load speculation
  • Assume the access will hit the cache
  • Replay the dependent instructions when miss
    detected

26
L2 Cache
  • Non-blocking
  • Load latency
  • Net load access latency of 7 cycles
  • Bandwidth
  • 1 load and 1 store in one cycle
  • New cache operations may begin every 2 cycles
  • 256-bit wide bus between L1 and L2
  • 48Gbytes per second _at_ 1.5GHz

27
L2 Cache Data Prefetcher
  • Hardware prefetcher monitors the reference
    patterns
  • Bring cache lines automatically
  • Attempts to fetch 256 bytes ahead of current
    access
  • Prefetch for up to 8 simultaneous independent
    streams

28
System Bus
  • Deliver data with 3.2Gbytes/S
  • 64-bit wide bus
  • Four data phase per clock cycle (quad pumped)
  • 100MHz clocked system bus

29
Execution on MPEG4 Benchmarks _at_ 1 GHz
30
Performance Trends
Real-time speech 10k SPECInt2000
Moore's Law Speedup
Performance Gap
31
Power Trends
Rocket Nozzle
Nuclear Reactor
Hot Plate
Power Gap
Real-time Speech 500 mW Power
Write a Comment
User Comments (0)
About PowerShow.com