EECS 470 - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

EECS 470

Description:

Reduce the number of instructions executed. Reduce the cycles to execute an instruction ... First flagship architecture since the P6 microarchitecture ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 32

Provided by: todda7

Category:

more less

Transcript and Presenter's Notes

Title: EECS 470

1
EECS 470

Superscalar Architectures and the Pentium 4
Lecture 12

2
Optimizing CPU Performance

Golden Rule tCPU NinstCPItCLK
Given this, what are our options
Reduce the number of instructions executed
Reduce the cycles to execute an instruction
Reduce the clock period
Our next focus Further reducing CPI
Approach Superscalar execution
Capable of initiating multiple instructions per
cycle
Possible to implement for in-order or
out-of-order pipelines

3
Why Superscalar?
Pipelining
Superscalar Pipelining

Optimization results in more complexity
Longer wires, more logic ? higher tCLK and tCPU
Architects must strike a balance with reductions
in CPI

4
Implications of Superscalar Execution

Instruction fetch?
Taken branches, multiple branches, partial cache
lines
Instruction decode?
Simple for fixed length ISA, much harder for
variable length
Renaming?
Multi-port RT, inter-inst dependencies must be
recognized
Dynamic Scheduling?
Requires multiple results buses, smarter
selection logic
Execution?
Multiple functional units, multiple result buses
Commit?
Multiple ROB/ARF ports, dependencies must be
recognized

5
P4 Overview

Latest iA32 processor from Intel
Equipped with the full set of iA32 SIMD
operations
First flagship architecture since the P6
microarchitecture
Pentium 4 ISA Pentium III ISA SSE2
SSE2 (Streaming SIMD Extensions 2) provides
128-bit SIMD integer and floating point
operations prefetch

6
Comparison Between Pentium III and Pentium 4
7
Execution Pipeline
8
Front End

Predicts branches
Fetches/decodes code into trace cache
Generates mops for complex instructions
Prefetches instructions that are likely to be
executed

9
Branch Prediction

Dynamically predict the direction and target of
branches based on PC using BTB
If no dynamic prediction available, statically
predict
Taken for backwards looping branches
Not taken for forward branches
Implemented at decode
Traces built across (predicted) taken branches to
avoid taken branch penalties
Also includes a 16-entry return address stack
predictor

10
Decoder

Single decoder available
Operates at a maximum of 1 instruction per cycle
Receives instructions from L2 cache 64 bits at a
time
Some complex instructions must enlist the
micro-ROM
Used for very complex iA32 instructions (gt 4
mops)
After the microcode ROM finishes, the front-end
resumes fetching mops from the Trace Cache

11
Execution Pipeline
12
Trace Cache

Primary instruction cache in P4 architecture
Stores 12k decoded mops
On a miss, instructions are fetched from L2
Trace predictor connects traces
Trace cache removes
Decode latency after mispredictions
Decode power for all pre-decoded instructions

13
Branch Hints

P4 software can provide hints to branch
prediction and trace cache
Specify the likely direction of a branch
Implemented with conditional branch prefixes
Used for decode-stage predictions and trace
building

14
Execution Pipeline
15
Execution Pipeline
16
Execution

126 mops can in flight at once
Up to 48 loads / 24 stores
Can dispatch up to 6 mops per cycle
2x trace cache and retirement mop bandwidth
Provides additional B/W for scheduling
mispeculation

17
Execution Units
18
Register Renaming
19
Register Renaming

8-entry architectural register file
128-entry physical register file
2 RAT (Front-end RAT and Retirement RAT)
Retirement RAT eliminates register writes into
ARF

20
Store and Load Scheduling

Out of order store and load operations
Stores are always in program order
48 loads and 24 stores could be in flight
Store/load buffers are allocated at the
allocation stage
Total 24 store buffers and 48 load buffers

21
Execution Pipeline
22
Retirement

Can retire 3 mops per cycle
Implements precise exceptions
Reorder buffer used to organize completed mops
Also keeps track of branches and sends updated
branch information to the BTB

23
Data Stream of Pentium 4 Processor
24
On-chip Caches

L1 instruction cache (Trace Cache)
L1 data cache
L2 unified cache
All caches use a pseudo-LRU replacement algorithm
Parameters

25
L1 Data Cache

Non-blocking
Support up to 4 outstanding load misses
Load latency
2-clock for integer
6-clock for floating-point
1 Load and 1 Store per clock
Load speculation
Assume the access will hit the cache
Replay the dependent instructions when miss
detected

26
L2 Cache

Non-blocking
Load latency
Net load access latency of 7 cycles
Bandwidth
1 load and 1 store in one cycle
New cache operations may begin every 2 cycles
256-bit wide bus between L1 and L2
48Gbytes per second _at_ 1.5GHz

27
L2 Cache Data Prefetcher

Hardware prefetcher monitors the reference
patterns
Bring cache lines automatically
Attempts to fetch 256 bytes ahead of current
access
Prefetch for up to 8 simultaneous independent
streams

28
System Bus

Deliver data with 3.2Gbytes/S
64-bit wide bus
Four data phase per clock cycle (quad pumped)
100MHz clocked system bus

29
Execution on MPEG4 Benchmarks _at_ 1 GHz
30
Performance Trends
Real-time speech 10k SPECInt2000
Moore's Law Speedup
Performance Gap
31
Power Trends
Rocket Nozzle
Nuclear Reactor
Hot Plate
Power Gap
Real-time Speech 500 mW Power

Write a Comment

User Comments (0)