Title: ECE4100/6100 Guest Lecture: P6
1ECE4100/6100 Guest LectureP6 NetBurst
Microarchitecture
Prof. Hsien-Hsin Sean Lee School of ECE Georgia
Institute of Technology
February 11, 2003
2Why studies P6 from last millennium?
- A paradigm shift from Pentium
- A RISC core disguised as a CISC
- Huge market success
- Microarchitecture
- And stock price
- Architected by former VLIW and RISC folks
- Multiflow (pioneer in VLIW architecture for
super-minicomputer) - Intel i960 (Intels RISC for graphics and
embedded controller) - Netburst (P4s microarchitecture) is based on P6
3P6 Basics
- One implementation of IA32 architecture
- Super-pipelined processor
- 3-way superscalar
- In-order front-end and back-end
- Dynamic execution engine (restricted dataflow)
- Speculative execution
- P6 microarchitecture family processors include
- Pentium Pro
- Pentium II (PPro MMX 2x caches16KB I/16KB D)
- Pentium III (P-II SSE enhanced MMX, e.g.
PSAD) - Celeron (without MP support)
- Later P-II/P-III/Celeron all have on-die L2 cache
4x86 Platform Architecture
Host Processor
L1 Cache (SRAM)
L2 Cache
P6 Core
(SRAM)
Back-Side Bus
On-die or on-package
Front-Side Bus
GPU
System Memory (DRAM)
Graphics Processor
AGP
chipset
Local
Frame
Buffer
PCI
USB
I/O
5Pentium III Die Map
- EBL/BBL External/Backside Bus logic
- MOB - Memory Order Buffer
- Packed FPU - Floating Point Unit for SSE
- IEU - Integer Execution Unit
- FAU - Floating Point Arithmetic Unit
- MIU - Memory Interface Unit
- DCU - Data Cache Unit (L1)
- PMH - Page Miss Handler
- DTLB - Data TLB
- BAC - Branch Address Calculator
- RAT - Register Alias Table
- SIMD - Packed Floating Point unit
- RS - Reservation Station
- BTB - Branch Target Buffer
- TAP Test Access Port
- IFU - Instruction Fetch Unit and L1 I-Cache
- ID - Instruction Decode
- ROB - Reorder Buffer
- MS - Micro-instruction Sequencer
6ISA Enahncement (on top of Pentium)
- CMOVcc / FCMOVcc r, r/m
- Conditional moves (predicated move) instructions
- Based on conditional code (cc)
- FCOMI/P compare FP stack and set integer flags
- RDPMC/RDTSC instructions
- Uncacheable Speculative Write-Combining (USWC)
weakly ordered memory type for graphics memory - MMX in Pentium II
- SIMD integer operations
- SSE in Pentium III
- Prefetches (non-temporal nta temporal t0, t1,
t2), sfence - SIMD single-precision FP operations
7P6 Pipelining
8P6 Microarchitecture
External bus
Chip boundary
Memory Cluster
Bus Cluster
Bus interface unit
Instruction Fetch Unit
Instruction Fetch Unit
IEU/JEU
Control Flow
IEU/JEU
(Restricted) Data Flow
BTB/BAC
Instruction Fetch Cluster
Reservation Station
Out-of-order Cluster
ROB Retire RF
Issue Cluster
9Instruction Fetching Unit
addr
data
Other fetch requests
Select mux
Instruction buffer
Streaming Buffer
Length marks
Instruction Cache
ILD
Instruction rotator
Linear Address
Victim Cache
Next PC Mux
P.Addr
Instruction TLB
bytes consumed by ID
Prediction marks
Branch Target Buffer
- IFU1 Initiate fetch, requesting 16 bytes at a
time - IFU2 Instruction length decoder, mark
instruction boundaries, BTB makes prediction - IFU3 Align instructions to 3 decoders in 4-1-1
format
10Dynamic Branch Prediction
W0
W1
W2
W3
512-entry BTB
- Similar to a 2-level PAs design
- Associated with each BTB entry
- W/ 16-entry Return Stack Buffer
- 4 branch predictions per cycle (due to 16-byte
fetch per cycle)
- Static prediction provided by Branch Address
Calculator when BTB misses (see prior slide)
11Static Branch Prediction
BTB miss?
Unconditional PC-relative?
No
No
Yes
Yes
PC-relative?
Return?
No
No
Indirect jump
Yes
Yes
Conditional?
No
Yes
BTBs decision
Taken
Backwards?
No
Taken
Yes
Taken
Taken
Not Taken
Taken
12X86 Instruction Decode
IFU3
Next 3 inst Inst to dec
S,S,S 3
S,S,C First 2
S,C,S First 1
S,C,C First 1
C,S,S 3
C,S,C First 2
C,C,S First 1
C,C,C First 1
complex (1-4)
simple (1)
simple (1)
Micro-instruction sequencer (MS)
Instruction decoder queue (6 ?ops)
- 4-1-1 decoder
- Decode rate depends on instruction alignment
- DEC1 translate x86 into micro-operations (?ops)
- DEC2 move decoded ?ops to ID queue
- MS performs translations either
- Generate entire ?op sequence from microcode ROM
- Receive 4 ?ops from complex decoder, and the rest
from microcode ROM
S Simple C Complex
13Allocator
- The interface between in-order and out-of-order
pipelines - Allocates
- 3-or-none ?ops per cycle into RS, ROB
- all-or-none in MOB (LB and SB)
- Generate physical destination Pdst from the ROB
and pass it to the Register Alias Table (RAT) - Stalls upon shortage of resources
14Register Alias Table (RAT)
Integer RAT Array
Logical Src
Array Physical Src (Psrc)
In-order queue
Int and FP Overrides
RAT PSrcs
FP TOS Adjust
FP RAT Array
Allocator
Physical ROB Pointers
- Register renaming for 8 integer registers, 8
floating point (stack) registers and flags 3 ?op
per cycle - 40 80-bit physical registers embedded in the ROB
(thereby, 6 bit to specify PSrc) - RAT looks up physical ROB locations for renamed
sources based on RRF bit
15Partial Register Width Renaming
Integer RAT Array
Logical Src
Array Physical Src
In-order queue
Int and FP Overries
RAT Physical Src
FP TOS Adjust
FP RAT Array
?op0 MOV AL (a) ?op1 MOV AH (b) ?op2 ADD
AL (c) ?op3 ADD AH (d)
Allocator
Physical ROB Pointers from Allocator
- 32/16-bit accesses
- Read from low bank
- Write to both banks
- 8-bit RAT accesses depending on which Bank is
being written
16Partial Stalls due to RAT
- Partial register stalls Occurs when writing a
smaller (e.g. 8/16-bit) register followed by a
larger (e.g. 32-bit) read - Partial flags stalls Occurs when a subsequent
instruction read more flags than a prior
unretired instruction touches
17Reservation Stations
WB bus 0
Port 0
WB bus 1
Port 1
Loaded data
Port 2
RS
Ld addr
LDA
MOB
DCU
STA
Port 3
St addr
STD
St data
Port 4
ROB
Retired data
- Gateway to execution binding max 5 ?op to each
port per cycle - 20 ?op entry buffer bridging the In-order and
Out-of-order engine - RS fields include ?op opcode, data valid bits,
Pdst, Psrc, source data, BrPred, etc. - Oldest first FIFO scheduling when multiple ?ops
are ready at the same cycle
18ReOrder Buffer
- A 40-entry circular buffer
- Similar to that described in SmithPleszkun85
- 157-bit wide
- Provide 40 alias physical registers
- Out-of-order completion
- Deposit exception in each entry
- Retirement (or de-allocation)
- After resolving prior speculation
- Handle exceptions thru MS
- Clear OOO state when a mis-predicted branch or
exception is detected - 3 ?ops per cycle in program order
- For multi-?op x86 instructions none or all
(atomic)
ROB
. . .
(exp) ?code assist
19Memory Execution Cluster
RS / ROB
LD
STA
STD
Load Buffer
DTLB
DCU
LD
STA
FB
Store Buffer
EBL
Memory Cluster Blocks
- Manage data memory accesses
- Address Translation
- Detect violation of access ordering
- Fill buffers in DCU (similar to MSHR Kroft81)
for handling cache misses (non-blocking)
20Memory Order Buffer (MOB)
- Allocated by ALLOC
- A second order RS for memory operations
- 1 ?op for load 2 ?ops for store Store Address
(STA) and Store Data (STD) - MOB
- 16-entry load buffer (LB)
- 12-entry store address buffer (SAB)
- SAB works in unison with
- Store data buffer (SDB) in MIU
- Physical Address Buffer (PAB) in DCU
- Store Buffer (SB) SAB SDB PAB
- Senior Stores
- Upon STD/STA retired from ROB
- SB marks the store senior
- Senior stores are committed back in program order
to memory when bus idle or SB full - Prefetch instructions in P-III
- Senior load behavior
- Due to no explicit architectural destination
21Store Coloring
x86 Instructions ?ops store color mov
(0x1220), ebx std (ebx) 2 sta 0x1220
2 mov (0x1110), eax std (eax)
3 sta 0x1100 3 mov ecx, (0x1220) ld
3 mov edx, (0x1280) ld 3 mov
(0x1400), edx std (edx) 4 sta 0x1400
4 mov edx, (0x1380) ld 4
- ALLOC assigns Store Buffer ID (SBID) in program
order - ALLOC tags loads with the most recent SBID
- Check loads against stores with equal or younger
SBIDs for potential address conflicts - SDB forwards data if conflict detected
22Memory Type Range Registers (MTRR)
- Control registers written by the system (OS)
- Supporting Memory Types
- UnCacheable (UC)
- Uncacheable Speculative Write-combining (USWC or
WC) - Use a fill buffer entry as WC buffer
- WriteBack (WB)
- Write-Through (WT)
- Write-Protected (WP)
- E.g. Support copy-on-write in UNIX, save memory
space by allowing child processes to share with
their parents. Only create new memory pages when
child processes attempt to write. - Page Miss Handler (PMH)
- Look up MTRR while supplying physical addresses
- Return memory types and physical address to DTLB
23Intel NetBurst Microarchitecture
- Pentium 4s microarchitecture, a post-P6 new
generation - Original target market Graphics workstations,
but the major competitor screwed up themselves - Design Goals
- Performance, performance, performance,
- Unprecedented multimedia/floating-point
performance - Streaming SIMD Extensions 2 (SSE2)
- Reduced CPI
- Low latency instructions
- High bandwidth instruction fetching
- Rapid Execution of Arithmetic Logic operations
- Reduced clock period
- New pipeline designed for scalability
24Innovations Beyond P6
- Hyperpipelined technology
- Streaming SIMD Extension 2
- Enhanced branch predictor
- Execution trace cache
- Rapid execution engine
- Advanced Transfer Cache
- Hyper-threading Technology (in Xeon and Xeon MP)
25Pentium 4 Fact Sheet
- IA-32 fully backward compatible
- Available at speeds ranging from 1.3 to 3 GHz
- Hyperpipelined (20 stages)
- 42 million transistors
- 0.18 µ for 1.7 to 1.9GHz 0.13µ for 1.8 to
2.8GHz - Die Size of 217mm2
- Consumes 55 watts of power at 1.5Ghz
- 400MHz (850) and 533MHz (850E) system bus
- 512KB or 256KB 8-way full-speed on-die L2
Advanced Transfer Cache (up to 89.6 GB/s _at_2.8GHz
to L1) - 1MB or 512KB L3 cache (in Xeon MP)
- 144 new 128 bit SIMD instructions (SSE2)
- HyperThreading Technology (only enabled in Xeon
and Xeon MP)
26Recent Intel IA-32 Processors
27Building Blocks of Netburst
System bus
L1 Data Cache
Bus Unit
Level 2 Cache
Execution Units
Memory subsystem
INT and FP Exec. Unit
Fetch/ Dec
ETC µROM
OOO logic
Retire
BTB / Br Pred.
Branch history update
Out-of-Order Engine
Front-end
28Pentium 4 Microarchitectue
BTB (4k entries)
I-TLB/Prefetcher
64 bits
64-bit System Bus
?Code ROM
IA32 Decoder
Trace Cache BTB (512 entries)
Quad Pumped 400M/533MHz 3.2/4.3 GB/sec BIU
Execution Trace Cache
?op Queue
Allocator / Register Renamer
INT / FP ?op Queue
Memory ?op Queue
Memory scheduler
Fast
Simple FP
Slow/General FP scheduler
INT Register File / Bypass Network
FP RF / Bypass Ntwk
U-L2 Cache 256KB 8-way 128B line, WB 48 GB/s
_at_1.5Gz
FP Move
FP MMX SSE/2
AGU
AGU
2x ALU
2x ALU
Slow ALU
Ld addr
St addr
Simple Inst.
Simple Inst.
Complex Inst.
256 bits
L1 Data Cache (8KB 4-way, 64-byte line, WT, 1 rd
1 wr port)
29Pipeline Depth Evolution
30Execution Trace Cache
- Primary first level I-cache to replace
conventional L1 - Decoding several x86 instructions at high
frequency is difficult, take several pipeline
stages - Branch misprediction penalty is horrible
- lost 20 pipeline stages vs. 10 stages in P6
- Advantages
- Cache post-decode ?ops
- High bandwidth instruction fetching
- Eliminate x86 decoding overheads
- Reduce branch recovery time if TC hits
- Hold up to 12,000 ?ops
- 6 ?ops per trace line
- Many (?) trace lines in a single trace
31Execution Trace Cache
- Deliver 3 ?ops per cycle to OOO engine
- X86 instructions read from L2 when TC misses (7
cycle latency) - TC Hit rate 8K to 16KB conventional I-cache
- Simplified x86 decoder
- Only one complex instruction per cycle
- Instruction gt 4 ?op will be executed by
micro-code ROM (P6s MS) - Perform branch prediction in TC
- 512-entry BTB 16-entry RAS
- With BP in x86 IFU, reduce 1/3 misprediction
compared to P6 - Intel did not disclose the details of BP
algorithms used in TC and x86 IFU (Dynamic
Static)
32Out-Of-Order Engine
- Similar design philosophy with P6 uses
- Allocator
- Register Alias Table
- 128 physical registers
- 126-entry ReOrder Buffer
- 48-entry load buffer
- 24-entry store buffer
33Register Renaming Schemes
ROB (40-entry)
Allocated sequentially
Data
Status
RRF
P6 Register Renaming
34Micro-op Scheduling
- ?op FIFO queues
- Memory queue for loads and stores
- Non-memory queue
- ?op schedulers
- Several schedulers fire instructions to execution
(P6s RS) - 4 distinct dispatch ports
- Maximum dispatch 6 ?ops per cycle (2 fast ALU
from Port 0,1 per cycle 1 from ld/st ports)
35Data Memory Accesses
- 8KB 4-way L1 256KB 8-way L2 (with a HW
prefetcher) - Load-to-use speculation
- Dependent instruction dispatched before load
finishes - Due to the high frequency and deep pipeline depth
- Scheduler assumes loads always hit L1
- If L1 miss, dependent instructions left the
scheduler receive incorrect data temporarily
mis-speculation - Replay logic Re-execute the load when
mis-speculated - Independent instructions are allowed to proceed
- Up to 4 outstanding load misses ( 4 fill buffers
in original P6) - Store-to-load forwarding buffer
- 24 entries
- Have the same starting physical address
- Load data size lt store data size
36Streaming SIMD Extension 2
- P-III SSE (Katmai New Instructions KNI)
- Eight 128-bit wide xmm registers (new
architecture state) - Single-precision 128-bit SIMD FP
- Four 32-bit FP operations in one instruction
- Broken down into 2 ?ops for execution (only
80-bit data in ROB) - 64-bit SIMD MMX (use 8 mm registers map to FP
stack) - Prefetch (nta, t0, t1, t2) and sfence
- P4 SSE2 (Willamette New Instructions WNI)
- Support Double-precision 128-bit SIMD FP
- Two 64-bit FP operations in one instruction
- Throughput 2 cycles for most of SSE2 operations
(exceptional examples DIVPD and SQRTPD 69
cycles, non-pipelined.) - Enhanced 128-bit SIMD MMX using xmm registers
37Examples of Using SSE
xmm1
xmm1
X3
X2
X1
X0
X3
X2
X1
X0
xmm2
Y3
Y2
Y1
Y0
Y3
Y2
Y1
Y0
xmm2
op
xmm1
X3
X2
X1
xmm1
X0 op Y0
Packed SP FP operation (e.g. ADDPS xmm1, xmm2)
Scalar SP FP operation (e.g. ADDSS xmm1, xmm2)
38Examples of Using SSE and SSE2
SSE
xmm1
xmm1
X3
X2
X1
X0
X3
X2
X1
X0
xmm2
Y3
Y2
Y1
Y0
Y3
Y2
Y1
Y0
xmm2
op
xmm1
X3
X2
X1
xmm1
X0 op Y0
Packed SP FP operation (e.g. ADDPS xmm1, xmm2)
Scalar SP FP operation (e.g. ADDSS xmm1, xmm2)
SSE2
X0
X1
X0
X1
X0
X1
xmm1
xmm1
Y0
Y1
Y0
Y1
Y0
Y1
xmm2
xmm2
op
op
op
xmm1
xmm1
X0 op Y0
X1 op Y1
X0 op Y0
X1
X1 or X0
Y1 or Y0
Packed DP FP operation (e.g. ADDPD xmm1, xmm2)
Scalar DP FP operation (e.g. ADDSD xmm1, xmm2)
Shuffle FP operation (e.g. SHUFPS xmm1, xmm2,
imm8)
Shuffle DP operation (2-bit imm) (e.g. SHUFPD
xmm1, xmm2, imm2)
39HyperThreading
- In Intel Xeon Processor and Intel Xeon MP
Processor - Enable Simultaneous Multi-Threading (SMT)
- Exploit ILP through TLP (Thread-Level
Parallelism) - Issuing and executing multiple threads at the
same snapshot - Single P4 Xeon appears to be 2 logical processors
- Share the same execution resources
- Architectural states are duplicated in hardware
40Multithreading (MT) Paradigms
41More SMT commercial processors
- Intel Xeon Hyperthreading
- Supports 2 replicated hardware contexts PC (or
IP) and architecture registers - New directions of usage
- Helper (or assisted) threads (e.g. speculative
precomputation) - Speculative multithreading
- Clearwater (once called Xtream logic) 8 context
SMT network processor designed by DISC
architect (company no longer exists) - SUN 4-SMT-processor CMP?
42Speculative Multithreading
- SMT can justify wider-than-ILP datapath
- But, datapath is only fully utilized by multiple
threads - How to speed up single-thread program by
utilizing multiple threads? - What to do with spare resources?
- Execute both sides of hard-to-predictable
branches - Eager execution or Polypath execution
- Dynamic predication
- Send another thread to scout ahead to warm up
caches BTB - Speculative precomputation
- Early branch resolution
- Speculatively execute future work
- Multiscalar or dynamic multithreading
- e.g. start several loop iterations concurrently
as different threads, if data dependence is
detected, redo the work - Run a dynamic compiler/optimizer on the side
- Dynamic verification
- DIVA or Slipstream Processor