ECE 4100/6100 Advanced Computer Architecture Lecture 12 P6 and NetBurst Microarchitecture

1 / 43
About This Presentation
Title:

ECE 4100/6100 Advanced Computer Architecture Lecture 12 P6 and NetBurst Microarchitecture

Description:

Advanced Computer Architecture Lecture 12 P6 and NetBurst Microarchitecture Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering –

Number of Views:259
Avg rating:3.0/5.0
Slides: 44
Provided by: hsienhsi
Category:

less

Transcript and Presenter's Notes

Title: ECE 4100/6100 Advanced Computer Architecture Lecture 12 P6 and NetBurst Microarchitecture


1
ECE 4100/6100Advanced Computer Architecture
Lecture 12 P6 and NetBurst Microarchitecture
Prof. Hsien-Hsin Sean Lee School of Electrical
and Computer Engineering Georgia Institute of
Technology
2
P6 System Architecture
Host Processor
L1 Cache (SRAM)
L2 Cache
P6 Core
(SRAM)
Back-Side Bus
Front-Side Bus
GPU
PCIExpress AGP
System Memory (DRAM)
Graphics Processor
chipset
Local
Frame
Buffer
PCI
USB
I/O
3
P6 Microarchitecture
External bus
Chip boundary
Memory Cluster
Bus Cluster
Bus interface unit
Instruction Fetch Unit
Instruction Fetch Unit
IEU/JEU
Control Flow
IEU/JEU
(Restricted) Data Flow
BTB/BAC
Instruction Fetch Cluster
Reservation Station
Out-of-order Cluster
ROB Retire RF
Issue Cluster
4
Pentium III Die Map
  • EBL/BBL External/Backside Bus logic
  • MOB - Memory Order Buffer
  • Packed FPU - Floating Point Unit for SSE
  • IEU - Integer Execution Unit
  • FAU - Floating Point Arithmetic Unit
  • MIU - Memory Interface Unit
  • DCU - Data Cache Unit (L1)
  • PMH - Page Miss Handler
  • DTLB - Data TLB
  • BAC - Branch Address Calculator
  • RAT - Register Alias Table
  • SIMD - Packed Floating Point unit
  • RS - Reservation Station
  • BTB - Branch Target Buffer
  • TAP Test Access Port
  • IFU - Instruction Fetch Unit and L1 I-Cache
  • ID - Instruction Decode
  • ROB - Reorder Buffer
  • MS - Micro-instruction Sequencer

5
P6 Basics
  • One implementation of IA32 architecture
  • Deeply pipeline processor
  • In-order front-end and back-end
  • Dynamic execution engine (restricted dataflow)
  • Speculative execution
  • P6 microarchitecture family processors include
  • Pentium Pro
  • Pentium II (PPro MMX 2x caches)
  • Pentium III (P-II SSE enhanced MMX, e.g.
    PSAD)
  • Pentium 4 (Not P6, will be discussed separately)
  • Pentium M (SSE2, SSE3, ?op fusion)
  • Core (PM SSSE3, SSE4, Intel 64 (EM64T), MacroOp
    fusion, 4 ?op retired rate vs. 3 of previous
    proliferation)

6
P6 Pipelining
7
Instruction Fetching Unit
addr
data
Other fetch requests
Select mux
Instruction buffer
Streaming Buffer
Length marks
Instruction Cache
ILD
Instruction rotator
Linear Address
Victim Cache
Next PC Mux
P.Addr
Instruction TLB
bytes consumed by ID
Prediction marks
Branch Target Buffer
  • IFU1 Initiate fetch, requesting 16 bytes at a
    time
  • IFU2 Instruction length decoder, mark
    instruction boundaries, BTB makes prediction (2
    cycles)
  • IFU3 Align instructions to 3 decoders in 4-1-1
    format

8
Static Branch Prediction (stage 17 Br. Dec of pg.
6)
BTB miss?
Unconditional PC-relative?
No
No
Yes
Yes
PC-relative?
Return?
No
No
Indirect jump
Yes
Yes
Conditional?
No
Yes
BTB dynamic predictors decision
Taken
Backwards?
No
Taken
Yes
Taken
Taken
Not Taken
Taken
9
Dynamic Branch Prediction
W0
W1
W2
W3
512-entry BTB
  • Similar to a 2-level PAs design
  • Associated with each BTB entry
  • W/ 16-entry Return Stack Buffer
  • 4 branch predictions per cycle (due to 16-byte
    fetch per cycle)
  • Speculative update (2 copies of BHR)
  • Static prediction provided by Branch Address
    Calculator when BTB misses (see prior slide)

10
X86 Instruction Decode
Instruction Buffer
Next 3 inst Inst to dec
S,S,S 3
S,S,C First 2
S,C,S First 1
S,C,C First 1
C,S,S 3
C,S,C First 2
C,C,S First 1
C,C,C First 1
(16 bytes)
complex (1-4)
simple (1)
simple (1)
Micro-instruction sequencer (MS)
Instruction decoder queue (6 ?ops)
To RAT/ALLOC
  • 4-1-1 decoder
  • Decode rate depends on instruction alignment
  • DEC1 translate x86 into micro-operations (?ops)
  • DEC2 move decoded ?ops to ID queue
  • MS performs translations either
  • Generate entire ?op sequence from the microcode
    ROM
  • Receive 4 ?ops from complex decoder, and the rest
    from microcode ROM
  • Subsequent Instructions followed by the inst
    needing MS are flushed

S Simple C Complex
11
Register Alias Table (RAT)
Renaming Example
Integer RAT Array
Logical Src
RRF
PSrc
25
EAX
0
2
EBX
0
Array Physical Src (Psrc)
ECX
ECX
Int and FP Overrides
1
In-order queue
15
EDX
0
RAT PSrcs
FP TOS Adjust
FP RAT Array
Allocator
Physical ROB Pointers
ROB
RRF
  • Register renaming for 8 integer registers, 8
    floating point (stack) registers and flags 3 ?op
    per cycle
  • 40 80-bit physical registers embedded in the ROB
    (thereby, 6 bit to specify PSrc)
  • RAT looks up physical ROB locations for renamed
    sources based on RRF bit
  • Override logic is for dependent ?ops decoded at
    the same cycle
  • Misprediction will revert all pointers to point
    to Retirement Register File (RRF)

12
Partial Stalls due to RAT
  • Partial register stalls Occurs when writing a
    smaller (e.g. 8/16-bit) register followed by a
    larger (e.g. 32-bit) read
  • Because need to read different partial pieces
    from multiple physical registers !
  • Partial flags stalls Occurs when a subsequent
    instruction reads more flags than a prior
    unretired instruction touches

13
Partial Register Width Renaming
Integer RAT Array
Logical Src
Array Physical Src
In-order queue
Int and FP Overries
RAT Physical Src
FP TOS Adjust
FP RAT Array
?op0 MOV AL (a) ?op1 MOV AH (b) ?op2 ADD
AL (c) ?op3 ADD AH (d)
Allocator
Physical ROB Pointers from Allocator
  • 32/16-bit accesses
  • Read from low bank (AL/BL/CL/DLAX/BX/CX/DXEAX/EB
    X/ECX/EDX/EDI/ESI/EBP/ESP)
  • Write to both banks (AH/BH/CH/DH)
  • 8-bit RAT accesses depending on which bank is
    being written and only update the particular bank

14
Allocator (ALLOC)
  • The interface between in-order and out-of-order
    pipelines
  • Allocates into ROB, MOB and RS
  • 3-or-none ?ops per cycle into ROB and RS
  • Must have 3 free ROB entries or no allocation
  • all-or-none policy for MOB
  • Stall allocation when not all the valid MOB ?ops
    can be allocated
  • Generate physical destination token Pdst from the
    ROB and pass it to the Register Alias Table (RAT)
    and RS
  • Stalls upon shortage of resources

15
Reservation Stations (RS)
WB bus 0
Port 0
WB bus 1
Port 1
Loaded data
Port 2
RS
Ld addr
LDA
MOB
DCU
STA
Port 3
St addr
STD
St data
Port 4
ROB
Retired data
  • Gateway to execution binding max 5 ?op to each
    port per cycle
  • Port binding at dispatch time (certain ?op can
    only be bound to one port)
  • 20 ?op entry buffer bridging the In-order and
    Out-of-order engine (32 entries in Core)
  • RS fields include ?op opcode, data valid bits,
    Pdst, Psrc, source data, BrPred, etc.
  • Oldest first FIFO scheduling when multiple ?ops
    are ready at the same cycle

16
ReOrder Buffer (ROB)
  • A 40-entry circular buffer (96-entry in Core)
  • 157-bit wide
  • Provide 40 alias physical registers
  • Out-of-order completion
  • Deposit exception in each entry
  • Retirement (or de-allocation)
  • After resolving prior speculation
  • Handle exceptions thru MS
  • Clear OOO state when a mis-predicted branch or
    exception is detected
  • 3 ?ops per cycle in program order
  • For multi-?op x86 instructions none or all
    (atomic)

ROB
. . .
(exp) ?code assist
17
Memory Execution Cluster
RS / ROB
LD
STA
STD
movl ecx, edi addl ecx, 8 movl
-4(edi), ebx movl eax, 4(ecx)
Load Buffer
DTLB
DCU
LD
STA
RS cannot detect this and could dispatch them at
the same time
FB
Store Buffer
EBL
Memory Cluster
  • Manage data memory accesses
  • Address Translation
  • Detect violation of access ordering
  • Fill buffers (FB) in DCU, similar to MSHR for
    non-blocking cache support

18
Memory Order Buffer (MOB)
  • Allocated by ALLOC
  • A second order RS for memory operations
  • 1 ?op for load 2 ?ops for store Store Address
    (STA) and Store Data (STD)
  • MOB
  • 16-entry load buffer (LB) (32-entry in Core, 64
    in SandyBridge)
  • 12-entry store address buffer (SAB) (20-entry in
    Core, 36 in SandyBridge)
  • SAB works in unison with
  • Store data buffer (SDB) in MIU
  • Physical Address Buffer (PAB) in DCU
  • Store Buffer (SB) SAB SDB PAB
  • Senior Stores
  • Upon STD/STA retired from ROB
  • SB marks the store senior
  • Senior stores are committed back in program order
    to memory when bus idle or SB full
  • Prefetch instructions in P-III
  • Senior load behavior
  • Due to no explicit architectural destination
  • New Memory dependency predictor in Core to
    predict store-to-load dependencies

19
Store Coloring
x86 Instructions ?ops store color mov
(0x1220), ebx std ebx 2 sta 0x1220
2 mov (0x1110), eax std eax
3 sta 0x1100 3 mov ecx, (0x1220) ld
0x1220 3 mov edx, (0x1280) ld 0x1280
3 mov (0x1400), edx std edx 4 sta
0x1400 4 mov edx, (0x1380) ld 0x1380
4
  • ALLOC assigns Store Buffer ID (SBID) in program
    order
  • ALLOC tags loads with the most recent SBID
  • Check loads against stores with equal or younger
    SBIDs for potential address conflicts
  • SDB forwards data if conflict detected

20
Memory Type Range Registers (MTRR)
  • Control registers written by the system (OS)
  • Supporting Memory Types
  • UnCacheable (UC)
  • Uncacheable Speculative Write-combining (USWC or
    WC)
  • Use a fill buffer entry as WC buffer
  • WriteBack (WB)
  • Write-Through (WT)
  • Write-Protected (WP)
  • E.g. Support copy-on-write in UNIX, save memory
    space by allowing child processes to share with
    their parents. Only create new memory pages when
    child processes attempt to write.
  • Page Miss Handler (PMH)
  • Look up MTRR while supplying physical addresses
  • Return memory types and physical address to DTLB

21
Intel NetBurst Microarchitecture
  • Pentium 4s microarchitecture
  • Original target market Graphics workstations,
    but
  • Design Goals
  • Performance, performance, performance,
  • Unprecedented multimedia/floating-point
    performance
  • Streaming SIMD Extensions 2 (SSE2)
  • SSE3 introduced in Prescott Pentium 4 (90nm)
  • Reduced CPI
  • Low latency instructions
  • High bandwidth instruction fetching
  • Rapid Execution of Arithmetic Logic operations
  • Reduced clock period
  • New pipeline designed for scalability

22
Innovations Beyond P6
  • Hyperpipelined technology
  • Streaming SIMD Extension 2
  • Hyper-threading Technology (HT)
  • Execution trace cache
  • Rapid execution engine
  • Staggered adder unit
  • Enhanced branch predictor
  • Indirect branch predictor (also in Banias Pentium
    M)
  • Load speculation and replay

23
Pentium 4 Fact Sheet
  • IA-32 fully backward compatible
  • Available at speeds ranging from 1.3 to 3.8 GHz
  • Hyperpipelined (20 stages)
  • 125 million transistors in Prescott (1.328
    billion in 16MB on-die L3 Tulsa, 65nm)
  • 0.18 µ for 1.3 to 2GHz 0.13µ for 1.8 to 3.4GHz
    90nm for 2.8GHz to 3.6GHz
  • Die Size of 122mm2 (Prescott 90nm), 435mm2 (Tulsa
    65nm),
  • Consumes 115 watts of power at 3.6Ghz
  • 1066MHz system bus
  • Prescott L1 16KB, 8-way vs. previous P4s 8KB
    4-way
  • 1MB, 512KB or 256KB 8-way full-speed on-die L2
    (B/W example 89.6 GB/s _at_2.8GHz to L1)
  • 2MB L3 cache (in P4 HT Extreme edition, 0.13µ
    only), 16MB in Tulsa
  • 144 new 128 bit SIMD instructions (SSE2), and 16
    SSSE instructions in Prescott
  • HyperThreading Technology (Not in all versions)

24
Building Blocks of Netburst
System bus
L1 Data Cache
Bus Unit
Level 2 Cache
Execution Units
Memory subsystem
INT and FP Exec. Unit
Fetch/ Dec
ETC µROM
OOO logic
Retire
BTB / Br Pred.
Branch history update
Out-of-Order Engine
Front-end
25
Pentium 4 Microarchitectue (Prescott)
BTB (4k entries)
I-TLB/Prefetcher
64 bits
64-bit System Bus
?Code ROM
IA32 Decoder
Trace Cache BTB (2k entries)
Quad Pumped 800MHz 6.4 GB/sec BIU
Execution Trace Cache (12K ?ops)
?op Queue
Allocator / Register Renamer
INT / FP ?op Queue
Memory ?op Queue
Memory scheduler
Fast
Simple FP
Slow/General FP scheduler
INT Register File / Bypass Network
FP RF / Bypass Ntwk
U-L2 Cache 1MB 8-way 128B line, WB 108 GB/s
FP Move
FP MMX SSE/2/3
AGU
AGU
2x ALU
2x ALU
Slow ALU
Ld addr
St addr
Simple Inst.
Simple Inst.
Complex Inst.
256 bits
L1 Data Cache (16KB 8-way, 64-byte line, WT, 1 rd
1 wr port)
26
Pipeline Depth Evolution
27
Execution Trace Cache
  • Primary first level I-cache to replace
    conventional L1
  • Decoding several x86 instructions at high
    frequency is difficult, take several pipeline
    stages
  • Branch misprediction penalty is considerable
  • Advantages
  • Cache post-decode ?ops (think about fill unit)
  • High bandwidth instruction fetching
  • Eliminate x86 decoding overheads
  • Reduce branch recovery time if TC hits
  • Hold up to 12,000 ?ops
  • 6 ?ops per trace line
  • Many (?) trace lines in a single trace

28
Execution Trace Cache
  • Deliver 3 ?ops per cycle to OOO engine if br
    pred is good
  • X86 instructions read from L2 when TC misses (7
    cycle latency)
  • TC Hit rate 8K to 16KB conventional I-cache
  • Simplified x86 decoder
  • Only one complex instruction per cycle
  • Instruction gt 4 ?op will be executed by
    micro-code ROM (P6s MS)
  • Perform branch prediction in TC
  • 512-entry BTB 16-entry RAS
  • With BP in x86 IFU, reduce 33 misprediction
    compared to P6
  • Intel did not disclose the details of BP
    algorithms used in TC and x86 IFU (Dynamic
    Static)

29
Out-Of-Order Engine
  • Similar design philosophy with P6 uses
  • Allocator
  • Register Alias Table
  • 128 physical registers
  • 126-entry ReOrder Buffer
  • 48-entry load buffer
  • 24-entry store buffer

30
Register Renaming Schemes
ROB (40-entry)
Allocated sequentially
Data
Status
RRF
P6 Register Renaming
31
Micro-op Scheduling
  • ?op FIFO queues
  • Memory queue for loads and stores
  • Non-memory queue
  • ?op schedulers
  • Several schedulers fire instructions from 2 ?op
    queues to execution (P6s RS)
  • 4 distinct dispatch ports
  • Maximum dispatch 6 ?ops per cycle (2 fast ALU
    from Port 0,1 per cycle 1 from ld/st ports)

32
Data Memory Accesses
  • Prescott 16KB 8-way L1 1MB 8-way L2 (with a HW
    prefetcher), 128B line
  • Load-to-use speculation
  • Dependent instruction dispatched before load
    finishes
  • Due to the high frequency and deep pipeline depth
  • From load scheduler to execution is longer than
    execution itself
  • Scheduler assumes loads always hit L1
  • If L1 miss, dependent instructions left the
    scheduler receive incorrect data temporarily
    mis-speculation
  • Replay logic
  • Re-execute the load when mis-speculated
  • Mis-speculated operations are placed into a
    replay queue for being re-dispatched
  • All trailing independent instructions are allowed
    to proceed
  • Tornado breaker
  • Up to 4 outstanding load misses ( 4 fill buffers
    in original P6)
  • Store-to-load forwarding buffer
  • 24 entries
  • Have the same starting physical address
  • Load data size lt store data size

33
Fast Staggered ALU
  • For frequent ALU instructions (No multiply, no
    shift, no rotate, no branch processing)
  • Double pumped clocks
  • Each operation finishes in 3 fast cycles
  • Lower-order 16-bit and bypass
  • Higher-order 16-bit and bypass
  • ALU flags generation

34
Branch Predictor
  • P4 uses the same hybrid predictor of Pentium M

Bimodal Predictor
Local Predictor
Global Predictor
Pred_L
Pred_B
L_hit
MUX
Pred_G
G_hit
MUX
35
Indirect Branch Predictor
  • In Pentium M and Prescott Pentium 4
  • Prediction based on global history

36
New Instructions over Pentium
  • CMOVcc / FCMOVcc r, r/m
  • Conditional moves (predicated move) instructions
  • Based on conditional code (cc)
  • FCOMI/P compare FP stack and set integer flags
  • RDPMC/RDTSC instructions
  • PMC P6 has 2, Netburst (P4) has 18
  • Uncacheable Speculative Write-Combining (USWC)
    weakly ordered memory type for graphics memory

37
New Instructions
  • SSE2 in Pentium 4 (not p6 microarchitecture)
  • Double precision SIMD FP
  • SSSE in Core 2
  • Supplemental instructions for shuffle, align,
    add, subtract.
  • Intel 64 (EM64T)
  • 64 bit support, new registers (8 more on top of
    8)
  • In Celeron D, Core 2 (and P4 Prescott, Pentium D)
  • Almost compatible with AMD64
  • AMDs NX bit or Intels XD bit for preventing
    buffer overflow attacks

38
Streaming SIMD Extension 2
  • P-III SSE (Katmai New Instructions KNI)
  • Eight 128-bit wide xmm registers (new
    architecture state)
  • Single-precision 128-bit SIMD FP
  • Four 32-bit FP operations in one instruction
  • Broken down into 2 ?ops for execution (only
    80-bit data in ROB)
  • 64-bit SIMD MMX (use 8 mm registers map to FP
    stack)
  • Prefetch (nta, t0, t1, t2) and sfence
  • P4 SSE2 (Willamette New Instructions WNI)
  • Support Double-precision 128-bit SIMD FP
  • Two 64-bit FP operations in one instruction
  • Throughput 2 cycles for most of SSE2 operations
    (exceptional examples DIVPD and SQRTPD 69
    cycles, non-pipelined.)
  • Enhanced 128-bit SIMD MMX using xmm registers

39
Examples of Using SSE
xmm1
xmm1
X3
X2
X1
X0
X3
X2
X1
X0
xmm2
Y3
Y2
Y1
Y0
Y3
Y2
Y1
Y0
xmm2
op
xmm1
X3
X2
X1
xmm1
X0 op Y0
Packed SP FP operation (e.g. ADDPS xmm1, xmm2)
Scalar SP FP operation (e.g. ADDSS xmm1, xmm2)
40
Examples of Using SSE and SSE2
SSE
xmm1
xmm1
X3
X2
X1
X0
X3
X2
X1
X0
xmm2
Y3
Y2
Y1
Y0
Y3
Y2
Y1
Y0
xmm2
op
xmm1
X3
X2
X1
xmm1
X0 op Y0
Packed SP FP operation (e.g. ADDPS xmm1, xmm2)
Scalar SP FP operation (e.g. ADDSS xmm1, xmm2)
SSE2
X0
X1
X0
X1
X0
X1
xmm1
xmm1
Y0
Y1
Y0
Y1
Y0
Y1
xmm2
xmm2
op
op
op
xmm1
xmm1
X0 op Y0
X1 op Y1
X0 op Y0
X1
X1 or X0
Y1 or Y0
Packed DP FP operation (e.g. ADDPD xmm1, xmm2)
Scalar DP FP operation (e.g. ADDSD xmm1, xmm2)
Shuffle FP operation (e.g. SHUFPS xmm1, xmm2,
imm8)
Shuffle DP operation (2-bit imm) (e.g. SHUFPD
xmm1, xmm2, imm2)
41
HyperThreading
  • Intel Xeon Processor and Intel Xeon MP Processor
  • Enable Simultaneous Multi-Threading (SMT)
  • Exploit ILP through TLP (Thread-Level
    Parallelism)
  • Issuing and executing multiple threads at the
    same snapshot
  • Single P4 w/ HT appears to be 2 logical
    processors
  • Share the same execution resources
  • dTLB shared with logical processor ID
  • Some other shared resources are partitioned (next
    slide)
  • Architectural states and some microarchitectural
    states are duplicated
  • IPs, iTLB, streaming buffer
  • Architectural register file
  • Return stack buffer
  • Branch history buffer
  • Register Alias Table

42
Multithreading (MT) Paradigms
43
HyperThreading Resource Partitioning
  • TC (or UROM) is alternatively accessed per cycle
    for each logical processor unless one is stalled
    due to TC miss
  • ?op queue (into ½) after fetched from TC
  • ROB (126/2)
  • LB (48/2)
  • SB (24/2) (32/2 for Prescott)
  • General ?op queue and memory ?op queue (1/2)
  • TLB (½?) as there is no PID
  • Retirement alternating between 2 logical
    processors
Write a Comment
User Comments (0)
About PowerShow.com