Title: Intel Core 2 Duo
1Intel Core 2 Duo
- CS 6354
- by WeiKeng Qin, Jian Xiang, Ren Xu
- December 8, 2009
2Introduction
- Motivation
- A Multi-Core on our desks
- A new microarchitecture to replace Netburst
- Intel Core 2 Duo
- A dual-core CPU
- ISA with SIMD Extension
- Intel Core microarchitecture
- Memory Hierarchy System
3Instruction Set Architecture
- Base X86-64
- No VLIW (Itanium)
- SIMD Extensions MMX, SSE, SSE2, SSE3, SSSE3,
SSE4.1
Walfdale, SSE4.1, Sep 2006
Core 2, SSSE3, July 2006
Prescott, SSE3, 2004
Pentium 4, SSE2, 2001
e.g. Permuting bytes in a word
Pentium III, SSE, 1999
DSP-oriented math, process management
Pentium MMX, 1996
Double precision, 128-bit register support
8 new registers, Float-point Operations
8 new registers, Packed data type, Integer
Operations
4Streaming SIMD Extension (SSE) 4.1
- Beginning with the 45 nm processors
- 47 instructions that improve performance of media
data manipulation - e.g. Fast and efficient bit width conversions
- Convert single byte values to word (16-bit)
values.
5SSE2 Code
- MOVDQU XMM0, M64
- PXOR XMM1, XMM1
- PUNPCKLBW XMM0, XMM1
6SSE4.1 Code
- PMOVZXBW XMM0, M64
- DEST150 lt-- ZeroExtend(SRC70)
- DEST3116 lt-- ZeroExtend(SRC158)
- DEST4732 lt-- ZeroExtend(SRC2316)
- DEST6348 lt-- ZeroExtend(SRC3124)
- DEST7964 lt-- ZeroExtend(SRC3932)
- DEST9580 lt-- ZeroExtend(SRC4740)
- DEST11196 lt-- ZeroExtend(SRC5548)
- DEST127112 lt-- ZeroExtend(SRC6356)
- Benefits
- Reduced instruction number (3?1)
- Better performance (40 speedup each loop)
- Reduced register pressure (2?1)
7Microarchitecture
- The Cores
- Single-die(107 mm²),
- Two identical core(L1 cache 64K x 2),
- Shared L2 cache 6M
- No Hyper-threading, no L3 cache
- Keep front-side bus
- Larger L2 cache
8Microarchitecture
- 14-stage Pipeline
- 4 wide decode
- 4 wide Retire
- Macro-fusion
- Enhanced ALUs
- Deeper Buffers
9Another View
10Decode Hardware
128 bits fetch bandwidth 18-entry IQ
Complex Decode -produces 1-4 micro-ops
Micro-code Sequencer
11Macro-fusion
- New Micro-op
- Represent instruction pair as single micro-op
- Enhanced ALUs
- To execute new compare and jump (CMPJCC)
micro-op in one clock
12Out of Order Execution
96 entries ROB 32 Entry Reservation Station
13Execution Units
- 6 dispatch ports(1 Load, 2 Store, 3 universal
ports) - 3 integer ALU, 2 float point ALU
14Branch Predictor
- Loop Detector
- - Track the number of loop iterations
- for future reference
- branch prediction unit (BPU) selects among for
every branch - -bimodal predictor
- -global predictor
- -loop detector
-
15- Cache Organization
- private L1 DCache and ICache, 32K/core, 8way, 64B
linesize, write-back(directory-based conherence) - shared L2 cache, 8way, 64B linesize (E8xxx)
- pros could be less bus traffic
- cons longer access latency than private L2
cache - potential conflict between threads
- -- FSB 1333MHz (E8xxx)
- Memory disambiguation
- aggressive memory dependence speculation based on
a load's- EIP-address-indexed hash table - watchdog mechanism
16- Prediction Implementation
- History table indexed by Instruction Pointer
- Each entry in the history array has a saturating
counter - Once counter saturates disambiguation possible
on this load (take effect since next iteration)
-load is allowed to go even meet unkown store
addresses - When a particular load failed disambiguation
reset its counter - Each time a particular load correctly
disambiguated increment counter
17Predictor Lookup
- when sent from RS, set disambiguation bit
- If meets an older unknow store address, set
"update" - If prediction is "go", dispatch, set "done"
- Else blocked
- A store in Load Buffer scan all previous load, if
a match found, "reset" bit set. - When load commits, update history.
Load Dispatch
Prediction Verification
18- Execute Disable Bit Support
- AMD Enhanced Virus Protection ARM eXecute Never
- help prevent buffer overflow attacks
- no need of software patches for buffer overflow
attacks - segregate memory by either storage of code or
data - processor disable code execution when malicious
worms try to inserting code into data buffers
(with OS support)
19- Instruction Pointer Based Prefetcher
- L1 DCache2 IP prefetchers/core
- L1 ICache1 traditional prefetcher
- L2 Cache 2 IP prefetchers
- predict what memory address will be used and
deliver in time - record every load's history using Instruction
Pointer - IP history array
- parameters for prefetch traffic control
fine-tuned for different platforms - prefetch monitor
20(No Transcript)
21References
- Intel's Next Generation Microarchitecture
Unveiled, by David Kanter, Real World
Technologies - Intel Core Microarchitecture Briefing, by Stephen
Smith and Bob Valentine, Intel - Inside Intel Core Microarchitecture Setting New
Standards for Energy-Efficient Performance, Ofri
Wechsler, Technology_at_Intel Magazine - Intel Core A Next-Generation Microarchitecture,
by Alan Zeichick, DevX - too many
22Questions?