Intel Core 2 Duo - PowerPoint PPT Presentation

About This Presentation

Title:

Intel Core 2 Duo

Description:

The loop detector monitors the behavior of each branch that the processor executes in order to identify which of ... Macro-op fusion lets the decoders combine two ... – PowerPoint PPT presentation

Number of Views:917

Avg rating:3.0/5.0

Slides: 23

Provided by: jx5c

Learn more at: https://www.cs.virginia.edu

Category:

more less

Transcript and Presenter's Notes

Title: Intel Core 2 Duo

1
Intel Core 2 Duo

CS 6354
by WeiKeng Qin, Jian Xiang, Ren Xu
December 8, 2009

2
Introduction

Motivation
A Multi-Core on our desks
A new microarchitecture to replace Netburst
Intel Core 2 Duo
A dual-core CPU
ISA with SIMD Extension
Intel Core microarchitecture
Memory Hierarchy System

3
Instruction Set Architecture

Base X86-64
No VLIW (Itanium)
SIMD Extensions MMX, SSE, SSE2, SSE3, SSSE3,
SSE4.1

Walfdale, SSE4.1, Sep 2006
Core 2, SSSE3, July 2006
Prescott, SSE3, 2004
Pentium 4, SSE2, 2001
e.g. Permuting bytes in a word
Pentium III, SSE, 1999
DSP-oriented math, process management
Pentium MMX, 1996
Double precision, 128-bit register support
8 new registers, Float-point Operations
8 new registers, Packed data type, Integer
Operations
4
Streaming SIMD Extension (SSE) 4.1

Beginning with the 45 nm processors
47 instructions that improve performance of media
data manipulation
e.g. Fast and efficient bit width conversions
Convert single byte values to word (16-bit)
values.

5
SSE2 Code

MOVDQU XMM0, M64
PXOR XMM1, XMM1
PUNPCKLBW XMM0, XMM1

6
SSE4.1 Code

PMOVZXBW XMM0, M64
DEST150 lt-- ZeroExtend(SRC70)
DEST3116 lt-- ZeroExtend(SRC158)
DEST4732 lt-- ZeroExtend(SRC2316)
DEST6348 lt-- ZeroExtend(SRC3124)
DEST7964 lt-- ZeroExtend(SRC3932)
DEST9580 lt-- ZeroExtend(SRC4740)
DEST11196 lt-- ZeroExtend(SRC5548)
DEST127112 lt-- ZeroExtend(SRC6356)
Benefits
Reduced instruction number (3?1)
Better performance (40 speedup each loop)
Reduced register pressure (2?1)

7
Microarchitecture

The Cores
Single-die(107 mm²),
Two identical core(L1 cache 64K x 2),
Shared L2 cache 6M
No Hyper-threading, no L3 cache
Keep front-side bus
Larger L2 cache

8
Microarchitecture

14-stage Pipeline
4 wide decode
4 wide Retire
Macro-fusion
Enhanced ALUs
Deeper Buffers

9
Another View
10
Decode Hardware
128 bits fetch bandwidth 18-entry IQ
Complex Decode -produces 1-4 micro-ops
Micro-code Sequencer
11
Macro-fusion

New Micro-op
Represent instruction pair as single micro-op
Enhanced ALUs
To execute new compare and jump (CMPJCC)
micro-op in one clock

12
Out of Order Execution
96 entries ROB 32 Entry Reservation Station
13
Execution Units

6 dispatch ports(1 Load, 2 Store, 3 universal
ports)
3 integer ALU, 2 float point ALU

14
Branch Predictor

Loop Detector
- Track the number of loop iterations
for future reference
branch prediction unit (BPU) selects among for
every branch
-bimodal predictor
-global predictor
-loop detector

Cache Organization
private L1 DCache and ICache, 32K/core, 8way, 64B
linesize, write-back(directory-based conherence)
shared L2 cache, 8way, 64B linesize (E8xxx)
pros could be less bus traffic
cons longer access latency than private L2
cache
potential conflict between threads
-- FSB 1333MHz (E8xxx)
Memory disambiguation
aggressive memory dependence speculation based on
a load's- EIP-address-indexed hash table
watchdog mechanism

Prediction Implementation
History table indexed by Instruction Pointer
Each entry in the history array has a saturating
counter
Once counter saturates disambiguation possible
on this load (take effect since next iteration)
-load is allowed to go even meet unkown store
addresses
When a particular load failed disambiguation
reset its counter
Each time a particular load correctly
disambiguated increment counter

17
Predictor Lookup

when sent from RS, set disambiguation bit
If meets an older unknow store address, set
"update"
If prediction is "go", dispatch, set "done"
Else blocked
A store in Load Buffer scan all previous load, if
a match found, "reset" bit set.
When load commits, update history.

Load Dispatch
Prediction Verification
18

Execute Disable Bit Support
AMD Enhanced Virus Protection ARM eXecute Never
help prevent buffer overflow attacks
no need of software patches for buffer overflow
attacks
segregate memory by either storage of code or
data
processor disable code execution when malicious
worms try to inserting code into data buffers
(with OS support)

Instruction Pointer Based Prefetcher
L1 DCache2 IP prefetchers/core
L1 ICache1 traditional prefetcher
L2 Cache 2 IP prefetchers
predict what memory address will be used and
deliver in time
record every load's history using Instruction
Pointer
IP history array
parameters for prefetch traffic control
fine-tuned for different platforms
prefetch monitor

20
(No Transcript)
21
References

Intel's Next Generation Microarchitecture
Unveiled, by David Kanter, Real World
Technologies
Intel Core Microarchitecture Briefing, by Stephen
Smith and Bob Valentine, Intel
Inside Intel Core Microarchitecture Setting New
Standards for Energy-Efficient Performance, Ofri
Wechsler, Technology_at_Intel Magazine
Intel Core A Next-Generation Microarchitecture,
by Alan Zeichick, DevX
too many

22
Questions?

Write a Comment

User Comments (0)