Title: Pentium Architecture Studying
1Pentium Architecture Studying
2Member / Topic
- Jian-Jang Chen
- Micro-architecture / Pipeline
- I-Hwei Yen
- Instruction Set Architecture
- Hung-Jen Huang
- Cache / Memory Architecture
- Min Li
- Code Optimization (Branch prediction)
3Micro-Architecture/Pipeline
Pentium Architecture
4Micro-Architecture/Pipeline
From bus interface unit
Icache
Next_IP
Branch Target buffer
Instruction Decoder (x3)
Microcode Instruction sequencer
To Instruction pool (reorder buffer)
Register alias table allocate
Pentium Architecture
5MMX Exe Unit
FP Exe Unit
Port 0
Int. Exe. Unit
R.S.
To / From Instruction pool (reorder buffer)
MMX Exe Unit
Jmp Exe Unit
Port 1
Int. Exe. Unit
Port 2
Load Unit
Loads
Port 3,4
Store Unit
Stores
6To / From Dcache
Reservation Station
Memory Interface Unit
Retirement register file
From
To
Instruction pool
7- An operand in an IA instruction can be located in
the instruction itself, a register, a memory
location, or an I/O port. They are classified by
the following - Immediate Operands
- Register Operands
- Memory Operands
- I/O Port Addressing
Pentium Architecture
8Memory Operands
- Operands in memory are referenced by a segment
selector and an offset
15
0
31
0
Segment Selector
Offset
Pentium Architecture
9- Can be specified implicitly or explicitly
- Rules
-
Pentium Architecture
10- Offset can be any combinations of the factors
bellow - Offset Base (Index Scale) Displacement
- Direct (static address) Offset Displacement
- Indirect (dynamic) Offset Base
- Examples
- (Index Scale) Displacement can present the
elements of an array. Displacement locates the
beginning of the array, the index holds the
element to be fetched, and the scale used for
different data types.
Pentium Architecture
11- The general instruction format is as following
Pentium Architecture
12- L1 ins/data16K,4-way,32bytes/block
- L2unified,512K,32bytes/block
- Write buffer32bytes,4 in Pentium III
- L2 has a separate cache bus
- No partial filled cache line
- Snoop ability for multiprocessors.
Physical memory
Data Cache Unit
System buss
L2 cache
bus
Instruct. TLB
Bus Interface Unit
Data TLB
Fetch Unit and L1 ins cache
Write Buffer
13- Allow any area of system memory to be cached
in L1 or L2 - Allow the type of caching,ie, memory type, to
be specified by a variety of - flags and registers.Five types of memory are
defined. - UC(uncacheable) 1)in order accesses 2)useful
for memory mapped I/O. - WC(write combining)system memory locations
are not cached,writes can be - delayed and combined in the write buffer
until buffer full or serialization. - WT(write through)reads and writes are
cached.all writes go through both a - cache line and the system memory.
- WB(write back)all reads and writes occur in
cache when possible. - WP(write protected)writes cause
corresponding cache lines on all processors - on the bus to be invalidated
14- MESI maintains consistency between different
processors caches. - L1 instruction cache only has SI
control,because its not writable. - Each cache line could be in one of the
following four states
15- Two level cache controlglobal and page.
- Control register CR0 flag CD turn on/off whole
system memory/caching L2,L1 - Control flag NW in CR0 controls writing policy
of the whole system memory. - Each page table or page directory entry has two
similar flags to control caching at - page level1)PCDenable/disable caching
2)PWTclear for WB set for WT - Global pageresident page entries in TLB unless
special operation. - Precedenceglobal flag overrules page level flag
caching control - Precedenceuncaching is selected when
confliction occurs - PrecedenceWC takes precedence over WTwhich
takes precedence over WB - Invalidatesome instructions could invalidate
cache when caching is disabled - TLB or write buffer may be drained or
invalidated under special operation.
16- MTRR(memory type range register)associate
memory type with physical address - Allows 96 memory ranges to be defined in
physical memory. - In multiprocessor system ,different processors
must use identical MTRR map. - In general,BIOS configures these MTRRs,and
operating system remaps them. - MTRRcap register is used to record
- 1.number of variable ranges could be
implemented 2.fix range support? 3.WC? - MTRRdefType registerused to 1)define default
type of the memory - 2)turn on/off MTRRs 3)enable/disable
fixed-range MTRRs - If fix-range MTRRs enabled,they take priority
over variable-range MTRRs. - 11 fixed-range registers,each is in charge of 8
fix memory ranges type. - Allows maximum 8 variable ranges be defined by
16 MTRRs.
17- If the instruction address is not in the BTB,
execution is predicted to - continue without branching ( fall through )
- Predicted taken branches have a 1 clock delay
- The BTB stores a four-bit history of branch
predictions
- BTB pattern matches on the direction of the
last four branches to - Dynamically predict whether a branch will
be taken
Pentium Architecture
18BRANCH PREDICTION OPTIMIZATION
- Optimize Branch Predictions in Code
- Reduce or eliminate branches
- Insure that each CALL instruction has a
matching RET instruction - Do not intermingle data with instructions in a
code segment - Unroll all very short loops
- Write code to follow the static prediction
algorithm
Pentium Architecture
19BRANCH PREDICTION OPTIMIZATION
- Static Prediction Algorithm
When branches dont have a history in the BTB
- Predicts unconditional branches to be taken
- JMP
- Predicts backward conditional branches to be
taken. This rule is - suitable for loops
- loop lt condition gt
- Predicts forward conditional branches to be
NOT taken - if lt condition gt
Pentium Architecture
20BRANCH PREDICTION OPTIMIZATION
- Eliminating and Reducing the Number of Branches
- Removing the possibility of branch
mispredictions - Reducing the number of BTB entries required
WHY
- Using replacement instructions instead of branch
instruction - SETcc
- CMOVcc or FCMOVcc
-
- ---Combine JNE ( JGE , etc.) and MOV instructions
into one
HOW
Pentium Architecture
21BRANCH PREDICTION OPTIMIZATION
1. X ( AltB ) ? C1C2
Example
2.
Example
Pentium Architecture