Title: P6 and IA64
1P6 and IA-64
- 8086 released on 1978
- Pentium release on 1993
- 8086 has upgrade by Pipeline, Super scalar,
Clock frequency, Cache and so on - But 8086 has limit, Hard to improve efficiency
- Intel released new technology call P6
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
2P6
- Pentium's L2 cache problem
- 256 512 KB
- Pentium interface cache and main memory via
external bus - Pipeline stall
- 1. Prefetch read cache
- and 2. Execution unit read data from main
memory - ? Its use the same bus
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
3P6
- P6 move L2 cache on same package with CPU
- Pentium Pro and Pentium II separate die with
CPU with 512 KB cache - ( include Pentium III in Slot 1 )
- Celeron integrate 128 KB cache on die
- Pentium III ( Coppermine ) integrate 256 KB
cache on die - ( 28 millions transistor on die of Pentium III )
- Xeon integrate cache 2 MB on die
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
4DIB
- DIB ( Dual Independent Bus ) FSB and BSB
- Cache Bus ( Back Side Bus ) 64 bit
- 256 bit in Pentium III ( Coppermine )
- BSB speed is higher than mainboards bus speed
- Pentium Pro and Pentium II ( include Pentium III
in Slot 1 ) - BSB speed ½ CPU speed
- Celeron and Pentium III ( Coppermine )
- BSB speed CPU speed
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
5BSB and FSB
- Cache 128 KB has Hit Rate gt 90
- BSB free CPU from use dedicate bus,
- CPU clock independence with main board clock
- ( september 2000 ) CPUs speed is 1.13 GHz,
- but main board clock 133 MHz
- FSB ( Front Side Bus ) Bus on main board
- interface CPU with I/O and main memory
- FSB speed 66 MHz, 100 and 133 MHz
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
6P6 Architecture
- Separate cache
- L1 to L2 via BSB
- L1 to Mem via FSB
- L1 cache 32KB
- - Instruction 16 KB
- - Data Cache 16 KB
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
7P6 Architecture
- Dynamic Execution Microarchitecture
- Fetch / Decode Unit
- Dispatch / Execute Unit
- Retire Unit
- Instruction Pool
- Dynamic Execution
- Multiple Branch Prediction
- Dynamic Data Flow Analysis
- Speculative Execution
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
8P6 Architecture
- Multiple Branch Prediction
- Concept from mainframe
- Use multiple pipeline for call or return
instruction - Fetch/Decode unit use to find branch instruction
- Dynamic Data Flow Analysis
- Analyze and search for out of order instruction
- Dispatch/Execute unit scan and sort instruction
for - Maximize usage of Execution unit
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
9P6 Architecture
- Speculative Execution
- Dispatch/Execute unit use to analyze instruction
- Forward execute instruction and send to
instruction pool - Keep result in temporary register
- Retire unit use to find executed instruction and
- out of order ( No branch ), Commit and
confirm result in - register, Then delete from pool
- This 3 techniques, Made P6 is non sequential CPU
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
10Pentium Pro
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
11P6 Architecture
- P6 next evolution of Intels CPU
- No more 80X86 core
- P6 Core is RISC
- Redesign all instruction on RISC core
- Backward compatible by mapping 80x86 to RISC
command - Improve Branch Prediction
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
12P6
- Pentium Pro first P6 architecture
- short life cycle, a few series of Pentium Pro
- Speed 150, 166, 180 and 200 MHz
- L1 Cache ( 8 8 )
- L2 Cache 256 and 512 KB on same package
- L2 Cache 1 MB at 200 MHz
Pentium Pro
Pentium Pro 1 MB L2 Cache
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
13Pentium II
- Pentium II Pentium Pro MMX
- Speed 233, 266, 300 and 333 MHz
- Package S.E.C.C ( Slot 1 )
- FSB 66 MHz
- L1 Cache 16 16, L2 Cache 512 KB
- FSB 100 MHz, Speed 350, 400, 450 MHz
- L2 Cache 2 MB name Pentium II Xeon ( speed
cache CPU ) - Package S.E.C.C 2 ( Slot 2 )
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
14Celeron
- Celeron Pentium II but low throughput (
Same Core ) - Speed 266, 300 MHz
- No L2 Cache
- L1 Cache 16 16
- FSB 66 MHz
- L2 Cache 128 KB ( Cache speed CPU )
- Speed 300A, 333, 366, 400, 433, 466, and 500 MHz
- FSB 66 MHz
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
15Celeron
- Package PPGA ( Plastic Pin Grid Array ) 370 Pin
- Package FC-PGA ( SSE )
- Change to 0.18 micron
- Core 1.5 VDC
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
16SSE
- 3D speed upgrade by adding new instruction
- Streaming SIMD Extension ( SSE )
- Can jump over L2 Cache
- Processor Serial Number
- Pentium III
- L1 Cache 16 16
- L2 Cache 512 KB
- ( Coppermine Cache 256 KB )
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
17Pentium 4
- In P6 architecture
- Speed upgrade from 150 MHz to 1.13GHz
- Change technology 0.5 to 0.25 and 0.13 micron
- VCC 3.3 to 2.2 and 1.5 V
- Pentium 4
- Same core with Penutium III
- But many thing has change
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
18Pentium 4
- 133 MHz bus to 200MHz and 400 MHz DDR ( Double
Date Rate) - Double clock speed in integer ALU ( lt 1 clock /
instruction ) - Add Execution trace cache ( keep translate
Micro-op ) - Upgrade pipeline and Branch Prediction from P6
- SSE Extension 2 ( new 144 instructions )
- Floating point 128 bit
- Dynamic Execution add Instruction Pool
- from keep 40 Micro-Ops to 100 Micro-Ops
- Execution Trace Cache Dynamic Execution
- All Loop work in Instruction Pool
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
19AMD K5
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
20AMD K5
- 5 Stage pipeline
- Super scalar technique
- Branch Prediction
- Dynamic Execution
- Architecture same as Pentium
- But Pentium pipe line is better
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
21K6-III
P6 architectures better than K6-III
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
22K7 ( Althon )
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
23Crusoe
- Intel and AMD structure
- RISC 80X86 Shell
- Mappig 80x86 instruction to RISC Core
instruction - Crusoe
- CPU of Transmeta
- Use software to help hardware work
- Translate instruction by hardware ( Code
Morphing )
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
24Crusoe
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
25Crusoe
- Software Code Morphing
- 128 bit VLIW ( Very Long Instruction Word ), 4
instructions - 4 execution unit
- Integer, Floating Point, Load/ Store and Branch
- Crusoe TM 5400
- 64 register
- Instruction cache 64 KB
- Data cache 64 KB
- L2 Cache 256 KB
- Speed 266 533 MHz
- Low power consumption 1/3 of Pentium III
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
26Crusoe
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
27CPU Compare
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
28IA-64 Intel HP
- CISC and RISC processor
- RISC core CISC
- RISC processor PowerPC, Alpha, Sparc, MIPS
- CPU problem
- Jump Branch prediction
- Read memory Cache and Prefetch queue
- gt 1 instruction/clock Super scalar
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
29Merced ( Itanium )
- EPIC (Explicitly Parallel Instruction Computing)
- 128 General register
- 128 Floating point register
- Parallel processing unit
- VLIW ( Very Long Instruction Word ) 128 bit ( 41
X 3 5 ) - Compiler optimization
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
30Branch Removal
cmp.ne p1,p2a,r0 // p1 lt- a ! 0 cmp.ne
p3,p4e,r0 // p3 lt- e ! 0 (p1) add bc,d // If
a ! 0 then add (p3) sub hi,j // If e !
0 then sub
if (a) b c d if (e) h i - j
- Predicate Register ( 64 )
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
31IA-64 Technique
- 64 Bit processor, improve from P6 architecture
- VLIW
- Compiler optimization
- Speculation feature ( reduce memory timing )
- 6 GFLOPs FPU
- 128 128 register
- Support by many software provider
- IA-32 Compatible ( Virtual 8086 Mode )
- IA-32 to IA-64 by Hardware translation mechanism
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
32Itanium
- March 2000
- 800 MHz
- 20 Instruction / Clock
- 3 level cache, 4 MB
- 320 millions transistors
- 25 millions for CPU
- 295 millions for L3 cache
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8
33Itanium
C. Vongchumyen 1 / 2004
Computer Organization and Assembly Language 8