Title: Itanium Family Architecture
1Itanium Family Architecture
- Marty Nicholes for EEC272
- UCDavis
2Topics
- Architectural Overview
- Registers, and ISA
- Implementation
- Instruction Flow
- Data Flow
- Benchmarks
- Futures
- References
3Architectural Overview
- What is exposed to the compiler
4Architectural Overview IPF Register Set
5Architectural OverviewRotating Register Set
- The first 32 general/FP registers are available
to all procedures - Each procedure uses the ALLOC instruction to gain
access to up to 96 more registers. - If more logical registers are allocated than
physical, the Register Stack Engine must spill
older physical registers to make room for new
allocated registers.
6Architectural OverviewRotating Register Set
7Architectural OverviewInstruction Set
Architecture
- Instructions organized into groups
- No dependencies or interlock needed
- Can execute concurrently
- Arbitrary number of instructions
- Groups delimited by stop
- Instructions come in bundles (128 bits)
- Contains 3 instructions
- Contains a bundle template
8Architectural OverviewInstruction Set
Architecture
- Bundle
- 3 41-bit instructions
- 5 bit bundle type
- Bundle type communicates stop location
9Architectural OverviewInstruction Set
Architecture
- Template helps to steer instruction to functional
unit - Implementation affects which bundle combos can
fully issue
10Architectural Overview
11Itanium Implementation
- 3 processor designs announced
- Merced (Itanium)
- McKinley (Itanium 2)
- Madison (1.5 Ghz Itanium 2)
12Implementation (Itanium Die)
13Itanium Block Diagram
14Itanium Pipeline
15Itanium 2 Die
16Itanium 2 Pipeline
17Itanium 2 Block Diagram
181.5 Ghz. Itanium 2 Die
19Instruction Flow
- How to feed the hungry execution units
20I-Cache (Itanium)
- 16 Kb (32B line size)
- 4-way set associative
- Fully pipelined, single clock
- 64-entry I-TLB
- Fully associative
- On-chip page walker
- Both have extra port for miss checking
21Instruction Fetch (Itanium)
22Branch Prediction (Itanium)
- 90 accurate 2-level BPT (512 entries (128 sets X
4 ways), each entry is 4 bits, this value indexes
into the pattern table for the set (128 pattern
tables), each PT has 16 entries, each entry is a
2 bit saturating counter. - Also a 64 entry, multi-way branch prediction
table (MBPT) that is similar but has 3 history
registers per bundle entry. Find first taken
algorithm. - Branch prediction penalties
- IP-relative branch w/correct prediction - 1 cycle
- IP-relative branch w/wrong target - 1 cycle
- Return branch w/correct prediction - 1 cycle
- Last branch in counted loop prediction - 2 cycle
- Branch Misprediction 9 cycles
23I-Cache (Itanium 2)
- 16KB (4 way, 64B line size)
- LRU replacement algorithm
- 32 GB/sec bandwidth
- 1 cycle load to use
- 2 level I-TLB
- ITC (32 entry, full assoc., 0.5 clock)
- ITLB (128 entry, fully assoc. 1 clock)
- Supports 4K to 4GB page sizes
- Supports 64 ITRs
- HW page walker starts on miss
24Branch Prediction (Itanium 2)
- Zero clock branch prediction
- 2 level branch prediction hierarchy
- L1IBR Level 1 Branch Cache
- Part of the L1 I-cache
- 1K trigger predictions0.5K target addresses
- L2B - Level 2 Branch Cache (12K histories)
- PHT - Pattern History Table (16K counters)
- Reduced prediction penalties
- IP-relative branch w/correct prediction - 0 cycle
- IP-relative branch w/wrong target - 1 cycle
- Return branch w/correct prediction - 1 cycle
- Last branch in counted loop prediction - 0 cycle
- Branch Misprediction 6 cycles
25Instruction Dispersal (Itanium)
- Stop bits eliminate dependency checking
- Templates simplify routing
- Map instructions to first available of 9 issue
ports. Keep issuing until - stop bit is hit
- required issue port is unavailable
- Re-map virtual register to physical register
- New bundles presented as bundles fully issue
26Instruction Dispersal (Itanium)
27Instruction Dispersal (Itanium 2)
28Register Remapping (Itanium)
- One 7-bit adder for each register specifier
- In total 98 7-bit adders, and 42 MUXs
29Instruction Dispersal (Itanium 2)
- Itanium 2 implements 11 issue ports
- 4 Mem/ALU/Multi-Media
- 2 Integer/ALU/Multi-Media
- 2 FMAC
- 3 branch
30Execution (Itanium Itanium 2)
- 17 execution units fed by 9 issue ports
- 20 execution units fed by 11 issue port (Itanium
2) - Scoreboard based, stall-on-use
- Enhanced to support predication
- Hazard evaluation in REG stage
- Hazards can proceed into EXE stage
- Stall occurs in EXE stage (deferred stall)
- Obtaining operands in the EXE stage
- Stalled instructions snoop for data values
- Utilize register bypass hardware from REG
31Data Flow
- Getting operands into the core
- Getting results stored
32Data Flow (Itanium)
- 128 (64 bit) integer register file
- 8 read and 6 write ports
- Supports 2 MEM and 2 ALU instructions
- 2 write ports for pending loads
- 128 (82 bit) floating point register file
- 8 read and 4 write ports
- Predicate register file (1 bit X 64)
- 15 read and 11 write ports to single registers
- Broadside read and write capability
33Data Speculation
- Control speculation
- On exception, NaT bit set, or NaTVal
- On consumption, exception is reported
- Special load issued early
- Address, size, and destination saved in ALAT
- ALAT used to check for overlapping stores
- If a match
- the load is invalidated
- must be reissued later when the data is to be
used
34ALAT Structure
35Data Flow - FMAC Unit (Itanium)
36Data Flow (Itanium 2 deltas)
- Integer register file
- 12 read 8 write ports
- Floating point register file
- 8 read and 6 write ports
- FPU to L2D cache
- 4 82-bit read ports (6 cycle latency)
- 2 82-bit write ports
37Data Flow - Caches (Itanium)
38Data Flow Caches (Itanium 2)
39Data Flow (Itanium 2 L1D)
- L1D - 16 kB of data (4 ways 4 kB/way)
- Prevalidated tags for fast loads
- 4 Ports (2 load ports 2 store ports)
- Load ports are independent.
- Each store port has a 1 in 8 chance of
conflicting with each other valid port. - True single cycle load access, including
- address translation data delivery
- data read integer unit data bypass
- Physically addressed (50 physical address bits)
- Write-through policy
- Each way is 64B line 64 indices
40Data Flow (Itanium 2)
41L1D addressing (Itanium 2)
42L1D Addressing Itanium 2
43Addressing
- Itanium
- 44 bit physical addressing
- 50 bit virtual addressing
- Maximum page size of 256MB
- Itanium 2
- 50 bit physical addressing
- 64 bit virtual addressing
- Maximum page size of 4GB
44Benchmarks
- SPECFP2000
- hp server rx5670 (1P, 1000 MHz, Itanium 2)
- 1431 (Dec-2002, rank 2)
- 1 Alpha _at_1482
- SPECInt2000
- hp server rx2600 (1P, 900 MHz, Itanium 2)
- 674 (Dec-2002, 1 P4 3Ghz _at_ 1200)
- TPC-C (non-clustered)
- Hp SuperDome (64P, 1.5Ghz Itanium2)
- 658,278Â (Apr-2003, rank 2)
- 1 IBM _at_680,613, 32P Power4 1.7GhzÂ
45Itanium Family Futures
- IA-32 Execution Layer, expected to debut 2003. A
1.5 GHz Itanium 2 to run 32-bit code about as
fast as a 1.5 GHz Xeon MP chip - HP mx2 dual processor module using Intel
Itanium 2 processors - Combines two future Itanium 2 processors and a
32-MB L4 cache onto a single daughter card module
that is pin-compatible with existing Madison
Itanium 2 processor sockets. - 2004 model of Madison is expected to be faster
and carry 9MB on chip L3 cache on .13u process - Deerfield. between 70 and 80 watts, and lower
cost. - Montecito processor in 2005. 90-nanometer,
dual-core technology, 18MB L3 cache, each core
will have its own L3 cache - Multi-threading?
46References
- Naffziger et. al., The Implementation of the
Itaniium 2 Microprocessor, IEEE Journal,
November 2002 - Naffziger, Hammond, The Implementation of the
Next-Generation 64b Itanium Microprocessor,
ISSCC, 2002 - Fetzer, Orton, A Fully-Bypassed 6-Issue Integer
Datapath and Register File on an Itanium
Microprocessor, ISSCC, 2002 - Lyon, Delano, Data Cache Design Considerations
for the Itanium 2 Processor, Proc. Of the 2002
IEEE Intl. Conf. On Computer Design, 2002 - Shankland, CNET News.comSeptember 10, 2002
- Singer, Intel Adds More Itanium 2 to its
Future, siliconvalley.internet.com, January 16,
2003 - Swoyer, Intel Promises Improved Performance,
Enterprise Systems, April 2003 - www.intel.com
- www.hp.com
- www.spec.org
- Sharangpani, Arora, Itanium Processor
Microarchitecture, IEEE, 2000 - Bradley, Mahoney, Stackhouse, The 16kB
Single-Cycle Read-Access Cache on a
Next-Generation 64b Itanium Microprocessor, - Stinson, Rusu, The 16kB Single-Cycle Read-Access
Cache on a Next-Generation 64b Itanium
Microprocessor, - Naffziger, Hammond, Next Generation Itanium
Processor Overview, Intel Devel. Forum 2001 - Barcella, Sankaranarayanan, Pai, ITANIUM, An
EPIC Architecture, - Intel Itanium 2 Processor Reference Manual,
Intel Corp., June 2002 - Parmenter, Levy, The Intel Itanium,
- K. Diefendorff, HP, Intel Complete IA-64
Rollout, Microprocessor Report, MicroDesign
Resources, Sunnyvale, CA, April 10, 2000 - McNairy, Soltis, Itanium 2 Microarchitecture,
IEEE Micro, March-April 2003