The Intel Itanium ISA Synopsys In Brief

About This Presentation

Title:

The Intel Itanium ISA Synopsys In Brief

Description:

1994 -Intel & Hewlett Packard begin work on Itanium (codename Merced) July 1999 -Company officials stress that only prototypes are to be released in ... – PowerPoint PPT presentation

Number of Views:62

Avg rating:3.0/5.0

Slides: 77

Provided by: Chris1671

Category:

more less

Transcript and Presenter's Notes

Title: The Intel Itanium ISA Synopsys In Brief

1
The Intel Itanium (ISA)Synopsys In Brief

Paginated, Edited with some development by
Charles Pickman
(Original Presentation by Scott Parmenter Chris
Levy, Fall 2002)

22-Dec-02
2
History

1994 -Intel Hewlett Packard begin work on
Itanium (codename Merced)
July 1999 -Company officials stress that only
prototypes are to be released in mid-1999 and
volume production is still on schedule for
mid-2000.
July 2000 -Chip rollout delayed until 1st half of
2001. By now, 15,000 units have been delivered
for evaluation purposes.
August 2000 -Intel publicly demonstrates a
four-CPU Itanium at LinuxWorld and at The Intel
Developer Forum. However, another revision is
needed before rollout.
McKinley (Itanium II) is due to arrive in the
second half of 2001, Merced is now irrelevant.
January 2000 -Merced delayed again, until Q2.
December 2000 -After the 1st full quarter of
Itanium sales
500 units were sold
2,000 shipped for demonstration purposes.
January 2002 -Intel denies speculation about
secretly developing Yamhill Technology, 64-bit
chip based on x86 design, as a backup in case
Itanium fails.
Reported development cost to date of Itanium
over 1 billion.
October 2002- A federal court finds Intel guilty
of using patented Intergraph technology in
Itanium.

3
Itanium Physical Concept

Designed to take complexity away from processor,
and making the programmer, compiler and assembler
more complex.
3x5 cartridge
CPU L3 cache
130W Power
420mm2
Transistors
CPU 25 million,
L3 Cache 300 million

4
Itanium CPU Layout
5
System Data Bus

Increased bus efficiency through enhanced
deferred transactions
266 MHz
Throughput up to 2.1GByte/sec
64-bit wide bus

6
Addressing Modes

The Itanium has only one simple addressing mode,
register indirect.
This reduces the amount of overhead per clock
cycle, since it does not have to deal with the
address-generation units required for multiple
addressing modes.

7
Operating Modes

IA-64 system environment. All 64-bit features are
fully available. Supports application
environments of IA-32 real mode, IA-32 V86 mode,
IA-32 protected mode, and IA-64 instruction set
modes.
IA-32 application environment in IA-64 system
environment mode. Full IA-32 instruction and
register set compatibility. Can execute a mixture
of IA-64 and IA-32 code by special branching
instructions.
IA-32 system environment. Fully IA-32 compatible.
This mode is used to run 32-bit operating
systems. No 64-bit extensions are available.
Supports real-mode, V86, and protected modes.

8
Instruction Word

At its heart is a VLIW core.
Itanium has added features to the VLIW design to
enhance instruction groupings, scalability, and
to permit wider parallel instruction.
Intel renamed this format EPIC (Explicitly
Parallel Instruction-set Computing.

9
Parallelism
10
Parallelism
11
Instruction Group

Instructions are divided into collections that
have no dependencies or interlocks.
These groups can be executed concurrently.
Groups can be of arbitrary length ended by stop
bits
Instructions before a group stop may have
resource dependencies

12
Instruction Bundle

Instructions are delivered to the processor in
bundles
2 bundles are dispatched at once (CECS440
supported?)
Every instruction is bundled into a set of three
similar instructions. These instructions are 41
bits each.
Each instruction bundle is 128 bits long
The remaining 5 bits designate a template (bundle
type).

13
Instruction Bundle

Bundles are ordered from lowest to highest memory
address
Bundles in lower memory addresses have precedence
over bundles in higher memory addresses
Byte order of each bundle in memory is
little-endian
Within a bundle, instructions are ordered from
slot 0 to slot 2

14
Templates

Templates help direct which instruction go to
which type of execution unit
Specify 24 possible combinations of instruction
types (8 are reserved)
Determine the mapping between instruction slot
and execution unit
Group stops are specified within a bundle

15
Templates

MII/MBB and MIB/MIB Templates allow
6 instructions or 8 parallel operations per clock
2 Load/Store operations
2 general purpose ALU operations
2 post-increment ALU operations
2 Branch operations
MFI/MFI Template allows
6 instructions or 12 parallel operations per
clock
4 double-precision operand Loads
4 double-precision floating point operations
2 integer ALU operations
2 post-increment ALU operations

16
VLIW Core

The Itanium is a six-issue processor, meaning it
can profitably handle six instructions
simultaneously.

17
(No Transcript)
18
Execution Units

4 integer units
4 multi-media units
2 load/store units
3 branch units
2 extended precision FP units
2 single precision FP units

19
Execution Ports

Feeds instructions to Execution Units
2 Integer Ports
2 Integer/Load-Store Ports
2 Floating Point Ports (CECS440 supported?)
3 Branch Ports

20
Floating Point Units (CECS440 supported?)

4 pipelined FMAC units (Floating-point Multiply
Add Calculator).
Primary two are each capable of processing two
single-precision, two double precision, or two
double-extended precision floating-point
operations per clock at up to 3.2GFlops.
Additional FMAC's for 3D applications. Each
capable of processing up to two single-precision
floating-point operations per clock at up to
3.2GFlops
Itanium has a theoretical max of 6.4GFlops of
single-precision floating point calculation.

21
Floating Point Units (CECS440 supported?)
22
Execution Unit Constraints

FP Units are optimized for multiply-accumulate
operations (CECS440 supported?)
Must add 0.0 for simple floating point
multiplication operations (CECS440 supported?)
Must multiply by 1.0 for all simple floating
point addition operations (CECS440 supported?)
Integer Units do not multiply
Integer pair must be transferred from
general-purpose registers to floating point
registers for a multiply-accumulate operation
(CECS440 supported?)
Integer Result is transferred back to a
general-purpose register
Noticeable performance penalty

23
Integer Register File

128 entries
8 read ports
6 write ports
Post-increment performed by idle ALU and write
ports
2 clock cycles are needed to access a register
due to number of registers

24
Floating Point Register File (CECS440 supported?)

128 entries
8 read ports
4 write ports, separated into odd and even banks
Supports double extended-precision arithmetic

25
Predicate Register file

64 1-bit entries
15 read ports
11 write ports

26
Registers

128 General registers
128 Floating-point registers (CECS440 supported?)
64 Predicate registers
8 Branch registers
128 Application registers
Instruction Pointer (IP) register
Registers are referred to by a mnemonic denoting
the register type and a number. For example,
general register 32 is named r32.

27
IA-32 Register Set

Itanium is fully backward compatible with Intels
x86 family
The x86 register set has been super imposed into
the Itaniums 64 bit register sets.

28
General registers

General Registers
Itanium provides 128 64-bit general purpose
registers for all integer and multimedia
computation.
Register gr0 is a read-only register and is
always zero (0).
32 registers are static and global to the
process.
96 registers are stacked. These registers are for
argument passing and local register stack frame.
Each register has an associated NaT bit,
indicating whether the value stored in the
register is valid.

29
Floating Point Registers (CECS440 supported?)

128 82-bit floating-point registers, for
floating-point computations.
32 static floating-point registers
96 rotating floating-point registers, for
software pipelining
The first two registers (fr0 and fr1) are
read-only
fr0 is read as 0.0
fr1 is read as 1.0.
Each register contains three fields
64-bit significand field
17-bit exponent field
1-bit sign field.

30
Predicate Registers

64 one-bit predicate registers enable controlling
the execution of instructions. When the value of
a predicate register is true (1), the instruction
is executed. The predicate registers enable
validating/invalidating instructions
eliminating branches in if/then/else logic blocks
So if the predicate is false the instruction
would execute, but there would be no write back.

31
Branch Registers

Eight 64-bit branch registers are used to specify
branch target addresses.
The branch registers streamline call/return
branching

32
Application registers

128 special purpose registers are used for
various functions.
The first eight K0-K7 are the kernel registers,
used to communicate info between the kernel and
an application.

Instruction Pointer

The 64-bit instruction pointer holds the address
of the bundle of the currently executing
instruction.
The IP cannot be directly read or written, it
increments as instructions are executed
Branch instructions set the IP to a new value.

33
NaT an NatVal Registers

There are 128 single bit Not-A-Thing registers
and 128 single bit Not-a-thing-value registers
There is one NaT bit for every GP Register
There is one NatVal bit for every FP Register
(CECS440 supported?)
These bits are used under speculative execution
to indicate deferred exceptions.

34
Register Stacking

Reduces function call and return overhead by
ensuring that all procedural input and output
parameters are in specific register locations,
without requiring the compiler to perform
register-register or memory-register moves

35
Register Stack

The general register stack is divided into two
subsets
Static The first 32 physical registers (r0-r31)
are permanent registers, visible to all
procedures (global).
Stacked The other 96 physical registers behave
like a stack. The procedure code allocates up to
96 input and output registers for a procedure
frame. (r32-r127)

36
RSE

What happens when you run out of registers?
The Itanium uses a hardware mechanism called a
Register Stack Engine (RSE), which operates
transparently in the background, it pushes and
pops registers for you after you have reached the
physical limit giving an application an
apparently unlimited number of apparent physical
registers.

37
Register Frames

Pushing and popping Itaniums large registers
every time there is a function call would be time
consuming.
ALLOC instruction shifts the apparent arrangement
of the general-purpose registers so that it
appears that parameters are being passed from one
function to another through shared physical
registers.

38
Register Frames
39
Memory Organization

The Itanium architecture defines a single,
uniform, linear address space of 264 bytes (only
250 used).
All code is stored in little-endian byte order in
memory. Data is typically stored in little-endian
byte order. The Itanium architecture also
provides support for big-endian operating
systems.
Memory access is provided solely through explicit
register load and store instructions.
The large number of registers in the Itanium
architecture enables multiple computations to be
performed without having to store temporary data
in memory. This reduces the number of memory
accesses.

40
Virtual Memory Support (CECS440 supported?)

64-bit flat addressing space, indexed by general
registers
IA-32 32-bit addresses are 0-extended into the
64-bit virtual address space
Divided into 8 each 261 byte virtual regions
Each virtual region has associated region
register that specifies 24-bit region identifier
(unique address space number)
8 of the possible 224 address spaces are
concurrently available via region registers
Regions can be combined to form larger regions
(62-, 63-, or 64-bit regions)
All IA-32 memory references are through region 0
Paging size ranges from 4kb to 4Gb

41
Region Register Format

42
Physical Memory Addressing

Objects in memory and I/O occupy common 63-bit
byte-accessible physical address space
Physical memory addresses performed via virtual
addresses mapped to the 63-bit physical address
space or by direct physical addressing
Current page table formats allow for mapping
virtual addresses into 50 bits of physical
address space (future extensions to allow for
larger mappings)

43
Physical Memory Addressing

44
Cache Hierarchy (1 of 3)

Split 32kb L1 on die
Separate instruction and data caches (16kb each)
4-way set-associative, fully pipelined with
32-byte cache line size
Can sustain 2 loads, 2 stores, or 1 load and 1
store per clock
Write-through with no write allocation
Data cache only used for Integer Unit (not FP
Unit)
Integer load hits have a 2-cycle latency
Load from addresses stored within last 3 cycles
(Store to load forwarding) bypasses cache

45
Cache Hierarchy (2 of 3)

Unified 96kb L2 on die
6-way set-associative
Fully pipelined with 64-byte cache line size
Write-back with write-allocate policy
4 Loads per clock cycle
Can sustain 2 memory operations per clock
Integer load hits have a 6-cycle latency
FP load hits have a 9-cycle latency

46
Cache Hierarchy (3 of 3)

Unified 2Mb or 4Mb L3 on cartridge (separate die)
Runs at full processor frequency
4-way set-associative
Fully pipelined with 64-byte cache line size to
provide 12.8 Gb/sec using a 128-bit wide cache
bus
Integer load hits have a 21-cycle latency
FP load hits have a 24-cycle latency (CECS440
supported?)

47
(No Transcript)
48
Branch Prediction

Only one form of conditional branch, but it can
be based on 1 of 64 predicate bits (located in
predicate registers)
Branches can be static or dynamic
Dynamic branches can be overridden by CPU
Can specify how far ahead of time to fetch the
predicted branch

49
Speculation

Itanium will try to conform to programmer-specifie
d loads if system bus may be busy
Data Speculation loads data in advance, allowing
memory bus to be fully utilized
Code Speculation executes instructions in advance
and results are held until safe to store in memory

50
Speculation
51
Predication

All conditional instructions are predicated
Avoids short branches that inject bubbles into
the pipeline
Executes both branch paths simultaneously
Discards irrelevant path as predicate is
evaluated

52
Predication
53
Predication
54
Pipeline

Itanium has a 10 stage pipeline
256 bits wide, handling 2 bundles at a time
Compared with
Pentium III 12 stage
Pentium IV 20 stage
Itanium 2 (McKinley) 8 stage

55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
Instructions

A typical Itanium instruction is a three operand
instruction, with the following syntax
Simple Instruction
add r1 r2, r3

65
(No Transcript)
66
Instruction Syntax (qp) mnemonic.comp1.comp2
dests srcs

(qp) A qualifying predicate is a predicate
register indicating whether or not the
instruction is executed.
Mnemonic A unique name identifying the
instruction.
comp1comp2Some instructions may include one
or more completers. Completers indicate optional
variations on the basic mnemonic.
dests, srcs Destination and source registers

67
IA-64 Instruction Set

ALU Instructions (A-Unit)
Integer ALU add, sub, xor, or, and, andcm, shl,
shladd
Integer Compare - cmp.crel.ctype ltsource and
destination operandsgt
crel is one of none, unc, or, and, or.andcm,
orcm, andcm, and.orcm
ctype is one of eq, ne, lt, le, gt, ge, ltu,
leu, gtu, geu
operands are registers or imm8
Multimedia - pshladd2, padd1, psub1, pavg1,
pcmp4.eq

68
IA-64 Instruction Set

Integer Instructions (I-Unit)
Multimedia and Variable Shifts - unpack1.h,
pshl4, mux2, unpack4.l, pmpy2.r, mix1.r
Integer Shifts shrp, extr.u, dep.z,
Test Bit tbit.z, tbit.z.unc, tbit.nz.or.andcm,
tnat.z, tnat.nz.or.andcm
Miscellaneous break.i, nop.i, mov, mov.i,
czx1.r

69
IA-64 Instruction Set

Memory Instructions (M-Unit)
Load/Store ld1.ldhint, ld8.bias.ldhint,
ld8.c.clr.acq.ldhint
Line Prefetch lfetch.fault.excl.lfhint
Semaphores chk.s.m, break.m, invala, srlz.d,
flushrs, loadrs, cmpxchg8.rel.ldhint

70
IA-64 Instruction Set

Branch and Call Instructions (B-Unit)
Branches br.btype.bwh.ph.dh target
btype is one of branch types none, cond, call,
ret, ia, cloop, ctop, cexit, wtop, wexit
bwh is one of branch whether hints spnk, sptk,
dpnk, dptk
ph is one of sequential prefetch hints none,
few, many
dh is one of branch cache deallocation hints
none, clr
Branch Predict and Nop - brp.ipwh.ih,
brp.ret.indwh.ih
Miscellaneous break.b, nop.b, cover, clrrrb,
bsw.1, epc

71
IA-64 Instruction Set

Floating-Point Instructions (F-Unit) (CECS440
supported?)
Arithmetic fma, fnma, xma.l
Parallel Floating-point Select fselect
Compare and Classify fcmp, fclass
Approximation fpsqrta, frcpa
Minimum/Maximum and Parallel Compare fmin,
fmax, fpcmp.eq
Merge and Logical fmerge, fmix, fpack, fand,
for, fxor
Conversion - fcvt.fx, fpcvt.fxu.trunc
Status Field Manipulation fchkf.sf, fclrf.sf,
fsetc.sf
Miscellaneous break.f, nop.f

72
IA-64 Instruction Set

Extended and Long Instructions (X-Unit)
Miscellaneous break.x, nop.x
Move Long Immediate64 movl
Long Branches brl,

73
Future Generations

Itanium 2 line (incompatible with Itanium/Merced)
McKinley (2002) 180 nm process, on-die L3 cache
(3MB), additional execution units, faster system
bus
Madison (2003) - around 1.5 GHz, 130 nm etching,
copper technology
Deerfield (2003) low-end version of Madison
Montecito (2004) - add capabilities including
multi-threading and multi-core (may be deferred
for Chivano)
Chivano (Shavano?) (2005-6) 95 nanometer
process

74
Summary