The Intel Itanium ISA Synopsys In Brief - PowerPoint PPT Presentation

1 / 76
About This Presentation
Title:

The Intel Itanium ISA Synopsys In Brief

Description:

1994 -Intel & Hewlett Packard begin work on Itanium (codename Merced) July 1999 -Company officials stress that only prototypes are to be released in ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 77
Provided by: Chris1671
Category:

less

Transcript and Presenter's Notes

Title: The Intel Itanium ISA Synopsys In Brief


1
The Intel Itanium (ISA)Synopsys In Brief
  • Paginated, Edited with some development by
    Charles Pickman
  • (Original Presentation by Scott Parmenter Chris
    Levy, Fall 2002)

22-Dec-02
2
History
  • 1994 -Intel Hewlett Packard begin work on
    Itanium (codename Merced)
  • July 1999 -Company officials stress that only
    prototypes are to be released in mid-1999 and
    volume production is still on schedule for
    mid-2000.
  • July 2000 -Chip rollout delayed until 1st half of
    2001. By now, 15,000 units have been delivered
    for evaluation purposes.
  • August 2000 -Intel publicly demonstrates a
    four-CPU Itanium at LinuxWorld and at The Intel
    Developer Forum. However, another revision is
    needed before rollout.
  • McKinley (Itanium II) is due to arrive in the
    second half of 2001, Merced is now irrelevant.
  • January 2000 -Merced delayed again, until Q2.
  • December 2000 -After the 1st full quarter of
    Itanium sales
  • 500 units were sold
  • 2,000 shipped for demonstration purposes.
  • January 2002 -Intel denies speculation about
    secretly developing Yamhill Technology, 64-bit
    chip based on x86 design, as a backup in case
    Itanium fails.
  • Reported development cost to date of Itanium
    over 1 billion.
  • October 2002- A federal court finds Intel guilty
    of using patented Intergraph technology in
    Itanium.

3
Itanium Physical Concept
  • Designed to take complexity away from processor,
    and making the programmer, compiler and assembler
    more complex.
  • 3x5 cartridge
  • CPU L3 cache
  • 130W Power
  • 420mm2
  • Transistors
  • CPU 25 million,
  • L3 Cache 300 million

4
Itanium CPU Layout
5
System Data Bus
  • Increased bus efficiency through enhanced
    deferred transactions
  • 266 MHz
  • Throughput up to 2.1GByte/sec
  • 64-bit wide bus

6
Addressing Modes
  • The Itanium has only one simple addressing mode,
    register indirect.
  • This reduces the amount of overhead per clock
    cycle, since it does not have to deal with the
    address-generation units required for multiple
    addressing modes.

7
Operating Modes
  • IA-64 system environment. All 64-bit features are
    fully available. Supports application
    environments of IA-32 real mode, IA-32 V86 mode,
    IA-32 protected mode, and IA-64 instruction set
    modes.
  • IA-32 application environment in IA-64 system
    environment mode. Full IA-32 instruction and
    register set compatibility. Can execute a mixture
    of IA-64 and IA-32 code by special branching
    instructions.
  • IA-32 system environment. Fully IA-32 compatible.
    This mode is used to run 32-bit operating
    systems. No 64-bit extensions are available.
    Supports real-mode, V86, and protected modes.

8
Instruction Word
  • At its heart is a VLIW core.
  • Itanium has added features to the VLIW design to
    enhance instruction groupings, scalability, and
    to permit wider parallel instruction.
  • Intel renamed this format EPIC (Explicitly
    Parallel Instruction-set Computing.

9
Parallelism
10
Parallelism
11
Instruction Group
  • Instructions are divided into collections that
    have no dependencies or interlocks.
  • These groups can be executed concurrently.
  • Groups can be of arbitrary length ended by stop
    bits
  • Instructions before a group stop may have
    resource dependencies

12
Instruction Bundle
  • Instructions are delivered to the processor in
    bundles
  • 2 bundles are dispatched at once (CECS440
    supported?)
  • Every instruction is bundled into a set of three
    similar instructions. These instructions are 41
    bits each.
  • Each instruction bundle is 128 bits long
  • The remaining 5 bits designate a template (bundle
    type).

13
Instruction Bundle
  • Bundles are ordered from lowest to highest memory
    address
  • Bundles in lower memory addresses have precedence
    over bundles in higher memory addresses
  • Byte order of each bundle in memory is
    little-endian
  • Within a bundle, instructions are ordered from
    slot 0 to slot 2

14
Templates
  • Templates help direct which instruction go to
    which type of execution unit
  • Specify 24 possible combinations of instruction
    types (8 are reserved)
  • Determine the mapping between instruction slot
    and execution unit
  • Group stops are specified within a bundle

15
Templates
  • MII/MBB and MIB/MIB Templates allow
  • 6 instructions or 8 parallel operations per clock
  • 2 Load/Store operations
  • 2 general purpose ALU operations
  • 2 post-increment ALU operations
  • 2 Branch operations 
  • MFI/MFI Template allows
  • 6 instructions or 12 parallel operations per
    clock
  • 4 double-precision operand Loads
  • 4 double-precision floating point operations
  • 2 integer ALU operations
  • 2 post-increment ALU operations

16
VLIW Core
  • The Itanium is a six-issue processor, meaning it
    can profitably handle six instructions
    simultaneously.

17
(No Transcript)
18
Execution Units
  • 4 integer units
  • 4 multi-media units
  • 2 load/store units
  • 3 branch units
  • 2 extended precision FP units
  • 2 single precision FP units

19
Execution Ports
  • Feeds instructions to Execution Units
  • 2 Integer Ports
  • 2 Integer/Load-Store Ports
  • 2 Floating Point Ports (CECS440 supported?)
  • 3 Branch Ports

20
Floating Point Units (CECS440 supported?)
  • 4 pipelined FMAC units (Floating-point Multiply
    Add Calculator).
  • Primary two are each capable of processing two
    single-precision, two double precision, or two
    double-extended precision floating-point
    operations per clock at up to 3.2GFlops.
  • Additional FMAC's for 3D applications. Each
    capable of processing up to two single-precision
    floating-point operations per clock at up to
    3.2GFlops
  • Itanium has a theoretical max of 6.4GFlops of
    single-precision floating point calculation.

21
Floating Point Units (CECS440 supported?)
22
Execution Unit Constraints
  • FP Units are optimized for multiply-accumulate
    operations (CECS440 supported?)
  • Must add 0.0 for simple floating point
    multiplication operations (CECS440 supported?)
  • Must multiply by 1.0 for all simple floating
    point addition operations (CECS440 supported?)
  • Integer Units do not multiply
  • Integer pair must be transferred from
    general-purpose registers to floating point
    registers for a multiply-accumulate operation
    (CECS440 supported?)
  • Integer Result is transferred back to a
    general-purpose register
  • Noticeable performance penalty

23
Integer Register File
  • 128 entries
  • 8 read ports
  • 6 write ports
  • Post-increment performed by idle ALU and write
    ports
  • 2 clock cycles are needed to access a register
    due to number of registers

24
Floating Point Register File (CECS440 supported?)
  • 128 entries
  • 8 read ports
  • 4 write ports, separated into odd and even banks
  • Supports double extended-precision arithmetic

25
Predicate Register file
  • 64 1-bit entries
  • 15 read ports
  • 11 write ports

26
Registers
  • 128 General registers
  • 128 Floating-point registers (CECS440 supported?)
  • 64 Predicate registers
  • 8 Branch registers
  • 128 Application registers
  • Instruction Pointer (IP) register
  • Registers are referred to by a mnemonic denoting
    the register type and a number. For example,
    general register 32 is named r32.

27
IA-32 Register Set
  • Itanium is fully backward compatible with Intels
    x86 family
  • The x86 register set has been super imposed into
    the Itaniums 64 bit register sets.

28
General registers
  • General Registers
  • Itanium provides 128 64-bit general purpose
    registers for all integer and multimedia
    computation.
  • Register gr0 is a read-only register and is
    always zero (0).
  • 32 registers are static and global to the
    process.
  • 96 registers are stacked. These registers are for
    argument passing and local register stack frame.
  • Each register has an associated NaT bit,
    indicating whether the value stored in the
    register is valid.

29
Floating Point Registers (CECS440 supported?)
  • 128 82-bit floating-point registers, for
    floating-point computations.
  • 32 static floating-point registers
  • 96 rotating floating-point registers, for
    software pipelining
  • The first two registers (fr0 and fr1) are
    read-only
  • fr0 is read as 0.0
  • fr1 is read as 1.0.
  • Each register contains three fields
  • 64-bit significand field
  • 17-bit exponent field
  • 1-bit sign field.

30
Predicate Registers
  • 64 one-bit predicate registers enable controlling
    the execution of instructions. When the value of
    a predicate register is true (1), the instruction
    is executed. The predicate registers enable
  • validating/invalidating instructions
  • eliminating branches in if/then/else logic blocks
  • So if the predicate is false the instruction
    would execute, but there would be no write back.

31
Branch Registers
  • Eight 64-bit branch registers are used to specify
    branch target addresses.
  • The branch registers streamline call/return
    branching

32
Application registers
  • 128 special purpose registers are used for
    various functions.
  • The first eight K0-K7 are the kernel registers,
    used to communicate info between the kernel and
    an application.

Instruction Pointer
  • The 64-bit instruction pointer holds the address
    of the bundle of the currently executing
    instruction.
  • The IP cannot be directly read or written, it
    increments as instructions are executed
  • Branch instructions set the IP to a new value.

33
NaT an NatVal Registers
  • There are 128 single bit Not-A-Thing registers
    and 128 single bit Not-a-thing-value registers
  • There is one NaT bit for every GP Register
  • There is one NatVal bit for every FP Register
    (CECS440 supported?)
  • These bits are used under speculative execution
    to indicate deferred exceptions.

34
Register Stacking
  • Reduces function call and return overhead by
    ensuring that all procedural input and output
    parameters are in specific register locations,
    without requiring the compiler to perform
    register-register or memory-register moves

35
Register Stack
  • The general register stack is divided into two
    subsets
  • Static The first 32 physical registers (r0-r31)
    are permanent registers, visible to all
    procedures (global).
  • Stacked The other 96 physical registers behave
    like a stack. The procedure code allocates up to
    96 input and output registers for a procedure
    frame. (r32-r127)

36
RSE
  • What happens when you run out of registers?
  • The Itanium uses a hardware mechanism called a
    Register Stack Engine (RSE), which operates
    transparently in the background, it pushes and
    pops registers for you after you have reached the
    physical limit giving an application an
    apparently unlimited number of apparent physical
    registers.

37
Register Frames
  • Pushing and popping Itaniums large registers
    every time there is a function call would be time
    consuming.
  • ALLOC instruction shifts the apparent arrangement
    of the general-purpose registers so that it
    appears that parameters are being passed from one
    function to another through shared physical
    registers.

38
Register Frames
39
Memory Organization
  • The Itanium architecture defines a single,
    uniform, linear address space of 264 bytes (only
    250 used).
  • All code is stored in little-endian byte order in
    memory. Data is typically stored in little-endian
    byte order. The Itanium architecture also
    provides support for big-endian operating
    systems.
  • Memory access is provided solely through explicit
    register load and store instructions.
  • The large number of registers in the Itanium
    architecture enables multiple computations to be
    performed without having to store temporary data
    in memory. This reduces the number of memory
    accesses.

40
Virtual Memory Support (CECS440 supported?)
  • 64-bit flat addressing space, indexed by general
    registers
  • IA-32 32-bit addresses are 0-extended into the
    64-bit virtual address space
  • Divided into 8 each 261 byte virtual regions
  • Each virtual region has associated region
    register that specifies 24-bit region identifier
    (unique address space number)
  • 8 of the possible 224 address spaces are
    concurrently available via region registers
  • Regions can be combined to form larger regions
    (62-, 63-, or 64-bit regions)
  • All IA-32 memory references are through region 0
  • Paging size ranges from 4kb to 4Gb

41
Region Register Format


42
Physical Memory Addressing
  • Objects in memory and I/O occupy common 63-bit
    byte-accessible physical address space
  • Physical memory addresses performed via virtual
    addresses mapped to the 63-bit physical address
    space or by direct physical addressing
  • Current page table formats allow for mapping
    virtual addresses into 50 bits of physical
    address space (future extensions to allow for
    larger mappings)

43
Physical Memory Addressing

44
Cache Hierarchy (1 of 3)
  • Split 32kb L1 on die
  • Separate instruction and data caches (16kb each)
  • 4-way set-associative, fully pipelined with
    32-byte cache line size
  • Can sustain 2 loads, 2 stores, or 1 load and 1
    store per clock
  • Write-through with no write allocation
  • Data cache only used for Integer Unit (not FP
    Unit)
  • Integer load hits have a 2-cycle latency
  • Load from addresses stored within last 3 cycles
    (Store to load forwarding) bypasses cache

45
Cache Hierarchy (2 of 3)
  • Unified 96kb L2 on die
  • 6-way set-associative
  • Fully pipelined with 64-byte cache line size
  • Write-back with write-allocate policy
  • 4 Loads per clock cycle
  • Can sustain 2 memory operations per clock
  • Integer load hits have a 6-cycle latency
  • FP load hits have a 9-cycle latency

46
Cache Hierarchy (3 of 3)
  • Unified 2Mb or 4Mb L3 on cartridge (separate die)
  • Runs at full processor frequency
  • 4-way set-associative
  • Fully pipelined with 64-byte cache line size to
    provide 12.8 Gb/sec using a 128-bit wide cache
    bus
  • Integer load hits have a 21-cycle latency
  • FP load hits have a 24-cycle latency (CECS440
    supported?)

47
(No Transcript)
48
Branch Prediction
  • Only one form of conditional branch, but it can
    be based on 1 of 64 predicate bits (located in
    predicate registers)
  • Branches can be static or dynamic
  • Dynamic branches can be overridden by CPU
  • Can specify how far ahead of time to fetch the
    predicted branch

49
Speculation
  • Itanium will try to conform to programmer-specifie
    d loads if system bus may be busy
  • Data Speculation loads data in advance, allowing
    memory bus to be fully utilized
  • Code Speculation executes instructions in advance
    and results are held until safe to store in memory

50
Speculation
51
Predication
  • All conditional instructions are predicated
  • Avoids short branches that inject bubbles into
    the pipeline
  • Executes both branch paths simultaneously
  • Discards irrelevant path as predicate is
    evaluated

52
Predication
53
Predication
54
Pipeline
  • Itanium has a 10 stage pipeline
  • 256 bits wide, handling 2 bundles at a time
  • Compared with
  • Pentium III 12 stage
  • Pentium IV 20 stage
  • Itanium 2 (McKinley) 8 stage

55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
Instructions
  • A typical Itanium instruction is a three operand
    instruction, with the following syntax
  • Simple Instruction
  • add r1 r2, r3

65
(No Transcript)
66
Instruction Syntax (qp) mnemonic.comp1.comp2
dests srcs
  • (qp) A qualifying predicate is a predicate
    register indicating whether or not the
    instruction is executed.
  • Mnemonic A unique name identifying the
    instruction.
  • comp1comp2Some instructions may include one
    or more completers. Completers indicate optional
    variations on the basic mnemonic.
  • dests, srcs Destination and source registers

67
IA-64 Instruction Set
  • ALU Instructions (A-Unit)
  • Integer ALU add, sub, xor, or, and, andcm, shl,
    shladd
  • Integer Compare - cmp.crel.ctype ltsource and
    destination operandsgt
  • crel is one of none, unc, or, and, or.andcm,
    orcm, andcm, and.orcm
  • ctype is one of eq, ne, lt, le, gt, ge, ltu,
    leu, gtu, geu
  • operands are registers or imm8
  • Multimedia - pshladd2, padd1, psub1, pavg1,
    pcmp4.eq

68
IA-64 Instruction Set
  • Integer Instructions (I-Unit)
  • Multimedia and Variable Shifts - unpack1.h,
    pshl4, mux2, unpack4.l, pmpy2.r, mix1.r
  • Integer Shifts shrp, extr.u, dep.z,
  • Test Bit tbit.z, tbit.z.unc, tbit.nz.or.andcm,
    tnat.z, tnat.nz.or.andcm
  • Miscellaneous break.i, nop.i, mov, mov.i,
    czx1.r

69
IA-64 Instruction Set
  • Memory Instructions (M-Unit)
  • Load/Store ld1.ldhint, ld8.bias.ldhint,
    ld8.c.clr.acq.ldhint
  • Line Prefetch lfetch.fault.excl.lfhint
  • Semaphores chk.s.m, break.m, invala, srlz.d,
    flushrs, loadrs, cmpxchg8.rel.ldhint

70
IA-64 Instruction Set
  • Branch and Call Instructions (B-Unit)
  • Branches br.btype.bwh.ph.dh target
  • btype is one of branch types none, cond, call,
    ret, ia, cloop, ctop, cexit, wtop, wexit
  • bwh is one of branch whether hints spnk, sptk,
    dpnk, dptk
  • ph is one of sequential prefetch hints none,
    few, many
  • dh is one of branch cache deallocation hints
    none, clr
  • Branch Predict and Nop - brp.ipwh.ih,
    brp.ret.indwh.ih
  • Miscellaneous break.b, nop.b, cover, clrrrb,
    bsw.1, epc

71
IA-64 Instruction Set
  • Floating-Point Instructions (F-Unit) (CECS440
    supported?)
  • Arithmetic fma, fnma, xma.l
  • Parallel Floating-point Select fselect
  • Compare and Classify fcmp, fclass
  • Approximation fpsqrta, frcpa
  • Minimum/Maximum and Parallel Compare fmin,
    fmax, fpcmp.eq
  • Merge and Logical fmerge, fmix, fpack, fand,
    for, fxor
  • Conversion - fcvt.fx, fpcvt.fxu.trunc
  • Status Field Manipulation fchkf.sf, fclrf.sf,
    fsetc.sf
  • Miscellaneous break.f, nop.f

72
IA-64 Instruction Set
  • Extended and Long Instructions (X-Unit)
  • Miscellaneous break.x, nop.x
  • Move Long Immediate64 movl
  • Long Branches brl,

73
Future Generations
  • Itanium 2 line (incompatible with Itanium/Merced)
  • McKinley (2002) 180 nm process, on-die L3 cache
    (3MB), additional execution units, faster system
    bus
  • Madison (2003) - around 1.5 GHz, 130 nm etching,
    copper technology
  • Deerfield (2003) low-end version of Madison
  • Montecito (2004) - add capabilities including
    multi-threading and multi-core (may be deferred
    for Chivano)
  • Chivano (Shavano?) (2005-6) 95 nanometer
    process

74
Summary
  • Ambitious chip with many sophisticated features
  • Many factors outside Intel's control
  • Will vendors sell it? (Dell, etc.)
  • Will new compilers take advantage features?

75
Will applications be written/ported for Itanium?
  • How will Sun and AMD do?
  • Always at least niche market
  • Will IT dept's risk upgrade of needed HW SW?
  • San Diego Supercomputer Center

76
Links to Source Material
  • Itanium Architecture Manuals
  • http//www.intel.com/design/itanium/manuals/
  • http//devrsrc1.external.hp.com/devresource/Docs/R
    efs/IA64ISA/
  • Official Presentations
  • http//www.hpdutchworld.nl/presentaties/Images/Ita
    nium20processor.htm
  • http//www.cp.eng.chula.ac.th/faculty/pjw/teaching
    /ads/ia64/ia64-microarch.pdf
  • http//www.intel.com/design/Itanium/microarch_ovw/
  • 3rd Party Overviews
  • http//www.geek.com/procspec/features/itanium/
  • http//www.extremetech.com/article2/0,3973,231,00.
    asp
  • News
  • http//news.com.com/2100-1001-276880.html
  • http//www.geek.com/news/geeknews/2002Sep/gee20020
    930016552.htm
  • http//www.eweek.com/article2/0,3959,147688,00.asp
Write a Comment
User Comments (0)
About PowerShow.com