Title: IA-64 Architecture (Think Intel Itanium)
1IA-64 Architecture(Think Intel Itanium)
also known as (EPIC Extremely Parallel
Instruction Computing) a new kind of superscalar
computer
HW 5 - Due 12/4 Please clean up boards in lab by
Dec 3 Put good wires in the box
Take chips off of the board using chip puller
Put parts away in the proper bins.
THANKS!
2Superpipelined Superscaler Machines
- Superpipelined machine
- Superpiplined machines overlap pipe stages
- Relies on stages being able to begin operations
before the last is complete. - Superscaler Machine
- A Superscalar machine employs multiple
independent pipelines to executes multiple
independent instructions in parallel. - Particularly common instructions (arithmetic,
load/store, conditional branch) can be executed
independently.
3Why A New Architecture Direction?
- Processor designers obvious choices for use of
increasing number of transistors on chip and
extra speed - Bigger Caches ? diminishing returns
- Increase degree of Superscaling by adding more
execution units ? complexity wall more logic,
need improved branch prediction, more renaming
registers, more complicated dependencies. - Multiple Processors ? challenge to use them
effectively in general computing - Longer pipelines ? greater penalty for
misprediction
4IA-64 Background
- Explicitly Parallel Instruction Computing (EPIC)
- - Jointly developed by Intel
Hewlett-Packard (HP) - New 64 bit architecture
- Not extension of x86 series
- Not adaptation of HP 64bit RISC architecture
- To exploit increasing chip transistors and
increasing speeds - Utilizes systematic parallelism
- Departure from superscalar trend
- Note Became the architecture of the Intel Itanium
5Basic Concepts for IA-64
- Instruction level parallelism
- EXPLICIT in machine instruction, rather than
determined at run time by processor - Long or very long instruction words (LIW/VLIW)
- Fetch bigger chunks already preprocessed
- Predicated Execution
- Marking groups of instructions for a late
decision on execution. - Control Speculation
- Go ahead and fetch decode instructions, but
keep track of them so the decision to issue
them, or not, can be practically made later - Data Speculation (or Speculative Loading)
- Go ahead and load data early so it is ready when
needed, and have a practical way to recover if
speculation proved wrong
- Software Pipelining
- - Multiple iterations of a loop can be executed
in parallel
6General Organization
7Predicate Registers
- Used as a flag for instructions that may or may
not be executed. -
- A set of instructions is assigned a predicate
register when it is uncertain whether the
instruction sequence will actually be executed
(think branch). - Only instructions with a predicate value of true
are executed. - When it is known that the instruction is going to
be executed, its predicate is set. All
instructions with that predicate true can now be
completed. - Those instructions with predicate false are now
candidates for cleanup.
8Predication
9Speculative Loading
10General Organization
11IA-64 Key Hardware Features
- Large number of registers
- IA-64 instruction format assumes 256 Registers
- 128 64 bit integer, logical general purpose
- 128 82 bit floating point and graphic
- 64 predicated execution registers
- (To support high degree of parallelism)
- Multiple execution units
- Probably pipelined
- 8 or more ?
12IA-64 Register Set
13Relationship between Instruction Type
Execution Unit
14IA-64 Execution Units
- I-Unit
- Integer arithmetic
- Shift and add
- Logical
- Compare
- Integer multimedia ops
- M-Unit
- Load and store
- Between register and memory
- Some integer ALU operations
- B-Unit
- Branch instructions
- F-Unit
- Floating point instructions
15Instruction Format Diagram
16Instruction Format
- 128 bit bundles
- Can fetch one or more bundles at a time
- Bundle holds three instructions plus template
- Instructions are usually 41 bit long
- Have associated predicated execution registers
- Template contains info on which instructions can
be executed in parallel - Not confined to single bundle
- e.g. a stream of 8 instructions may be executed
in parallel - Compiler will have re-ordered instructions to
form contiguous bundles - Can mix dependent and independent instructions in
same bundle
17Field Encoding Instr Set Mapping
Note BAR indicates stops Possible dependencies
with Instructions after the stop
18Assembly Language Format
- qp mnemonic .comp dest srcs //
- qp - predicate register
- 1 at execution ? execute and commit result to
hardware - 0 ? result is discarded
- mnemonic - name of instruction
- comp one or more instruction completers used to
qualify mnemonic - dest one or more destination operands
- srcs one or more source operands
- - instruction groups stops (when
appropriate) - Sequence without read after write or write after
write - Do not need hardware register dependency checks
- // - comment follows
19Assembly Example
Register Dependency
- ld8 r1 r5 //first group
- add r3 r1, r4 //second group
- Second instruction depends on value in r1
- Changed by first instruction
- Can not be in same group for parallel execution
- Note ends the group of instructions that can
be executed in parallel
20Assembly Example
Multiple Register Dependencies
- ld8 r1 r5 //first group
- sub r6 r8, r9 //first group
- add r3 r1, r4 //second group
- st8 r6 r12 //second group
- Last instruction stores in the memory location
whose address is in r6, which is established in
the second instruction
21Assembly Example Predicated Code
Consider the Following program with branches
- if (ab)
- j j 1
- else
- if(c)
- k k 1
- else
- k k 1
- i i 1
22Assembly Example Predicated Code
Pentium Assembly Code cmp a, 0
compare with 0 je L1 branch to L1 if a
0 cmp b, 0 je L1 add j, 1 j j
1 jmp L3 L1 cmp c, 0 je L2 add k,
1 k k 1 jmp L3 L2 sub k, 1 k
k 1 L3 add i, 1 i i 1
- Source Code
- if (ab)
- j j 1
- else
- if(c)
- k k 1
- else
- k k 1
- i i 1
23Assembly Example Predicated Code
Pentium Code cmp a, 0 je L1 cmp b,
0 je L1 add j, 1 jmp L3 L1 cmp c,
0 je L2 add k, 1 jmp L3 L2 sub k,
1 L3 add i, 1
IA-64 Code cmp. eq p1, p2 0, a (p2)
cmp. eq p1, p3 0, b (p3) add j 1, j (p1)
cmp. ne p4, p5 0, c (p4) add k 1, k (p5)
add k -1, k add i 1, i
- Source Code
- if (ab)
- j j 1
- else
- if(c)
- k k 1
- else
- k k 1
- i i 1
24Example of Prediction
25Data Speculation
- Load data from memory before needed
- What might go wrong?
- Load moved before store that might alter memory
location - Need subsequent check in value
26Assembly Example Data Speculation
Consider the Following program
- (p1) br some_label // cycle 0
- ld8 r1 r5 // cycle 0 (indirect
memory op 2 cycles) - add r1 r1, r3 // cycle 2
27Assembly Example Data Speculation
Consider the Following program
Original code
Speculated Code
ld8.s r1 r5 //cycle -2 //
other instructions (p1) br some_label
//cycle 0 chk.s r1, recovery //cycle 0
add r2 r1, r3 //cycle 0
- (p1) br some_label //cycle 0
- ld8 r1 r5 //cycle 0
- add r1 r1, r3 //cycle 2
28Assembly Example Data Speculation
Consider the Following program
- st8 r4 r12 //cycle 0
- ld8 r6 r8 //cycle 0 (indirect memory op
2 cycles) - add r5 r6, r7 //cycle 2
- st8 r18 r5 //cycle 3
What if r4 and r8 point to the same address?
29Assembly Example Data Speculation
Consider the Following program Without Data
Speculation With Data
Speculation
ld8.a r6 r8 //cycle -2, adv // other
instructions st8 r4 r12 //cycle 0 ld8.c
r6 r8 //cycle 0, check add r5 r6, r7
//cycle 0 st8 r18 r5 //cycle 1
- st8 r4 r12 //cycle 0
- ld8 r6 r8 //cycle 0
- add r5 r6, r7 //cycle 2
- st8 r18 r5 //cycle 3
30Assembly Example Data Speculation
Data Dependencies Speculation
Speculation with data
dependency
- ld8.a r6 r8 //cycle -3,adv ld
- // other instructions
- add r5 r6, r7 //cycle -1,uses r6
- // other instructions
- st8 r4 r12 //cycle 0
- chk.a r6, recover //cycle 0, check
- back //return pt
- st8 r18 r5 //cycle 0
- recover
- ld8 r6 r8 //get r6 from r8
- add r5 r6, r7 //re-execute
- be back //jump back
ld8.a r6 r8 //cycle-2 // other
instructions st8 r4 r12 //cycle 0 ld8.c
r6 r8 //cycle 0 add r5 r6, r7
//cycle 0 st8 r18 r5 //cycle 1
31Software Pipelining
- // yi xi c
- L1 ld4 r4r5,4 //cycle 0 load postinc 4
- add r7r4,r9 //cycle 2
- st4 r6r7,4 //cycle 3 store postinc 4
- br.cloop L1 //cycle 3
- Adds constant to one vector and stores result in
another - No opportunity for instruction level parallelism
in one iteration - Instruction in iteration x all executed before
iteration x1 begins - If no address conflicts between loads and stores
can move independent instructions from loop x1
to loop x
32Pipeline - Unrolled Loop, Pipeline Display
- Unrolled loop
- ld4 r32r5,4 //cycle 0
- ld4 r33r5,4 //cycle 1
- ld4 r34r5,4 //cycle 2
- add r36r32,r9 //cycle 2
- ld4 r35r5,4 //cycle 3
- add r37r33,r9 //cycle 3
- st4 r6r36,4 //cycle 3
- ld4 r36r5,4 //cycle 3
- add r38r34,r9 //cycle 4
- st4 r6r37,4 //cycle 4
- add r39r35,r9 //cycle 5
- st4 r6r38,4 //cycle 5
- add r40r36,r9 //cycle 6
- st4 r6r39,4 //cycle 6
- st4 r6r40,4 //cycle 7
Original Loop L1 ld4 r4r5,4 //cycle 0
load postinc 4 add r7r4,r9 //cycle 2
st4 r6r7, 4 //cycle 3 store postinc 4
br.cloop L1 //cycle 3
Pipeline Display
33Unrolled Loop Observations
- Completes 5 iterations in 7 cycles
- Compared with 20 cycles in original code
- Assumes two memory ports
- Load and store can be done in parallel
34Support For Software Pipelining
- Automatic register renaming
- Fixed size are of predicate and fp register file
(p16-P32, fr32-fr127) and programmable size area
of gp register file (max r32-r127) capable of
rotation - Loop using r32 on first iteration automatically
uses r33 on second - Predication
- Each instruction in loop predicated on rotating
predicate register - Determines whether pipeline is in prolog, kernel,
or epilog - Special loop termination instructions
- Branch instructions that cause registers to
rotate and loop counter to decrement
35Intels Itanium Implements the IA-64