Title: Christopher Foster
1- Christopher Foster
- Scott Thibaudeau
- Brian Cleary
2- Itanium IA-64 Overview.
- Development of the Parallel Processor
- Success and Failure (Problems and Solutions)
- Multiple Parallel Pipelines on a Single Die
- Itanium is born!
- Execution of Parallel Processing in IA-64
- 10 deep pipeline execution 9 Parallel
distribution sites - Current and future IA-64 code Development
- The Memory Requirements and Specifications
- Heirarchy Registers, L1,2,3 Cache, Main Memory,
HD - L1Data L2Unified L3Off-Chip Fully
Associative - Latency Times
- Full Memory Block Diagram Overview
- System Management Bus (SM Bus)
- Thermal System
- EEPROM, PIROM
- System Bus (IA-64 Bus Architecture)
- Bandwidth
- Parallel Processors in Parallel
3History of MicroprocessorsA Very Abridged Tour.
- Beginning of time Circa 1980 and before
- CISC and RISC Computers are all that exist.
- Zilog 6502 Lives in every house (Nintendo).
- Ronald Regan in office.
- Middle Ages Circa 1990
- Parallel Processing exists in white-papers.
- IA-32 is in almost every desktop.
- Vanilla Ice hits it big.
- Current Day Circa 2000
- Beowulf Clusters (Distributed Parallel Processing
Networks) - Pentium breaks the GHz mark with IA-32.
- Intel develops the IA-64 Architecture to support
Parallel on die.
4So whats so good about Parallelism?
At the most efficient each parallel path divides
the execution time IN HALF!
- This leads to incredible gains
- Productivity (Reduced Latency)
- Wait times for compile/execute
- Increased functionality in real-time processes
- Reliability (Redundancy)
- Multiple modules for eachfunctional unit
- Security (Locality)
- All processors in one place (physically)
- Encryption power increased
- Scalability (Modular reuse)
5But are there any disadvantages?
- YES
- Memory Size/Latency
- Branch Prediction
- Independent Instructions
6IA-64 Solves All of these problems
- Memory Size 64 bit addressing Huge Register
File - Memory Latency Multiple Layers of Cache
- Branch Prediction Hardware Solution
- Independent Instructions New code classes
And with these problems out of the way
7The way is prepared forMultiple Parallel
Processes on a Single DieExplicitly Parallel
Instruction Processing (EPIC)
- With resources made available, the Itanium is
able to use multiple - functional units for each process required.
- This results in an incredible number of
- separate pipelined execution paths
- Integer Function Units (2)
- Memory Units (2)
- Branch Prediction Units (3)
- Floating Point Units (2)
- Total 9 separate execution paths!
Note Though the focus is not on pipelining
here, there are 10 deep pipelines for each unit.
8Overall Architecture
9The Full Pipeline Procedure
10Fetch/Distribution Procedures
3 instructions per bundle 2 bundles per clock
x Fully 6 instructions per clock. M0, M1,
I0, I1, F0, F1, B0, B1, B2 These are all
execution pipelines.
MMemory Units FFloating Point Units IInteger
Units BBranch
11How do we write code for The Itanium?
- NEW Code Classes
- Allow programmer to specify specific function
units for - Loads, Arithmetic, Branch Ops, Logic Operations
- Enable users to specify INDEPENDENT INSTRUCTIONS
- Interpretation at OS Level
- Windows 64 (to be released as Windows XP64)
- Linux-64, HP-UX, Modesto
- PAL Level interpretation
- Possibility of Virtual Machine interface.
12And what does this code look like?
13Is Itanium Fully Developed?
No.
- Some registers yet to be named and used.
- Windows 64 not yet available.
- Cost of processor/memory production still too
high.
And they havent written any books on the subject
yet either.
Moores Law If we keep doubling, then we can
expect IA-64 to be around half as long as IA-32.
Thats about 5-7 years. That gives us at least 3
more.
14Register File
- 256 general and floating point registers
- 64-bits wide
- Rotating registers
15Memory Hierarchy
- Level 1 Data Cache (L1-D)
- Level 1 Instruction Cache (L1-I)
- 16Kb, 4-way set associative with 32-byte lines
- Level 2 Unified Cache (L2)
- Level 3 Cache (L3)
- Main Memory (FSB) Bus
- Maximum Bandwidth of 2.1GB/s.
- Level 1 Level 2 Data Translation Lookaside
Buffers (L1/L2-DTLB) - Instruction Translation Cache (ITLB)
16Level 1 Data Cache (L1-D)
- 16 Kb, 4-way set associative, write through, no
write allocate with 32-byte lines - Integer loads have 2-cycle latency
- Floating Point loads bypass L1 Data cache
17Level 2 Unified Cache (L2)
- 96Kb, 6-way set associative, write back and write
allocate with 64-byte lines - Integer loads have 6-cycle latency
- Floating Point have 9-cycle latency
18L3 Cache (L3)???
- Off-chip
- 2Mb or 4Mb package
- Maximum bandwidth from L3 to L2 is 16 bytes times
the core frequency - Integer loads have 21-cycle latency
- Floating Point have 24-cycle latency
So what?
19(No Transcript)
20L1 L2 Data Translation Lookaside Buffer
- 32 96 entries, respectively
- Both fully associative
- Both support page sizes of 4k, 8k, 16k, 64k,
256k, 1M, 4M, 16M, 64M, and 256M - Purges supported include all page sizes and 4G
21Instruction Translation Cache
- Single-level instruction
- 64 entries
- Fully associative
22Overall Architecture
23IA-64 Thermal Specifications
- What are the components? How does it work?
- Internal thermal circuit w/ thermal sensing diode
- How does it protect itself from overheating?
- Comparison to THIGH
- What happens when overheating occurs?
- Thermal Alert Register tripped
- To restore
- What exactly are the heat tolerances? What
should be calculated? Any equations? - According to Intel
24IA-64 Thermal Specifications
25IA-64 Thermal Specifications Dimensions of
Thermal Sensor
26IA-64 Thermal Specifications The Processors
- What about the AMD/P4/P3?
- P4 Application Slows Down (Itanium inherits
fundamental heat protection) - P3 Application Freezes
- As for the AMD
- Video displaying above characteristics at end of
presentation
27IA-64 Thermal Specifications Location of Thermal
Sensor
28IA-64 System Management Bus (w/Thermal Sensory)
- Why do we care about the PIROM and EEPROM?
- EEPROM is a read write memory block that enables
vendors to specify methods/standards as to how
data is transferred in the data bus. - PIROM contains write-protected information
regarding certain characteristics of the
processor (frequency speed). - As for the thermal sensor, in conjunction with
the above components, accurate temperature
checking/regulation is achieved.
29IA-64 System Management Bus Data/Addressing
Management
- Packet Types (Read/Write)
- Memory Units current address read, random access
read, sequential read, byte write, page write - Thermal Unit write byte, read byte, send byte,
receive byte, ARA - Addressing
- Memory Units 1010XXY2b
- Thermal Unit 0011XXXZb 1001XXXZb
0101XXXZb
30IA-64 System Management BusMemory Unit Packet
Types
31IA-64 System Management BusThermal Unit Packet
Types
32IA-64 Bus ArchitectureSMBus Timing Diagrams
33IA-64 Main Bus Architecture Overview
34IA-64 Main Bus ArchitectureSpecifications
- 64-Bit bus running at 2.1 GB/s
- Up to 4 Itaniums can be connected in parallel
to the same bus (running at 266 Mhz) - SAC System Address Controller
- SDC System Data Controller
- Above controllers assign Address or Data
Information from the Itanium(s) to the memory
unit (from multiple processors to a single bus
line and vice versa)
35IA-64 Customer Feedback
- What are journalists, customers saying?
- - The heat generated from the Itanium can be
compared to an EZ-Bake OvenIntel is losing its
foothold in the processor industry by relying on
the archaic x86 architecture. - - Upgrading a mission critical system is a
daunting task, especially since there exists
reliable 64-bit Unix Machines. Then theres the
code conversion problem