Title: Hardware Support for Compiler Speculation
1(No Transcript)
2Hardware Support for Compiler Speculation
- Compiler needs to move instructions before
branch, possibly before condition - Requirements
- Instructions that can be moved without disrupting
data flow - Exceptions that can be ignored until outcome is
known - Ability to speculatively access memory with
potential address conflicts
3Exception Support
- Four methods
- Hardware and OS cooperate to ignore exceptions
for speculative instructions - Speculative instructions never raise exceptions
explicit checks must be made - Poison bits used to mark registers with invalid
results use causes exception - Speculative results are buffered until certain
4Exception Handling
- Nonterminating exceptions can be handled normally
(e.g. page fault) - May cause serious performance loss
5Memory Reference Speculation
- Moving loads across stores is only safe if the
addresses do not conflict - Special instructions check for address conflicts
64.6. Crosscutting Issues Hardwarevs Software
Speculation
- A number of trade-offs and limitations
- Disambiguating memory references is hard for a
compiler - Hardware branch prediction is usually better
- Precise exceptions easier in hardware
- Hardware does not require housekeeping code
- Compilers can look further
- Hardware techniques are more portable
7Hardware/Software Speculation
- Major disadvantage of hardware complexity!
- Some architectures combine hardware and software
approaches
84.7. Putting It All TogetherIA-64 and Itanium
- IA-64
- RISC-style
- Register-register
- Emphasis on software-based optimisations
- Features
- 128 65-bit integer registers
- 128 82-bit FP registers
- 64 predicate registers 8 branch registers
9Registers
- Integer registers
- Use windowing mechanism
- 031 always visible
- Remainder arranged in overlapping windows
- Local and out areas (variable size)
- Hardware for over-/underflow
- Int and FP registers support register rotation
- Supports software pipelining
10Instruction Format and VLIW
- Compiler schedules parallel instructions flags
dependences - Instruction group
- Sequence of (register) independent instructions
- Compiler marks boundaries between groups (stop)
- Bundle
- 128-bits 5-bit template 3 41-bit instructions
11Instruction Bundle
- Template specifies stops and execution unit
- I-unit (int special multimedia, etc.)
- M-unit (int memory access)
- F-unit (FP)
- B-unit (branches)
- LX (extended instructions)
12Example
for (int k 0 k lt 1000 k) xk xk
s
- Unrolled seven times
- Optimised for size
- 9 bundles 15 nops
- 21 cycles (3 per calculation)
- Optimised for performance
- 11 bundles 30 nops
- 12 cycles (1.7 per calculation)
13Instructions
- 41-bits long
- 4-bit opcode ( template bits)
- 6-bit predicate register specifier
- Predication
- Almost all instructions can be predicated
- Branch is jump with predicate check!
- Complex comparisons set two predicate registers
14Speculation
- Exceptions can be deferred
- Uses poison bits (65-bit registers)
- Nonspeculative and chk instructions raise
exception - Speculative loads
- Called advanced load (ld.a)
- Stores check addresses
15Itanium
- First implementation of IA-64
- Issues up to six instructions per cycle (two
bundles) - Nine functional units
- 2 I, 2 M, 3 B, 2 F
- 10-stage pipeline
- Multilevel dynamic branch predictor
16Itanium
- Complex hardware with many features of
dynamically scheduled pipelines! - Branch prediction
- Register renaming
- Scoreboarding
- Deep pipeline
- etc.
17Itanium Performance
- SPECint not too impressive
- 85 of Alpha 21264 (older, more power-efficient
processor!) - FP better
- Faster, even with slower clock!
- But skewed by one benchmark for Pentium
- Alpha compilers need improvement
184.8. Another ViewILP in Embedded Processors
- Trimedia (see chapter 2)
- Classic VLIW
- Hardware decompression of code
- Crusoe
- Software translation of 80x86 to VLIW
- Low power
19Trimedia TM32 Architecture
- VLIW
- Instruction specifies five operations
- Static scheduling
- No hardware hazard detection
- 23 functional units (11 types)
20Transmeta Crusoe
- Low power design
- Emulates 80x86
- VLIW
- 64-bit (2 op) and 128-bit (4 op) instructions
- Five types of operations
- ALU (int, register-register)
- Compute (int ALU, FP, multimedia)
- Memory
- Branch
- Immediate
21Crusoe
- Simple, in-order pipeline
- Integer 6-stage (IF1, IF2, DEC, OP, EX, WB)
- FP 10-stage (5 EX stages)
22Crusoe
- Software interpretation of 80x86 code
- Basic blocks cached
- Exception handling complicated
- Crusoe has good support for speculative
reordering - Memory writes buffered and committed only when
safe
23Crusoe Performance
- Hard to measure accurately
- Power consumption is low (? of Pentium)
244.9. Fallacies and Pitfalls
- Fallacy There is a simple approach to
multiple-issue (high performance with low
complexity) - Big gap between peak and sustained performance
for multiple issue processors - Need dynamic scheduling, speculation support,
branch prediction, sophisticated prefetch, etc. - Sophisticated compilers are required
254.10. Concluding Comments
- Hardware techniques migrating to software and
vice versa - Multiprocessors may be important in future
26Chapter 5Memory Hierarchy Design
27Memory Hierarchies
- Not a new idea!
- Takes advantage of the principle of locality
- Temporal
- Spatial
- Small, fast memories close to processor
28Memory Hierarchies
29Introduction
- Usually includes responsibility for memory
protection - Performance is a major problem
30Figure 5.2
31Characterising Levels of the Memory Hierarchy
- Four questions
- Where can a block be placed? (placement)
- How is a block found? (identification)
- Which block should be replaced on a miss?
(replacement) - What happens on a write? (write strategy)
32Example
- The Alpha 21264 is used as an example throughout
33Caches
- Where is a block placed in a cache?
- Three possible answers ? three different types
34Cache Categories
- Set associative
- n-way set associative, where n is number of
blocks in set - Commonly, n 2 or n 4
- Direct-mapped
- 1-way set associative
- Fully associative
- m-way set associative (m is total number of
blocks in cache)
35(No Transcript)