Hardware Support for Compiler Speculation - PowerPoint PPT Presentation

About This Presentation

Title:

Hardware Support for Compiler Speculation

Description:

Exceptions that can be ignored until outcome is known ... Emulates 80x86. VLIW. 64-bit (2 op) and 128-bit (4 op) instructions. Five types of operations: ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 36

Provided by: csgw

Category:

more less

Transcript and Presenter's Notes

Title: Hardware Support for Compiler Speculation

1
(No Transcript)
2
Hardware Support for Compiler Speculation

Compiler needs to move instructions before
branch, possibly before condition
Requirements
Instructions that can be moved without disrupting
data flow
Exceptions that can be ignored until outcome is
known
Ability to speculatively access memory with
potential address conflicts

3
Exception Support

Four methods
Hardware and OS cooperate to ignore exceptions
for speculative instructions
Speculative instructions never raise exceptions
explicit checks must be made
Poison bits used to mark registers with invalid
results use causes exception
Speculative results are buffered until certain

4
Exception Handling

Nonterminating exceptions can be handled normally
(e.g. page fault)
May cause serious performance loss

5
Memory Reference Speculation

Moving loads across stores is only safe if the
addresses do not conflict
Special instructions check for address conflicts

6
4.6. Crosscutting Issues Hardwarevs Software
Speculation

A number of trade-offs and limitations
Disambiguating memory references is hard for a
compiler
Hardware branch prediction is usually better
Precise exceptions easier in hardware
Hardware does not require housekeeping code
Compilers can look further
Hardware techniques are more portable

7
Hardware/Software Speculation

Major disadvantage of hardware complexity!
Some architectures combine hardware and software
approaches

8
4.7. Putting It All TogetherIA-64 and Itanium

IA-64
RISC-style
Register-register
Emphasis on software-based optimisations
Features
128 65-bit integer registers
128 82-bit FP registers
64 predicate registers 8 branch registers

9
Registers

Integer registers
Use windowing mechanism
031 always visible
Remainder arranged in overlapping windows
Local and out areas (variable size)
Hardware for over-/underflow
Int and FP registers support register rotation
Supports software pipelining

10
Instruction Format and VLIW

Compiler schedules parallel instructions flags
dependences
Instruction group
Sequence of (register) independent instructions
Compiler marks boundaries between groups (stop)
Bundle
128-bits 5-bit template 3 41-bit instructions

11
Instruction Bundle

Template specifies stops and execution unit
I-unit (int special multimedia, etc.)
M-unit (int memory access)
F-unit (FP)
B-unit (branches)
LX (extended instructions)

12
Example
for (int k 0 k lt 1000 k) xk xk
s

Unrolled seven times
Optimised for size
9 bundles 15 nops
21 cycles (3 per calculation)
Optimised for performance
11 bundles 30 nops
12 cycles (1.7 per calculation)

13
Instructions

41-bits long
4-bit opcode ( template bits)
6-bit predicate register specifier
Predication
Almost all instructions can be predicated
Branch is jump with predicate check!
Complex comparisons set two predicate registers

14
Speculation

Exceptions can be deferred
Uses poison bits (65-bit registers)
Nonspeculative and chk instructions raise
exception
Speculative loads
Called advanced load (ld.a)
Stores check addresses

15
Itanium

First implementation of IA-64
Issues up to six instructions per cycle (two
bundles)
Nine functional units
2 I, 2 M, 3 B, 2 F
10-stage pipeline
Multilevel dynamic branch predictor

16
Itanium

Complex hardware with many features of
dynamically scheduled pipelines!
Branch prediction
Register renaming
Scoreboarding
Deep pipeline
etc.

17
Itanium Performance

SPECint not too impressive
85 of Alpha 21264 (older, more power-efficient
processor!)
FP better
Faster, even with slower clock!
But skewed by one benchmark for Pentium
Alpha compilers need improvement

18
4.8. Another ViewILP in Embedded Processors

Trimedia (see chapter 2)
Classic VLIW
Hardware decompression of code
Crusoe
Software translation of 80x86 to VLIW
Low power

19
Trimedia TM32 Architecture

VLIW
Instruction specifies five operations
Static scheduling
No hardware hazard detection
23 functional units (11 types)

20
Transmeta Crusoe

Low power design
Emulates 80x86
VLIW
64-bit (2 op) and 128-bit (4 op) instructions
Five types of operations
ALU (int, register-register)
Compute (int ALU, FP, multimedia)
Memory
Branch
Immediate

21
Crusoe

Simple, in-order pipeline
Integer 6-stage (IF1, IF2, DEC, OP, EX, WB)
FP 10-stage (5 EX stages)

22
Crusoe

Software interpretation of 80x86 code
Basic blocks cached
Exception handling complicated
Crusoe has good support for speculative
reordering
Memory writes buffered and committed only when
safe

23
Crusoe Performance

Hard to measure accurately
Power consumption is low (? of Pentium)

24
4.9. Fallacies and Pitfalls

Fallacy There is a simple approach to
multiple-issue (high performance with low
complexity)
Big gap between peak and sustained performance
for multiple issue processors
Need dynamic scheduling, speculation support,
branch prediction, sophisticated prefetch, etc.
Sophisticated compilers are required

25
4.10. Concluding Comments