Hardware Support for Compiler Speculation - PowerPoint PPT Presentation

About This Presentation
Title:

Hardware Support for Compiler Speculation

Description:

Exceptions that can be ignored until outcome is known ... Emulates 80x86. VLIW. 64-bit (2 op) and 128-bit (4 op) instructions. Five types of operations: ... – PowerPoint PPT presentation

Number of Views:29
Avg rating:3.0/5.0
Slides: 36
Provided by: csgw
Category:

less

Transcript and Presenter's Notes

Title: Hardware Support for Compiler Speculation


1
(No Transcript)
2
Hardware Support for Compiler Speculation
  • Compiler needs to move instructions before
    branch, possibly before condition
  • Requirements
  • Instructions that can be moved without disrupting
    data flow
  • Exceptions that can be ignored until outcome is
    known
  • Ability to speculatively access memory with
    potential address conflicts

3
Exception Support
  • Four methods
  • Hardware and OS cooperate to ignore exceptions
    for speculative instructions
  • Speculative instructions never raise exceptions
    explicit checks must be made
  • Poison bits used to mark registers with invalid
    results use causes exception
  • Speculative results are buffered until certain

4
Exception Handling
  • Nonterminating exceptions can be handled normally
    (e.g. page fault)
  • May cause serious performance loss

5
Memory Reference Speculation
  • Moving loads across stores is only safe if the
    addresses do not conflict
  • Special instructions check for address conflicts

6
4.6. Crosscutting Issues Hardwarevs Software
Speculation
  • A number of trade-offs and limitations
  • Disambiguating memory references is hard for a
    compiler
  • Hardware branch prediction is usually better
  • Precise exceptions easier in hardware
  • Hardware does not require housekeeping code
  • Compilers can look further
  • Hardware techniques are more portable

7
Hardware/Software Speculation
  • Major disadvantage of hardware complexity!
  • Some architectures combine hardware and software
    approaches

8
4.7. Putting It All TogetherIA-64 and Itanium
  • IA-64
  • RISC-style
  • Register-register
  • Emphasis on software-based optimisations
  • Features
  • 128 65-bit integer registers
  • 128 82-bit FP registers
  • 64 predicate registers 8 branch registers

9
Registers
  • Integer registers
  • Use windowing mechanism
  • 031 always visible
  • Remainder arranged in overlapping windows
  • Local and out areas (variable size)
  • Hardware for over-/underflow
  • Int and FP registers support register rotation
  • Supports software pipelining

10
Instruction Format and VLIW
  • Compiler schedules parallel instructions flags
    dependences
  • Instruction group
  • Sequence of (register) independent instructions
  • Compiler marks boundaries between groups (stop)
  • Bundle
  • 128-bits 5-bit template 3 41-bit instructions

11
Instruction Bundle
  • Template specifies stops and execution unit
  • I-unit (int special multimedia, etc.)
  • M-unit (int memory access)
  • F-unit (FP)
  • B-unit (branches)
  • LX (extended instructions)

12
Example
for (int k 0 k lt 1000 k) xk xk
s
  • Unrolled seven times
  • Optimised for size
  • 9 bundles 15 nops
  • 21 cycles (3 per calculation)
  • Optimised for performance
  • 11 bundles 30 nops
  • 12 cycles (1.7 per calculation)

13
Instructions
  • 41-bits long
  • 4-bit opcode ( template bits)
  • 6-bit predicate register specifier
  • Predication
  • Almost all instructions can be predicated
  • Branch is jump with predicate check!
  • Complex comparisons set two predicate registers

14
Speculation
  • Exceptions can be deferred
  • Uses poison bits (65-bit registers)
  • Nonspeculative and chk instructions raise
    exception
  • Speculative loads
  • Called advanced load (ld.a)
  • Stores check addresses

15
Itanium
  • First implementation of IA-64
  • Issues up to six instructions per cycle (two
    bundles)
  • Nine functional units
  • 2 I, 2 M, 3 B, 2 F
  • 10-stage pipeline
  • Multilevel dynamic branch predictor

16
Itanium
  • Complex hardware with many features of
    dynamically scheduled pipelines!
  • Branch prediction
  • Register renaming
  • Scoreboarding
  • Deep pipeline
  • etc.

17
Itanium Performance
  • SPECint not too impressive
  • 85 of Alpha 21264 (older, more power-efficient
    processor!)
  • FP better
  • Faster, even with slower clock!
  • But skewed by one benchmark for Pentium
  • Alpha compilers need improvement

18
4.8. Another ViewILP in Embedded Processors
  • Trimedia (see chapter 2)
  • Classic VLIW
  • Hardware decompression of code
  • Crusoe
  • Software translation of 80x86 to VLIW
  • Low power

19
Trimedia TM32 Architecture
  • VLIW
  • Instruction specifies five operations
  • Static scheduling
  • No hardware hazard detection
  • 23 functional units (11 types)

20
Transmeta Crusoe
  • Low power design
  • Emulates 80x86
  • VLIW
  • 64-bit (2 op) and 128-bit (4 op) instructions
  • Five types of operations
  • ALU (int, register-register)
  • Compute (int ALU, FP, multimedia)
  • Memory
  • Branch
  • Immediate

21
Crusoe
  • Simple, in-order pipeline
  • Integer 6-stage (IF1, IF2, DEC, OP, EX, WB)
  • FP 10-stage (5 EX stages)

22
Crusoe
  • Software interpretation of 80x86 code
  • Basic blocks cached
  • Exception handling complicated
  • Crusoe has good support for speculative
    reordering
  • Memory writes buffered and committed only when
    safe

23
Crusoe Performance
  • Hard to measure accurately
  • Power consumption is low (? of Pentium)

24
4.9. Fallacies and Pitfalls
  • Fallacy There is a simple approach to
    multiple-issue (high performance with low
    complexity)
  • Big gap between peak and sustained performance
    for multiple issue processors
  • Need dynamic scheduling, speculation support,
    branch prediction, sophisticated prefetch, etc.
  • Sophisticated compilers are required

25
4.10. Concluding Comments
  • Hardware techniques migrating to software and
    vice versa
  • Multiprocessors may be important in future

26
Chapter 5Memory Hierarchy Design
27
Memory Hierarchies
  • Not a new idea!
  • Takes advantage of the principle of locality
  • Temporal
  • Spatial
  • Small, fast memories close to processor

28
Memory Hierarchies
29
Introduction
  • Usually includes responsibility for memory
    protection
  • Performance is a major problem

30
Figure 5.2
31
Characterising Levels of the Memory Hierarchy
  • Four questions
  • Where can a block be placed? (placement)
  • How is a block found? (identification)
  • Which block should be replaced on a miss?
    (replacement)
  • What happens on a write? (write strategy)

32
Example
  • The Alpha 21264 is used as an example throughout

33
Caches
  • Where is a block placed in a cache?
  • Three possible answers ? three different types

34
Cache Categories
  • Set associative
  • n-way set associative, where n is number of
    blocks in set
  • Commonly, n 2 or n 4
  • Direct-mapped
  • 1-way set associative
  • Fully associative
  • m-way set associative (m is total number of
    blocks in cache)

35
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com