Title: by Martin Labrecque
1by Martin Labrecque
- How to Fake 1000 Registers
- Oehmke, Binkert, Mudge, Reinhart
- to appear in Nov _at_ Micro 2005
2Outline
- Motivation
- Observations on registers
- Idea
- Virtual Context Architecture
- Evaluation in 2 types of applications
3Some definitions
- Activation record
- Data structure
- variables belonging to one particular scope
(e.g. a procedure body) - links to other activation records
-
- Synonyms "data frame", "stack frame"
- Context
- Activation record of a thread of execution
A register is only meaningful to the current
activation record
4Key observation
- Virtual Memory
- For the ISA standpoint each process has an
'infinite' amount of memory available - Memory is managed in caches, RAM and disk
- Memory is context free
- This is not true for registers
- Limited resource
Need to virtualize registers
5How registers are used
Source code variables
Compiler
IR virtual registers
Register allocation
Binary logical registers
Decode/Rename
Data path physical registers
Pipeline
6Registers are useful
- Can't get rid of registers
- Efficient address encoding in instructions
- Unambiguous data dependences
- Efficient integration in the micro-architecture
7Dawn of a New Idea
Attach a memory address to the content of the
register!
8Virtualizing registers
9Mapping registers to memory
- Registers are virtualized because they hold the
content of a memory location - 2 options
- At register allocation, map compiler virtual
registers to memory - Memory to memory operations
- Doesn't make use of ISA registers
- Map ISA registers to memory
- Key Idea of the Virtual Context Architecture
10Programming the VCA
- Where are the registers mapped in memory?
- The Stack Pointer is the Reference
- Allows to 'allocate' memory dynamically
- Efficient way of passing parameters to a a
function - Need some architectural support to address with
offsets to the stack pointer
11Renaming
- To get the register memory address, combine
- the source/destination register index of the
binary program - base pointer (stack pointer)
- ISA register index ? register memory address ?
physical register
12Register memory address ? physical reg.
- The address base pointer offset
- Exploit locality of the addresses to compress the
number of bits in the conversion, low probability
of capacity miss
13Register File is a Cache
- Hardware controlled cache
- An instruction requires its source operands and
destination register to execute
What happens on a cache miss? We need some
hardware control!
14Some additional HW
- Each register has 3 new attributes
- A reference count
- Incremented when instruction using it goes
through rename - Decremented when instruction is committed
- Non zero value means that register cannot be
reallocated to other logical registers - Guarantees instruction correct execution
15Some additionnal HW (ctnd)
- A 'committed' bit
- Valid, non speculative value
- A 'dirty' bit
- Value more up-to-date than memory
- Using those attributes, a state machine controls
which registers are available or not - Branch recovery works by having a duplicate
renaming table containing the committed
architectural state
16Source operand to physical registerconversion
17Destination logical register to physical register
conversion
18Allocation of an entry for destination register
- Replacement policy in rename table
19Pipeline modifications
- Changes in the renaming
- ATSQ architectural state transfer queue
- Adds to the queue upon fills and spills
- Has priority on the instruction to execute
- Addresses for fills and spills are pre-calculated
- No memory disambiguation required
- No data dependences
20Outline
- Motivation
- Observations on registers
- Idea
- Virtual Context Architecture
- Evaluation in 2 types of applications
- Baseline Methodology
- Register windows w/ results
- SMT w/ results
- Combined register windows SMT
21Baseline machine
22More on methodology
- Uses SimPoints to find representative simulation
intervals - SPEC CPU 2000
- Baseline doesn't have register windows
- (Alphas register remapping with issue queues)
- Window overflow/underflow 10 cycles
23Applications
- Register windows
- Multithreading
http//www.sics.se/psm/sparcstack.html
http//en.wikipedia.org/wiki/Register_window
24Register Windows
- Global register allocation
- How many registers should we reserve for the
current procedure versus the rest of the program? - SPARC example
- usually contains as many as 128 GPRs
- At any point only 32 are available
- 8 global, 8 params in, 8 params out, 8 local
values - Up to 32 windows
- Windows changed by an instruction usually along
with 'call' and 'return' - Partial overlap 'params out' of caller are
'params in' of callee - Also used in Itanium (variable sized window)
- Alternative is e.g. renaming with reservation
stations
Save some memory (stack) traffic on function calls
25Register Windows Caveats
- Problem
- Overflow of windows call depth too deep
- Underflow of window need to restore a window
from memory - Solution
- Operating system handler
- typical scheme saves and restores windows
- VCA handles registers individually
Performance Advantage of the Register Stack in
Intel Itanium Processors
26Register windows evaluation
- Ideal fills and spills are free
- VCA is especially good with few registers
- Close to ideal at 256 registers
- VCA 4 faster than baseline _at_256 regs
- Less registers means less in-flight instructions
and less branch misprediction ?increase - For others ? decrease
27Single data cache port experiment
- Normalized to 2-port baseline
- 7 faster than baseline _at_ 256 regs
- 0.5 slower than ideal _at_ 256 regs
282nd Appmulti-threading
29SMT simultaneous multi-threading
- Lots of replicated resources (larger register
file) - VCA renaming table is not replicated, only base
thread pointer - VCA
- of in-flight instructions determine number of
registers required - not of threads
30SMT 2 and 4 threads
- Normalized to single thread baseline 256 regs
(not shown) - _at_ 192 regs, VCA 2T is 97 of baseline _at_ 320 regs
(baseline is at 88) - _at_192 regs, VCA 4T is at 98.7 of baseline _at_448
regs
31CombinedSMT w/ register windows
- Normalized to single thread baseline _at_ 256 regs
- VCA 4T 98 of peak performance _at_ 192 regs
32SMT register windows
- Register window reduces cache accesses while SMT
increases them - VCA 4T non-windowed _at_192 regs is 98 perf. of
baseline, it still has 24 more cache accesses,
adding windows makes cache accesses 5 below
baseline
33VCA summarized
- unifies support for both multiple independent
threads and register windowing within each
thread - backwards compatible with existing ISAs at the
application level for multithreaded contexts - requires only minimal ISA changes for register
windowing - requires no changes to the physical register file
design and the performance-critical
schedule/execute/writeback loop - builds on existing rename logic to map logical
registers to physical registers and handles
register cache misses in the decode/rename stages
34VCA summarized (ctnd)
- completely decouples physical register file size
from the number of logical registers by using
memory as a backing store, rather than another
larger register file - does not involve speculation or prediction,
avoiding the need for recovery mechanisms.
35Conclusions
- A VCA-based implementation of register windows in
an out-of-order processor reduces execution time
by 4 while reducing data cache accesses by
nearly 20 compared to a non-windowed machine,
with an even larger performance advantage over a
conventional register-window implementation. - VCA's data cache traffic reduction is large
enough that it can achieve the same performance
with one cache port as an otherwise similar
conventional machine would with two cache ports.
36Conclusions (ctnd)
- VCA is also able to manage thread contexts
efficiently, enabling effective implementation of
simultaneous multithreading (SMT) using as few as
half the registers of a standard architecture. - VCA allows SMT to be combined with register
windows with no additional physical registers. - a 4-thread VCA machine with 192 registers can
achieve higher performance than a conventional
non-windowed SMT machine with twice as many
registers.