Title: Mikko Lipasti
1Value PredictionAre(nt) We Done Yet?
- Mikko Lipasti
- University of Wisconsin-Madison
2Definition
- What is value prediction? Broadly, three salient
attributes - Generate a speculative value (predict)
- Consume speculative value (execute)
- Verify speculative value (compare/recover)
- This subsumes branch prediction
- Focus here on operand values
3Some History
- Classical value prediction
- Independently invented by 4 groups in 1995-1996
- AMD (Nexgen) L. Widigen and E. Sowadsky, patent
filed March 1996, inv. March 1995 - Technion F. Gabbay and A. Mendelson, inv.
sometime 1995, TR 11/96, US patent Sep 1997 - CMU M. Lipasti, C. Wilkerson, J. Shen, inv. Oct.
1995, ASPLOS paper submitted March 1996 - Wisconsin Y. Sazeides, J. Smith, Summer 1996
4Why?
- Possible explanations
- Natural evolution from branch prediction
- Natural evolution from memoization
- Natural evolution from rampant speculation
- Cache hit speculation
- Memory independence speculation
- Speculative address generation
- Improvements in tracing/simulation technology
- Theres a lot of zeroes out there. (C.
Wilkerson) - Values, not just instructions addresses
- TRIP6000 A. Martin-de-Nicolas, IBM
5Publications by Year
- Excludes journals, workshops, compiler conferences
6What Happened?
- Tremendous academic interest
- Dozens of research groups, papers, proposals
- No industry uptake
- No present or planned CPU with value prediction
- Why?
- Meager performance benefit (lt 10)
- Power consumption
- Dynamic power for extra activity
- Static power (area) for prediction tables
- Complexity and correctness
- Subtle memory ordering issues MICRO 01
- Misprediction recovery HPCA 04
7Performance?
- Relationship between timely fetch and value
prediction benefit Gabbay, ISCA - Value prediction doesnt help when the result can
be computed before the consumer instruction is
fetched - High-bandwidth fetch helps
- Wide trace caches studied in late 1990s
- But, these have several negative attributes
- Recent designs focus on frequency, not ILP
- High-bandwidth fetch is a red herring
- More important to fetch the right instructions
8Future Adoption?
- Classical value prediction will only make it in
the context of a very different microarchitecture - One that explicitly and aggressively exposes ILP
- Promising trends
- Deep pipelining craze appears to be over
- Cant manage the design complexity
- High frequency mania appears to be over
- Cant afford the power
- Architects are pursuing ILP once again
- Value prediction has another opportunity
9What Value Prediction Begat
- Value prediction catalyzed a new focus on values
in computation - This had not been studied before
- A whole new realm of research
- Value-Aware Microarchitecture
- Spans numerous subdisciplines
- Significant industrial impact already
- Also, developments in supporting technologies
10Value-Aware Microarchitecture
- Memory Hierarchy
- Register File Compression several
- Cache Compression Gupta, Alameldeen
- Memory Compression e.g. IBM MXT
- Bandwidth compression
- Address and data bus encoding Rudolph
- Initialization Traffic Lewis
- Memory Hierarchy
- Register File Compression several
- Cache Compression Gupta, Alameldeen
- Memory Compression e.g. IBM MXT
- Bandwidth compression
- Address and data bus encoding Rudolph
- Initialization Traffic Lewis
- Load/Store Processing
- Load value prediction numerous
- Fast address calculation Austin
- Value-aware alias prediction Onder
- Memory consistency Cain
- Load/Store Processing
- Load value prediction numerous
- Fast address calculation Austin
- Value-aware alias prediction Onder
- Memory consistency Cain
Value-Aware Microarchitecture
- Execution Core
- Value Prediction
- Operand Significance
- Low Power Canal
- Execution bandwidth Loh
- Bit-slicing Pentium 4, Mestan
- Instruction reuse Sodani
- Carry prediction Circuit-level Speculation
- Execution Core
- Value Prediction
- Operand Significance
- Low Power Canal
- Execution bandwidth Loh
- Bit-slicing Pentium 4, Mestan
- Instruction reuse Sodani
- Carry prediction Circuit-level Speculation
- Cache Coherence
- Producer-side
- Silent stores, temporally silent stores Lepak
- Speculative lock elision Wisc, UIUC
- Consumer side
- Load value prediction using stale lines Lepak
- Coherence decoupling ASPLOS 04
- Cache Coherence
- Producer-side
- Silent stores, temporally silent stores Lepak
- Speculative lock elision Rajwar
- Consumer side
- Load value prediction using stale lines Lepak
- Coherence decoupling Burger, Sohi
11Supporting Technologies
- Value prediction presented some unique
challenges - Relatively low correct prediction rate (initially
40-50) - Nontrivial misprediction rate with avoidable
misprediction cost - These drove study of
- Confidence prediction/estimation
- First microarchitectural application of
confidence estimation, though not widely credited
or cited as such - Since studied for numerous applications, e.g.
gating control speculation - Selective recovery Sazeides Ph.D., Kim HPCA 04
- Numerous challenges in extending recovery to
entire window - Both have proved to be fruitful research areas
- Also stimulated development of software
technology - Value profiling
- Value-based compiler optimizations
- Run-time code specialization
12Outline
- Some History
- Industry Trends
- Value-Aware Microarchitecture
- Case study Memory Consistency Trey Cain, ISCA
2004 - Conventional load queue microarchitecture
- Value-based memory ordering
- Replay-reduction heuristics
- Performance evaluation
- Conclusions
13Value-based Memory Consistency
- High ILP gt Large instruction windows
- Larger physical register file
- Larger scheduler
- Larger load/store queues
- Result in increased access latency
- Value-based Replay
- If load queue scalability a problemwho needs
one! - Instead, re-execute load instructions a 2nd time
in program order - Filter replays heuristics reduce extra cache
bandwidth to 3.5 on average
14Enforcing RAW dependences
Program order
(Exe order)
- (1) store A
- (3) store ?
- (2) load A
- Load queue contains load addresses
- Memory independence speculation
- Hoist load above unknown store assuming it is to
a different address - Check correctness at store retirement
- One search per store address calculation
- If address matches, the load is squashed
15Enforcing memory consistency
- Processor p1
- (3) load A
- 2. (1) load A
raw
war
- Two approaches
- Snooping Search per incoming invalidate
- Insulated Search per load address calculation
16Load queue implementation
queue management
squash determination
external request
external address
store address
store age
load address
load age
- of write ports load address calc width
- of read ports loadstore address calc width (
1) - Current generation designs (32-48 entries, 2
write ports, 2 (3) read ports)
17Load queue scaling
- Larger instruction window gt larger load queue
- Increases access latency
- Increases energy consumption
- Wider issue width gt more read/write ports
- Also increases latency and energy
18Related work MICRO 2003
- Park et al., Purdue
- Extra structure dedicated to enforcing memory
consistency - Increase capacity through segmentation
- Sethumadhavan et al., UT-Austin
- Add set of filters summarizing contents of load
queue
19Keep it simple
- Throw more hardware at the problem?
- Need to design/implement/verify
- Execution core is already complicated
- Load queue checks for rare errors
- Why not move error checking away from exe?
20Value-based Consistency
IF1
IF2
CMP
D
R
Q
S
EX
C
REP
WB
- Replay access the cache a second time -cheaply!
- Almost always cache hit
- Reuse address calculation and translation
- Share cache port used by stores in commit stage
- Compare compares new value to original value
- Squash if the values differ
- This is value prediction!
- Predict access cache prematurely
- Execute as usual
- Verify replay load, compare value, recover if
necessary
21Rules of replay
- All prior stores must have written data to the
cache - No store-to-load forwarding
- Loads must replay in program order
- If a load is squashed, it should not be replayed
a second time - Ensures forward progress
22Replay reduction
- Replay costs
- Consumes cache bandwidth (and power)
- Increases reorder buffer occupancy
- Can we avoid these penalties?
- Infer correctness of certain operations
- Four replay filters
- These are used to avoid checking our value
prediction when in fact no value prediction
occurred (loaded value is known to be correct) - Similar to constant prediction in initial work
23No-Reorder filter
- Avoid replay if load isnt reordered wrt other
memory operations - Can we do better?
24Enforcing single-thread RAW dependencies
- No-Unresolved Store Address Filter
- Load instruction i is replayed if there are prior
stores with unresolved addresses when i issues - Works for intra-processor RAW dependences
- Doesnt enforce memory consistency
25Enforcing MP consistency
- No-Recent-Miss Filter
- Avoid replay if there have been no cache line
fills (to any address) while load in instruction
window - No-Recent-Snoop Filter
- Avoid replay if there have been no external
invalidates (to any address) while load in
instruction window
26Constraint graph
- Defined for sequential consistency by Landin et
al., ISCA-18 - Directed-graph represents a multithreaded
execution - Nodes represent dynamic instruction instances
- Edges represent their transitive orders (program
order, RAW, WAW, WAR). - If the constraint graph is acyclic, then the
execution is correct
27Constraint graph example - SC
Proc 1 ST A
Proc 2
WAR
2.
4.
LD B
Program order
Program order
ST B
LD A
3.
RAW
1.
Cycle indicates that execution is incorrect
28Anatomy of a cycle
Proc 1 ST A
Proc 2
Incoming invalidate
WAR
LD B
Program order
Program order
Cache miss
ST B
RAW
LD A
29Enforcing MP consistency
- No-Recent-Miss Filter
- Avoid replay if there have been no cache line
fills (to any address) while load in instruction
window - No-Recent-Snoop Filter
- Avoid replay if there have been no external
invalidates (to any address) while load in
instruction window
30Filter Summary
Conservative
Replay all committed loads
No-Reorder Filter
No-Unresolved Store/ No-Recent-Snoop Filter
No-Unresolved Store/ No-Recent-Miss Filter
Aggressive
31Outline
- Some History
- Industry Trends
- Value-Aware Microarchitecture
- Case study Memory Consistency Cain, ISCA
- Conventional load queue microarchitecture
- Value-based memory ordering
- Replay-reduction heuristics
- Performance evaluation
- Conclusions
32Base machine model
PHARMsim PowerPC execute-at-execute simulator with OOO cores and aggressive split-transaction snooping coherence protocol
Out-of-order execution core 5 GHZ, 15-stage, 8-wide pipeline 256 entry reorder buffer, 128 entry load/store queue 32 entry issue queue
Functional units (latency) 8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4), 4 L1 Dcache load ports in OoO window 1 L1 Dcache load/store port at commit
Front-end Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with 16k entry selection table, 64 entry RAS, 8k entry 4-way BTB
Memory system (latency) 32k DM L1 icache (1), 32k DM L1 dcache (1) 256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte cache lines Memory (400 cycle/100 ns best-case latency, 10 GB/S BW) Stride-based prefetcher modeled after Power4
33L1 DCache bandwidth increase
SPECint2000
SPECfp2000
commercial
multiprocessor
- replay all (b) no-reorder filter (c)
no-recent-miss filter (d) no-recent-snoop filter
On average, 3.4 bandwidth overhead using
no-recent-snoop filter
34Value-based replay performance (relative to
constrained load queue)
SPECint2000
SPECfp2000
commercial
multiprocessor
Value-based replay 8 faster on avg than baseline
using 16-entry ld queue
35Does value locality help?
- Not much
- Value locality does avoid memory ordering
violations - 59 single-thread violations avoided
- 95 consistency violations avoided
- But these violations rarely occur
- 1 single-thread violation per 100 million instr
- 4 consistency violation per 10,000 instr
36What About Power?
- Simple power model
- Empirically 0.02 replay loads per committed
instruction - If load queue CAM energy/insn gt 0.02 energy
expenditure of a cache access and comparison - value-based implementation saves power!
DEnergy replays ( Eper cache access Eper
word comparison ) replay overhead ( Eper ldq
search ldq searches )
37Value-based replay Pros/Cons
- Eliminates associative lookup hardware
- Load queue becomes simple FIFO
- Negligible IPC or L1D bandwidth impact
- Can be used to fix value prediction
- Enforces dependence order consistency constraint
MICRO 01 - Requires additional pipeline stages
- Requires additional cache datapath for loads
38Conclusions
- Value prediction
- Continues to generate lots of academic interest
- Little industry uptake so far
- Historical trends (narrow deep pipelines)
minimized benefit - Sea-change underway on this front
- Value prediction will be revisited in quest for
ILP - Power consumption is key!
- Value-Aware Microarchitecture
- Multiple fertile areas of research
- Some has found its way into products
- Are we done yet? No!
- Questions?
39Backups
40Caveat Memory Dependence Prediction
- Some predictors train using the conflicting store
- (e.g. store-set predictor)
- Replay mechanism is unable to pinpoint
conflicting store - Fair comparison
- Baseline machine store-set predictor w/ 4k entry
SSIT and 128 entry LFST - Experimental machine Simple 21264-style
dependence predictor w/ 4k entry history table
41Load queue search energy
Based on 0.09 micron process technology using
Cacti v. 3.2
42Load queue search latency
Based on 0.09 micron process technology using
Cacti v. 3.2
43Benchmarks
- MP (16-way)
- Commercial workloads (SPECweb, TPC-H)
- SPLASH2 scientific application (ocean)
- Error bars signify 95 statistical confidence
- UP
- 3 from SPECfp2000
- Selected due to high reorder buffer utilization
- apsi, art, wupwise
- 3 commercial
- SPECjbb2000, TPC-B, TPC-H
- A few from SPECint2000
44Life cycle of a load
LD ?
ST ?
LD ?
LD ?
ST ?
LD ?
ST ?
ST ?
LD ?
ST ?
LD A
ST A
ST ?
OoO Execution Window
Blam!
LD ?
LD A
Load queue
45Performance relative to unconstrained load queue
Good news Replay w/ no-recent-snoop filter only
1 slower on average
46Reorder-Buffer Utilization
47Why focus on load queue?
- Load queue has different constraints that store
queue - More loads than stores (30 vs 14 dynamic
instructions) - Load queue searched more frequently (consuming
more power) - Store-forwarding logic performance critical
- Many non-scalable structures in OoO processor
- Scheduler
- Physical register file
- Register map
48Prior work formal memory model representations
- Local, WRT, global performance of memory ops
(Dubois et al., ISCA-13) - Acyclic graph representation (Landin et al.,
ISCA-18) - Modeling memory operation as a series of
sub-operations (Collier, RAPA) - Acyclic graph sub-operations (Adve, thesis)
- Initiation event, for modeling early
store-to-load forwarding (Gharachorloo, thesis)
49Some History
From Larry.Widigen_at_amd.com (Larry
Widigen) Received by charlie (4.1) id AA00850
Wed, 14 Aug 96 103312 PDT Date Wed, 14 Aug 96
103312 PDT Message-Id lt9608141733.AA00850_at_charl
iegt To Mikko_H._Lipasti_at_cmu.edu Subject www
location of paper Status RO X-Status X-Keywords
X-UID 1 I would like to review your
forthcoming paper, "Value Locality and Load Value
Prediction." Could you provide a www address
where it resides? I am curious as to its
contents since its title suggests that it may
discuss an area where I have done some
work. Cordially, Larry Widigen Manager of
Processor Development
- Classical value prediction
- Independently invented by 4 groups in 1995-1996
- AMD (Nexgen) L. Widigen and E. Sowadsky, patent
filed March 1996, inv. March 1995 - Technion F. Gabbay and A. Mendelson, inv.
sometime 1995, TR 11/96, US patent Sep 1997 - CMU M. Lipasti, C. Wilkerson, J. Shen, inv. Oct.
1995, ASPLOS paper submitted March, 1996 - Wisconsin Y. Sazeides, J. Smith, Summer 1996