Mikko Lipasti

About This Presentation

Title:

Mikko Lipasti

Description:

Title: Efficient Memory Barrier Implementation Author: cain Last modified by: Mikko H. Lipasti Created Date: 10/9/2002 2:59:59 PM Document presentation format – PowerPoint PPT presentation

Number of Views:155

Avg rating:3.0/5.0

Slides: 50

Provided by: CAIN59

Learn more at: https://users.oden.utexas.edu

Category:

more less

Transcript and Presenter's Notes

Title: Mikko Lipasti

1
Value PredictionAre(nt) We Done Yet?

Mikko Lipasti
University of Wisconsin-Madison

2
Definition

What is value prediction? Broadly, three salient
attributes
Generate a speculative value (predict)
Consume speculative value (execute)
Verify speculative value (compare/recover)
This subsumes branch prediction
Focus here on operand values

3
Some History

Classical value prediction
Independently invented by 4 groups in 1995-1996
AMD (Nexgen) L. Widigen and E. Sowadsky, patent
filed March 1996, inv. March 1995
Technion F. Gabbay and A. Mendelson, inv.
sometime 1995, TR 11/96, US patent Sep 1997
CMU M. Lipasti, C. Wilkerson, J. Shen, inv. Oct.
1995, ASPLOS paper submitted March 1996
Wisconsin Y. Sazeides, J. Smith, Summer 1996

4
Why?

Possible explanations
Natural evolution from branch prediction
Natural evolution from memoization
Natural evolution from rampant speculation
Cache hit speculation
Memory independence speculation
Speculative address generation
Improvements in tracing/simulation technology
Theres a lot of zeroes out there. (C.
Wilkerson)
Values, not just instructions addresses
TRIP6000 A. Martin-de-Nicolas, IBM

5
Publications by Year

Excludes journals, workshops, compiler conferences

6
What Happened?

Tremendous academic interest
Dozens of research groups, papers, proposals
No industry uptake
No present or planned CPU with value prediction
Why?
Meager performance benefit (lt 10)
Power consumption
Dynamic power for extra activity
Static power (area) for prediction tables
Complexity and correctness
Subtle memory ordering issues MICRO 01
Misprediction recovery HPCA 04

7
Performance?

Relationship between timely fetch and value
prediction benefit Gabbay, ISCA
Value prediction doesnt help when the result can
be computed before the consumer instruction is
fetched
High-bandwidth fetch helps
Wide trace caches studied in late 1990s
But, these have several negative attributes
Recent designs focus on frequency, not ILP
High-bandwidth fetch is a red herring
More important to fetch the right instructions

8
Future Adoption?

Classical value prediction will only make it in
the context of a very different microarchitecture
One that explicitly and aggressively exposes ILP
Promising trends
Deep pipelining craze appears to be over
Cant manage the design complexity
High frequency mania appears to be over
Cant afford the power
Architects are pursuing ILP once again
Value prediction has another opportunity

9
What Value Prediction Begat

Value prediction catalyzed a new focus on values
in computation
This had not been studied before
A whole new realm of research
Value-Aware Microarchitecture
Spans numerous subdisciplines
Significant industrial impact already
Also, developments in supporting technologies

10
Value-Aware Microarchitecture

Memory Hierarchy
Register File Compression several
Cache Compression Gupta, Alameldeen
Memory Compression e.g. IBM MXT
Bandwidth compression
Address and data bus encoding Rudolph
Initialization Traffic Lewis

Memory Hierarchy
Register File Compression several
Cache Compression Gupta, Alameldeen
Memory Compression e.g. IBM MXT
Bandwidth compression
Address and data bus encoding Rudolph
Initialization Traffic Lewis

Load/Store Processing
Load value prediction numerous
Fast address calculation Austin
Value-aware alias prediction Onder
Memory consistency Cain

Load/Store Processing
Load value prediction numerous
Fast address calculation Austin
Value-aware alias prediction Onder
Memory consistency Cain

Value-Aware Microarchitecture

Execution Core
Value Prediction
Operand Significance
Low Power Canal
Execution bandwidth Loh
Bit-slicing Pentium 4, Mestan
Instruction reuse Sodani
Carry prediction Circuit-level Speculation

Execution Core
Value Prediction
Operand Significance
Low Power Canal
Execution bandwidth Loh
Bit-slicing Pentium 4, Mestan
Instruction reuse Sodani
Carry prediction Circuit-level Speculation

Cache Coherence
Producer-side
Silent stores, temporally silent stores Lepak
Speculative lock elision Wisc, UIUC
Consumer side
Load value prediction using stale lines Lepak
Coherence decoupling ASPLOS 04

Cache Coherence
Producer-side
Silent stores, temporally silent stores Lepak
Speculative lock elision Rajwar
Consumer side
Load value prediction using stale lines Lepak
Coherence decoupling Burger, Sohi

11
Supporting Technologies

Value prediction presented some unique
challenges
Relatively low correct prediction rate (initially
40-50)
Nontrivial misprediction rate with avoidable
misprediction cost
These drove study of
Confidence prediction/estimation
First microarchitectural application of
confidence estimation, though not widely credited
or cited as such
Since studied for numerous applications, e.g.
gating control speculation
Selective recovery Sazeides Ph.D., Kim HPCA 04
Numerous challenges in extending recovery to
entire window
Both have proved to be fruitful research areas
Also stimulated development of software
technology
Value profiling
Value-based compiler optimizations
Run-time code specialization

12
Outline

Some History
Industry Trends
Value-Aware Microarchitecture
Case study Memory Consistency Trey Cain, ISCA
2004
Conventional load queue microarchitecture
Value-based memory ordering
Replay-reduction heuristics
Performance evaluation
Conclusions

13
Value-based Memory Consistency

High ILP gt Large instruction windows
Larger physical register file
Larger scheduler
Larger load/store queues
Result in increased access latency
Value-based Replay
If load queue scalability a problemwho needs
one!
Instead, re-execute load instructions a 2nd time
in program order
Filter replays heuristics reduce extra cache
bandwidth to 3.5 on average

14
Enforcing RAW dependences
Program order
(Exe order)

(1) store A
(3) store ?
(2) load A

Load queue contains load addresses
Memory independence speculation
Hoist load above unknown store assuming it is to
a different address
Check correctness at store retirement
One search per store address calculation
If address matches, the load is squashed

15
Enforcing memory consistency

Processor p2
(2) store A

Processor p1
(3) load A
2. (1) load A

raw
war

Two approaches
Snooping Search per incoming invalidate
Insulated Search per load address calculation

16
Load queue implementation
queue management
squash determination
external request
external address
store address
store age
load address
load age

of write ports load address calc width
of read ports loadstore address calc width (
1)
Current generation designs (32-48 entries, 2
write ports, 2 (3) read ports)

17
Load queue scaling

Larger instruction window gt larger load queue
Increases access latency
Increases energy consumption
Wider issue width gt more read/write ports
Also increases latency and energy

18
Related work MICRO 2003

Park et al., Purdue
Extra structure dedicated to enforcing memory
consistency
Increase capacity through segmentation
Sethumadhavan et al., UT-Austin
Add set of filters summarizing contents of load
queue

19
Keep it simple

Throw more hardware at the problem?
Need to design/implement/verify
Execution core is already complicated
Load queue checks for rare errors
Why not move error checking away from exe?

20
Value-based Consistency

IF1
IF2
CMP
D
R
Q
S
EX
C
REP
WB

Replay access the cache a second time -cheaply!
Almost always cache hit
Reuse address calculation and translation
Share cache port used by stores in commit stage
Compare compares new value to original value
Squash if the values differ
This is value prediction!
Predict access cache prematurely
Execute as usual
Verify replay load, compare value, recover if
necessary

21
Rules of replay

All prior stores must have written data to the
cache
No store-to-load forwarding
Loads must replay in program order
If a load is squashed, it should not be replayed
a second time
Ensures forward progress

22
Replay reduction

Replay costs
Consumes cache bandwidth (and power)
Increases reorder buffer occupancy
Can we avoid these penalties?
Infer correctness of certain operations
Four replay filters
These are used to avoid checking our value
prediction when in fact no value prediction
occurred (loaded value is known to be correct)
Similar to constant prediction in initial work

23
No-Reorder filter

Avoid replay if load isnt reordered wrt other
memory operations
Can we do better?

24
Enforcing single-thread RAW dependencies

No-Unresolved Store Address Filter
Load instruction i is replayed if there are prior
stores with unresolved addresses when i issues
Works for intra-processor RAW dependences
Doesnt enforce memory consistency

25
Enforcing MP consistency

No-Recent-Miss Filter
Avoid replay if there have been no cache line
fills (to any address) while load in instruction
window
No-Recent-Snoop Filter
Avoid replay if there have been no external
invalidates (to any address) while load in
instruction window

26
Constraint graph

Defined for sequential consistency by Landin et
al., ISCA-18
Directed-graph represents a multithreaded
execution
Nodes represent dynamic instruction instances
Edges represent their transitive orders (program
order, RAW, WAW, WAR).
If the constraint graph is acyclic, then the
execution is correct

27
Constraint graph example - SC
Proc 1 ST A
Proc 2
WAR
2.
4.
LD B
Program order
Program order
ST B
LD A
3.
RAW
1.
Cycle indicates that execution is incorrect
28
Anatomy of a cycle
Proc 1 ST A
Proc 2
Incoming invalidate
WAR
LD B
Program order
Program order
Cache miss
ST B
RAW
LD A
29
Enforcing MP consistency

No-Recent-Miss Filter
Avoid replay if there have been no cache line
fills (to any address) while load in instruction
window
No-Recent-Snoop Filter
Avoid replay if there have been no external
invalidates (to any address) while load in
instruction window

30
Filter Summary
Conservative
Replay all committed loads
No-Reorder Filter
No-Unresolved Store/ No-Recent-Snoop Filter
No-Unresolved Store/ No-Recent-Miss Filter
Aggressive
31
Outline

Some History
Industry Trends
Value-Aware Microarchitecture
Case study Memory Consistency Cain, ISCA
Conventional load queue microarchitecture
Value-based memory ordering
Replay-reduction heuristics
Performance evaluation
Conclusions

32
Base machine model
PHARMsim PowerPC execute-at-execute simulator with OOO cores and aggressive split-transaction snooping coherence protocol
Out-of-order execution core 5 GHZ, 15-stage, 8-wide pipeline 256 entry reorder buffer, 128 entry load/store queue 32 entry issue queue
Functional units (latency) 8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4), 4 L1 Dcache load ports in OoO window 1 L1 Dcache load/store port at commit
Front-end Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with 16k entry selection table, 64 entry RAS, 8k entry 4-way BTB
Memory system (latency) 32k DM L1 icache (1), 32k DM L1 dcache (1) 256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte cache lines Memory (400 cycle/100 ns best-case latency, 10 GB/S BW) Stride-based prefetcher modeled after Power4
33
L1 DCache bandwidth increase
SPECint2000
SPECfp2000
commercial
multiprocessor

replay all (b) no-reorder filter (c)
no-recent-miss filter (d) no-recent-snoop filter

On average, 3.4 bandwidth overhead using
no-recent-snoop filter
34
Value-based replay performance (relative to
constrained load queue)
SPECint2000
SPECfp2000
commercial
multiprocessor
Value-based replay 8 faster on avg than baseline
using 16-entry ld queue
35
Does value locality help?

Not much
Value locality does avoid memory ordering
violations
59 single-thread violations avoided
95 consistency violations avoided
But these violations rarely occur
1 single-thread violation per 100 million instr
4 consistency violation per 10,000 instr

36
What About Power?

Simple power model
Empirically 0.02 replay loads per committed
instruction
If load queue CAM energy/insn gt 0.02 energy
expenditure of a cache access and comparison
value-based implementation saves power!

DEnergy replays ( Eper cache access Eper
word comparison ) replay overhead ( Eper ldq
search ldq searches )
37
Value-based replay Pros/Cons

Eliminates associative lookup hardware
Load queue becomes simple FIFO
Negligible IPC or L1D bandwidth impact
Can be used to fix value prediction
Enforces dependence order consistency constraint
MICRO 01
Requires additional pipeline stages
Requires additional cache datapath for loads

38
Conclusions

Value prediction
Continues to generate lots of academic interest
Little industry uptake so far
Historical trends (narrow deep pipelines)
minimized benefit
Sea-change underway on this front
Value prediction will be revisited in quest for
ILP
Power consumption is key!
Value-Aware Microarchitecture
Multiple fertile areas of research
Some has found its way into products
Are we done yet? No!
Questions?

39
Backups
40
Caveat Memory Dependence Prediction

Some predictors train using the conflicting store
(e.g. store-set predictor)
Replay mechanism is unable to pinpoint
conflicting store
Fair comparison
Baseline machine store-set predictor w/ 4k entry
SSIT and 128 entry LFST
Experimental machine Simple 21264-style
dependence predictor w/ 4k entry history table

41
Load queue search energy
Based on 0.09 micron process technology using
Cacti v. 3.2
42
Load queue search latency
Based on 0.09 micron process technology using
Cacti v. 3.2
43
Benchmarks

MP (16-way)
Commercial workloads (SPECweb, TPC-H)
SPLASH2 scientific application (ocean)
Error bars signify 95 statistical confidence
UP
3 from SPECfp2000
Selected due to high reorder buffer utilization
apsi, art, wupwise
3 commercial
SPECjbb2000, TPC-B, TPC-H
A few from SPECint2000

44
Life cycle of a load
LD ?
ST ?
LD ?
LD ?
ST ?
LD ?
ST ?
ST ?
LD ?
ST ?
LD A
ST A
ST ?
OoO Execution Window
Blam!
LD ?
LD A
Load queue
45
Performance relative to unconstrained load queue
Good news Replay w/ no-recent-snoop filter only
1 slower on average
46
Reorder-Buffer Utilization
47
Why focus on load queue?

Load queue has different constraints that store
queue
More loads than stores (30 vs 14 dynamic
instructions)
Load queue searched more frequently (consuming
more power)
Store-forwarding logic performance critical
Many non-scalable structures in OoO processor
Scheduler
Physical register file
Register map

48
Prior work formal memory model representations

Local, WRT, global performance of memory ops
(Dubois et al., ISCA-13)
Acyclic graph representation (Landin et al.,
ISCA-18)
Modeling memory operation as a series of
sub-operations (Collier, RAPA)
Acyclic graph sub-operations (Adve, thesis)
Initiation event, for modeling early
store-to-load forwarding (Gharachorloo, thesis)

49
Some History
From Larry.Widigen_at_amd.com (Larry
Widigen) Received by charlie (4.1) id AA00850
Wed, 14 Aug 96 103312 PDT Date Wed, 14 Aug 96
103312 PDT Message-Id lt9608141733.AA00850_at_charl
iegt To Mikko_H._Lipasti_at_cmu.edu Subject www
location of paper Status RO X-Status X-Keywords
X-UID 1 I would like to review your
forthcoming paper, "Value Locality and Load Value
Prediction." Could you provide a www address
where it resides? I am curious as to its
contents since its title suggests that it may
discuss an area where I have done some
work. Cordially, Larry Widigen Manager of
Processor Development

Classical value prediction
Independently invented by 4 groups in 1995-1996
AMD (Nexgen) L. Widigen and E. Sowadsky, patent
filed March 1996, inv. March 1995
Technion F. Gabbay and A. Mendelson, inv.
sometime 1995, TR 11/96, US patent Sep 1997
CMU M. Lipasti, C. Wilkerson, J. Shen, inv. Oct.
1995, ASPLOS paper submitted March, 1996
Wisconsin Y. Sazeides, J. Smith, Summer 1996

Write a Comment

User Comments (0)