Mikko Lipasti - PowerPoint PPT Presentation

1 / 49
About This Presentation
Title:

Mikko Lipasti

Description:

Title: Efficient Memory Barrier Implementation Author: cain Last modified by: Mikko H. Lipasti Created Date: 10/9/2002 2:59:59 PM Document presentation format – PowerPoint PPT presentation

Number of Views:148
Avg rating:3.0/5.0
Slides: 50
Provided by: CAIN59
Category:

less

Transcript and Presenter's Notes

Title: Mikko Lipasti


1
Value PredictionAre(nt) We Done Yet?
  • Mikko Lipasti
  • University of Wisconsin-Madison

2
Definition
  • What is value prediction? Broadly, three salient
    attributes
  • Generate a speculative value (predict)
  • Consume speculative value (execute)
  • Verify speculative value (compare/recover)
  • This subsumes branch prediction
  • Focus here on operand values

3
Some History
  • Classical value prediction
  • Independently invented by 4 groups in 1995-1996
  • AMD (Nexgen) L. Widigen and E. Sowadsky, patent
    filed March 1996, inv. March 1995
  • Technion F. Gabbay and A. Mendelson, inv.
    sometime 1995, TR 11/96, US patent Sep 1997
  • CMU M. Lipasti, C. Wilkerson, J. Shen, inv. Oct.
    1995, ASPLOS paper submitted March 1996
  • Wisconsin Y. Sazeides, J. Smith, Summer 1996

4
Why?
  • Possible explanations
  • Natural evolution from branch prediction
  • Natural evolution from memoization
  • Natural evolution from rampant speculation
  • Cache hit speculation
  • Memory independence speculation
  • Speculative address generation
  • Improvements in tracing/simulation technology
  • Theres a lot of zeroes out there. (C.
    Wilkerson)
  • Values, not just instructions addresses
  • TRIP6000 A. Martin-de-Nicolas, IBM

5
Publications by Year
  • Excludes journals, workshops, compiler conferences

6
What Happened?
  • Tremendous academic interest
  • Dozens of research groups, papers, proposals
  • No industry uptake
  • No present or planned CPU with value prediction
  • Why?
  • Meager performance benefit (lt 10)
  • Power consumption
  • Dynamic power for extra activity
  • Static power (area) for prediction tables
  • Complexity and correctness
  • Subtle memory ordering issues MICRO 01
  • Misprediction recovery HPCA 04

7
Performance?
  • Relationship between timely fetch and value
    prediction benefit Gabbay, ISCA
  • Value prediction doesnt help when the result can
    be computed before the consumer instruction is
    fetched
  • High-bandwidth fetch helps
  • Wide trace caches studied in late 1990s
  • But, these have several negative attributes
  • Recent designs focus on frequency, not ILP
  • High-bandwidth fetch is a red herring
  • More important to fetch the right instructions

8
Future Adoption?
  • Classical value prediction will only make it in
    the context of a very different microarchitecture
  • One that explicitly and aggressively exposes ILP
  • Promising trends
  • Deep pipelining craze appears to be over
  • Cant manage the design complexity
  • High frequency mania appears to be over
  • Cant afford the power
  • Architects are pursuing ILP once again
  • Value prediction has another opportunity

9
What Value Prediction Begat
  • Value prediction catalyzed a new focus on values
    in computation
  • This had not been studied before
  • A whole new realm of research
  • Value-Aware Microarchitecture
  • Spans numerous subdisciplines
  • Significant industrial impact already
  • Also, developments in supporting technologies

10
Value-Aware Microarchitecture
  • Memory Hierarchy
  • Register File Compression several
  • Cache Compression Gupta, Alameldeen
  • Memory Compression e.g. IBM MXT
  • Bandwidth compression
  • Address and data bus encoding Rudolph
  • Initialization Traffic Lewis
  • Memory Hierarchy
  • Register File Compression several
  • Cache Compression Gupta, Alameldeen
  • Memory Compression e.g. IBM MXT
  • Bandwidth compression
  • Address and data bus encoding Rudolph
  • Initialization Traffic Lewis
  • Load/Store Processing
  • Load value prediction numerous
  • Fast address calculation Austin
  • Value-aware alias prediction Onder
  • Memory consistency Cain
  • Load/Store Processing
  • Load value prediction numerous
  • Fast address calculation Austin
  • Value-aware alias prediction Onder
  • Memory consistency Cain

Value-Aware Microarchitecture
  • Execution Core
  • Value Prediction
  • Operand Significance
  • Low Power Canal
  • Execution bandwidth Loh
  • Bit-slicing Pentium 4, Mestan
  • Instruction reuse Sodani
  • Carry prediction Circuit-level Speculation
  • Execution Core
  • Value Prediction
  • Operand Significance
  • Low Power Canal
  • Execution bandwidth Loh
  • Bit-slicing Pentium 4, Mestan
  • Instruction reuse Sodani
  • Carry prediction Circuit-level Speculation
  • Cache Coherence
  • Producer-side
  • Silent stores, temporally silent stores Lepak
  • Speculative lock elision Wisc, UIUC
  • Consumer side
  • Load value prediction using stale lines Lepak
  • Coherence decoupling ASPLOS 04
  • Cache Coherence
  • Producer-side
  • Silent stores, temporally silent stores Lepak
  • Speculative lock elision Rajwar
  • Consumer side
  • Load value prediction using stale lines Lepak
  • Coherence decoupling Burger, Sohi

11
Supporting Technologies
  • Value prediction presented some unique
    challenges
  • Relatively low correct prediction rate (initially
    40-50)
  • Nontrivial misprediction rate with avoidable
    misprediction cost
  • These drove study of
  • Confidence prediction/estimation
  • First microarchitectural application of
    confidence estimation, though not widely credited
    or cited as such
  • Since studied for numerous applications, e.g.
    gating control speculation
  • Selective recovery Sazeides Ph.D., Kim HPCA 04
  • Numerous challenges in extending recovery to
    entire window
  • Both have proved to be fruitful research areas
  • Also stimulated development of software
    technology
  • Value profiling
  • Value-based compiler optimizations
  • Run-time code specialization

12
Outline
  • Some History
  • Industry Trends
  • Value-Aware Microarchitecture
  • Case study Memory Consistency Trey Cain, ISCA
    2004
  • Conventional load queue microarchitecture
  • Value-based memory ordering
  • Replay-reduction heuristics
  • Performance evaluation
  • Conclusions

13
Value-based Memory Consistency
  • High ILP gt Large instruction windows
  • Larger physical register file
  • Larger scheduler
  • Larger load/store queues
  • Result in increased access latency
  • Value-based Replay
  • If load queue scalability a problemwho needs
    one!
  • Instead, re-execute load instructions a 2nd time
    in program order
  • Filter replays heuristics reduce extra cache
    bandwidth to 3.5 on average

14
Enforcing RAW dependences
Program order
(Exe order)
  1. (1) store A
  2. (3) store ?
  3. (2) load A
  • Load queue contains load addresses
  • Memory independence speculation
  • Hoist load above unknown store assuming it is to
    a different address
  • Check correctness at store retirement
  • One search per store address calculation
  • If address matches, the load is squashed

15
Enforcing memory consistency
  • Processor p2
  • (2) store A
  • Processor p1
  • (3) load A
  • 2. (1) load A

raw
war
  • Two approaches
  • Snooping Search per incoming invalidate
  • Insulated Search per load address calculation

16
Load queue implementation
queue management
squash determination
external request
external address
store address
store age
load address
load age
  • of write ports load address calc width
  • of read ports loadstore address calc width (
    1)
  • Current generation designs (32-48 entries, 2
    write ports, 2 (3) read ports)

17
Load queue scaling
  • Larger instruction window gt larger load queue
  • Increases access latency
  • Increases energy consumption
  • Wider issue width gt more read/write ports
  • Also increases latency and energy

18
Related work MICRO 2003
  • Park et al., Purdue
  • Extra structure dedicated to enforcing memory
    consistency
  • Increase capacity through segmentation
  • Sethumadhavan et al., UT-Austin
  • Add set of filters summarizing contents of load
    queue

19
Keep it simple
  • Throw more hardware at the problem?
  • Need to design/implement/verify
  • Execution core is already complicated
  • Load queue checks for rare errors
  • Why not move error checking away from exe?

20
Value-based Consistency

IF1
IF2
CMP
D
R
Q
S
EX
C
REP
WB
  • Replay access the cache a second time -cheaply!
  • Almost always cache hit
  • Reuse address calculation and translation
  • Share cache port used by stores in commit stage
  • Compare compares new value to original value
  • Squash if the values differ
  • This is value prediction!
  • Predict access cache prematurely
  • Execute as usual
  • Verify replay load, compare value, recover if
    necessary

21
Rules of replay
  • All prior stores must have written data to the
    cache
  • No store-to-load forwarding
  • Loads must replay in program order
  • If a load is squashed, it should not be replayed
    a second time
  • Ensures forward progress

22
Replay reduction
  • Replay costs
  • Consumes cache bandwidth (and power)
  • Increases reorder buffer occupancy
  • Can we avoid these penalties?
  • Infer correctness of certain operations
  • Four replay filters
  • These are used to avoid checking our value
    prediction when in fact no value prediction
    occurred (loaded value is known to be correct)
  • Similar to constant prediction in initial work

23
No-Reorder filter
  • Avoid replay if load isnt reordered wrt other
    memory operations
  • Can we do better?

24
Enforcing single-thread RAW dependencies
  • No-Unresolved Store Address Filter
  • Load instruction i is replayed if there are prior
    stores with unresolved addresses when i issues
  • Works for intra-processor RAW dependences
  • Doesnt enforce memory consistency

25
Enforcing MP consistency
  • No-Recent-Miss Filter
  • Avoid replay if there have been no cache line
    fills (to any address) while load in instruction
    window
  • No-Recent-Snoop Filter
  • Avoid replay if there have been no external
    invalidates (to any address) while load in
    instruction window

26
Constraint graph
  • Defined for sequential consistency by Landin et
    al., ISCA-18
  • Directed-graph represents a multithreaded
    execution
  • Nodes represent dynamic instruction instances
  • Edges represent their transitive orders (program
    order, RAW, WAW, WAR).
  • If the constraint graph is acyclic, then the
    execution is correct

27
Constraint graph example - SC
Proc 1 ST A
Proc 2
WAR
2.
4.
LD B
Program order
Program order
ST B
LD A
3.
RAW
1.
Cycle indicates that execution is incorrect
28
Anatomy of a cycle
Proc 1 ST A
Proc 2
Incoming invalidate
WAR
LD B
Program order
Program order
Cache miss
ST B
RAW
LD A
29
Enforcing MP consistency
  • No-Recent-Miss Filter
  • Avoid replay if there have been no cache line
    fills (to any address) while load in instruction
    window
  • No-Recent-Snoop Filter
  • Avoid replay if there have been no external
    invalidates (to any address) while load in
    instruction window

30
Filter Summary
Conservative
Replay all committed loads
No-Reorder Filter
No-Unresolved Store/ No-Recent-Snoop Filter
No-Unresolved Store/ No-Recent-Miss Filter
Aggressive
31
Outline
  • Some History
  • Industry Trends
  • Value-Aware Microarchitecture
  • Case study Memory Consistency Cain, ISCA
  • Conventional load queue microarchitecture
  • Value-based memory ordering
  • Replay-reduction heuristics
  • Performance evaluation
  • Conclusions

32
Base machine model
PHARMsim PowerPC execute-at-execute simulator with OOO cores and aggressive split-transaction snooping coherence protocol
Out-of-order execution core 5 GHZ, 15-stage, 8-wide pipeline 256 entry reorder buffer, 128 entry load/store queue 32 entry issue queue
Functional units (latency) 8 Int ALUs (1), 3 Int MULT/DIV(3/12) 4 FP ALUs (4), 4 FP MULT/DIV (4,4), 4 L1 Dcache load ports in OoO window 1 L1 Dcache load/store port at commit
Front-end Combined bimodal (16k entry)/ gshare (16k entry) branch predictor with 16k entry selection table, 64 entry RAS, 8k entry 4-way BTB
Memory system (latency) 32k DM L1 icache (1), 32k DM L1 dcache (1) 256K 8-way L2 (7), 8MB 8-way L3 (15), 64 byte cache lines Memory (400 cycle/100 ns best-case latency, 10 GB/S BW) Stride-based prefetcher modeled after Power4
33
L1 DCache bandwidth increase
SPECint2000
SPECfp2000
commercial
multiprocessor
  1. replay all (b) no-reorder filter (c)
    no-recent-miss filter (d) no-recent-snoop filter

On average, 3.4 bandwidth overhead using
no-recent-snoop filter
34
Value-based replay performance (relative to
constrained load queue)
SPECint2000
SPECfp2000
commercial
multiprocessor
Value-based replay 8 faster on avg than baseline
using 16-entry ld queue
35
Does value locality help?
  • Not much
  • Value locality does avoid memory ordering
    violations
  • 59 single-thread violations avoided
  • 95 consistency violations avoided
  • But these violations rarely occur
  • 1 single-thread violation per 100 million instr
  • 4 consistency violation per 10,000 instr

36
What About Power?
  • Simple power model
  • Empirically 0.02 replay loads per committed
    instruction
  • If load queue CAM energy/insn gt 0.02 energy
    expenditure of a cache access and comparison
  • value-based implementation saves power!

DEnergy replays ( Eper cache access Eper
word comparison ) replay overhead ( Eper ldq
search ldq searches )
37
Value-based replay Pros/Cons
  • Eliminates associative lookup hardware
  • Load queue becomes simple FIFO
  • Negligible IPC or L1D bandwidth impact
  • Can be used to fix value prediction
  • Enforces dependence order consistency constraint
    MICRO 01
  • Requires additional pipeline stages
  • Requires additional cache datapath for loads

38
Conclusions
  • Value prediction
  • Continues to generate lots of academic interest
  • Little industry uptake so far
  • Historical trends (narrow deep pipelines)
    minimized benefit
  • Sea-change underway on this front
  • Value prediction will be revisited in quest for
    ILP
  • Power consumption is key!
  • Value-Aware Microarchitecture
  • Multiple fertile areas of research
  • Some has found its way into products
  • Are we done yet? No!
  • Questions?

39
Backups
40
Caveat Memory Dependence Prediction
  • Some predictors train using the conflicting store
  • (e.g. store-set predictor)
  • Replay mechanism is unable to pinpoint
    conflicting store
  • Fair comparison
  • Baseline machine store-set predictor w/ 4k entry
    SSIT and 128 entry LFST
  • Experimental machine Simple 21264-style
    dependence predictor w/ 4k entry history table

41
Load queue search energy
Based on 0.09 micron process technology using
Cacti v. 3.2
42
Load queue search latency
Based on 0.09 micron process technology using
Cacti v. 3.2
43
Benchmarks
  • MP (16-way)
  • Commercial workloads (SPECweb, TPC-H)
  • SPLASH2 scientific application (ocean)
  • Error bars signify 95 statistical confidence
  • UP
  • 3 from SPECfp2000
  • Selected due to high reorder buffer utilization
  • apsi, art, wupwise
  • 3 commercial
  • SPECjbb2000, TPC-B, TPC-H
  • A few from SPECint2000

44
Life cycle of a load
LD ?
ST ?
LD ?
LD ?
ST ?
LD ?
ST ?
ST ?
LD ?
ST ?
LD A
ST A
ST ?
OoO Execution Window
Blam!
LD ?
LD A
Load queue
45
Performance relative to unconstrained load queue
Good news Replay w/ no-recent-snoop filter only
1 slower on average
46
Reorder-Buffer Utilization
47
Why focus on load queue?
  • Load queue has different constraints that store
    queue
  • More loads than stores (30 vs 14 dynamic
    instructions)
  • Load queue searched more frequently (consuming
    more power)
  • Store-forwarding logic performance critical
  • Many non-scalable structures in OoO processor
  • Scheduler
  • Physical register file
  • Register map

48
Prior work formal memory model representations
  • Local, WRT, global performance of memory ops
    (Dubois et al., ISCA-13)
  • Acyclic graph representation (Landin et al.,
    ISCA-18)
  • Modeling memory operation as a series of
    sub-operations (Collier, RAPA)
  • Acyclic graph sub-operations (Adve, thesis)
  • Initiation event, for modeling early
    store-to-load forwarding (Gharachorloo, thesis)

49
Some History
From Larry.Widigen_at_amd.com (Larry
Widigen) Received by charlie (4.1) id AA00850
Wed, 14 Aug 96 103312 PDT Date Wed, 14 Aug 96
103312 PDT Message-Id lt9608141733.AA00850_at_charl
iegt To Mikko_H._Lipasti_at_cmu.edu Subject www
location of paper Status RO X-Status X-Keywords
X-UID 1 I would like to review your
forthcoming paper, "Value Locality and Load Value
Prediction." Could you provide a www address
where it resides? I am curious as to its
contents since its title suggests that it may
discuss an area where I have done some
work. Cordially, Larry Widigen Manager of
Processor Development
  • Classical value prediction
  • Independently invented by 4 groups in 1995-1996
  • AMD (Nexgen) L. Widigen and E. Sowadsky, patent
    filed March 1996, inv. March 1995
  • Technion F. Gabbay and A. Mendelson, inv.
    sometime 1995, TR 11/96, US patent Sep 1997
  • CMU M. Lipasti, C. Wilkerson, J. Shen, inv. Oct.
    1995, ASPLOS paper submitted March, 1996
  • Wisconsin Y. Sazeides, J. Smith, Summer 1996
Write a Comment
User Comments (0)
About PowerShow.com