Feedback Directed Prefetching - PowerPoint PPT Presentation

About This Presentation
Title:

Feedback Directed Prefetching

Description:

Feedback Directed Prefetching Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt Spend more time here. Explain why adapt better Prefetches can ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 49
Provided by: Santhosh
Learn more at: http://users.ece.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Feedback Directed Prefetching


1
Feedback Directed Prefetching
  • Santhosh Srinath
  • Onur Mutlu
  • Hyesoon Kim
  • Yale N. Patt

?

?


2
Problem
Solution
  • Prefetching can significantly improve performance
  • When prefetches are accurate
  • And timely
  • However, Prefetching can also significantly
    degrade performance
  • Due to Memory Bandwidth impact
  • Pollution of the cache

Feedback Directed Prefetching is a
comprehensive mechanism which reduces the
negative effects of prefetching as well as
improves the positive effects
3
Outline
  • Background and Motivation
  • Feedback Directed Prefetching (FDP)
  • Metrics and How to collect
  • How to adapt
  • Prefetcher Aggressiveness
  • Cache Insertion Policy for Prefetches
  • Results

4
Background (Prefetcher Aggressiveness)
  • Prefetch Distance
  • Prefetch Degree

Access Stream
Prefetch Degree
X1
Predicted Stream
Predicted Stream
1 2 3
P
Prefetch Distance
Very Conservative
Middle of the Road
Very Aggressive
5
Background (Prefetcher Aggressiveness)
  • Very Aggressive
  • Well ahead of the load access stream
  • Hides memory access latency better
  • More speculative
  • Very Conservative
  • Closer to the load access stream
  • Might not hide memory access latency completely
  • Reduces potential for cache pollution and
    bandwidth contention

6
Motivation
?48
? 29
  • Very Aggressive improves average performance by
    84
  • However it can also significantly reduce
    performance on some benchmarks

7
Outline
  • Background and Motivation
  • Feedback Directed Prefetching (FDP)
  • Metrics and How to collect
  • How to adapt
  • Prefetcher Aggressiveness
  • Cache Insertion Policy for Prefetches
  • Results

7
Feedback Directed Prefetching
8
Feedback Directed Prefetching
  • Comprehensive mechanism which takes in account
  • Prefetcher Accuracy
  • Prefetcher Lateness
  • Prefetcher-caused Cache Pollution
  • Adapts
  • Prefetcher Aggressiveness
  • Cache Insertion Policy for Prefetches

9
Metrics
  • Prefetch Accuracy
  • Prefetch Lateness
  • Prefetcher-caused Cache Pollution

10
Prefetch Accuracy
  • Useful Prefetches are referenced by the demand
    requests when in L2

11
Prefetch Accuracy
  • Low Accuracy
  • More likely that Prefetching can reduce
    performance

12
Prefetch Accuracy
  • Implementation
  • pref-bit added to each L2 tag-store entry
  • Tracked using two counters pref_total, used_total

13
Prefetch Lateness
  • Measure of how timely prefetches are
  • Used to determine if increasing the
    aggressiveness helps
  • Implementation
  • pref-bit added to each L2 MSHR entry
  • New counter late_total

14
Prefetcher-caused Cache Pollution
  • Measure of the disturbance caused by prefetched
    data in the cache
  • Used to determine if the prefetcher is evicting
    useful data from the cache

15
Prefetcher-caused Cache Pollution (2)
  • Hardware Implementation
  • Insight this does not need to be exact
  • Track pollution using Pollution filter
  • Based on Bloom Filter concept
  • Bit set when a prefetch evicts a demand miss
  • Bit reset when a prefetch is serviced
  • Two Counters pollution_total, demand_total

16
Feedback Directed Prefetching
  • Comprehensive mechanism which takes in account
  • Prefetcher Accuracy
  • Prefetcher Lateness
  • Prefetcher-caused Cache Pollution
  • Adapts
  • Prefetcher Aggressiveness
  • Cache Insertion Policy

16
Feedback Directed Prefetching
17
How to adapt? Prefetcher Aggressiveness
  • Dynamic Configuration Counter
  • Current Aggressiveness

Distance Degree
1 Very Conservative 4 1
2 Conservative 8 1
3 Middle-of-the-Road 16 2
4 Aggressive 32 4
5 Very Aggressive 64 4
18
How to adapt? Prefetcher Aggressiveness (2)
High Accuracy
Med Accuracy
Low Accuracy
Not-Late
Late
Not-Poll
Polluting
Not-Poll
Decrease
Polluting
Increase
Late
Decrease
Not-Late
Decrease
Increase
No Change
  • For Current Phase, based on static thresholds,
    classify
  • Accuracy
  • Lateness
  • Cache-Pollution caused by Prefetches

Reduce memory bandwidth usage and Cache Pollution
Improve Timeliness
Reduce Cache Pollution
19
How to Adapt?Cache Insertion Policy for
Prefetches
  • Why adapt?
  • Reduce the potential for cache pollution
  • Classify Cache Pollution based on static
    thresholds
  • Low Insert at MID(n/2) Position
  • Eg For a 16-way cache, MID 8 in LRU stack
  • Medium Insert at LRU-4(n/4) Position
  • Eg For a 16-way cache, LRU-4 4 in LRU stack
  • High Insert at LRU Position

20
Outline
  • Background and Motivation
  • Feedback Directed Prefetching
  • Metrics and How to collect
  • How to adapt
  • Prefetcher Aggressiveness
  • Cache Insertion Policy for Prefetches
  • Results

20
Feedback Directed Prefetching
21
Evaluation Methodology
  • Execution-driven Alpha simulator
  • Aggressive out-of-order superscalar processor
  • 1 MB, 16-way, 10-cycle unified L2 cache
  • 500-cycle minimum main memory latency
  • Detailed memory model
  • Prefetchers Modeled
  • Stream Prefetcher tracking 64 different streams
  • Global History Buffer Prefetcher (in paper)
  • PC-based Stride Prefetcher (in paper)

22
Results Adjusting Only Aggressiveness
  • 4.7 higher avg IPC over the Very Aggressive
    configuration
  • Most of the performance losses have been
    eliminated

23
Results Adjusting Only Cache Insertion Policy
Very Aggressive Prefetcher
  • 5.1 better than inserting prefetches in MRU
    position
  • 1.9 better than inserting prefetches in LRU-4
    position

24
Results Putting it all together (FDP)
?13
?11
  • 6.5 IPC improvement over Very Aggressive
    configuration
  • Performance losses converted to performance gains!

25
Bandwidth Impact
  • BPKI - Memory Bus Accesses per 1000 retired
    Instructions
  • Includes effects of L2 demand misses as well as
    pollution induced misses and prefetches
  • FDP significantly improves bandwidth efficiency

6.5 higher performance and18.7 less bandwidth
13.6 higher performance with similar bandwidth
usage
No. Pref. Very Cons Mid Very Aggr FDP
IPC 0.85 1.21 1.47 1.57 1.67
BPKI 8.56 9.34 10.60 13.38 10.88
26
Hardware Cost

pref-bits for L2 cache 16384 blocks 16384 bits
Pollution Filter 4096 entries 1bit 4096 bits
16-bit counters 11 counters 176 bits
pref-bits for MSHR 128 entries 128 bits
  • Total hardware cost 20784 bits 2.54 KB
  • Percentage area overhead compared to baseline 1MB
    L2 cache 2.5KB/1024KB 0.24
  • NOT on the critical path

27
Outline
  • Background and Motivation
  • Feedback Directed Prefetching
  • Metrics and collecting this information in
    Hardware
  • How to adapt
  • Results
  • Conclusions

27
Feedback Directed Prefetching
28
Contributions
  • Comprehensive and low-cost feedback mechanism for
    hardware prefetchers
  • Uses
  • Prefetcher Accuracy
  • Prefetcher Lateness
  • Prefetcher-caused Cache Pollution
  • Adapts
  • Aggressiveness
  • Cache Insertion Policy for prefetches
  • 6.5 higher performance and 18.7 less bandwidth
    compared to Very Aggressive Prefetching
  • Eliminates negative impact of prefetching
  • Applicable to any data prefetch algorithm

29
Questions?
30
Backups
31
FDP vs Prefetch Cache
  • Prefetch Caches eliminate prefetcher induced
    cache pollution
  • However, prefetches are now limited to the size
    of the prefetch cache
  • 5.3 higher perf. than Very Aggr.32KB
  • Within 2 of Very Aggr.64KB
  • Memory bandwidth of FDP is 16 less than 32KB and
    9 less than 64KB.

32
Performance on Other Prefetch algorithms
  • Global History Buffer Prefetcher
  • 20.8 less memory bandwidth than very aggressive
    with similar perf.
  • 9.9 better performance than middle-of-the-road
    with similar bandwidth usage
  • PC-based Stride Prefetcher
  • 4 better performance than the very aggressive
  • 24 reduction in bandwidth usage

33
IPC Performance
34
Dynamic Prefetcher Accuracy
35
Prefetch Lateness
36
Pollution Filter
37
Thresholds
38
Prefetches Sent
39
Distribution of dynamic aggressiveness level
40
Distribution of insertion position of prefetched
blocks
41
Effect of FDP on memory bandwidth consumption
42
Performance of Prefetch cache vs FDP
43
Bandwidth consumption of prefetch cache vs. FDP
44
Effect of FDP on GHB
45
Effect of FDP on GHB(Bandwidth)
46
Effect of varying L2 size and memory latency
47
IPC on other benchmarks
48
BPKI on other benchmarks
Write a Comment
User Comments (0)
About PowerShow.com