Title: PredictorDirected Stream Buffers
1Predictor-Directed Stream Buffers
- Timothy Sherwood
- Suleyman Sair
- Brad Calder
2Overview
- Introduction
- Past Stream Buffer work
- Predictor-Directed Stream Buffers
- Policy Improvements
- Results
- Contribution
3Introduction
- Memory Wall
- Latency reduction through prefetching
- without eating too much bandwidth
- Stream Buffers are one of the most used
- simple to implement
- very efficient
- Pointer based codes
4Past Stream Buffer work
- Jouppi 1990
- consecutive cache line FIFO
- Palacharla and Kessler 1994
- non-unit stride (based on memory chunk)
- allocation filters
- Farkas et. al. 1997
- PC-based stride
- fully associative / non-overlapping
5Past Stream Buffer work
to data cache, register file, and MSHRs
store predict_stride in streaming buffer on
allocation
N buffers
from/to next lower level of memory
6Past Stream Buffer work
- Past work targeted at streaming in arrays
- either in sequential order
- or stride order (multidimensional array)
- Could not handle Pointer Codes
- repetitive non-striding references
- Need a more General Predictor
7Predictor-Directed Stream Buffer
- The Goal Simple and efficient hardware based
prefetching of complex but predictable streams - Approach Take a general predictor and hook it up
to the well established stream buffer front end. - Separate the predictor from the prefetcher
- Can use almost any predictor
- 2 Delta
- Context
- Markov
8PSB Generalized Architecture
to data cache, register file, and MSHRs
Prediction Info
subset of prediction info
predicted address
Load PC History Stride Confidence Last Address
update prediction information
predicted address
N buffers
from/to next lower level of memory
9PSB Stages
- Allocation
- Prediction
- Probe
- Prefetching
- Lookup
10Stage Descriptions
- Allocation
- Stream Buffer is allocated to a particular load
- the buffer is initialized
- subject to Allocation Filters
- Prediction
- an empty buffer entry asks for an address
- subject to limited predictor speed.
11Stage Descriptions (Continued)
- Probe
- if there are free ports remove useless prefetches
- not mandatory
- Prefetching
- subject to scheduling for ports and priority,
prefetches are sent to memory - Lookup
- when a load performs an L1 access, the Stream
Buffers are checked in parallel
12PSB Implementation
- Tried many different address predictors
- Best is Stride Filtered Markov
- similar to Joseph and Grunwalds Predictor
- first order Markov
- striding behavior is filtered out
- Difference is stored to reduce size
13Difference Storing
14PSB with SFM
15Methods
- SimpleScalar 3.0
- Rewrote memory hierarchy
- Model bandwidth between all levels
- Added perfect store sets
- Ran over set of Pointer Benchmarks
- 2K entry predictor table
- 8 buffers x 4 entry Stream Buffers
- 32k 4-way associative cache
16Speedup from PSB
17Allocation Filtering
- Farkas et.al. showed how two miss filtering
- prevents too many streams requesting resources
- Does not work as well for pointer codes
- irregular miss patterns
- We use Priority and Accuracy Counters
- track behavior of Loads
- allocate to Loads that are Behaving well
18Allocation Filtering Speedup
19Stream Buffer Priority
- Round Robin
- give each active buffer equal resources
- predictor and prefetching
- Priority Counters
- uses small counters with each buffer
- use the counters to rank buffer
- more resources to better performing buffers
20Priority Scheduling Speedup
21Latency Reduction
22Contributions
- Predictor-Directed Stream Buffers allow
decoupling of Stream Buffer front end from
address generation - Using accuracy based allocation filtering and
priority scheduling can make a large difference
in performance - With some simple compression, even small Markov
tables can be very effective
23Accuracy
24Bus Results