Cache Prefetching - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Cache Prefetching

Description:

poor timing and not able to overlap well. Markov prefetcher. high hardware cost, not a good stand-alone prefetcher, mediocre accuracy. Hybrid prefetching ... – PowerPoint PPT presentation

Number of Views:127
Avg rating:3.0/5.0
Slides: 31
Provided by: stefan83
Category:

less

Transcript and Presenter's Notes

Title: Cache Prefetching


1
Cache Prefetching
  • Stefan G. Berg

2
Outline
  • Introduction and Terminology
  • Three Techniques
  • Stride Prefetching
  • Recursive Prefetching
  • Markov Prefetching
  • Hybrid Approaches
  • Conclusions

3
Traditional Processor
Memory Latency
L1 Miss Time
Data Arrives
Memory Reference (miss)
Time
4
Lockup-Free Cache
Memory Latency
L1 Miss Time
Data Arrives
Dependent Instruction
Memory Reference (miss)
Time
5
Out-of-Order Execution
Memory Latency
L1 Miss Time
Stall
Data Arrives
Dependent Instruction
Memory Reference (miss)
Time
6
Cache Prefetching
Memory Latency
Prefetch
Data Arrives
Dependent Instruction
Memory Reference (miss)
Time
7
Accuracy and Coverage
  • Prefetch_all Number of prefetches
  • Prefetch_hit Number of prefetches that result
    in a cache hit
  • Misses Number of cache misses

Percentage of useful prefetches
Percentage of misses removed
8
Producing Prefetch Addresses
  • Observing memory references
  • Stride prefetcher
  • Strided memory reference pattern (hw/sw)
  • Recursive prefetcher
  • Linked memory reference pattern (hw/sw)
  • Markov prefetcher
  • History of miss addresses (hw)

9
Outline
  • Introduction and Terminology
  • Three Techniques
  • Stride Prefetching
  • Recursive Prefetching
  • Markov Prefetching
  • Hybrid Approaches
  • Conclusions

10
Strided Access
b
b1s
b2s
b3s
b4s
b5s
Stride
Stride
Stride
Stride
Stride
11
Stride Prefetcher
Reference Prediction Table (RPT)
Loop
Prev. Address
Stride
State
PC Tag
PC LOAD reg, address
address stride
12
Prefetching in a Loop
Loop
Ideal execution time of loop
tx
LOAD reg, address
PREFETCH addressstride
?
tm
Memory latency
address stride
Memory latency bound
Memory latency completely hidden
13
Overlapped Prefetching
Loop
LOAD reg, address
PREFETCH addressnstride
address stride
Instead of computing n we can use a lookahead PC
14
Lookahead PC (Chen and Baer)
  • Initially set equal to PC
  • Incremented by 1 every cycle
  • (Branches predicted using BPT)
  • Lookahead PC will run ahead of PC when PC stalls
    due to cache miss
  • Lookahead PC used to index RPT and issue
    prefetches
  • Distance between Lookahead PC and PC not allowed
    to exceed memory latency

15
Superscalar Challenges
  • Lookahead PC (LA PC) is not fast enough when
    several instructions are issued every cycle
  • Cache tag can become a bottleneck because the
    rate at which prefetches and memory references
    are issued increases
  • Tango (Pinter and Yoaz)
  • Faster Lookahead PC
  • Cache for tag memory

16
Tango (Pinter and Yoaz)
  • Lookahead PC advances not by 1 every cycle, but
    from one branch to the next

enhanced branch target buffer (BTB)
Target Address
Prediction Info
T-Entry
PC Tag
NT-Entry
17
Prefetching with new LA PC
Reference Prediction Table (RPT)
Prev. Address
Stride
State
PC Tag
BTB-entry
T/NT
MemRefCnt
A Branch
LA PC
B Load
A Branch
NT
1
C Load
A Branch
NT
1
B Load
D Load
A Branch
NT
2
C Load
D Load
E Branch
18
Cache for Tag Memory
  • FIFO queue keeps track of last n (6) cache lines
    that were found (hit) in the cache
  • If a prefetch address hits in the FIFO, it is
    useless and can be discarded
  • Cut number of cache tag lookups from overhead
    prefetches in half

19
Outline
  • Introduction and Terminology
  • Three Techniques
  • Stride Prefetching
  • Recursive Prefetching
  • Markov Prefetching
  • Hybrid Approaches
  • Conclusions

20
Linked Data Structure Access
0
4
8
12
14
next
0
4
8
12
14
Offset
0
4
8
12
14
next
next
Offset
Offset
0
4
8
12
14
next
Offset
21
Detecting Recursive Accesses
a0
a4
a8
a12
a14
b0
b4
b8
b12
b14
c0
c4
c8
c12
c14
next
next
next
Offset
Offset
Offset
Producer of b
Consumer of b/Producer of c
a
b
rsrc
rsrc
Example
LOAD rdest, rsrc(14)
LOAD rdest, rsrc(14)
p p-gtnext
hold same value
22
Roth, Moshovos, Sohi (HW)
Producer of b
Consumer of b/Producer of c
a
rsrc
b
rsrc
PC-A LOAD rdest, rsrc(14)
PC-B LOAD rdest, rsrc(14)
hold same value
Potential Producer Window
Correlation Table
Memory Value Loaded
Producer Instruction Address
Producer Instruction Address
Consumer Instruction Address
Consumer Instruction Template
PC-A
b
PC-B
PC-A
LOAD r,r(14)
23
Recursive Prefetching?
Potential Producer Window
Correlation Table
Memory Value Loaded
Producer Instruction Address
Producer Instruction Address
Consumer Instruction Address
Consumer Instruction Template
PC-A
b
PC-B
PC-A
LOAD r,r(14)
Record Producer
Establish Producer/Consumer Correlation
Prefetch
?
24
Luk and Mowry (SW)
Recursive Data Structure (RDS)
RDS Traversal
Greedy Prefetching
struct T int data struct T next
T l ... while(l) ... l l-gtnext
T l ... while(l) prefetch(l-gtnext) ...
l l-gtnext
Identify RDSs
Find RDS traversals
Insert Prefetches
25
Pre-Order Tree Traversal
Ordering of Prefetch Requests
1
3
2
5
6
7
4
8
9
10
11
12
13
14
15
n
miss
n
partial miss
n
hit
26
Outline
  • Introduction and Terminology
  • Three Techniques
  • Stride Prefetching
  • Recursive Prefetching
  • Markov Prefetching
  • Hybrid Approaches
  • Conclusions

27
Markov Prefetcher (Joseph and Grunwald)
Miss Addr. 1
Miss Addr. 2a
Miss Addr. 3a
Miss Addr. 4a
Miss Addr. 5a
Miss Addr. 2b
Miss Addr. 3b
Miss Addr. 4b
Miss Addr. 5b
Miss Addr. 4c
Miss Addr. 5c
Miss Addr. 4d
Miss Addr. 5d
State Transition Table with History of 1
State Transition Table with History of 3
Miss Tag
Predictor 1
Predictor 2
Miss Tag
Predictor 1
Predictor 2
Addr. 1
Addr. 2a
Addr. 2b
Addr. 1
Addr. 4a
Addr. 4b
Addr. 2a
Addr. 3a
---
Addr. 2a
Addr. 5a
---
Addr. 2b
Addr. 3b
---
Addr. 2b
Addr. 5b
Addr. 5c
Addr. 3a
Addr. 4a
---
Addr. 3b
Addr. 4b
Addr. 4c
28
Outline
  • Introduction and Terminology
  • Three Techniques
  • Stride Prefetching
  • Recursive Prefetching
  • Markov Prefetching
  • Hybrid Approaches
  • Conclusions

29
Hybrid Approaches (Joseph and Grunwald)
  • Parallel Prefetching
  • all prefetchers have access to hardware resources
    (e.g. miss addresses, data on memory reference
    instructions)
  • all prefetchers are allowed to prefetch
  • Serial Prefetching
  • most accurate prefetcher is allowed to prefetch
    first
  • static ordering stride, Markov, consecutive

30
Conclusions
  • Stride prefetchers are most mature
  • good coverage, timing and high accuracy
  • Recursive prefetchers
  • poor timing and not able to overlap well
  • Markov prefetcher
  • high hardware cost, not a good stand-alone
    prefetcher, mediocre accuracy
  • Hybrid prefetching
  • need evaluation of which prefetchers to combine
  • and how to combine them (e.g., serial prefetching)
Write a Comment
User Comments (0)
About PowerShow.com