Title: Using the Compiler to Improve Cache Replacement Decisions
1Using the Compiler to Improve Cache Replacement
Decisions
Zhenlin Wang, UMass Amherst Kathryn S. McKinley,
UT Austin Arnold L. Rosenberg, UMass
Amherst Charles C. Weems, UMass Amherst
2Motivation and Background
- LRU is not always effective
- Optimal cache replacement must peek into the
future - Compiler locality analysis determines data access
pattern for numeric applications - Cache line tag bit(s) and ISA extension control
cache replacement explicitly - Replacement logic augments LRU with compiler
hints
3LRU vs. Compiler Control
2-way cache with LRU
SUBROUTINE TEST(N) INTEGER AN,BN,CN DO
I 1,N CI AI BI ENDDO
DO I 1,N AI CI 5
ENDDO END
A1
A1
B1
C1
Set 1
A2
Set 2
Set 3
Set 128
( N128 )
4 Compiler Locality Analysis
Spatial (,lt)
Spatial (,lt)
Spatial (,lt)
SUBROUTINE TEST(N) INTEGER AN,BN,CN DO
I 1,N CI AI BI ENDDO
DO I 1,N AI CI 5
ENDDO END
BI at N1 1N1
CI at N1 1N1
AI at N1 1N1
Cross-loop 1N1
Cross-loop 1N1
temporal
temporal
CI at N2 1N1
AI at N2 1N1
Spatial (,lt)
Spatial (,lt)
Locality Graph
5An Abstract Model
- An optimal algorithm uses exact reuse distances
- Given trace a b c d a c d e b f a, reuse
distance of a is 4 - Reuse level a range in which the next reuse
will occur - i,j lt k,l, if j lt k
- For example, a reuse level of a is 3,5. (a b c
d a c d e b f a) - We combine data dependences with loop iteration
point to compute reuse levels - For example, (, lt) lt ( lt, )
4
6The Architecture Evict-Me bit
- Inspired by the Alpha 21264 prefetch-and-evict-nex
t and evict instruction - Each cache line has an extra evict-me bit
- On a replacement, choose the cache line with the
evict-me bit set - Use LRU policy if no evict-me bits are set
- Extend ISA with load/store instructions that set
the evict-me bit
7Heuristics for Setting Evict-me Bits
- On a replacement, evict the cache line if its
evict-me bit is set, otherwise, use the LRU bits - Compiler heuristics
- Set evict-me bit if the reuse distance of a
reference is greater than cache size - Intuition even a fully set associative cache can
not exploit the reuse - Reuse level 1, cache size, cache size1, ?
- Volume based heuristics
- Its reuse crosses nests whose data volume is
greater than 2cache size - Or reuse crosses nests of nesting level gt2
8Algorithm for Setting Evict-me Bits
- Mark evict-me bit for an array reference if
- It has no temporal locality in its nest
- Its reuse crosses nests whose data volume gt
2cache size - Spatial locality is resolved by run time address
calculation or loop unrolling -
Do I 1 N A(I) ENDDO
Do I 1 N A(I) A(I1)
A(I2) A(I3) ENDDO
A1
A3
A2
0
1
9Evict-me An Example
2-way cache with evict-me
SUBROUTINE TEST(N) INTEGER AN,BN,CN DO
I 1,N CI AI BI ENDDO
DO I 1,N AI CI 5
ENDDO END
B1 1
Set 1
A1 0
C1 0
Set 2
Set 3
Set 128
( N128 )
Nest 1 volume 384 words lt 2256
Cache size 256 words
10Experimental Framework
- Implemented in Scale, a compiler infrastructure
developed at UMass - Scale includes optimizations such as partial
redundancy elimination, scalar replacement, value
numbering, sparse conditional constant
propagation, register allocation, etc. - Generates SPARC Assembly
- Simulate the evict-Me cache with URSIM
- Out of order execution
- Lock up free cache
- SDRAM
SPARC Assembly
Source code
Native Assembler linker
SPARC executable
URSIM
Scale
11Cache configurations
- Both levels are lock-up free with 8 MSHRs each
Size and associativity
Latencies (cycles)
12Miss reduction (level 1)
13Miss reduction (level 2)
14Performance Impact of Evict-me (Conf. 2)
15Evict-me and Prefetching Combined(Conf. 3)
16Summary
- Compiler can improve cache replacement decisions
- Evict-me algorithm seldom degrades performance
- Architectural support for evict-me is practical
- Effectiveness depends on cache configuration,
data set size, and access patterns