Title: Cache Performance Optimization by ApplicationSpecific ReConfigurable Indexing
1Cache Performance Optimization by
Application-Specific Re-Configurable Indexing
K. Patel E. Macii M. Poncino Politecnico di
Torino 10129 Torino, Italy
L. Benini Universita di Bologna 40136 Bologna,
Italy
ICCAD 2004 San Jose, CA November 9-11, 2004
2Outline
- Introduction.
- Previous work.
- Cache indexing Overview.
- Application specific indexing.
- Re-configurable indexing.
- Experimental results.
- Conclusions.
3Introduction
- Caches in embedded systems
- Want low miss rates.
- but tight area/power budgets.
- Trade-offs Low miss rate vs. complexity.
- Direct-mapped caches
- Low complexity, faster access, but higher miss
rate. - Associative caches
- Higher complexity, slower access, but smaller
miss rate. - Can we escape this trade-off?
- Yes By exploit the knowledge on memory
references.
4Our Contribution
- Target
- Embedded systems running a given application mix.
- Objective
- Optimization technique for reducing conflict
misses in direct-mapped caches. - Solution
- Use an application-specific indexing mechanism.
- Re-configurable upon re-targeting of the system.
5Smart Indexing Previous Work
- General-purpose indexing.
- XOR-based indexing Frailong, 1985.
- Irreducible polynomials Rau,1997.
- Bit selection.
- Application-specific indexing.
- Trace-driven bit selection Givargis, 2003.
6Cache Indexing Revisited
- Strong analogy between cache indexing and
hashing. - Objective Map a large set of objects onto a much
smaller space. - Difference
- Must use simple hashing functions for HW
implementation!
7Cache Misses Revisited
- Compulsory misses.
- First access to line not in cache.
- Capacity misses.
- Active portion of memory exceeds cache size.
- Conflict misses.
- Active portion of address space fits in cache,
but too many lines map to the same cache entry. - Occur only in direct-mapped and set-associative
caches.
We target conflict misses, which are the weak
point of direct-mapped caches.
8Cache Indexing and Conflict Misses
- Conflict misses are affected by how cache lines
are addressed. - Traditional indexing
- S Cache size B Block size n of address
bits. - b log2 B of offset bits.
- S/B of cache lines.
- m log2 (S/B) of index bits.
- t n (mb) of tag bits.
n bits
9Proposed Indexing Scheme
- Compute index by doing a selection of all
non-offset bits (i.e., m bits out of z).
- Problem What bits should we consider?
- Use a specific address trace to select bits so as
to minimize conflict misses.
10Application-Specific Cache Indexing
- Modeling of conflict conditions
- Trace T a0,,aL-1.
- Direct Conflict Pattern between two addresses ai
and aj DCPij - Boolean conditions under which ai and aj would
conflict. - DCPij ?k0,..,z-1 ( ak yk)
- yk variable which is 1 if bit k is in the
index. - ak 1 if ai and aj differ in bit k.
(? AND)
11Application-Specific Cache Indexing
- Modeling of conflict conditions
- Example
- z 6 bits
- ai (010101)aj (110111)
-
DCPij y5y1
12Application-Specific Cache Indexing
- Modeling of conflict conditions
- Total Conflict Pattern for and address ai CPi
- OR of all the DCPs between ai and its successors
in the trace. - CPi ?ki1, , L DCPik
(? OR)
13Application-Specific Cache Indexing
- Modeling of conflict conditions
- Example
- Trace T (0, 1, 5, 6, 1, 5, 6, 5, 6).
- For a1 (CP for 2nd reference)
- CP1 DCP12 DCP13
- DCP12
- a1(001)
- a2(101)
- DCP13
- a1(001)
- a3(110)
- CP1 DCP12 DCP13 y2 y2 y1 y0 y2
y2
y2 y1 y0
14Application-Specific Cache Indexing
- From conflict conditions to conflict misses.
- Consider each CPi as an integer-valued term.
- Value of each CPi is either 0 or 1.
- Sum over all addresses in the trace
- Cost ?i0,,L-1 CPi
- Cost is an integer-valued function of all yis
- Find an assignment of the yis that minimizes
Cost. - Proposed algorithm uses
- BDDs for Boolean functions.
- ADDs for integer-valued function.
(? arithmetic sum)
15Application-Specific Cache Indexing
- Example
- z3 (8 memory words) and m2 (4 cache slots).
- T (0, 1, 5, 6, 1, 5, 6, 5, 6).
- m2 index bits ? 3 choices
- bit 0 and bit 1 (traditional indexing)
- bit 0 and bit 2
- bit 1 and bit 2
- Resulting ADD
- Optimal indexing
- y01, y10, y21
- bit 0 and bit 2
16Making Bit Selection Re-Configurable
- Architecture of a bit-slice of the bit selector
z address bits
Selectionvalue
z bit register
Vdd
P2
NP
Vdd
P1
index bit i
N1
2 FO4 maximum delay!
17Bit Selector Implementation
18Experimental Flow
InputTrace
CacheParameters
19Experimental Results Embedded Applications
- PowerStone benchmarks.
- Filters, transforms, crypto,
- Cache configurations
- Configuration A Size 1KB, Line Size 4
Bytes. - Configuration B Size 2KB, Line Size 4
Bytes. - Configuration C Size 4KB, Line Size 4 Bytes.
20Experimental Results Embedded Applications
- Conflict miss reduction w.r.t. default indexing.
- Consider data and instruction caches.
Average 9.56 miss reduction
Average 16.94 miss reduction
21Experimental Results Embedded Applications
- Conflict miss reduction w.r.t. heuristic bit
selection Givargis, 2003. - Consider data and instruction caches.
Average 6.2 miss reduction. Average 12.17
miss reduction.
22Experimental Results General Purpose Applications
- SPEC benchmarks.
- Subset of entire suite.
- Limited to 10M references.
- Cache configurations
- Configuration A Size 4KB, Line Size 4
Bytes. - Configuration B Size 16KB, Line Size 8
Bytes. - Configuration C Size 64KB, Line Size 16
Bytes.
23Experimental Results General Purpose Applications
- Conflict miss reduction w.r.t. default indexing.
- Consider data and instruction caches.
Average 34.33 miss reduction. Average
24.67 miss reduction.
Higher savings than for embedded applications!
24Conclusions
- Bit selection A low cost approach to reduce the
number of conflict misses. - Our algorithm
- Exactly models conflict miss conditions.
- Resulting bit selection Yields the optimal
conflict misses. - Re-configurability Indexing is easily
changeable based on application.