Cache Performance Optimization by ApplicationSpecific ReConfigurable Indexing - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Cache Performance Optimization by ApplicationSpecific ReConfigurable Indexing

Description:

Occur only in direct-mapped and set-associative caches. ... z bit register. Vdd. P2. P1. N1. NP. z. address bits. Selection. value. 2 FO4 maximum delay! ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 25
Provided by: massimo91
Category:

less

Transcript and Presenter's Notes

Title: Cache Performance Optimization by ApplicationSpecific ReConfigurable Indexing


1
Cache Performance Optimization by
Application-Specific Re-Configurable Indexing
K. Patel E. Macii M. Poncino Politecnico di
Torino 10129 Torino, Italy
L. Benini Universita di Bologna 40136 Bologna,
Italy
ICCAD 2004 San Jose, CA November 9-11, 2004
2
Outline
  • Introduction.
  • Previous work.
  • Cache indexing Overview.
  • Application specific indexing.
  • Re-configurable indexing.
  • Experimental results.
  • Conclusions.

3
Introduction
  • Caches in embedded systems
  • Want low miss rates.
  • but tight area/power budgets.
  • Trade-offs Low miss rate vs. complexity.
  • Direct-mapped caches
  • Low complexity, faster access, but higher miss
    rate.
  • Associative caches
  • Higher complexity, slower access, but smaller
    miss rate.
  • Can we escape this trade-off?
  • Yes By exploit the knowledge on memory
    references.

4
Our Contribution
  • Target
  • Embedded systems running a given application mix.
  • Objective
  • Optimization technique for reducing conflict
    misses in direct-mapped caches.
  • Solution
  • Use an application-specific indexing mechanism.
  • Re-configurable upon re-targeting of the system.

5
Smart Indexing Previous Work
  • General-purpose indexing.
  • XOR-based indexing Frailong, 1985.
  • Irreducible polynomials Rau,1997.
  • Bit selection.
  • Application-specific indexing.
  • Trace-driven bit selection Givargis, 2003.

6
Cache Indexing Revisited
  • Strong analogy between cache indexing and
    hashing.
  • Objective Map a large set of objects onto a much
    smaller space.
  • Difference
  • Must use simple hashing functions for HW
    implementation!

7
Cache Misses Revisited
  • Compulsory misses.
  • First access to line not in cache.
  • Capacity misses.
  • Active portion of memory exceeds cache size.
  • Conflict misses.
  • Active portion of address space fits in cache,
    but too many lines map to the same cache entry.
  • Occur only in direct-mapped and set-associative
    caches.

We target conflict misses, which are the weak
point of direct-mapped caches.
8
Cache Indexing and Conflict Misses
  • Conflict misses are affected by how cache lines
    are addressed.
  • Traditional indexing
  • S Cache size B Block size n of address
    bits.
  • b log2 B of offset bits.
  • S/B of cache lines.
  • m log2 (S/B) of index bits.
  • t n (mb) of tag bits.

n bits
9
Proposed Indexing Scheme
  • Compute index by doing a selection of all
    non-offset bits (i.e., m bits out of z).
  • Problem What bits should we consider?
  • Use a specific address trace to select bits so as
    to minimize conflict misses.

10
Application-Specific Cache Indexing
  • Modeling of conflict conditions
  • Trace T a0,,aL-1.
  • Direct Conflict Pattern between two addresses ai
    and aj DCPij
  • Boolean conditions under which ai and aj would
    conflict.
  • DCPij ?k0,..,z-1 ( ak yk)
  • yk variable which is 1 if bit k is in the
    index.
  • ak 1 if ai and aj differ in bit k.

(? AND)
11
Application-Specific Cache Indexing
  • Modeling of conflict conditions
  • Example
  • z 6 bits
  • ai (010101)aj (110111)

DCPij y5y1
12
Application-Specific Cache Indexing
  • Modeling of conflict conditions
  • Total Conflict Pattern for and address ai CPi
  • OR of all the DCPs between ai and its successors
    in the trace.
  • CPi ?ki1, , L DCPik

(? OR)
13
Application-Specific Cache Indexing
  • Modeling of conflict conditions
  • Example
  • Trace T (0, 1, 5, 6, 1, 5, 6, 5, 6).
  • For a1 (CP for 2nd reference)
  • CP1 DCP12 DCP13
  • DCP12
  • a1(001)
  • a2(101)
  • DCP13
  • a1(001)
  • a3(110)
  • CP1 DCP12 DCP13 y2 y2 y1 y0 y2

y2
y2 y1 y0
14
Application-Specific Cache Indexing
  • From conflict conditions to conflict misses.
  • Consider each CPi as an integer-valued term.
  • Value of each CPi is either 0 or 1.
  • Sum over all addresses in the trace
  • Cost ?i0,,L-1 CPi
  • Cost is an integer-valued function of all yis
  • Find an assignment of the yis that minimizes
    Cost.
  • Proposed algorithm uses
  • BDDs for Boolean functions.
  • ADDs for integer-valued function.

(? arithmetic sum)
15
Application-Specific Cache Indexing
  • Example
  • z3 (8 memory words) and m2 (4 cache slots).
  • T (0, 1, 5, 6, 1, 5, 6, 5, 6).
  • m2 index bits ? 3 choices
  • bit 0 and bit 1 (traditional indexing)
  • bit 0 and bit 2
  • bit 1 and bit 2
  • Resulting ADD
  • Optimal indexing
  • y01, y10, y21
  • bit 0 and bit 2

16
Making Bit Selection Re-Configurable
  • Architecture of a bit-slice of the bit selector

z address bits
Selectionvalue
z bit register
Vdd
P2
NP
Vdd
P1
index bit i
N1
2 FO4 maximum delay!
17
Bit Selector Implementation
18
Experimental Flow
InputTrace
CacheParameters
19
Experimental Results Embedded Applications
  • PowerStone benchmarks.
  • Filters, transforms, crypto,
  • Cache configurations
  • Configuration A Size 1KB, Line Size 4
    Bytes.
  • Configuration B Size 2KB, Line Size 4
    Bytes.
  • Configuration C Size 4KB, Line Size 4 Bytes.

20
Experimental Results Embedded Applications
  • Conflict miss reduction w.r.t. default indexing.
  • Consider data and instruction caches.

Average 9.56 miss reduction
Average 16.94 miss reduction
21
Experimental Results Embedded Applications
  • Conflict miss reduction w.r.t. heuristic bit
    selection Givargis, 2003.
  • Consider data and instruction caches.

Average 6.2 miss reduction. Average 12.17
miss reduction.
22
Experimental Results General Purpose Applications
  • SPEC benchmarks.
  • Subset of entire suite.
  • Limited to 10M references.
  • Cache configurations
  • Configuration A Size 4KB, Line Size 4
    Bytes.
  • Configuration B Size 16KB, Line Size 8
    Bytes.
  • Configuration C Size 64KB, Line Size 16
    Bytes.

23
Experimental Results General Purpose Applications
  • Conflict miss reduction w.r.t. default indexing.
  • Consider data and instruction caches.

Average 34.33 miss reduction. Average
24.67 miss reduction.
Higher savings than for embedded applications!
24
Conclusions
  • Bit selection A low cost approach to reduce the
    number of conflict misses.
  • Our algorithm
  • Exactly models conflict miss conditions.
  • Resulting bit selection Yields the optimal
    conflict misses.
  • Re-configurability Indexing is easily
    changeable based on application.
Write a Comment
User Comments (0)
About PowerShow.com