Automatic Tuning of Two-Level Caches to Embedded Applications - PowerPoint PPT Presentation

About This Presentation

Title:

Automatic Tuning of Two-Level Caches to Embedded Applications

Description:

Develop fast, good-quality heuristic for tuning two-level ... Finally, search associativity. For the lowest energy line size, increase the ... Finally, ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 27

Provided by: jau60

Learn more at: http://www.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title: Automatic Tuning of Two-Level Caches to Embedded Applications

1
Automatic Tuning of Two-Level Caches to Embedded
Applications

Ann Gordon-Ross and Frank Vahid
Department of Computer Science and Engineering
University of California, Riverside
Also with the Center for Embedded Computer
Systems, UC Irvine

Nikil Dutt Center for Embedded Computer
Systems School for Information and Computer
Science University of California, Irvine
This work was supported by the U.S. National
Science Foundation, and by the Semiconductor
Research Corporation
2
Introduction

Memory access 50 of embedded processors system
power
Caches are power hungry
ARM920T(Segars 01)
MCORE (Lee/Moyer/Arends 99)
Thus, the cache is a good candidate for
optimizations

Main Mem
L2 Cache
L1 Cache
Processor
3
Motivation

Tuning cache parameters to an application can
save energy 60 on average
Balasubramonian00, Zhang03
Each application has different cache requirements
One predetermined cache configuration cant be
best for all applications

L1 Cache
4
Motivation

Tuning cache parameters to an application can
save energy 60 on average
Balasubramonian00, Zhang03
Each application has different cache requirements
One predetermined cache configuration cant be
best for all applications

L1 Cache
5
Motivation

Tuning cache parameters to an application can
save energy 60 on average
Balasubramonian00, Zhang03
Each application has different cache requirements
One predetermined cache configuration cant be
best for all applications

L1 Cache
6
Motivation

By tuning these parameters, the cache can be
customized to a particular application

L1 Cache
Energy
L2 Cache
Possible Cache Configurations
Main Memory
7
Related Work

Configurable caches
Soft cores (ARM, MIPS, Tensillica, etc.)
Even for hard processors (Motorola MCore - Malik
ISLPED00 Albonesi MICRO00 Zhang ISCA03)
Configurable cache tuning
Mostly manually in practice
Sub-optimal, time-consuming
L1 automated methods
Platune (Givargis TCAD02, Palesi CODES02)
Zhang RSP03
Two-level caches becoming popular
More transistors on-chip available
Bigger gap between on-chip and off-chip accesses
Need automated tuning for L1L2

8
Challenge for Two-Level Cache Tuning

One level 10s of configurations
Two levels 100s/1000s of configurations
Need efficient heuristic
Especially if used with simulation-based search

Level 2
Level 1
- Total size - Line size - Associativity
2500 configs
- Total size - Line size - Associativity

Say 50 configs.
50 configs.
9
Two-Level Cache Tuning Goal

Develop fast, good-quality heuristic for tuning
two-level caches to embedded applications for
reduced energy consumption
Presently focus on separate I and D cache in both
levels

Level 1 Caches
Level 2 Caches
I-cache
I-cache
Main Memory
Microprocessor
D-cache
D-cache
10
Configurable Cache Architecture

Our target configurable cache architecture is
based on Zhang/Vahid/Najjars Highly-Configurable
Cache Architecture for Embedded Systems, ISCA
2003

Base Level One Cache
2KB
2KB
2KB
2KB
8 KB cache consisting of 4 2KB banks that can
operate as 4 ways
11
Configuration Space

Cache parameters
Size - L1 cache 2, 4, and 8 KBytes. L2 cache
16, 32, and 64 KBytes
Line size (L1 or L2) - 16, 32, and 64 Bytes
16 byte physical base line size
Associativity (L1 or L2) - Direct-mapped, 2-way,
and 4-way
432 possible configurations
For two levels, with separate I and D

12
Experimental Environment
MediaBench EEMBC
Chosen cache configuration
Hit and miss ratios for each configuration
SimpleScalar
Cache exploration heuristic
Cache energy - Cacti Main memory energy - Samsung
memory CPU stall energy - 0.18 micron MIPS uP
13
First Heuristic Tune Levels One-at-a-Time

Tune L1, then L2
Initial L2 64 KByte, 4-way, 64 byte line size
For best L1 found, tune L2 cache
Tuned each cache using Zhangs heuristic for
one-level cache tuning (RSP03)

L1 Cache
L2 Cache
Main Memory
14
First Heuristic Tune Levels One-at-a-Time

Zhangs heuristic Search parameters in order of
importance (RSP03)

15
Results of First Heuristic

Base cache configuration
Level 1 - 8 KByte, 4-way, 32 byte line
Level 2 - 64 KByte, 4-way, 64 byte line

16
First Heuristic

Did not find optimal in most cases
Sometimes 200 or 300 worse
The two levels should not be explored separately
Too much interdependence among L1 and L2 cache
parameters
E.g., high L1 associativity decreases misses and
thus reduces need for large L2
Dozens of other such interdependencies

17
Improved Heuristic Basic Interlacing

To more fully explore the dependencies between
the two levels, we interlaced the exploration of
the level one and level two caches

L1 Cache
L2 Cache
18
Improved Heuristic Basic Interlacing

To more fully explore the dependencies between
the two levels, we interlaced the exploration of
the level one and level two caches

L1 Cache
L2 Cache
19
Improved Heuristic Basic Interlacing

To more fully explore the dependencies between
the two levels, we interlaced the exploration of
the level one and level two caches

L1 Cache
L2 Cache
Basic interlacing performed better than the
initial heuristic but there was still much room
for improvement
20
Final Heuristic Interlaced with Local Search

Performed well, but some cases sub-optimal
Manually examined those cases
Determined small local search needed
Final heuristic called TCaT - The Two Level
Cache Tuner

However, the application may require the
increased associativity. During the associativity
search step, the cache size is allowed to
increase so that larger associativities may be
explored.
Because of the bank arrangements, if a 16KB cache
is determined to be the best size, the only
associativity option is direct-mapped
21
TCaT Results Energy

Energy consumption (normalized to the base cache
configuration)
53 energy savings in cache/memory access
sub-system vs. base cache

22
TCaT Results Performance

Execution time for the TCaT cache configuration
and the optimal cache configuration (normalized
to the execution time of the benchmark running
with the base cache configuration)
TCaT finds near-optimal configuration, nearly 30
improvement over base cache

23
TCaT Exploration Time Improvements