Title: Automatic Tuning of Two-Level Caches to Embedded Applications
1Automatic Tuning of Two-Level Caches to Embedded
Applications
- Ann Gordon-Ross and Frank Vahid
- Department of Computer Science and Engineering
- University of California, Riverside
- Also with the Center for Embedded Computer
Systems, UC Irvine
Nikil Dutt Center for Embedded Computer
Systems School for Information and Computer
Science University of California, Irvine
This work was supported by the U.S. National
Science Foundation, and by the Semiconductor
Research Corporation
2Introduction
- Memory access 50 of embedded processors system
power - Caches are power hungry
- ARM920T(Segars 01)
- MCORE (Lee/Moyer/Arends 99)
- Thus, the cache is a good candidate for
optimizations
Main Mem
L2 Cache
L1 Cache
Processor
3Motivation
- Tuning cache parameters to an application can
save energy 60 on average - Balasubramonian00, Zhang03
- Each application has different cache requirements
- One predetermined cache configuration cant be
best for all applications
L1 Cache
4Motivation
- Tuning cache parameters to an application can
save energy 60 on average - Balasubramonian00, Zhang03
- Each application has different cache requirements
- One predetermined cache configuration cant be
best for all applications
L1 Cache
5Motivation
- Tuning cache parameters to an application can
save energy 60 on average - Balasubramonian00, Zhang03
- Each application has different cache requirements
- One predetermined cache configuration cant be
best for all applications
L1 Cache
6Motivation
- By tuning these parameters, the cache can be
customized to a particular application
L1 Cache
Energy
L2 Cache
Possible Cache Configurations
Main Memory
7Related Work
- Configurable caches
- Soft cores (ARM, MIPS, Tensillica, etc.)
- Even for hard processors (Motorola MCore - Malik
ISLPED00 Albonesi MICRO00 Zhang ISCA03) - Configurable cache tuning
- Mostly manually in practice
- Sub-optimal, time-consuming
- L1 automated methods
- Platune (Givargis TCAD02, Palesi CODES02)
- Zhang RSP03
- Two-level caches becoming popular
- More transistors on-chip available
- Bigger gap between on-chip and off-chip accesses
- Need automated tuning for L1L2
8Challenge for Two-Level Cache Tuning
- One level 10s of configurations
- Two levels 100s/1000s of configurations
- Need efficient heuristic
- Especially if used with simulation-based search
Level 2
Level 1
- Total size - Line size - Associativity
2500 configs
- Total size - Line size - Associativity
Say 50 configs.
50 configs.
9Two-Level Cache Tuning Goal
- Develop fast, good-quality heuristic for tuning
two-level caches to embedded applications for
reduced energy consumption - Presently focus on separate I and D cache in both
levels
Level 1 Caches
Level 2 Caches
I-cache
I-cache
Main Memory
Microprocessor
D-cache
D-cache
10Configurable Cache Architecture
- Our target configurable cache architecture is
based on Zhang/Vahid/Najjars Highly-Configurable
Cache Architecture for Embedded Systems, ISCA
2003
Base Level One Cache
2KB
2KB
2KB
2KB
8 KB cache consisting of 4 2KB banks that can
operate as 4 ways
11Configuration Space
- Cache parameters
- Size - L1 cache 2, 4, and 8 KBytes. L2 cache
16, 32, and 64 KBytes - Line size (L1 or L2) - 16, 32, and 64 Bytes
- 16 byte physical base line size
- Associativity (L1 or L2) - Direct-mapped, 2-way,
and 4-way - 432 possible configurations
- For two levels, with separate I and D
12Experimental Environment
MediaBench EEMBC
Chosen cache configuration
Hit and miss ratios for each configuration
SimpleScalar
Cache exploration heuristic
Cache energy - Cacti Main memory energy - Samsung
memory CPU stall energy - 0.18 micron MIPS uP
13First Heuristic Tune Levels One-at-a-Time
- Tune L1, then L2
- Initial L2 64 KByte, 4-way, 64 byte line size
- For best L1 found, tune L2 cache
- Tuned each cache using Zhangs heuristic for
one-level cache tuning (RSP03)
L1 Cache
L2 Cache
Main Memory
14First Heuristic Tune Levels One-at-a-Time
- Zhangs heuristic Search parameters in order of
importance (RSP03)
15Results of First Heuristic
- Base cache configuration
- Level 1 - 8 KByte, 4-way, 32 byte line
- Level 2 - 64 KByte, 4-way, 64 byte line
16First Heuristic
- Did not find optimal in most cases
- Sometimes 200 or 300 worse
- The two levels should not be explored separately
- Too much interdependence among L1 and L2 cache
parameters - E.g., high L1 associativity decreases misses and
thus reduces need for large L2 - Dozens of other such interdependencies
17Improved Heuristic Basic Interlacing
- To more fully explore the dependencies between
the two levels, we interlaced the exploration of
the level one and level two caches
L1 Cache
L2 Cache
18Improved Heuristic Basic Interlacing
- To more fully explore the dependencies between
the two levels, we interlaced the exploration of
the level one and level two caches
L1 Cache
L2 Cache
19Improved Heuristic Basic Interlacing
- To more fully explore the dependencies between
the two levels, we interlaced the exploration of
the level one and level two caches
L1 Cache
L2 Cache
Basic interlacing performed better than the
initial heuristic but there was still much room
for improvement
20Final Heuristic Interlaced with Local Search
- Performed well, but some cases sub-optimal
- Manually examined those cases
- Determined small local search needed
- Final heuristic called TCaT - The Two Level
Cache Tuner
However, the application may require the
increased associativity. During the associativity
search step, the cache size is allowed to
increase so that larger associativities may be
explored.
Because of the bank arrangements, if a 16KB cache
is determined to be the best size, the only
associativity option is direct-mapped
21TCaT Results Energy
- Energy consumption (normalized to the base cache
configuration) - 53 energy savings in cache/memory access
sub-system vs. base cache
22TCaT Results Performance
- Execution time for the TCaT cache configuration
and the optimal cache configuration (normalized
to the execution time of the benchmark running
with the base cache configuration) - TCaT finds near-optimal configuration, nearly 30
improvement over base cache
23TCaT Exploration Time Improvements
- Searches only 28 of 432 possible configurations
- 6 of space
- Simulation-based approach
- 500 MHz Sparc
- 50 hrs vs. 3 hrs
- Hardware-based approach
- 434 sec vs. 28 sec
24TCaT in Presence of Hw/Sw Partitioning
- Hardware/software partitioning may become common
in SOC platforms - On-chip FPGA
- Program kernels moved to FPGA
- Greatly reduces temporal and spatial locality of
program - Does TCaT still work well on programs with very
low locality?
25TCaT With Hardware/Software Partitioning
- Energy consumption (normalized to the base cache
configuration) - 55 energy savings in cache/memory access
sub-system vs. base cache
26Conclusions
- TCaT is an effective heuristic for two-level
cache tuning - Prunes 94 of search space for a given two-level
configurable cache architecture - Near-optimal performance results, 30 improvement
vs. base cache - Near-optimal energy results, 53 improvement vs.
base cache - Robust in presence of hw/sw partitioning
- Future work
- More cache parameters, unified 2L cache
- Even larger search space
- Dynamic in-system tuning
- Must avoid cache flushes