Automatic Tuning of Two-Level Caches to Embedded Applications - PowerPoint PPT Presentation

About This Presentation
Title:

Automatic Tuning of Two-Level Caches to Embedded Applications

Description:

Develop fast, good-quality heuristic for tuning two-level ... Finally, search associativity. For the lowest energy line size, increase the ... Finally, ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 27
Provided by: jau60
Learn more at: http://www.cs.ucr.edu
Category:

less

Transcript and Presenter's Notes

Title: Automatic Tuning of Two-Level Caches to Embedded Applications


1
Automatic Tuning of Two-Level Caches to Embedded
Applications
  • Ann Gordon-Ross and Frank Vahid
  • Department of Computer Science and Engineering
  • University of California, Riverside
  • Also with the Center for Embedded Computer
    Systems, UC Irvine

Nikil Dutt Center for Embedded Computer
Systems School for Information and Computer
Science University of California, Irvine
This work was supported by the U.S. National
Science Foundation, and by the Semiconductor
Research Corporation
2
Introduction
  • Memory access 50 of embedded processors system
    power
  • Caches are power hungry
  • ARM920T(Segars 01)
  • MCORE (Lee/Moyer/Arends 99)
  • Thus, the cache is a good candidate for
    optimizations

Main Mem
L2 Cache
L1 Cache
Processor
3
Motivation
  • Tuning cache parameters to an application can
    save energy 60 on average
  • Balasubramonian00, Zhang03
  • Each application has different cache requirements
  • One predetermined cache configuration cant be
    best for all applications

L1 Cache
4
Motivation
  • Tuning cache parameters to an application can
    save energy 60 on average
  • Balasubramonian00, Zhang03
  • Each application has different cache requirements
  • One predetermined cache configuration cant be
    best for all applications

L1 Cache
5
Motivation
  • Tuning cache parameters to an application can
    save energy 60 on average
  • Balasubramonian00, Zhang03
  • Each application has different cache requirements
  • One predetermined cache configuration cant be
    best for all applications

L1 Cache
6
Motivation
  • By tuning these parameters, the cache can be
    customized to a particular application

L1 Cache
Energy
L2 Cache
Possible Cache Configurations
Main Memory
7
Related Work
  • Configurable caches
  • Soft cores (ARM, MIPS, Tensillica, etc.)
  • Even for hard processors (Motorola MCore - Malik
    ISLPED00 Albonesi MICRO00 Zhang ISCA03)
  • Configurable cache tuning
  • Mostly manually in practice
  • Sub-optimal, time-consuming
  • L1 automated methods
  • Platune (Givargis TCAD02, Palesi CODES02)
  • Zhang RSP03
  • Two-level caches becoming popular
  • More transistors on-chip available
  • Bigger gap between on-chip and off-chip accesses
  • Need automated tuning for L1L2

8
Challenge for Two-Level Cache Tuning
  • One level 10s of configurations
  • Two levels 100s/1000s of configurations
  • Need efficient heuristic
  • Especially if used with simulation-based search

Level 2
Level 1
- Total size - Line size - Associativity
2500 configs
- Total size - Line size - Associativity

Say 50 configs.
50 configs.
9
Two-Level Cache Tuning Goal
  • Develop fast, good-quality heuristic for tuning
    two-level caches to embedded applications for
    reduced energy consumption
  • Presently focus on separate I and D cache in both
    levels

Level 1 Caches
Level 2 Caches
I-cache
I-cache
Main Memory
Microprocessor
D-cache
D-cache
10
Configurable Cache Architecture
  • Our target configurable cache architecture is
    based on Zhang/Vahid/Najjars Highly-Configurable
    Cache Architecture for Embedded Systems, ISCA
    2003

Base Level One Cache
2KB
2KB
2KB
2KB
8 KB cache consisting of 4 2KB banks that can
operate as 4 ways
11
Configuration Space
  • Cache parameters
  • Size - L1 cache 2, 4, and 8 KBytes. L2 cache
    16, 32, and 64 KBytes
  • Line size (L1 or L2) - 16, 32, and 64 Bytes
  • 16 byte physical base line size
  • Associativity (L1 or L2) - Direct-mapped, 2-way,
    and 4-way
  • 432 possible configurations
  • For two levels, with separate I and D

12
Experimental Environment
MediaBench EEMBC
Chosen cache configuration
Hit and miss ratios for each configuration
SimpleScalar
Cache exploration heuristic
Cache energy - Cacti Main memory energy - Samsung
memory CPU stall energy - 0.18 micron MIPS uP
13
First Heuristic Tune Levels One-at-a-Time
  • Tune L1, then L2
  • Initial L2 64 KByte, 4-way, 64 byte line size
  • For best L1 found, tune L2 cache
  • Tuned each cache using Zhangs heuristic for
    one-level cache tuning (RSP03)

L1 Cache
L2 Cache
Main Memory
14
First Heuristic Tune Levels One-at-a-Time
  • Zhangs heuristic Search parameters in order of
    importance (RSP03)

15
Results of First Heuristic
  • Base cache configuration
  • Level 1 - 8 KByte, 4-way, 32 byte line
  • Level 2 - 64 KByte, 4-way, 64 byte line

16
First Heuristic
  • Did not find optimal in most cases
  • Sometimes 200 or 300 worse
  • The two levels should not be explored separately
  • Too much interdependence among L1 and L2 cache
    parameters
  • E.g., high L1 associativity decreases misses and
    thus reduces need for large L2
  • Dozens of other such interdependencies

17
Improved Heuristic Basic Interlacing
  • To more fully explore the dependencies between
    the two levels, we interlaced the exploration of
    the level one and level two caches

L1 Cache
L2 Cache
18
Improved Heuristic Basic Interlacing
  • To more fully explore the dependencies between
    the two levels, we interlaced the exploration of
    the level one and level two caches

L1 Cache
L2 Cache
19
Improved Heuristic Basic Interlacing
  • To more fully explore the dependencies between
    the two levels, we interlaced the exploration of
    the level one and level two caches

L1 Cache
L2 Cache
Basic interlacing performed better than the
initial heuristic but there was still much room
for improvement
20
Final Heuristic Interlaced with Local Search
  • Performed well, but some cases sub-optimal
  • Manually examined those cases
  • Determined small local search needed
  • Final heuristic called TCaT - The Two Level
    Cache Tuner

However, the application may require the
increased associativity. During the associativity
search step, the cache size is allowed to
increase so that larger associativities may be
explored.
Because of the bank arrangements, if a 16KB cache
is determined to be the best size, the only
associativity option is direct-mapped
21
TCaT Results Energy
  • Energy consumption (normalized to the base cache
    configuration)
  • 53 energy savings in cache/memory access
    sub-system vs. base cache

22
TCaT Results Performance
  • Execution time for the TCaT cache configuration
    and the optimal cache configuration (normalized
    to the execution time of the benchmark running
    with the base cache configuration)
  • TCaT finds near-optimal configuration, nearly 30
    improvement over base cache

23
TCaT Exploration Time Improvements
  • Searches only 28 of 432 possible configurations
  • 6 of space
  • Simulation-based approach
  • 500 MHz Sparc
  • 50 hrs vs. 3 hrs
  • Hardware-based approach
  • 434 sec vs. 28 sec

24
TCaT in Presence of Hw/Sw Partitioning
  • Hardware/software partitioning may become common
    in SOC platforms
  • On-chip FPGA
  • Program kernels moved to FPGA
  • Greatly reduces temporal and spatial locality of
    program
  • Does TCaT still work well on programs with very
    low locality?

25
TCaT With Hardware/Software Partitioning
  • Energy consumption (normalized to the base cache
    configuration)
  • 55 energy savings in cache/memory access
    sub-system vs. base cache

26
Conclusions
  • TCaT is an effective heuristic for two-level
    cache tuning
  • Prunes 94 of search space for a given two-level
    configurable cache architecture
  • Near-optimal performance results, 30 improvement
    vs. base cache
  • Near-optimal energy results, 53 improvement vs.
    base cache
  • Robust in presence of hw/sw partitioning
  • Future work
  • More cache parameters, unified 2L cache
  • Even larger search space
  • Dynamic in-system tuning
  • Must avoid cache flushes
Write a Comment
User Comments (0)
About PowerShow.com