Cache Miss Clustering for Banked Memory Systems - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Cache Miss Clustering for Banked Memory Systems

Description:

Embedded applications spend considerable fraction of execution ... Reactivation overhead (based on what mode it was in) Compiler Approach. Program. Loop nest ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 30
Provided by: oozt
Category:

less

Transcript and Presenter's Notes

Title: Cache Miss Clustering for Banked Memory Systems


1
Cache Miss Clustering for Banked Memory Systems
  • O. Ozturk, G. Chen, G. Chen, M. Kandemir
    Pennsylvania State University
  • M. Karakoy
  • Imperial College

2
Agenda
  • Introduction
  • Related Work
  • Architectural Model
  • Example
  • Compiler Approach
  • Experimental Evaluation
  • Conclusion

3
Introduction
  • Embedded applications spend considerable fraction
    of execution cycles in memory accesses
  • A significant portion of overall execution time
  • A large fraction of overall energy
  • Research efforts on optimizing memory accesses
  • Latency, Energy, Latency Energy perspectives
  • Proposed techniques span
  • Circuit, architectural, software (Application,
    OS, Compiler)
  • Low-power operating modes energy reduction
    technique in DRAMs
  • Implemented by shutting off certain components
  • Memory banks that are not used can be put into a
    low-power mode
  • Prior efforts using low-power modes Reactive and
    Proactive

4
Introduction
  • Reactive Techniques
  • A special hardware circuit
  • Monitor memory accesses
  • If a bank idle for a certain number of cycles ?
    low power mode
  • Proactive Techniques
  • Software (usually compiler) analyzes the behavior
    of the application
  • Explicit instructions to change power modes
  • High level information about the behavior of the
    application
  • Needs the source code
  • Low-power modes studied in the literature
  • Do not consider dynamic data cache behavior at
    runtime
  • Our goal Compiler-directed code restructuring
    scheme
  • Employ cache miss clustering
  • Increase bank idleness ? Better use of low-power
    modes
  • Cache miss clustering
  • Cluster data cache misses ? Cluster memory idle
    cycles
  • Longer idle periods for banks ? Higher energy
    savings

5
Related Work
  • Banked memories
  • Lebeck et al ? OS based strategies
  • Page allocation in a banked memory
  • Delaluz et al ? Architectural and compiler
    techniques (bank pre-activation)
  • Fan et al ? Memory controller policies
  • Cache miss clustering based studies
  • Pai and Adve ? Cache miss clustering as a means
    of improving memory parallelism in
    high-performance computing
  • Compare cache miss clustering against software
    prefetching
  • Based on loop strip-mining
  • Clustering memory accesses
  • Sezer et al ? Cluster to a small set of banks
    at a given time
  • Improve bank locality
  • Cache independent -- does not take cache behavior
    explicitly into account
  • Our approach
  • Reducing energy consumption rather than improving
    performance
  • Work at the data cache level

6
Introduction
Miss Clustered Loop Nests
Input Loop Nest
Clustering Memory Bank Accesses
Clustering Data Cache Misses
Bank Invariant Loop Nests
7
Architectural Model
  • A multi-bank memory system Similar to RDRAM
  • Banks can be in different low-power operating
    modes
  • active, standby, nap and power-down
  • Memory accesses (read/write) only in active mode

8
Architectural Model
  • Each low-power mode has
  • Different energy consumption (per cycle)
  • Resynchronization cost (wake up penalty)

Read / Write
Active 3.570 nJ
2 cycles
9000 cycles
Standby 0.830 nJ
Power-Down 0.005 nJ
30 cycles
Nap 0.320 nJ
9
Architectural Model
  • A more power saving mode is selected if the
    idleness period (bank inter-access time) is
    sufficiently enough
  • Transitions between low-power and active modes
  • Incur performance penalties
  • Cannot be frequently used
  • Frequent use may result in non-tolerable
    performance penalties

10
Architectural Model
  • BIT ? Bank inter-access time
  • The larger the BIT, a more aggressive low-power
    mode can be exploited
  • Select the most suitable low-power mode
  • By software using the compiler or OS support
  • By hardware using a prediction mechanism attached
    to the memory controller
  • By a combination of both
  • We employ a hardware-based BIT prediction
    mechanism
  • Similar to the mechanisms used in current memory
    controllers
  • After 10 cycles of idleness ? Standby mode
  • After another 100 cycles ? Nap mode
  • After 1,000,000 cycles ? Power-down mode
  • When referenced ? back into the active mode
  • Reactivation overhead (based on what mode it was
    in)

11
Compiler Approach
  • Program
  • Loop nest
  • Arrays accessed by Ni
  • Arrays accessed by P

12
Compiler Approach
  • Iteration vector
  • Bounds
  • Loop step vector

13
Clustering Memory Bank Accesses
  • Split a given loop nest N into a set of
    bank-invariant loop nests N1, N2, Nm
  • N may not be bank-invariant
  • Iteration space of each Ni given by Li, Ui is
    subset of L, U
  • Executing these bank invariant loop nests is
    equivalent to that of executing the original loop
    nest

14
Clustering Memory Bank Accesses
15
Example
  • One bank
  • Th? cache hit latency
  • Idle period ? 2Th

16
Example
  • Cluster 2 cache-misses together
  • Idle period ? 4Th

17
Example
  • A memory bank can hold 5000 array elements
  • Cache line array element
  • Cache miss latency ? Tm cycles
  • A ? Bi and Bi1
  • Execution time / iteration for loop nest N ? T
    cycles
  • Loop nest N incurs a data cache miss every TTm
    cycles
  • Bank idleness period is T cycles

18
Example
  • Apply clustering to loop nest N
  • Invariant loop nests N1 and N2
  • Clustering factor ? c 2
  • N1 and N2
  • Cache misses every 2(TTm) cycles
  • Bank idleness period is 2T cycles

19
Experimental Setup
  • SUIF Compiler
  • Using SimpleScalar
  • Schemes
  • Baseline no optimization
  • LOOP conventional loop restructuring loop
    permutation and loop tiling
  • CMC our approach
  • PRI cluster memory accesses to a small set of
    banks at a given time (bank locality)
  • Benchmarks
  • Fourier ? Fourier transform
  • Flt ? Digital filtering routine
  • Adi ? Adi decomposition
  • Cholesky ? Cholesky decomposition
  • Hydro2d ? array-dominated code from the Spec
    Benchmark Suite
  • Tis, tsf ? Perfect Club Benchmarks

20
Memory Idle Times
  • Idle Time CDF
  • Baseline vs. CDC ? Combine smaller idle times

21
Normalized Energy Consumption
  • 16.1
  • 27.3
  • 25.4
  • 37.7

22
Experimental Evaluation
  • Distribution of energy savings across banks
  • Savings with our scheme are uniform across the
    banks
  • Variance with the PRI scheme can be very large in
    some cases
  • Cholesky
    Fourier

23
Normalized Execution Cycles
24
Sensitivity Analysis
  • Using different clustering factors
  • Varying data cache capacities (cholesky)

25
Conclusion
  • Memory behavior is one of the prime metric in
    embedded systems
  • Performance
  • Energy consumption
  • Prior efforts that target at energy reduction for
    banked memories
  • Cache oblivious
  • Savings can be unpredictable
  • Our approach
  • Cache conscious memory energy reduction scheme
  • Cache miss clustering
  • Cluster data cache misses
  • Cluster banked memory accesses
  • Increase in idle periods ? better energy
    management
  • Our approach is effective in reducing
  • Energy
  • Execution cycles

26
  • Thank You !

27
Clustering Memory Bank Accesses
28
Clustering Memory Bank Accesses
  • C cache miss clustering factor

29
Clustering Memory Bank Accesses
  • Input
  • A bank invariant loop nest
  • c Cache miss clustering factor
  • Output
  • A new loop nest
  • Data cache misses are clustered
  • Clustering factor
  • Determines the number of cache misses that we
    want to cluster together
  • A larger clustering factor increases the code
    size
  • A very large clustering factor can also increase
    the number of cache misses
  • An example where c3
Write a Comment
User Comments (0)
About PowerShow.com