Title: Dynamic Cache Clustering for Chip Multiprocessors
1Dynamic Cache Clustering for Chip Multiprocessors
- Mohammad Hammoud, Sangyeun Cho, and Rami Melhem
Dept. of Computer Science University of Pittsburgh
2Tiled CMP Architectures
- Tiled CMP Architectures have recently been
advocated as a scalable design. - They replicate identical building blocks (tiles)
and connect them with a switched network on-chip
(NoC). - A tile typically incorporates a private L1 cache
and an L2 cache bank. - Two traditional practices of CMP caches
- One bank to one core assignment (Private Scheme).
- One bank to all cores assignment (Shared Scheme).
3Private and Shared Schemes
- Private Scheme
- A core maps and locates a cache block, B, to and
from its local L2 bank. - Coherence maintenance is required at both, the L1
and the L2 levels. - Data is read very fast but cache miss rate might
render high. - Shared Scheme
- A core maps and locates a cache block, B, to and
from a target tile (using some bits- home select
or HS bits from Bs physical address) referred to
as the static home tile (SHT) of B. - Coherence is required only at the L1 level.
- Cache miss rate is low but data reads are slow
(NUCA design).
Bs physical address
4The Degree of Sharing
- Sharing Degree (SD), or the number of cores that
share a given pool of cache banks, could be set
somewhere between the shared and the private
designs.
1-1 assignment
2-2 assignment
4-4 assignment
8-8 assignment
16-16 assignment
(Private Design)
(Shared Design)
5Static Designs Principal Deficiency
- The aforementioned static designs are subject to
a principal deficiency -
- In reality, computer applications exhibit
different cache demands. - A single application may demonstrate different
phases corresponding to distinct code regions
invoked during its execution. - Program phases can be characterized by different
L2 cache misses and durations.
They all entail static partitioning of the
available cache capacity and dont tolerate the
variability among working sets and phases of a
working set.
6Our work
- Dynamically monitor the behaviors of the programs
running on different CMP cores. - Adapt to each program cache demand by offering a
fine-grained banks-to-cores assignments (a
technique we refer to as cache clustering). - Introduce novel mapping and location strategies
to manage dynamic cache designs in tiled CMPs.
(CD Cluster Dimension)
7Talk roadmap
- The proposed dynamic cache clustering (DCC)
scheme. - Performance metrics.
- DCC algorithm.
- DCC mapping strategy.
- DCC location strategy.
- Quantitative evaluation
- Concluding remarks.
8The Proposed Scheme
- We denote the L2 cache banks that can be assigned
to a specific core, i, as is cache cluster. - We further denote the number of banks that the
cache cluster of core i consists of as cache
cluster dimension of core i (CDi). - We propose a dynamic cache clustering (DCC)
scheme where - Each core is initially started up with a specific
cache cluster. - After every period time T (potential
re-clustering point), the cache cluster of a core
is dynamically contracted, expanded, or kept
intact, depending on the cache demand experienced
by that core.
9Performance Metrics
- The basic trade-offs of varying the dimension of
a cache cluster are the average L2 access latency
and the L2 miss rate. - Average L2 access latency (AAL) increases
strictly with the cluster dimension. - L2 miss rate (MR) is inversely proportional to
the cluster dimension. - Improving either AAL or MR doesnt necessarily
correlate to an improvement in the overall system
performance. - Improving one of the following metrics typically
translates to a better system performance. -
-
10DCC Algorithm
- The AMAT metric can be utilized to judiciously
gauge the benefit of varying the cache cluster
dimension of a certain core i. - At every potential re-clustering point
- The AMATi (AMATi current) experienced by a
process P running on core i is evaluated and
stored. - AMATi current is subtracted from the previously
stored AMATi (AMATi previous). - Assume a contraction action has been taken
previously - A positive subtraction value indicates that AMATi
has increased. Hence, we retard and expand Ps
cluster. - A negative value indicates that AMATi has
decreased. We hence contract Ps cluster a step
further predicting more benefit.
11DCC Mapping Strategy
- Varying a cache cluster dimension (CD) of each
core over time requires a function that maps
cache blocks to cache clusters exactly as
required. - Assume that a core i requests a cache block B. If
CDi lt 16 (for instance), B is mapped to a dynamic
home tile (DHT) different than the static home
tile (SHT) of B. - DHT of B depends on CDi. With CDi smaller than 16
only a subset of bits from the HS field of Bs
physical address needs to be utilized to
determine Bs DHT (i.e., 3 bits from HS are used
if CDi 8). - We developed the following generic function to
determine the DHT of block B (ID is the binary
representation of core i and MB are masking
bits)
12DCC Mapping Strategy A Working Example
- Assume core 5 (ID 0101) requests cache block B
with HS 1111.
DHT (11111111) (01010000)
1111
DHT (11110111) (01011000)
0111
DHT (11110101) (01011010)
0101
DHT (11110001) (01011110)
0101
DHT (11110000) (01011111)
0101
13DCC Location Strategy
- The generic mapping function we defined cant be
used straightforwardly to locate cache blocks. - Assume a cache block B with HS 1111 is
requested by core 0 (ID 0000). - Assume the cache cluster of core 0 is contracted
and B is afterward requested by core 0.
DHT (11110111) (00001000)
0111
DHT (11110101) (00001010)
0101
14DCC Location Strategy
- Solution 1 re-copy all blocks upon a
re-clustering action. - Solution2 After missing at Bs DHT, Bs SHT
(tile 15) can be accessed to locate B at tile 7. - Solution3 Send the L2 request directly to Bs
SHT instead of sending it first to Bs DHT and
then possibly to Bs SHT.
Very Expensive
Slow Inter-tile communications between tiles
0, 5, 15, 7, and lastly 0
DHT (11110101) (00001010)
0101
Slow inter-tile communications between tiles
0, 15, 7, and lastly 0.
15DCC Location Strategy
- Solution4 Send simultaneous requests to only the
tiles that are potential DHTs of B. - The potential DHTs of B can be easily determined
by varying MB and MBbar of the DCC mapping
function for the range of CDs 1, 2, 4, 8, and 16. - Upper bound
- Lower bound 1
- Average 1 1/2 log2(n) (i.e., for 16 tiles, 3
messages per request)
log2(NumberofTiles) 1
16Quantitative Evaluation Methodology
- System Parameters
- We simulate a 16-way tiled CMP.
- Simulator Simics 3.0.29 (Solaris OS)
- Cache line size 64 Bytes.
- L1 I-D sizes/ways/latency 16KB/2 ways/1 cycles.
- L2 size/ways/latency 512KB per bank/16 ways/12
cycles. - Latency per hop 3 cycles.
- Memory latency 300 cycles.
- L1 and L2 replacement policy LRU
- Benchmarks SPECJBB, OCEAN, BARNES, LU, RADIX,
FFT, MIX1(16 copies of HMMER), MIX2(16 copies of
SPHINX), MIX3( Barnes, Lu, Milc, Mcf, Bzip2, and
Hmmer- 2 threads/copies each).
17Comparing With Static Schemes
We first study the effect of the average L1 miss
time (AMT) across FS1, FS2, FS4, FS8, FS16, and
DCC.
- DCC outperforms FS16, FS8, FS4, FS2, and FS1 by
averages of 6.5, 8.6, 10.1, 10, and 4.5, - respectively, and by as much as 21.3.
18Comparing With Static Schemes
We second study the effect of L2 miss rate across
FS1, FS2, FS4, FS8, FS16, and DCC.
- No Single static scheme provides the best miss
rate for all the benchmarks. - DCC always provides miss rates comparable to the
best static alternative.
19Comparing With Static Schemes
We third study the effect of execution time
across FS1, FS2, FS4, FS8, FS16, and DCC.
- The superiority of DCC in AMT translates to
better overall performance. - DCC always provides performance comparable to
the best static alternative.
20Sensitivity Study
We fourth study the sensitivity of DCC to
different T,Tl,Tg values.
- DCC is not much dependent on the values of
parameters T,Tl,Tg . - Overall, DCC performs a little better with T
100K than with T 300K.
21Comparing With Cooperative Caching
We fifth compare DCC against the cooperative
caching (CC) scheme. CC is based on FS1 (private
scheme).
- DCC outperforms CC by an average of 1.59.
- The basic problem with CC is that it spills
blocks without knowing if spilling helps - or hurts cache performance (a problem
addressed recently in HPCA09).
22Concluding Remarks
- This paper proposes DCC, a distributed cache
management scheme for large scale chip
multiprocessors. - Contrary to static designs, DCC adapts to working
sets irregularities. - We propose generic mapping and location
strategies that can be utilized for both, static
designs (with different sharing degrees) and
dynamic designs in tiled CMPs. - The proposed DCC location strategy can be
improved (in regard to reducing the number of
messages per request) by maintaining a small
history about a specific cluster expansions and
contractions. - For instance, with an activity chain of 16-8-4,
we can predict that a requested block cant exist
at a DHT corresponding to CD 1 or 2, and has
higher probability to exist at DHTs corresponding
to CD 4 and 8 than at DHT that corresponds to
CD 16.
23Dynamic Cache Clustering for Chip Multiprocessors
Thank you!
- M. Hammoud, S. Cho, and R. Melhem
Dept. of Computer Science University of Pittsburgh