Title: Reconfigurable Caches and their Application to Media Processing
1Reconfigurable Caches and their Application to
Media Processing
- Parthasarathy (Partha) Ranganathan
- Dept. of Electrical and Computer Engineering
- Rice University
- Houston, Texas
Sarita Adve Dept. of Computer Science University
of Illinois at Urbana Champaign Urbana, Illinois
Norman P. Jouppi Western Research
Laboratory Compaq Computer Corporation Palo Alto,
California
2Motivation (1 of 2)
- Different workloads on general-purpose processors
- Scientific/engineering, databases, media
processing, - Widely different characteristics
- Challenge for future general-purpose systems
- Use most transistors effectively for all workloads
3Motivation (2 of 2)
- Challenge for future general-purpose systems
- Use most transistors effectively for all
workloads - 50 to 80 of processor transistors devoted to
cache - Very effective for engineering and database
workloads - BUT large caches often ineffective for media
workloads - Streaming data and large working sets ISCA
1999 - Can we reuse cache transistors for other
useful work?
4Contributions
- Reconfigurable Caches
- Flexibility to reuse cache SRAM for other
activities - Several applications possible
- Simple organization and design changes
- Small impact on cache access time
5Contributions
- Reconfigurable Caches
- Flexibility to reuse cache SRAM for other
activities - Several applications possible
- Simple organization and design changes
- Small impact on cache access time
- Application for media processing
- e.g., instruction reuse reuse memory for
computation - 1.04X to 1.20X performance improvement
6Outline for Talk
- Motivation
- Reconfigurable caches
- Key idea
- Organization
- Implementation and timing analysis
- Application for media processing
- Summary and future work
7Reconfigurable Caches Key Idea
Key idea reuse cache transistors!
- Dynamically divide SRAM into multiple partitions
- Use partitions for other useful activities
? Cache SRAM useful for both conventional and
media workloads
8Reconfigurable Cache Uses
- Number of different uses for reconfigurable
caches - Optimizations using lookup tables to store
patterns - Instruction reuse, value prediction, address
prediction, - Hardware and software prefetching
- Caching of prefetched lines
- Software-controlled memory
- QoS guarantees, scratch memory area
? Cache SRAM useful for both conventional and
media workloads
9Key Challenges
- How to partition SRAM?
- How to address the different partitions as they
change? - Minimize impact on cache access (clock cycle) time
- Associativity-based partitioning
10Conventional Cache Organization
11Associativity-Based Partitioning
Partition at granularity of ways Multiple data
paths and additional state/logic
12Reconfigurable Cache Organization
- Associativity-based partitioning
- Simple - small changes to conventional caches
- But and granularity of partitions depends on
associativity - Alternate approach Overlapped-wide-tag
partitioning - More general, but slightly more complex
- Details in paper
13 Other Organizational Choices (1 of 2)
- Ensuring consistency of data at repartitioning
- Cache scrubbing flush data at repartitioning
intervals - Lazy transitioning Augment state with partition
information - Addressing of partitions - software (ISA) vs.
hardware
14 Other Organizational Choices (2 of 2)
- Method of partitioning - hardware vs. software
control - Frequency of partitioning - frequent vs.
infrequent - Level of partitioning - L1, L2, or lower levels
- Tradeoffs based on application requirements
15Outline for Talk
- Motivation
- Reconfigurable caches
- Key idea
- Organization
- Implementation and timing analysis
- Application for media processing
- Summary and future work
16Conventional Cache Implementation
ADDRESS
DATA ARRAY
TAG ARRAY
BIT LINES
WORD LINES
DECODERS
COLUMN MUXES
SENSE AMPS
COMPARATORS
MUX DRIVERS
DATA
OUTPUT DRIVER
OUTPUT DRIVERS
VALID OUTPUT
- Tag and data arrays split into multiple
sub-arrays - to reduce/balance length of word lines and bit
lines
17Changes for Reconfigurable Cache
ADDRESS
1NP
DATA ARRAY
BIT LINES
TAG ARRAY
WORD LINES
DECODERS
COLUMN MUXES
SENSE AMPS
COMPARATORS
MUX DRIVERS
1NP
DATA
OUTPUT DRIVER
OUTPUT DRIVERS
VALID OUTPUT
1NP
- Associate sub-arrays with partitions
- Constraint on minimum number of sub-arrays
- Additional multiplexors, drivers, and wiring
18Impact on Cache Access Time
- Sub-array-based partitioning
- Multiple simultaneous accesses to SRAM array
- No additional data ports
- Timing analysis methodology
- CACTI analytical timing model for cache time
(Compaq WRL) - Extended to model reconfigurable caches
- Experiments varying cache sizes, partitions,
technology,
19Impact on Cache Access Time
- Cache access time
- Comparable to base (within 1-4) for few
partitions (2) - Higher for more partitions, especially with small
caches - But still within 6 for large caches
- Impact on clock frequency likely to be even lower
20Outline for Talk
- Motivation
- Reconfigurable caches
- Application for media processing
- Instruction reuse with media processing
- Simulation results
- Summary and future work
21Application for Media Processing
- Instruction reuse/memoization Sodani and Sohi,
ISCA 1997 - Exploits value redundancy in programs
-
- Store instruction operands and result in reuse
buffer - If later instruction and operands match in reuse
buffer, - skip execution
- read answer from reuse buffer
cache partition
cache partition
cache partition
Few changes for implementation with
reconfigurable caches
22Simulation Methodology
- Detailed simulation using RSIM (Rice)
- User-level execution-driven simulator
- Media processing benchmarks
- JPEG image encoding/decoding
- MPEG video encoding/decoding
- GSM speech decoding and MPEG audio decoding
- Speech recognition and synthesis
23System Parameters
- Modern general-purpose processor with ILPmedia
extensions - 1 GHz, 8-way issue, OOO, VIS, prefetching
- Multi-level memory hierarchy
- 128KB 4-way associative 2-cycle L1 data cache
- 1M 4-way associative 20-cycle L2 cache
- Simple reconfigurable cache organization
- 2 partitions at L1 data cache
- 64 KB data cache, 64KB instruction reuse buffer
- Partitioning at start of application in software
24Impact of Instruction Reuse
100
100
100
92
89
84
JPEG decode
MPEG decode
Speech synthesis
- Performance improvements for all applications
(1.04X to 1.20X) - Use memory to reduce compute bottleneck
- Greater potential with aggressive design details
in paper
25Summary
- Goal Use cache transistors effectively for all
workloads - Reconfigurable Caches Flexibility to reuse cache
SRAM - Simple organization and design changes
- Small impact on cache access time
- Several applications possible
- Instruction reuse - reuse memory for computation
- 1.04X to 1.20X performance improvement
- More aggressive reconfiguration currently under
investigation
26- More information available at
- http//www.ece.rice.edu/parthas
- parthas_at_rice.edu