Title: Designing Efficient Memory for Future Computing Systems
1Designing Efficient Memory for Future Computing
Systems
Aniruddha N. Udipi University of Utah
Ph.D. Dissertation Defense, March 7, 2012
Advisor Rajeev Balasubramonian
www.cs.utah.edu/udipi
2My other computer is..
3Scaling server farms
- Facebook 30,000 servers, 80 Billion images
stored, serves 600,000 photos a second, logs 25
TB of data per day the statistics can go on.. - The primary challenge to scaling efficient
supply of data to thousands of cores - Its all about the memory!
4Performance Trends
- Demand-side
- Multi-socket, multi-core, multi-thread
- Large datasets - big data analytics, scientific
computation models - RAMCloud-like designs
- 1 TB/s per node by 2017
- Supply-side
- Pin count, per pin BW, capacity
- Severely power limited
5Energy Trends
- Datacenters consume 2 of all power generated in
the US - Operation cooling
- 100 Billion kWh, 7.4 Billion
- 25-40 of total power in large systems consumed
in memory - As processors get simpler, this fraction likely
to increase
6Cost-per-bit
- Traditionally the holy grail of DRAM design
- Operational expenditure over 3 years Capital
expenditure in datacenter servers - Cost-per-bit less important than before
3.00 13W
0.30 60W
7Complexity Trends
- The job of the memory controller is hard
- 18 timing parameters for DRAM!
- Maintenance operations
- Refresh, scrub, power down, etc.
- Several DIMM and controller variants
- Hard to provide interoperability
- Need processor-side support for new
memory features - Now throw in heterogeneity
- Memristors, PCM, STT-RAM, etc.
8Reliability Trends
- Shrinking feature sizes not helping
- Nor is the scale
- 64 x 1015 DRAM cells in a typical datacenter
- DRAM errors the 1 reason for servers at Google
to enter repair - Datacenters are the backbone of web-connected
infrastructure - Reliability is essential
- Server downtime has huge economic impact
- Breached SLAs, for example
9Thesis statement
- Main memory systems are at an inflection point
- Convergence of several trends
- Major overhaul required to achieve a system that
is - Energy-efficient, high-performance,
low-complexity, reliable, and cost effective - Combination of two things
- Prudent application of novel technologies
- Fundamental rethinking of conventional design
decisions
10Designing Future Memory Systems
CPU
DIMM
4
1
2
MC
3
2
3
1
4
Memory Chip Architecture reducing overfetch
increasing parallelism ISCA 10
Memory protocol Streamlined Slot-based
Interface with semi-autonomous memory ISCA 11
3
1
Memory Interconnect Prudent use of Silicon
Photonics, without modifying DRAM dies ISCA
11
Memory Reliability Efficient RAID-based
high-availability Chipkill memory ISCA
12
2
4
11PART 1 Memory Chip Organization
12Key bottleneck
DRAM Chip
DRAM Chip
DRAM Chip
DRAM Chip
RAS
CAS
Cache Line
Row Buffer
One bank shown in each chip
13Why this is a problem
14 15SSA Architecture
ONE DRAM CHIP
ADDR/CMD BUS
64 Bytes
DIMM
Subarray
Bank
Bitlines
Row buffer
8
8
8
8
8
8
8
8
8
DATA BUS
MEMORY CONTROLLER
Global Interconnect to I/O
16SSA Operation
DRAM Chip
DRAM Chip
DRAM Chip
DRAM Chip
Subarray
Subarray
Subarray
Subarray
Subarray
Subarray
Subarray
Subarray
Address
Subarray
Subarray
Subarray
Subarray
Subarray
Subarray
Subarray
Subarray
Sleep Mode (or other parallel accesses)
Cache Line
17SSA Impact
- Energy reduction
- Dynamic fewer bitlines activated
- Static smaller activation footprint more and
longer spells of inactivity better power down - Latency impact
- Limited pins per cache line serialization
latency - Higher bank-level parallelism shorter queuing
delays - Area increase
- More peripheral circuitry and I/O at finer
granularities area overhead (lt 5)
18Key Contributions
- Up to 6X reduction in DRAM chip dynamic energy
- Up to 5X reduction in DRAM chip static energy
- Up to 50 improvements in performance in
applications limited by bank contention - All for 5 increase in area
19PART 2 Memory Interconnect
20Key Bottleneck
- Fundamental nature of electrical pins
- Limited pin count, per pin bandwidth, memory
capacity, etc. - Diverging growth rates of core count and pin
count - Limited by physics, not engineering!
21Silicon Photonic Interconnects
- We need something that can break the
edge-bandwidth bottleneck - Ring modulator based photonics
- Off chip light source
- Indirect modulation using resonant rings
- Relatively cheap coupling on- and off-chip
- DWDM for high bandwidth density
- As many as 67 wavelengths possible
- Limited by Free Spectral Range, and coupling
losses between rings
Source Xu et al. Optical Express 16(6), 2008
22The Questions Were Trying to Answer
How do we make photonics less invasive to memory
die design?
What should the role of electrical signaling be?
Should we replace all interconnects with
photonics? On-chip too?
What should the role of 3D be in an optically
connected memory?
Should we be designing photonic DRAM dies?
Stacks? Channels?
23Design Considerations I
- Photonic interconnects
- Large static power dissipation ring tuning
- Rings are designed to resonate at a specific
frequency - Processing defects and temperature change this
- Need to heat the rings to correct for this
- Much lower dynamic energy consumption
relatively independent of distance - Electrical interconnects
- Relatively small static power dissipation
- Large dynamic energy consumption
24Design Considerations II
- Should not over-provision photonic bandwidth, use
only where necessary - Use photonics where theyre really useful
- To break the off-chip pin barrier
- Exploit 3D-Stacking and TSVs
- High bandwidth, low static power, decouples
memory dies - Exploit low-swing wires
- Cheap on-chip communication
25Proposed Design
ADVANTAGE 1 Increased activity factor, more
efficient use of photonics
ADVANTAGE 3 Not disruptive to the design of
commodity memory dies
ADVANTAGE 2 Rings are co-located easier to
isolate or tune thermally
DRAM chips
Processor
DIMM
Waveguide
Photonic Interface die
Memory controller
26Key Contributions
DRAM chips
Processor
DIMM
Waveguide
Photonic Interface die
Memory controller
- 23 reduced energy consumption
- 4X capacity per channel
- Potential for performance improvements due
to increased bank count - Less disruptive to memory die design
Makes the job of the memory controller difficult!
27PART 3 Memory Access Protocol
28Key Bottleneck
- Large capacity, high bandwidth, and evolving
technology trends will increase pressure on the
memory interface - Memory controller micro-manages every operation
of the memory system - Processor-side support required for every memory
innovation - Several signals between processor and memory
- Heavy pressure on address/command bus
- Worse with several independent banks, large
amounts of state
29Proposed Solution
- Release MCs tight control, make memory stack
more autonomous - Move mundane tasks to the interface die
- Maintenance operation (refresh, scrub, etc.)
- Routine operations (DRAM precharge, NVM wear
leveling) - Timing control (18 constraints for DRAM alone)
- Coding and any other special requirements
- Processor-side controller only schedules requests
and controls data bus
30Memory Access Operation
ML
ML
gt ML
x
x
x
S1
S2
Arrival
Issue
First free slot
Start looking
Backup slot
Time
Slot Cache line data bus occupancy X
Reserved Slot ML Memory Latency Addr.
latency Bank access Data bus
latency
31Performance Impact Synthetic Traffic
lt 9 latency impact, even at maximum load
Virtually no impact on achieved bandwidth
32Performance Impact PARSEC/STREAM
Apps have very low BW requirements Scaled down
system, similar trends
33Key Contributions
- Plug and play
- Everything is interchangeable and interoperable
- Only interface-die support required (communicate
ML) - Better support for heterogeneous systems
- Easier DRAM-NVM data movement on the same channel
- More innovation in the memory system
- Without processor-side support constraints
- Fewer commands between processor and memory
- Energy, performance advantages
34PART 4 Memory Reliability
35Key Bottleneck
- Increased access granularity
- Every data access is spread across 36 DRAM chips
- DRAM industry standards define minimum access
granularity from each chip - Massive overfetch of data at multiple levels
- Wastes energy
- Wastes bandwidth
- Occupies ranks/banks for longer, hurting
performance - x4 device width restriction
- fewer ranks for given DIMM real estate
- x8/x16/x32 more power efficient per capacity
- Reliability level 1 failed chip out of 36
36A new approach LOT-ECC
- Operate on a single rank of memory 9 chips
- and support failure of 1 chip per rank (9 chips)
- Multiple tiers of localized protection
- Tier-1 Local Error Detection (checksums)
- Tier 2 Global Error Correction (parity)
- T3 T4 to handle specific failure cases
- Error correction data stored in data memory
- Data mapping handled by memory controller with
firmware support - Transparent to OS, caches, etc.
37LOT-ECC Design
38The Devil is in the Details
- Were borrowing one bit from data LED to
use in the GEC - Put them all in the same DRAM row
- When a cache line is written,
- Write data, LED, GEC all self-contained
- no read-before-write
- Guaranteed row-buffer hit
39Key Benefits
- Energy Efficiency Fewer chips activated per
access, reduced access granularity, reduced
static energy through better use of low-power
modes - Performance Gains More rank-level parallelism,
reduced access granularity - Improved Protection Can handle 1 failed chip out
of 9, compared to 1 in 36 currently - Flexibility Works with a single rank of x4 DRAMs
or more efficient wide-I/O x8/x16 DRAMs - Implementation Ease Changes to memory controller
and system firmware only commodity
processor/memory/OS
40Power Results
-55
41Performance Results
Latency Reduction LOT-ECC x8 43 GEC
Coalescing 47 Oracular 57
42Exploiting features in SSA
43Putting it all together
44Summary
- Tremendous pressure on the memory system
- Bandwidth, energy, complexity, reliability
- Prudently apply novel technologies
- Silicon photonics
- Low-swing wires
- 3D-stacking
- Rethink some fundamental design choices
- Micromanagement by the memory controller
- Overfetch in the face of diminishing locality
- Conventional ECC codes
45Impact
- Significant static/dynamic energy reduction
- Memory core, channel, controller, reliability
- Significant performance improvement
- Bank parallelism, channel bandwidth, reliability
- Significant complexity reduction
- Memory controller
- Improved reliability
46Synergies
- SSA Photonics
- Photonics Autonomous memory
- SSA Reliability
- SSA, Photonics, and LOT-ECC provide additive
energy benefits - Each targets one of three major sources of energy
consumption DRAM array, off-chip channel,
reliability - SSA, Photonics, and LOT-ECC also provide additive
performance benefits - Each targets one of three major performance
bottleneck Bank-contention, off-chip BW,
reliability
47Research Contributions
- Memory reliability
- Memory access protocol
- Memory channel architecture
- Memory chip microarchitecture
- On-chip networks
- Non-uniform power caches
- 3D stacked cache design
ISCA 2012 ISCA 2011 ISCA 2010 HPCA
2010 HiPC 2009 HPCA 2009
48Future Work
- Future project ideas include
- Memory architectures for graphics/throughput-orien
ted applications - Memory optimizations for handheld devices
- Tightly integrated software support
- Managing heterogeneity, reconfigurability
- Novel memory hierarchies
- Memory autonomy and virtualization
- Refresh management in DRAM
49Acknowledgements
- Rajeev
- Naveen
- Committee Al, Norm, Erik, Ken
- Awesome lab-mates
- Karen, Ann, Emily front office
- Parents family
- Friends