Designing Efficient Memory for Future Computing Systems - PowerPoint PPT Presentation

About This Presentation
Title:

Designing Efficient Memory for Future Computing Systems

Description:

Title: PowerPoint Presentation Author: Udipi, Aniruddha Last modified by: Aniruddha Udipi Document presentation format: On-screen Show (4:3) Company – PowerPoint PPT presentation

Number of Views:174
Avg rating:3.0/5.0
Slides: 50
Provided by: UdipiAn
Category:

less

Transcript and Presenter's Notes

Title: Designing Efficient Memory for Future Computing Systems


1
Designing Efficient Memory for Future Computing
Systems
Aniruddha N. Udipi University of Utah
Ph.D. Dissertation Defense, March 7, 2012
Advisor Rajeev Balasubramonian
www.cs.utah.edu/udipi
2
My other computer is..
3
Scaling server farms
  • Facebook 30,000 servers, 80 Billion images
    stored, serves 600,000 photos a second, logs 25
    TB of data per day the statistics can go on..
  • The primary challenge to scaling efficient
    supply of data to thousands of cores
  • Its all about the memory!

4
Performance Trends
  • Demand-side
  • Multi-socket, multi-core, multi-thread
  • Large datasets - big data analytics, scientific
    computation models
  • RAMCloud-like designs
  • 1 TB/s per node by 2017
  • Supply-side
  • Pin count, per pin BW, capacity
  • Severely power limited

5
Energy Trends
  • Datacenters consume 2 of all power generated in
    the US
  • Operation cooling
  • 100 Billion kWh, 7.4 Billion
  • 25-40 of total power in large systems consumed
    in memory
  • As processors get simpler, this fraction likely
    to increase

6
Cost-per-bit
  • Traditionally the holy grail of DRAM design
  • Operational expenditure over 3 years Capital
    expenditure in datacenter servers
  • Cost-per-bit less important than before

3.00 13W
0.30 60W
7
Complexity Trends
  • The job of the memory controller is hard
  • 18 timing parameters for DRAM!
  • Maintenance operations
  • Refresh, scrub, power down, etc.
  • Several DIMM and controller variants
  • Hard to provide interoperability
  • Need processor-side support for new
    memory features
  • Now throw in heterogeneity
  • Memristors, PCM, STT-RAM, etc.

8
Reliability Trends
  • Shrinking feature sizes not helping
  • Nor is the scale
  • 64 x 1015 DRAM cells in a typical datacenter
  • DRAM errors the 1 reason for servers at Google
    to enter repair
  • Datacenters are the backbone of web-connected
    infrastructure
  • Reliability is essential
  • Server downtime has huge economic impact
  • Breached SLAs, for example

9
Thesis statement
  • Main memory systems are at an inflection point
  • Convergence of several trends
  • Major overhaul required to achieve a system that
    is
  • Energy-efficient, high-performance,
    low-complexity, reliable, and cost effective
  • Combination of two things
  • Prudent application of novel technologies
  • Fundamental rethinking of conventional design
    decisions

10
Designing Future Memory Systems
CPU
DIMM

4
1
2
MC
3
2
3
1
4
Memory Chip Architecture reducing overfetch
increasing parallelism ISCA 10
Memory protocol Streamlined Slot-based
Interface with semi-autonomous memory ISCA 11
3
1
Memory Interconnect Prudent use of Silicon
Photonics, without modifying DRAM dies ISCA
11
Memory Reliability Efficient RAID-based
high-availability Chipkill memory ISCA
12
2
4
11
PART 1 Memory Chip Organization
12
Key bottleneck
DRAM Chip
DRAM Chip
DRAM Chip
DRAM Chip
RAS
CAS
Cache Line
Row Buffer
One bank shown in each chip
13
Why this is a problem
14

15
SSA Architecture
ONE DRAM CHIP
ADDR/CMD BUS
64 Bytes
DIMM
Subarray
Bank
Bitlines
Row buffer
8
8
8
8
8
8
8
8
8
DATA BUS
MEMORY CONTROLLER
Global Interconnect to I/O
16
SSA Operation
DRAM Chip
DRAM Chip
DRAM Chip
DRAM Chip
Subarray
Subarray
Subarray
Subarray
Subarray
Subarray
Subarray
Subarray
Address
Subarray
Subarray
Subarray
Subarray
Subarray
Subarray
Subarray
Subarray
Sleep Mode (or other parallel accesses)
Cache Line
17
SSA Impact
  • Energy reduction
  • Dynamic fewer bitlines activated
  • Static smaller activation footprint more and
    longer spells of inactivity better power down
  • Latency impact
  • Limited pins per cache line serialization
    latency
  • Higher bank-level parallelism shorter queuing
    delays
  • Area increase
  • More peripheral circuitry and I/O at finer
    granularities area overhead (lt 5)

18
Key Contributions
  • Up to 6X reduction in DRAM chip dynamic energy
  • Up to 5X reduction in DRAM chip static energy
  • Up to 50 improvements in performance in
    applications limited by bank contention
  • All for 5 increase in area

19
PART 2 Memory Interconnect
20
Key Bottleneck
  • Fundamental nature of electrical pins
  • Limited pin count, per pin bandwidth, memory
    capacity, etc.
  • Diverging growth rates of core count and pin
    count
  • Limited by physics, not engineering!

21
Silicon Photonic Interconnects
  • We need something that can break the
    edge-bandwidth bottleneck
  • Ring modulator based photonics
  • Off chip light source
  • Indirect modulation using resonant rings
  • Relatively cheap coupling on- and off-chip
  • DWDM for high bandwidth density
  • As many as 67 wavelengths possible
  • Limited by Free Spectral Range, and coupling
    losses between rings

Source Xu et al. Optical Express 16(6), 2008
22
The Questions Were Trying to Answer
How do we make photonics less invasive to memory
die design?
What should the role of electrical signaling be?
Should we replace all interconnects with
photonics? On-chip too?
What should the role of 3D be in an optically
connected memory?
Should we be designing photonic DRAM dies?
Stacks? Channels?
23
Design Considerations I
  • Photonic interconnects
  • Large static power dissipation ring tuning
  • Rings are designed to resonate at a specific
    frequency
  • Processing defects and temperature change this
  • Need to heat the rings to correct for this
  • Much lower dynamic energy consumption
    relatively independent of distance
  • Electrical interconnects
  • Relatively small static power dissipation
  • Large dynamic energy consumption

24
Design Considerations II
  • Should not over-provision photonic bandwidth, use
    only where necessary
  • Use photonics where theyre really useful
  • To break the off-chip pin barrier
  • Exploit 3D-Stacking and TSVs
  • High bandwidth, low static power, decouples
    memory dies
  • Exploit low-swing wires
  • Cheap on-chip communication

25
Proposed Design
ADVANTAGE 1 Increased activity factor, more
efficient use of photonics
ADVANTAGE 3 Not disruptive to the design of
commodity memory dies
ADVANTAGE 2 Rings are co-located easier to
isolate or tune thermally
DRAM chips
Processor
DIMM
Waveguide
Photonic Interface die
Memory controller
26
Key Contributions
DRAM chips
Processor
DIMM
Waveguide
Photonic Interface die
Memory controller
  • 23 reduced energy consumption
  • 4X capacity per channel
  • Potential for performance improvements due
    to increased bank count
  • Less disruptive to memory die design

Makes the job of the memory controller difficult!
27
PART 3 Memory Access Protocol
28
Key Bottleneck
  • Large capacity, high bandwidth, and evolving
    technology trends will increase pressure on the
    memory interface
  • Memory controller micro-manages every operation
    of the memory system
  • Processor-side support required for every memory
    innovation
  • Several signals between processor and memory
  • Heavy pressure on address/command bus
  • Worse with several independent banks, large
    amounts of state

29
Proposed Solution
  • Release MCs tight control, make memory stack
    more autonomous
  • Move mundane tasks to the interface die
  • Maintenance operation (refresh, scrub, etc.)
  • Routine operations (DRAM precharge, NVM wear
    leveling)
  • Timing control (18 constraints for DRAM alone)
  • Coding and any other special requirements
  • Processor-side controller only schedules requests
    and controls data bus

30
Memory Access Operation
ML
ML
gt ML
x
x
x
S1
S2
Arrival
Issue
First free slot
Start looking
Backup slot
Time
Slot Cache line data bus occupancy X
Reserved Slot ML Memory Latency Addr.
latency Bank access Data bus
latency
31
Performance Impact Synthetic Traffic
lt 9 latency impact, even at maximum load
Virtually no impact on achieved bandwidth
32
Performance Impact PARSEC/STREAM
Apps have very low BW requirements Scaled down
system, similar trends
33
Key Contributions
  • Plug and play
  • Everything is interchangeable and interoperable
  • Only interface-die support required (communicate
    ML)
  • Better support for heterogeneous systems
  • Easier DRAM-NVM data movement on the same channel
  • More innovation in the memory system
  • Without processor-side support constraints
  • Fewer commands between processor and memory
  • Energy, performance advantages

34
PART 4 Memory Reliability
35
Key Bottleneck
  • Increased access granularity
  • Every data access is spread across 36 DRAM chips
  • DRAM industry standards define minimum access
    granularity from each chip
  • Massive overfetch of data at multiple levels
  • Wastes energy
  • Wastes bandwidth
  • Occupies ranks/banks for longer, hurting
    performance
  • x4 device width restriction
  • fewer ranks for given DIMM real estate
  • x8/x16/x32 more power efficient per capacity
  • Reliability level 1 failed chip out of 36

36
A new approach LOT-ECC
  • Operate on a single rank of memory 9 chips
  • and support failure of 1 chip per rank (9 chips)
  • Multiple tiers of localized protection
  • Tier-1 Local Error Detection (checksums)
  • Tier 2 Global Error Correction (parity)
  • T3 T4 to handle specific failure cases
  • Error correction data stored in data memory
  • Data mapping handled by memory controller with
    firmware support
  • Transparent to OS, caches, etc.

37
LOT-ECC Design
38
The Devil is in the Details
  • Were borrowing one bit from data LED to
    use in the GEC
  • Put them all in the same DRAM row
  • When a cache line is written,
  • Write data, LED, GEC all self-contained
  • no read-before-write
  • Guaranteed row-buffer hit

39
Key Benefits
  • Energy Efficiency Fewer chips activated per
    access, reduced access granularity, reduced
    static energy through better use of low-power
    modes
  • Performance Gains More rank-level parallelism,
    reduced access granularity
  • Improved Protection Can handle 1 failed chip out
    of 9, compared to 1 in 36 currently
  • Flexibility Works with a single rank of x4 DRAMs
    or more efficient wide-I/O x8/x16 DRAMs
  • Implementation Ease Changes to memory controller
    and system firmware only commodity
    processor/memory/OS

40
Power Results
-55
41
Performance Results
Latency Reduction LOT-ECC x8 43 GEC
Coalescing 47 Oracular 57
42
Exploiting features in SSA
43
Putting it all together
44
Summary
  • Tremendous pressure on the memory system
  • Bandwidth, energy, complexity, reliability
  • Prudently apply novel technologies
  • Silicon photonics
  • Low-swing wires
  • 3D-stacking
  • Rethink some fundamental design choices
  • Micromanagement by the memory controller
  • Overfetch in the face of diminishing locality
  • Conventional ECC codes

45
Impact
  • Significant static/dynamic energy reduction
  • Memory core, channel, controller, reliability
  • Significant performance improvement
  • Bank parallelism, channel bandwidth, reliability
  • Significant complexity reduction
  • Memory controller
  • Improved reliability

46
Synergies
  • SSA Photonics
  • Photonics Autonomous memory
  • SSA Reliability
  • SSA, Photonics, and LOT-ECC provide additive
    energy benefits
  • Each targets one of three major sources of energy
    consumption DRAM array, off-chip channel,
    reliability
  • SSA, Photonics, and LOT-ECC also provide additive
    performance benefits
  • Each targets one of three major performance
    bottleneck Bank-contention, off-chip BW,
    reliability

47
Research Contributions
  • Memory reliability
  • Memory access protocol
  • Memory channel architecture
  • Memory chip microarchitecture
  • On-chip networks
  • Non-uniform power caches
  • 3D stacked cache design

ISCA 2012 ISCA 2011 ISCA 2010 HPCA
2010 HiPC 2009 HPCA 2009
48
Future Work
  • Future project ideas include
  • Memory architectures for graphics/throughput-orien
    ted applications
  • Memory optimizations for handheld devices
  • Tightly integrated software support
  • Managing heterogeneity, reconfigurability
  • Novel memory hierarchies
  • Memory autonomy and virtualization
  • Refresh management in DRAM

49
Acknowledgements
  • Rajeev
  • Naveen
  • Committee Al, Norm, Erik, Ken
  • Awesome lab-mates
  • Karen, Ann, Emily front office
  • Parents family
  • Friends
Write a Comment
User Comments (0)
About PowerShow.com