Embedded DRAM for a Reconfigurable Array - PowerPoint PPT Presentation

About This Presentation
Title:

Embedded DRAM for a Reconfigurable Array

Description:

Embedded DRAM. for a Reconfigurable Array. S.Perissakis, Y.Joo1, J.Ahn1, A.DeHon, J.Wawrzynek ... ( p r o d u c e r ) ( c o n s u m e r ) Motivation. Stream buffers ... – PowerPoint PPT presentation

Number of Views:192
Avg rating:3.0/5.0
Slides: 25
Provided by: steliospe
Category:

less

Transcript and Presenter's Notes

Title: Embedded DRAM for a Reconfigurable Array


1
Embedded DRAMfor a Reconfigurable Array
  • S.Perissakis, Y.Joo1, J.Ahn1,
  • A.DeHon, J.Wawrzynek
  • University of California, Berkeley
  • 1LG Semicon Co., Ltd

2
Outline
  • Reconfigurable architecture overview
  • Motivation for on-chip DRAM
  • Configurable Memory Block (CMB)
  • Evaluation
  • Conclusion

3
Long Term Architecture Goal
  • On-chip CPU
  • LUT-based compute pages
  • DRAM memory pages
  • Fat pyramid network fat tree shortcuts

4
Long Term Architecture Goal
  • On-chip CPU
  • LUT-based compute pages
  • DRAM memory pages
  • Fat pyramid network fat tree shortcuts

5
Long Term Architecture Goal
  • On-chip CPU
  • LUT-based compute pages
  • DRAM memory pages
  • Fat pyramid network fat tree shortcuts

6
Long Term Architecture Goal
  • On-chip CPU
  • LUT-based compute pages
  • DRAM memory pages
  • Fat pyramid network fat tree shortcuts

7
Long Term Architecture Goal
CPU
CPU
Reconfigure
K e r n e l 1
K e r n e l 2
( p r o d u c e r )
( c o n s u m e r )
8
Motivation
Need large on-chip memory for
  • Stream buffers Reduce reconfiguration frequency
  • Configuration memory Speed up reconfiguration
  • Application memory Speed up individual kernels

9
Challenges
DRAM offers increased density (10X to 20X that of
SRAM), but
  • Harder to use
  • Row/Col accesses variable latency
  • Refresh
  • Lower performance
  • Increased access latency

Q Is it worth the trouble ?
10
Trumpet test chip
  • One compute page
  • One memory page
  • Corresponding fraction of network

Trumpet
11
CMB Functions
  • Configuration source
  • State source/sink
  • Data store
  • Input/output

12
CMB Overview
Ctl10
Cmd
CMB Controller
Addr90
From host
Ctl10
Addr170
DRAM Macro
Tree1590
From compute
DQ1270
page
1270
630
Short1590
Stall
Retiming
Address
Rate
Buffers
Registers
Data Xbars
Matching
13
DRAM Macro
  • 0.25µm, 4 metal eDRAM process
  • 1 to 8 Mbits (2 Mbits in test chip)
  • 128-bit wide SDRAM interface
  • Up to 125 MHz clock ? 2 GB/s peak B/W
  • 36ns/12ns row/col latencies
  • Row buffers to hide precharge refresh

Designed by LG Semicon
14
SRAM Abstraction
  • SRAM-like interface Req, R/W, Address, Data
  • Row buffers ? simple direct-mapped cache
  • 6-cycle minimum latency, pipelined
  • Misses handled by logic stalls
  • 10-cycle miss latency hidden from logic

15
Stalls
  • Stall sources
  • Row buffer miss (10 cycles)
  • Write after read (4 cycles)
  • DRAM/logic clock alignment (1 cycle)
  • Refresh (Halt from host)
  • Multicycle stall distribution

16
Stall Buffers
  • Memory page is never stalled
  • Must buffer read data during stall
  • Must buffer requests during stall distribution

17
Trumpet Test Chip
  • 0.25? DRAM, 0.4? logic
  • 2 Mbits 64 LUTs
  • 125 MHz operation
  • 1 GB/sec peak bandwidth
  • 10 ?sec reconfiguration
  • 10 x 5 mm2 die
  • 1 W _at_ 125 MHz

CMB
Compute
Page
18
CMB Area Breakdown
  • 13.95 mm2 total
  • 2 Mbits capacity? 147 Kbits/mm2 average
    densityCompare to 700-900 Kbits/mm2 commodity
    DRAM

19
Using a Custom Macro
  • Existing
  • 13.95 mm2
  • 147 Kbits/mm2
  • Custom
  • 9.4 mm2
  • 218 Kbits/mm2

20
Comparison to SRAM CMB
With typical SRAM core densities and
? No stall buffers
? Simplified controller
  • DRAM (custom macro) ? 218 Kb/mm2
  • SRAM (equal area) ? 25 Kb/mm2

Close to 1 order of magnitude density advantage
for DRAM
?
21
Performance
  • Configuration / state swap peak 1 GB/s
  • User accesses dependent on access patterns
  • Peak if high locality
  • Near peak for sequential patterns (62-93)
  • Column latency exposed when dependencies exist,
    or on mixed R/W
  • Row latency exposed on random accesses

22
Performance (example)
8
Input image
8
Scanline order
Row 4 misses / DCT block
8x8 DCT block
1 Kbit 1 DRAM row
Col 2 misses / DCT block
? 73 efficiency
23
Refresh Overhead
  • 8 to 16 ms retention time expected
  • 2.5 to 5.0 bandwidth loss
  • Can reduce by refreshing only active part of
    memory
  • May skip refresh for short-lived data

24
Conclusion
  • Q Is on-chip DRAM advantageous to SRAM ?
  • Our experience so far
  • User-friendly abstraction possible
  • Can maintain density advantage
  • Effect on application performance
  • Large buffer space ? less frequent
    reconfiguration
  • High bandwidth ? faster reconfiguration
  • Effect on individual kernels often limited by
    DRAM core latency
Write a Comment
User Comments (0)
About PowerShow.com