Title: ThreeDimensional Integration for MultiProcessor SystemonChip
1Three-Dimensional Integration for Multi-Processor
System-on-Chip
Searching for the architectural sweet spot
Luca Benini DEIS Università di Bologna lbenini_at_dei
s.unibo.it
Thanks to M. Facchini, T. Carlson, P. Marchal
(IMEC) C. Seiculescu, G. De Micheli (EPFL) S.
Murali, A. Pullini, F. Angiolini (INoCs) S.
Mitra (Stanford) I. Loi (UNIBO)
2The communication bottleneck
- Architectural issues
- Traditional shared buses do not scale well
bandwidth saturation - Chip IO is pad limited
- Physical issues
- On-chip Interconnects become increasingly slower
w.r.t. logic - IOs are increasingly expensive
- Consequences
- Performance losses
- Power/Energy cost
- Design closure issues, respins or infeasibility
New architectures and design methods are
required!
2
3TSV market outlook
Yole07
4TSV performance
- Delay is given by combination of parasitics
- Horizontal wire to via base
- Via delay (includes R of bases)
- Horizontal wire from via top
- Load
- For a whole via of 50µm, delay is 16/18.5ps
(SOI/bulk) - For a 1.5mm horizontal link, delay is around
200ps
5So far so good, but
- The area news are not so good!
- TSV itself can be small (even 2-3um)
- But TSV pitch is not so small
- Limited by wafer aligment technology!
- Sub-micron aligment is not yet feasible
- Micron aligment is feasible but slow and
expensive! - Need large landing pads for TSVs
- 10um pitches seem to be realistic
- Not all TSVs can be used for signals
- Power supply, clock, thermal vias
6TSV reliability losses
- Main failure mechanisms (fabrication)
- Misalignment
- Voids formation during Bonding phase
- Dislocation and defects of Copper grains
- Oxide film formation over Cu interface
- Partial or full Pad detaching due to thermal
Stress - Thermal dissipation is much harder in 3D stacks,
thereby further increasing the risk of
temperature-related failures
12/14/2009
6
Loi Igor igor.loi_at_unibo.it
7TSV yield
Miyakawa HRI07
DBI defect frequency NBI Number of TSVs
Yexp(-DBI NBI)
8Summing up
- Good power and speed
- Area overhead is significant
- Reliability not ideal (fabrication and aging)
- Synchronization is hard (skew minimization across
layers) - Therefore
- Cost and design effort are not trivial
- Not just another dimension for wiring (as of
today) - Need a sistematic way to deal with non-ideality
9A medium-term vision
10Do We Really Need It?
- Multi-core logic performance is back on track,
but
John McCalpin
http//www.cs.virginia.edu/stream/
- Multi-core are bandwith-hungry
- Limited caches
- Multi-threading
- Virtualization
The Bandwidth Challenge
11Scaling cores with constant BW
C
T
B
Using Cache size to accommodate increasing
thread traffic is VERY expensive using BW can
be cheaper!!
T/C1/dB dgt1 (2-3)
2x increased traffic drives 8x cache size
(constant memory bandwidth) 4x increased traffic
drives 64 x cache size (constant memory bandwidth)
IBM
12What about Embedded MPSoCs?
NXP07
Frame rate constraint is getting too tight!
133D offers plenty of bandwidth
High-end packaging roadmap
Intel 07
10µm TSV-pitch ? 10K vertical connections per mm2
14How do we get the bandwith?
- Current low cost SoC solution (2D) single
channel memory system interface
Transaction queue
Slave port/s
System Interconnect
SoC front-end
S1
PHY
Memory backend
S2
CH1
Memory scheduler
Sk
Off chip physical interface circuits
Main bottleneck the memory channel
15Memory Controller example
16Multi-channel (2D) Memory interface
- Data-parallel memory system (e.g. OpenSPARC T1,
T2)
PHY
Memory backend
CH1
Transaction queue
Slave port/s
System Interconnect
SoC front-end
S1
PHY
Memory backend
S2
CH2
Memory scheduler
Sk
PHY
Memory backend
CH3
- Main bottlenecks
- Power, pin budget of memory channels
- Front-end congestion
- Scheduler scalability
17Packaging scenarios vs. IO circuits
Board
SSTL2
DDR2 -16bit
Logic
PCB
SiP
DDR2 -16bit
RLC interconnect
PCB
SSTL2 no terminations
3D SiC (TSVs)
DDR2- 16bit
PCB
RC interconnect
3D-ready DDR2- 16bits
CMOS
PCB
3D-ready DDR2- xbits
RC interconnect
PCB
ORDERS OF MAGNITUDE MORE ENERGY EFFICIENT
SSTL2 standard stub termination logic This is
dedicated logic circuits to transmit data across
a transmission line, commonly used in DRAM
memories
183D-single channel interface
- Current low cost SoC solution (2D) single
channel memory system interface
CTRL
PHY
Slave port/s
Addr
SoC front-end Queue Scheduler
S1
DW
Memory backend
Fat data lanes ? 1 cycle block transfers
S2
Split R/W ? reduced scheduling conflicts
DR
ISSUE requires changes in MEM
Sk
CH
Main bottleneck the its still a single channel
(for transactions)
193D-multi channel interface
- Solution 1 multiple standard channels
Memory backend
PHY
Slave port/s
CH1
SoC front-end Queue Scheduler
S1
Memory backend
S2
CH2
Sk
Memory backend
CH3
203D-multi channel interface
- Solution 1 multiple standard channels
Memory backend
PHY
Slave port/s
CH1
SoC front-end Queue Scheduler
S1
Memory backend
S2
CH2
Sk
Memory backend
CH3
213D-multi channel interface
- Solution 1 multiple standard channels
Memory backend
PHY
Slave port/s
CH1
SoC front-end Queue Scheduler
S1
Memory backend
S2
CH2
Sk
Memory backend
CH3
223D-multi channel interface
- Solution 1 multiple standard channels
Memory backend
PHY
Slave port/s
CH1
SoC front-end Queue Scheduler
S1
Memory backend
S2
CH2
Sk
Memory backend
CH3
233D-multi channel interface
- Solution 1 multiple standard channels
Memory backend
PHY
Slave port/s
CH1
SoC front-end Queue Scheduler
S1
Memory backend
S2
CH2
Sk
Memory backend
CH3
Advantage does not require functional changes to
DRAM interface
243D-multi channel interface
- Solution 2 TDMA 3D overclocked bus
Memory backend
Exploits speed of TSVs
PHY
Logic DIE
Slave port/s
CH1
SoC front-end Queue Scheduler
S1
Memory backend
S2
CH1
CH2
CH3
CH2
Sk
t
Memory backend
CH3
Multi-channel on wide unidirectional data lanes
is also possible
253D DRAM interface choices
- Multi-channel 3D DRAM looks very promising
- Can leverage existing DRAM organization
- Relieves bandwidth bottleneck
- Mitigates (by orders of magnitude) the cost
issues of off-chip multi-channel interfaces - Exploration of wide channels with unidirectional
data lanes is worthwile coupled with a deep
revision of DRAM chip interfaces
26DRAM for wide 3D interface
ROW Address 120
DOUT 1 to N word
Read Latch
n
Write Buff
DIN 1 to N word
Column Address 110
27Scalability bottleneck
- Single controller becomes a bottleneck even with
many slave ports - All cores need to reach it! Even using NoC
interconnect, the latency price is high - Internal management of multiple slave interface
and many transaction queues creates complexity
bottlenecks - This approach does not exploit the possibility of
fine-grain distribution of 3D visa - Creates a single point of faileure
Bottleneck
28Multiple 3D DRAM interfaces
- Relieves single-controller bottleneck
- Cores have a friendly neighbor controller
- Memory is fully accessible to everybody
- Notion of vicinity in memory space
- Not without issues
- Area cost is increased (some hw sharing of
memctrl is lost) - Many points of entry in the memory dies. Need a
regular pattern of memory access ports for
commodity 3D RAM
29IMIS 1.0 Example
- Intimate memory interface specification
PORT
Chip footprint
80x19 cells, pitch 24µm
30A case study
- NoC Based Scalability and Modularity
- Increase QoS, Predictability and Bandwidth
TSV
FP
Core
Core
Core
Core
L1
L1
L1
L1
SW
SW
DRAM
DRAM
MC
FP
MC
TSVs
TSVs
ni
MCU
TSVs
MCU
ni
TSVs
ni
SW
SW
SW
TSV
TSV
SW
DRAM
TSVs
TSVs
MCU
MCU
TSVs
TSVs
MC
DRAM
ni
ni
MC
GPRs
Core
L1
Core
L1
SW
SW
Core
L1
Core
L1
PLL
IO
Optional Low latency high BW channel
31A promising approach
- A high-level analysis for GP architectures
- Traditional 2-Level cache hierarchy
G. Loh, 3D-Stacked Memory Architectures for
Multi-Core Processors ISCA 2008
32Looking forward
- More in general, multiple, application specific
dies - Logic prevalent process ? lots of metal, fast
and leaky transistors, area inefficient large
library of cells - Memory prevalent process ? few levels of metal
low-leakage transistors, few highly specialized
cell generators - From SoC to ML-SoC
- Lots of opportunities! More degrees of freedom to
achieve reliability, low-quiescent power, low
energy - If TSV technology becomes really stable (high
yield, low cost)
333D NoCs for MLSoCs
- Designing NoCs for 3D ICs big challenge
- Which topology, switches on what layer and
floorplan locations ? - Meet application constraints
- Bandwidth, latency
- Meet 3D technology constraints
- Maximum available TSV constraint
- Communication between adjacent layers
- NoC floorplan considering 3D layers
Automating 3D NoC design essential !
342D vs 3D Synthesis
3D technology TSV constraints
352D vs 3D Synthesis
3D technology TSV constraints
TSV constraints7 links
362D vs 3D Synthesis
- TSV constraint important factor in determining
topology - Addressing 3D floorplanning of NoC also crucial
- Additional constraint only links across adjacent
layers
Several new isues in 3D NoC synthesis !
37Contributions 3D NoC Flow
3D Specs
Communication characteristics
Technology constraints
User objectives
- Application bandwidth requirements
- Latency constraints
- Message type of traffic flows
- Core assignment to layer in 3D
- Optionally, floorplan of cores in each layer
- Max. TSVs across adjacent layers
- Constraint on links only across adjacent layers
- Power consumption
- Latency
NoC Topology Synthesis
NoC area models
Application-specific 3D NoC
NoC power models
Vertical link power, latency models
383D NoC Flow
- Features
- Deadlock removal (routing and message-dependent)
intra and inter layer - Floorplan of network components layer by layer
- Meet 3D technology constraints
- Design trade-offs possible
- Inputs
- IP, communication specs
- Layer assignment placement of cores
- Bandwidth, latency constraints of flows
- NoC area, power models
- Maximum inter-layer links (TSV constraint)
39Synthesis Approach
40Communication Abstraction
Synthesize best topology
Build communication graph based on application
specs
µ bw/max_bw (1- µ) min_lat/lat, µ - scaling
parameter varied by algorithm
41Core to Switch Assignment
- Build local partitioning graphs (LPG)
- Layer-by-layer NoC design
0.5
1.0
ARM
M
ARM
LPG 1
LPG 0
M
ARM
ARM
0.2
42Path Computation
- In 3D, 2 important constraints to be met
- TSV, maximum switch size (frequency)
Trade-offs of TSV count (yield) Vs
power-performance possible
43Placement of Switches
Layer assignment of switches Floorplan of each
layer
Layer 1
Layer 0
Initial placemet of cores taken as input
44Placement of Switches
Layer 1
Layer 0
Switches inserted to minimize distance (weighted
by bandwidth)
Solved as a Linear Program
45Experiments
46Example Layer Layout
- Vertical links are laid out as floorplan
obstructions - NoC components based on xpipes library
47Multi-media Case Study
- Triple video object plane decoder (TVOPD)
- 38 cores, lot of pipeline traffic flow
- Core layer assignment, floorplan given as inputs
48Generated Topology
49Multi-media Case Study
Floorplan of each layer
50Comparisons with Meshes
- Different SoC benchmarks-36 to 65 cores
- Model bottleneck (shared memory), local memory,
pipeline benchmarks - Meshes optimized for application traffic
Proposed method 38 reduction in power 25
reduction in latency
51TSV constraints
- TSV constrainsts gt inter-layer link constraints
- Tigther, poorer latency, power
- Trade-off exploration possible with our algorithm
- Run time few hours for all benchmarks
52Conclusions
- NoCs are critical for 3D ICs
- Scalable, modular, support technology constraints
- Presented synthesis approach for 3D NoCs
- Topology generation, floorplanning
- Large improvements in power, delay compared to
existing solutions - Not the optimal solution for e.g. tightly coupled
memories - Need fast NoC bypasse e.g. NEC solution at
ISSCC 2009