3D Integration for High-Performance Processor Microarchitectures

About This Presentation

Title:

3D Integration for High-Performance Processor Microarchitectures

Description:

3D Integration for HighPerformance Processor Microarchitectures – PowerPoint PPT presentation

Number of Views:172

Avg rating:3.0/5.0

Slides: 90

Provided by: gabri1

Category:

more less

Transcript and Presenter's Notes

Title: 3D Integration for High-Performance Processor Microarchitectures

1
3D Integration for High-Performance Processor
Microarchitectures

Gabriel H. Loh

2
Outline

Motivation
3D - what is it?
3D - what can we do with it?
Looking toward the future

3
Wire Scaling Problems
Wires shrink, too...
... but their delays arent improving much
Devices are getting faster...
4
Wire scaling problems
one clock cycle
Device Delay
Wire Delay
180nm
Device Delay
Wire Delay
130nm
For some designs, wire delay already accounts for
50 of a clock cycle
Device Delay
Wire Delay
90nm
Device Delay
Wire Delay
65nm
(not to scale)
Fetzer and Orton, ISSCC 02
5
Power Scaling Problem
Dohh!
1000
Pentium III
Nuclear Reactor
Pentium II
!
Pentium Pro
100
Pentium
Power Density (Watts/cm2)
P4
Prescott
i486
10
!
i386
Core 2 Duo
Hot Plate
1
1.5mm
1.0mm
700nm
500nm
350nm
250nm
180nm
130nm
90nm
65nm
http//www.phys.ncku.edu.tw/htsu/humor/fry_egg.ht
ml
Data Intel
http//capefeare.com/homerdoh.gif
Data sandpile.org
http//research.esd.ornl.gov/EMBYR/fire-crop.gif
6
Power Scaling Problems

Peak power
Power density
Average power

Giant heatsink
Liquid cooling
HVAC cooling
1000W PSU
Power costs
bose.com
laptop.org
Battery life (and weight)
Environmental noise
georgiapower.com
Coolmax PS-CTG1000
Toshiba PA3107U-1BRS
CoolIT Freezone CPU Cooler
SCNJ-1100P Scythe Ninja PLUS CPU Cooler
www.hostcountryusa.com/images/aircondition.jpg
7
Technology Scaling Problems

Physical limits to silicon, lithography, etc.

... how far?
45nm
32nm
22nm
16nm
11nm
8nm
5nm
Many other challenges...
The limits are inevitable, but weve been
delaying the inevitable for a while now
8
3D Integration

One possible way to delay the inevitable for
several generations

Moores Law Number of transistors on a chip
doubles every two years
Traditionally accomplished by reducing feature
size
Gordon Moore
Can achieve the same density by stacking in the
third dimension
www.intel.com (not a direct quote!)
9
High-level benefits
Major source of circuit delay and energy
consumption
Load
Load
millimeters of metal
Layer 2
Simultaneous reduction in latency and energy
microns of metal
Layer 1
Can place and route circuits in 3D
Driver
Driver
You can also mix-and-match process technologies
(e.g., stack DRAM on CMOS)
10
Manufacturing Assumptions

Wafer Bonding

Wafer 1 Thinned to 10s of microns. Etch
through to connect power and I/O
Wafer 1
Wafer 2 Unthinned for structural support.
Connect heat sink to this side
Wafer 2
(face-to-face example)

compatible with current manufacturing
3D bonding is a relatively straightforward BEOL
process

coarser connections
requires backside etching

This approach currently pursued by Intel, IBM and
others
11
Key Parameters

For this talk, we mostly assume face-to-face
wafer bonding

TSVs
Wafer 1 (thin 10-20 mm)

Two types of vertical interconnect
die-to-die (d2d) in between faces
through-silicon via (TSV) on
backside thinned die

Wafer 2
d2ds
(TSV)
(d2d)
(TSV)
d2d latency is fast well under one gate delay
(www.tezzaron.com)
12
Not a Device Physics/Fab talk

3D integration works

Computer Architecture Question Given 3D
integration, what can (and should) we build?
Intel Corporation
13
3D Landscape
Integration Heterogeneity
Mixed Process (e.g., CMOSDRAM)
Block-on-block, Gate-on-gate, Transistor-on-transi
stor
CMOS only
2D Cores
Number Of Cores
3D Cores
Single Core
Multi-Core
Granularity Of Stacking
Many Core
14
Circuit Level
Wordline wire delay

Example SRAMs
Used for caches, register files, branch
predictors, register renamer, etc.
Questions
How to organize in 3D?
Benefits?

Bitline wire delay
Column mux, sense amp, drive out
Row decoder
15
3D Organizations (1)
Bitline wire length halved

Split bitlines

Reduces both latency and energy consumption
Less wire in row decoder
Bitlines split between layers
16
3D Organizations (2)

Split wordlines

Half-length wordline
Less wire in column mux
Wordline split between layers
17
Granularity of Stacking
Cache on Cores
Original 2D
Fine-grained stacking Provides more wire
reduction Also requires more d2d vias
Banks on Banks
Bitcells on Bitcells
18
Empirical Results
Optimized for latency
Latency (ps)
Energy (pJ)
8
21
5
31
For 4-layer stack 30 lower latency
... and 10 lower energy
Puttaswamy and Loh, ICCD 2005
Puttaswamy and Loh, HPCA 2007
19
Datapath/Layout Level

Example bypass network

Every ALU can forward its result to any other ALU
Each mux chooses from n results
O(n)
There are n ALUs ? 2n muxes

Total area for bypass network
O(n) O(n) 2n
O(n3) Area
O(n2) Wire Length

O(n)
Network must support n results
20
BIT-Split Datapath

Place bits by significance or interleaved

n inputs split over L layers
There are still n ALUs ? 2n muxes

Total area for bypass network
O(n/L) O(n/L) 2n
O(n3/L2) Area
O(n2/L) Wire Length

O(n/L)
For L2 layers, area decreases to 0.25 of
original (75 reduction!)
n results split over L layers
O(n/L)
21
Bit-Split DataPath
No die-to-die vias needed in the bypass network,
since result biti will never be forwarded to
input bitj (for i?j)
ALUs may need some die-to-die vias (e.g., for
carry propagation)
22
Empirical Results
64-bit datapath (2D one layer)
32 bits per layer (3D two layers)
16 bits per layer (3D four layers)
-30.9 latency -29.2 energy
-47.2 latency -44.9 energy
Puttaswamy and Loh, HPCA 2007
23
Thermals?

3D stacking can improve wire delay and power
... but power density may still increase
... which can lead to higher chip temperatures

24
Microarchitecture Level

Arrange blocks/circuits to
Reduce power density
Keep power close to heat sink
While reducing wire (for latency and power)

First, an Observation
Most integer operations use small values (7-4,
20231, 149, etc.)
00000000000000000000000000000000000000000000000000
00000000001110 000000000000000000000000000000000
0000000000000000000000000001001 000000000000000000
0000000000000000000000000000000000000000010111
Most bits are zeros, and not used/useful
25
Significance Partitioned Datapath

Bypass, ALUs similar to earlier
but split everything in this fashion

bits015
bits1631
bits3247
bits4863
26
Thermal Herding
(This end closer to heatsink)
Most of the time, only the one layer closest to
the heatsink is active
ADD 59
ADD 8910265539
bits015
bits1631
All four layers active
bits3247
These three layers inactive
Circuits implemented in 3D, so latency and power
reduced whether one or all layers active
bits4863
27
Width Prediction

To clock-gate a block, we need to know to gate it
early enough

By the time we know that the value only needs a
few bits, its too late to clock gate the RF
R5
Register File
R5
LW (predict)
0x00000013
Register File
Significance Detector
0
0x00000013
Stall and re-read if necessary
?
Low-width!
Dynamic width prediction can be very accurate
(98 on average) Loh MICRO02
28
overview of other components
Significance-partitioned using width prediction
3D-stacked, but no explicit thermal control
Lookup/update split
Entry-partitioned
Port-stacked
29
Performance/Power results
Resulted in 47.9 clock speed and 47.0
performance
Wire reduction due to 3D is significant
Benefits even greater with thermal herding
(-19.3 Power)
(-28.6 Power)
Puttaswamy and Loh, HPCA 2007
30
Thermals
Thermals are a challenge for 3D processors, but
not an insurmountable problem or a show-stopper
Hotspot increases by 19C
Remainder may require V/f scaling, better
cooling, etc.
TH reduces this to only 8C
TH can reduce temp below 2D!
Results collected with UVa HotSpot
3.0 Preliminary results with Intel
thermal modeling tool indicates Thermal
Herding is even more effective
Puttaswamy and Loh, HPCA 2007
31
Obvious Stack DRAM on CPU
3D-stacked DRAM, wider bus Liu et al. IEEE DT
2005Loi et al. DAC 2006 Kgil et al. ASPLOS
2006
Tezzaron FaStack 3D DRAM
32
Simple 3D DRAM Stacking Works
Very low latency b/t core and MC, and b/t MC and
DRAM MC running at core speed
Greatly reduced contention on bus between core
and MC
Actual DRAM array access latency reduced
33
Still Using 2D Interface

Previous approaches showed decent gains, but
still used a traditional interface to memory

This interface only needs 100 bits
But die-stacking provides many 1000s of
connections!
34
Simple Modifications
Increase number of ranks
Add more row buffer entries
Add more memory controllers
35
Performance Impact
33.8
30.6
Total of 74.7 over previous 3D DRAM 3.8x over
2D
Better MSHR design gives 17.8 (106 over prev.
3D)
36
3D Memory Stacking Summary

Doing the simple thing works
Opening up the interface gets you more
Adjusting the microarchitecture to match the new
interface gets you even more

These general lessons can be applied to other
components that you want to implement in 3D
37
Research Summary

Exploring 3D integration at many levels of
microprocessor design
Additional transistors and shorter wires can be
converted into performance and power benefits
Increasing the utilization of the 3D interface
yields better results
Simple approaches are beneficial, too
(low-hanging fruit)
Still a lot of open research
3D for new microarchitectures
More opportunities for mixed-process integration

38
What Else?

Other open research problems in 3D
Reliability
Parametric variations, yield
Tools (CAD/EDA)
Test, DFT and Debug

di/dt noise
electro- migration
slow
bad
fast
good
http//www.bo.imm.cnr.it/researchs/elettronica/ind
ex_files/rel_res.htm
39
Summary

3D can delay the end of Moores Law
this is worth many billions of
more importantly, enables other areas of
computation to continue their advancements
Success depends on (among other things) new
architectures to exploit the 3D technology while
dealing with the challenges

40
Acknowledgments (people)

GA Tech
Prof. Hsien-Hsin S. Lee, Prof. Sung Kyu Lim
Kiran Puttaswamy, Dean Lewis, Michael Healy, Dae
Hyun Kim
Tutorial
MICRO06 Yuan Xie (Penn State), Bryan Black
(Intel), John Devale (Intel), Kerry Bernstein
(IBM)
ISCA08 Jian Li (IBM), Jerry Bautista (Intel),
Jason Cong (UCLA), Hsien-Hsin Lee (GT)
Intel
Bryan Black, John Devale, Jeff Rupley, Ned
Brekelbaum, Don McCauley, Paul Reed

41
Acknowledgments (Other)

Funding
SRC/FCRP C2S2
NSF CAREER
State of Georgia
Equipment
Intel Corporation

42
More Info

www.3D.GATECH.edu
Starting points for 3D architecture
Loh, Xie, Black, Processor Design in
Three-Dimensional Die-Stacking TechnologiesIEEE
Micro magazine, May/June issue 2007
Xie, Loh, Black, Bernstein, Design Space
Exploration for 3D ArchitecturesACM Journal of
Emerging Technologies for Computer Systems, 2(2),
pp. 65-103, April 2006

43
gtgtgtgt BACKUP SLIDES ltltltlt
44
Example Bonding Process
3. Thermocompression bonding
4. CMP backside thinning ( 10 mm)
5. Backside etching for power, ground, I/O
6. Dice, package, etc.
(heat spreader, heat sink, etc.)
1. Manufacture dies/wafers separately 2. Deposit
Cu via stubs
45
d2d via latency
1mm of wire 225 picoseconds
Die-to-die delay 8 picoseconds
ltlt 1 gate delay/FO4
RCd2d 0.35 RCviastack
FO4 delay 22 picoseconds (approximately one
gate delay)
RCviastack
HSpice, (B)PTM 70nm transistor model
source Puttaswamy and Loh, ICCD 2005
source Intel Corporation
46
intel stack example
(d2d via)
Source Intel
47
X-SEM of Bond Structure
Good bonding indicated when no voiding or seam
seen between metal pieces
Source Intel
48
3oomm wafer bonding

Scanning Acoustic Microscopy Image (CSAM)

Non optimal bonding
time/temperature

Good bonding across the full 300 mm wafer

Ref Morrow et. al, Wafer-level 3D interconnects
via Cu bonding, Proc. AMC, 125-130 (2004)
49
3D Bond Test Configuration
50
Sample Chain Results
Calculated from geometry, modeling 5 layer
thickness variation
Calculated from geometry
Ref Morrow et. al, Wafer-level 3D interconnects
via Cu bonding, Proc. AMC, 125-130 (2004)

Resistance measured using backside
through-silicon vias each point is a chain of
4096 links
Obtained tight distributions of resistances with
high yield
Measurements agree with calculated values based
on geometry
Negligible contribution from bond interface
resistance

Source Intel
51
Individual Device parameters

Wafer Level Testing of NMOS and PMOS
No difference between thin stacked wafers and
non-bonded wafers
Thin/Stacked outliers due to patterning issues
with testing pads on backside

Ref P. R. Morrow, et al, Three-dimensional
Wafer Stacking via Cu-Cu BondingIntegrated with
65 nm Strained-Si/Low-k CMOS technology,
Electron Device Letters, 2(5), 335-337 (2006).
52
Yield of LogicLogic Stacking
1 wafer 10 good die 2 wafers 20 good die
Planar
2 wafers 22 good die

3D

Half size die
Increase individual die yield
Dramatically increases die count (edge effect)
Bonding slow die to fast die
A tight process will have a small impact
Simple etest pre-sort can eliminate impact

Source Intel
53
Die vs. Wafer Stacking
Wafer Stacking
Die Stacking
TSV
Source Intel
Source Intel
Possible Application Logic Memory TSV Size
50 mm Thickness 100 mm Bonding Structure
Bump Size Bonding Pitch Bump Pitch
Possible Application Logic Logic TSV Size
lt5mm Thickness 10 mm Bonding Structure lt5
mm Bonding Pitch lt8 mm
Limited currently by 300mm 3s alignment
capability
Source Intel
54
Dealing with Large D2D Pitches
For split-wordline, need one d2d via per wordline
So long as total d2d bandwidth is low enough, we
can use dovetailed layouts
What if d2d via pitch is too large for one via
per wordline?
55
Power Density estimates

130nm P4
3.4D GHz µPGA478 0.13 µm HTT
109W peak
131mm2
109W / 1.31cm2 83 W/cm2
90nm P4 (Prescott)
3.6 GHz (Model 560) 3.6D GHz LGA775 0.09 µm HTT
151W peak power
112mm2
151W / 1.12cm2 134 W/cm2

Dohh!
Best guess... data on sandpile.org
is somewhat difficult to correlate
56
gtgtgtgt MICRO 2006 ltltltlt
57
Core-Level Design

Simplest approach
Core-on-core, cache-on-core
Reuse existing 2D designs
first step in evolutionary path to new 3D
processors

LLC is relatively low power, so thermal impact
should be limited
Likely to have severe thermal problems
58
Options

Use 3D to build larger last-level caches (LLC)

4MB L2 cache (SRAM)
Stack 8MB (SRAM)
Stack 32MB (DRAM)
Stack 64MB (DRAM)
Core 2
Original 4MB
Original 4MB for fast tags
Original 4MB removed
Core 1
2D baseline
3D (12MB)
3D (32MB)
3D (64MB)
59
Performance Impact
Significant performance increase
1.4
14
IPC
1.2
12
BW
1.0
10
0.8
8
Instructions Per Cycle
BandWidth (GB/s)
0.6
6
Significant reduction in off-chip BW
0.4
4
0.2
2
0.0
0
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
12
12
12
12
12
12
12
12
12
12
12
12
12
conj
dSym
gauss
pcg
sMVM
sSym
sTrans
spgAVDF
spgAVIF
spgUS
svd
svm
Avg
Intel RMS parallel/multi-core benchmarks
Black et al. MICRO 2006
60
Thermal Impact
100
92.85
88.35
88.43
90.27
90
80
70
60
Peak Temperature (C)
50
40
30
20
10
0
2D 4MB
3D 12MB
3D 32MB
3D 64MB
No major thermal issues
Black et al. MICRO 2006
61
Baseline Core2 Duo Thermals
Coolest 59ºC
Hottest 88.4ºC
Core 1
L2 Cache 4MB
FP
Ld/ST
RS
Edge temp drop is due to an epoxy fillet
62
Thermal cost of DRAm on CPU
89
88
87
Actually
86
Better
3.8
Increase
85
Peak Temperature (C)
84

83
82
86W6.2W
92W6.2W
92W0W
81
92W
80
CPU (2D)
CPU (3D)
DRAM on CPU
DRAM on CPU -
(3D)
I/O (3D)
DRAM on CPU
CPU on DRAM
63
3D-stacked P4

Planar die area is 50
Eliminates RC delay, drivers, reduce driver
sizing
25 of pipe stages removed
15 perf. improvement from pipestage elimination
15 power improvement from clock elimination and
pipestage elimination
Splitting FUBs and placement benefits are additive

Top
Bottom
Black, et. al, 3D Processing Technology and Its
Impact on iA32 Microprocessors, Proceedings of
the International Conference on Computer Design,
October 2004
64
P4 3D tradeoff space
Power Pwr Temp Perf Vcc Freq
Baseline 147 100 99 100 1.00 1.00
Same Power 147 100 127 129 1.00 1.18
Same Frequency 125 85 113 115 1.00 1.00
Same Temperature 97.3 66 99 108 0.92 0.92
Same Performance 68.2 46 77 100 0.82 0.82
Perf vs. Freq 0.82 performance for 1
frequency Freq vs. Vcc 1 for 1 in Vcc
Black et al. MICRO2006
65
gtgtgtgt ISCA 2008 ltltltlt
66
Estimate for of DRAM Layers

Samsung K4T51083QE DDR2 SDRAM
10.9Mb/mm2 (1.36MB/mm2) in 80nm
27.9Mb/mm2 (3.5MB/mm2) in 50nm
Quad-core in 45nm 200-300mm2
DC Penryn 107mm2 ? QC 214mm2
QC Barcelona 285mm2
1GB per layer ? 294mm2 per DRAM layer
Should be slightly less since peripheral logic
placed on separate layer
Else, 512MB per layer (16 DRAM, 1 peripheral
logic)

67
Thermal Impact
Results based on UVA HotSpot thermal simulator
Performance modeling already accounts for 32ms
refresh rate (as opposed to 64ms)
Numbers are pessimistic in that we assume max
power consumption for both CPU and DRAM (e.g., if
DRAM is running at full-tilt, then CPU is
probably mostly idling if CPU is running at full
tilt, then it is probably mostly hitting in cache
and so DRAM activity would be low)
68
Baseline Config and Workloads
69
gtgtgtgt CF 2008 ltltltlt
70
3D Benefits and Costs

Benefits
More processor resources (larger caches,
functional units, buffers)
? higher performance
Additional functionality
On-stack voltage converters, noise control,
profiling,
Costs
More layers of silicon ?
Additional fab steps (and equipment) ?
Higher temps ? better cooling ?

71
Benefits Not For Everybody
Some markets cant afford additional costs
Other design points may already be thermally
limited
Other markets (server, gaming) may be quite
willing to deal with 3D costs for the benefits
May not want 3D in these markets
72
Converged Design Methodology

One microarchitecture for all segments

Base Core Microarchitecture
Base K8 Microarchitecture
budget
mobile
desktop
server
gaming
budget
mobile
desktop
server
gaming

Reduces design costs
differentiate products via speed/power binning,
cache size, cores

Naively using 3D for only high-end
products breaks the One Design methodology
73
Goal/Motivation

Use 2D as the default (for low-end, mobile, etc.)
Make 3D optional (for high-end market segments)
Maintain a single overall design
Stackable Microarchitecture Approach
baseline 2D processor has everything it needs to
stand-alone
optional 3D layers augment the conventional
baseline structures
3D layers snap on for segments that need it

74
Stackable SRAMs

For caches, predictors, register files, etc.

Bank 3
Bank 3
Bank 2
Bank 2
Bank 1
Bank 1
Bank 0
Bank 0
0
0
ai
ai-1
2 banks? 1
2 banks? 0
ai-1
ai
4 banks? 0
75
Another View
76
Configuring Each Layer
Step 1 Layer Identification
Step 2 Per-Layer Configuration
Layer ID 3
Layer ID 2
Layer ID 1
Layer ID 0
0
Config Bits
Set during assembly, at boot-up, etc.
77
Modular Processor Overview
2D Floorplan and parameters based on Core 2
microarchitecture
78
Performance Results
uops-per-cycle speedup
79
Design Tradeoffs
80
Modular Design Conclusions

With a partitionable design, 3D benefits can be
segregated across markets
Baseline 2D processor fulfills needs of mass
markets
Optional, stackable layers provide value for
higher-end market segments
Can be combined with conventional speed/power
binning for further product-line differentiation

81
gtgtgtgt CAD/EDA ltltltlt
82
Some 3D CAD/EDA problems