Title: 3D Integration for High-Performance Processor Microarchitectures
13D Integration for High-Performance Processor
Microarchitectures
2Outline
- Motivation
- 3D - what is it?
- 3D - what can we do with it?
- Looking toward the future
3Wire Scaling Problems
Wires shrink, too...
... but their delays arent improving much
Devices are getting faster...
4Wire scaling problems
one clock cycle
Device Delay
Wire Delay
180nm
Device Delay
Wire Delay
130nm
For some designs, wire delay already accounts for
50 of a clock cycle
Device Delay
Wire Delay
90nm
Device Delay
Wire Delay
65nm
(not to scale)
Fetzer and Orton, ISSCC 02
5Power Scaling Problem
Dohh!
1000
Pentium III
Nuclear Reactor
Pentium II
!
Pentium Pro
100
Pentium
Power Density (Watts/cm2)
P4
Prescott
i486
10
!
i386
Core 2 Duo
Hot Plate
1
1.5mm
1.0mm
700nm
500nm
350nm
250nm
180nm
130nm
90nm
65nm
http//www.phys.ncku.edu.tw/htsu/humor/fry_egg.ht
ml
Data Intel
http//capefeare.com/homerdoh.gif
Data sandpile.org
http//research.esd.ornl.gov/EMBYR/fire-crop.gif
6Power Scaling Problems
- Peak power
- Power density
- Average power
Giant heatsink
Liquid cooling
HVAC cooling
1000W PSU
Power costs
bose.com
laptop.org
Battery life (and weight)
Environmental noise
georgiapower.com
Coolmax PS-CTG1000
Toshiba PA3107U-1BRS
CoolIT Freezone CPU Cooler
SCNJ-1100P Scythe Ninja PLUS CPU Cooler
www.hostcountryusa.com/images/aircondition.jpg
7Technology Scaling Problems
- Physical limits to silicon, lithography, etc.
... how far?
45nm
32nm
22nm
16nm
11nm
8nm
5nm
Many other challenges...
The limits are inevitable, but weve been
delaying the inevitable for a while now
83D Integration
- One possible way to delay the inevitable for
several generations
Moores Law Number of transistors on a chip
doubles every two years
Traditionally accomplished by reducing feature
size
Gordon Moore
Can achieve the same density by stacking in the
third dimension
www.intel.com (not a direct quote!)
9High-level benefits
Major source of circuit delay and energy
consumption
Load
Load
millimeters of metal
Layer 2
Simultaneous reduction in latency and energy
microns of metal
Layer 1
Can place and route circuits in 3D
Driver
Driver
You can also mix-and-match process technologies
(e.g., stack DRAM on CMOS)
10Manufacturing Assumptions
Wafer 1 Thinned to 10s of microns. Etch
through to connect power and I/O
Wafer 1
Wafer 2 Unthinned for structural support.
Connect heat sink to this side
Wafer 2
(face-to-face example)
- compatible with current manufacturing
- 3D bonding is a relatively straightforward BEOL
process
- coarser connections
- requires backside etching
This approach currently pursued by Intel, IBM and
others
11Key Parameters
- For this talk, we mostly assume face-to-face
wafer bonding
TSVs
Wafer 1 (thin 10-20 mm)
- Two types of vertical interconnect
- die-to-die (d2d) in between faces
- through-silicon via (TSV) on
- backside thinned die
Wafer 2
d2ds
(TSV)
(d2d)
(TSV)
d2d latency is fast well under one gate delay
(www.tezzaron.com)
12Not a Device Physics/Fab talk
Computer Architecture Question Given 3D
integration, what can (and should) we build?
Intel Corporation
133D Landscape
Integration Heterogeneity
Mixed Process (e.g., CMOSDRAM)
Block-on-block, Gate-on-gate, Transistor-on-transi
stor
CMOS only
2D Cores
Number Of Cores
3D Cores
Single Core
Multi-Core
Granularity Of Stacking
Many Core
14Circuit Level
Wordline wire delay
- Example SRAMs
- Used for caches, register files, branch
predictors, register renamer, etc. - Questions
- How to organize in 3D?
- Benefits?
Bitline wire delay
Column mux, sense amp, drive out
Row decoder
153D Organizations (1)
Bitline wire length halved
Reduces both latency and energy consumption
Less wire in row decoder
Bitlines split between layers
163D Organizations (2)
Half-length wordline
Less wire in column mux
Wordline split between layers
17Granularity of Stacking
Cache on Cores
Original 2D
Fine-grained stacking Provides more wire
reduction Also requires more d2d vias
Banks on Banks
Bitcells on Bitcells
18Empirical Results
Optimized for latency
Latency (ps)
Energy (pJ)
8
21
5
31
For 4-layer stack 30 lower latency
... and 10 lower energy
Puttaswamy and Loh, ICCD 2005
Puttaswamy and Loh, HPCA 2007
19Datapath/Layout Level
Every ALU can forward its result to any other ALU
Each mux chooses from n results
O(n)
There are n ALUs ? 2n muxes
- Total area for bypass network
- O(n) O(n) 2n
- O(n3) Area
- O(n2) Wire Length
O(n)
Network must support n results
20BIT-Split Datapath
- Place bits by significance or interleaved
n inputs split over L layers
There are still n ALUs ? 2n muxes
- Total area for bypass network
- O(n/L) O(n/L) 2n
- O(n3/L2) Area
- O(n2/L) Wire Length
O(n/L)
For L2 layers, area decreases to 0.25 of
original (75 reduction!)
n results split over L layers
O(n/L)
21Bit-Split DataPath
No die-to-die vias needed in the bypass network,
since result biti will never be forwarded to
input bitj (for i?j)
ALUs may need some die-to-die vias (e.g., for
carry propagation)
22Empirical Results
64-bit datapath (2D one layer)
32 bits per layer (3D two layers)
16 bits per layer (3D four layers)
-30.9 latency -29.2 energy
-47.2 latency -44.9 energy
Puttaswamy and Loh, HPCA 2007
23Thermals?
- 3D stacking can improve wire delay and power
- ... but power density may still increase
- ... which can lead to higher chip temperatures
24Microarchitecture Level
- Arrange blocks/circuits to
- Reduce power density
- Keep power close to heat sink
- While reducing wire (for latency and power)
First, an Observation
Most integer operations use small values (7-4,
20231, 149, etc.)
00000000000000000000000000000000000000000000000000
00000000001110 000000000000000000000000000000000
0000000000000000000000000001001 000000000000000000
0000000000000000000000000000000000000000010111
Most bits are zeros, and not used/useful
25Significance Partitioned Datapath
- Bypass, ALUs similar to earlier
- but split everything in this fashion
bits015
bits1631
bits3247
bits4863
26Thermal Herding
(This end closer to heatsink)
Most of the time, only the one layer closest to
the heatsink is active
ADD 59
ADD 8910265539
bits015
bits1631
All four layers active
bits3247
These three layers inactive
Circuits implemented in 3D, so latency and power
reduced whether one or all layers active
bits4863
27Width Prediction
- To clock-gate a block, we need to know to gate it
early enough
By the time we know that the value only needs a
few bits, its too late to clock gate the RF
R5
Register File
R5
LW (predict)
0x00000013
Register File
Significance Detector
0
0x00000013
Stall and re-read if necessary
?
Low-width!
Dynamic width prediction can be very accurate
(98 on average) Loh MICRO02
28overview of other components
Significance-partitioned using width prediction
3D-stacked, but no explicit thermal control
Lookup/update split
Entry-partitioned
Port-stacked
29Performance/Power results
Resulted in 47.9 clock speed and 47.0
performance
Wire reduction due to 3D is significant
Benefits even greater with thermal herding
(-19.3 Power)
(-28.6 Power)
Puttaswamy and Loh, HPCA 2007
30Thermals
Thermals are a challenge for 3D processors, but
not an insurmountable problem or a show-stopper
Hotspot increases by 19C
Remainder may require V/f scaling, better
cooling, etc.
TH reduces this to only 8C
TH can reduce temp below 2D!
Results collected with UVa HotSpot
3.0 Preliminary results with Intel
thermal modeling tool indicates Thermal
Herding is even more effective
Puttaswamy and Loh, HPCA 2007
31Obvious Stack DRAM on CPU
3D-stacked DRAM, wider bus Liu et al. IEEE DT
2005Loi et al. DAC 2006 Kgil et al. ASPLOS
2006
Tezzaron FaStack 3D DRAM
32Simple 3D DRAM Stacking Works
Very low latency b/t core and MC, and b/t MC and
DRAM MC running at core speed
Greatly reduced contention on bus between core
and MC
Actual DRAM array access latency reduced
33Still Using 2D Interface
- Previous approaches showed decent gains, but
still used a traditional interface to memory
This interface only needs 100 bits
But die-stacking provides many 1000s of
connections!
34Simple Modifications
Increase number of ranks
Add more row buffer entries
Add more memory controllers
35Performance Impact
33.8
30.6
Total of 74.7 over previous 3D DRAM 3.8x over
2D
Better MSHR design gives 17.8 (106 over prev.
3D)
363D Memory Stacking Summary
- Doing the simple thing works
- Opening up the interface gets you more
- Adjusting the microarchitecture to match the new
interface gets you even more
These general lessons can be applied to other
components that you want to implement in 3D
37Research Summary
- Exploring 3D integration at many levels of
microprocessor design - Additional transistors and shorter wires can be
converted into performance and power benefits - Increasing the utilization of the 3D interface
yields better results - Simple approaches are beneficial, too
(low-hanging fruit) - Still a lot of open research
- 3D for new microarchitectures
- More opportunities for mixed-process integration
38What Else?
- Other open research problems in 3D
- Reliability
- Parametric variations, yield
- Tools (CAD/EDA)
- Test, DFT and Debug
di/dt noise
electro- migration
slow
bad
fast
good
http//www.bo.imm.cnr.it/researchs/elettronica/ind
ex_files/rel_res.htm
39Summary
- 3D can delay the end of Moores Law
- this is worth many billions of
- more importantly, enables other areas of
computation to continue their advancements - Success depends on (among other things) new
architectures to exploit the 3D technology while
dealing with the challenges
40Acknowledgments (people)
- GA Tech
- Prof. Hsien-Hsin S. Lee, Prof. Sung Kyu Lim
- Kiran Puttaswamy, Dean Lewis, Michael Healy, Dae
Hyun Kim - Tutorial
- MICRO06 Yuan Xie (Penn State), Bryan Black
(Intel), John Devale (Intel), Kerry Bernstein
(IBM) - ISCA08 Jian Li (IBM), Jerry Bautista (Intel),
Jason Cong (UCLA), Hsien-Hsin Lee (GT) - Intel
- Bryan Black, John Devale, Jeff Rupley, Ned
Brekelbaum, Don McCauley, Paul Reed
41Acknowledgments (Other)
- Funding
- SRC/FCRP C2S2
- NSF CAREER
- State of Georgia
- Equipment
- Intel Corporation
42More Info
- www.3D.GATECH.edu
- Starting points for 3D architecture
- Loh, Xie, Black, Processor Design in
Three-Dimensional Die-Stacking TechnologiesIEEE
Micro magazine, May/June issue 2007 - Xie, Loh, Black, Bernstein, Design Space
Exploration for 3D ArchitecturesACM Journal of
Emerging Technologies for Computer Systems, 2(2),
pp. 65-103, April 2006
43gtgtgtgt BACKUP SLIDES ltltltlt
44Example Bonding Process
3. Thermocompression bonding
4. CMP backside thinning ( 10 mm)
5. Backside etching for power, ground, I/O
6. Dice, package, etc.
(heat spreader, heat sink, etc.)
1. Manufacture dies/wafers separately 2. Deposit
Cu via stubs
45d2d via latency
1mm of wire 225 picoseconds
Die-to-die delay 8 picoseconds
ltlt 1 gate delay/FO4
RCd2d 0.35 RCviastack
FO4 delay 22 picoseconds (approximately one
gate delay)
RCviastack
HSpice, (B)PTM 70nm transistor model
source Puttaswamy and Loh, ICCD 2005
source Intel Corporation
46intel stack example
(d2d via)
Source Intel
47X-SEM of Bond Structure
Good bonding indicated when no voiding or seam
seen between metal pieces
Source Intel
483oomm wafer bonding
- Scanning Acoustic Microscopy Image (CSAM)
- Non optimal bonding
- time/temperature
- Good bonding across the full 300 mm wafer
Ref Morrow et. al, Wafer-level 3D interconnects
via Cu bonding, Proc. AMC, 125-130 (2004)
493D Bond Test Configuration
50Sample Chain Results
Calculated from geometry, modeling 5 layer
thickness variation
Calculated from geometry
Ref Morrow et. al, Wafer-level 3D interconnects
via Cu bonding, Proc. AMC, 125-130 (2004)
- Resistance measured using backside
through-silicon vias each point is a chain of
4096 links - Obtained tight distributions of resistances with
high yield - Measurements agree with calculated values based
on geometry - Negligible contribution from bond interface
resistance
Source Intel
51Individual Device parameters
- Wafer Level Testing of NMOS and PMOS
- No difference between thin stacked wafers and
non-bonded wafers - Thin/Stacked outliers due to patterning issues
with testing pads on backside
Ref P. R. Morrow, et al, Three-dimensional
Wafer Stacking via Cu-Cu BondingIntegrated with
65 nm Strained-Si/Low-k CMOS technology,
Electron Device Letters, 2(5), 335-337 (2006).
52Yield of LogicLogic Stacking
1 wafer 10 good die 2 wafers 20 good die
Planar
2 wafers 22 good die
3D
- Half size die
- Increase individual die yield
- Dramatically increases die count (edge effect)
- Bonding slow die to fast die
- A tight process will have a small impact
- Simple etest pre-sort can eliminate impact
Source Intel
53Die vs. Wafer Stacking
Wafer Stacking
Die Stacking
TSV
Source Intel
Source Intel
Possible Application Logic Memory TSV Size
50 mm Thickness 100 mm Bonding Structure
Bump Size Bonding Pitch Bump Pitch
Possible Application Logic Logic TSV Size
lt5mm Thickness 10 mm Bonding Structure lt5
mm Bonding Pitch lt8 mm
Limited currently by 300mm 3s alignment
capability
Source Intel
54Dealing with Large D2D Pitches
For split-wordline, need one d2d via per wordline
So long as total d2d bandwidth is low enough, we
can use dovetailed layouts
What if d2d via pitch is too large for one via
per wordline?
55Power Density estimates
- 130nm P4
- 3.4D GHz µPGA478 0.13 µm HTT
- 109W peak
- 131mm2
- 109W / 1.31cm2 83 W/cm2
- 90nm P4 (Prescott)
- 3.6 GHz (Model 560) 3.6D GHz LGA775 0.09 µm HTT
- 151W peak power
- 112mm2
- 151W / 1.12cm2 134 W/cm2
Dohh!
Best guess... data on sandpile.org
is somewhat difficult to correlate
56gtgtgtgt MICRO 2006 ltltltlt
57Core-Level Design
- Simplest approach
- Core-on-core, cache-on-core
- Reuse existing 2D designs
- first step in evolutionary path to new 3D
processors
LLC is relatively low power, so thermal impact
should be limited
Likely to have severe thermal problems
58Options
- Use 3D to build larger last-level caches (LLC)
4MB L2 cache (SRAM)
Stack 8MB (SRAM)
Stack 32MB (DRAM)
Stack 64MB (DRAM)
Core 2
Original 4MB
Original 4MB for fast tags
Original 4MB removed
Core 1
2D baseline
3D (12MB)
3D (32MB)
3D (64MB)
59Performance Impact
Significant performance increase
1.4
14
IPC
1.2
12
BW
1.0
10
0.8
8
Instructions Per Cycle
BandWidth (GB/s)
0.6
6
Significant reduction in off-chip BW
0.4
4
0.2
2
0.0
0
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
4
32
64
12
12
12
12
12
12
12
12
12
12
12
12
12
conj
dSym
gauss
pcg
sMVM
sSym
sTrans
spgAVDF
spgAVIF
spgUS
svd
svm
Avg
Intel RMS parallel/multi-core benchmarks
Black et al. MICRO 2006
60Thermal Impact
100
92.85
88.35
88.43
90.27
90
80
70
60
Peak Temperature (C)
50
40
30
20
10
0
2D 4MB
3D 12MB
3D 32MB
3D 64MB
No major thermal issues
Black et al. MICRO 2006
61Baseline Core2 Duo Thermals
Coolest 59ºC
Hottest 88.4ºC
Core 1
L2 Cache 4MB
FP
Ld/ST
RS
Edge temp drop is due to an epoxy fillet
62Thermal cost of DRAm on CPU
89
88
87
Actually
86
Better
3.8
Increase
85
Peak Temperature (C)
84
83
82
86W6.2W
92W6.2W
92W0W
81
92W
80
CPU (2D)
CPU (3D)
DRAM on CPU
DRAM on CPU -
(3D)
I/O (3D)
DRAM on CPU
CPU on DRAM
633D-stacked P4
- Planar die area is 50
- Eliminates RC delay, drivers, reduce driver
sizing - 25 of pipe stages removed
- 15 perf. improvement from pipestage elimination
- 15 power improvement from clock elimination and
pipestage elimination - Splitting FUBs and placement benefits are additive
Top
Bottom
Black, et. al, 3D Processing Technology and Its
Impact on iA32 Microprocessors, Proceedings of
the International Conference on Computer Design,
October 2004
64P4 3D tradeoff space
Power Pwr Temp Perf Vcc Freq
Baseline 147 100 99 100 1.00 1.00
Same Power 147 100 127 129 1.00 1.18
Same Frequency 125 85 113 115 1.00 1.00
Same Temperature 97.3 66 99 108 0.92 0.92
Same Performance 68.2 46 77 100 0.82 0.82
Perf vs. Freq 0.82 performance for 1
frequency Freq vs. Vcc 1 for 1 in Vcc
Black et al. MICRO2006
65gtgtgtgt ISCA 2008 ltltltlt
66Estimate for of DRAM Layers
- Samsung K4T51083QE DDR2 SDRAM
- 10.9Mb/mm2 (1.36MB/mm2) in 80nm
- 27.9Mb/mm2 (3.5MB/mm2) in 50nm
- Quad-core in 45nm 200-300mm2
- DC Penryn 107mm2 ? QC 214mm2
- QC Barcelona 285mm2
- 1GB per layer ? 294mm2 per DRAM layer
- Should be slightly less since peripheral logic
placed on separate layer - Else, 512MB per layer (16 DRAM, 1 peripheral
logic)
67Thermal Impact
Results based on UVA HotSpot thermal simulator
Performance modeling already accounts for 32ms
refresh rate (as opposed to 64ms)
Numbers are pessimistic in that we assume max
power consumption for both CPU and DRAM (e.g., if
DRAM is running at full-tilt, then CPU is
probably mostly idling if CPU is running at full
tilt, then it is probably mostly hitting in cache
and so DRAM activity would be low)
68Baseline Config and Workloads
69gtgtgtgt CF 2008 ltltltlt
703D Benefits and Costs
- Benefits
- More processor resources (larger caches,
functional units, buffers) - ? higher performance
- Additional functionality
- On-stack voltage converters, noise control,
profiling, - Costs
- More layers of silicon ?
- Additional fab steps (and equipment) ?
- Higher temps ? better cooling ?
71Benefits Not For Everybody
Some markets cant afford additional costs
Other design points may already be thermally
limited
Other markets (server, gaming) may be quite
willing to deal with 3D costs for the benefits
May not want 3D in these markets
72Converged Design Methodology
- One microarchitecture for all segments
Base Core Microarchitecture
Base K8 Microarchitecture
budget
mobile
desktop
server
gaming
budget
mobile
desktop
server
gaming
- Reduces design costs
- differentiate products via speed/power binning,
cache size, cores
Naively using 3D for only high-end
products breaks the One Design methodology
73Goal/Motivation
- Use 2D as the default (for low-end, mobile, etc.)
- Make 3D optional (for high-end market segments)
- Maintain a single overall design
- Stackable Microarchitecture Approach
- baseline 2D processor has everything it needs to
stand-alone - optional 3D layers augment the conventional
baseline structures - 3D layers snap on for segments that need it
74Stackable SRAMs
- For caches, predictors, register files, etc.
Bank 3
Bank 3
Bank 2
Bank 2
Bank 1
Bank 1
Bank 0
Bank 0
0
0
ai
ai-1
2 banks? 1
2 banks? 0
ai-1
ai
4 banks? 0
75Another View
76Configuring Each Layer
Step 1 Layer Identification
Step 2 Per-Layer Configuration
Layer ID 3
Layer ID 2
Layer ID 1
Layer ID 0
0
Config Bits
Set during assembly, at boot-up, etc.
77Modular Processor Overview
2D Floorplan and parameters based on Core 2
microarchitecture
78Performance Results
uops-per-cycle speedup
79Design Tradeoffs
80Modular Design Conclusions
- With a partitionable design, 3D benefits can be
segregated across markets - Baseline 2D processor fulfills needs of mass
markets - Optional, stackable layers provide value for
higher-end market segments - Can be combined with conventional speed/power
binning for further product-line differentiation
81gtgtgtgt CAD/EDA ltltltlt
82Some 3D CAD/EDA problems
- Layer partitioning
- Global routing
- Clock routing
- Decap placement/sizing
See also www.GTCAD.gatech.edu
833D Layer Partitioning
- Goals Approach
- Partition modules and/or gates into multiple
layers - Consider bonding-style F2F, F2B, B2B
- Maximize inter-partition interconnect for F2F
- Minimize inter-partition interconnect for F2B and
B2B
843D Global Routing
- Goals Approach
- Construct 3D routing tree
- Optimize performance (?Elmore delay), power
(?wirelength) - Thermal-aware through-via insertion
- move through-vias ( thermal passage) closer to
hotspots
853D Clock Routing
- Goals Approach
- Construct 3D clock tree
- Minimize clock skew under non-uniform thermal
profile - Bonding style imposes constraints on of
through-vias
F2B Stack
Alternating F2F/B2B
863D Decap Placement and Sizing
- Goals Approach
- Place modules to reduce IR-drop and decap
required - Allocate whitespace for decap insertion
- Size decaps ( Tox) for leakage reduction
87gtgtgtgt OTHER ltltltlt
883D-Stacked 21364
2D 21364 core
3D 21364
3D EBox Detail
21364
21364 core
EBox Detail
- EBox has mix of self-stacking
- RF, IQ critical wires are intra-FUB
- and FUB-stacking
- EUs critical wires are inter-FUB
89Self-Bibliography
- Low-/Circuit-level
- DAC 2007
- GLSVLSI 2006 (2)
- ISCAS 2006
- ISVLSI 2006
- ICCD 2005
- Architecture-level
- IEEE Micro 2007
- HPCA 2007
- MICRO 2006
- JETC 2006
- ISCA 2008
- CF 2008
- Other
- TCAD 2007