Title: Design challenges in sub-100nm high performance microprocessors
1Design challenges in sub-100nm high performance
microprocessors
- Nitin Borkar, Siva Narendra, James Tschanz,
Vasantha Erraguntla - Circuit Research, Intel Labs
- nitin.borkar_at_intel.com
- siva.g.narendra_at_intel.com
- james.w.tschanz_at_intel.com
- vasantha.erraguntla_at_intel.com
2Outline
- Section 1 Challenges for low power and high
performance (90 mins) - Historical device and system scaling trends
- Sub-100nm device scaling challenges
- Power delivery and dissipation challenges
- Power efficient design choices
- Section 2a Circuit techniques for variation
tolerance (90 mins) - Short channel effects
- Adaptive circuit techniques for variation
tolerance
3Outline (contd.)
- Section 2b Circuit techniques for leakage
control (90 mins) - Leakage power components
- Leakage power prediction
- Leakage reduction and control techniques
- Section 3 Full-chip power reduction techniques
(90 mins) - Micro-architecture innovations
- Coding techniques for interconnect power
reduction - CMOS compatible dense memory design
- Special purpose hardware
- Design methodologies challenges for CAD
4Section 1
- Challenges for low power and high performance
5Moores Law on scaling
6 Scaling of dimensions
7Transistors on a chip
1000
2X growth in 1.96 years!
100
Pentium 4
Pentium III
10
Pentium II
Pentium
Transistors (MT)
486
1
386
286
0.1
8086
8085
0.01
8080
8008
4004
0.001
1970
1980
1990
2000
2010
Year
Transistors on Lead Microprocessors double every
2 years
8Die size growth
100
Pentium 4
Pentium III
Pentium II
Pentium
486
Die size (mm)
10
386
286
8080
8086
7 growth per year
8085
8008
2X growth in 10 years
4004
1
1970
1980
1990
2000
2010
Year
Die size grows by 14 to satisfy Moores Law
9Frequency
10000
Pentium 4
1000
Pentium III
Pentium II
100
Pentium
Frequency (Mhz)
486
386
10
8085
286
8086
8080
1
8008
4004
0.1
1970
1980
1990
2000
2010
Year
Lead Microprocessors frequency doubles every 2
years
10Performance
Applications will demand TIPS performance
11Power
Future
100
Pentium 4
Pentium III
Pentium
10
486
286
8086
Power (Watts)
386
8085
1
8080
8008
4004
0.1
1971
1974
1978
1985
1992
2000
Year
Lead Microprocessors power continues to increase
12Obeying Moores Law...
200M--1.8B transistors on the Lead Microprocessor
13Vcc will continue to reduce
10.00
1.35
1
1.00
Supply Voltage (V)
1.15
0.9
0.10
1970
1980
1990
2000
2010
Year
Only 15 Vcc reduction to meet frequency demand
14Constant Electric Field Scaling
15Active capacitance density
Active capacitance grows 30-35 each technology
generation
16Power will be a problem
100000
18KW
5KW
10000
1.5KW
500W
1000
P4
P III
100
Power (Watts)
Pentium
486
286
10
386
8086
8085
8080
1
8008
4004
0.1
1974
1978
1985
1992
2000
2004
2008
1971
Year
Power delivery and dissipation will be prohibitive
17Closer look at the power
100,000
Will be...
18KW
10,000
5KW
Should be...
Power (Watts)
1.5KW
623W
1,000
500W
375W
225W
135W
100
2002
2004
2006
2008
Year
18Advanced transistor design
Shallow highly doped source/drain extension
Thin TOX
p
p
Halo/pocket
Retrograde Well
Shallow trench isolation
n-well
Deep source/drain
19Intels 15 nm bulk transistor
R. Chau et al., IEDM 2000
20Transistor scaling trends - SCE
Le
Tox
Dj
D
Aspect Ratio
- Short channel effect (SCE) as measured as aspect
ratio has been worsening with scaling
21Transistor scaling challenges - Dj
- Junction depth reduction
- Device channel length decrease for same SCE
- - Series resistance to the channel increases
22Transistor scaling challenges - Tox
- Thinning gate oxide
- Increased gate tunneling leakage
- Electrical thickness is 2X physical thickness
- Gate stress now limits max VCC
- Solutions
- New decoupling caps
- Modified oxides/gate materials
- Model gate leakage in circuit simulation
23VCC and VT scaling
24Vcc scaling Soft errors
- Vcc and cap scaling with technology reduces
charge stored - Soft errors prominent in logic circuits
- No error correction in logic circuits
- Storage nodes per chip increasing
- Higher soft errors at the chip level
25Motivation
- Soft error rate (SER) per bit staying constant in
future processes - T. Karnik et al, 2001 VLSI Circuits Symposium
- Need to reduce SER/bit
Goal Reduce chip-level SER with no performance
penalty and minimum power penalty
26Measured Latch Data
SERX
2.25
7,000
2
5,250
Original
Errors
3,500
SER ImprovementX
1,750
Hardened
0
1
0.5
0.7
0.9
1.1
1.3
Supply Voltage (V)
T. Karnik et al, 2001 VLSI Circuits Symposium
- Will need 2X SER improvement in latches with no
performance loss.
27VT vs. leakage
- Leakage rises as the VT is lowered
- MOS has a sub-threshold slope of 110mV/decade
- Lower VT by 50mV ? 3X leakage
- Solutions
- Dual VT
- Stacking of off gates
- Controlled back gate bias?
- Multiple process technologies Mobile vs.
Performance?
28Sub-threshold Leakage
Sub-threshold leakage current will increase
exponentially
Assumtions 0.25mm, Ioff 1na/m 5X increase each
generation at 30ºC
29Leakage Power
Excessive sub-threshold leakage power
30Leakage Power increases
100,000
0.18u
0.13u
0.1u
0.07u
0.05u
10,000
1,000
Ioff (na/u)
100
10
30
40
50
60
70
80
90
100
Temp (C)
Drain leakage will have to increase to meet freq
demand Results in excessive leakage power
31Wide Domino Functionality
CLK
CLK
Q2
Q1
A
B
C
B
C
Static Gate
D2 Domino Gate
CLK
D1 Domino Gate
- Lower AC noise margin Vt
- Ioff could limit NOR fan-in
- High activity, higher power, 2X
- Irreversible logic evaluation
- Scalability is not good
- High performance 30 over static
- High fan-in NOR, less logic gates
- High fan-in complex gates possible
- Smaller area
32Bitline Delay Scaling Problem
- Bit line swing limited by parameter mismatch
differential noise - Cell stability degrades with Vt lowering
- Bit line delay a (Cap/W)Vswing/(Ion/W -
rowsIoff/W) - Reducing of rows per bitline approaching limit
33Restrict transistor leakage
10000
7 GHz
5.5 GHz
4 GHz
2.5 Ghz
1000
Pentium 4
Frequency (Mhz)
Pentium II
100
Pentium
486
386
10
1985
1990
1995
2000
2005
2010
Year
Reduce leakage ? Frequency will not double every
2 years
34Interconnect scaling trends
35Interconnect performance
R increases faster at lower levels C increases
faster at higher levels RC increases 40-60
36Interconnect distribution
Interconnect distribution does not change
significantly
37Wire Scaling
- Uarch for short wires
- Repeaters
38Optimum Repeater
- Vary
- N size, P size
- Repeater distance
- Metal width, space
39P, V, T Variations
40Frequency SD Leakage
Normalized Frequency
0.18 micron 1000 samples
30
20X
15
20
Normalized Leakage (Isb)
41Vt Distribution
120
0.18 micron 1000 samples
100
80
30mV
of Chips
60
40
20
0
-39.71
-25.27
-10.83
3.61
18.05
32.49
D
VTn(mv)
42Frequency Distribution
150
100
of Chips
50
0
1.00
1.07
1.15
1.22
1.30
1.37
Freq (Normalized)
43Isb Distribution
100
of Chips
1
1.00
4.82
8.64
12.47
16.29
20.11
Isb (Normalized)
44Supply Voltage Variation
- Activity changes
- Current delivery RI and L(di/dt) drops
- Dynamic ns to 10-100us
- Within-die variation
45Handling di/dt
Bulk Decoupling
High Frequency Decoupling
VRM Response
Local Decoupling
Silver BoxResponse
On DieDecoupling
46Vcc Variation Reduction
- On die decoupling capacitors reduce DVcc
- Cost area, and gate oxide leakage concerns
- On die voltage down converters regulators
47Temperature Variation
Cache
70ºC
Core
120ºC
- Activity ambient change
- Dynamic 100-1000us
- Within-die variation
48Major Paradigm Shift
- From deterministic design to probabilistic and
statistical design - A path delay estimate is probabilistic (not
deterministic) - Multi-variable design optimization for
- Yield and bin splits
- Parameter variations
- Active and leakage power
- Performance
49Performance Efficiency of mArch
Pollacks Rule
4
3
Area(Lead / Compaction)
2
Growth (X)
Performance(Lead / Compaction)
1
Note Performance measured using SpecINT and
SpecFP
0
1.5
1
0.7
0.5
0.35
0.18
Technology Generation
- Implications (in the same technology)
- New microarchitecture 2-3X die area of the last
uArch - Provides 1.5-1.7X performance of the last uArch
We are on the wrong side of a Square Law
50Frequency Performance
- Frequency increased 61X
- 18.3X ? process technology
- Additional 3.3X ? uArch
- Performance increased 100X
- 14X ? process technology
- Additional 7X ? uArch, design
51Design EfficiencymArch
In the same process technology, compare Scalar ?
Super-scalar ? Dynamic ?
Netburst 2-3X Growth in area 1.4X Growth in
Integer Performance 1.7X Growth in Total
Performance 2-2.5X Growth in Power
Pollacks Rule in actionPower inefficiency
52Power Efficiency - Circuits
Assumptions Activity Static 0.2, Domino
0.5 Clock consumes 40 of full chip power
High Power circuits contribute to power
inefficiency
53Power density will increase
Power density too high to keep junctions at low
temp
54Thermal Solutions
Ta
Attachment
Qsa Sink-to-Ambient (Heat-Sink)
Resistance
Ts
Heat Sink
Qcs Case-to-Sink (Interface)
Resistance
Interface
Tc
Package
Qjc Junction-to-Case (Package)
Resistance
Mounting
Tj
55Thermal CapabilityToday
Package - Polymer thermal interface - 1.5mm
Cu heat spreader - 0.35oC/W (typical) Thermal
Interface Material - Thermal Grease - Phase
Change Material - 0.12oC/W Heat Sink - Al
Folded Fin Cu base - 3.5 x 2.5 x 2 at
400g - 0.38oC/W - 5 (for RM fan)
1.0
0.8
Heat Sink (0.38oC/W)
0.6
QJA 0.82oC/W
Thermal Resistance (oC/W)
TIM (0.12oC/W)
0.4
Package (0.35oC/W)
0.2
0.0
TJ 90oC, TA 45oC, QJA 0.82oC/W P
(90-45)/0.82 55W
56Thermal CapabilityFuture
Must improve on all frontsno silver bullet
57Shrinking Size Quieter
3000
2500
2000
1500
System Volume ( cubic inch)
1000
500
0
PC tower
Mini tower
m-tower
Slim line
Small pc
Small quiet, yet high performance
58Thermal Budget
Desktop PC ASP
2200
Performance PC
1800
1400
Value PC
US
1000
600
200
1995 1996 1997 1998
1999 2000
Source Dataquest Personal Computers
Shrinking ASP, and shrinking budget for thermals
59Thermal
- Throttling / clock gating
- Circuits and sizing
- 10 performance gain at same power can be
translated into 25 power reduction by changing
VCC - Improved die attach / package
- Can effect new uArch / floor planning
- Spread and reduce power
60Thermal Envelope Cost
1000
Liquid Spray Refrigeration
100
Liquid Immersion
Unit Cost ()
Mobile High Perf
HAR HS with Heat Pipe
Itanium proc
10
HAR HS
Pentium 4 proc
Celeron
Extrusions
1
1
10
100
1000
Power (W)
61The Odds
1.5
100
Pentium III
75
1.0
Projected Heat Dissipation Volume
Pentium 4
50
Heat-Sink Volume (in3)
Thermal Budget (oC/W)
Air Flow Rate (CFM)
Projected Air Flow Rate
0.5
25
Thermal Budget
0
0
250
0
50
100
150
200
Power (W)
Power ? Thermals ? Higher Heat Sink Volume ?
Higher Air-flow Is this cheaper, smaller, and
quieter?
62Whats next
- Circuit techniques for variation tolerance
- Circuit techniques for leakage control
- Full-chip power reduction techniques
- 30 min quiz
63Section 2a
- Circuit techniques for variation tolerance
64Moores law on scaling
65Scaling of dimensions
66Requires die size growth or same die size
67(No Transcript)
68Drain current (Linear scale)
VT
(Log scale)
IOFF
69Barrier Lowering (BL)
Increasing electron energy (NMOS)
L
n
p
Barrier height
Barrier height
n
Channel of length L
Xd
Xd
Drain (n)
Source (n)
70Drain Induced BL (DIBL)
71Impact of variation in L
BL (VDS?0)
DIBL (VDSVDD)
VT (Volts)
Channel length (um)
DL ? DVT ? DION DIOFF
72180nm measurements
Necessary to make circuits less sensitive to VT
(ION IOFF) variation
73Transistor scaling
L
Tox
Dj
D
Transistor aspect ratio
Short channel effects increase with scaling
74Transistor scaling challenges - Dj
0.8
0.5
m)
0.7
0.4
NMOS
m)
m
m
(mA/
0.6
0.3
(mA/
DN
DP
PMOS
0.5
I
0.2
I
S. Thompson et al., 1998.
S. Thompson et al., 1998.
0.4
0.1
0
50
100
150
200
Junction Depth (nm)
S. Asai et al., 1997.
75Transistor scaling challenges - Tox
76High-K Gate Dielectric
- Lower gate leakage
- Higher Cox at a given gate leakage
77Parameter variation
Device and chip level parameters
Parameter variations increase with
scaling Adaptive VDD, VT to reduce chip level
variation
78Scaling challenges summary
- L, VDD, VT scaling
- ? Increasing parameter variation
- ? Increasing sub-threshold leakage power
- ? Increasing gate leakage power
- Product life cycle reduced from 3.6 years to 2
years - ? Concurrent engineering
- ? Better prediction models
79VT variation categories
80Adaptive Body Bias (ABB)
81Side effects of ABB
(2) Apply reverse bias
Determine impact of adaptive body bias on
within-die VT variation.
82Short Channel MOS VT
BL? ? lb? DIBL? ? ld?
83Within-die VT Variation
Within-die VT variation is primarily due to CD
variation
84Solutions
- Bi-directional adaptive body bias
- Several separate bias generators on-chip
85Testchip die micrograph
- 150nm CMOS
- 21 subsites per die
- Microprocessor critical path
- FrequencyMin(F1..F21)
- PowerSum(P1..P21)
- Separate VBS for each subsite
- 62 dies per wafer
5.3 mm
4.5 mm
86Sub-site micrograph
21 sub-sites with separate body bias for each
sub-site
87CUT schematics
88Simple Adaptive Body Bias (S-ABB)
Neglects WID variation
Area overhead 2
89Effectiveness of S-ABB
Frequency Variations/m
NBB 4.1
S-ABB 1.0
S-ABB
NBB
90Adaptive Body Bias (ABB)
Accounts for WID variation
Area overhead 2-3
91Effectiveness of ABB
Frequency Variations/m
NBB 4.1
ABB 0.69
ABB
NBB
92Adaptive Bias Distribution
N FBB
N FBB
N RBB
P FBB
P RBB
P FBB
1 die
13 dies
38 dies
10 dies
93Frequency vs. Critical Path Count (NCP)
- Frequency m and s reduce as NCP increases
- Frequency distribution unchanged for NCP gt 14
94WID Delay Variation vs. Logic Depth
NMOS s/m 5.6 PMOS s/m 3.0
Number of samples ()
Delay s/m 4.2
Variation ()
Miyazaki, ISSCC 2000 This work
Path Depth 49 16
Device ?/? 2.4 4.27
Frequency ?/? 0.55 4.17
95Within-Die Adaptive Body Bias (WID-ABB)
Compensates for WID variation
Area overhead Similar to ABB
96Effectiveness of WID-ABB
97 in highest bin
Frequency Variations/m
ABB 0.69
WIDABB 0.21
ABB
WID-ABB
97Within-Die Bias Distributions
P FBBN RBB
Circuit Block Count
P,N FBB
FBB
P RBBN FBB
PMOS Body Bias (V)
P,N RBB
RBB
NMOS Body Bias (V)
RBB
FBB
98Bias Resolution
Bias resolution ABB ABB WID-ABB WID-ABB
Bias resolution dies, F gt 1 s/m dies, F gt 1.075 s/m
500mV 79 2.87 2 1.89
300mV 100 1.47 66 0.50
100mV 100 0.69 97 0.21
- 300mV bias resolution sufficient for ABB
- WID-ABB requires 100mV bias resolution
99ABB summary
- D2D and WID variations impact microprocessor
frequency and leakage - ABB improves die acceptance rate from 50 to 100
- ABB is most effective when WID variations are
considered - Compensating for WID variations by WID-ABB
increases number of high frequency dies from 32
to 97
100Adaptive VDD VT
For iso-frequency Decrease VDD Increase
VT Increase VDD Decrease VT
Fast die
Slow die
V
T
2
S
a
a
10
P
V
P
leak
sw
DD
101Testchip goals
- Body bias (VBS) for VT modulation
- Measure frequency improvement with
- Adaptive VDD
- Adaptive VBS
- Adaptive VDDVBS
- Adaptive VDD Within-die VBS
- Subject to total active and standby power
constraints
102Baseline measurements
103Adaptive VDD vs. Fixed VDD
Active power limit 10W/cm2 Standby power
limit 0.5W/cm2
Fixed VDD 1.05V Frequency reduced to meet
power limit
Adaptive VDD 20mV resolution VDD frequency
changed
simultaneously
104VDD resolution requirement
100
80
60
Die count
40
20
0
0.9
0.95
1
1.05
Frequency Bin
Fixed VDD 1.05V
Adaptive VDD 50mV resolution
Adaptive VDD 20mV resolution
Minimum of 20mV resolution in VDD is required
105Adaptive VDD vs. Adaptive VBS
100
80
60
79
Die count
74
16
15
40
10
3
0
0
20
0
0.9
0.95
1
1.05
Frequency Bin
Adaptive VDD 20mV resolution
Adaptive VBS 100mV resolution
6 Fixed VDD Target frequency bin 10 Adaptive
VDD 16 Adaptive VBS
106Adaptive VDD VBS
100
80
60
26
71
79
Die count
16
40
3
2
0
0
20
0
0.9
0.95
1
1.05
Frequency Bin
Adaptive VBS
Adaptive VDDVBS
Adaptive VDD VBS more effective than adaptive
VDD or adaptive VBS
107VDD distribution
1.05
1.07
0.99
1.01
1.03
VDD (V)
Adaptive VDD VBS results in lower VDD than in
adaptive VDD
108VBS distribution
Adaptive VBS
Adaptive VDDVBS
Adaptive VDD VBS results in more dies with FBB
than in adaptive VBS
109Adaptive VDD Within-die VBS
Adaptive VDD Within-die VBS is most effective
110AVDD ABB Summary
- 150nm CMOS with 10W/cm2 active 0.5W/cm2 standby
power density limits result in - 20mV resolution in VDD is required
- 100mV resolution for VBS is required
111Neighborhood VT variation
The devices of interest that are in close
proximity can be either of the same or different
polarity.
Impacts sense amps, diff amps, current mirrors
etc.
Impacts clock generation circuits, switching
thresholds etc.
Voltage biasing
Current biasing
112Voltage biasing
Linear threshold voltage mismatch of matched
device pair for 500 mV forward body bias, zero
body bias and 500 mV reverse body bias.
113Application to sense-amp
Traditional sense-amplifier
New sense-amplifier
114Simulation results
1.5 V, 1 mV/pS ramp rate, and 110 C
115Current biasing
Basic iso-current biasing
116Application
Non-overlapping 2f clock generation
117Current biasing
Process insensitive current biasing
118Iref existing techniques
- Reference voltage to reference current conversion
- Bandgap circuit with off-chip resistor
- MOS reference voltage with off-chip
resistor - Direct reference current generation
- MOS based temperature compensation only
119Objective
Generate process compensated current ? with thin
tox digital CMOS devices ? without external
resistors
120Device measurement
0.18 mm CMOS technology, 30oC, Uncompensated
current
n 77 s 235.8 mA m 1.6 mA s/m
15
121Subtraction method
(at x Xmid)
(around Xmid)
y1 y2 vary with x, but yD is insensitive to x
122Example
Choose m1?m2 and n2 This will provide non-zero
yD insensitive to x around xd for proper n1
123Illustration
n2 1 m1 4.2 m2 2 xd 15 ? n1 0.13
y1 (35)
y2 (47)
xd
yD (6)
x
124MOS devices in saturation
Using long-channel wide devices
?
?
125Compensation by subtraction
126VT generation circuit
VDD
½ VDD
1VT
20/2
15/2
5VT
15/2
20/2
2/20
127Subtraction circuit
VDD
Vsg2 ? 2VT
z2
z1
Iref I1-I2
z1/z2 1/8
I2
I1
128Device measurement
0.18 mm digital CMOS technology, 30oC
I1
I2
n 112
129Compensated current
0.18 mm digital CMOS technology, 30oC
n 112 s 17.4 mA m 305 mA s/m
5.7
130Sub-1 V operation
b, a Vddmin (V) Temp (oC) Iref variation Vdd sensitivity
5, 2 0.9 30 5.0 0.3 per 100 mV
3, 2 0.6 30 5.2 0.4 per 100 mV
Low voltage operation enabled by redesigning Vt
generation circuit
131Process corner simulationresults
0.18 mm digital CMOS technology, 30oC, VDD 0.9
V, z1/z2 1/6
1.4
Iu (-16 to 22)
Iref (5)
1.22
1.14
1.2
1
1
0.99
0.97
0.95
0.95
0.89
1.0
0.84
0.8
Normalized current
0.6
0.4
0.2
0.0
Slow -
Slow
Typical
Fast
Fast
Process corner
7.6X smaller variation than uncompensated current
132Summary on Iref
- Subtraction technique for compensation
- Compensation technique reduces reference current
variation to 5 at Vdd of 0.9 V from 38 - Variation remains as 5 at Vdd of 0.6 V
133Section 2a Summary
- Device parameter variation increases with scaling
? design margins increase - Adaptive schemes required to minimize impact of
device variation on design margin of digital
circuits - Voltage and current biasing schemes to minimize
impact of variation on analog circuits
134Section 2b
- Circuit techniques for leakage control
135Outline
- Leakage sources impact of variations
- Leakage estimation with variations
- Static leakage reduction techniques
- Dynamic leakage reduction techniques
- Leakage-tolerant circuits
136Sources of Leakage
137Transistor leakage mechanisms
From Keshavarzi, Roy, Hawkins (ITC 1997)
1. PN junction leakage
5. Punchthrough current
2. Weak inversion SD leakage
6. Narrow width effects
3. DIBL and contribution from SCE
7. Gate oxide leakage
4. GIDL
8. Hot carrier injection
138Components of leakage
139Subthreshold leakage trends
- Historic Vt scaling 15 per generation
- S-D and gate leakage impact 3-5X increase
- Significant component of total power
- Serious dynamic circuit robustness penalty
140Leakage vs. switching power
Leakage gt 50 of total power!
250nm
180nm
130nm
100nm
70nm
- Key requirements
- Accurate prediction of chip leakage power
- Techniques to reduce chip leakage power
-
141DIBL impact on leakage
BL (VDS?0)
DIBL (VDSVDD)
VT (Volts)
Higher IOFF due to DIBL
Channel length (um)
142Variation impact on leakage
1.0E
-
05
1.0E
-
05
1.0E
-
05
1.0E
-
05
m
m
150 nm technology
110C
110C
0.18
m CMOS
110C
0.18
m CMOS
110C
110C
110C
VD1V
VD1V
VD1V
VD1V
VD1V
VD1V
1.0E
-
06
1.0E
-
06
1.0E
-
06
1.0E
-
06
1.0E
-
07
1.0E
-
07
1.0E
-
07
1.0E
-
07
NBB0V
NBB0V
NBB0V
NBB0V
Intrinsic IOFF (A)
Intrinsic IOFF (A)
Intrinsic IOFF (A)
Intrinsic IOFF (A)
1.0E
-
08
1.0E
-
08
1.0E
-
08
1.0E
-
08
Lwc
Lwc
1.0E
-
09
1.0E
-
09
Lwc
Lwc
1.0E
-
09
1.0E
-
09
Lwc
Lwc
1.0E
-
10
1.0E
-
10
1.0E
-
10
1.0E
-
10
Lnom
Lnom
Lnom
Lnom
Lnom
Lnom
1.0E
-
11
1.0E
-
11
1.0E
-
11
1.0E
-
11
1500
2000
2500
3000
3500
1500
2000
2500
3000
3500
1500
2000
2500
3000
3500
1500
2000
2500
3000
3500
1/
IDlin
1/
IDlin
1/
IDlin
1/
IDlin
Shorter L
Shorter L
Shorter L
Shorter L
Shorter L
Shorter L
Shorter L transistors contribute more to chip
leakage
143Transistor scaling challenges - Tox
144High-K Gate Dielectric
- Lower gate leakage
- Higher Cox at a given gate leakage
145Source/Drain Tunneling Leakage
146Leakage Estimation and Modeling
147Leakage estimation
Prior techniques
148New model
Includes within-die variation
After simplification using error function
properties,
149Applications
150Measurement results
0.18 um 32-bit microprocessors (n960)
50 of the samples within 20 of the measured
leakage Compared 11 and 0.2 of the samples
using other techniques
151Static Leakage Reduction1) Transistor Stacks
152Leakage of Stacks
Stack leakage is 5-10X smaller
153ScalabilityStack Effect
Stack effect becomes stronger with scaling
154Exploiting natural stacks
32-bit Kogge-Stone adder
High VT Low VT
Energy Overhead 1.64 nJ 1.84 nJ
Savings 2.2 mA 38.4 mA
Min time in Standby 84 mS 5.4 mS
Reduction Avg Worst
High VT 1.5X 2.5X
Low VT 1.5X 2X
155Stack forcing
Delay Penalty
Leakage Reduction
Equal Loading
Low-Vt stack-forcing reduces leakage power by 3X
156Static Leakage Reduction2) Dual-Vt Process
157Dual VT design technique
Leakage 3X smaller (Active Standby) No
performance loss
158Optimum choices of high low Vt
75-100mV VT difference is optimal
159Dual-VT and sizing
- Techniques
- DVT
- min-lvt
- min-area
- min-pwr
Optimize design with concurrent dual-VT
allocation and sizing
160Results total power
min-lvt
min-pwr
min-area
DVT
Switching
Leakage
Total power (normalized)
1.96 GHz(High-VT target)
2.30 GHz(Low-VT target)
2.21 GHz
- Total power reduced by 6-8 over DVT-only
- Leakage power reduced by 20 over DVT-only
161Results total device width
min-lvt
high-VT
min-pwr
DVT
min-area
low-VT
Total width (normalized)
1.96 GHz(High-VT target)
2.30 GHz(Low-VT target)
2.21 GHz
- Less low-VT usage than DVT-only
- Trade-off between area low-VT usage
162Results area comparison
Frequency 2.3GHz
DVT
min-lvt
min-lvt 15 area overhead
min-pwr
min-area
20 burn-in power reduction
163Effect of leakage change
- Push leakage in manufacturing to increase
frequency - Dual-VT design ideally push low-VT only
2.2 GHz
DVTS,original
Path Count (x1000)
2.76 GHz
DVTS,low-VT leakage push
High-VT paths do not speed up
Path Delay (ps)
164Enhanced dual-VT design
- Allow for efficient frequency change
- Insert additional low-VT devices
2.2 GHz
EDVTS,20
Path Count (x1000)
EDVTS,low-VT leakage push
2.76 GHz
Path Delay (ps)
Dual-VT insertion should consider process scaling
165Dual-VT sizing summary
- Dual-VT sizing reduces low-VT usage by 30-60
compared with DVT-only - Leakage power reduced by 20
- Dual-VT designs offer 9 frequency improvement
over single-VT - Enhanced design allows frequency increase through
low-VT leakage push
166Dynamic Leakage Reduction1) Body bias
167Reverse body bias
Total Leakage Power Measured on 0.18m Test Chip
Tech 0.35 mm 0.18 mm
Opt.RBB 2V 0.5V
Ioff Red. 1000X 10X
RBB reduces SD leakage Less effective with
shorter L, lower VT, scaling
168Impact of scaling on RBB effectiveness
RBB becomes less effective with technology scaling
169Switching leakage reduction forward body bias
20 power reduction at 1GHz 8 ? frequency at
iso-power 20X ? idle-mode leakage
170Router chip with forward body bias
150nm technology
Digital core with on-chip PMOS body bias
generator (BG).
171Power and performance gain by FBB
33 performance gain at 1.1V! 25 power
reduction at 1Ghz!!
172Standby leakage control by FBB
173Dynamic Leakage Reduction2) Dynamic sleep
transistor
174Active leakage control
Sleep transistor
Body bias
17532-bit ALU overview
Technology 130nm dual-VT CMOS
Die Area 1.61 X 1.44 mm2
Transistors 160K
Frequency 4.05GHz _at_ 1.28V450mV FBB, 75C
CBG central bias generator LBG local bias
generator
176Sleep transistor layout
ALU
Sleep transistor cells
177Body bias layout
Sleep transistor LBGs
ALU core LBGs
Number of ALU core LBGs 30
Number of sleep transistor LBGs 10
PMOS device width 13mm
Area overhead 8
ALU
ALU core LBGs
Sleep transistor LBGs
178Frequency leakage impact
Reference No sleep transistor, 450mV FBB to core, 1.35V, 75C Frequency degradation Leakage reduction Area increase
No over/under drive or sleep body bias 2.3 37X 6
200mV over/under drive 1.8 44X 7
Sleep body biasFBB RBB 1.8 64X 8
Dynamic body biasFBB ZBB 0 1.9X 8
PMOS sleeptransistor
PMOSbody bias
179Virtual supply convergence
Convergence gt 1ms
- Convergence time is dependent on capacitance
Convergence lt 1ms
Leaky MOS decap on virtual VCC better leakage
savings for gt 1ms idle time
180Total power equal frequency
TON 100 cycles, 75C, a0.05, F4.05GHz
15savings
8savings
Overhead
Leakage
? 77
? 45
LBG
Total power (mW)
Switching
? 3
Clock gatingonly
Clock gating body bias
Clock gating sleep transistor
181Leakage-Tolerant Circuits1) Dynamic register
file
182Impact of increasing leakage
- Leakage disturbs the local bit line (LBL)
- Noise can result in erroneous evaluation
- Wider addressing exacerbates problem
183Dual-Vt design for robustness
- High-Vt and stronger keepers mitigate leakage and
improve robustness - Contention causes severe penalty in delay
184Source-follower NMOS (SFN)
- As leakage charges the output node, feedback
reduces the leakage
Automatic Vgs reduction and reverse Vbs
185Leakage bypass w/ stack forcing
- Extra PMOSs supply leakage currents
- Leakage is bypassed away from LBL
- Extra NMOS device forces stack node
Stack node
186Better robustness vs. delay
Larger keeper smaller skew
187Energy vs. delay for SFN
- Robustness fixed at 10 across all points
- Leakage-tolerant techniques not only improve
robustness, but reduce energy as well - SFN width not as competitive because of PMOS
pull-up
188Energy vs. delay for LBSF
- LBSF faster despite 3-stack pull-down in LBL,
2-stack in GBL - Comparable total width in pull-down stacks yield
similar capacitance
189Summary of LBSF and SFN
Full LBSF SFN DVT SFN LBSF
Delay improvement 33 10 31
Energy reduction 37 24 38
Total width reduction 47 -3 26
- Improved RF robustness without delay penalty
- Advantages of LBSF and SFN improve as leakage
increases
190Leakage-Tolerant Circuits2) L1 cache using
bitline leakage reduction (BLR)
191Bitline leakage reduction
- Memory cell HVT and Lmax
- Solution Larger, Dual-Vt cell for L1 cache
- 3 types of cells
- HVT Lmax
- HVT Lmin
- DVT Lmin
192Intrinsic and effective read current
- DVTLmin cell IINT is 35 larger IEFF is smaller
193Bitline leakage reduction
WL -100mV ? Vmax Vvc Vmax 100mV
194BLR test chip results
- 2Kb bank of 16Kb L1 cache
BLR 25 higher read current, 3 larger cell area
195BLR performance
1.2V, 110oC
- Bitline delay improved from 91ps to 75ps
- Read delay reduced from 159ps to 132ps
- Bitline development rate improved by 8
196Leakage-Tolerant Circuits3) Conditional keeper
for burn-in
197Leakage at burn-in (BI)
- BI conditions elevated voltage and temperature
further challenges leakage issue - Higher leakage, higher temperature
- Thermal runaway issue and positive feedback
effect - Impact of leakage (specially at BI) on circuit
functionality - Stability of IDDQ measurement with BI stress
198Keepers need to be upsized for burn-in
- Larger keepers increase delay at normal
condition
199Burn-in conditional keeper
Normal mode Keeper
Effective Burn-in Keeper
Burn-in signal (BI)
PKB
PK1
Clock
Min. sized
Pull Down NMOS
Clock
200Burn-in keeper 100nm comparison
STD
Norm. delay (Normal condition)
Delay improvement
BI-CKP
NORs Fan-in (number of inputs)
Burn-in Keeper size of pull down
Larger delay improvement for wider dynamic gates
201Summary
- Control of leakage power becoming crucial
- Leakage estimation is necessary during design
phase - Static and dynamic techniques can be used for
leakage control - Dual-VT process and stack effect
- Dynamic sleep transistor and body bias
- Leakage-tolerant circuits
- Cache and memory leakage techniques
- Burn-in leakage reduction
202Section 3
Full-chip power reduction techniques and design
methodologies
203Micro architecture innovations
204mArchitecture Tradeoffs
- Higher target frequency with
- Shallow logic depth
- Larger number of critical paths
- But with lower probability
205Improve mArch Efficiency
Thermals Power Delivery designed for full HW
utilization
Single Thread
ST
Wait for Mem
Multi-Threading
Wait for Mem
MT1
Wait
MT2
MT3
Multi-threading improves performance without
impacting thermals power delivery
206Still obey Moores Law!
10,000
Actual
Moore's Law.
1,000
Transistors (MT)
100
10
2000
2002
2004
2006
2008
Year
Total transistors meet Moores Law
207Freds Rule
4
Area(Lead/Compaction)
3
Growth (X)
2
1
Perf(Lead/Compaction)
0
1.5
1
0.7
0.5
0.35
0.18
Technology Generation
In the same process technology 2X Area ? 1.4X
Performance
208Reduced die size causes Performance gap
30-60 performance loss even after meeting
Moores Law
209Exploit MemoryLow PD
- Large on die caches provide
- Increased Data Bandwidth Reduced Latency
- Hence, higher performance for much lower power
210Memory has lower power density
Exploit memory !
211Increase memory area
70
57
60
55
54
50
41
40
Memory Area of total
29
30
20
10
0
2000
2002
2004
2006
2008
Year
Use gt 50 die area in memory
212Memory trend
100000
12M
2.5M
10000
24M
1M
5.5M
1000
Memory (KB)
100
16
16
8
10
1
1980
1990
2000
2010
Year
213Power density is reduced
Full chip power density is reduced But local
power density will be high
214Can DRAM help?
- Transistor perf not critical for DRAM
- Dont need large retention time
- 10X more storage in same area power
- TB/sec Bandwidth, at lt10ns latency
215Embedded DRAM on logic
Provides 10X memory--same area, same power as SRAM
216Embedded DRAM could improve performance
Source Glenn Hinton, 99
- Embedded DRAM provides
- 10X increase in on-die Memory
- 1,000 X increase in Bandwidth
- 10X reduction in Latency
217On-die DRAM Applications
(1)
(2)
218130nm test chip
0.52m
0.52m
0.73m
1.61m
1.05m
1.10m
N/P Inversion
P/N Accumulation
P/P Depletion
219Area and Power Comparison
- P/P the best from power and area perspective
220Interconnect power reduction
221Motivation CC Multiplier (CCM)
CCM 0
CCM 1
CCM 2
Cc
Cc
CCM Cc Multiplier
Cg
- RintCint delay of long busses is a key speed
limiter - Coupling cap (Cc) is a large component of Cint
Cint Cg CCM ? (2Cc)
222Coupling Capacitance Scaling
metal-4
- Coupling capacitance remains a large fraction of
Cint despite moving from Al to Cu.
223Static Bus (SB)
- Simple scheme with no timing constraints
- Minimize delay through optimal repeater insertion
- CCM of 2 has negative impact on delay
224Dynamic Bus
- Domino timing applied to interconnect
- Monotonic transitions
- Reduced collinear capacitance
- Static (worst case) 2X
- Dynamic (worst case) 1X
- F2 repeater required susceptible to noise
- Higher transition activity when input 1
- Static CMOS inverters drive all segments
225Dynamic Bus Advantages
- Capacitance effects reduced
- Collinear capacitance reduced 2X
- Orthogonal capacitance unchanged
- Inductance effects reduced
- Can oppose transition for static bus
- Can reduce capacitive effects for dynamic bus
226Static Pulsed Bus (SPB)
- Static PG generates a pulse on a data transition
- Toggle FF (TFF) restores correct data at bus end
- Leading edge is critical repeaters are skewed
227SPB Benefits
CCM 0
CCM 1
Cc
Cc
Cg
- In SPB, data transitions are monotonic
- worst case CCM 1 and repeaters can be skewed
- Similar to dynamic bus but (1) has no clock
overhead and (2) its energy scales with switching
activity
228SB Vs. SPB Delay
SPB Delay Breakdown
RC Rep. 77
Other 23
- SPB reduces delay by 22 as a result of
- Repeater skewing
- CCM lt 1 due to useful noise coupling
229SB Vs. SPB Energy
- SPB reduces energy by 12 due to
- Smaller skewed repeater sizes
- Smaller CCM
230SB vs. SPB Different Bus Lengths
at iso-energy
at iso-delay
- At iso-energy, SPB improves delay by 15-25
- At iso-delay, SPB reduces energy by 12-25
- At iso-delay, SPB reduces current/width by 26-34
231SPB summary
- SPB has monotonic data transitions
- ? worse case CCM 1
- ? repeaters can be skewed
- Unlike dynamic bus
- ? no clock precharge-evaluate energy and routing
- ? energy consumption is data activity dependent
- For 1500mm-4500mm metal-4 line, SPB
- ? improves delay by 15-25
- ? reduces energy by 12-25
- ? reduces width by 34-42
- ? reduces peak-current by 26-34
232Transition-Encoded Bus (TEB)
- Encoder circuit
- XOR of previous and current input
- Domino compatible output
- Decoder circuit
- XOR of previous output and bus state
233TEB Advantages
- Dynamic bus performance improvement
- Collinear capacitance reduction
- Static bus energy
- Transition dependent switching activity
- Noise-insensitive F2 repeater required
- Regains noise immunity of CMOS inverter
234Energy Comparison
Static
Transition-encoded
9mm metal3, 130nm process, 1.2V, 30ºC
235Results
- Averaged over 3-9mm buses
- Metal3 in 130nm technology, 1.2V, 30ºC
236TEB summary
- Transition-encoded bus
- High performance, energy efficient on-chip
interconnect technique - 32 active area reduction
- 49 peak current reduction
- Transition dependent energy consumption
- ? Energy savings at aggressive delay targets
- Enables 10-35 performance improvement on 79 of
full-chip Pentium 4 buses
237Special purpose hardware
238Special-Purpose HW
- Special-purpose performance ? more MIPS/mm²
- SIMD integer and FP instructions in several ISAs
- Integration of other platform components, e.g.
memory controller, graphics - Special-purpose logic, programmable logic, and
separately programmable engines
Die Area Power Performance
General Purpose 2X 2X 1.4X
Multimedia Kernels lt10 lt10 1.5-4X
Improve power efficiency with Valued Performance
239TCP/IP challenges
Saturated 1GbE
1GbE 1.48M pkts/sec 672 ns 10GbE 14.8M
pkts/sec 67.2 ns
General purpose MIPS will not keep up!
240Compute power required for TCP/IP
TCP/IP Engine will provide required MIPs
241A sample approach
- A programmable hardware engine for offloading TCP
processing - Focus on
- Most complex part TCP inbound processing
- Handle 10Gbps Ethernet traffic with sufficient
headroom for outbound processing - Aggressive wire speed goal - minimum packet size
on saturated wire - Simple, scalable, flexible design enabling fast
time to market
242Key features
- Special purpose processor
- Dual frequency, low latency, buffer-free design
- High frequency execution core
- Accelerated context lookup and loading
- Programmability for ever-changing protocols
- Programmable design with special instructions
- Rapid validation and debug
- Scalable solution
- Across bandwidth and packet sizes
- Extendable to multi-core solution
243Packet size vs. core frequency
64
1Gbps 1GHz
10Gbps 10GHz
Increase packet size ? reduce frequency
244Chip characteristics
Chip Area Process Interconnect Transistors Pad count 2.23 x 3.54mm2 90nm dual-VT CMOS 1 poly, 7 metal 460K 306
245Standard FP MAC
FB
MA
Critical Path Logic Stages 26 _at_30ps per stage,
Fmax 1.2Ghz (P860, 1.1V)
246Prototype FP MAC
M(CS)
FB(CS)
MP(CS)
ZD
0
1
1
42 compressor
1
0
M gtF
Shift by 32
Overflow detector
1
0
ME FBE
Critical Path Logic Stages 12 _at_30ps per stage,
Fmax 3GHz (P860, 1.1V)
247Accumulator Algorithm
- Key Minimize interaction between incoming
operand and accumulator result - Floating point number converted to base 32
- Exponent subtraction no longer necessary
- Exponent comparison reduced from 8 to 3 bits
248Die photograph and characteristics
MULTIPLIER
FIFOs SCAN
ALIGNER
CLK
ACCUMULATE
NORMALIZE
Clock Grid Buffers
249Design methodologies
250Motivation
- Parameter variations will become worse with
technology scaling - Robust variation tolerant circuits and
microarchitectures needed - Multi-variable design optimizations considering
parameter variations - Major shift from deterministic to probabilistic
design
251Impact on Design Methodology
Due to variations in Vdd, Vt, and Temp
Probability
Delay
Deterministic
Deterministic
Probabilistic
Probabilistic 10X variation 50 total power
Frequency
of Paths
of Paths
Delay Target
Delay Target
Leakage Power
252Tool Complexity
- Problems
- Far too many tools and tool interfaces
- Data is not easily extractable
- Circuit reuse is minimal
- Solutions
- Common tool interfaces
- Standard databases
- Parameterized design
253Designer Cockpit
- Everything on the menu bar
254Designer Cockpit
255Designer Cockpit
File Edit View Select Synthesize
Parasitics Sizing Analyze Experiment
Checks Options
AMPS Tune to target sense amp memory cellSet
restrictions ? Autosize
Speed power curve Optimize with
sensitivity Optimize metal line Pick VT Delay vs
size Cell characterize Sense amp
characterize Memory cell stability Setup hold
chararacterize User specified ? New Script ?
- Tools work with partial or full selection
- Designer intervention allowed anywhere
- Layout planner provides wiring parasitics
- Not the route of the week
- All tools callable from user programs
- Experiment organizer
- Optimization and experiments built in
256Optimization Example
- Imagine
- Select gates from schematic editor or layout
planner to optimize - Select optimization for PD3
- Include a metal width and space
- Include VT range optimization
- Force a metal line length as a function of
transistor sizes in a cell - Select Pathmill analysis
- Run with sensitivity turned on
257Optimization Example
258Evolve a Macro Library
Feasibility studies estimation
RTL
Circuit design
Layout
Tapeout
- Executable on-line documentation
- Designs must be easily absorbed into the library
259Tools and productivity
- Functional uarch modules
- Investigation tools and libraries
- Cross discipline optimization Monte Carlo
- Easy database access
- Designer has same access as developer
- Full chip path extraction and visualization
- Productive design requires
- Innovation to be early
- Early innovation enabled by
- Flexible and open tools
260Development CAD and DA
Research- Core technologies
Tool Vendors
CAD Development- Productize modules and sample
flow
Design DA Groups - Interface, flows and
adaptations
Designers - Special features
261Examples
262Chip with bias generator (BG)
150nm Communications router (ISSCC 01)
Digital core with on-chip PMOS body bias
generator (BG). 1.5 million PMOS devices
263Distributed biasing scheme
Central Bias Generator (CBG) and Local Bias
Generator (LBG)
264Bias generation distribution
265Routing details
Global routing
Vcca
Vcca 450mV
Vcca
To LBGs
FBB / ZBB control bit
Local routing
Vcc
From LBGs
Vcc 450mV
266Router chip summary
267Dual-VT Motivation
- Low-VT used in critical paths
- Achieve same frequency as all low-VT design
- Leakage power much smaller than all low-VT design
268Dual-VT Options
- DVT
- H-SDVT
- L-SDVT
- DVTS
269Dual-VT Allocation Only (DVT)
- Transistors sized for original target
- Insert low-VT to meet new target frequency
270Selective LVT Insertion (H-SDVT)
- Size at target frequency
- Insert low-VT to fix critical paths
- Size to optimize slack (down-size)
271Selective HVT Insertion (L-SDVT)
- Convert netlist to all low-VT
- Size at target frequency
- Insert high-VT on non-critical paths
- Size to optimize slack
272Dual-VT and Sizing (DVTS)
- Iterative DVT flow
- Use different amounts of sizing, low-VT to reach
target - Pick best iteration
273Tutorial summary
- Challenges for low power and high performance
- Historical device and system scaling trends
- Sub-100nm device scaling challenges
- Power delivery and dissipation challenges
- Power efficient design choices
- Circuit techniques for variation tolerance
- Short channel effects
- Adaptive circuit techniques for variation
tolerance
274Tutorial summary (contd.)
- Circuit techniques for leakage control
- Leakage power components
- Leakage power prediction and control techniques
- Full-chip power reduction techniques
- Micro-architecture innovations
- Coding techniques for interconnect power
reduction - CMOS compatible dense memory design
- Special purpose hardware
- Design methodologies challenges for CAD
275Power limited microprocessor integration choices
Special purpose processing DSP Network processing
(wired/wireless)
Adapt to Process
Present
Next decade
Adaptive general purpose units
Special purpose units
General purpose units
Dense Memory
Memory
Power (active and standby) management
276Acknowledgements
- The presenters would like to thank all the CRL
team members and Intel design and manufacturing
teams for their contribution towards the contents
of this tutorial.
277Bibliography (1 of 7)
- De, V. Borkar, S. Technology and design
challenges for low power and high performance
microprocessors, Low Power Electronics and
Design, 1999. Proceedings. 1999 International
Symposium on , 1999, Page(s) 163 168 - Lundstrom, M. Ren, Z. Essential physics of
carrier transport in nanoscale MOSFETs, Electron
Devices, IEEE Transactions on , Volume 49 Issue
1 , Jan. 2002, Page(s) 133 -141 - Thompson, S. et al A 90 nm logic technology
featuring 50 nm strained silicon channel
transistors, 7 layers of Cu interconnects, low k
ILD, and 1um2 SRAM cell, Electron Devices
Meeting, 2002. IEDM '02. Digest. International ,
8-11 Dec. 2002, Page(s) 61 -64 - Karnik, T. Borkar, S. Vivek De Sub-90nm
technologies--challenges and opportunities for
CAD, Computer Aided Design, 2002. ICCAD 2002.
IEEE/ACM International Conference on , 2002,
Page(s) 203 206 - Belady, C. Cooling and power considerations for
semiconductors into the next century, Low Power
Electronics and Design, International Symposium
on, 2001. , 6-7 Aug. 2001, Page(s) 100 -105 - Karnik, T et al Selective node engineering for
chip-level soft error rate improvement, VLSI
Circuits Digest of Technical Papers, 2002.
Symposium on , 13-15 June 2002, Page(s) 204 -205 - Narendra, S. De, V. Borkar, S. Antoniadis, D.
Chandrakasan, A. Full chip sub-threshold leakage
power prediction model for sub 0.18um CMOS, Low
Power Electronics and Design, 2002. ISLPED '02.
Proceedings of the 2002 International Symposium
on , 2002, Page(s) 19 -23 - Narendra, S. Borkar, S. De, V. Antoniadis, D.
Chandrakasan, A. Scaling of stack effect and its
application for leakage reduction, Low Power
Electronics and Design, International Symposium
on, 2001. , 2001, Page(s) 195 200 - Narendra, S. et al 1.1 V 1 GHz communications
router with on-chip body bias in 150 nm CMOS,
Solid-State Circuits Conference, 2002. Digest of
Technical Papers. ISSCC. 2002 IEEE International
, Volume 1 , 2002, Page(s) 270 -466 vol.1
278Bibliography (2 of 7)
- Tschanz, J.W. Narendra, S. Nair, R. De, V.
Effectiveness of adaptive supply voltage and body
bias for reducing impact of parameter variations
in low power and high performance
microprocessors, Solid-State Circuits, IEEE
Journal of , Volume 38 Issue 5 , May 2003,
Page(s) 826 -829 - Tschanz, J.W. et al Adaptive body bias for
reducing impacts of die-to-die and within-die
parameter variations on microprocessor frequency
and leakage, Solid-State Circuits, IEEE Journal
of , Volume 37 Issue 11 , Nov. 2002, Page(s)
1396 -1402 - Vangal, S. et al 5GHz 32b integer-execution core
in 130nm dual-Vt CMOS, Solid-State Circuits
Conference, 2002. Digest of Technical Papers.
ISSCC. 2002 IEEE International , Volume 2 ,
2002, Page(s) 334 -535 - Narendra, S. Keshavarzi, A. Bloechel, B.A.
Borkar, S. De, V. Forward body bias for
microprocessors in 130-nm technology generation
and beyond, Solid-State Circuits, IEEE Journal of
, Volume 38 Issue 5 , May 2003, Page(s) 696
-701 - Somasekhar, D Lu, Shih-Lien Bloechel, Bradley
Lai, Konrad Borkar, Shekhar De, Vivek Planar
1T-Cell DRAM with MOS Storage Capacitors in a
130nm Logic Technology for High Density
Microprocessor Caches, European Solid-State
Circuits Conference, 2002, Proceedings of the
2002 International Conference on, ESSCIRC 2002,
Page(s) 127 - 130 - Khellah, M. Tschanz, J. Ye, Y. Narendra, S.
De, V. Static pulsed bus for on-chip
interconnects, VLSI Circuits Digest of Technical
Papers, 2002. Symposium on , 13-15 June 2002,
Page(s) 78 79 - Anders, M. Rai, N. Krishnamurthy, R.K. Borkar,
S. A transition-encoded dynamic bus technique
for high-performance interconnects, Solid-State
Circuits, IEEE Journal of , Volume 38 Issue 5 ,
May 2003, Page(s) 709 714 - Vangal, S. et al A 5GHz Floating Point Multiply
Accumulator in 90nm Dual-VT CMOS, Solid-State
Circuits Conference, 2003. Digest of Technical
Papers. ISSCC. 2003 IEEE International , Volume
46 , 2003, Page(s) 334 -335
279Bibliography (3 of 7)
- Hoskote, Y. et al A 10GHz TCP Offload
Accelerator for 10Gbps Ethernet in 90nm Dual-VT
CMOS, Solid-State Circuits Conference, 2003.
Digest of Technical Papers. ISSCC. 2003 IEEE
International , Volume 46 , 2003, Page(s)
258-259 - http//www.intel.com/research/silicon/mooreslaw.ht
m - G.E. Moore, Cramming more components onto
integrated circuits, Electronics, vol. 38, no.
8, April 19, 1965. - K.G. Kempf, Improving Throughput across the
Factory Life-Cycle, Intel Technology Journal,
Q4, 1998. - S. Thompson, P. Packan, and M. Bohr, MOS
Scaling Transistor Challenges for the 21st
Century, Intel Technology Journal, Q3, 1998. - Y. Taur and T. H. Ning, Fundamentals of Modern
VLSI Devices, Cambridge University Press, 1998. - D. Antoniadis and J.E. Chung, Physics and
Technology of Ultra Short Channel MOSFET
Devices, Intl. Electron devices Meeting, pp.
21-24, 1991. - A. Chandrakasan, S. Sheng, and R. W. Brodersen,
Low-Power CMOS Digital design, IEEE J.
Solid-State Circuits, vol. 27, pp. 473-484, Apr.
1992. - Z. Chen, J. Shott, J. Burr, and J. D. Plummer,
CMOS Technology Scaling for Low Voltage Low
Power Applications, IEEE Symp. Low Power Elec.,
pp. 56-57, 1994. - H.C. Poon, L.D. Yau, R.L. Joh