Title: Variation Tolerant Analog and Digital Design Methodologies
1Variation Tolerant Analog and Digital Design
Methodologies
- Larry Pileggi
- Carnegie Mellon
- pileggi_at_ece.cmu.edu
2Preview
- Controlling the dominant (systematic) variations
- Regular and restricted design rule (RDR) logic
fabrics - Methodologies and circuits that are optimized for
regular fabrics - Analog/RF regularity
- Stochastic design methods for random variations
- Modeling circuit-level variability
- SRAM statistical modeling and design
- Analog/RF stochastic design
3Sub-65nm CMOS Challenges
- Design and manufacturing costs are now
prohibitive - Printability limited bysub-wavelength
lithography - Standard layout rules become insufficient
- First eliminate systematic variability, then
address random variability
4Gridded RDRs
- Return to the past? ? gridded, fixed pitch
layouts - The translation of stick-layouts to gds2 patterns
dictates the required rules and layout density
5Example Gridded M1 Patterns
- Example rules for contacts/vias at line-ends
- Which micro-regular pattern is more
manufacturable?
Pattern B
Pattern A
100nm spacing
6Example Gridded M1 Patterns
- Exploiting the slightly tighter line-ends with
pattern B can improve area particularly with
gridded layout - Relies on ability to characterize all possible
patterns
Pattern B
Desired Process Window
7Reduced Number of Patterns
- Micro-regular patterning can reduce number of
unique patterns and reduce systematic variability - But the macro-regularity, or the way we group
subtle patterns can be equally important - At what node do we benefit from limiting both
micro-regular patterns and macro-regular
groupings? - How can we best utilize this regularity?
8Macro-Regular Predictability
- Ex SRAM-layout specific SPICE models are
required for design closure of CMOS SRAMs - Statistical transistor models (90nm) based on
all possible patterns produce a much wider
noise margin distribution
s 0.060
s 0.026
SRAM-layout-specific models
DR-compliant-layout models
9Regularity Simplifies Rules Qualification
- Standard design rules created for worst case
SRAM rules created for specific patterns - Rules can be simplified (pushed) for regular
patterns with knownneighborhoods - If we can pre-qualify all regular patterns, there
is less need to pre-qualify all logic cells - Can now derive application-domain-specific logic
for improved logic efficiency and density - Requires methodology based on micro- and
macro-regularity
10Logic Bricks for Macro-Regularity
- To control the number of geometry patterns that
must be pre-qualified we can implement logic from
larger cells (bricks) - Reduces the number of edge patterns
11CMU Experimental Brick Flow
12ARM926EJ Example
- 65nm Low Power CMOS
- Std Cell Spec Design
- 16KB D cache, 32KB I cache
- 250MHz worst case
- Area 1.1323 mm2
- Bricks derived from7 fixed-size primitives
- AO22, AO12, Nand2, Nor2, And2, 41 Mux
- 3 Flip Flop types
- Various INV sizes for buffering
- 16 fixed-size application-specific bricks
- Compatible boundary INVs, NANDs and NORs
- Identical to std cell footprint
- 25 man-weeks of design time
- First-pass working silicon
16KB D cache 32KB I cache 250MHz worst case Area
1.1323 mm2
UnidirectionalMicroRegular Fabric
13ARM926EJ Results
- Std cells based on sizing and resynthesis using
complete library - Results do not reflect improved control of
variations, or possible improvement with
brick-specific synthesis and design flow - Normalized Leff comparison based on ACLV
simulations at nominal process conditions for DFF
cell vs. brick
14Regular Bricks vs. Non-Regular Std Cells
- DR compliant, unidirectional FEOL pattern bricks
incur 15-75 area penalty vs. non-regular
standard cells at 65nm - Simple patterns allow pushed line-end rules for
area improvement - Can merge diffusions within large brick functions
1.2
Normalizedto non-regularpattern std cells
Normalizedto non-regularpattern std cells
1.1
1.0
Pushed-Rule Bricks
0.9
Normalized Area
0.8
0.7
0.6
0.5
0
1
6
8
9
11
12
14
15
17
18
Brick Index
15Regular Bricks vs. Non-Regular Std Cells
- DR compliant, unidirectional FEOL pattern bricks
incur 15-75 area penalty vs. non-regular
standard cells - Simple patterns allow pushed line-end rules for
area improvement - Can merge diffusions within large brick functions
- Transistor-level optimization (TLO) of large
brick functions offers further improvement
Normalizedto non-regularpattern std cells
Pushed-Rule Bricks
Bricks w/ Manual TLO
16Baking Bricks
- Can derive application-specific bricks that are
constructed from pre-qualified regular patterns - TLO can provide significant improvement for large
non-traditional logic functions - Some very efficient transistor-level
implementations are possible for certain
application-domain specific designs
Example Brick F ab bc bd acd Synthesis
with a standard cell library requires 18
transistors vs. 10 for this implementation
17Mapping to Bricks
- Difficult to add all possible large logic
functions to standard cell libraries technology
mapping algorithms would struggle
Function ABC(DEFG) 20 transistors (2 AO22)4
stages of logic
Function ABC(DEFG) 16 transistors2 stages of
logic
18Logic/Pattern Co-Optimization
Gate-based Micro-Regular Gridded Layout
Micro-Regular Layoutof Gridded TLO Brick
19Application-Specific TLO
- Can also attempt to extract functions that are
particularly efficient for a specific fabric or
application - Example b0 p0p1 p2p3 p1p2p4 p0p3p4
12 transistors, 2 logic stages
26 transistors, 4 logic stages
20Logic BRIX
- Greater advantages below 65nm as methodologies
and mapping algorithms accommodate bricks - Beta version of commercial flow has demonstrated
the benefits of pre-qualified patterns and TLO
Courtesy of PDF Solutions pdBRIX
21Analog and Mixed-Signal
- Same lithography setup must work for analog and
mixed-signal components (SRAM) - SRAM has always been macro-regular, now becoming
more micro-regular - Analog layout has always been regular to control
systematic mismatch - Random variations now become dominant
22Random Variations
- Random variations most prominent for min-sized
FETs - E.g. Line edge roughness is most dominant for
min length FET - Wider FETs reduce variation via Central Limit
Thereom
W
W0
45
50
55
45
50
55
L (nm)
Distribution of DL variation
Distribution of avg. length
23Stochastic Design SRAMs
- SRAM timing is determined by small FETs in
bit-cells
BL
BL
_
Core Cell
WL
Core Cell
Core Cell
Column Mux
Replica path
SA
SAEN
Waveforms sampled from90nm CMOS low-swing
bitlineSRAM testchip (in collaboration with
Prof Ken Mai, CMU)
OUT
OUT
_
24Replica Bitline (RBL)
- Conventional RBL chooses a fixed number of driver
cells to partially average out the randomness - Increasingly difficult as random mismatch becomes
more dominant
25Configurable Replica Bitline (CRBL)
- Instead select a subset of potential driver cells
(post-manufacturing) that best average out
randomness
26Configurable Replica Bitline (CRBL)
- Post manufacturing selection provided a 100ps
tuning range using 3 cells selected from 10
candidates in 90nm testchip - Randomness provides for wider tuning range
100ps
27RBL vs. CRBL
- Simulations of read path for a commercial 65nm
SRAM design
Replica Path
Replica Path
Global Only ? 0.91
Global Local ? 0.41
Read Path
Read Path
RBL vs. CRBL (3 of 5 cells)
RBL vs. CRBL (3 of 10 cells)
RBL Delay w/o mismatch
RBL Delay w/o mismatch
Configurable RBL Delay w/ mismatch
Configurable RBL Delay w/ mismatch
28Capturing the System Level Impact
- Build statistical response surface models (RSMs)
to compare and optimize designs - Example SRAM self-timing
- Self-timing circuit must track bitcell delay
- Self-timing delay is part of READ delay
- Buffer chain (BUF)
- Insensitive to intra-die variations
- Poor tracking of inter-die and environmental
variations - Replica bitline (RBL)
- Better tracking for inter-die and environmental
variations - More sensitive to mismatch
- Configurable Replica bitline (C-RBL)
29Monte Carlo (Statistical) Analysis
- Monte Carlo analysis
- Randomly select M samples for e1, e2,...
- Evaluate circuit performance at each sampling
point - Estimate performance distribution using the M
samples
model NMOS bsim4 typen tox 4e-9 1e-10?1
1.3e-10?2 ... vth0 0.6 0.24?1
0.3?2 ... ...
?1, ?2, ?3, ...
Simulator
Performance Distribution
30Monte Carlo Samples
- Applying MC at system level can be run-time
costly - 1k 10k sampling points are typically required
to achieve reasonable accuracy - Even with 10k sampling points, an accurate result
is not guaranteed! - MC analysis is random, and you can be unlucky
with samples (especially for results beyond the
/-3 sigma range) - Controlling sampling points is often important,
especially for circuits like SRAMs
31Importance Sampling
- Brute-force MC simulation is impractical for rare
events - If Pr Performance lt SPEC lt 10-6, at least
million samples required to observe this event in
MC simulation - Idea Bias the random sample generation in such a
way to observe rare events with a much smaller
number of samples - Build a response surface model (RSM) to identify
the failure space for MC analysis
32Response Surface Modeling (RSM)
- Approximate the performance of interest (e.g.,
delay, power, gain, etc.) as an analytic
function of process parameters - Can cover local variations of process parameters
/-30 - Use linear or quadratic functions to approximate
the corresponding local variation - Fitting RSM to samples, then performing MC on
RSM, can be more efficient than direct MC if the
number of variables (N) is small - The number of sampling points must be equal to or
greater than the number of unknown model
coefficients to fit RSM - Linear RSM contains N 1 model coefficients
- Quadratic RSM contains N(N1)/2 N 1 model
coefficients - PWL RSMs stitched together can be used to cover a
larger space
Local RSM, p(X)
33Statistical Response Surface Models
- Given set of correlated inter-die variations and
set of spatially correlated intra-die variations,
build statistical RSM - Fitted analytical performance model based on
well-chosen simulation samples - Accuracy depends on model complexity Linear,
quadratic, piecewise-linear,.. - Include uniform distribution orcorner models for
VDD andtemperature
34Model Explosion
- Statistical device models can be extremely
complex - Over 300 random ?s (inter-die) for a 65nm
process - Mismatch modeling can require 1020 additional
?s for every transistor - If the number of variables is large, first
convert set of correlated random variables to
independent set of random variables - Simple example
?VTH,NL and ? VTH,NR are correlated ?VTH,NL
y1y3 ? VTH,NR y2y3
35PCA
- Principal Component Analysis (PCA) does this in a
generalized way for jointly normal random
variables - Apply eigen decomposition to produce a new
(possibly smaller) set of parameters that are
uncorrelated - Similar to finding the orthogonal basis of a
vector space
Dx correlated parameters Dy uncorrelated
parameters
36Reducing Number of Variables
- If some of the eigenvalues are small, they can be
removed to reduce the random space dimension - Allows us to use a compact set of independent
random variables to approximate the original
high-dimensional space - Most large problems tend to be rank deficient
J. Friedman and W. Stuetzle, Projection pursuit
regression, Journal of the American Statistical
Association, vol. 76, no. 376, pp. 817-823,
1981 X. Li, J. Le, L. Pileggi and A. Strojwas,
"Projection-based performance modeling for
inter/intra-die variations," ICCAD, pp. 721-727,
2005
37Statistical Modeling Example
- Constructed PWL RSMs to compare designs
- Buffer chain (BUF)
- Replica bit line (RBL)
- Configurable Replica bit-line (C-RBL)
- Applied MC analysis to the region of most likely
failures
38M-C Simulation Results for 65nm CMOS
- Comparison of self timing architectures (results
based on 0.98 success rate at chosen frequency)
39Optimizing Designs
- Can we use RSMs to optimize the designs over
local statistical parameter space? - Formally find the optimum set of design
variables which minimizes a cost function and
meets a set of specifications - Both the objective and the constraint become
stochastic - Choice of optimization algorithm would depend on
the objective and constraint functions
40Example Sense Amp Optimization
- Random offset impacts the self-timing of the READ
- Build RSM of offset that is dominated by VTHn
variations - Simulations suggest a linear relationship between
offset and VTHn
65nm Latch type sense amp
Based on 1000 MC samples
41Random Input Offset
- There is little or no correlation for VTHp and
other variables - Offset is less sensitive to these other
variations even across different precharge
voltages
65nm Latch type sense amp
Based on 1000 M-C samples
42Optimization
- Since dominant variation parameter shows linear
relationship, a simple linear RSM model can be
used to optimize sizing of NFETs (N1 and N2) and
PFETs (P1 and P2) - Voffset aDiff(Vtn) bDiff(Vtp) c
- Model has less than 3 error
- Vtn and Vtp are incorporated as independent and
Gaussian
43Measurement Results
- Large input offset voltage variation for 65nm
- Optimized circuit 10 larger gate area, 25
lower offset - Measured data based on 14K SAs from 20 different
chips
44Simulation Results
- Comparison of simulation and measurement as a
function of precharge voltage, Vpc - Optimized circuit has been desensitive to
variations, including precharge
45Pelgrom Model
- It is well known Pelgrom that increasing the
device sizes will tend to average out random
variations - For random threshold variation, Pelgrom showed
that the (uncorrelated) variance improvement is
proportional to WL
Pelgrom et al, Matching Properties of MOS
Transistors, IEEE JSSC, vol 24, no. 5, Oct. 1989.
46Results Comparison
- How does Pelgrom model compare as a function of
precharge - Accuracy of the model depends on the region of
operation - Pelgrom model only applies if performance
variation is dominated by mismatch of 2 xtors
47Analog/RF Design in Scaled CMOS
- As CMOS continues to scale, oversizing
transistors can potentially cancel any benefit of
moving to the next generation technology - Example Pelgrom modelanalysis of a 65nm
differential pair - Mismatch improves slowly with increasing
transistor size 1/sqrt(area)
48Sizing via Selection of Elements
- Start with regular fabric of analog
sub-components but select only a subset of
themfor precision matching - Ex open-loop amp for pipeline ADC mismatch in
65nm CMOS - Select some (1/2) rather than all subcomponents
to minimize offset
49Post-Silicon Element Selection for Mismatch
- Some circuit overhead required to implement
post-silicon tuning - But with further scaling, post-silicon tuning
might be the only way to meet specs and reap the
benefits of next gen technology - Example Exponential vs. sqrt improvement
(Pelgrom model)with area for 65nm open-loop
amplifier
50Conclusions
- Regular patterning for logic, memory and analog
becomes increasingly important below 65nm - New circuits and methodologies can exploit this
regularity for improved performance - As systematic variations are better controlled,
random variations will become dominant - Stochastic design methods will be needed to
produce competitive chips - Configurable and tunable circuits will become
more imperative particularly for analog and
mixed-signal
51(No Transcript)