Title: Towards Technology Aware Design at the Architecture Level
1Towards Technology Aware Design at the
Architecture Level
2Convergence of communication, computing and
consumer
3Convergence is enabled by complex digital
real-time SoCs
20 Radios gt 200MHz CPU gt 200MHz DSP gt 64MB
Flash gt 32MB RAM 600 components 20cm PCB area
- Costs is all that matters
- price erosion is high (2x/decade) for high end
products - But power efficiency is a must
- new features and limited battery-capacity require
power efficient architectures (between 10 to 200
GOPS/Watt)
4Scaling is main engine for cost reduction...
Relative Cost Per Gate (log scale)
5...but it is turning into a hell of nano-scale
physics
- Leakage power starts to dominate
- larger gate delay than scaling for performance
predicts - Voltage headroom shrinks
- makes ARF guys deeply worried (sub 1 Volt
circuits) - Interconnect claims the first role
- challenging timing, power, synchronism, signal
integrity - Increasing uncertainties
- jeopardizes predictability and yield, affects
design process
6Uncertainties are omni-present, thereby...
Dopant Fluctuations
Manufacturing uncertainties
electrical uncertainties
Line Edge Roughness
NBTI
7...causing functional errors
Chang, Symposium VLSI Tech. 2005
- Circuits with functionally correct operation
under the expected amount of variations have to
be built - statistical aware SRAM design techniques
- redundancy
- See Wims presentation
8...causing parametric uncertainties on
delay/energy of digital blocks
Energy per Access
1.6
access delay of 200 random instances of a 8kB
memory
1.4
1.2
1
0.8
0.6
0.5
1
1.5
Access Delay
- Random shift of performancepower consumption of
components
9Complicating timing closure in synchronous designs
Logic
l2 in
l2 out
l1 out
- Synchronous design paradigm consists of providing
a static HW interface to system architects that
makes it easy to reason - Delay of all critical paths is shorter than fixed
clock cycle
clk
clk
data 1
data 2
l1 out
data2
data 1
n
l2 in
data 1
l2 out
ok
setup time violation
10Complicating timing closure in synchronous designs
Logic
l2 in
l2 out
l1 out
- Synchronous design paradigm consists of providing
a static HW interface to system architects that
makes it easy to reason - Delay of all critical paths is shorter than fixed
clock cycle under all conditions!!!
clk
clk
data 1
data 2
l1 out
data2
data 1
n
l2 in
data2
data 1
slow
data 1
l2 out
setup time violation
11Guarantee timing closure under all
uncertainties"
- Worst-case Design
- Design margins for all design corners (worst,
typical best) - Accumulated margins lead to over-design -gt more
area/more power - Farther the corners more the penalty
- Post-fabrication testing
- To filter outliers, keep corners closer, limit
over-design
12Scaling increases uncertainties
- Prevailing worst-case design
- Worst-case corner of new node is worse than
previous - Reduced Vdd increases sensitivity to variations
- More sources of uncertainties, more the penalty
- Tight circuit parametric constraints limit the
amount of tolerable low-level uncertainties - Testing to keep corners closer
- Distributions getting wider gt more yield loss
- Only works for static uncertainties
- Cant be extended for degradation induced
uncertainties - Unable to benefit from scaling
- Especially in sub-45nm regime
13How can we better handle these variations?
2. Extend the technology- design interface
1. Push foundry for less variations
3. Compensate for variations at run-time
14How can we better handle these variations?
2. Extend the technology- design interface
1. Push foundry for less variations
3. Compensate for variations at run-time
15Extending the technology/design interface
- Models extending the design interface to expose
the impact of manufacturing on the system - reliability models, post OPC extraction,
variability information - Models allow for better-than-worst-case designs
- Only modify the hot-spots of the design rather
than everything - Models allow for a reduction of the accumulation
of design margins - Case-studies
- litho-simulation for better gate sizing
- statistical static timing analysis
- system-level yield prediction
16Case-study 1 Using litho-simulations for better
gate sizing
J. Yang et al., Advanced timing analysis based
on post-OPC extraction of critical dimensions,
Proc DAC 2005
17Case-study 1 Using litho-simulations for better
gate sizing
x
J. Yang et al., Advanced timing analysis based
on post-OPC extraction of critical dimensions,
Proc DAC 2005
18Case-study 1 Using litho-simulations for better
gate sizing
x
J. Yang et al., Advanced timing analysis based
on post-OPC extraction of critical dimensions,
Proc DAC 2005
19Case-study 1 Using litho-simulations for better
gate sizing
x
- Manufacturing introduces line-width variations
(LWV) - major contributor to timing variation
- yield loss as we signed of on an incorrect layout
- However, 50 of LWV are systematic
J. Yang et al., Advanced timing analysis based
on post-OPC extraction of critical dimensions,
Proc DAC 2005
20Case-study 1 Using litho-simulations for better
gate sizing
- Systematic variations can be modeled after
physical layout thru an aerial simulation - Extraction of layout and transistor-level timing
analysis allows for an identification of true
critical paths after manufacturing - Design corrections for true critical paths (e.g.,
gate length trimming lt-gt resizing ) - Critical path information can be used for tuning
manufacturing (e.g., reducing data set for mask
production)
21Avoiding the accumulation of design margins
- In corner point design, design margins are
accumulated to ensure that the design operates
under all worst-case conditions - Highest temperature, lowest voltage and worst
process conditions for all gates - Likelihood of these corner is extremely low
- Optimize the design such that the probability
that it meets the time/power constraints is
sufficiently high - Case-studies
- statistical static timing analysis
- system-level yield prediction
22Case-study 2 Statistical Static Timing Analysis
gate netlist
gate netlist
Statistical Static Timing Analysis
Static Timing Analysis
delay distribution
delay
gate lib with statistical delay distribution
gate lib characterized for worst-case
library design
library design
design
technology
design rules parameterized manufacturing
induced variations
worst-case design rules
- Tools exist MAGMA, Synopsys, ExtremeDA, etc.
- Solution is as good as the input models on
variability
23Case-study 3 System-level Yield Prediction
Yield-aware Architecture Exploration
- Can the system be made yielding?
- What are the yield critical blocks?
- What-if analysis?
component
Energy
Delay
Correlated Energy/Delay info per component (e.g.
obtained via statistical analysis or simulation
techniques)
24Case-study 3 Yield-aware Architecture Exploration
1.6
less and faster memories
1.4
1.2
99.9
1
manufactured systems meeting clock frequency
Average Energy for memory organization (Relative)
0.8
0.6
0.4
0.2
0
Ref.
Redesigned Architecture
Ref.
Redesigned Architecture
Ref.
Redesigned Architecture
Architecture
Architecture
Architecture
Image Processing
Wireless Receiver
Audio Decoder
25Summarizing benefits and challenges of extending
the technology/design interface
- Model-based BTWC refining the interface between
technology/processing data and design - Pass limits of manufacturing flow to the
designers - Pass functional intent to manufacturing flow
- Fast analysis tools are required at all
abstraction layers - Incorporate most important effects
- Manual analysis of the yield on a 1gCell netlist
is impossible - Research for limiting run-times (e.g., STA after
litho simulation) - Silicon foundries should provide processing
information - Many sources of processing variations exist
(e.g., lithography, reliability,...) - factory-floor IT systems need to be able to
archive, retrieve and analyze all types of
cross-sections - They are reluctant to do so for other reasons
- Analysis Results should be translated in design
methods - Strain/stress simulation tools are available, but
these tools cannot be used for design (e.g., no
layout is available during synthesis - Techniques are tedious design margins remain
required
26How can we better handle these variations?
2. Extend the technology- design interface
1. Push foundry for less variations
3. Compensate for variations at run-time
27Compensating for delay variations at run-time...
- Ensure that the system is fast enough at run-time
- But avoid taking design margins, but rather speed
up the limiting circuits at run-time - Typical case optimization
- only burn energy when necessary
- The only solution for extreme conditions
- large variations
- sensitive designs (analog blocks, memories,
ultra-low power logic) - Complementary to other design techniques
- Can be made independent of origin of
uncertainties - may raze design margins for ALL uncertainties in
a single shot
deadline
energy
x
x
x
x
x
x
x
x
x
x
x
x
x
delay
28...requires a self-adjusting system
Application knowledge (deadlines, workload, )
- guarantee functional correctness
- circuits that remain functionally correct under
variations - guarantee parametric correctness
(performance/energy) - energy/delay/robustness monitors
- speed knobs
- A method for integrating knobs/monitors into the
system at low cost
System knobs
Run-time Controller (finds optimal knob
settings, RTOS/HW)
Hardware Status
Distributed energy/delay measurement
Conceptual View
29Case-study 4 building a delay-variation tolerant
memory hierarchy
- Memories are most vulnerable to random variations
- Many minimal sized transistors Pelgroms law!!!
- Many critical paths
- Small memories near the functional units are best
candidates - these are most power consuming
- Functional failures can be eliminated using
circuit design techniques
tile
tile
tile
Inter-tile communication network
tile
tile
tile
L2 memory
L2 memory
L1mem
L1mem
L1mem
L1mem
L1mem
Intra-tile communication network
PE
PE
PE
PE
30...requires a self-adjusting system
Application knowledge (deadlines, workload, )
- guarantee functional correctness
- circuits that remain functionally correct under
variations - guarantee parametric correctness
(performance/energy) - energy/delay/robustness monitors
- speed knobs
- A method for integrating knobs/monitors into the
system at low cost
System knobs
Run-time Controller (finds optimal knob
settings, RTOS/HW)
Hardware Status
Distributed energy/delay measurement
Conceptual View
31Guaranteeing functional correctness under
variations
- Delay variations cause functional errors in
synchronous designs - Can we build circuits that avoid latching the
wrong data independent of how much variations
occur? - self-timed logic
- double-latched
- adaptive synchronous
Logic
clk
32Self-timed logic based on handshaking (1)
- Data hand-off between any pair of registers is by
hand-shake - Delivers actual performance
- operand values determine the delay
- Each register (or associated combinational logic)
need to have completion detection circuitry - Cost-effective robust completion-detectors for
data-path FUs need to be explored - Matched delay line approach still requires design
margins (cfr. Wim)
Jens Sparoe et al, Principles of Asynchronous
Circuit Design A Systems Perspective, Kluwer
Academic Publishers, Jan. 2002
33Self-timed logic based on handshaking (2)
- Self-timed logic is attractive for knobbed
components - no combined control of both Vdd/frequency
required - Difficult to integrate in current design flows
- dedicated cells, difficult to characterize with
current tools - logic and physical synthesis
- actual case timing
- avoiding hazards
- robustness
- testing (no clock)
- Async design is reviving
- Handshake Solution next Friday seminar
- ARM developed an async version of the ARM9
Schmoo-plot of a self-timed circuit (ASPIDA
processor). The chip operates correctly over a
large Vdd range. (from Cortadella et al.,
De-synchronization synthesis of asynchronous
circuits from synchronous specifications, TCAD,
Oct. 2005, Vol. 25, Issue 10, pp. 1904-)
34Double Latching
- Circuit delay speculation, error detection,
correction - Exploits typical-case amidst dynamic
uncertainties below uArch-level - Demonstrated good energy savings
- Worst-case (with all uncompensated uncertainties)
is limited and must be guaranteed by design-time
analysis - Can not handle huge uncertainties, doesnt fully
exploit actual-case - Severe short-path constraints
- Delay padding overhead
- Extra bypass path, shadow latch overhead,
meta-stability issues
D. Ernst et al., Razor A Low-Power Pipeline
Based on Circuit-Level Timing Speculation,
Micro, 2003
35Adaptive Synchronous
- Run-time determined clock based on HW status
- Using hardware monitoring (test vectors delay
measurement) - PLL/DLL is directed to deliver required clock
- Only pre-determined clocks can be generated by
PLL/DLL - Takes a while (1000 cycles) to complete the
transition
BIST
test patterns
Clk Generator
IP Blocks
clks
test
time
operate
36Case-study 4a self-timed memory
Vdd_matrix
Vdd_IO
Vdd_decoder
Address Latch 1
Address Latch 2
Interface
Orchestration of events is critical for
functional correctness of memory
WL buff
xDec
Matrix
Sense Amplifiers
CLK
DEC_ START
PRE
WL EN
Data_Out
Data_In
SA
CLK
DEC_START
PREb
WL EN
SA ACT
37Case-study 4a self-timed memory
Vdd_matrix
Vdd_IO
Vdd_decoder
Wrong cell on is read gt incorrect output at
sense-amp
Address Latch 1
Address Latch 2
Interface
WL buff
xDec
Matrix
Sense Amplifiers
CLK
DEC_ START
PRE
WL EN
Data_Out
Data_In
SA
CLK
DEC_START
Empty
valid
DEC_OUT
Decoder is too slow due to process variations
PREb
WL EN
SA ACT
38Case-study 4a Self-timed address decoder
stop pre-charging
x
39...requires a self-adjusting system
Application knowledge (deadlines, workload, )
- guarantee functional correctness
- circuits that remain functionally correct under
variations - guarantee parametric correctness
(performance/energy) - speed knobs
- energy/delay/robustness monitors
- A method for integrating knobs/monitors into the
system at low cost
System knobs
Run-time Controller (finds optimal knob
settings, RTOS/HW)
Hardware Status
Distributed energy/delay measurement
Conceptual View
40Knobs for controlling performance
- uArch-level components with run-time configurable
parametric aspects - without affecting functionality
- Right configuration is decided at run-time
- based on HW status and delay requirements
- Knobs or combination of knobs should
- be fast enough to compensate for worst
variations, but highest possible energy savings
in case of more relaxed conditions - have low overhead not to upset the original
performance/energy - fine control i.e. only speed up the failing
path - speed - low re-configuration time
- Translation of fine-grain performance variability
into energy savings - By switching to energy-efficient configurations
when more operation latencies can be tolerated
- Typical knobs
- power supply/back gating
- (redundant logic)
- re-configurable HW
411. Backbiasing/supply-based knobs
- Excellent proven dynamic range of combined Vdd/Vt
knobs - sufficient for 90nm
- Widely researched and applied in designs
- Intels Enhanced SpeedStep technology
- AMDs PowerNOW! Technology
- Transmetas Longrun2 Power Management (incl.ABB)
- Vt knob is losing its efficiency
- area overhead
- back-gate becoming less effective,
- gate leakage
- multi-gate devices (e.g., finfets)?
- Usually very coarse granularity single knob for
entire chip
courtesy M.Meijer NXP
42Towards knobs with a finer granularitymultiple
Vdd islands
- Multiple Vdd/Vt domains enables finer grain
control, limiting worst-casing - Multiple Vdd/Vt knobs are challenging
- different islands operate at different speed -gt
GALS-like communication fabric - multiple off-chip DC-DC converters incur overhead
(too many, too many pins, too many extra
components and complex power distribution) - on-chip linear power regulators are not portable
to new technologies, incur noise and contain
biasing currents
x
43A low-overhead on-chip voltage controller
- Vdd control through header and footer transistors
- Linear resistors (active mode)
- Power switch (standby)
- Fine grain programmability with digital
resistance with segmented transistor - Fast settling times (order of 100ns)
- Overhead remains high
- extra switches
- level converters/clamp cells
- Sensitivity to noise
- Limited energy savings
- CVddVswing rather than CVdd2
M. Meijer et al., On-chip Digital Power Supply
Control for System-on-chip Applications, Proc.
ISLPED 05
443. Re-configurable hardware knobs
- Knob consists of multiple units of hardware
- fast one to satisfy worst-case constraints
- slow one to save energy
- Exploring best combination between
low-power/high-performance knob - HW of unused config should be isolated from input
changed and/or Vdd - Mux/Demux operand isolation circuits defines
optimal granularity max amount of combinations
E.g. variable-size buffers, carry chain
variants, etc.
45Case-study 4b configurable HW to vary memory
performance
- Buffers inside the row decoder and wordline
drivers are interesting circuits for building
knobs - Inside the critical path of the memory
- Important contribution to both the power and
delay as these circuits drive large capacitive
loads - Limited impact on area
particularly for small memories (lt128kB cfr.
Amrutur et al, Speed and power scaling of
SRAMs, IEEE J. Solid State Circuits, vol. 35,
no. 2, pp. 175-185, Feb. 2000)
46Case-study 4b Configurable drivers implemented
with redundant logic
fast
Cworldline/ Cdecoder/...
in
out
energy- efficient
ctrl_fast
ctrl_fast
- Maximizing range of configurable buffers
- sizing of buffers
- number of stages
- Overhead for combining circuits can be limited
47Case-study 4b Sizing of a Pareto-optimal Buffer
1
f2
16Cmin
1
f2
f3
38Cmin
48Case-study 4b Determining optimal stage-length
- Energy-optimal number of stages depends on
performance targets - Optimal number of stages for speed can be
determined using classical tapered buffer design - More stages only increase power consumption as
they do not further decrease delay - Configurable buffer is built by combining
Pareto-optimal buffers
Cload 32Cmin
49Case-study 4b Design issues in building a
configurable drivers
- Tri-state buffers for selecting the buffer
configuration. - Output sharing impacts for performance of low
power buffer, not of high speed one. - Area savings thru
- the use of normal inverter on intermediate stage
of the high power inverter - no tri-state buffers for the initial stages of
the low power buffer
E.g., a configurable buffer with a fast and slow
option.
Hua Wang et al, Variable tapered pareto buffer
design and implementation allowing run-time
configuration for low-power embedded SRAMs.
TVLSIS, 13(10) 1127-1135 (2005)
50Case-study 4b Integration of the configurable
buffer inside a memory
- Configurable buffers integrated in pre- post
decoder - 1kB SRAM in 65nm BPTM
- 10 variations assumed in both Vt/Beta
- Area overhead is limited compared to array
(estimated less than 5 for this small memory)
- Three option configurable buffer
- Spice simulations of 65nm
- Sizing indicated on figure
51...requires a self-adjusting system
Application knowledge (deadlines, workload, )
- guarantee functional correctness
- circuits that remain functionally correct under
variations - guarantee parametric correctness
(performance/energy) - speed knobs
- energy/delay/robustness monitors
- A method for integrating knobs/monitors into the
system at low cost
System knobs
Run-time Controller (finds optimal knob
settings, RTOS/HW)
Hardware Status
Distributed energy/delay measurement
Conceptual View
52Run-time Hardware Monitoring
- Requirements of monitor circuits depends on
choice of circuit style - Adaptive synchronous
- Functional parametric testing
- Excited with expected typical input vectors
generated by compact configurable BIST HW - Possible by exploiting structure regularity of
memories data-path and using symbolic
cellular-automata methods gt research - Circuits which are always functionally correct
(e.g., self-timed) - Parametric testing only
- Measurement can be less tedious (e.g. delay-line)
Energy
Delay
What is the actual energy/performance of the
circuit?
53Some performance monitoring circuits...
- Delay-line for on-chip performance monitoring
- calibration with accurate reference
- Counter-based performance monitor for self-timed
logic
external reference
1/50
logic
S
Integral Controller
clk/ completion signal
-
54...requires a self-adjusting system
Application knowledge (deadlines, workload, )
- guarantee functional correctness
- circuits that remain functionally correct under
variations - guarantee parametric correctness
(performance/energy) - energy/delay/robustness monitors
- speed knobs
- A method for integrating knobs/monitors into the
system at low cost
System knobs
Run-time Controller (finds optimal knob
settings, RTOS/HW)
Hardware Status
Distributed energy/delay measurement
Conceptual View
55System integration challenges
- System characteristics
- application dynamism
- performance constraints
- energy budget
- cost target
- reliability constraints
- ...
- Architecture
- sensitivity to variations
How to create a self-adaptive system at lowest
cost?
- Technology
- intra die variations
- slow (random, reliability)
- fast (IR drop, Xtalk,..)
- global variations
- slow (D2D)
- fast (Power noise)
- range of variations
56Case study 4 An adaptive synchronous integration
(1)
- Memories are self-timed and tuneable
- System is assumed to operate synchronously
- Easy to integrate in existing systems
- Variations define which is slowest component and
thus max. clock speed - uncertainties -gt access delay variations
- slowest word determines access delay of a memory
- Monitoring circuits to determine fmax
(Energy/access -gt optional) - BIST generates test vectors
- Increasing fmax by moving the slowest memory to
high speed - at the cost of extra power.
- Tuneable clock required to set operating
frequency - determined by application/system requirements
- configuration time is relatively high
- Assumption
- logic is not BTWC designed
57Case study 4 An adaptive synchronous integration
(2)
ss
- Run-time controller identifies energy-optimal
knob positions of each component to achieve the
desired fmax_at_lowest_energy - in case of multiple knobs per component, energy
monitoring is needed
1.1
1
0.9
Energy
0.6
0.4
Delay
0.5
1
1.25
0.25
0.75
100Na
100Na
nn
ss
120EU
time
deadline
58Case study 4 An adaptive synchronous integration
(2)
sf
- Run-time controller identifies energy-optimal
knob positions of each component to achieve the
desired fmax_at_lowest_energy - in case of multiple knobs per component, energy
monitoring is needed
1.1
1
0.9
Energy
0.6
0.4
Delay
0.5
1
1.25
0.25
0.75
DP
100Na
100Na
nn
ss
120EU
Mem1
Mem2
time
deadline
59Case study 4 An adaptive synchronous integration
(2)
sf
- Run-time controller identifies energy-optimal
knob positions of each component to achieve the
desired fmax_at_lowest_energy - in case of multiple knobs per component, energy
monitoring is needed
1.1
1
0.9
Energy
0.6
0.4
Delay
0.5
1
1.25
0.25
0.75
DP
100Na
100Na
nn
ss
120EU
sf
150EU
Mem1
Mem2
time
deadline
60Experimental results for a DAB receiver
80
60
(energy energy_nom) /energy_nom
40
DAB receiver
20
0
0.814
0.91
1
0.754
normalized deadline constraint
DAB receiver consists of 3 FUs connected to 7
configurable memories
61Case study 4 An adaptive synchronous integration
(2)
sf
- Run-time controller identifies energy-optimal
knob positions of each component to achieve the
desired fmax_at_lowest_energy - in case of multiple knobs per component, energy
monitoring is needed - Energy benefits varies from chip-to-chip
(depending on variations) - average if many components on chip
1.1
1
0.9
Energy
0.6
0.4
Delay
0.5
1
1.25
0.25
0.75
100Na
100Na
nn
120EU
ss
sf
150EU
time
deadline
62Involving task-level information in the feedback
control loop
- A more complex control algorithm re-configures
the memories performance depending on the memory
usage (application load) and current hardware
status - Single solution that can deal with ALL sources of
variations - from manufacturing induced ones to
application-level ones - Similar to DVS like of solutions once
calibrated - e.g., TCM, VDD/VT-hopping (see Ph.d. Peng Yang
for an extended overview) -
1.1
1
0.9
Energy
0.6
0.4
Delay
0.5
1
1.25
0.25
0.75
100Na
100Na
nn
ss
sf
150EU
time
deadline
63Experimental results for a DAB receiver
80
60
(energy energy_nom) /energy_nom
40
DAB receiver
20
0
0.814
0.91
1
0.754
normalized deadline constraint
A. Papanikolaou et al., A system-level
methodology for fully compensating process
variability impact of memory organizations in
periodic applications, CODESISSS, 2005,
p117-122
64More than compensating random process variations
- System can adapt itself to slowly changing
environmental parameters - Temperature, degradation, aging, etc.
- Requires re-calibration of the circuit depending
on environmental conditions - Worst-case margins remain necessary
- Re-configuration of clock involves considerable
delay - No accurately tracing of fast changing
environmental value dependent conditions
64kB 0.18CMOS 250Mhz
results from a self-timed memory proposed in E.
Karl et al
65Summarizing the benefits and challenges of
rt-compensation techniques
- Feedback control saves energy by razing design
margins still allowing for real-time operation - The required components are available
- Delay variation resilient circuits
- self-timed is feasible inside memories with
limited overhead - the debate is still open op whats the best DVR
for logic minimize overhead while maximize the
razed margins - Coarse grain Vdd knobs to fine grain
re-configurable HW - More work is required in monitortest circuits
- A possible integration of these circuits into a
working system has been presented. - removes manufacturing variations can deal with
slow changing dynamic variations - which feedback system is best in a given context
is still largely unknown
66Acknowledgements
- Satyakiran Munaga
- Miguel Miranda
- Hua Wang
- Antonis Papanikolaou
- Francky Catthoor
- Wim Dehaene
- Hugo De Man
67Thank you!