Title: Design Techniques for Power Reduction
1Design Techniques for Power Reduction
- Borivoje Nikolic
- bora_at_eecs.berkeley.edu
2Digital IC Challenges
- Robustness
- Device features scale faster than their tolerance
- Environment impact supply noise, coupling,
- Tradeoff with power and performance
- Power
- Active power
- Drain and gate leakage
- Many known power reduction approaches
- Trade-off performance for power savings
- Affect robustness
- Cost
- NRE, masks, complexity
- Good news density, intrinsic delays, improve
3Power is a Problem
- If we continue doing business as usual, both
dynamic and leakage power will be a problem
chips are getting hot
and phones leaky!
- Need to delivermaximum performance under power
constraints
From S. Borkar, Intel
4Outline
- Know your enemy Power consumption in CMOS
- Power and performance have to be jointly
optimized - Reducing leakage
- Robust design
- BEE and INSECTA
- Conclusions
5Dynamic Power Consumption
V
dd
i
PMOS
L
NETWORK
A
1
V
A
out
C
N
L
NMOS
NETWORK
- One half of the power from the supply is consumed
in the pull-up network and one half is stored on
CL - This also happens during glitching
6Transistor Leakage
VDS 1.2V
G
Ci
S
D
Cd
Sub
Subthreshold slope S kT/q ln10 (1Cd/Ci)
Drain leakage current is exponential with
VGS Subthreshold slope is 70mV/dec _at_ room temp.
7Transistor Leakage
3-10x in currenttechnologies
- Two effects
- diffusion current (like a bipolar transistor)
- exponential increase with VDS (DIBL)
8Gate Tunneling
- IGD e-ToxeVgd,
- IGS e-ToxeVgs
- Independent of the subthreshold leakage
- Contributes to the total leakage
- Modeled in BSIM4
- Also in BSIM3v3 but foundries usually do not
include it
9Reducing active power
- Downsizing transistors (CL)
- Slows down logic
- Lowering the supply voltage (VDD)
- Slows down logic
- Reducing swing slows down the succeeding stage
- Reducing frequency (f)
- Does not reduce energy
- Reducing switching activity (a)
- Logic restructuring
- Reducing glitching
- Balancing logic
10Reducing Active Power
- Downsizing, lowering the supply on the critical
path will lower the operating frequency - Downsize non-critical paths
- Narrows down the path delay distribution
- Increases impact of variations
11Reducing Leakage
- Using higher thresholds
- Channel doping
- Body biasing
- Reduces drive current
- Using stack effect
- Stacked devices
- Sleep transistors
- Using longer transistors
- Limited benefit
- Increase in active current
12Power-Performance Optimization
Energy/op
Unoptimized design
Emax
Emin
Dmin
Dmax
Delay
Maximize throughput for given energy
or Minimize energy for given throughput
13Power-Performance Optimization
- There are many sets of parameters to adjust
- Tuning variables
- Devices
- Circuit(sizing, supply, threshold)
- Logic style(std. cells, custom , )
- Block topology (adder CLA, CSA, )
- Micro-architecture (parallel, pipelined)
14Multi-Level Approach
- Energy minimization subject to delay constraint
- Optimal trade-off between energy and area
Energy-Area (Cost) Performance
Architecture
Energy-Performance
Micro-Architecture
Energy-Delay
Circuit (Logic FFs)
D. Markovic
15Sizing, Supply, Threshold Optimization
- Transistor sizing can yield large power savings
with small delay penalties - Gate sizing
- Beta-ratio adjustments
- Stack resizing
- IBM EinsTuner
- Supply voltage affects both active and leakage
energy - Threshold voltages affect primarily the leakage
16Optimization framework
- Use for
- Optimize datapath building blocks
- Investigate the optimality of any given design
- Use inside microarchitecture optimizer
R. Zlatanovici
17Adders in Energy-Delay Space
Will demonstratein 90nm
- Sparse Radix-4 adder is the fastest
- R. Zlatanovici, S. Kao
18Scope of Circuit Level Optimization
- By combining sizing, supply and threshold
optimization block delay can be varied in the
range 12 - Limited effectiveness
VDD, VTh, sizing optimization
64-b adder example
Nominal (Dmin, Enom)
Sizing opt. (1.1Dmin, 0.3Enom)
Energy Enom
Delay Dnom
D. Markovic
19Microarchitecture Optimization
- Viterbi decoder ACS recursion Transforming from
add-compare-select to compare-select-add
E.Yeo
20Sizing, Supply, Threshold Optimization
- There exists optimal supply threshold for each
function - In this optimum ESw/ELk 2
- Depends on logic depth, activity, function
- Technology is not optimal for all blocks
- Adjust during the design
- Multiple supplies, thresholds
- Variable throughput applications
- Variable supplies, thresholds
21Sizing, Supply, Threshold Optimization
Reference Design Dref (Vddmax,Vthref)
Large variation in optimal circuit parameters
Vddopt, Vthopt, wopt
Vddmax
Vthmax
Vddmin
Vthmin
Technology parameters (Vddmax, Vthref) rarely
optimal
22Dynamic Sleep Transistor
Active mode
Noise on virtual supply
Logic block
23Dynamic Sleep Transistor
Idle mode
Virtual supply collapse
M.Sheets
24Design Variability
- Power-performance optimization
Power
T
S
F
Power constraint
Performance constraint
Leakage constraint
Performance
25Robust Optimization
- Optimization with uncertain parameters (R.
Zlatanovici) - Parameters are within an ellipsoid centered on
the nominal values - Optimize the worst case
- Optimization with stochastic parameters
- Parameters are random variables with known
distribution centered on the nominal values - Optimize for parametric yield in the power
delay space - Linear delay (logical effort based) models
allow a convex formulation of the optimization
with uncertain parameters - Bottom up approach get a handle on variations
(K. Cao and Prof. Rabaeys group)
26Whats Berkeley Emulation Engine?
- A real-time FPGA-based hardware emulator, with
speed up to 60 MHz - Emulation capacity of 10 Million ASIC
gate-equivalents per module, corresponding to 600
Gops (16-bit adds). - 2400 external parallel I/O providing 192 Gbps raw
bandwidth. - Automated design flow from Simulink to FPGA
emulation, integrated with INSECTA ASIC design
flow.
27BEE Applications
- Real-time hardware emulation
- Novel Communication Systems with analog front-end
hardware (MCMA, UWB, 60GHz) - Digital signal processing systems
- Real-time control systems
- Hardware acceleration
- Large-scale communication/signal processing
system simulation - Hardware-in-the-loop cosimulation with software
system - Complex parallel computing algorithms
28BEE Design Environment
Servers
BEE Processing Unit
Analog Front-end
Client PC
Network
Ethernet
LVDS/LVTTL
BEE/Insecta Design Flow
FPGA Bit Stream Conf File
Simulink MDL
ASIC Layout
29Design Flow Users Perspective
Virtual Components
VHDL Netlist
30Basic Blocks
FIFO
DPRAM
Shifter
VHDL
Concat
Enable
Const
ROM
RAM
Counter
Delay
Mux
Down
P to S
Convert
ReInt
S to P
Sync
Slice
Up Smp
Register
FPGAASIC Support
FPGA Support Only
Scale
Sin Cos
Shift
Thresh
31Communication DSP Blocks
Puncture
Conv. Encoder
Depuncture
DDS
CIC
FIR
FFT
FPGAASIC Support
FPGA Support Only
32MAP (BCJR) Decoder
- Fully enclosed design
- Uniform RNG input vector
- Channel encoder
- AWGN filter
- Channel decoder
- BER collection mechanism
- Part of 3G Turbo Decoder
33MAP Simulation
- 10 MHz system clock
- SNR 14db ? -1db
- 109 Samples
- lt30minutes
34ASIC Flow INSECTA
- Tcl/Tk code drives the flow
- Same scripting language used by several EDA
tools First Encounter, Nanoroute, ModelSim,
Synopsys - GUI controls technology selection, parameter
selection, flow sequencing - A real Push Button flow
- Users can refine flow-generated scripts
35ASIC Flow Details
- PC Software
- Matlab R13 (6.5)
- Xilinx ISE
- Xilinx SystemGenerator 2.2
- BEE ISE
- Xilinx ChipScope
- Xilinx Parallel Cable
- UNIX SW Versions
- TCL/TK 8.3
- Synopsys 2002.05
- Cadence SoCEncounter 2.2.(Nanoroute)
- Modelsim 5.6
- Cadence SE(icfb 4.4.6)
- Mentor Calibre
Optional design steps
High-level Design
Generate backend scripts Insecta
View hierarchy Insecta
Identify files and paths Insecta
Run floorplanning First Encounter
View logic schematic DA
Resolve design hierarchy Insecta
Backannotate netlist DC
Gate-level simulation Modelsim
Check hierarchy consistency Insecta
Run physical synthesis DC/PSYN
View floorplan First Encounter
Identify bad VHDL structures Insecta
Run signal integrity First Encounter
View routed design NanoRoute
Correct bad VHDL structures Insecta
Re-run physical synthesis DC/PSYN
View log files Insecta
View GDSII pipo
Generate synthesis scripts Insecta
Run route NanoRoute
Virtual component generation MC
Post process DFII icfb
Run (first) logic synthesis DC
364092-bit LDPC Decoder
1.8 million transistors 2.7mm x 3.1mm (10x
smaller than a 1024-bit LDPC decoder) 1GHz (E.
Yeo)
37Conclusions
- Power and energy are now primary design
constraints - Variations do not scale as well as the feature
sizes - Optimization has to be performed across all the
levels of hierarchy - Using multiple/variable supplies and thresholds
helps achieve optimality - BEE and INSECTA (in 0.13µm) are fully operational
- LDPC chip taped out in May