Title: Tutorial Outline
1Tutorial Outline
2Design Levels
Abstraction Analysis Analysis
Analysis Analysis Power Level
Capacity Accuracy Speed Resources
Savings Most
Worst Fastest Least
Most Application Behavioral Architectural
(RTL) Logic (Gate) Transistor (Switch)
Least Best
Slowest Most Least
3Basic Principles of Low Power Design
P CL VDD2 f0?1 tscVDD Ipeak f0?1 VDD
Ileakage
- Reduce switching (supply) voltage
- quadratic effect -gt dramatic savings
- negative effect on performance
- Reduce capacitance
- Reduce switching frequency
- switching activity
- clock rate
- Reduce glitching
- Reduce short circuit currents (slope engineering)
- Reduce leakage currents
4Processor Sleep Modes
- Software power control - power management
- DOZE - most fus stopped except on-chip cache
memory (cache coherency) - NAP - cache also turned off, PLL still on, time
out or external interrupt to resume - SLEEP - PLL off, external interrupt to resume
Deeper sleep mode saves more power
Deeper sleep mode requires more latency to resume
ACPI www.teleport.com/acpi
5PowerPC Sleep Modes
10 cycles to wake up from SLEEP
100us to wake up from SLEEP
6Keeper Circuits
- A floating node (not driven by any gates) can
suffer charge decay resulting in short-circuit
currents - Keeper circuits can
- slightly increase power dissipation
- slightly increase delay
- Essential in circuits with sleep modes
7Intels SpeedStep
- Hardware that steps down the clock and supply
voltage when the user unplugs a mobile computer
from AC power - step down the PLL from 650MHz ? 500MHz
- CPU stalls during SpeedStep adjustment
8Transmeta LongRun
- Hardware that scales supply voltage and clock
frequency in response to software demands - 32 levels of VDD (use 5 to 7 in practice) from
1.1V to 1.6V - clock frequency from 200MHz to 700MHz in
increments of 33MHz - Controlled through 5 pin interface
- triggered when CPU load change detected by
software - heavier load ? ramp up supply voltage, when
stable scale up clock frequency - lighter load ? scale down clock frequency, when
PLL locks onto new rate, ramp down supply voltage - always keeps clock frequency within limits
required by supply voltage to avoid clock skew
problems
9Transmeta LongRun, cont
- Software can detect CPU load change in 1/2
microsecond - LongRun scales
- the voltage up or down in less than 20
microseconds per step - the clock frequency in one step
- Worst-case scenario of a full swing from 1.1V to
1.6V and from 200MHz to 700MHz takes 280
microseconds - CPU stalls only during PLL relock (lt 20
microseconds in the worst case)
10Power Reduction Techniques in the Processor Core
11Core Power Reduction
Pcore CVDD2f
- Circuit level techniques
- voltage scaling, transistor sizing, dual supply
voltages - Gate level techniques
- logic gate and network restructuring, input
ordering - energy-delay efficient functional units, delay
balancing, etc. - Architecture level techniques
- processor sleep modes
- pipelining, clock and signal gating
- guarded evaluation, precomputation, power aware
state encoding, etc.
12Guarded Evaluation
- Reduce switching activity by adding latches at
the inputs if outputs are not used - Latch preserves previous value of inputs to
suppress activity - could also use AND gates to mask inputs to zero ?
forced zero (good if zero-out condition changes
infrequently compared to data rate)
A
A
B
Latch
Multiplier
C
condition
13Precomputation
Precomputed inputs
R1
Combination logic f(X)
Outputs
Gated inputs
R2
Load disable
Precomputation logic
g(X)
g(X)
- Identify logical conditions at inputs that are
invariant to the output - since those inputs dont affect output, disable
input transitions - trade area for energy
14Binary Comparator Example
Can achieve up to 75 power reduction with 3
area overhead and 1 to 5 additional gate delays
in worst case path
15Design Issues in Precomputation
- Design steps
- 1. Select precomputation architecture
- 2. Determined the precomputed and gated inputs
(R1 should be much smaller than R2) - 3. Find (good implementation for) g(X)
- 4. Evaluate potential energy savings based on
input statistics (if savings not sufficient go to
step 2 or 3 and try again) - Also works for multiple output functions where
g(X) is the product of gj(X) over all j
16Common Case Computation
Inputs
common case detected
sleep2
CC detection circuit
Original circuit
sleep1
sleep3
CC execution circuit
CCC controller
common case completed
Outputs
17Activity of CCC Circuit Over Time
Original circuit
CC detection circuit
CC execution circuit
tp
tc
te
Time
- Several (possibly conflicting) factors involved
in choosing the CC circuit leading to maximal
energy and/or time savings - Dependent on input data statistics
18CCC Performance
From Lakshminarayana, 1999
19Glitch Reduction by Pipelining
- Glitches are dependent on the logic depth of the
circuit - Nodes logically deeper are more prone to
glitching - arrival times of the gate inputs are more spread
due to delay imbalances - usually affected by more primary input switching
- Reduce depth by adding pipeline registers
20Typical RISC Datapath
- Five stage pipeline (originally for performance,
but also helps with energy)
Fetch
Decode
Execute
Memory
WriteBack
PC
Instruction
MAR
MDR
I
D
pipeline stage isolation register
clk
21Sample of Benchmark Set
22Datapath Energy Consumption
23Signal Gating
- Mask unwanted switching activity from propagating
- Generation of control signals requires additional
logic circuitry (more power)
source signal
gated signal
control signal to suppress source signal
24Signal Gating, cont
- Signal gating saves energy if the relative
enable/disable frequency of control signal is
much lower than the frequency of source signal
(so many signal activities blocked) - Savings even greater if a group of source signals
can share a control signal - Good candidates - clock signals, address or data
buses, signals with high frequency or high
glitching
25Selectively Gated Pipeline Regs
- Pipeline registers consume a large percentage of
datapath power - 40 for 0.35?
- Pipeline registers have large width
- Pipeline registers are clocked every cycle
- Not all clockings are necessary
- use the decoded control signals to selectively
gate the clock of pipeline register fields - only simple extra logic necessary
- can be built into the clock buffer circuit
26Gated Pipeline Register Example
Instr SW r1, 0(r2)
MEM/WB
EXE/MEM
mem/wb_cntl
MemData
Address
D
Data
EXE
MEM
WB
27Switch Capacitance Reduction
From Narayana, DAC, 2000
28Key References, Processor Core
- Alidina, Precomputation-based sequential logic
optimization for low power, IEEE Trans. on VLSI
Systems, 2(4)426-436, 1994. - Halfhill, Transmeta breaks x86 low-power barrier,
Microprocessor Report, Feb. 2000. - Lakshminarayana, et.al., Common-case Computation,
DAC, 1999. - Manne, etal., Pipeline gating Speculation
control for energy reduction, ISCA, June 1998. - Roy, Power analysis and design at the system
level, Low Power Design in Deep Submicron
Electronics, Nebel and Mermet, Ed., Kluwer, 1997. - Tiwari, Reducing power in high-performance
microprocessors, DAC, 1998. - Tiwari, Guarded evaluation, ISLPD, 1995.
- Ye, etal., The design and use of SimplePower A
cycle-accurate energy estimation tool, DAC, 2000. - Yeap, Practical Low Power Digital VLSI Design,
KAP, 1998.
29Power Reduction Techniques in the Clock System
30Clock Power
- Why clock power is important (large)
- Generally the signal with the highest frequency
- Typically drives a large load
- all sequential logic elements
- all precharged/dynamic logic
- distributed throughout chip, so lots of wiring
- E.g., DEC 21164s clock accounts for 40 of total
chip power - 3.75nF total clock load
- 20W (out of 50W) in clock distribution network
31Clocking System
- Clock generation, distribution and loading
Clock load
PLL
Control registers/latches Pipeline
registers Dynamic (precharged) logic
Memory precharged bit lines
32Typical Clock Power Distribution
33Clock Power Reduction
- Pclock CVdd2f
- Minimize voltage (V) using half swing clocks
- Minimize clock load (C)
- clock gating
- careful routing, distributed drivers
- Minimize clock frequency (f)
- DET flipflops
- localized PLL to multiply frequency of clock
- GALS design approach
34Reduced Swing Clock
Vdd
N-device clock
P-device clock
Gnd
Regular Clock
Vdd
P-device clock
Vtp
Vtn
N-device clock
Gnd
Half Swing Clock
35Half Swing Clocks
- Advantages
- as long as Vtn (Vtp) less (greater) than 1/2Vdd
on-off characteristics of nfet (pfet) unchanged - Disadvantages
- sequential element delay approx. doubled
(propagation delay and setup/hold time) due to
increased on-resistance - half-swing clock generator done via charge
sharing, so sleep modes problematic - not appropriate for very low voltage systems
36Clock Gating
- Most popular method for power reduction of clock
signals and fus - often idle functional units
- e.g., floating point units
- need circuit to generate
enable signal - increases complexity of control logic
- timing critical to avoid clock glitches at AND
gate output - additional gate delay on clock signal
- masking AND gate can replace a buffer in the
clock distribution tree
37Glitch Free Clock Gating
lt
From lt
B
lt
Gated Clock
A
Clock
Clock
(1)
0 1
Gated Clock (1)
lt
REG
Clock
Gated Clock
Gated Clock (2)
Clock
(2)
38Gated Clock FSM Architecture
Comb Logic
Reg
AF - Activation Function, Which evaluates to
logic 1 when clock needs to be stopped.
AF
Latch
Gated Clock
Clock
39Clock Tree Construction to Facilitate Gating
Can insert clock gating at multiple levels in
clock tree Can shut off entire subtree if all
gating conditions are satisfied
H-Tree Clock Network
40Clock Driver Distribution Comparison
SD single driver, DD distributed driver
(H-tree) 3.3V supply, 100MHz frequency, 1 micron
feature size
41Clock Tree Structure Affects Gating
x1
R1
R1
x1
R2
x2
A
R3
Clock
x3
x1x3
B
R3
R2
x2
R4
R4
x2x4
x4
(a)
(b)
Assuming x1, x2, x3, x4 are mutually exclusive
422005? Multimedia SoC
Interrupt Controller
X Memory
Y Memory
22.8 mm
System Level Interconnect
I/O
System Bus Controller
MPU Core
DSP Core
22.8 mm
2GHz System clock
200M transistor chip
43Multiple Local Clock Generators
f lt f1 lt f2 lt f3
Interrupt Contr
X Memory
Y Memory
System Level Interconnect
I/O
System Bus Controller
MPU Core
DSP Core
Key is in the design of the local circuits used
to generate the clock signal in each module
44Clock Generation Options
- Globally synched clock generators
- PLLs
- large, power hungry
- process variation robustness
- Clock multipliers
- smaller, lower power consumption
- Independent clock generators
- ring oscillators (DLLs)
- small size, low power consumption
- free running
45Clock Frequency Multipliers
1 Young, 1992 2 Alvarez, 1995 3 Gupta
46GALS Design Style
- Reduce clock power consumption by using a
Globally Asynchronous, Locally Synchronous (GALS)
design style - Overheads for
- local clock generation
- independent clock generators
- low power global clock reference signal with
local clock frequency multipliers - global asynchronous communication
- Skew tolerant
47GALS Multimedia SoC
Interrupt Controller
X Memory
Y Memory
I/O
MPU Core
DSP Core
data
handshake protocol
48Key References, Clock Power
- Alvarez, A wide bandwidth low voltage PLL for
PowerPC microprocessors, IEEE Journal of SSC,
30383-391, April 1995. - Chen, A simple technique for global clock power
reduction, PSU Internal Report, 1998. - Chen, Clock power issues in system-on-a-chip
designs, Proc. of Workshop on VLSI, pp. 48-53,
March 1999. - Friedman, Clock distribution design in VLSI
circuits An Overview, Proc. of ISCAS, pp.
1475-1478, May 1994. - Gupta, Features of differential delay line used
on the embedded ultra low power Intel486 in
developer.intel.com/design/intarch/papers/ddl486.h
tm - Hemani, Lowering power consumption in clock by
using GALS design style, Proc. of DAC, pp.
873-878, 1999. - Kojima, Half-swing clocking scheme for 75 power
saving, IEEE Journal of SSC, 30(4)432-435, April
1994. - Tellez, Activity driven clock design for low
power circuits, Proc. of ICCAD, pp. 62-65, Nov.
1995. - Young, A PLL clock generator with 5 to 110MHz of
lock range for microprocessors, IEEE Journal of
SSC, pp. 1599-1607, Nov. 1992
49Power Reduction Techniques in the Bus
Interconnects
50Bus Power
- Buses are a significant source of power
dissipation due to high switching activities and
large capacitive loading - 15 of total power in Alpha 21064
- 30 of total power in Intel 80386
Wout
Xout
Yout
Zout
Bus receivers
Bus
Bus drivers
Ain
Bin
Cin
Din
51Bus Power Reduction
- Pbus nCVdd2f for an n-bit bus
- Minimize bit switching activity (f) of buses by
encoding the data - Minimize voltage swing (V2) using differential
signaling - Alternative bus structures
- charge recovery buses
- bus multiplexing (lower f, maybe)
- segmented buses (lower C)
- Minimizing bus traffic (n)
- code compression
- instruction loop buffers
52Signal Encoding
Binary Code
Gray Code
53Toggle Rates
54Bus Signal Encoding
- Different encodings lead to different area,
delay, and power trade-offs - What is the power and latency cost of the
encoding/decoding logic? - What if the bus stream is not sequential?
- Can really pay off in buses with large capacitive
loading (off-chip buses and high level on-chip
buses)
55Bus Invert Encoding
- At each cycle decide whether sending the true or
compliment signal leads to fewer toggles - Need an additional polarity signal on the bus to
tell the bus receiver whether to invert the
signal or not - Only makes sense for groups of signals - buses -
that can share the polarity signal - Works for both sequential and random bus streams
56Bus Invert Coding Logic
Invert/pass
0000 ? 1110
Invert/pass
Source data
Data bus
0000 ? 0001
Received data
0000 ? 1110
0 ? 1
Polarity signal
0 ? 1
Polarity decision logic
Bus register
Under uniform random signal conditions
(non-correlated data), 25 upper bound on toggle
reduction
Hamming distance
57Efficiency of Bus Invert Encoding
- Have overhead in area, power and delay of
additional logic to encode/decode - Maximum number of toggling bits reduced from n
to n/2 - Under uniform random signal conditions
(non-correlated data sequence), the toggle
reduction has an upper bound of 25
58Efficiency of Bus Encoding
n/2
1 n1
EP n/2
EQ ? k Qk where Qk
2n k
k0
From Stan, 1995
59Bus Encoding Extensions
- For sequential data (e.g., generated on address
buses) - Gray code encoding (except for overhead)
- T0 code by Benini
- add address incrementer circuitry to receiver
- add INC line to address bus
- for consecutive addresses, just assert the INC
line without sending the second address - reduces address bus transitions by 36 over
binary - outperforms Gray code when probability of
consecutive addresses is gt 0.5
60Data Bus Switching Activity
Average switching activity
4 bit grouping of 32 bit bus
From Sacha, 1999
61Low Swing Buses
- Minimize voltage swing (V2) using differential
signaling - bus contains multiple bits -gt relatively low
overhead - all signals on the bus operate in sync -gt
creative circuit techniques for differential
circuits - Two basic approaches
- Additional reference voltage lines
- driver circuit responsible for generating Vref
- SA bus receiver circuit required
- Charge recycling
62Additional Reference Lines
- Introduce an additional reference voltage line
between the sender and receiver
Vref
driver circuit
receiver circuit
Send data
Received data
Vbus
Cbus
Low swing bus
Vbus
?V ? 0.1Vdd
Vref
Conventional bus
Logic 0
Logic 1
63Bus Driver Circuit
Vbus
Cref gtgt Cn,Cbus
Vref
Cn Vref Cbus(Vdd-Vref)
Source data
64Power Efficiency
- Depends on the extent of voltage swing reduction
(depends on required noise immunity and
sensitivity of sensing circuit) - 0.1Vdd reduced swing -gt 99 savings
- Also must consider
- additional power of driver and receiver circuits
- additional timing delays of circuits (but reduced
swing improves signal switching time) - reduced swing ? smaller transistors at driver ?
reduced short circuit currents
65Limitations
- Susceptible to noise and cross-talk
- Producing large on-chip capacitance Cref
difficult - Sensing circuit difficult to design for very low
operating voltages - Ratio of Cbus to Cn may be difficult to control
(sensitive to process variations) - Driver circuit inherently dynamic so cannot stay
dormant for long periods (what if data signal
contains long series of identical values?) - Takes time for Vref to recover if bus deactivated
66Charge Recycling Bus
- High order bit discharges to lower bit recycling
charge (need 2 wires per bit)
0
CD1
1
CD1-
CD2
0
CD2-
S1
S2
67Power Efficiency
- Depends on the number of bits stacked
- For n bits, voltage swing of each line is
- ?V Vdd/(2n)
- So power dissipation of recycling bus is
- PCRB 2n C (Vdd/(2n))2 (2f) Pconv /(n2)
- However, due to precharge dont gain from data
correlation, so efficiency reduced to - PCRB 2Pconv /(n2)
68Limitations
- Larger values of n improves power efficiency but
decreases noise immunity - Must maintain all line capacitances at an equal
value (may limit scheme to on-chip buses ? have
to be careful in layout to balance capacitances) - Requires precharge phase ? reduces data transfer
rate
69Comparisons
Vdd 2V, CL(bus) 1pF, 0.6?
From Zhang, 1998
70Charge Recovery Bus
- Recover charge from falling bit lines to
precharge rising bit lines
transmit control
receive control
short control
71Energy Savings
- The amount of energy savings depends on the
number of lines shorted, the control circuitry,
and the data length and pattern - For a single transfer charge recovery
- E RCVdd?V
- where R is the number of rising bit lines and
?V is the voltage change after charge transfer - E RCVdd(Vdd-Vdd(F/(RF))) CVdd2(R2/(RF))
72Reported Savings
- For random data, 32-bit bus
- single transfer energy savings of 47
- maximum optimal energy savings of 72
Avg energy savings
Width of databus
From Khoo, 1995
73Single Transfer Charge Recovery Bus
CD0
??
CD1
??
CD2
??
CD3
??
transmit control
receive control
Participates in charge sharing if data bit is
different from last data bit transmitted
74Data Patterns Affect Savings
Trace A 0001-gt1110-gt0001
Trace B 0011-gt1100-gt0011
Trace A Trace B
Step 3 Step 5 Total
312.5(2.5-0.625) 212.5(2.5-1.25) 112.5
(2.5-1.875) 212.5(2.5-1.25)
15.625 12.50
75Impacts of Signal Encoding on Charge Recovery
Relative energy consumption
Average of 15 Mediabench benchmarks
From Bishop, 1999
76Bus Multiplexing
- Share long data buses with time multiplexing (S1
uses even cycles, S2 odd) - But what if data samples are correlated (e.g.,
sign bits)?
77Correlated Data Streams
Bit switching probabilities
Bit position
MSB
LSB
78Disadvantage of Bus Multiplexing
- If data bus is shared, advantages of data
correlation are lost (bus carries samples from
two uncorrelated data streams) - Bus sharing should not be used for positively
correlated data streams - Bus sharing may prove advantageous in a
negatively correlated data stream (where
successive samples switch sign bits) - more
random switching
79Segmented Buses
- Partition bus into several segments that reduces
the capacitance per segment
Wout
Xout
TIE
Ain
Bin
TIE control
- Try to group often communicating circuits on the
same segment
80TIE Design
- To connect the segments
- Delay/power models for t-gate solution show a
60-70 reduction in power and a 10-30
improvement in bus delay
81Code Compression
- Assuming only a subset of instrs used, replace
them with a shorter encoding to reduce memory
bandwidth
addresses
Core
IDT
logN bits
instructions
k bits
instruction decompression table (restores
original format)
memory
82Instruction Loop Buffer
- Temporarily store decoded instrs from small
loops in a buffer (DIB)
skip Ifetch and decode
Fetch
Decode
Execute
Memory
WriteBack
PC
Instruction
MAR
MDR
I
D
DIB stores decoded instrs for a whole loop
DIB
Can achieve a 40 power savings in the MPU core
83Key References, Bus Power
- Bajwa, Stage-skip pipeline, Proc. of ISLPED, pp.
353-358, 1996. - Bellaouar, An ultra-low power CMOS on-chip
interconnect architecture, Proc. of SLPE, pp.
52-53, 1995. - Benini, Address bus encoding techniques for
system-level power optimization, Proc. of DATE,
pp. 861-866, 1998. - Bishop, Database charge recovery practical
considerations, Proc. SLPED, 1999. - Chen, Segmented bus design for low power systems,
IEEE Trans. on VLSI Systems, 7(1)25-29, Mar
1999. - Hikari, Data dependent logic swing internal bus
architecture for ultralow power LSI, IEEE Journal
of SSC, 30(4)397-402, Apr 1995. - Khoo, Charge recovery on a databus, Proc. SLPED,
185-189, Aug 1995. - Stan, Bus-invert coding for low power I/O, IEEE
Trans. on VLSI Systems, 3(1)49-58, 1995. - Yamauchi, An asymtotically zero power charge
recycling bus architecture, IEEE Journal of SSC,
30(4)423-431, Apr 1995. - Yoshida, An object code compression approach to
embedded processors, Proc. of ISLPED, pp.
265-268, 1997. - Zhang, Low-swing interconnect interface circuits,
Proc. SLPED, 161-166, Aug 1998.