Tutorial Outline

About This Presentation

Title:

Tutorial Outline

Description:

Most Worst Fastest Least Most. Application. Behavioral. Architectural (RTL) Logic (Gate) ... delay and setup/hold time) due to increased on-resistance ... – PowerPoint PPT presentation

Number of Views:119

Avg rating:3.0/5.0

Slides: 84

Provided by: Jan6154

Category:

more less

Transcript and Presenter's Notes

Title: Tutorial Outline

1
Tutorial Outline
2
Design Levels
Abstraction Analysis Analysis
Analysis Analysis Power Level
Capacity Accuracy Speed Resources
Savings Most
Worst Fastest Least
Most Application Behavioral Architectural
(RTL) Logic (Gate) Transistor (Switch)
Least Best
Slowest Most Least
3
Basic Principles of Low Power Design
P CL VDD2 f0?1 tscVDD Ipeak f0?1 VDD
Ileakage

Reduce switching (supply) voltage
quadratic effect -gt dramatic savings
negative effect on performance
Reduce capacitance
Reduce switching frequency
switching activity
clock rate
Reduce glitching
Reduce short circuit currents (slope engineering)
Reduce leakage currents

4
Processor Sleep Modes

Software power control - power management
DOZE - most fus stopped except on-chip cache
memory (cache coherency)
NAP - cache also turned off, PLL still on, time
out or external interrupt to resume
SLEEP - PLL off, external interrupt to resume

Deeper sleep mode saves more power
Deeper sleep mode requires more latency to resume
ACPI www.teleport.com/acpi
5
PowerPC Sleep Modes
10 cycles to wake up from SLEEP
100us to wake up from SLEEP
6
Keeper Circuits

A floating node (not driven by any gates) can
suffer charge decay resulting in short-circuit
currents
Keeper circuits can
slightly increase power dissipation
slightly increase delay
Essential in circuits with sleep modes

7
Intels SpeedStep

Hardware that steps down the clock and supply
voltage when the user unplugs a mobile computer
from AC power
step down the PLL from 650MHz ? 500MHz
CPU stalls during SpeedStep adjustment

8
Transmeta LongRun

Hardware that scales supply voltage and clock
frequency in response to software demands
32 levels of VDD (use 5 to 7 in practice) from
1.1V to 1.6V
clock frequency from 200MHz to 700MHz in
increments of 33MHz
Controlled through 5 pin interface
triggered when CPU load change detected by
software
heavier load ? ramp up supply voltage, when
stable scale up clock frequency
lighter load ? scale down clock frequency, when
PLL locks onto new rate, ramp down supply voltage
always keeps clock frequency within limits
required by supply voltage to avoid clock skew
problems

9
Transmeta LongRun, cont

Software can detect CPU load change in 1/2
microsecond
LongRun scales
the voltage up or down in less than 20
microseconds per step
the clock frequency in one step
Worst-case scenario of a full swing from 1.1V to
1.6V and from 200MHz to 700MHz takes 280
microseconds
CPU stalls only during PLL relock (lt 20
microseconds in the worst case)

10
Power Reduction Techniques in the Processor Core
11
Core Power Reduction
Pcore CVDD2f

Circuit level techniques
voltage scaling, transistor sizing, dual supply
voltages
Gate level techniques
logic gate and network restructuring, input
ordering
energy-delay efficient functional units, delay
balancing, etc.
Architecture level techniques
processor sleep modes
pipelining, clock and signal gating
guarded evaluation, precomputation, power aware
state encoding, etc.

12
Guarded Evaluation

Reduce switching activity by adding latches at
the inputs if outputs are not used
Latch preserves previous value of inputs to
suppress activity
could also use AND gates to mask inputs to zero ?
forced zero (good if zero-out condition changes
infrequently compared to data rate)

A
A
B
Latch
Multiplier
C
condition
13
Precomputation
Precomputed inputs
R1
Combination logic f(X)
Outputs
Gated inputs
R2
Load disable
Precomputation logic
g(X)
g(X)

Identify logical conditions at inputs that are
invariant to the output
since those inputs dont affect output, disable
input transitions
trade area for energy

14
Binary Comparator Example
Can achieve up to 75 power reduction with 3
area overhead and 1 to 5 additional gate delays
in worst case path
15
Design Issues in Precomputation

Design steps
1. Select precomputation architecture
2. Determined the precomputed and gated inputs
(R1 should be much smaller than R2)
3. Find (good implementation for) g(X)
4. Evaluate potential energy savings based on
input statistics (if savings not sufficient go to
step 2 or 3 and try again)
Also works for multiple output functions where
g(X) is the product of gj(X) over all j

16
Common Case Computation
Inputs
common case detected
sleep2
CC detection circuit
Original circuit
sleep1
sleep3
CC execution circuit
CCC controller
common case completed
Outputs
17
Activity of CCC Circuit Over Time
Original circuit
CC detection circuit
CC execution circuit
tp
tc
te
Time

Several (possibly conflicting) factors involved
in choosing the CC circuit leading to maximal
energy and/or time savings
Dependent on input data statistics

18
CCC Performance
From Lakshminarayana, 1999
19
Glitch Reduction by Pipelining

Glitches are dependent on the logic depth of the
circuit
Nodes logically deeper are more prone to
glitching
arrival times of the gate inputs are more spread
due to delay imbalances
usually affected by more primary input switching
Reduce depth by adding pipeline registers

20
Typical RISC Datapath

Five stage pipeline (originally for performance,
but also helps with energy)

Fetch
Decode
Execute
Memory
WriteBack
PC
Instruction
MAR
MDR
I
D
pipeline stage isolation register
clk
21
Sample of Benchmark Set
22
Datapath Energy Consumption
23
Signal Gating

Mask unwanted switching activity from propagating
Generation of control signals requires additional
logic circuitry (more power)

source signal
gated signal
control signal to suppress source signal
24
Signal Gating, cont

Signal gating saves energy if the relative
enable/disable frequency of control signal is
much lower than the frequency of source signal
(so many signal activities blocked)
Savings even greater if a group of source signals
can share a control signal
Good candidates - clock signals, address or data
buses, signals with high frequency or high
glitching

25
Selectively Gated Pipeline Regs

Pipeline registers consume a large percentage of
datapath power
40 for 0.35?
Pipeline registers have large width
Pipeline registers are clocked every cycle
Not all clockings are necessary
use the decoded control signals to selectively
gate the clock of pipeline register fields
only simple extra logic necessary
can be built into the clock buffer circuit

26
Gated Pipeline Register Example
Instr SW r1, 0(r2)
MEM/WB
EXE/MEM
mem/wb_cntl
MemData
Address
D
Data
EXE
MEM
WB
27
Switch Capacitance Reduction
From Narayana, DAC, 2000
28
Key References, Processor Core

Alidina, Precomputation-based sequential logic
optimization for low power, IEEE Trans. on VLSI
Systems, 2(4)426-436, 1994.
Halfhill, Transmeta breaks x86 low-power barrier,
Microprocessor Report, Feb. 2000.
Lakshminarayana, et.al., Common-case Computation,
DAC, 1999.
Manne, etal., Pipeline gating Speculation
control for energy reduction, ISCA, June 1998.
Roy, Power analysis and design at the system
level, Low Power Design in Deep Submicron
Electronics, Nebel and Mermet, Ed., Kluwer, 1997.
Tiwari, Reducing power in high-performance
microprocessors, DAC, 1998.
Tiwari, Guarded evaluation, ISLPD, 1995.
Ye, etal., The design and use of SimplePower A
cycle-accurate energy estimation tool, DAC, 2000.
Yeap, Practical Low Power Digital VLSI Design,
KAP, 1998.

29
Power Reduction Techniques in the Clock System
30
Clock Power

Why clock power is important (large)
Generally the signal with the highest frequency
Typically drives a large load
all sequential logic elements
all precharged/dynamic logic
distributed throughout chip, so lots of wiring
E.g., DEC 21164s clock accounts for 40 of total
chip power
3.75nF total clock load
20W (out of 50W) in clock distribution network

31
Clocking System

Clock generation, distribution and loading

Clock load
PLL
Control registers/latches Pipeline
registers Dynamic (precharged) logic
Memory precharged bit lines
32
Typical Clock Power Distribution
33
Clock Power Reduction

Pclock CVdd2f
Minimize voltage (V) using half swing clocks
Minimize clock load (C)
clock gating
careful routing, distributed drivers
Minimize clock frequency (f)
DET flipflops
localized PLL to multiply frequency of clock
GALS design approach

34
Reduced Swing Clock
Vdd
N-device clock
P-device clock
Gnd
Regular Clock
Vdd
P-device clock
Vtp
Vtn
N-device clock
Gnd
Half Swing Clock
35
Half Swing Clocks

Advantages
as long as Vtn (Vtp) less (greater) than 1/2Vdd
on-off characteristics of nfet (pfet) unchanged
Disadvantages
sequential element delay approx. doubled
(propagation delay and setup/hold time) due to
increased on-resistance
half-swing clock generator done via charge
sharing, so sleep modes problematic
not appropriate for very low voltage systems

36
Clock Gating

Most popular method for power reduction of clock
signals and fus
often idle functional units
e.g., floating point units
need circuit to generate
enable signal
increases complexity of control logic
timing critical to avoid clock glitches at AND
gate output
additional gate delay on clock signal
masking AND gate can replace a buffer in the
clock distribution tree

37
Glitch Free Clock Gating
lt
From lt
B
lt
Gated Clock
A
Clock
Clock
(1)
0 1
Gated Clock (1)
lt
REG
Clock
Gated Clock
Gated Clock (2)
Clock
(2)
38
Gated Clock FSM Architecture
Comb Logic
Reg
AF - Activation Function, Which evaluates to
logic 1 when clock needs to be stopped.
AF
Latch
Gated Clock
Clock
39
Clock Tree Construction to Facilitate Gating
Can insert clock gating at multiple levels in
clock tree Can shut off entire subtree if all
gating conditions are satisfied
H-Tree Clock Network
40
Clock Driver Distribution Comparison
SD single driver, DD distributed driver
(H-tree) 3.3V supply, 100MHz frequency, 1 micron
feature size
41
Clock Tree Structure Affects Gating
x1
R1
R1
x1
R2
x2
A
R3
Clock
x3
x1x3
B
R3
R2
x2
R4
R4
x2x4
x4
(a)
(b)
Assuming x1, x2, x3, x4 are mutually exclusive
42
2005? Multimedia SoC
Interrupt Controller
X Memory
Y Memory
22.8 mm
System Level Interconnect
I/O
System Bus Controller
MPU Core
DSP Core
22.8 mm
2GHz System clock
200M transistor chip
43
Multiple Local Clock Generators
f lt f1 lt f2 lt f3
Interrupt Contr
X Memory
Y Memory
System Level Interconnect
I/O
System Bus Controller
MPU Core
DSP Core
Key is in the design of the local circuits used
to generate the clock signal in each module
44
Clock Generation Options

Globally synched clock generators
PLLs
large, power hungry
process variation robustness
Clock multipliers
smaller, lower power consumption
Independent clock generators
ring oscillators (DLLs)
small size, low power consumption
free running

45
Clock Frequency Multipliers
1 Young, 1992 2 Alvarez, 1995 3 Gupta
46
GALS Design Style

Reduce clock power consumption by using a
Globally Asynchronous, Locally Synchronous (GALS)
design style
Overheads for
local clock generation
independent clock generators
low power global clock reference signal with
local clock frequency multipliers
global asynchronous communication
Skew tolerant

47
GALS Multimedia SoC
Interrupt Controller
X Memory
Y Memory
I/O
MPU Core
DSP Core
data
handshake protocol
48
Key References, Clock Power

Alvarez, A wide bandwidth low voltage PLL for
PowerPC microprocessors, IEEE Journal of SSC,
30383-391, April 1995.
Chen, A simple technique for global clock power
reduction, PSU Internal Report, 1998.
Chen, Clock power issues in system-on-a-chip
designs, Proc. of Workshop on VLSI, pp. 48-53,
March 1999.
Friedman, Clock distribution design in VLSI
circuits An Overview, Proc. of ISCAS, pp.
1475-1478, May 1994.
Gupta, Features of differential delay line used
on the embedded ultra low power Intel486 in
developer.intel.com/design/intarch/papers/ddl486.h
tm
Hemani, Lowering power consumption in clock by
using GALS design style, Proc. of DAC, pp.
873-878, 1999.
Kojima, Half-swing clocking scheme for 75 power
saving, IEEE Journal of SSC, 30(4)432-435, April
1994.
Tellez, Activity driven clock design for low
power circuits, Proc. of ICCAD, pp. 62-65, Nov.
1995.
Young, A PLL clock generator with 5 to 110MHz of
lock range for microprocessors, IEEE Journal of
SSC, pp. 1599-1607, Nov. 1992

49
Power Reduction Techniques in the Bus
Interconnects
50
Bus Power

Buses are a significant source of power
dissipation due to high switching activities and
large capacitive loading
15 of total power in Alpha 21064
30 of total power in Intel 80386

Wout
Xout
Yout
Zout
Bus receivers
Bus
Bus drivers
Ain
Bin
Cin
Din
51
Bus Power Reduction

Pbus nCVdd2f for an n-bit bus
Minimize bit switching activity (f) of buses by
encoding the data
Minimize voltage swing (V2) using differential
signaling
Alternative bus structures
charge recovery buses
bus multiplexing (lower f, maybe)
segmented buses (lower C)
Minimizing bus traffic (n)
code compression
instruction loop buffers

52
Signal Encoding
Binary Code
Gray Code
53
Toggle Rates
54
Bus Signal Encoding

Different encodings lead to different area,
delay, and power trade-offs
What is the power and latency cost of the
encoding/decoding logic?
What if the bus stream is not sequential?
Can really pay off in buses with large capacitive
loading (off-chip buses and high level on-chip
buses)

55
Bus Invert Encoding

At each cycle decide whether sending the true or
compliment signal leads to fewer toggles
Need an additional polarity signal on the bus to
tell the bus receiver whether to invert the
signal or not
Only makes sense for groups of signals - buses -
that can share the polarity signal
Works for both sequential and random bus streams

56
Bus Invert Coding Logic
Invert/pass
0000 ? 1110
Invert/pass
Source data
Data bus
0000 ? 0001
Received data
0000 ? 1110
0 ? 1
Polarity signal
0 ? 1
Polarity decision logic
Bus register
Under uniform random signal conditions
(non-correlated data), 25 upper bound on toggle
reduction
Hamming distance
57
Efficiency of Bus Invert Encoding

Have overhead in area, power and delay of
additional logic to encode/decode
Maximum number of toggling bits reduced from n
to n/2
Under uniform random signal conditions
(non-correlated data sequence), the toggle
reduction has an upper bound of 25

58
Efficiency of Bus Encoding
n/2
1 n1
EP n/2
EQ ? k Qk where Qk
2n k
k0
From Stan, 1995
59
Bus Encoding Extensions

For sequential data (e.g., generated on address
buses)
Gray code encoding (except for overhead)
T0 code by Benini
add address incrementer circuitry to receiver
add INC line to address bus
for consecutive addresses, just assert the INC
line without sending the second address
reduces address bus transitions by 36 over
binary
outperforms Gray code when probability of
consecutive addresses is gt 0.5

60
Data Bus Switching Activity
Average switching activity
4 bit grouping of 32 bit bus
From Sacha, 1999
61
Low Swing Buses

Minimize voltage swing (V2) using differential
signaling
bus contains multiple bits -gt relatively low
overhead
all signals on the bus operate in sync -gt
creative circuit techniques for differential
circuits
Two basic approaches
Additional reference voltage lines
driver circuit responsible for generating Vref
SA bus receiver circuit required
Charge recycling

62
Additional Reference Lines

Introduce an additional reference voltage line
between the sender and receiver

Vref
driver circuit
receiver circuit
Send data
Received data
Vbus
Cbus
Low swing bus
Vbus
?V ? 0.1Vdd
Vref
Conventional bus
Logic 0
Logic 1
63
Bus Driver Circuit
Vbus
Cref gtgt Cn,Cbus
Vref
Cn Vref Cbus(Vdd-Vref)
Source data
64
Power Efficiency

Depends on the extent of voltage swing reduction
(depends on required noise immunity and
sensitivity of sensing circuit)
0.1Vdd reduced swing -gt 99 savings
Also must consider
additional power of driver and receiver circuits
additional timing delays of circuits (but reduced
swing improves signal switching time)
reduced swing ? smaller transistors at driver ?
reduced short circuit currents

65
Limitations

Susceptible to noise and cross-talk
Producing large on-chip capacitance Cref
difficult
Sensing circuit difficult to design for very low
operating voltages
Ratio of Cbus to Cn may be difficult to control
(sensitive to process variations)
Driver circuit inherently dynamic so cannot stay
dormant for long periods (what if data signal
contains long series of identical values?)
Takes time for Vref to recover if bus deactivated

66
Charge Recycling Bus

High order bit discharges to lower bit recycling
charge (need 2 wires per bit)

0
CD1
1
CD1-
CD2
0
CD2-
S1
S2
67
Power Efficiency

Depends on the number of bits stacked
For n bits, voltage swing of each line is
?V Vdd/(2n)
So power dissipation of recycling bus is
PCRB 2n C (Vdd/(2n))2 (2f) Pconv /(n2)
However, due to precharge dont gain from data
correlation, so efficiency reduced to
PCRB 2Pconv /(n2)

68
Limitations

Larger values of n improves power efficiency but
decreases noise immunity
Must maintain all line capacitances at an equal
value (may limit scheme to on-chip buses ? have
to be careful in layout to balance capacitances)
Requires precharge phase ? reduces data transfer
rate

69
Comparisons
Vdd 2V, CL(bus) 1pF, 0.6?
From Zhang, 1998
70
Charge Recovery Bus

Recover charge from falling bit lines to
precharge rising bit lines

transmit control
receive control
short control
71
Energy Savings

The amount of energy savings depends on the
number of lines shorted, the control circuitry,
and the data length and pattern
For a single transfer charge recovery
E RCVdd?V
where R is the number of rising bit lines and
?V is the voltage change after charge transfer
E RCVdd(Vdd-Vdd(F/(RF))) CVdd2(R2/(RF))

72
Reported Savings

For random data, 32-bit bus
single transfer energy savings of 47
maximum optimal energy savings of 72

Avg energy savings
Width of databus
From Khoo, 1995
73
Single Transfer Charge Recovery Bus

CD0
??

CD1
??

CD2
??

CD3
??
transmit control
receive control
Participates in charge sharing if data bit is
different from last data bit transmitted
74
Data Patterns Affect Savings
Trace A 0001-gt1110-gt0001
Trace B 0011-gt1100-gt0011
Trace A Trace B
Step 3 Step 5 Total
312.5(2.5-0.625) 212.5(2.5-1.25) 112.5
(2.5-1.875) 212.5(2.5-1.25)
15.625 12.50
75
Impacts of Signal Encoding on Charge Recovery
Relative energy consumption
Average of 15 Mediabench benchmarks
From Bishop, 1999
76
Bus Multiplexing

Share long data buses with time multiplexing (S1
uses even cycles, S2 odd)
But what if data samples are correlated (e.g.,
sign bits)?

77
Correlated Data Streams
Bit switching probabilities
Bit position
MSB
LSB
78
Disadvantage of Bus Multiplexing

If data bus is shared, advantages of data
correlation are lost (bus carries samples from
two uncorrelated data streams)
Bus sharing should not be used for positively
correlated data streams
Bus sharing may prove advantageous in a
negatively correlated data stream (where
successive samples switch sign bits) - more
random switching

79
Segmented Buses

Partition bus into several segments that reduces
the capacitance per segment

Wout
Xout
TIE
Ain
Bin
TIE control

Try to group often communicating circuits on the
same segment

80
TIE Design

To connect the segments
Delay/power models for t-gate solution show a
60-70 reduction in power and a 10-30
improvement in bus delay

81
Code Compression

Assuming only a subset of instrs used, replace
them with a shorter encoding to reduce memory
bandwidth

addresses
Core
IDT
logN bits
instructions
k bits
instruction decompression table (restores
original format)
memory
82
Instruction Loop Buffer

Temporarily store decoded instrs from small
loops in a buffer (DIB)

skip Ifetch and decode
Fetch
Decode
Execute
Memory
WriteBack
PC
Instruction
MAR
MDR
I
D
DIB stores decoded instrs for a whole loop
DIB
Can achieve a 40 power savings in the MPU core
83
Key References, Bus Power

Bajwa, Stage-skip pipeline, Proc. of ISLPED, pp.
353-358, 1996.
Bellaouar, An ultra-low power CMOS on-chip
interconnect architecture, Proc. of SLPE, pp.
52-53, 1995.
Benini, Address bus encoding techniques for
system-level power optimization, Proc. of DATE,
pp. 861-866, 1998.
Bishop, Database charge recovery practical
considerations, Proc. SLPED, 1999.
Chen, Segmented bus design for low power systems,
IEEE Trans. on VLSI Systems, 7(1)25-29, Mar
1999.
Hikari, Data dependent logic swing internal bus
architecture for ultralow power LSI, IEEE Journal
of SSC, 30(4)397-402, Apr 1995.
Khoo, Charge recovery on a databus, Proc. SLPED,
185-189, Aug 1995.
Stan, Bus-invert coding for low power I/O, IEEE
Trans. on VLSI Systems, 3(1)49-58, 1995.
Yamauchi, An asymtotically zero power charge
recycling bus architecture, IEEE Journal of SSC,
30(4)423-431, Apr 1995.
Yoshida, An object code compression approach to
embedded processors, Proc. of ISLPED, pp.
265-268, 1997.
Zhang, Low-swing interconnect interface circuits,
Proc. SLPED, 161-166, Aug 1998.