Tutorial Outline - PowerPoint PPT Presentation

1 / 83
About This Presentation
Title:

Tutorial Outline

Description:

Most Worst Fastest Least Most. Application. Behavioral. Architectural (RTL) Logic (Gate) ... delay and setup/hold time) due to increased on-resistance ... – PowerPoint PPT presentation

Number of Views:119
Avg rating:3.0/5.0
Slides: 84
Provided by: Jan6154
Category:
Tags: outline | tutorial

less

Transcript and Presenter's Notes

Title: Tutorial Outline


1
Tutorial Outline
2
Design Levels
Abstraction Analysis Analysis
Analysis Analysis Power Level
Capacity Accuracy Speed Resources
Savings Most
Worst Fastest Least
Most Application Behavioral Architectural
(RTL) Logic (Gate) Transistor (Switch)
Least Best
Slowest Most Least
3
Basic Principles of Low Power Design
P CL VDD2 f0?1 tscVDD Ipeak f0?1 VDD
Ileakage
  • Reduce switching (supply) voltage
  • quadratic effect -gt dramatic savings
  • negative effect on performance
  • Reduce capacitance
  • Reduce switching frequency
  • switching activity
  • clock rate
  • Reduce glitching
  • Reduce short circuit currents (slope engineering)
  • Reduce leakage currents

4
Processor Sleep Modes
  • Software power control - power management
  • DOZE - most fus stopped except on-chip cache
    memory (cache coherency)
  • NAP - cache also turned off, PLL still on, time
    out or external interrupt to resume
  • SLEEP - PLL off, external interrupt to resume

Deeper sleep mode saves more power
Deeper sleep mode requires more latency to resume
ACPI www.teleport.com/acpi
5
PowerPC Sleep Modes
10 cycles to wake up from SLEEP
100us to wake up from SLEEP
6
Keeper Circuits
  • A floating node (not driven by any gates) can
    suffer charge decay resulting in short-circuit
    currents
  • Keeper circuits can
  • slightly increase power dissipation
  • slightly increase delay
  • Essential in circuits with sleep modes

7
Intels SpeedStep
  • Hardware that steps down the clock and supply
    voltage when the user unplugs a mobile computer
    from AC power
  • step down the PLL from 650MHz ? 500MHz
  • CPU stalls during SpeedStep adjustment

8
Transmeta LongRun
  • Hardware that scales supply voltage and clock
    frequency in response to software demands
  • 32 levels of VDD (use 5 to 7 in practice) from
    1.1V to 1.6V
  • clock frequency from 200MHz to 700MHz in
    increments of 33MHz
  • Controlled through 5 pin interface
  • triggered when CPU load change detected by
    software
  • heavier load ? ramp up supply voltage, when
    stable scale up clock frequency
  • lighter load ? scale down clock frequency, when
    PLL locks onto new rate, ramp down supply voltage
  • always keeps clock frequency within limits
    required by supply voltage to avoid clock skew
    problems

9
Transmeta LongRun, cont
  • Software can detect CPU load change in 1/2
    microsecond
  • LongRun scales
  • the voltage up or down in less than 20
    microseconds per step
  • the clock frequency in one step
  • Worst-case scenario of a full swing from 1.1V to
    1.6V and from 200MHz to 700MHz takes 280
    microseconds
  • CPU stalls only during PLL relock (lt 20
    microseconds in the worst case)

10
Power Reduction Techniques in the Processor Core
11
Core Power Reduction
Pcore CVDD2f
  • Circuit level techniques
  • voltage scaling, transistor sizing, dual supply
    voltages
  • Gate level techniques
  • logic gate and network restructuring, input
    ordering
  • energy-delay efficient functional units, delay
    balancing, etc.
  • Architecture level techniques
  • processor sleep modes
  • pipelining, clock and signal gating
  • guarded evaluation, precomputation, power aware
    state encoding, etc.

12
Guarded Evaluation
  • Reduce switching activity by adding latches at
    the inputs if outputs are not used
  • Latch preserves previous value of inputs to
    suppress activity
  • could also use AND gates to mask inputs to zero ?
    forced zero (good if zero-out condition changes
    infrequently compared to data rate)

A
A
B
Latch
Multiplier
C
condition
13
Precomputation
Precomputed inputs
R1
Combination logic f(X)
Outputs
Gated inputs
R2
Load disable
Precomputation logic
g(X)
g(X)
  • Identify logical conditions at inputs that are
    invariant to the output
  • since those inputs dont affect output, disable
    input transitions
  • trade area for energy

14
Binary Comparator Example
Can achieve up to 75 power reduction with 3
area overhead and 1 to 5 additional gate delays
in worst case path
15
Design Issues in Precomputation
  • Design steps
  • 1. Select precomputation architecture
  • 2. Determined the precomputed and gated inputs
    (R1 should be much smaller than R2)
  • 3. Find (good implementation for) g(X)
  • 4. Evaluate potential energy savings based on
    input statistics (if savings not sufficient go to
    step 2 or 3 and try again)
  • Also works for multiple output functions where
    g(X) is the product of gj(X) over all j

16
Common Case Computation
Inputs
common case detected
sleep2
CC detection circuit
Original circuit
sleep1
sleep3
CC execution circuit
CCC controller
common case completed
Outputs
17
Activity of CCC Circuit Over Time
Original circuit
CC detection circuit
CC execution circuit
tp
tc
te
Time
  • Several (possibly conflicting) factors involved
    in choosing the CC circuit leading to maximal
    energy and/or time savings
  • Dependent on input data statistics

18
CCC Performance
From Lakshminarayana, 1999
19
Glitch Reduction by Pipelining
  • Glitches are dependent on the logic depth of the
    circuit
  • Nodes logically deeper are more prone to
    glitching
  • arrival times of the gate inputs are more spread
    due to delay imbalances
  • usually affected by more primary input switching
  • Reduce depth by adding pipeline registers

20
Typical RISC Datapath
  • Five stage pipeline (originally for performance,
    but also helps with energy)

Fetch
Decode
Execute
Memory
WriteBack
PC
Instruction
MAR
MDR
I
D
pipeline stage isolation register
clk
21
Sample of Benchmark Set
22
Datapath Energy Consumption
23
Signal Gating
  • Mask unwanted switching activity from propagating
  • Generation of control signals requires additional
    logic circuitry (more power)

source signal
gated signal
control signal to suppress source signal
24
Signal Gating, cont
  • Signal gating saves energy if the relative
    enable/disable frequency of control signal is
    much lower than the frequency of source signal
    (so many signal activities blocked)
  • Savings even greater if a group of source signals
    can share a control signal
  • Good candidates - clock signals, address or data
    buses, signals with high frequency or high
    glitching

25
Selectively Gated Pipeline Regs
  • Pipeline registers consume a large percentage of
    datapath power
  • 40 for 0.35?
  • Pipeline registers have large width
  • Pipeline registers are clocked every cycle
  • Not all clockings are necessary
  • use the decoded control signals to selectively
    gate the clock of pipeline register fields
  • only simple extra logic necessary
  • can be built into the clock buffer circuit

26
Gated Pipeline Register Example
Instr SW r1, 0(r2)
MEM/WB
EXE/MEM
mem/wb_cntl
MemData
Address
D
Data
EXE
MEM
WB
27
Switch Capacitance Reduction
From Narayana, DAC, 2000
28
Key References, Processor Core
  • Alidina, Precomputation-based sequential logic
    optimization for low power, IEEE Trans. on VLSI
    Systems, 2(4)426-436, 1994.
  • Halfhill, Transmeta breaks x86 low-power barrier,
    Microprocessor Report, Feb. 2000.
  • Lakshminarayana, et.al., Common-case Computation,
    DAC, 1999.
  • Manne, etal., Pipeline gating Speculation
    control for energy reduction, ISCA, June 1998.
  • Roy, Power analysis and design at the system
    level, Low Power Design in Deep Submicron
    Electronics, Nebel and Mermet, Ed., Kluwer, 1997.
  • Tiwari, Reducing power in high-performance
    microprocessors, DAC, 1998.
  • Tiwari, Guarded evaluation, ISLPD, 1995.
  • Ye, etal., The design and use of SimplePower A
    cycle-accurate energy estimation tool, DAC, 2000.
  • Yeap, Practical Low Power Digital VLSI Design,
    KAP, 1998.

29
Power Reduction Techniques in the Clock System
30
Clock Power
  • Why clock power is important (large)
  • Generally the signal with the highest frequency
  • Typically drives a large load
  • all sequential logic elements
  • all precharged/dynamic logic
  • distributed throughout chip, so lots of wiring
  • E.g., DEC 21164s clock accounts for 40 of total
    chip power
  • 3.75nF total clock load
  • 20W (out of 50W) in clock distribution network

31
Clocking System
  • Clock generation, distribution and loading

Clock load
PLL
Control registers/latches Pipeline
registers Dynamic (precharged) logic
Memory precharged bit lines
32
Typical Clock Power Distribution
33
Clock Power Reduction
  • Pclock CVdd2f
  • Minimize voltage (V) using half swing clocks
  • Minimize clock load (C)
  • clock gating
  • careful routing, distributed drivers
  • Minimize clock frequency (f)
  • DET flipflops
  • localized PLL to multiply frequency of clock
  • GALS design approach

34
Reduced Swing Clock
Vdd
N-device clock
P-device clock
Gnd
Regular Clock
Vdd
P-device clock
Vtp
Vtn
N-device clock
Gnd
Half Swing Clock
35
Half Swing Clocks
  • Advantages
  • as long as Vtn (Vtp) less (greater) than 1/2Vdd
    on-off characteristics of nfet (pfet) unchanged
  • Disadvantages
  • sequential element delay approx. doubled
    (propagation delay and setup/hold time) due to
    increased on-resistance
  • half-swing clock generator done via charge
    sharing, so sleep modes problematic
  • not appropriate for very low voltage systems

36
Clock Gating
  • Most popular method for power reduction of clock
    signals and fus
  • often idle functional units
  • e.g., floating point units
  • need circuit to generate
    enable signal
  • increases complexity of control logic
  • timing critical to avoid clock glitches at AND
    gate output
  • additional gate delay on clock signal
  • masking AND gate can replace a buffer in the
    clock distribution tree

37
Glitch Free Clock Gating
lt
From lt
B
lt
Gated Clock
A
Clock
Clock
(1)
0 1
Gated Clock (1)
lt
REG
Clock
Gated Clock
Gated Clock (2)
Clock
(2)
38
Gated Clock FSM Architecture
Comb Logic
Reg
AF - Activation Function, Which evaluates to
logic 1 when clock needs to be stopped.
AF
Latch
Gated Clock
Clock
39
Clock Tree Construction to Facilitate Gating
Can insert clock gating at multiple levels in
clock tree Can shut off entire subtree if all
gating conditions are satisfied
H-Tree Clock Network
40
Clock Driver Distribution Comparison
SD single driver, DD distributed driver
(H-tree) 3.3V supply, 100MHz frequency, 1 micron
feature size
41
Clock Tree Structure Affects Gating
x1
R1
R1
x1
R2
x2
A
R3
Clock
x3
x1x3
B
R3
R2
x2
R4
R4
x2x4
x4
(a)
(b)
Assuming x1, x2, x3, x4 are mutually exclusive
42
2005? Multimedia SoC
Interrupt Controller
X Memory
Y Memory
22.8 mm
System Level Interconnect
I/O
System Bus Controller
MPU Core
DSP Core
22.8 mm
2GHz System clock
200M transistor chip
43
Multiple Local Clock Generators
f lt f1 lt f2 lt f3
Interrupt Contr
X Memory
Y Memory
System Level Interconnect
I/O
System Bus Controller
MPU Core
DSP Core
Key is in the design of the local circuits used
to generate the clock signal in each module
44
Clock Generation Options
  • Globally synched clock generators
  • PLLs
  • large, power hungry
  • process variation robustness
  • Clock multipliers
  • smaller, lower power consumption
  • Independent clock generators
  • ring oscillators (DLLs)
  • small size, low power consumption
  • free running

45
Clock Frequency Multipliers
1 Young, 1992 2 Alvarez, 1995 3 Gupta
46
GALS Design Style
  • Reduce clock power consumption by using a
    Globally Asynchronous, Locally Synchronous (GALS)
    design style
  • Overheads for
  • local clock generation
  • independent clock generators
  • low power global clock reference signal with
    local clock frequency multipliers
  • global asynchronous communication
  • Skew tolerant

47
GALS Multimedia SoC
Interrupt Controller
X Memory
Y Memory
I/O
MPU Core
DSP Core
data
handshake protocol
48
Key References, Clock Power
  • Alvarez, A wide bandwidth low voltage PLL for
    PowerPC microprocessors, IEEE Journal of SSC,
    30383-391, April 1995.
  • Chen, A simple technique for global clock power
    reduction, PSU Internal Report, 1998.
  • Chen, Clock power issues in system-on-a-chip
    designs, Proc. of Workshop on VLSI, pp. 48-53,
    March 1999.
  • Friedman, Clock distribution design in VLSI
    circuits An Overview, Proc. of ISCAS, pp.
    1475-1478, May 1994.
  • Gupta, Features of differential delay line used
    on the embedded ultra low power Intel486 in
    developer.intel.com/design/intarch/papers/ddl486.h
    tm
  • Hemani, Lowering power consumption in clock by
    using GALS design style, Proc. of DAC, pp.
    873-878, 1999.
  • Kojima, Half-swing clocking scheme for 75 power
    saving, IEEE Journal of SSC, 30(4)432-435, April
    1994.
  • Tellez, Activity driven clock design for low
    power circuits, Proc. of ICCAD, pp. 62-65, Nov.
    1995.
  • Young, A PLL clock generator with 5 to 110MHz of
    lock range for microprocessors, IEEE Journal of
    SSC, pp. 1599-1607, Nov. 1992

49
Power Reduction Techniques in the Bus
Interconnects
50
Bus Power
  • Buses are a significant source of power
    dissipation due to high switching activities and
    large capacitive loading
  • 15 of total power in Alpha 21064
  • 30 of total power in Intel 80386

Wout
Xout
Yout
Zout
Bus receivers
Bus
Bus drivers
Ain
Bin
Cin
Din
51
Bus Power Reduction
  • Pbus nCVdd2f for an n-bit bus
  • Minimize bit switching activity (f) of buses by
    encoding the data
  • Minimize voltage swing (V2) using differential
    signaling
  • Alternative bus structures
  • charge recovery buses
  • bus multiplexing (lower f, maybe)
  • segmented buses (lower C)
  • Minimizing bus traffic (n)
  • code compression
  • instruction loop buffers

52
Signal Encoding
Binary Code
Gray Code
53
Toggle Rates
54
Bus Signal Encoding
  • Different encodings lead to different area,
    delay, and power trade-offs
  • What is the power and latency cost of the
    encoding/decoding logic?
  • What if the bus stream is not sequential?
  • Can really pay off in buses with large capacitive
    loading (off-chip buses and high level on-chip
    buses)

55
Bus Invert Encoding
  • At each cycle decide whether sending the true or
    compliment signal leads to fewer toggles
  • Need an additional polarity signal on the bus to
    tell the bus receiver whether to invert the
    signal or not
  • Only makes sense for groups of signals - buses -
    that can share the polarity signal
  • Works for both sequential and random bus streams

56
Bus Invert Coding Logic
Invert/pass
0000 ? 1110
Invert/pass
Source data
Data bus
0000 ? 0001
Received data
0000 ? 1110
0 ? 1
Polarity signal
0 ? 1
Polarity decision logic
Bus register
Under uniform random signal conditions
(non-correlated data), 25 upper bound on toggle
reduction
Hamming distance
57
Efficiency of Bus Invert Encoding
  • Have overhead in area, power and delay of
    additional logic to encode/decode
  • Maximum number of toggling bits reduced from n
    to n/2
  • Under uniform random signal conditions
    (non-correlated data sequence), the toggle
    reduction has an upper bound of 25

58
Efficiency of Bus Encoding
n/2
1 n1
EP n/2
EQ ? k Qk where Qk
2n k
k0
From Stan, 1995
59
Bus Encoding Extensions
  • For sequential data (e.g., generated on address
    buses)
  • Gray code encoding (except for overhead)
  • T0 code by Benini
  • add address incrementer circuitry to receiver
  • add INC line to address bus
  • for consecutive addresses, just assert the INC
    line without sending the second address
  • reduces address bus transitions by 36 over
    binary
  • outperforms Gray code when probability of
    consecutive addresses is gt 0.5

60
Data Bus Switching Activity
Average switching activity
4 bit grouping of 32 bit bus
From Sacha, 1999
61
Low Swing Buses
  • Minimize voltage swing (V2) using differential
    signaling
  • bus contains multiple bits -gt relatively low
    overhead
  • all signals on the bus operate in sync -gt
    creative circuit techniques for differential
    circuits
  • Two basic approaches
  • Additional reference voltage lines
  • driver circuit responsible for generating Vref
  • SA bus receiver circuit required
  • Charge recycling

62
Additional Reference Lines
  • Introduce an additional reference voltage line
    between the sender and receiver

Vref
driver circuit
receiver circuit
Send data
Received data
Vbus
Cbus
Low swing bus
Vbus
?V ? 0.1Vdd
Vref
Conventional bus
Logic 0
Logic 1
63
Bus Driver Circuit
Vbus
Cref gtgt Cn,Cbus
Vref
Cn Vref Cbus(Vdd-Vref)
Source data
64
Power Efficiency
  • Depends on the extent of voltage swing reduction
    (depends on required noise immunity and
    sensitivity of sensing circuit)
  • 0.1Vdd reduced swing -gt 99 savings
  • Also must consider
  • additional power of driver and receiver circuits
  • additional timing delays of circuits (but reduced
    swing improves signal switching time)
  • reduced swing ? smaller transistors at driver ?
    reduced short circuit currents

65
Limitations
  • Susceptible to noise and cross-talk
  • Producing large on-chip capacitance Cref
    difficult
  • Sensing circuit difficult to design for very low
    operating voltages
  • Ratio of Cbus to Cn may be difficult to control
    (sensitive to process variations)
  • Driver circuit inherently dynamic so cannot stay
    dormant for long periods (what if data signal
    contains long series of identical values?)
  • Takes time for Vref to recover if bus deactivated

66
Charge Recycling Bus
  • High order bit discharges to lower bit recycling
    charge (need 2 wires per bit)

0
CD1
1
CD1-
CD2
0
CD2-
S1
S2
67
Power Efficiency
  • Depends on the number of bits stacked
  • For n bits, voltage swing of each line is
  • ?V Vdd/(2n)
  • So power dissipation of recycling bus is
  • PCRB 2n C (Vdd/(2n))2 (2f) Pconv /(n2)
  • However, due to precharge dont gain from data
    correlation, so efficiency reduced to
  • PCRB 2Pconv /(n2)

68
Limitations
  • Larger values of n improves power efficiency but
    decreases noise immunity
  • Must maintain all line capacitances at an equal
    value (may limit scheme to on-chip buses ? have
    to be careful in layout to balance capacitances)
  • Requires precharge phase ? reduces data transfer
    rate

69
Comparisons
Vdd 2V, CL(bus) 1pF, 0.6?
From Zhang, 1998
70
Charge Recovery Bus
  • Recover charge from falling bit lines to
    precharge rising bit lines





transmit control
receive control
short control
71
Energy Savings
  • The amount of energy savings depends on the
    number of lines shorted, the control circuitry,
    and the data length and pattern
  • For a single transfer charge recovery
  • E RCVdd?V
  • where R is the number of rising bit lines and
    ?V is the voltage change after charge transfer
  • E RCVdd(Vdd-Vdd(F/(RF))) CVdd2(R2/(RF))

72
Reported Savings
  • For random data, 32-bit bus
  • single transfer energy savings of 47
  • maximum optimal energy savings of 72

Avg energy savings
Width of databus
From Khoo, 1995
73
Single Transfer Charge Recovery Bus

CD0
??

CD1
??

CD2
??

CD3
??
transmit control
receive control
Participates in charge sharing if data bit is
different from last data bit transmitted
74
Data Patterns Affect Savings
Trace A 0001-gt1110-gt0001
Trace B 0011-gt1100-gt0011
Trace A Trace B
Step 3 Step 5 Total
312.5(2.5-0.625) 212.5(2.5-1.25) 112.5
(2.5-1.875) 212.5(2.5-1.25)
15.625 12.50
75
Impacts of Signal Encoding on Charge Recovery
Relative energy consumption
Average of 15 Mediabench benchmarks
From Bishop, 1999
76
Bus Multiplexing
  • Share long data buses with time multiplexing (S1
    uses even cycles, S2 odd)
  • But what if data samples are correlated (e.g.,
    sign bits)?

77
Correlated Data Streams
Bit switching probabilities
Bit position
MSB
LSB
78
Disadvantage of Bus Multiplexing
  • If data bus is shared, advantages of data
    correlation are lost (bus carries samples from
    two uncorrelated data streams)
  • Bus sharing should not be used for positively
    correlated data streams
  • Bus sharing may prove advantageous in a
    negatively correlated data stream (where
    successive samples switch sign bits) - more
    random switching

79
Segmented Buses
  • Partition bus into several segments that reduces
    the capacitance per segment

Wout
Xout
TIE
Ain
Bin
TIE control
  • Try to group often communicating circuits on the
    same segment

80
TIE Design
  • To connect the segments
  • Delay/power models for t-gate solution show a
    60-70 reduction in power and a 10-30
    improvement in bus delay

81
Code Compression
  • Assuming only a subset of instrs used, replace
    them with a shorter encoding to reduce memory
    bandwidth

addresses
Core
IDT
logN bits
instructions
k bits
instruction decompression table (restores
original format)
memory
82
Instruction Loop Buffer
  • Temporarily store decoded instrs from small
    loops in a buffer (DIB)

skip Ifetch and decode
Fetch
Decode
Execute
Memory
WriteBack
PC
Instruction
MAR
MDR
I
D
DIB stores decoded instrs for a whole loop
DIB
Can achieve a 40 power savings in the MPU core
83
Key References, Bus Power
  • Bajwa, Stage-skip pipeline, Proc. of ISLPED, pp.
    353-358, 1996.
  • Bellaouar, An ultra-low power CMOS on-chip
    interconnect architecture, Proc. of SLPE, pp.
    52-53, 1995.
  • Benini, Address bus encoding techniques for
    system-level power optimization, Proc. of DATE,
    pp. 861-866, 1998.
  • Bishop, Database charge recovery practical
    considerations, Proc. SLPED, 1999.
  • Chen, Segmented bus design for low power systems,
    IEEE Trans. on VLSI Systems, 7(1)25-29, Mar
    1999.
  • Hikari, Data dependent logic swing internal bus
    architecture for ultralow power LSI, IEEE Journal
    of SSC, 30(4)397-402, Apr 1995.
  • Khoo, Charge recovery on a databus, Proc. SLPED,
    185-189, Aug 1995.
  • Stan, Bus-invert coding for low power I/O, IEEE
    Trans. on VLSI Systems, 3(1)49-58, 1995.
  • Yamauchi, An asymtotically zero power charge
    recycling bus architecture, IEEE Journal of SSC,
    30(4)423-431, Apr 1995.
  • Yoshida, An object code compression approach to
    embedded processors, Proc. of ISLPED, pp.
    265-268, 1997.
  • Zhang, Low-swing interconnect interface circuits,
    Proc. SLPED, 161-166, Aug 1998.
Write a Comment
User Comments (0)
About PowerShow.com