Variable Word Width Computation for Low Power - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Variable Word Width Computation for Low Power

Description:

However, many applications don't need a full 32 bit data word: Video: 24 bit. Audio: 16 bit ... a little bit of logic to figure out what the correct lines are, ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 28
Provided by: sayf
Category:

less

Transcript and Presenter's Notes

Title: Variable Word Width Computation for Low Power


1
Variable Word Width Computation for Low Power
  • By
  • Bret Victor
  • Sayf Alalusi

2
Motivation
  • 32 bit architecture required for most general
    purpose computing
  • However, many applications dont need a full 32
    bit data word
  • Video 24 bit
  • Audio 16 bit
  • Text 8 bit
  • Logic 1 bit
  • How can we exploit this to save power?

3
Possibilities
  • Architecture that supports 32, 24, 16, 8, and 1
    bit operations? Or some subset?
  • Switch processor between modes, or specify width
    for each instruction? Global or distributed
    control?
  • Gated clocks? Dont drive unused outputs? Power
    down unused blocks?

4
Implementation
  • Based on MIPS architecture and ISA
  • Two widths 16 bit and 32 bit
  • Width chosen on instruction-by-instruction basis.
  • Flag bit in instruction word selects width
  • Modified ISA
  • arithmetic add16, add32 mul16, mul32
  • logical and16, and32
  • memory lw16, lw32 sw16, sw32
  • branch compare beq16, beq32

5
Energy
  • Energy consumption occurs when a node
    transitions, and is proportional to the
    capacitance at that node.
  • Prevent nodes from transitioning unnecessarily.
  • Energy savings can be calculated by adding all
    the capacitance that is switching.

6
Where We Save Energy
  • Our design saves energy over a traditional
    processor in three main areas
  • Clock and control line energy
  • HWTE (High Word Transition Energy)
  • Memory control energy
  • We will see these three areas as we step through
    the pipeline.

7
Pipeline Overview
branch address 32
MUX
PC 4 32
immed 16
32
32
branch offset
4
dest reg 5
32
5
I
5
PC
32
32
32
5
32
IF/ID
ID/EX
reg A
MUX
ALU
fwd from MEM
ALU result 32
32
32
fwd from WB
32
32
32
32
32
reg B
MUX
32
32
fwd from MEM
fwd from WB
immed
data for SW 32
dest reg 5
5
dest reg 5
EX/MEM
MEM/WB
8
IF Stage
branch address 32
MUX
PC 4 32
32
4
32
I
PC
32
32
IF/ID
  • Instruction words and addresses must be 32 bits.
  • Cant modify much.

9
ID Stage
branch address 32
PC 4 32
immed 16
32
branch offset
dest reg 5
32
5
5
32
5
32
IF/ID
ID/EX
  • We can
  • gate the clocks of the pipeline register
  • only drive high words out of register file if 32
    bit operation

10
Pipeline Register (ID)
WidthGatedClock
UngatedClock
reg A high 16
Clock
UngatedClock
reg A low 16
reg B high 16
reg B low 16
C
WidthGatedClock
Q
D
Width
destReg 5
ImmedGatedClock
(from instruction word)
immed 16
  • Fit gating into clock distribution network.
  • Little energy overhead and helps control skew.
  • On ID stage, gating reduces clock energy by
  • 56 on 16-bit operations
  • 19 on 32-bit non-immediate operations

11
Register File Read Port (ID)
  • Decoder selects register to drive output bus.
  • We add one AND gate per register.
  • Switching capacitance dominated by output bus.
  • 16 bit operation takes 50 less energy than 32
    bit operation....
  • Not necessarily savings!

D E C O D E R
Width
16
N
Reg 0 high 16
16
N
Reg 0 low 16
Width
16
N
Reg 1 high 16
16
N
Reg 1 low 16
12
EX Stage
reg A
MUX
ALU
fwd from MEM
32
fwd from WB
32
reg B
MUX
32
fwd from MEM
fwd from WB
immed
data for SW 32
dest reg 5
  • Modify the ALU to perform 16 bit operations.
  • Prevent the high word output of the MUXes from
    changing on 16 bit operations.
  • Gate the clock of the pipeline register
  • Only latch high word of ALU result on 32 bit
    operations
  • Only latch reg B on store word operations

13
Logical Inst.s (EX)
X0 ------- Y0 -------
e.g. X AND Y
X1 ------- Y1 -------
X31 ------ Y31 ------
  • Just dont let the unused bits (high 16)
    transition
  • If they dont transition, they will not drive the
    next stage either.
  • 50 less energy

14
Adder (EX)
0 . . 3
A0 B0
An Bn
16 . . 19
Upper Level CLA Generation
4 . . 7
20 . . 23
8 . . 11
24 . . 27
S0
Sn
12 . . 15
28 . . 31
  • The 4CLA blocks just get replicated for the
    number of bits, but the upper level CLA structure
    will grow with the number of bits.
  • 16 bits 58 less energy

15
Multiplier (EX)
32 x 32bit adds 32 x 32bit reg. writes 32
shifts In 32 cycles Vs. 16 x 16bit adds 16 x
16bit reg. writes 16 shifts In 16 cycles
  • Multiply complexity grows as N2, so a 16 bit
    multiply takes 77 less energy.
  • Even if upper 16 bits 0, a 32 bit multiply does
    16 extra shifts.

16
HWTE
  • Two types of data in 16 bit application
  • Computational data (16-bit) high word 0
  • Pointers and addresses (32-bit) high word C
  • Assume C mostly constant (memory accesses
    mostly in 64K block)
  • Traditional processor only consumes more datapath
    energy than our processor when transitioning
    between these data types.
  • HWTE High Word Transition Energy

17
HWTE
  • With such a model, our processor effectively only
    excecutes 16 bit operations.
  • Traditional processor excecutes 32 bit
    operations only when transitioning between data
    types.
  • E32 energy of 32 bit operation
  • E16 energy of 16 bit operation
  • N average number of consecutive instructions
    that use the same data type
  • HWTE ( E32 - E16 ) / N

18
Barrel Shifter (EX)
A3
B3
A2
B2
A1
B1
A0
B0
SH0 SH1 SH2 SH3
  • Big win will come from not driving the control
    lines to the upper 16 bits.
  • Save about 50 in energy

19
MEM Stage
ALU result 32
32
32
32
dest reg 5
  • This is a big, regular memory (SRAM) structure
    that can easily be segmented into blocks.
  • Exploit this fact

20
DCache (MEM)
Width
Block
Only drive the word line that you need!
  • 2-way set associative, write-back
  • Blocks are 2 x 32b or 4 x 16b, i.e. the 16b data
    values are aligned on 16b boundaries, 32 on 32b.

21
DCache (MEM)
  • Only drive the word lines that are needed.
  • Need a little bit of logic to figure out what the
    correct lines are, but large capacitance of WL
    dominates.
  • Block size is larger for 16 bit values, better
    exploits spatial locality
  • Associativity does not change from 16 bit to 32
    bit word lengths
  • Energy savings 50
  • Control Line Savings, no HWTE!

22
WB Stage
Dest reg 5
Mem data 32
5
ALU result 32
32
MEM/WB
  • On a 16 bit operation, we can
  • Only drive the low word out of the MUX
  • Capacitive load on register write port is large
  • Driving 16 bits out of the MUX consumes 50 less
    energy than driving 32 bits HWTE formula
    applies.
  • Only latch the low word into the register?

23
Reg. File Write Port (WB)
Write
  • We can add one AND gate for each register.
  • But 16 bit write uses same amount of clock energy
    as 32 bit write without modifications.
  • Little savings from not writing into the
    register, because the high word would not change
    in a 16 bit application.
  • Not worth it!

HiWrite
Width
D E C O D E R
HiWrite
C Reg 0 high 16 D
16
Write
C Reg 0 low 16 D
16
HiWrite
C Reg 1 high 16 D
16
Write
C Reg 1 low 16 D
16
24
Summary
  • Typical power distribution in core (non-memory)
  • ALU 34 x 66
  • I-decode 23 x 100
  • Register file 13 x 66
  • Clock 10 x 50
  • Shifter 11 x 50
  • Pipeline 9 x 74
  • Core energy reduced by 29.

25
Summary
  • Typical power distribution in memory
  • Instruction cache 60 x 100
  • Data cache 40 x 50
  • Cache energy reduced by 20.
  • Total processor power consumption
  • Cache 66 x 80
  • Core 33 x 71
  • Total energy reduced by 24 when executing a 16
    bit application.

26
Conclusions
  • Primary drawback is modification of ISA.
  • Energy savings are reasonable.
  • Our modifications are fairly easy to implement,
    and can be fit into existing processor designs
    with minimal area increase.

27
Where do we go from here?
  • More accurate capacitance models and SPICE
    simulation
  • More accurate models of instruction mix
Write a Comment
User Comments (0)
About PowerShow.com