Variable Word Width Computation for Low Power - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Variable Word Width Computation for Low Power

Description:

However, many applications don't need a full 32 bit data word: Video: 24 bit. Audio: 16 bit ... a little bit of logic to figure out what the correct lines are, ... – PowerPoint PPT presentation

Number of Views:22

Avg rating:3.0/5.0

Slides: 28

Provided by: sayf

Category:

more less

Transcript and Presenter's Notes

Title: Variable Word Width Computation for Low Power

1
Variable Word Width Computation for Low Power

By
Bret Victor
Sayf Alalusi

2
Motivation

32 bit architecture required for most general
purpose computing
However, many applications dont need a full 32
bit data word
Video 24 bit
Audio 16 bit
Text 8 bit
Logic 1 bit
How can we exploit this to save power?

3
Possibilities

Architecture that supports 32, 24, 16, 8, and 1
bit operations? Or some subset?
Switch processor between modes, or specify width
for each instruction? Global or distributed
control?
Gated clocks? Dont drive unused outputs? Power
down unused blocks?

4
Implementation

Based on MIPS architecture and ISA
Two widths 16 bit and 32 bit
Width chosen on instruction-by-instruction basis.
Flag bit in instruction word selects width
Modified ISA
arithmetic add16, add32 mul16, mul32
logical and16, and32
memory lw16, lw32 sw16, sw32
branch compare beq16, beq32

5
Energy

Energy consumption occurs when a node
transitions, and is proportional to the
capacitance at that node.
Prevent nodes from transitioning unnecessarily.
Energy savings can be calculated by adding all
the capacitance that is switching.

6
Where We Save Energy

Our design saves energy over a traditional
processor in three main areas
Clock and control line energy
HWTE (High Word Transition Energy)
Memory control energy
We will see these three areas as we step through
the pipeline.

7
Pipeline Overview
branch address 32
MUX
PC 4 32
immed 16
32
32
branch offset
4
dest reg 5
32
5
I
5
PC
32
32
32
5
32
IF/ID
ID/EX
reg A
MUX
ALU
fwd from MEM
ALU result 32
32
32
fwd from WB
32
32
32
32
32
reg B
MUX
32
32
fwd from MEM
fwd from WB
immed
data for SW 32
dest reg 5
5
dest reg 5
EX/MEM
MEM/WB
8
IF Stage
branch address 32
MUX
PC 4 32
32
4
32
I
PC
32
32
IF/ID

Instruction words and addresses must be 32 bits.
Cant modify much.

9
ID Stage
branch address 32
PC 4 32
immed 16
32
branch offset
dest reg 5
32
5
5
32
5
32
IF/ID
ID/EX

We can
gate the clocks of the pipeline register
only drive high words out of register file if 32
bit operation

10
Pipeline Register (ID)
WidthGatedClock
UngatedClock
reg A high 16
Clock
UngatedClock
reg A low 16
reg B high 16
reg B low 16
C
WidthGatedClock
Q
D
Width
destReg 5
ImmedGatedClock
(from instruction word)
immed 16

Fit gating into clock distribution network.
Little energy overhead and helps control skew.
On ID stage, gating reduces clock energy by
56 on 16-bit operations
19 on 32-bit non-immediate operations

11
Register File Read Port (ID)

Decoder selects register to drive output bus.
We add one AND gate per register.
Switching capacitance dominated by output bus.
16 bit operation takes 50 less energy than 32
bit operation....
Not necessarily savings!

D E C O D E R
Width
16
N
Reg 0 high 16
16
N
Reg 0 low 16
Width
16
N
Reg 1 high 16
16
N
Reg 1 low 16
12
EX Stage
reg A
MUX
ALU
fwd from MEM
32
fwd from WB
32
reg B
MUX
32
fwd from MEM
fwd from WB
immed
data for SW 32
dest reg 5

Modify the ALU to perform 16 bit operations.
Prevent the high word output of the MUXes from
changing on 16 bit operations.
Gate the clock of the pipeline register
Only latch high word of ALU result on 32 bit
operations
Only latch reg B on store word operations

13
Logical Inst.s (EX)
X0 ------- Y0 -------
e.g. X AND Y
X1 ------- Y1 -------
X31 ------ Y31 ------

Just dont let the unused bits (high 16)
transition
If they dont transition, they will not drive the
next stage either.
50 less energy

14
Adder (EX)
0 . . 3
A0 B0
An Bn
16 . . 19
Upper Level CLA Generation
4 . . 7
20 . . 23
8 . . 11
24 . . 27
S0
Sn
12 . . 15
28 . . 31

The 4CLA blocks just get replicated for the
number of bits, but the upper level CLA structure
will grow with the number of bits.
16 bits 58 less energy

15
Multiplier (EX)
32 x 32bit adds 32 x 32bit reg. writes 32
shifts In 32 cycles Vs. 16 x 16bit adds 16 x
16bit reg. writes 16 shifts In 16 cycles

Multiply complexity grows as N2, so a 16 bit
multiply takes 77 less energy.
Even if upper 16 bits 0, a 32 bit multiply does
16 extra shifts.

16
HWTE

Two types of data in 16 bit application
Computational data (16-bit) high word 0
Pointers and addresses (32-bit) high word C
Assume C mostly constant (memory accesses
mostly in 64K block)
Traditional processor only consumes more datapath
energy than our processor when transitioning
between these data types.
HWTE High Word Transition Energy

17
HWTE

With such a model, our processor effectively only
excecutes 16 bit operations.
Traditional processor excecutes 32 bit
operations only when transitioning between data
types.
E32 energy of 32 bit operation
E16 energy of 16 bit operation
N average number of consecutive instructions
that use the same data type
HWTE ( E32 - E16 ) / N

18
Barrel Shifter (EX)
A3
B3
A2
B2
A1
B1
A0
B0
SH0 SH1 SH2 SH3

Big win will come from not driving the control
lines to the upper 16 bits.
Save about 50 in energy

19
MEM Stage
ALU result 32
32
32
32
dest reg 5

This is a big, regular memory (SRAM) structure
that can easily be segmented into blocks.
Exploit this fact

20
DCache (MEM)
Width
Block
Only drive the word line that you need!

2-way set associative, write-back
Blocks are 2 x 32b or 4 x 16b, i.e. the 16b data
values are aligned on 16b boundaries, 32 on 32b.

21
DCache (MEM)

Only drive the word lines that are needed.
Need a little bit of logic to figure out what the
correct lines are, but large capacitance of WL
dominates.
Block size is larger for 16 bit values, better
exploits spatial locality
Associativity does not change from 16 bit to 32
bit word lengths
Energy savings 50
Control Line Savings, no HWTE!

22
WB Stage
Dest reg 5
Mem data 32
5
ALU result 32
32
MEM/WB

On a 16 bit operation, we can
Only drive the low word out of the MUX
Capacitive load on register write port is large
Driving 16 bits out of the MUX consumes 50 less
energy than driving 32 bits HWTE formula
applies.
Only latch the low word into the register?

23
Reg. File Write Port (WB)
Write

We can add one AND gate for each register.
But 16 bit write uses same amount of clock energy
as 32 bit write without modifications.
Little savings from not writing into the
register, because the high word would not change
in a 16 bit application.
Not worth it!

HiWrite
Width
D E C O D E R
HiWrite
C Reg 0 high 16 D
16
Write
C Reg 0 low 16 D
16
HiWrite
C Reg 1 high 16 D
16
Write
C Reg 1 low 16 D
16
24
Summary

Typical power distribution in core (non-memory)
ALU 34 x 66
I-decode 23 x 100
Register file 13 x 66
Clock 10 x 50
Shifter 11 x 50
Pipeline 9 x 74
Core energy reduced by 29.

25
Summary

Typical power distribution in memory
Instruction cache 60 x 100
Data cache 40 x 50
Cache energy reduced by 20.
Total processor power consumption
Cache 66 x 80
Core 33 x 71
Total energy reduced by 24 when executing a 16
bit application.

26
Conclusions

Primary drawback is modification of ISA.
Energy savings are reasonable.
Our modifications are fairly easy to implement,
and can be fit into existing processor designs
with minimal area increase.

27
Where do we go from here?