Hamming Transcoders for Power Reduction on Internal Buses - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Hamming Transcoders for Power Reduction on Internal Buses

Description:

... processor descriptions in Verilog. picoJava (for now) UltraSparc ... Future, running SPEC benchmarks on Verilog RTL model offered by Sun (sparc v8 release) ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 21

Provided by: victo56

Category:

more less

Transcript and Presenter's Notes

Title: Hamming Transcoders for Power Reduction on Internal Buses

1
Hamming Transcoders for Power Reduction on
Internal Buses

Victor Wen
July 12, 2000
University of California, Berkeley

2
Outline

Motivations
Related Work
General Coding Setup
Transition Code Technique
Simulation Results
What does it cost?
Future Work/Conclusion

3
Motivation

Increasing importance of wires relative to
transistors
Spend transistors to drive wires more
efficiently?
Try to reduce transitions over wires decrease
capacitive charge/discharge.
Orthogonal to other power-saving techniques
e.g. voltage reduction, low-swing driver/receiver
clock gating
Parallel function blocks (like vectors!)
Important for portable devices where power and
energy is a major constraint

4
Power reduction through coding
Encoded Value
Output
Input
Decoder
Encoder

Can we encode information in a way that takes
less power?
Do this on chip?!
Do this dynamically?! Apply value prediction
technique to track data pattern.

5
Related Work

Bus Invert Coding, by M. R. Stan and W. P.
Burleson
Reduce peak power by 50, avg by up to 25
Work-zone Encoding, by E. Musoll et al.
Compare favorably with other techniques
Test Vector Ordering, by P. Girard et al.
Result 8.2 to 54.1 less activities
Minimizing Power consumption, by A. Chandrakasan
and R. Broderson
Introduces power-saving techniques at different
levels
The Predictability of Data Values, by Y. Sazeides
and J. Smith
The context-based predictor suggests using
previous n values to help tracking data

6
Dynamic Transcoder

Have two FSM and some hardware cache at two ends
of bus, tracking each other. Use values send
across the bus to synchronize the FSM thus,
extra overhead communication via the bus is
avoided.
The FSM decides when to admit new data value into
the hardware table(s) and which low hamming
weight code to assign to it.
The hardware table varies in size and design.

FSM
FSM
Hardware Table(s)
Hardware Table(s)
Input
Output
Encoded Value
7
FSM Details
State Transition Diagram
Code 0xFF Freq 10
Potentially 232 entries in table for 32 bit bus!
5
6
Code 0x00 Freq 2620
2
1

Most frequent arc assigned lowest-weight code
(e.g. 0x0) the codes are re-assigned dynamically
to reflect most frequent values in the current
phase of the trace.
The state could represent actual value or it
could be class of values (e.g. state 1 could be
the difference of 1 in current and previous
input). We call the latter filtered input. The
filtered input could capture more unique values
(e.g. all input values differ by 1 is captured in
one entry).
Use output codes to XOR transmission line
Every 1 in coded version causes transition on the
bus
Most frequent arcs cause least number of
transitions

8
Hardware Table Details

Hardware table consists of a filter, and a
combination of shift register, pending table and
actual map table.
The tables store most frequent values (could be
actual or filtered inputs)
Shift registers and pending table used to admit
new frequent values and hold evicted values from
map table.
Currently investigate different hardware
combination and admission policies to find a
balance between tracking ability vs. hardware
complexity.
Currently three setups are under consideration

Pending Table
Map Table
Filter(s)
Output
Input
Shift Regs
Map Table
Shift Regs
P. Table
M. Table
9
Evaluation Input pattern and Unique values (gcc)
Inputs to Transcoder

The input pattern graph (a) demonstrates raw and
filtered input transition to the transcoder. The
filtered inputs actually has more activities than
raw input.
The unique value graph (b) demonstrate number of
unique values in the gcc trace.
Unique-ness given a sliding window of size n, if
the current input matches any value in the
window, then it is not unique and vice versa.
Notice that when n gt 31, the total number of
unique values drops drastically.
Graph (a) and (b) would suggest that transcoder
with table size gt 31 and no filter would work
best.

(a)
Unique values in the trace
(b)
10
Evaluation Transition saving for gcc and compress

Graph (a) shows the resulting transitions of gcc
trace after running through the dynamic
transcoder, with no, xor and subtract filter.
The trends show that dynamic transcoder was able
to track the input and reduce activities. Also,
notice that area between the nocode line and
other lines represents energy saved.
The results of static oracle transcoder is also
shown. The three traces has table size of 1, 31
and all.
Note The oracle transcoder reads in the trace
file and construct the map table statically, with
most frequent values assigned lower weight. Then
the trace file is re-read to find out the
resulting transitions.
Graph (b) shows result of similar experiment for
compress trace.
Note Both static and dynamic transcoder perform
much better here. It is due to that input is more
predictable and filters gave the transcoder
better input to adapt (the input pattern graph
for compress is not shown)

(a)
(b)
11
Conclusion Future Work

Conclusion
Transition coding attacks the root of the problem
Static oracle transcoder still beats the
dynamic transcoder, suggesting room for
improvement.
Changes to existing circuits are transparent.
Orthogonal to other low power techniques.
Future work
Allow more context in doing filtering (e.g. use
before-n values instead of immediate previous
value).
Simulate SPEC on Sparc UltraSparc RTL.
Implement all three architecture described above.
Implement actual hardware and estimate how much
power the transcoder itself would take up.

12
Hardware Cost?

Given table size n, bus width m,
xor (4 transistor/bit, pass transistor logic) gt
136 T
nm AND gates, plus n-input m-bit OR gate for
associative lookup gt 2nm T 2nm T 4nm T
6nm T for table storage
a n-bit encoder (for the code)
32 1-bit inverters gt 64T
a majority voting circuit (to decide whether to
invert or not)
FSM circuit to perform pattern tracking alg.
n 8-bit counters (to keep track of the hit
frequency)
muxes

13
Huffman-based Compression

Variable bit length problem!
Possible soln macro clock
Less bits ! less transitions

14
Hamming Weight

Find a map function to minimize transition
Search space is large 256! (For 8-bit bus)
Leads to transition code idea

15
Simulation Setup

Sun offering processor descriptions in Verilog
picoJava (for now)
UltraSparc (soon)

16
Simulation Results (1)

Savings
Rank 9 saves 79.52
Rank 256 saves 79.68

9th bit overhead Rank 1 23 Rank 9 0.29
17
Simulation Results (2)
Number of transitions drops quickly as ranks
increases 256x256 table might not be
necessary Other trace files show similar trends
Note icu_data connects between instruction cache
unit and integer unit. A fairly long bus
according to picoJavas floorplan
18
Hamming Transcoder (cont)

Only transitions matter, not absolute value
Recognize more frequent transitions assign
low-weight code to them
Guarantees more frequent transitions have less
bits changes on the wire

19
Transition Code Overview
34
32
32
Coder
Decoder
Encoder
Cur bus value
Hardware Table(s)
Prev input
Filter
Transcode
32
34
To Bus
XOR
Coded?
Cur input
Invert?
20
Simulation Setup