Title: Hamming Transcoders for Power Reduction on Internal Buses
1Hamming Transcoders for Power Reduction on
Internal Buses
- Victor Wen
- July 12, 2000
- University of California, Berkeley
2Outline
- Motivations
- Related Work
- General Coding Setup
- Transition Code Technique
- Simulation Results
- What does it cost?
- Future Work/Conclusion
3Motivation
- Increasing importance of wires relative to
transistors - Spend transistors to drive wires more
efficiently? - Try to reduce transitions over wires decrease
capacitive charge/discharge. - Orthogonal to other power-saving techniques
- e.g. voltage reduction, low-swing driver/receiver
- clock gating
- Parallel function blocks (like vectors!)
- Important for portable devices where power and
energy is a major constraint
4Power reduction through coding
Encoded Value
Output
Input
Decoder
Encoder
- Can we encode information in a way that takes
less power? - Do this on chip?!
- Do this dynamically?! Apply value prediction
technique to track data pattern.
5Related Work
- Bus Invert Coding, by M. R. Stan and W. P.
Burleson - Reduce peak power by 50, avg by up to 25
- Work-zone Encoding, by E. Musoll et al.
- Compare favorably with other techniques
- Test Vector Ordering, by P. Girard et al.
- Result 8.2 to 54.1 less activities
- Minimizing Power consumption, by A. Chandrakasan
and R. Broderson - Introduces power-saving techniques at different
levels - The Predictability of Data Values, by Y. Sazeides
and J. Smith - The context-based predictor suggests using
previous n values to help tracking data
6Dynamic Transcoder
- Have two FSM and some hardware cache at two ends
of bus, tracking each other. Use values send
across the bus to synchronize the FSM thus,
extra overhead communication via the bus is
avoided. - The FSM decides when to admit new data value into
the hardware table(s) and which low hamming
weight code to assign to it. - The hardware table varies in size and design.
FSM
FSM
Hardware Table(s)
Hardware Table(s)
Input
Output
Encoded Value
7FSM Details
State Transition Diagram
Code 0xFF Freq 10
Potentially 232 entries in table for 32 bit bus!
5
6
Code 0x00 Freq 2620
2
1
- Most frequent arc assigned lowest-weight code
(e.g. 0x0) the codes are re-assigned dynamically
to reflect most frequent values in the current
phase of the trace. - The state could represent actual value or it
could be class of values (e.g. state 1 could be
the difference of 1 in current and previous
input). We call the latter filtered input. The
filtered input could capture more unique values
(e.g. all input values differ by 1 is captured in
one entry). - Use output codes to XOR transmission line
- Every 1 in coded version causes transition on the
bus - Most frequent arcs cause least number of
transitions
8Hardware Table Details
- Hardware table consists of a filter, and a
combination of shift register, pending table and
actual map table. - The tables store most frequent values (could be
actual or filtered inputs) - Shift registers and pending table used to admit
new frequent values and hold evicted values from
map table. - Currently investigate different hardware
combination and admission policies to find a
balance between tracking ability vs. hardware
complexity. - Currently three setups are under consideration
Pending Table
Map Table
Filter(s)
Output
Input
Shift Regs
Map Table
Shift Regs
P. Table
M. Table
9Evaluation Input pattern and Unique values (gcc)
Inputs to Transcoder
- The input pattern graph (a) demonstrates raw and
filtered input transition to the transcoder. The
filtered inputs actually has more activities than
raw input. - The unique value graph (b) demonstrate number of
unique values in the gcc trace. - Unique-ness given a sliding window of size n, if
the current input matches any value in the
window, then it is not unique and vice versa. - Notice that when n gt 31, the total number of
unique values drops drastically. - Graph (a) and (b) would suggest that transcoder
with table size gt 31 and no filter would work
best.
(a)
Unique values in the trace
(b)
10Evaluation Transition saving for gcc and compress
- Graph (a) shows the resulting transitions of gcc
trace after running through the dynamic
transcoder, with no, xor and subtract filter. - The trends show that dynamic transcoder was able
to track the input and reduce activities. Also,
notice that area between the nocode line and
other lines represents energy saved. - The results of static oracle transcoder is also
shown. The three traces has table size of 1, 31
and all. - Note The oracle transcoder reads in the trace
file and construct the map table statically, with
most frequent values assigned lower weight. Then
the trace file is re-read to find out the
resulting transitions. - Graph (b) shows result of similar experiment for
compress trace. - Note Both static and dynamic transcoder perform
much better here. It is due to that input is more
predictable and filters gave the transcoder
better input to adapt (the input pattern graph
for compress is not shown)
(a)
(b)
11Conclusion Future Work
- Conclusion
- Transition coding attacks the root of the problem
- Static oracle transcoder still beats the
dynamic transcoder, suggesting room for
improvement. - Changes to existing circuits are transparent.
- Orthogonal to other low power techniques.
- Future work
- Allow more context in doing filtering (e.g. use
before-n values instead of immediate previous
value). - Simulate SPEC on Sparc UltraSparc RTL.
- Implement all three architecture described above.
- Implement actual hardware and estimate how much
power the transcoder itself would take up.
12Hardware Cost?
- Given table size n, bus width m,
- xor (4 transistor/bit, pass transistor logic) gt
136 T - nm AND gates, plus n-input m-bit OR gate for
associative lookup gt 2nm T 2nm T 4nm T - 6nm T for table storage
- a n-bit encoder (for the code)
- 32 1-bit inverters gt 64T
- a majority voting circuit (to decide whether to
invert or not) - FSM circuit to perform pattern tracking alg.
- n 8-bit counters (to keep track of the hit
frequency) - muxes
13Huffman-based Compression
- Variable bit length problem!
- Possible soln macro clock
- Less bits ! less transitions
14Hamming Weight
- Find a map function to minimize transition
- Search space is large 256! (For 8-bit bus)
- Leads to transition code idea
15Simulation Setup
- Sun offering processor descriptions in Verilog
- picoJava (for now)
- UltraSparc (soon)
16Simulation Results (1)
- Savings
- Rank 9 saves 79.52
- Rank 256 saves 79.68
9th bit overhead Rank 1 23 Rank 9 0.29
17Simulation Results (2)
Number of transitions drops quickly as ranks
increases 256x256 table might not be
necessary Other trace files show similar trends
Note icu_data connects between instruction cache
unit and integer unit. A fairly long bus
according to picoJavas floorplan
18Hamming Transcoder (cont)
- Only transitions matter, not absolute value
- Recognize more frequent transitions assign
low-weight code to them - Guarantees more frequent transitions have less
bits changes on the wire
19Transition Code Overview
34
32
32
Coder
Decoder
Encoder
Cur bus value
Hardware Table(s)
Prev input
Filter
Transcode
32
34
To Bus
XOR
Coded?
Cur input
Invert?
20Simulation Setup
- Now, running SPEC95 on high-level processor
model. - Future, running SPEC benchmarks on Verilog RTL
model offered by Sun (sparc v8 release).