Title: Code Compression for Low Power Embedded System Design
1Code Compression for Low Power Embedded System
Design
- Haris Lekatsas Joerg Henkel
Wayne Wolf - Princeton University NEC USA
Princeton University
2Outline
- Introduction Code compression requirements
- Other approaches
- Architectural exploration
- Pre-cache architecture
- Post-cache architecture
- Bus compaction techniques
- Experimental results Toggle counts, performance
and energy savings - Conclusions
3What is code compression?
Program Memory - Compressed Software
CPU
Decompression Engine Expand Software using
hardware
Add hardware
Reduce Software Code Size
4Code compression requirements
- Random Access
- Start decompression at block boundaries
- Resynchronization points restart FSM
- Byte Alignment
- Faster Decoding
- More compact indexing
- Indexing
- LAT
- Patching branch offsets
b1
b2
b4
b3
block1
blo-
ck2
block3
block4
block4
5Prior art
- Statistical coding methods
- Kozuch and Wolfe (ICCD 1994) Huffman coding.
- Dictionary coding methods
- Liao et al (ARVLSI 1995) Can be done completely
in software. Gives only modest compression. - Lefurgy et al (Micro-30, 1997) Decompression is
done at instruction fetch. Compression
performance around 60 for PowerPC - Industry Essentially new instruction sets.
- ARMs Thumb. Compression performance 60-70.
- MIPS16. Compression performance 60-70
- For Low Power Yoshida et al. (ISLPED97), Benini
et al. (ISLPED99)
6Contributions
- Objectives
- Reduce power consumption while maintaining or
even improving performance - Techniques
- Post-cache decompression
- Bus encoding schemes
- Efficient instruction compression
7Algorithm overview
Instruction Segment
Lekatsas and Wolf (TCAD, Dec.99)
Markov modeling
Encoding Process Decoding Table Creation
4-phase Compression
Compress branches
Patch branch offsets
Compressed Instruction Segment
8Instruction grouping
- Group1 Instructions with immediates Code
0 - - 53.3, bps 8
- Group2 Branches Code 11
- - 26.1, bps 32
- Group3 Fast dictionary (no immediate fields)
Code 100 - - 20.0, bps 32 or 64
- Group4 Uncompressed instructions Code 101
- - 0.6, bps 32
9SPARC branch compression
31
29
28
25
24
22
0
30
21
22-bit displacement
op
a
cond
op2
16-bit offset
cond
11
a
NB
24
23
22
21
18
16
0
17
15
Added field 23-24 which tells the decompressor
this is a branch field 17-16 which
specifies how many offset bytes
Avoid NP-complete problem by estimating the
offset size
10Compression ratios
Average compression ratio 55
11Previous workPre-cache architecture
32
AddressBus
Main Memory
CPU
D- Engine
D-cache
32
32
DataBus 1
DataBus 2
I-cache
- Decompression only on a cache miss
-
- Decoding can overlap memory access
- DataBus 2 carries compressed code
- DataBus 1 carries uncompressed code
Decoding speed not as critical
Fewer transitions Energy savings
No gain on DataBus1
12Our new approach Post-cache architecture
AddressBus
32
Main Memory
CPU
D-cache
D- Engine
32
32
DataBus 1
DataBus 2
I-cache
- Fewer memory accesses
- Less traffic on both DataBus1 2
- Reduced caches misses potentially increase
performance
Even less energy consumed
13Bus compaction approaches
0
21
0
31
31
21
instr1
instr2
Bits 22-31 unused
16
31
31
18
0
5
0
instr3
instr4
Bits 17-31 unused
13
31
0
instr2, contd
Bits 13-31 unused
No Packing
With Packing
14Experimental setup assumptions
- Design space exploration using the Avalanche
framework by Li and Henkel (DAC98) - CPU performance (cycle counts) and energy using
the sparcsim simulator for the SPARClite
processor (integrated in the Avalanche framework) - We assume a SOC comprising of a CPU, I-cache,
D-cache, main memory, buses, and a decompression
engine
15Decompression engine power estimation
Executable Program
Decoder in BDL
Decoding table
Cyber
QPT- Trace generation
Instruction Extraction
VHDL code
Instruction Segment
Software Compressor
dc_shell
my_dinero
VHDL, Verilog Netlists
Compressed Program
Opencad
Pattern generator
Trace Hit/Miss Info
Input patterns
Power Estimation
16Toggles on DataBus1 for Trick
17Toggles on DataBus2 for Trick
18Toggles on both DataBus1 2 for Trick
19System energy savings
20System performance gain
21Conclusions
- Energy savings between 22 - 82
- For some cache sizes we have a performance
advantage 6 - 68 - We reduce executable code size down to 55 of its
original size - Embedded systems running one application only
22Thank you!