Code Compression for Low Power Embedded System Design - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Code Compression for Low Power Embedded System Design

Description:

Prior art. Statistical coding methods: Kozuch and Wolfe (ICCD 1994): Huffman coding. ... Lefurgy et al (Micro-30, 1997): Decompression is done at instruction fetch. ... – PowerPoint PPT presentation

Number of Views:92

Avg rating:3.0/5.0

Slides: 23

Provided by: ganeshlaks

Category:

more less

Transcript and Presenter's Notes

Title: Code Compression for Low Power Embedded System Design

1
Code Compression for Low Power Embedded System
Design

Haris Lekatsas Joerg Henkel
Wayne Wolf
Princeton University NEC USA
Princeton University

2
Outline

Introduction Code compression requirements
Other approaches
Architectural exploration
Pre-cache architecture
Post-cache architecture
Bus compaction techniques
Experimental results Toggle counts, performance
and energy savings
Conclusions

3
What is code compression?
Program Memory - Compressed Software
CPU
Decompression Engine Expand Software using
hardware
Add hardware
Reduce Software Code Size
4
Code compression requirements

Random Access
Start decompression at block boundaries
Resynchronization points restart FSM
Byte Alignment
Faster Decoding
More compact indexing
Indexing
LAT
Patching branch offsets

b1
b2
b4
b3
block1
blo-
ck2
block3
block4
block4
5
Prior art

Statistical coding methods
Kozuch and Wolfe (ICCD 1994) Huffman coding.
Dictionary coding methods
Liao et al (ARVLSI 1995) Can be done completely
in software. Gives only modest compression.
Lefurgy et al (Micro-30, 1997) Decompression is
done at instruction fetch. Compression
performance around 60 for PowerPC
Industry Essentially new instruction sets.
ARMs Thumb. Compression performance 60-70.
MIPS16. Compression performance 60-70
For Low Power Yoshida et al. (ISLPED97), Benini
et al. (ISLPED99)

6
Contributions

Objectives
Reduce power consumption while maintaining or
even improving performance
Techniques
Post-cache decompression
Bus encoding schemes
Efficient instruction compression

7
Algorithm overview
Instruction Segment
Lekatsas and Wolf (TCAD, Dec.99)
Markov modeling
Encoding Process Decoding Table Creation
4-phase Compression
Compress branches
Patch branch offsets
Compressed Instruction Segment
8
Instruction grouping

Group1 Instructions with immediates Code
0
- 53.3, bps 8
Group2 Branches Code 11
- 26.1, bps 32
Group3 Fast dictionary (no immediate fields)
Code 100
- 20.0, bps 32 or 64
Group4 Uncompressed instructions Code 101
- 0.6, bps 32

9
SPARC branch compression
31
29
28
25
24
22
0
30
21
22-bit displacement
op
a
cond
op2
16-bit offset
cond
11
a
NB
24
23
22
21
18
16
0
17
15
Added field 23-24 which tells the decompressor
this is a branch field 17-16 which
specifies how many offset bytes
Avoid NP-complete problem by estimating the
offset size
10
Compression ratios
Average compression ratio 55
11
Previous workPre-cache architecture
32
AddressBus
Main Memory
CPU
D- Engine
D-cache
32
32
DataBus 1
DataBus 2
I-cache

Decompression only on a cache miss
Decoding can overlap memory access
DataBus 2 carries compressed code
DataBus 1 carries uncompressed code

Decoding speed not as critical
Fewer transitions Energy savings
No gain on DataBus1
12
Our new approach Post-cache architecture
AddressBus
32
Main Memory
CPU
D-cache
D- Engine
32
32
DataBus 1
DataBus 2
I-cache

Fewer memory accesses
Less traffic on both DataBus1 2
Reduced caches misses potentially increase
performance

Even less energy consumed
13
Bus compaction approaches
0
21
0
31
31
21
instr1
instr2
Bits 22-31 unused
16
31
31
18
0
5
0
instr3
instr4
Bits 17-31 unused
13
31
0
instr2, contd
Bits 13-31 unused
No Packing
With Packing
14
Experimental setup assumptions

Design space exploration using the Avalanche
framework by Li and Henkel (DAC98)
CPU performance (cycle counts) and energy using
the sparcsim simulator for the SPARClite
processor (integrated in the Avalanche framework)
We assume a SOC comprising of a CPU, I-cache,
D-cache, main memory, buses, and a decompression
engine

15
Decompression engine power estimation
Executable Program
Decoder in BDL
Decoding table
Cyber
QPT- Trace generation
Instruction Extraction
VHDL code
Instruction Segment
Software Compressor
dc_shell
my_dinero
VHDL, Verilog Netlists
Compressed Program
Opencad
Pattern generator
Trace Hit/Miss Info
Input patterns
Power Estimation
16
Toggles on DataBus1 for Trick
17
Toggles on DataBus2 for Trick
18
Toggles on both DataBus1 2 for Trick
19
System energy savings
20
System performance gain
21
Conclusions