Code Compression for Low Power Embedded System Design - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Code Compression for Low Power Embedded System Design

Description:

Prior art. Statistical coding methods: Kozuch and Wolfe (ICCD 1994): Huffman coding. ... Lefurgy et al (Micro-30, 1997): Decompression is done at instruction fetch. ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 23
Provided by: ganeshlaks
Category:

less

Transcript and Presenter's Notes

Title: Code Compression for Low Power Embedded System Design


1
Code Compression for Low Power Embedded System
Design
  • Haris Lekatsas Joerg Henkel
    Wayne Wolf
  • Princeton University NEC USA
    Princeton University

2
Outline
  • Introduction Code compression requirements
  • Other approaches
  • Architectural exploration
  • Pre-cache architecture
  • Post-cache architecture
  • Bus compaction techniques
  • Experimental results Toggle counts, performance
    and energy savings
  • Conclusions

3
What is code compression?
Program Memory - Compressed Software
CPU
Decompression Engine Expand Software using
hardware
Add hardware
Reduce Software Code Size
4
Code compression requirements
  • Random Access
  • Start decompression at block boundaries
  • Resynchronization points restart FSM
  • Byte Alignment
  • Faster Decoding
  • More compact indexing
  • Indexing
  • LAT
  • Patching branch offsets

b1
b2
b4
b3
block1
blo-
ck2
block3
block4
block4
5
Prior art
  • Statistical coding methods
  • Kozuch and Wolfe (ICCD 1994) Huffman coding.
  • Dictionary coding methods
  • Liao et al (ARVLSI 1995) Can be done completely
    in software. Gives only modest compression.
  • Lefurgy et al (Micro-30, 1997) Decompression is
    done at instruction fetch. Compression
    performance around 60 for PowerPC
  • Industry Essentially new instruction sets.
  • ARMs Thumb. Compression performance 60-70.
  • MIPS16. Compression performance 60-70
  • For Low Power Yoshida et al. (ISLPED97), Benini
    et al. (ISLPED99)

6
Contributions
  • Objectives
  • Reduce power consumption while maintaining or
    even improving performance
  • Techniques
  • Post-cache decompression
  • Bus encoding schemes
  • Efficient instruction compression

7
Algorithm overview
Instruction Segment
Lekatsas and Wolf (TCAD, Dec.99)
Markov modeling
Encoding Process Decoding Table Creation
4-phase Compression
Compress branches
Patch branch offsets
Compressed Instruction Segment
8
Instruction grouping
  • Group1 Instructions with immediates Code
    0
  • - 53.3, bps 8
  • Group2 Branches Code 11
  • - 26.1, bps 32
  • Group3 Fast dictionary (no immediate fields)
    Code 100
  • - 20.0, bps 32 or 64
  • Group4 Uncompressed instructions Code 101
  • - 0.6, bps 32

9
SPARC branch compression
31
29
28
25
24
22
0
30
21
22-bit displacement
op
a
cond
op2
16-bit offset
cond
11
a
NB
24
23
22
21
18
16
0
17
15
Added field 23-24 which tells the decompressor
this is a branch field 17-16 which
specifies how many offset bytes
Avoid NP-complete problem by estimating the
offset size
10
Compression ratios
Average compression ratio 55
11
Previous workPre-cache architecture
32
AddressBus
Main Memory
CPU
D- Engine
D-cache
32
32
DataBus 1
DataBus 2
I-cache
  • Decompression only on a cache miss
  • Decoding can overlap memory access
  • DataBus 2 carries compressed code
  • DataBus 1 carries uncompressed code

Decoding speed not as critical
Fewer transitions Energy savings
No gain on DataBus1
12
Our new approach Post-cache architecture
AddressBus
32
Main Memory
CPU
D-cache
D- Engine
32
32
DataBus 1
DataBus 2
I-cache
  • Fewer memory accesses
  • Less traffic on both DataBus1 2
  • Reduced caches misses potentially increase
    performance

Even less energy consumed
13
Bus compaction approaches
0
21
0
31
31
21
instr1
instr2
Bits 22-31 unused
16
31
31
18
0
5
0
instr3
instr4
Bits 17-31 unused
13
31
0
instr2, contd
Bits 13-31 unused
No Packing
With Packing
14
Experimental setup assumptions
  • Design space exploration using the Avalanche
    framework by Li and Henkel (DAC98)
  • CPU performance (cycle counts) and energy using
    the sparcsim simulator for the SPARClite
    processor (integrated in the Avalanche framework)
  • We assume a SOC comprising of a CPU, I-cache,
    D-cache, main memory, buses, and a decompression
    engine

15
Decompression engine power estimation
Executable Program
Decoder in BDL
Decoding table
Cyber
QPT- Trace generation
Instruction Extraction
VHDL code
Instruction Segment
Software Compressor
dc_shell
my_dinero
VHDL, Verilog Netlists
Compressed Program
Opencad
Pattern generator
Trace Hit/Miss Info
Input patterns
Power Estimation
16
Toggles on DataBus1 for Trick
17
Toggles on DataBus2 for Trick
18
Toggles on both DataBus1 2 for Trick
19
System energy savings
20
System performance gain
21
Conclusions
  • Energy savings between 22 - 82
  • For some cache sizes we have a performance
    advantage 6 - 68
  • We reduce executable code size down to 55 of its
    original size
  • Embedded systems running one application only

22
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com