Title: A Decompression Architecture for Low Power Embedded Systems
1A Decompression Architecturefor Low Power
Embedded Systems
Yi-hsin Tseng Date11/06/2007
- Lekatsas, H. Henkel, J. Wolf, W.
- Computer Design, 2000. Proceedings. 2000
International Conference on 2000 IEEE
2Outline
- Introduction motivation
- Code Compression Architecture
- Decompression Engine Design
- Experimental results
- Conclusion Contributions of the paper
- Our project
- Relate to CSE520
- Q A
3Introduction motivation
4For Embedded system
- More complicated architecture in embedded system
nowadays. - Available memory space is smaller.
- A reduced executable program can also indirectly
affect the chip on - Size
- Weight
- Power consumption
5Why code compression/decompression?
- Compress the instruction segment of the
executable running on the embedded system - Reducing the memory requirements and bus
transaction overheads - Compression ? Decompression
6Related work on compressed instructions
- A logarithmic-based compression scheme where
32-bit instructions map to fixed but smaller
width compressed instructions. - (The system using memory area only)
- Frequently appearing instructions are compressed
to 8 bits. - (fixed-length 8 or 32 bits)
7The compressed method in this paper
- Give comprehensive results for the whole system
including - buses
- memories (main memory and cache)
- decompression unit
- CPU
8Code Compression Architecture
9Architecture in this system (Post-cache)
- Reason ?
- -Increase the effective cache size
- Improve instruction bandwidth
10Code Compression Architecture
- Use SAMC to compress instructions
- (Semiadaptive Markov Compression)
- Divide instructions into 4 groups
- based on SPARC architecture
- appended a short code (3-bit) in the beginning of
each compressed instruction
114 Groups of Instructions
- Group 1
- instructions with immediates
- Ex sub i1, 2, g3 set 5000, g2
- Group 2
- branch instructions
- Ex be, bne, bl, bg, ...
- Group 3
- instructions with no immediates
- Ex add o1,o2,g3 st g1,o2
- Group 4
- Instructions that are left uncompressed
12Decompression Engine Design(Approach)
13The Key idea is.
- Present an architecture for embedded systems that
decompresses offline-compressed instructions
during runtime - to reduce the power consumption
- a performance improvement (in most cases)
14Pipelined Design
15Pipelined Design (cont)
16Pipelined Design group 1 (stage 1)
Index the Dec. Table
Input Compressed Instructions
Forward instructions
17Pipelined Design group 1 (stage 2)
18Pipelined Design group 1 (stage3)
19Pipelined Design group 1 (stage 4)
20Pipelined Design group 2 branch instructions
(stage 1)
21Pipelined Design group 2 branch instructions
(stage 2)
22Pipelined Design group 2 branch instructions
(stage 3)
23Pipelined Design group 2 branch instructions
(stage 4)
24Pipelined Design group 3instructions with no
immediates (stage 1)
No immediate instructions may appear in pairs. -gt
compressed in one byte. (lt-gt 64 bits)
256 entry table
8 bits as index to address
25Pipelined Design group 3instructions with no
immediates (stage 2)
26Pipelined Design group 3instructions with no
immediates (stage 3)
27Pipelined Design group 3instructions with no
immediates (stage 4)
28Pipelined Design group 4 uncompressed
instructions
29Experimental results
30Experimental results
- Use different applications
- an algorithm for computing 3D vectors for a
motion picture ("i3d) - a complete MPEGII encoder ("mpeg ")
- a smoothing algorithm for digital images ("smo")
- a trick animation algorithm ("trick")
- A simulation tool written in C for obtaining
performance data for the decompression engine
31Experimental results (cont)
- The decompression engine is application specific.
- for each application -- build a decoding table
and a fast dictionary table that will decompress
that particular application only.
32Experimental results for energy and performance
33Worse performance on smo 512-byte instruction
cache? - Do not require large memory. (Execute
in tight loops) - Generates very few misses for
this cache size. (So the compressed
architecture therefore does not help an already
almost perfect hit ratio and the slowdown by
the decompression engine prevails)
34Conclusion Contributions of the paper
- This paper designed an instruction decompression
engine as a soft IP core for low power embedded
systems. - Applications run faster as opposed to systems
with no code compression (due to improved cache
performance). - Lower power consumption (due to smaller memory
requirements for the executable program and
smaller number of memory accesses)
35Relate to CSE520
- Implement the system performance and power
consumption by using Pipeline Architecture in
system. - A different architecture design for lower power
consumption on the Embedded system. - Smaller cache size perform better on compressed
architecture larger cache perform better on
no-compressed architecture. - Cache hit ratio
36Our project
- Goal
- How to improve the efficiency of power management
in embedded multicore system - Idea
- Use different power mode within a given power
budget, global power management policy (In Jun
Shens presentation) - Use the SAMC algorithm and this decompress
architecture as another factor to simulate (This
paper) - How?
- SimpleScalar tool set
- try simple function at first, then try the
different power mode
37Thank you!Q A
38Backup Slides
39Critique
- The decompression engine will slowdown the system
if the cache generate very few misses for some
cache size.
40Post-cache Pre-cache
Pre-cache The instruction stored in the I-cache
is decompressed.
Post-cache The instruction stored in the I-cache
is still decompressed.
41Problems for post-cache arch
- Memory Relocation
- The compression will change the instruction
location in the memory. - In pre-cache arch
- Decompression is done before fetch into I-cache,
so the address in the I-cache neednt to be fixed.
42SPARC Instruction Set
- Instruction groups
- load/store (ld, st, ...)
- Move data from memory to a register / Move data
from a register to memory - integer arithmetic (add, sub, ...)
- Arithmetic operations on data in registers
- bit-wise logical (and, or, xor, ...)
- Logical operations on data in registers
- bit-wise shift (sll, srl, ...)
- Shift bits of data in registers
- integer branch (be, bne, bl, bg, ...)
- Trap (ta, te, ...)
- control transfer (call, save, ...)
- floating point (ldf, stf, fadds, fsubs, ...)
- floating point branch (fbe, fbne, fbl, fbg, ...)
43SPARC Instruction Example