Title: GeneralPurpose Processor Huffman Encoding Extension
1General-Purpose Processor Huffman Encoding
Extension
- Stephan Wong, Sorin Cotofana,
- Stamatis Vassiliadis
- Computer Engineering Laboratory
- Electrical Engineering
- TU Delft
ITCC 2000, 3-2000, Las Vegas
2Outline
- Introduction
- Assumptions Research Question
- Architectural Extension
- Simulation Results
- Conclusions
ITCC 2000, 3-2000, Las Vegas
3Introduction
Huffman Coding
Topic Compute real-time video using
General-Purpose Processors (GPPs)
Problem High-resolution video processing can be
problematic with available GPPs
GOAL Increase GPP video processing performance!
Add specialized units to GPPs!!
- used very often in video coding
- difficult to parallelize
- up to 20 of computations in video coding
ITCC 2000, 3-2000, Las Vegas
4Assumptions
Research Question
- Out-of-order superscalar General-Purpose
Processor - Extend GPP with new (multimedia) instructions
- Extend GPP with a Multimedia Functional Unit
In our case Addition of a Huffman Coding
unit (called Huffman Coding Functional Unit)
What are the implications of adding an HFU to a
GPP?
ITCC 2000, 3-2000, Las Vegas
5Basic Steps in Picture Coding
Huffman Coding
Digital picture
6Processor Organization
Architecture
- Add a Huffman Coding Functional Unit
- Add 3 new instructions
- LoadBlock
- Loads an 8x8 block of quantized DCT coefficients
- HEncode
- Performs the actual Huffman encoding
- WriteOutput
- Writes output of Huffman Coding to main memory
Instruction Fetch
Decode Issue
...
HFU
FU
FU
Memory
ITCC 2000, 3-2000, Las Vegas
7HFU Organization
Register File
Instruction format
LoadBlock
LoadBlock (r2)imm
M E M O R Y
HEncode r3, r4
HEncode
WriteOutput (r5)imm
WriteOutput
HFU
ITCC 2000, 3-2000, Las Vegas
8Code Example
/ Load the block. / load r2, starting_block_addr
ess LoadBlock (r2)imm / Perform the actual
coding. / load r3, previous_dc_value load r4,
lum_or_chrom HEncode r3, r4 / Write output to
memory. / load r5, write_address WriteOutput
(r5)0
ITCC 2000, 3-2000, Las Vegas
9Simulation Environment
- 4-way superscalar, out-of-order GPP architecture
- 4 integer ALUs, 1 integer MULT/DIV-unit, 4 FP
adders - 1 FP MULT/DIV-unit, 2 memory ports
- L1 data cache organization (LRU)
- 16 KB 128 sets, 4-way associative, block size of
32 bytes - L2 unified cache organization (LRU)
- 256 KB 1024 sets, 4-way associative, block size
of 64 bytes - sim-outorder simulator modified to support
extensions - ijpeg benchmark and modified ijpeg benchmark
ITCC 2000, 3-2000, Las Vegas
10Benchmark Issues
- Original benchmark uses a function called
emit_bits - storing results of Huffman Coding one by one
- This prohibits the usage of the WriteOutput
instruction, because - the architecture needs to be adapted to benchmark
- the original benchmark needs to be completely
rewritten - Instead, another instruction is used ? MoveResult
instruction - storing the results now require a while-loop
- each result is stored using two numbers
- one bit is used to control the while-loop
This introduces a penalty compared to the
WriteOutput instruction!
? The results presented here are worse-case
scenario for the proposal.
ITCC 2000, 3-2000, Las Vegas
11Simulated HFU Organization
Register File
Instruction format
LoadBlock
LoadBlock (r2)imm
M E M O R Y
HEncode r3, r4
HEncode
WriteOutput (r5)imm
WriteOutput
MoveResult r6,r7,r8
MoveResult
HFU
Register File
ITCC 2000, 3-2000, Las Vegas
12Modified Code Example
/ Load the block./ load r2, starting_block_addre
ss LoadBlock (r2)imm / Perform the actual
coding./ load r3, previous_dc_value load r4,
lum_or_chrom HEncode r3, r4
ITCC 2000, 3-2000, Las Vegas
13Simulation Results
- Total Number of Execution Cycles (TNEC)
- average decreases between 6.3 and 7.4
- Total number of instructions
- average decrease is about 5
- Total number of branches
- average decrease is about 14
ITCC 2000, 3-2000, Las Vegas
14Recalculating TNEC Results
- Determine number of calls to emit_bits
- Conservative assumption of penalty in clockcycles
- Subtract from the original TNEC value
of emit_bits calls
Original TNEC -
new TNEC
- Recalculate the decreases in TNEC
- average decreases between 9 and 12
Assumption no specialized units for DCT and
Quantization!
ITCC 2000, 3-2000, Las Vegas
15What if ?
What if DCT and Quantization are hardwired to
improve performance?
ITCC 2000, 3-2000, Las Vegas
16Conclusions
- Proposed hardwired Huffman Coding unit
- 3 new instructions
- Potential improvement between 6 and 12
- Number of branches decreased by 14
- Number of instructions decreased by 5
- Hardware requirements are similar to adding 1-2k
bytes - of memory and control logic
- Future Work
- What is the impact on video processing
performance when extending a GPP with FPGA MFUs?
ITCC 2000, 3-2000, Las Vegas
17More information
http//www.tudelft.nl
http//cardit.et.tudelft.nl
http//cardit.et.tudelft.nl/molen
18Memory Model
Loading an 8x8 block of quantized DCT coefficients
- sim-outorder simulator
- Memory bandwidth is 8 bytes per cycle
- Each coefficient is represented by two bytes
LoadBlock can loads blocks of up to 4
coefficients per cycle
Assumed load-latency (total_lat
0) 1. Determine number of hits before
miss/end-of-load 2. Divide by 4 and add to
total_lat 3. Add miss latency to total_lat (if
any) 4. If end-of-load, STOP. Else, go to step 1.
19ijpeg benchmark
- New Instructions Utilization
-compression.quality 70 -compression.smoothing_fac
tor 0 -compression.optimize_coding 0 -verbose 1
-GO.compress
- Two benchmarks
- original benchmark
- updated benchmark using previous code example
- Two input pictures
- test picture 33,124 bytes
- ref picture 2,113,595 bytes
20(No Transcript)
21Basic Steps in Picture Coding
Huffman Coding
Digital picture
22Simulated HFU Organization
Register File
Instruction format
LoadBlock
LoadBlock (r2)imm
M E M O R Y
HEncode r3, r4
HEncode
MoveResult r6,r7,r8
MoveResult
HFU
Register File
ITCC 2000, 3-2000, Las Vegas
23Modified Code Example
/ Load the block./ load r2, starting_block_addre
ss LoadBlock (r2)imm / Perform the actual
coding./ load r3, previous_dc_value load r4,
lum_or_chrom HEncode r3, r4
ITCC 2000, 3-2000, Las Vegas