GeneralPurpose Processor Huffman Encoding Extension - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

GeneralPurpose Processor Huffman Encoding Extension

Description:

General-Purpose Processor Huffman Encoding Extension. Stephan Wong, Sorin ... QUANT. 8. 8. Quantization: Quantized DCT Coefficients * = Quantized DC Coefficient ... – PowerPoint PPT presentation

Number of Views:90

Avg rating:3.0/5.0

Slides: 24

Provided by: jamessteph

Category:

more less

Transcript and Presenter's Notes

Title: GeneralPurpose Processor Huffman Encoding Extension

1
General-Purpose Processor Huffman Encoding
Extension

Stephan Wong, Sorin Cotofana,
Stamatis Vassiliadis
Computer Engineering Laboratory
Electrical Engineering
TU Delft

ITCC 2000, 3-2000, Las Vegas
2
Outline

Introduction
Assumptions Research Question
Architectural Extension
Simulation Results
Conclusions

ITCC 2000, 3-2000, Las Vegas
3
Introduction
Huffman Coding
Topic Compute real-time video using
General-Purpose Processors (GPPs)
Problem High-resolution video processing can be
problematic with available GPPs
GOAL Increase GPP video processing performance!
Add specialized units to GPPs!!

used very often in video coding
difficult to parallelize
up to 20 of computations in video coding

ITCC 2000, 3-2000, Las Vegas
4
Assumptions
Research Question

Out-of-order superscalar General-Purpose
Processor
Extend GPP with new (multimedia) instructions
Extend GPP with a Multimedia Functional Unit

In our case Addition of a Huffman Coding
unit (called Huffman Coding Functional Unit)
What are the implications of adding an HFU to a
GPP?
ITCC 2000, 3-2000, Las Vegas
5
Basic Steps in Picture Coding
Huffman Coding
Digital picture
6
Processor Organization
Architecture

Add a Huffman Coding Functional Unit

Add 3 new instructions
LoadBlock
Loads an 8x8 block of quantized DCT coefficients
HEncode
Performs the actual Huffman encoding
WriteOutput
Writes output of Huffman Coding to main memory

Instruction Fetch
Decode Issue
...
HFU
FU
FU
Memory
ITCC 2000, 3-2000, Las Vegas
7
HFU Organization
Register File
Instruction format
LoadBlock
LoadBlock (r2)imm
M E M O R Y
HEncode r3, r4
HEncode
WriteOutput (r5)imm
WriteOutput
HFU
ITCC 2000, 3-2000, Las Vegas
8
Code Example
/ Load the block. / load r2, starting_block_addr
ess LoadBlock (r2)imm / Perform the actual
coding. / load r3, previous_dc_value load r4,
lum_or_chrom HEncode r3, r4 / Write output to
memory. / load r5, write_address WriteOutput
(r5)0
ITCC 2000, 3-2000, Las Vegas
9
Simulation Environment

4-way superscalar, out-of-order GPP architecture
4 integer ALUs, 1 integer MULT/DIV-unit, 4 FP
adders
1 FP MULT/DIV-unit, 2 memory ports
L1 data cache organization (LRU)
16 KB 128 sets, 4-way associative, block size of
32 bytes
L2 unified cache organization (LRU)
256 KB 1024 sets, 4-way associative, block size
of 64 bytes
sim-outorder simulator modified to support
extensions
ijpeg benchmark and modified ijpeg benchmark

ITCC 2000, 3-2000, Las Vegas
10
Benchmark Issues

Original benchmark uses a function called
emit_bits
storing results of Huffman Coding one by one
This prohibits the usage of the WriteOutput
instruction, because
the architecture needs to be adapted to benchmark
the original benchmark needs to be completely
rewritten
Instead, another instruction is used ? MoveResult
instruction
storing the results now require a while-loop
each result is stored using two numbers
one bit is used to control the while-loop

This introduces a penalty compared to the
WriteOutput instruction!
? The results presented here are worse-case
scenario for the proposal.
ITCC 2000, 3-2000, Las Vegas
11
Simulated HFU Organization
Register File
Instruction format
LoadBlock
LoadBlock (r2)imm
M E M O R Y
HEncode r3, r4
HEncode
WriteOutput (r5)imm
WriteOutput
MoveResult r6,r7,r8
MoveResult
HFU
Register File
ITCC 2000, 3-2000, Las Vegas
12
Modified Code Example
/ Load the block./ load r2, starting_block_addre
ss LoadBlock (r2)imm / Perform the actual
coding./ load r3, previous_dc_value load r4,
lum_or_chrom HEncode r3, r4
ITCC 2000, 3-2000, Las Vegas
13
Simulation Results

Total Number of Execution Cycles (TNEC)
average decreases between 6.3 and 7.4

Total number of instructions
average decrease is about 5

Total number of branches
average decrease is about 14

ITCC 2000, 3-2000, Las Vegas
14
Recalculating TNEC Results

Determine number of calls to emit_bits

Conservative assumption of penalty in clockcycles

Subtract from the original TNEC value

of emit_bits calls
Original TNEC -
new TNEC

Recalculate the decreases in TNEC
average decreases between 9 and 12

Assumption no specialized units for DCT and
Quantization!
ITCC 2000, 3-2000, Las Vegas
15
What if ?
What if DCT and Quantization are hardwired to
improve performance?
ITCC 2000, 3-2000, Las Vegas
16
Conclusions

Proposed hardwired Huffman Coding unit
3 new instructions

Potential improvement between 6 and 12

Number of branches decreased by 14

Number of instructions decreased by 5

Hardware requirements are similar to adding 1-2k
bytes
of memory and control logic

Future Work
What is the impact on video processing
performance when extending a GPP with FPGA MFUs?

ITCC 2000, 3-2000, Las Vegas
17
More information
http//www.tudelft.nl
http//cardit.et.tudelft.nl
http//cardit.et.tudelft.nl/molen
18
Memory Model
Loading an 8x8 block of quantized DCT coefficients

sim-outorder simulator
Memory bandwidth is 8 bytes per cycle
Each coefficient is represented by two bytes

LoadBlock can loads blocks of up to 4
coefficients per cycle
Assumed load-latency (total_lat
0) 1. Determine number of hits before
miss/end-of-load 2. Divide by 4 and add to
total_lat 3. Add miss latency to total_lat (if
any) 4. If end-of-load, STOP. Else, go to step 1.
19
ijpeg benchmark

Input parameters

New Instructions Utilization

-compression.quality 70 -compression.smoothing_fac
tor 0 -compression.optimize_coding 0 -verbose 1
-GO.compress

Two benchmarks
original benchmark
updated benchmark using previous code example

Two input pictures
test picture 33,124 bytes
ref picture 2,113,595 bytes

20
(No Transcript)
21
Basic Steps in Picture Coding
Huffman Coding
Digital picture
22
Simulated HFU Organization
Register File
Instruction format
LoadBlock
LoadBlock (r2)imm
M E M O R Y
HEncode r3, r4
HEncode
MoveResult r6,r7,r8
MoveResult
HFU
Register File
ITCC 2000, 3-2000, Las Vegas
23
Modified Code Example
/ Load the block./ load r2, starting_block_addre
ss LoadBlock (r2)imm / Perform the actual
coding./ load r3, previous_dc_value load r4,
lum_or_chrom HEncode r3, r4
ITCC 2000, 3-2000, Las Vegas

Write a Comment

User Comments (0)