GeneralPurpose Processor Huffman Encoding Extension - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

GeneralPurpose Processor Huffman Encoding Extension

Description:

General-Purpose Processor Huffman Encoding Extension. Stephan Wong, Sorin ... QUANT. 8. 8. Quantization: Quantized DCT Coefficients * = Quantized DC Coefficient ... – PowerPoint PPT presentation

Number of Views:90
Avg rating:3.0/5.0
Slides: 24
Provided by: jamessteph
Category:

less

Transcript and Presenter's Notes

Title: GeneralPurpose Processor Huffman Encoding Extension


1
General-Purpose Processor Huffman Encoding
Extension
  • Stephan Wong, Sorin Cotofana,
  • Stamatis Vassiliadis
  • Computer Engineering Laboratory
  • Electrical Engineering
  • TU Delft

ITCC 2000, 3-2000, Las Vegas
2
Outline
  • Introduction
  • Assumptions Research Question
  • Architectural Extension
  • Simulation Results
  • Conclusions

ITCC 2000, 3-2000, Las Vegas
3
Introduction
Huffman Coding
Topic Compute real-time video using
General-Purpose Processors (GPPs)
Problem High-resolution video processing can be
problematic with available GPPs
GOAL Increase GPP video processing performance!
Add specialized units to GPPs!!
  • used very often in video coding
  • difficult to parallelize
  • up to 20 of computations in video coding

ITCC 2000, 3-2000, Las Vegas
4
Assumptions
Research Question
  • Out-of-order superscalar General-Purpose
    Processor
  • Extend GPP with new (multimedia) instructions
  • Extend GPP with a Multimedia Functional Unit

In our case Addition of a Huffman Coding
unit (called Huffman Coding Functional Unit)
What are the implications of adding an HFU to a
GPP?
ITCC 2000, 3-2000, Las Vegas
5
Basic Steps in Picture Coding
Huffman Coding
Digital picture
6
Processor Organization
Architecture
  • Add a Huffman Coding Functional Unit
  • Add 3 new instructions
  • LoadBlock
  • Loads an 8x8 block of quantized DCT coefficients
  • HEncode
  • Performs the actual Huffman encoding
  • WriteOutput
  • Writes output of Huffman Coding to main memory

Instruction Fetch
Decode Issue
...
HFU
FU
FU
Memory
ITCC 2000, 3-2000, Las Vegas
7
HFU Organization
Register File
Instruction format
LoadBlock
LoadBlock (r2)imm
M E M O R Y
HEncode r3, r4
HEncode
WriteOutput (r5)imm
WriteOutput
HFU
ITCC 2000, 3-2000, Las Vegas
8
Code Example
/ Load the block. / load r2, starting_block_addr
ess LoadBlock (r2)imm / Perform the actual
coding. / load r3, previous_dc_value load r4,
lum_or_chrom HEncode r3, r4 / Write output to
memory. / load r5, write_address WriteOutput
(r5)0
ITCC 2000, 3-2000, Las Vegas
9
Simulation Environment
  • 4-way superscalar, out-of-order GPP architecture
  • 4 integer ALUs, 1 integer MULT/DIV-unit, 4 FP
    adders
  • 1 FP MULT/DIV-unit, 2 memory ports
  • L1 data cache organization (LRU)
  • 16 KB 128 sets, 4-way associative, block size of
    32 bytes
  • L2 unified cache organization (LRU)
  • 256 KB 1024 sets, 4-way associative, block size
    of 64 bytes
  • sim-outorder simulator modified to support
    extensions
  • ijpeg benchmark and modified ijpeg benchmark

ITCC 2000, 3-2000, Las Vegas
10
Benchmark Issues
  • Original benchmark uses a function called
    emit_bits
  • storing results of Huffman Coding one by one
  • This prohibits the usage of the WriteOutput
    instruction, because
  • the architecture needs to be adapted to benchmark
  • the original benchmark needs to be completely
    rewritten
  • Instead, another instruction is used ? MoveResult
    instruction
  • storing the results now require a while-loop
  • each result is stored using two numbers
  • one bit is used to control the while-loop

This introduces a penalty compared to the
WriteOutput instruction!
? The results presented here are worse-case
scenario for the proposal.
ITCC 2000, 3-2000, Las Vegas
11
Simulated HFU Organization
Register File
Instruction format
LoadBlock
LoadBlock (r2)imm
M E M O R Y
HEncode r3, r4
HEncode
WriteOutput (r5)imm
WriteOutput
MoveResult r6,r7,r8
MoveResult
HFU
Register File
ITCC 2000, 3-2000, Las Vegas
12
Modified Code Example
/ Load the block./ load r2, starting_block_addre
ss LoadBlock (r2)imm / Perform the actual
coding./ load r3, previous_dc_value load r4,
lum_or_chrom HEncode r3, r4
ITCC 2000, 3-2000, Las Vegas
13
Simulation Results
  • Total Number of Execution Cycles (TNEC)
  • average decreases between 6.3 and 7.4
  • Total number of instructions
  • average decrease is about 5
  • Total number of branches
  • average decrease is about 14

ITCC 2000, 3-2000, Las Vegas
14
Recalculating TNEC Results
  • Determine number of calls to emit_bits
  • Conservative assumption of penalty in clockcycles
  • Subtract from the original TNEC value

of emit_bits calls
Original TNEC -
new TNEC
  • Recalculate the decreases in TNEC
  • average decreases between 9 and 12

Assumption no specialized units for DCT and
Quantization!
ITCC 2000, 3-2000, Las Vegas
15
What if ?
What if DCT and Quantization are hardwired to
improve performance?
ITCC 2000, 3-2000, Las Vegas
16
Conclusions
  • Proposed hardwired Huffman Coding unit
  • 3 new instructions
  • Potential improvement between 6 and 12
  • Number of branches decreased by 14
  • Number of instructions decreased by 5
  • Hardware requirements are similar to adding 1-2k
    bytes
  • of memory and control logic
  • Future Work
  • What is the impact on video processing
    performance when extending a GPP with FPGA MFUs?

ITCC 2000, 3-2000, Las Vegas
17
More information
http//www.tudelft.nl
http//cardit.et.tudelft.nl
http//cardit.et.tudelft.nl/molen
18
Memory Model
Loading an 8x8 block of quantized DCT coefficients
  • sim-outorder simulator
  • Memory bandwidth is 8 bytes per cycle
  • Each coefficient is represented by two bytes

LoadBlock can loads blocks of up to 4
coefficients per cycle
Assumed load-latency (total_lat
0) 1. Determine number of hits before
miss/end-of-load 2. Divide by 4 and add to
total_lat 3. Add miss latency to total_lat (if
any) 4. If end-of-load, STOP. Else, go to step 1.
19
ijpeg benchmark
  • Input parameters
  • New Instructions Utilization

-compression.quality 70 -compression.smoothing_fac
tor 0 -compression.optimize_coding 0 -verbose 1
-GO.compress
  • Two benchmarks
  • original benchmark
  • updated benchmark using previous code example
  • Two input pictures
  • test picture 33,124 bytes
  • ref picture 2,113,595 bytes

20
(No Transcript)
21
Basic Steps in Picture Coding
Huffman Coding
Digital picture
22
Simulated HFU Organization
Register File
Instruction format
LoadBlock
LoadBlock (r2)imm
M E M O R Y
HEncode r3, r4
HEncode
MoveResult r6,r7,r8
MoveResult
HFU
Register File
ITCC 2000, 3-2000, Las Vegas
23
Modified Code Example
/ Load the block./ load r2, starting_block_addre
ss LoadBlock (r2)imm / Perform the actual
coding./ load r3, previous_dc_value load r4,
lum_or_chrom HEncode r3, r4
ITCC 2000, 3-2000, Las Vegas
Write a Comment
User Comments (0)
About PowerShow.com