Title: November%201st,%202000
1Human beings are great programmers, Computers
are poor actors
VLIW DSP vs. SuperScalar Implementation of
a Baseline H.263 Encoder
Serene Banerjee Hamid R. Sheikh Lizy K.
John Brian L. Evans Alan C. Bovik
Department of Electrical and Computer Engineering
The University of Texas at Austin
November 1st, 2000
serene_at_ece.utexas.edu
2Baseline H.263 Video Encoding
I Intra frame Discrete Cosine Transform (DCT)
is used to reduce spatial redundancy within a
frame. P Predicted frame Motion compensated
prediction (MCP) used to reduce temporal
redundancy. DCT is used to reduce spatial
redundancy in the prediction error.
3Baseline H.263 Encoder
4H.263 Encoder
- Goals baseline H.263 encoder only
- Evaluate performance of compiled C code on Very
Long Instruction Word (VLIW) Digital Signal
Processors (DSPs) and superscalar processors - Hand optimize H.263 video encoder on VLIW DSP
- University of British Columbia (UBC) H.263
Version 2 (H.263) video codec - By Prof. Faouzi Kossentinis group
http//spmg.ece.ubc.ca - 23000 lines (720 kbytes) of C code targeted for
PCs - Baseline H.263 and many optional H.263 modes
- Primarily for research purposes
5TMS320C6701 Processor
- Up to 8 32-bit instructions are executed in one
instruction cycle in an in-order way - 2 32-bit data paths, with 16 32-bit registers and
16 16-bit data memory banks
Program Fetch
Control Registers
Instruction Dispatch
Instruction Decode
Control Logic
A Register File
B Register File
Test/ Emulation
Interrupts control
L1
S1
M1
D1
L2
S2
M2
D2
TMS320C6701 CPU Core
6TMS320C6701 EVM
- TMS320C6701 processor
- 11 - 17 stages of pipeline, depending on
instruction - External memory
- 256 kB of 133 MHz synchronous burst static
random-access memory (SBSRAM) - 8 MB of 100 MHz synchronous dynamic RAM (SDRAM)
in two 16-bit RAM banks - 100 MHz clock speed due to SDRAM
- Development environment
- Code Composer Interactive real-time debugging
- Simulator Does not report pipeline stalls
7SimpleScalar Simulator
- Superscalar processor reorders sequential
instructions based on data dependencies for
parallel (out-of-order) execution - SimpleScalar is configurable superscalar
simulator http//www.simplescalar.org
Fetch
Dispatch
Scheduler
Execute
Writeback
Memory
Memory
TLB Translation lookahead buffer
Commit
Data-TLB
Data cache
Six pipeline stages for out-of-order simulation
8Comparison of Processors
9Encoder Profile for VLIW DSP (with level two C
optimization only)
1476 Mcycles/frame for 128 x 96 resolution with
full-search motion estimation
SAD
10Encoder Profile for SuperScalar(1-way with
level two C optimization)
196 Mcycles/frame for 128 x 96 resolution with
full-search motion estimation
11H.263 Encoder Comparison(with level 2 C
optimization only)
- Frame resolution 128 x 96 (Sub-QCIF)
- Full search motion estimation
- Clock speed 100 MHz
12VLIW DSP Memory Optimizations
- Internal program memory holds
- Computationally intensive routines
- Commonly used runtime support functions from TI
libraries (memcpy, memcmp and memset) - Internal data memory holds
- Macroblocks and search area for motion estimation
- Macroblocks for DCT, quantization, coding,
reconstruction - Local data for computationally intensive routines
- Stack
- Speedup 29 times over level two optimization
13VLIW DSP Code Optimizations
- Compiler intrinsics gave little improvement
- Wrote assembly routines
- Parallel assembly SAD, Clip_MB (clips
overflowing values) - Linear assembly Interpolate, FillMBData (pack
copy of pixel data into macroblock structures) - Rewriting the C code
- Unroll loops and pipeline computations
- Use 32-bit packed data I/O to slower external RAM
- Avoid pipeline stalls due to memory bank
conflicts - Speedup 4 times over level two C optimization
14VLIW DSP Optimizations (assembly routines
only)
15VLIW DSP Encoder Profile(after all C6701
optimizations)
24 Mcycles/frame for 128 x 96 resolution with
full-search motion estimation
SAD
16Superscalar Encoder Profile(256-way
SimpleScalar processor)
28 Mcycles/frame for 128 x 96 resolution with
full-search motion estimation
17Subroutine Comparisons
18H.263 Encoder Comparison
- Frame resolution 128 x 96 (Sub-QCIF)
- Full search motion estimation
- Clock speed 100 MHz
19Conclusions
- With level 2 optimization only
- One-way superscalar is 7.5x faster than VLIW DSP
- Four-way to one-way issue speedup is 2.88x
- 256-way to four-way speedup is 2.4x
- Variable length coding much faster on superscalar
- VLIW DSP hand optimization produces 61x speedup
vs. level two C optimization - Placement of often-used data and code on-chip
- Hand coded SAD, interpolation, and reconstruction
- 14 faster than 256-way superscalar version
http//www.ece.utexas.edu/sheikh/h263