Title: Customizing WideSIMD Architectures for H'264
1Customizing Wide-SIMD Architecturesfor H.264
- Sangwon Seo1, Mark Woh1, Scott Mahlke1, Trevor
Mudge1 - Vijay Sundaram2, Chaitali Chakrabarti2
- 1 University of Michigan
- 2 Arizona State University
2Outline
- Motivation
- H.264 Analysis
- Proposed Architecture
- H.264 Kernel Mappings
- Results
- Conclusion
3Motivation Smart Phone
Reference Images http//www.apple.com/iphone/gal
lery/
4Motivation Inside Smart Phone
Reference Images http//idannyb.files.wordpress.
com/2008/07/xiuvbfueck3gsdum-large.jpg
5H.264 Design
H.264 encoder/decoder reference design
Reference Images I. Richardson, H.264 and
MPEG-4 video compression, WILEY, 2003
6H.264 Analysis
- H.264 Kernel Algorithms
- Heavy SIMD workload
- Different natural SIMD widths
- High Medium Thread Level Parallelism
- Need to support multiple SIMD widths to maximize
the SIMD utilization
7H.264 Analysis
- Example Deblocking Filter
- Two dimensional data are used for multimedia
algorithms. - Row or column order memory access works well for
one set of edges, but not for the other. - Diagonal memory bank system helps to access
blocks along a row or a column.
Horizontal Filtering
Vertical Filtering
8H.264 Analysis
- Subgraphs for Innerloops of two kernel algorithms
- Large amount of data locality
- Large RF power consumption (Read/Write)
- Bypass and Temporary buffer support
9H.264 - Analysis
- Instruction Pairs
- Heavy usage of shuffle and arithmetic operations
- Add-Shift round operation
- Sub-Abs SAD operation
- Need to fuse the frequently used instruction pairs
10H.264 - Analysis
- Permutation Patterns for Intraprediction
- Fixed set of shuffle patterns
- Need for programmable shuffle network
11Modified SIMD architecture
12Modified SIMD architecture
Multiple SIMD widths Thread-Level Parallelism
13Modified SIMD architecture
Diagonal Memory Organization Memory Bank System
Shuffle Network
14Modified SIMD architecture
Short-lived values stored in temporary buffers
15Modified SIMD architecture
Short-lived values Fused Operation
16Modified SIMD architecture
Shuffle Networks are placed here and there to
align data
17Mapping of H.264 Kernels
18Results
- System Breakdown
- H.264 CIF video at 30fps
19Results
- Speedup Breakdown
- 2.13x performance increase on average
20Results
- Energy-Delay product comparison
- 29 energy-delay improvement on average
21Results
- Comparison with latest H.264 encoders
- 17 T. C. Chen et.al, 2.8 to 62.7 mW
low-power and power-aware H.264 encoder for
mobile - applications, 2007 IEEE Symposium on
VLSI Circuits, pp. 222223, June 2007. - 18 M. Bhatnagar, TMS320DM6446/3 Power
Consumption Summary, Texas Instruments - Application Reports, http//focus.ti.com/
lit/an/spraad6a/spraad6a.pdf, Feb. 2008.
22Conclusion
- Key architectural enhancements
- SIMD partitioning
- Diagonal memory bank system
- Bypass and temporary buffer support
- Fused operation support
- Programmable crossbar
- Future work
- Image processing algorithms on SIMD architecture
23Backup Slides
24H.264 Analysis
- Diagonal Memory Organization
- Two dimensional data are used for multimedia
algorithms. - Blocks along a row or a column need to be
accessed easily.
25Mapping of H.264 Kernels
26Mapping of H.264 Kernels
27Mapping of H.264 Kernels