Title: Design and Tradeoff Analysis of JPEG2000 on HardwareReconfigurable Systems
1Design and Tradeoff Analysisof JPEG-2000
onHardware-Reconfigurable Systems
- Ryan DeVille, Vikas Aggarwal,
- Ian Troxel, and Alan D. George
- High-performance Computing and Simulation (HCS)
Research Laboratory - Department of Electrical and Computer Engineering
- University of Florida
2Introduction
- JPEG-2000 Encoding
- State-of-the-art low bit-rate compression
algorithm - Progressive transmission by quality, resolution,
component, or spatial locality - Spatially random access to bitstream
- Region of interest coding
- Motivation for porting JPEG-2000 to RC systems
- High-performance and low-cost solution is
attractive for airborne and satellite imaging
systems - Speedup readily available with fine-grain and
coarse-grain parallelism opportunities
3Related Research
- EBCOT Encoder designs
- Group of Column optimization method
- Previous RC Designs
- Space systems prototype 5
- Scalable Entropy Encoder 6
- Dual Processing Elements Architecture 7
- 2D Discrete Wavelet Transform designs
- Several mimic early VLSI designs 8, 9
- Multiple architecture designs classifications
10 - Direct
- 1D, transpose, perform another 1D
- Intrinsically slow
- Separate serial and parallel filters or parallel
row, parallel column filters - Processes along rows and columns
- Represents significant performance improvement
- Symmetrically extended
- Improves processing efficiency, especially
towards center of image
4JPEG-2000 Encoder Design Develop.
- Software code profiling first used to determine
effort distribution - Previous research efforts show that DWT and Tier1
encoding consume 80-85 of execution time - Current profiling results with Jasper and
OpenJPEG show that gt90 of execution time spent
in DWT and Tier1 - Benchmark images selected from Kodak Lossless
True Color Image Suite, JasPer benchmark images,
standard image processing images (lena, etc.)
- Jasper Execution Time Profile
5Discrete Wavelet Transform (DWT)
- Features
- Second-most computationally intensive block in
compression process - Transforms each component tile data into
coefficients - Reversible transform involves all integer
operations - Represents high- and low-frequency components of
image - Amenable to compression results in better
compression ratios - Recursive application yields frequency bands at
multiple resolutions - Operation
- 2D transform achieved by successively
- applying 1D transform in XY directions
- Each 1D transform consist of
- Filtering step
- De-interleave step reorganizing of data in bands
- Available data and functional parallelism can
be exploited
a3LH
a3LL
a3HH
a3HL
a2LH
a1LH
a2HH
a2HL
a1HL
a1HH
6DWT Hardware Architecture
- Challenges presented by DWT
- Parallel processing limited by memory bandwidth
requirements - Some sequential nature in processing involved
- Design features
- Data-level parallelism exploited by operating on
multiple tiles - Function-level parallelism exploited by
pipelining different - processing step
- Data reuse eliminates extra read cycles
- Internal architecture
- Each tile is entirely stored in single Block RAM
to - minimize data movement
- Overlapped processing to further reduce latency
7Embedded Block Coding with Optimized Truncation
(EBCOT) Tier-1
- Features
- Specially adapted arithmetic coder
- Four bit-plane coding primitives
- Three coding passes for each bit-plane (except
the most significant) - Operation
- Coding passes CUP begins at most significant bit
plane - Iteratively perform coding passes over remaining
bit planes - Coding-pass-generated context and bit data
serially encoded and compressed by arithmetic
encoder - Flush and reset arithmetic coder at completion
8Tier-1 Encoding Hardware Architecture
- Challenges presented by Tier-1 encoding
- Serial process creation of current MQ context
data directly depends upon previous pass results - Bursty communication contextual data from a
pass short, semi-continuous bursts - Large amounts of data and flags must be stored
through multiple iterations of algorithm,
requiring high memory bandwidth - Internal architecture (high-level)
- Retrieve current stripe from memory for
processing - Data is operated in a pipelined fashion through
registers - Context and data information sent to queues
- Serializing agent arithmetic entropy encoder
- MQ Input Controller regulates input to arithmetic
entropy encoder, insuring correct operation - Data from arithmetic entropy encoder is written
to a separate, final buffer
Design decision to use MQ encoder as serializing
agent saves area and BlockRAM space without
sacrificing too much performance.
9Target HPEC Platform
- High-Perf. Embedded Computing Nallatech BenNUEY
w/ BenBLUE-II
- Three FPGAs (all Xilinx Virtex2 6000, -4)
- Single user FPGA on BenNUEY PCI board
- Dual FPGAs on BenBLUE-II daughter card
- Low bandwidth to system memory through 64/66 MHz
PCI bus connection - Large memory storage capability with 12 MB SRAM
(166 MHz, ZBT) - Advantages/Disadvantages
- High configuration time (PCI bus chained JTAG
interface) - Large memory storage helps alleviate strain on
PCI bus - Very good IO interface support with proprietary
tools
Diagram shown here only reflects those buses
actually used in the design other communication
schemes are available.
10DWT Single FPGA Results
Results for single DWT module design for BenNUEY
board operating at 80 MHz
Note software solution comes from exec. on
server with 2.4 GHz Xeon CPU
Resource Utilization on Virtex2 6000 -4
Results for Eight DWT modules design for BenNUEY
board operating at 40 MHz
11Tier-1 Encoding Current Results
Results for Tier1 module design for BenNUEY board
operating at 90 MHz
Note software solution comes from execution on
server with 2.4 GHz Xeon Processor
Profiling shows performance projections with DMA
transfer times included.
Results synthesized with Synplify Pro 7.7.1,
PAR with Xilinx ISE 6.3
12Conclusions from HPEC Platform
- Multi-chip system offers resources for increased
parallelism or a multi-component application - Order of magnitude improvement in total
computation time - Faster computation times on FPGA
- But communication overhead severely hinders
performance improvement - Low-bandwidth PCI interconnect not amenable to
designs with challenging memory demands
13Target HPC Platform
SGI Altix w/ RASC extension
- High-Performance Computing SGI Altix 350 with
FPGA Brick - Single FPGA Virtex2 6000 (-6 speed grade)
- Approximately 33 of chip used for SGIs RASC
system layer - Two algorithm clock speeds 200 MHz and 100 MHz
- High bandwidth to system memory through
proprietary NUMAlink interconnect (12.8 GB/s)
through Scalable System Port (6.4 GB/s) - 3 banks of QDR SRAM (6 MB each) with a full
bandwidth of 9.6 GB/s (1.6 GB/s for each read and
write) - Advantages/Disadvantages
- Extremely low reconfiguration time
- High memory bandwidth greatly helps
memory-intensive apps, such as JPEG-2K
Diagram shown here only reflects those buses
actually used in the design other communication
schemes are available.
14Performance Projections
Profile shows projections for no-latency,
infinite-bandwidth interconnect.
- NUMAlink interconnect
- Approximate order-of-magnitude improvement of
transfers in similar designs - Mitigates communication overhead bottleneck
15Lessons Learned and Conclusions
- Lessons Learned
- HW/SW codesign
- Shared-memory systems more amenable to
closely-coupled processing associated with
communication-sensitive RC applications - PCI boards for servers effective when tasks are
offloaded for processing with minimal or masked
communication - Memory bandwidth constrains parallelism in DWT
design - Serializing agent (arithmetic coder) in Tier-1
design is key limit to performance improvement - Conclusions
- Identifying and accelerating key components
yields better system performance (with a wary eye
on Amdahls Law) - Performance enhancements achieved mostly through
functional parallelism due to sequential
processing constraints
16Future Work and Acknowledgments
- Future Work
- Full system implementation on SGI Altix with RASC
- Region of Interest capability
- Lossy encoding and rate capability
- MCT and Tier-2 encoding on FPGA as well
- Single FPGA JPEG-2000 encoding application
- Acknowledgments
- We wish to thank the following vendors for
equipment and/or tools in support of this
research - SGI
- Nallatech
- Xilinx
- Aldec
- Special thanks to SGI Digital Media group, SGI
RASC engineers for their help and suggestions
17References
- 1 Adams, M.D. and Ward, R.K., JasPer a
portable flexible open-source software tool kit
for image coding/process, in IEEE International
Conference on Acoustics, Speech, and Signal
Processing (ICASSP04), pp. 241-244, May 2004. - 2 OpenJPEG. http//www.opegjpeg.org/
- 3 Liu, L., Li, D., Li, Z., Wang, Z. and Chen,
H., A VLSI architecture of EBCOT encoder for
JPEG2000, in 5th International Conference on
ASIC, pp. 882-885, Oct. 2003. - 4 Chen, K., Lian, C., Chen, H., and L. Chen,
Analysis and architecture design of EBCOT for
JPEG-2000, in IEEE International Symposium on
Circuits and Systems, vol. 2, pp. 765-768, May
2001. - 5 Van Buren, D., A high-rate JPEG2000
compression system for space, in IEEE Aerospace
Conference, March 2005. - 6 Aouadi, I., and Hammami, O., Analysis and
hardware design of a scalable dual JPEG-2000
entropy coder, in Euromicro Symposium on Digital
System Design (DSD 2004), pp. 227-233, Sept.
2004. - 7 Gangadhar, M. and Bhatia, D., FPGA based
EBCOT architecture for JPEG 2000, in IEEE
International Conference on Field-Programmable
Technology (FPT03), pp. 228-233, Dec. 2003 - 8 Hung, K., Huang Y., Truong, T., Wang, C.,
FPGA implementation for 2D discrete wavelet
transform, in Electronics Letters, pp. 639-640,
April 1998. - 9 Lakshminarayanan, G. Venkataramani, B.
Senthil Kumar, J., Yousuf, A.K. and Sriram, G.,
Design and FPGA implementation of image block
encoders with 2D-DWT, in Conference on
Convergent Technologies for Asia-Pacific Region
(TENCON 2003), pp. 1015-1019, Oct. 2003. - 10 McCanny, P., Masud, S., and McCanny, J.,
Design and implementation of the symmetrically
extended 2-D wavelet transform, in IEEE
International Conference on Acoustics, Speech,
and Signal Processing (ICASSP02), vol. 3, pp.
3108-31111, May 2002. - 11 D. Taubman, High performance scalable
image compression with EBCOT, in IEEE Trans.
Image Processing, vol. 9, pp. 1158-1170, July
2000. - 12 I.E.G. Richardson, Video Codec Design
Developing Image and Video Compression Systems.
Chichester, West Sussex, New York John Wiley and
Sons, Ltd (UK), 2002. - 13 T. Acharya and P.-S. Tsai, JPEG 2000
Standard for image Compression Concepts,
Algorithms, and VLSI Architectures. Hoboken, New
Jersey John Wiley and Sons, Inc., 2005.