Design and Tradeoff Analysis of JPEG2000 on HardwareReconfigurable Systems - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Design and Tradeoff Analysis of JPEG2000 on HardwareReconfigurable Systems

Description:

Ian Troxel, and Alan D. George ... State-of-the-art low bit-rate compression algorithm ... Hoboken, New Jersey: John Wiley and Sons, Inc., 2005. ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 18
Provided by: dral60
Category:

less

Transcript and Presenter's Notes

Title: Design and Tradeoff Analysis of JPEG2000 on HardwareReconfigurable Systems


1
Design and Tradeoff Analysisof JPEG-2000
onHardware-Reconfigurable Systems
  • Ryan DeVille, Vikas Aggarwal,
  • Ian Troxel, and Alan D. George
  • High-performance Computing and Simulation (HCS)
    Research Laboratory
  • Department of Electrical and Computer Engineering
  • University of Florida

2
Introduction
  • JPEG-2000 Encoding
  • State-of-the-art low bit-rate compression
    algorithm
  • Progressive transmission by quality, resolution,
    component, or spatial locality
  • Spatially random access to bitstream
  • Region of interest coding
  • Motivation for porting JPEG-2000 to RC systems
  • High-performance and low-cost solution is
    attractive for airborne and satellite imaging
    systems
  • Speedup readily available with fine-grain and
    coarse-grain parallelism opportunities

3
Related Research
  • EBCOT Encoder designs
  • Group of Column optimization method
  • Previous RC Designs
  • Space systems prototype 5
  • Scalable Entropy Encoder 6
  • Dual Processing Elements Architecture 7
  • 2D Discrete Wavelet Transform designs
  • Several mimic early VLSI designs 8, 9
  • Multiple architecture designs classifications
    10
  • Direct
  • 1D, transpose, perform another 1D
  • Intrinsically slow
  • Separate serial and parallel filters or parallel
    row, parallel column filters
  • Processes along rows and columns
  • Represents significant performance improvement
  • Symmetrically extended
  • Improves processing efficiency, especially
    towards center of image

4
JPEG-2000 Encoder Design Develop.
  • Software code profiling first used to determine
    effort distribution
  • Previous research efforts show that DWT and Tier1
    encoding consume 80-85 of execution time
  • Current profiling results with Jasper and
    OpenJPEG show that gt90 of execution time spent
    in DWT and Tier1
  • Benchmark images selected from Kodak Lossless
    True Color Image Suite, JasPer benchmark images,
    standard image processing images (lena, etc.)
  • Jasper Execution Time Profile

5
Discrete Wavelet Transform (DWT)
  • Features
  • Second-most computationally intensive block in
    compression process
  • Transforms each component tile data into
    coefficients
  • Reversible transform involves all integer
    operations
  • Represents high- and low-frequency components of
    image
  • Amenable to compression results in better
    compression ratios
  • Recursive application yields frequency bands at
    multiple resolutions
  • Operation
  • 2D transform achieved by successively
  • applying 1D transform in XY directions
  • Each 1D transform consist of
  • Filtering step
  • De-interleave step reorganizing of data in bands
  • Available data and functional parallelism can
    be exploited

a3LH
a3LL
a3HH
a3HL
a2LH
a1LH
a2HH
a2HL
a1HL
a1HH
6
DWT Hardware Architecture
  • Challenges presented by DWT
  • Parallel processing limited by memory bandwidth
    requirements
  • Some sequential nature in processing involved
  • Design features
  • Data-level parallelism exploited by operating on
    multiple tiles
  • Function-level parallelism exploited by
    pipelining different
  • processing step
  • Data reuse eliminates extra read cycles
  • Internal architecture
  • Each tile is entirely stored in single Block RAM
    to
  • minimize data movement
  • Overlapped processing to further reduce latency

7
Embedded Block Coding with Optimized Truncation
(EBCOT) Tier-1
  • Features
  • Specially adapted arithmetic coder
  • Four bit-plane coding primitives
  • Three coding passes for each bit-plane (except
    the most significant)
  • Operation
  • Coding passes CUP begins at most significant bit
    plane
  • Iteratively perform coding passes over remaining
    bit planes
  • Coding-pass-generated context and bit data
    serially encoded and compressed by arithmetic
    encoder
  • Flush and reset arithmetic coder at completion

8
Tier-1 Encoding Hardware Architecture
  • Challenges presented by Tier-1 encoding
  • Serial process creation of current MQ context
    data directly depends upon previous pass results
  • Bursty communication contextual data from a
    pass short, semi-continuous bursts
  • Large amounts of data and flags must be stored
    through multiple iterations of algorithm,
    requiring high memory bandwidth
  • Internal architecture (high-level)
  • Retrieve current stripe from memory for
    processing
  • Data is operated in a pipelined fashion through
    registers
  • Context and data information sent to queues
  • Serializing agent arithmetic entropy encoder
  • MQ Input Controller regulates input to arithmetic
    entropy encoder, insuring correct operation
  • Data from arithmetic entropy encoder is written
    to a separate, final buffer

Design decision to use MQ encoder as serializing
agent saves area and BlockRAM space without
sacrificing too much performance.
9
Target HPEC Platform
  • High-Perf. Embedded Computing Nallatech BenNUEY
    w/ BenBLUE-II
  • Three FPGAs (all Xilinx Virtex2 6000, -4)
  • Single user FPGA on BenNUEY PCI board
  • Dual FPGAs on BenBLUE-II daughter card
  • Low bandwidth to system memory through 64/66 MHz
    PCI bus connection
  • Large memory storage capability with 12 MB SRAM
    (166 MHz, ZBT)
  • Advantages/Disadvantages
  • High configuration time (PCI bus chained JTAG
    interface)
  • Large memory storage helps alleviate strain on
    PCI bus
  • Very good IO interface support with proprietary
    tools

Diagram shown here only reflects those buses
actually used in the design other communication
schemes are available.
10
DWT Single FPGA Results
Results for single DWT module design for BenNUEY
board operating at 80 MHz
Note software solution comes from exec. on
server with 2.4 GHz Xeon CPU
Resource Utilization on Virtex2 6000 -4
Results for Eight DWT modules design for BenNUEY
board operating at 40 MHz
11
Tier-1 Encoding Current Results
Results for Tier1 module design for BenNUEY board
operating at 90 MHz
Note software solution comes from execution on
server with 2.4 GHz Xeon Processor
Profiling shows performance projections with DMA
transfer times included.
Results synthesized with Synplify Pro 7.7.1,
PAR with Xilinx ISE 6.3
12
Conclusions from HPEC Platform
  • Multi-chip system offers resources for increased
    parallelism or a multi-component application
  • Order of magnitude improvement in total
    computation time
  • Faster computation times on FPGA
  • But communication overhead severely hinders
    performance improvement
  • Low-bandwidth PCI interconnect not amenable to
    designs with challenging memory demands

13
Target HPC Platform
SGI Altix w/ RASC extension
  • High-Performance Computing SGI Altix 350 with
    FPGA Brick
  • Single FPGA Virtex2 6000 (-6 speed grade)
  • Approximately 33 of chip used for SGIs RASC
    system layer
  • Two algorithm clock speeds 200 MHz and 100 MHz
  • High bandwidth to system memory through
    proprietary NUMAlink interconnect (12.8 GB/s)
    through Scalable System Port (6.4 GB/s)
  • 3 banks of QDR SRAM (6 MB each) with a full
    bandwidth of 9.6 GB/s (1.6 GB/s for each read and
    write)
  • Advantages/Disadvantages
  • Extremely low reconfiguration time
  • High memory bandwidth greatly helps
    memory-intensive apps, such as JPEG-2K

Diagram shown here only reflects those buses
actually used in the design other communication
schemes are available.
14
Performance Projections
Profile shows projections for no-latency,
infinite-bandwidth interconnect.
  • NUMAlink interconnect
  • Approximate order-of-magnitude improvement of
    transfers in similar designs
  • Mitigates communication overhead bottleneck

15
Lessons Learned and Conclusions
  • Lessons Learned
  • HW/SW codesign
  • Shared-memory systems more amenable to
    closely-coupled processing associated with
    communication-sensitive RC applications
  • PCI boards for servers effective when tasks are
    offloaded for processing with minimal or masked
    communication
  • Memory bandwidth constrains parallelism in DWT
    design
  • Serializing agent (arithmetic coder) in Tier-1
    design is key limit to performance improvement
  • Conclusions
  • Identifying and accelerating key components
    yields better system performance (with a wary eye
    on Amdahls Law)
  • Performance enhancements achieved mostly through
    functional parallelism due to sequential
    processing constraints

16
Future Work and Acknowledgments
  • Future Work
  • Full system implementation on SGI Altix with RASC
  • Region of Interest capability
  • Lossy encoding and rate capability
  • MCT and Tier-2 encoding on FPGA as well
  • Single FPGA JPEG-2000 encoding application
  • Acknowledgments
  • We wish to thank the following vendors for
    equipment and/or tools in support of this
    research
  • SGI
  • Nallatech
  • Xilinx
  • Aldec
  • Special thanks to SGI Digital Media group, SGI
    RASC engineers for their help and suggestions

17
References
  • 1 Adams, M.D. and Ward, R.K., JasPer a
    portable flexible open-source software tool kit
    for image coding/process, in IEEE International
    Conference on Acoustics, Speech, and Signal
    Processing (ICASSP04), pp. 241-244, May 2004.
  • 2 OpenJPEG. http//www.opegjpeg.org/
  • 3 Liu, L., Li, D., Li, Z., Wang, Z. and Chen,
    H., A VLSI architecture of EBCOT encoder for
    JPEG2000, in 5th International Conference on
    ASIC, pp. 882-885, Oct. 2003.
  • 4 Chen, K., Lian, C., Chen, H., and L. Chen,
    Analysis and architecture design of EBCOT for
    JPEG-2000, in IEEE International Symposium on
    Circuits and Systems, vol. 2, pp. 765-768, May
    2001.
  • 5 Van Buren, D., A high-rate JPEG2000
    compression system for space, in IEEE Aerospace
    Conference, March 2005.
  • 6 Aouadi, I., and Hammami, O., Analysis and
    hardware design of a scalable dual JPEG-2000
    entropy coder, in Euromicro Symposium on Digital
    System Design (DSD 2004), pp. 227-233, Sept.
    2004.
  • 7 Gangadhar, M. and Bhatia, D., FPGA based
    EBCOT architecture for JPEG 2000, in IEEE
    International Conference on Field-Programmable
    Technology (FPT03), pp. 228-233, Dec. 2003
  • 8 Hung, K., Huang Y., Truong, T., Wang, C.,
    FPGA implementation for 2D discrete wavelet
    transform, in Electronics Letters, pp. 639-640,
    April 1998.
  • 9 Lakshminarayanan, G. Venkataramani, B.
    Senthil Kumar, J., Yousuf, A.K. and Sriram, G.,
    Design and FPGA implementation of image block
    encoders with 2D-DWT, in Conference on
    Convergent Technologies for Asia-Pacific Region
    (TENCON 2003), pp. 1015-1019, Oct. 2003.
  • 10 McCanny, P., Masud, S., and McCanny, J.,
    Design and implementation of the symmetrically
    extended 2-D wavelet transform, in IEEE
    International Conference on Acoustics, Speech,
    and Signal Processing (ICASSP02), vol. 3, pp.
    3108-31111, May 2002.
  • 11 D. Taubman, High performance scalable
    image compression with EBCOT, in IEEE Trans.
    Image Processing, vol. 9, pp. 1158-1170, July
    2000.
  • 12 I.E.G. Richardson, Video Codec Design
    Developing Image and Video Compression Systems.
    Chichester, West Sussex, New York John Wiley and
    Sons, Ltd (UK), 2002.
  • 13 T. Acharya and P.-S. Tsai, JPEG 2000
    Standard for image Compression Concepts,
    Algorithms, and VLSI Architectures. Hoboken, New
    Jersey John Wiley and Sons, Inc., 2005.
Write a Comment
User Comments (0)
About PowerShow.com