FPGA Implementation of JPEG2000 EncoderDecoder - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

FPGA Implementation of JPEG2000 EncoderDecoder

Description:

HD DIGITAL CAMCORDER. DIGITAL CAMERA. HDTV VIDEO EDITING. MOTION DETECTION. NETWORK DISTRIBUTION ... Digital Camcorders. JPEG2000. JPEG. JBIG. etc.. FPGAs ... – PowerPoint PPT presentation

Number of Views:208
Avg rating:3.0/5.0
Slides: 49
Provided by: yawap
Category:

less

Transcript and Presenter's Notes

Title: FPGA Implementation of JPEG2000 EncoderDecoder


1
FPGA Implementation of JPEG2000 Encoder/Decoder
  • By Yaw Appiah

Institute of Electronic Systems
ASPI Project Group 1044
June 12, 2006
2
Agenda
3
Introduction
4
Introduction
  • Unified scheme
  • Different still image types
  • Different characteristics
  • Different Image Models
  • State-of-the-art
  • Low high bit compression algorithm
  • Multiple resolution display,
  • Progressive transmission precision
  • Lossy lossless

5
JPEG2000 Applications
6
JPEG2000 Implementation Challenges
  • More Features More Complexity.
  • Many different stages to produce compressed
    output.
  • Many parameters tracked individually for each
    code block (64x64).

COMPLEX ALGORITHM
OPERATION INTENSIVE
  • Several 100s operations per pixel each bit
    processed many times (DWT, Entropy Coding, etc)

MEMORY INTENSIVE
  • Each Pixel must be accessed many times, many
    buffers needed for good throughput

Few processors are capable of implementing
JPEG2000 at high rates
Processors take relatively long time to decode
still images
7
Goals
  • Study JPEG2000 standard
  • Develop modified software
  • Accelerate encoding and decoding process
  • Propose JPEG2000 architecture
  • Implement proposed architcture

8
A3 Model
Application
  • Medical Wireless
  • Imaging
  • Printers Scanners
  • Digital Camcorders

Algorithm
  • JPEG2000
  • JPEG
  • JBIG
  • etc..

Architecture
  • FPGAs
  • Virtex II
  • Cyclone II
  • DSPs
  • ASICs, etc..

9
SpecC Methodology
Specification Model
Architecture Model
Communication Model
Implementation Model
  • Pure functionality of Intended System
  • Functionality, with no notion of timing
  • System architecture with no precise timing and
    communication
  • Model with accurate timing data
  • Model not cycle accurate
  • Model is cycle-cycle accurate RTL

10
Specification
11
Specification
Implementation
Flow
Verification
Requirements Capture
Specification Model
Functional Sim
Specify
Functionally Correct
  • Communication Computation Separation
  • To reveal data dependency
  • Expose parallelism

12
Encoder Functionality
1011101010010101010001110001
13
Decoder Functionality
1011101010010101010001110001
Code Stream
Code Stream
14
Encoder Specification
Read image
Pre-processing
Image-tile
Multiple Tiles
67K
DC Level
0K
Color Transform (mct.c)
7K
ICT
RCT
Core-Processing
predata
54K
coredata
22K
0K
111K
Post-processing (Write .j2k)
15
Constraints
  • User Inputs JP2 parameters affects compression
    time

CODING DELAY
  • Encoding time should be less than 7.51ns
  • Delay applies to decoder side

IMAGE SIZE
  • Color Images set at max size of (2424)K pixel
  • Successful test on test image using modified
    JPEG2000 SW

IMAGE FORMAT
  • Image type set to .bmp, .pnm, .pgm, .ppm
  • Other formats can be used

HIGH OUTPUT BIT-RATE
  • Increase bit-rate by using internal memory of
    FPGA

FLEXIBILITY
  • Parts of encoding/decoding process must be in
    independent blocks
  • For duplication and parallelism

MEMORY
  • No external memory should be used
  • Encoding and decoding process in line-based mode

16
Coding Delay
CYCLES (10 exp 9)
SEC. (_at_994MHz)
TRACY IMAGE
ENCODING OPERATION MORE COMPLEX THAN DECODING
17
Design Space Exploration
18
Design Space Exploration
Implementation
Flow
Verification
Specify
Requirements Capture
Specification Model
Functional Sim
Functionally Correct
Architecture/Algorithm Exploration
Design
Architecture Model
Behavioural Sim
DataFlow Metrics
  • Partitioning of Behaviors
  • Scheduling of Behaviors

19
Profiling Estimation
COMPUTATIONAL RUN TIME
Image
of Running Time
20
Profiling Estimation
COMPUTATIONAL RUN TIME ENCODER/DECODER
Functions
of Running Time
21
Partitioning Tradeoff Analysis
CRITICALITY OF CODEC BEHAVIOR
Q a b c d
Q
R b c d
Criticality
a
T e f
Sub-behaviors
22
HW/SW Estimation
SW ESTIMATION OF FUNCTIONAL BLOCKS
Millions Clock Cycles
HW ESTIMATION OF FUNCTIONAL BLOCKS
Millions Clock Cycles
23
Partitioning methodologies
  • Port small part of software code to hardware, set
    up interfaces and test
  • Incrementally move software code to hardware,
    test verify at every stage
  • Stop when goals are reached (time, area,
    performance, etc.)

24
Partitioning Solution 1
REASONS
BIT-MODELING
  • Computations are done bit-by-bit
  • Such computation offers huge opportunity for
    parallelism

In SW
In HW
SW OVERHEAD
  • High SW overhead better handled in HW
  • Due to low number Ctrl path

HIGH RUN TIME
  • t1 functions gives the highest of run time

25
Partitioning Solution 2
REASONS
DATA TRANSFER
  • mqc functions frequently called
  • Needs to be in HW to reduce SW/HW data transfer

In SW
In HW
REPEATED LOOPS
  • Frequently repeated loops in mqc sub-functions
  • Avoid SW/HW contradictory due to huge
    communication overhead

DATA SHARING
  • Resource sharing between sub-behaviors SW/HW
    divide impossible

26
Partitioning Solution 3
REASONS
DWT Coeff.
  • High number of DWT coeff.

In SW
In HW
DATA STRUCTURE
  • Structure of data dependency makes in dwt.c
    makes it possible to use MAC

ITERATIVE LOOP
  • Repeated loop in dwt_encode disapproves SW/HW
    divide

DISADVANTAGE
RISE HW COMPLEXITY
27
Traffic Storage
K Kilo-Byte M Mega-Byte Com Communication
CFB
28
Choice of Partition Solution
SELECTION
SOLUTION 3
In SW
REASONS
In HW
HARDWARE
  • Best performance with almost the same HW as Soln
    1 2

CONSTRAINT
  • Satisfies Design Constraints to a greater extent

29
Architectural Model
REASON
FLEXIBLE LOGIC REASOURCES HIGH PERFORMANCE
INTERFACE TO EXTERNAL MEMORY
30
Delays after Archit. Model
HW SPEED UP ESTIMATES GREATER THAN AMDAHLS LAW
ESTIMATES
31
Communication Synthesis
32
Communications Synthesis
Implementation
Flow
Verification
Specify
Requirements Capture
Specification Model
Functional Sim
Functionally Correct
Design
Architecture/Algorithm Exploration
Architecture Model
Behavioural Sim
DataFlow Metrics
Bus Synthesis
Communications Synthesis
Communications Model
Bus-Functional Sim
Clock-Cycle Metrics
  • Protocol Insertion
  • Transducer Synthesis

33
Communication Model
System Bus
SW/HW communication block are timing accurate but
not cycle accurate
MicroBlaze, JPG2K_HW BRAM interconnected via
system bus
Transducer inserted, MicroBlaze JPG2K_HW
protocols not compatible
All interface to same bus are clocked at the same
speed
34
MicroBlaze/JPG2K_HW Interface
35
MicroBlaze Read/Write Timing
WRITING TO JPG2K_HW
READING FROM JPG2K_HW
36
BRAM/JPG2K_HW Interface
WRITING TO JPG2K_HW
READING FROM JPG2K_HW
37
Comm. Model Results
JPG2K_HW Cycles (10 EXP (6) )
Simulation Done with ModelSim XE
(Diff (A - B))
Com. Model Cycles (Tier-1 DWT)
Com. Model delay Average of Writes Reads
from/to JPG2K_HW
Com. Model (A) _at_ 6ns
Fulfills timing constraint of 7.51ns
No. of Cylces rise by 0.1 in Com. Model due to
Communication overhead
Archit. Model (B)
38
Back-End
39
Backend
Implementation
Verification
Requirements Capture
Specification Model
Functional Sim
Functionally Correct
Architecture/Algorithm Exploration
Architecture Model
Behavioral Sim
DataFlow Metrics
Implementation Refinement
Implementation Model
Cycle-Accurate Sim
Speed/Area
Compile
EDIF
Timing Sim
Place Route
FPGA
CPU
40
Proposed Architecture
41
Operation Profile
OPERATION PROFILE FOR 1 CODE-BLOCK
Block counts
Type of Operation
42
Analysis of DWT Results
AREA
Reduction by factor of 11.77 11.81, for
DWT-mid DWT-par
LUTS
REASON
-Seq High No. of control logic
-Par Low No. of control logic
SPEED
Reduction by factor of 1.39 2.46, for DWT-mid
DWT-par
CYCLES
REASON
-Seq Seq. processing of high No. of DWT
code-blocks
-Par Par. processing of high No. of DWT
code-blocks
43
Analysis of MQC Results
AREA
Reduction by factor of 1.70 1.73, for MQC-mid
MQC-par
LUTS
REASON
-Seq -Par Same as DWT
SPEED
Reduction by factor of 1.27 MQC-par
CYCLES
-High No. cycles Seq. processing of large data
transfer
REASON
-Low No. cycles Par. processing of large data
transfer
44
Analysis of EBCOT Results
AREA
Reduction by factor of 10.75 13.97 for
EBCOTSeq. EBCOT-par.
LUTS
-High Low No. cycles Same as DWT MQC
REASON
SPEED
Reduction by factor of 1.04 for EBCOT-par
CYCLES
-High No. cycles Seq. processing of large No. of
bit passes
REASON
-Low No. cycles low No. of control logic
(control steps)
45
JPG2K_HW Imple. Results
JPG2K_HW
Results after PAR using Xilinx ISE tool
High Slack for EBCOT DWT
MQC Proposed/ Industry
The JPG2K_HW faster by 13 and 2.3 for MQC DWT
respt.
46
Conclusions
Implementation Model JPG2K_HW faster by 13
2.3 for Tier-1 DWT respectively.
47
Future Work
LENGTH Nov. 05 - May 06 (7 months)
Imple. Mod.
Spec. Model
  • 1 MTH
  • 2 MTHS
  • 1 MTH
  • 3 MTHS

48
The End!!!
Many Thanks for Listening!
Write a Comment
User Comments (0)
About PowerShow.com