Title: FPGA Implementation of JPEG2000 EncoderDecoder
1FPGA Implementation of JPEG2000 Encoder/Decoder
Institute of Electronic Systems
ASPI Project Group 1044
June 12, 2006
2Agenda
3Introduction
4Introduction
- Unified scheme
- Different still image types
- Different characteristics
- Different Image Models
- State-of-the-art
- Low high bit compression algorithm
- Multiple resolution display,
- Progressive transmission precision
- Lossy lossless
5JPEG2000 Applications
6JPEG2000 Implementation Challenges
- More Features More Complexity.
- Many different stages to produce compressed
output. - Many parameters tracked individually for each
code block (64x64).
COMPLEX ALGORITHM
OPERATION INTENSIVE
- Several 100s operations per pixel each bit
processed many times (DWT, Entropy Coding, etc)
MEMORY INTENSIVE
- Each Pixel must be accessed many times, many
buffers needed for good throughput
Few processors are capable of implementing
JPEG2000 at high rates
Processors take relatively long time to decode
still images
7Goals
- Study JPEG2000 standard
- Develop modified software
- Accelerate encoding and decoding process
- Propose JPEG2000 architecture
- Implement proposed architcture
8A3 Model
Application
- Medical Wireless
- Imaging
- Printers Scanners
- Digital Camcorders
Algorithm
Architecture
- FPGAs
- Virtex II
- Cyclone II
- DSPs
- ASICs, etc..
9SpecC Methodology
Specification Model
Architecture Model
Communication Model
Implementation Model
- Pure functionality of Intended System
- Functionality, with no notion of timing
- System architecture with no precise timing and
communication
- Model with accurate timing data
- Model not cycle accurate
- Model is cycle-cycle accurate RTL
10Specification
11Specification
Implementation
Flow
Verification
Requirements Capture
Specification Model
Functional Sim
Specify
Functionally Correct
- Communication Computation Separation
- To reveal data dependency
- Expose parallelism
12Encoder Functionality
1011101010010101010001110001
13Decoder Functionality
1011101010010101010001110001
Code Stream
Code Stream
14Encoder Specification
Read image
Pre-processing
Image-tile
Multiple Tiles
67K
DC Level
0K
Color Transform (mct.c)
7K
ICT
RCT
Core-Processing
predata
54K
coredata
22K
0K
111K
Post-processing (Write .j2k)
15Constraints
- User Inputs JP2 parameters affects compression
time
CODING DELAY
- Encoding time should be less than 7.51ns
- Delay applies to decoder side
IMAGE SIZE
- Color Images set at max size of (2424)K pixel
- Successful test on test image using modified
JPEG2000 SW
IMAGE FORMAT
- Image type set to .bmp, .pnm, .pgm, .ppm
- Other formats can be used
HIGH OUTPUT BIT-RATE
- Increase bit-rate by using internal memory of
FPGA
FLEXIBILITY
- Parts of encoding/decoding process must be in
independent blocks - For duplication and parallelism
MEMORY
- No external memory should be used
- Encoding and decoding process in line-based mode
16Coding Delay
CYCLES (10 exp 9)
SEC. (_at_994MHz)
TRACY IMAGE
ENCODING OPERATION MORE COMPLEX THAN DECODING
17Design Space Exploration
18Design Space Exploration
Implementation
Flow
Verification
Specify
Requirements Capture
Specification Model
Functional Sim
Functionally Correct
Architecture/Algorithm Exploration
Design
Architecture Model
Behavioural Sim
DataFlow Metrics
- Partitioning of Behaviors
- Scheduling of Behaviors
19Profiling Estimation
COMPUTATIONAL RUN TIME
Image
of Running Time
20Profiling Estimation
COMPUTATIONAL RUN TIME ENCODER/DECODER
Functions
of Running Time
21Partitioning Tradeoff Analysis
CRITICALITY OF CODEC BEHAVIOR
Q a b c d
Q
R b c d
Criticality
a
T e f
Sub-behaviors
22HW/SW Estimation
SW ESTIMATION OF FUNCTIONAL BLOCKS
Millions Clock Cycles
HW ESTIMATION OF FUNCTIONAL BLOCKS
Millions Clock Cycles
23Partitioning methodologies
- Port small part of software code to hardware, set
up interfaces and test - Incrementally move software code to hardware,
test verify at every stage - Stop when goals are reached (time, area,
performance, etc.)
24Partitioning Solution 1
REASONS
BIT-MODELING
- Computations are done bit-by-bit
- Such computation offers huge opportunity for
parallelism
In SW
In HW
SW OVERHEAD
- High SW overhead better handled in HW
- Due to low number Ctrl path
HIGH RUN TIME
- t1 functions gives the highest of run time
25Partitioning Solution 2
REASONS
DATA TRANSFER
- mqc functions frequently called
- Needs to be in HW to reduce SW/HW data transfer
In SW
In HW
REPEATED LOOPS
- Frequently repeated loops in mqc sub-functions
- Avoid SW/HW contradictory due to huge
communication overhead
DATA SHARING
- Resource sharing between sub-behaviors SW/HW
divide impossible
26Partitioning Solution 3
REASONS
DWT Coeff.
- High number of DWT coeff.
In SW
In HW
DATA STRUCTURE
- Structure of data dependency makes in dwt.c
makes it possible to use MAC
ITERATIVE LOOP
-
- Repeated loop in dwt_encode disapproves SW/HW
divide
DISADVANTAGE
RISE HW COMPLEXITY
27Traffic Storage
K Kilo-Byte M Mega-Byte Com Communication
CFB
28Choice of Partition Solution
SELECTION
SOLUTION 3
In SW
REASONS
In HW
HARDWARE
- Best performance with almost the same HW as Soln
1 2
CONSTRAINT
-
- Satisfies Design Constraints to a greater extent
29Architectural Model
REASON
FLEXIBLE LOGIC REASOURCES HIGH PERFORMANCE
INTERFACE TO EXTERNAL MEMORY
30Delays after Archit. Model
HW SPEED UP ESTIMATES GREATER THAN AMDAHLS LAW
ESTIMATES
31Communication Synthesis
32Communications Synthesis
Implementation
Flow
Verification
Specify
Requirements Capture
Specification Model
Functional Sim
Functionally Correct
Design
Architecture/Algorithm Exploration
Architecture Model
Behavioural Sim
DataFlow Metrics
Bus Synthesis
Communications Synthesis
Communications Model
Bus-Functional Sim
Clock-Cycle Metrics
- Protocol Insertion
- Transducer Synthesis
33Communication Model
System Bus
SW/HW communication block are timing accurate but
not cycle accurate
MicroBlaze, JPG2K_HW BRAM interconnected via
system bus
Transducer inserted, MicroBlaze JPG2K_HW
protocols not compatible
All interface to same bus are clocked at the same
speed
34MicroBlaze/JPG2K_HW Interface
35MicroBlaze Read/Write Timing
WRITING TO JPG2K_HW
READING FROM JPG2K_HW
36BRAM/JPG2K_HW Interface
WRITING TO JPG2K_HW
READING FROM JPG2K_HW
37Comm. Model Results
JPG2K_HW Cycles (10 EXP (6) )
Simulation Done with ModelSim XE
(Diff (A - B))
Com. Model Cycles (Tier-1 DWT)
Com. Model delay Average of Writes Reads
from/to JPG2K_HW
Com. Model (A) _at_ 6ns
Fulfills timing constraint of 7.51ns
No. of Cylces rise by 0.1 in Com. Model due to
Communication overhead
Archit. Model (B)
38Back-End
39Backend
Implementation
Verification
Requirements Capture
Specification Model
Functional Sim
Functionally Correct
Architecture/Algorithm Exploration
Architecture Model
Behavioral Sim
DataFlow Metrics
Implementation Refinement
Implementation Model
Cycle-Accurate Sim
Speed/Area
Compile
EDIF
Timing Sim
Place Route
FPGA
CPU
40Proposed Architecture
41Operation Profile
OPERATION PROFILE FOR 1 CODE-BLOCK
Block counts
Type of Operation
42Analysis of DWT Results
AREA
Reduction by factor of 11.77 11.81, for
DWT-mid DWT-par
LUTS
REASON
-Seq High No. of control logic
-Par Low No. of control logic
SPEED
Reduction by factor of 1.39 2.46, for DWT-mid
DWT-par
CYCLES
REASON
-Seq Seq. processing of high No. of DWT
code-blocks
-Par Par. processing of high No. of DWT
code-blocks
43Analysis of MQC Results
AREA
Reduction by factor of 1.70 1.73, for MQC-mid
MQC-par
LUTS
REASON
-Seq -Par Same as DWT
SPEED
Reduction by factor of 1.27 MQC-par
CYCLES
-High No. cycles Seq. processing of large data
transfer
REASON
-Low No. cycles Par. processing of large data
transfer
44Analysis of EBCOT Results
AREA
Reduction by factor of 10.75 13.97 for
EBCOTSeq. EBCOT-par.
LUTS
-High Low No. cycles Same as DWT MQC
REASON
SPEED
Reduction by factor of 1.04 for EBCOT-par
CYCLES
-High No. cycles Seq. processing of large No. of
bit passes
REASON
-Low No. cycles low No. of control logic
(control steps)
45JPG2K_HW Imple. Results
JPG2K_HW
Results after PAR using Xilinx ISE tool
High Slack for EBCOT DWT
MQC Proposed/ Industry
The JPG2K_HW faster by 13 and 2.3 for MQC DWT
respt.
46Conclusions
Implementation Model JPG2K_HW faster by 13
2.3 for Tier-1 DWT respectively.
47Future Work
LENGTH Nov. 05 - May 06 (7 months)
Imple. Mod.
Spec. Model
48The End!!!
Many Thanks for Listening!