Title: CHAPTER 6 VLSI Architectures for Motion Estimation
1CHAPTER 6VLSI Architectures for Motion
Estimation
21-D Systolic Array
- A Family of VLSI Designs for the Motion
Compensation Block-Matching Algorithm - K. M. Yang, M. T. Sun, and L. Wu
- IEEE Transactions on Circuits and Systems, vol.
36, no. 10, pp. 1317-1325, Oct. 1989
3Main Features
- They allow full search capability which is the
optimal solution in block-matching. - They allow sequential inputs to save pin counts
but perform parallel processing. - They use common busses for data transfers and
save silicon area. - They are very flexible and modular designs,
capable of processing different block sizes, e.g.
8 X 8, 1 6 x 1 6 or 32x32 - They are cascadable, i.e., cascaded chips allow a
larger tracking area. - They contain testing circuitry for increasing the
testability. - The first chip design for block matching motion
estimation in the world.
4Architecture Design
- In order to utilize fully the processing power of
the PEs, a special data flow has to be derived
to keep the PEs as busy as possible. - The data are repeatedly used at different
searching positions. - In the following, two data-flow techniques which
allow the designs to achieve 100 percent
efficiency are described. One broadcasts previous
frame data and the other broadcasts current block
data.
5Notations
6Broadcasting the Previous Frame Data
- While b(Ib, Jb15) is being inputted it can be
broadcasted to all processors that need it. - This relieves the burden of repeated access of
the same data from the previous frame.
7Broadcasting Reference Frame
- The 16 PE columns represent the calculation of
the error measurement for 16 search positions. - Except for a very short initial delay, all the
PEs are busy all the time, so that the
utilization is 100. - The address generator generates the address by
summing up a base address and a running index. - The base address, (Ia, Ja) or (Ib, Jb) which is
defined as the upper left corner of a block,
remains the same for the entire processing of
that blocks and the running indexes (i, j) and
(k, l) are identical sequence for all blocks.
8Basic Data Flow
9Architecture of PE
- These sub-operations are performed in a pipeline
fashion and thus reduce the cycle time. - The accumulator in the last stage of the PE has
16-bit precision to accommodate the largest
possible error measurement.
10Broadcasting the Current Frame Data
Parallel-in-parallel- output shift registers
Parallel-in-parallel- output shift registers with
multiplexers
11Basic Dataflow for Broadcasting Current Block Data
12Flexible Block Size
- Different motion-compensation schemes may use
different block sizes and require large tracking
ranges. It is very desirable to have a chip
flexible enough for use in different systems. - Consider a block-size of 8 ? 8, the required
computations for each block is ¼ of the
computation required for a block-size of 16 ? 16. - However, in each frame, the number of blocks is 4
times the number of the block-size of 16 ? 16.
13Flexible Block Size (Cont.)
- The computational load for each frame is the same
for different block-sizes except that the
internal dynamic-range is slightly different
(tracking range is fixed). - Both architectures discussed are flexible enough
to process 8 ? 8, 16 ? 16 or 32 ? 32 blocks as
long as the tracking range is fixed to 16
searches in one coordinate. - The same hardware containing 16 PEs can be
reconfigured to process different block sizes by
a very simple control signal (address generator). - The above discussion can be generalized to other
block sizes of power of 2.
14Larges Tracking Ranges
- The tracking range is basically limited by the
computation power of the PE's. If the tracking
range of -16 to 15 is needed, the computation
load is increased by 4 times. - Assuming each PE is already operating at the
limit of its capability, 4 times the number of
PE's will be needed. - In this connection, essentially two chips are
cascaded to provide 32-stage input registers and
32 PEs for the doubled horizontal tracking
range.
15Block Diagram for Cascading Four Chips to Achieve
Tracking Range of -16 to 15
Motion Vector
CMP
CHIP A
CHIP C
C1 p1 p1
C2 p2 p2
CHIP D
CHIP B
16Overlapped Search Area
0 16 32 47
0 16 32 47
0 16 32 47
0 16 32 47
0 16 32 47
0 16 32 47
Sub-tracking area I
Sub-tracking area III
0 16 32 47
0 16 32 47
0 16 32 47
0 16 32 47
Sub-tracking area III
Sub-tracking area IV
17Overlapped Search Area (Cont.)
- The cascaded chip design can also be easily done
by assigning each chip to process one portion of
the tracking area. - While these data from the overlapped area are
inputted, they can be broadcasted to two chips to
save the bandwidth. This avoids proportional
increase of the memory requirement in a cascaded
chips system.
18Motion Estimation with Fractional Precision
19Fractional Motion Estimation Chip-Pair Design
Video in
Current Frame Storage Memory
Motion Compensation Chip I Integer Precision
Reconstructed Video in
Previous Frame Storage Memory
(mi, mj)
Tracking Area Storage Memory
Motion Compensation Chip F Fractional Precision
Current Block Storage Memory
20Block Diagram of a Fractional Motion Estimation
CHip
21Interpolation
- The combination of IP1 and IP2 eases the input
rate and keeps the PEs performing operations
every cycle. - The interpolated values at the output of the IP1
and the IP2 can be expressed as
22Basic Data Flow for Fractional Motion Vector
Estimator
23Basic Data Flow for Fractional Motion Vector
Estimator (cont.)
24Schematic Diagram of IP1
25Schematic Diagram of IP2
26Chip Layout
27Testability
- The motion vector calculated by the chip is a
function of the current block data and the data
in the previous frame within the tracking range.
Since the number of possible combinations of
these input data are extremely large, exhaustive
testing of the chip is impossible. - In order to be able to test the chip, it is
highly desirable to have a testing circuit inside
the chip without using excessive chip area, or
degrading performance. - The chip proposed operates in two modes, the
normal mode and the test mode, which are selected
by an external signal named test.
28Testability (Cont.)
- By using tri-state buses and a decoder, the
testing vectors for the whole chip are reduced to
much smaller sets of functionally divided
modules. - In the test mode, a test pattern is inputted
from some data pins, which are normally used for
inputting one of the previous frame data, and
then is decoded by the Test Pattern Decoder. - Only one of the modules will be tested at a time
and only its results are routed to an output bus
and observed from the output pins.
29 Array Architectures for Block Matching
AlgorithmsT. Komarek and P. PirschIEEE
Transactions on Circuits and Systems, vol. 36,
no. 10, Oct. 1989, pp. 1301-1308
30Block Matching Algorithm
(motion vector)
- The BMA is defined over a four-dimensional index
space due to its four indexes i, k, m, and n. - As an example, the BMA is decomposed into two
parts which are defined over two-dimensional
index spaces. - The first one is spawn by the indexes i and k and
consists of the addition of the sum s(m, n). - In the rest, which is defined over m and n, the
minimum search and the selection of the
displacement vector components is performed.
31Derivation of Systolic Arrays for Full Search BMA
- The addition of s(m, n ) starts with the index k,
and is continued over the index i for fixed m and
n. - The second part of the decomposed BMA is given by
m and n fixed
32DG Spawn in the i, k Plane
Subtraction magnitude operation, addition
DG displayed for a block size of N 3 and a
maximum displacementof p 2 in the i, k-plane
of the decomposed full search BMA.
i
0
0
0
1
2
3
k
AD
AD
AD
Time schedule
4
AD
AD
AD
5
x(i, k)
y(im,kn)
AD
AD
AD
6
0
A
A
A
s(m,n)
7
m,n s(m-1,n) minimum,search displacement vector
M
addition
33Systolic Architecture AB1 for N 3, p 2
Search area data
Reference data
0
AD
11 21 31 11 21 31
AD
42 32 22 32 22 12
12 22 32 12 22 32
AD
13 23 33 13 23 33
43 33 23 33 23 13
Number of time instance necessary to determine a
displacement vector N ? (2p1)?(2p1N-1) N ?
(2p1)?(2pN)
M
AD
Displacement Vector
34Three-Dimensional Index Space Spawn by the Index
i, k, and m
35Systolic Array AS2
- Systolic architecture AS2 with processing
elements AD, A , and M derived from the previous
DGwith the indexes of input data x ( i , k ) and
y(i m , k n). The indexes enclosed by the
dashed lines belong to data of one search area
line and one reference block.
projection onto the i, m plane
36Systolic Architecture AB2
- Systolic architecture AB2 with the indexes of
search area data y(i m, k n). The reference
block data x ( i , k) remain fixed in the PE's
AD. The indexes of one search area line data are
enclosed by the dashed line.
Projection along the i, k-plane
37Processing Element
38Bit-Level Cell Array
4x4 PE array
39Bit-Level PE Array (Cont.)
40Systolic Array AS1
- Systolic architecture AS1 for N 3 and p 2
with the indexes of search area data y ( i m, k
n) and reference block data x ( i , k).
41Efficient Hybrid Tree/Linear Array Aarchitectures
forBlock-Matching Motion Estimation Algorithms
- M.-J.Chen, L.-G. Chen, K.-N.Cheng, M.C.Chen
- IEE Proc.-Vis. Image Signal Process., vol. 143,
no. 4, pp. 217-222, Aug. 1996
42Illustration of One-Dimensional Full Search
Algorithm
43Tree-Type Array Architecture with N 4
44Hybrid Tree/Linear Architecture
45Tree-Cut Technique Direct Form
46Image pel Distribution for Memory Interleaving
47Chip Layout and Characteristics
48Analysis and Architecture Design of
VariableBlock Size Motion Estimation for
H.264/AVC
- Ching-Yeh Chen, Shao-Yi Chien, Yu-Wen Huang,
Tung-Chien Chen, Tu-ChihWang, and Liang-Gee Chen - IEEE Trans. Circuits Syst. Video Technology
49Abstract
- Variable block size motion estimation (VBSME) has
become an important video coding technique, but
it increases the difficulty of hardware design. - We use inter/intra-level classification and
various data flows to analyze the impact of
supporting VBSME in different hardware
architectures. - We propose two hardware architectures, which can
support traditional fixed block size motion
estimation as well as VBSME with the less chip
area overhead compared to previous approaches.
50Abstract (Cont.)
- By broadcasting reference pixel rows and
propagating partial SADs, the first design has
the fewer reference pixel registers and a shorter
critical path. - The second design utilizes a 2-D distortion array
and one adder tree with the reference buffer
which can maximize the data reuse between
successive searching candidates. - We demonstrate a 720p, 30fps solution at 108 MHz
with 330.2K gate count and 208K bits on-chip
memory.
51Introduction (Cont.)
- The row (column) SAD is the summation of N
distortions in a row (column). - Although FSBMA provides the best quality among
various ME algorithms, it consumes the largest
computation power. In general, the computation
complexity of ME is from 50 to 90 of a typical
video coding system. Hence a hardware accelerator
of ME is required.
52VBSME
- Variable block size motion estimation (VBSME) is
a new coding technique and provides more accurate
predictions compared to traditional fixed block
size motion estimation (FBSME). - With FBSME, if a MB consists of two objects with
different motion directions, the coding
performance of this MB is worse. - On the other hand, for the same condition, the MB
can be divided into smaller blocks in order to
fit the different motion directions with VBSME. - VBSME has been adopted in the latest video coding
standards, including H.263, MPEG-4, WMV9.0, and
H.264/AVC.
53VBSME (Cont.)
- In H.264/AVC, a MB with variable block size can
be divided into seven kinds of blocks including 4
4, 4 8, 8 4, 8 8, 8 16, 16 8, and 16
16. - Although VBSME can achieve higher compression
ratio, it not only requires huge computation
complexity but also increases the difficulty of
hardware implementation for ME. - Traditional ME hardware architectures are
designed for FBSME, and they can be classified
into two categories. - One is an inter-level architecture, where each
processing element (PE) is responsible for one
SAD of a specific searching candidate. - The other is an intra-level architecture, where
each PE is responsible for the distortion of a
specific current pixel in the current MB for all
searching candidates.
54Yang, Sun, and Wus Architetures
- An 1-D inter-level hardware architecture
(1DInterYSW). - The number of PEs is equal to the number of
searching candidates in the horizontal direction,
2Ph. - The most important concept is data broadcasting.
With broadcasting technique, the memory bandwidth
which is defined as the number of bits for the
required reference data in one cycle is reduced
significantly, although some global routings are
required.
55Yeo and Hus Architectures
56Lai and Chens Architeture
- Reference pixels are propagated with propagation
registers, and current pixels are broadcasted
into PEs. - The partial SADs are still stored and accumulated
in PEs. - Besides, 2DInterLC has to load reference pixels
into propagation registers before computing SADs.
The latency of loading reference pixels can be
reduced by partitioning the search range in
2DInterLC.
57Vos and Stegherrs Architecture
58Vos and Stegherrs Architecture (Cont.)
- A 2-D intra-level architecture.
- The number of PEs is equal to the block size.
Each PE is corresponding to a current pixel. And
current pixels are stored in PEs, respectively. - The important concept of 2DIntraVS is the
scanning order in searching candidates, snake
scan. - The computation flow is as follows.
- First, the distortion is computed in each PE, and
N partial row SADs are propagated and accumulated
in the horizontal direction. - Second, an adder tree is used to accumulate the N
row SADs to be SAD. The accumulations of row SADs
and SAD are done in one cycle. Hence no partial
SAD is required to be stored.
59Komarek and Pirschs Architecture
Hsieh and Lins
Komarek and Pirschs Architecture
60Komarek and Pirschs Architecture (Cont.)
- Komarek and Pirsch contributed a detailed
systolic mapping procedure by the dependence
graph (DG). AB2 (2DIntraKP) is a 2-D intra-level
architecture. - Current pixels are stored in corresponding PEs.
Reference pixels are propagated PE by PE in the
horizontal direction. - The N partial column SADs are propagated and
accumulated in the vertical direction, first. - After the vertical propagation, these N column
SADs are propagated in the horizontal direction.
61Hsieh and Lins Architecture
- 2DIntraHL consists of N PE arrays in the vertical
direction, and each PE array is composed of N PEs
in a row. - In 2DIntraHL, reference pixels are propagated
with propagation registers one by one, which can
provide the advantages of serial data input and
increasing the data reuse. - Current pixels are still stored in PEs. The N
partial column SADs are propagated in the
vertical direction from bottom to up. - In each computing cycle, each PE array generates
N distortions of a searching candidate and
accumulates these distortions with N partial
column SADs in the vertical propagation. - After the accumulation in the vertical direction,
N column SADs are accumulated in the top adder
tree in one cycle. The longer latency for loading
reference pixels and large propagation registers
are the penalties for the reduction of memory
bandwidth and memory bandwidth.
62Proposed Propagate Partial SAD
63Proposed Propagate Partial SAD (Cont.)
- The architecture is composed of N PE arrays with
1-D adder tree in the vertical direction. - Current pixels are stored in each PE, and two
sets of N continuous reference pixels in a row
are broadcasted to N PE arrays at the same time.
64Data Flow of Propagate Partial SAD
65Proposed SAD Tree
66Scan Order and Memory Access
67Variable Block Size Motion Estimation
68The Impact of Variable Block Size Motion
Estimation in Hardware Architectures
- There are many methods to support VBSME in
hardware architectures. - For example, we can increase the number of PEs or
the operating frequency to do ME for different
block sizes, respectively. One of them is to
reuse the SADs of the smallest blocks, which are
the blocks partitioned with the smallest block
size, to derive the SADs of larger blocks. - By this method, the overhead of supporting VBSME
is only the slight increase of gate count, and
the other factors, such as frequency, hardware
utilization, memory usage, and so on, are the
same as those of FBSME.
69Data Flow IStoring in PEs (Inter-Level
Architecture)
- The number of bits for the data buffer in each PE
is increased from log2N28 to n2(log2(N/n)28),
where N2 and (N/n)2 are the number of pixels in
one block, and 8 is the wordlength of one pixel.
FBSME, N 16 VBSME, N 16,
n 4
70Data Flow IIPropagating with Propagation
Registers (Intra-Level Architecture)
- In intra-level architectures, partial SADs can be
accumulated and propagated with propagation
registers. - Each PE computes the distortion of one
corresponding current pixel in current MB. - By propagation adders and registers, the partial
SAD is accumulated with these distortions. - When supporting VBSME, more propagation registers
are required to store partial SADs of the
smallest blocks. In each propagating direction,
the number of propagation registers are n times
of that in the original for the n smallest blocks
in the other direction.
71The Proposed Propagate Partial SAD Architecture
with Data Flow II
72Data Flow IIINo Partial SADs
The proposed SAD Tree architecture with Data Flow
III, where N 16 and n 4.
73Data Flow IIINo Partial SADs (Cont.)
- In intra-level architectures, it is possible that
no partial SADs are required to be stored, such
as SAD Tree. - Each PE computes the distortion of one current
pixel for a searching - candidate, and the total SAD is accumulated
by an adder tree in - one cycle, as shown in Fig. 5(a).
- Because there is no partial SAD in this
architecture, there is no registers overhead to
store partial SADs when supporting VBSME. - The adder tree is the one to be reorganized to
support VBSME - That is, we partition the 2-D adder tree in order
to get the SADs of the smallest blocks first, and
then based on these SADs, to derive the SADs of
large blocks. Although there is no additional
register overhead, the adder tree additions
required to support VBSME do require additional
area,
74THE PARALLELISM, CYCLES, LATENCY, AND DATA FLOW
OF EIGHT HARDWARE ARCHITECTURES
75THE DATA BUFFER AND MEMORY BITWIDTH OF EIGHT
HARDWARE ARCHITECTURES
76An Example
- The specifications of ME are as follows. The MB
size is 1616, and the search range is Ph 64
and Pv 32. - The frame size is D1 size, 720 480.
- When VBSME is supported, a MB can be partitioned
at most to 16 44 blocks. - We use Verilog-HDL and SYNOPSYS Design Compiler
with ARTISAN UMC 0.18um cell library to implement
each hardware architecture. - Because the timing of the critical path in some
architectures is too long, which means the
maximum operating frequency is limited without
modifying the architecture, the frame rate is set
as only 10 frames per second (fps).
77Area and Required Frequency
- Among these eight hardware architectures, all
inter-level architectures with Data Flow I
increase gate count dramatically. The chip area
is five times of that in FBSME at least.
78Latency
- The latency is defined as the number of start-up
cycles that a hardware takes to generate the
first SAD. - If a module has a long latency and it cannot be
shortened by parallel architectures, the effect
of parallel computation is reduced. That is, a
shorter latency is better for video coding
systems. - There are two factors to affect the latency.
- Hardware architecture
- Memory bandwidth
- Compared to these hardware architectures, the
other intra-level architectures, such as proposed
Propagate Partial SAD and SAD Tree, have shorter
latencies.
79Utilization
- In general, inter-level architectures can
continuously compute MB by MB, so the initial
cycles can be neglected and the utilization will
be 100. - Therefore, we defined the utilization as
Computing cycles / Operating cycles for a MB. - The operating cycles include three parts,
latency, computing cycles, and bubble cycles.
Computation cycles are the number of cycles when
we can get one SAD at least. That is, if the
utilization is 100, we can get one SAD in each
cycle at least. Fewer operating cycles will less
the penalty of the latency be apparent. - The more bubble cycles are, the lower the
utilization is.
80Memory Usage
- Memory usage consists of two parts, memory
bitwidth and memory bandwidth. - Memory bitwidth is defined as the number of bits
which a hardware has to access from memory in
each cycle, and memory bandwidth is re-defined as
the number of bits which a hardware has to access
from memory for a MB. - Memory bandwidth affects the loading of system
bus without on-chip memory or the power of
on-chip memory, and memory bitwidth is the key to
the data arrangement of on-chip memories. - Memory bitwidth and bandwidth are affected by the
data reuse scheme and operating cycles.
81Hexagonal Plot
- The closer the point is to the center, the worse
the performance is. - Note that, in various video coding systems or
hardware system platforms, the weighting of each
axis will be very different. - We can use these hexagonal plots to select the
optimal architecture based on different
constraints for the system integration.
82Hexagonal Plots
83Hexagonal Plots
84Hexagonal Plots
85Hexagonal Plots
86Hardware Architecture of H.264 Integer Motion
Estimation
- Based on the above analysis, we propose a ME
hardware for H.264/AVC integer-pixel motion
estimation (IME) as an example. - Our specification is that two frame sizes are
supported in our specification. - One is D1 Format with four reference frames, 30
fps. In the previous frame, the search range is
-64,64) and -32,32) in the horizontal and
vertical directions. In the rest frames, the
search range is -32,32) and -16,16) in the
horizontal and vertical directions. - The other is 720p with one reference frame, 30
fps. The search range is the same as that of the
previous frame in D1 Format.
87Hardware Architecture of H.264 Integer Motion
Estimation (Cont.)
- In our specification, the computation complexity
of H.264 is 2.4 tera instructions per second and
3.8 tera bytes per second in D1 Format and
dominated by IME, which is estimated by
instruction profiling of reference software,
JM7.3. - The ultra large computation complexity can be
solved by the parallel computation, but the huge
external memory bandwidth can not. Therefore, the
huge memory bandwidth is a difficult challenge
for hardware design. - There are still two problems.
- First, because of VBSME and Lagrangian mode
decision, the data dependency of motion vector
predictor prohibits from the parallel computation
between the smaller blocks in a MB. - Secondly, when the high processing ability is
necessary, the hardware cost of ME hardware
architectures with high degrees of parallelism is
also required to be discussed.
88Modified Algorithm
- First, we divide the computation of ME into two
parts, integer-pixel ME and fractional-pixel ME
(FME), and propose two individual hardware
accelerators for IME - and FME, respectively. The utilization of
hardware accelerators can be significantly
improved by this way. - Second, in the original Lagrangian mode decision,
the MV predictor of a block is the medium MV
among the MVs of top, top-right, left neighboring
44 blocks but in the parallel computation of
hardware architectures, the coding modes of the
neighboring 44 blocks can not be decided in
parallel, especially when the block size is 44.
89The motion vector predictor for (a) the 48
block, (b) the 1616 block, and (c) the modified
motion vector predictor for all blocks.
90Hardware Architecture with M-parallelism
- In our specification, we require eight sets of
Propagate Partial SAD or SAD Tree to achieve the
realtime computation. - Eight sets of Propagate Partial SAD and SAD Tree,
which can process eight successive candidates in
a row at the same time, are combined as
Eight-Parallel Propagate Partial SAD and
Eight-Parallel SAD Tree, respectively.
91Hardware Architecture of H.264 Integer Motion
Estimation.
92Comparison of RD Curves Between JM7.3 and Our
Proposed Encoder
93Memory Reduction of H.264 IME