Title: EE398 Project Presentation
1EE398 Project Presentation
- Performance-Complexity Tradeoff H.264 Motion
Search - Ionut Hristodorescu
- ionuth_at_stanford.edu
2Outline
- H.264 motion search algorithm
- Mapping of the motion compensation algorithm on a
memory hierarchy subsystem - Cache organization and its impact on the motion
compensation speed - Making the internal H.264 data structures more
cache friendly
3H.264 Motion Search Algorithm
- Block Matching Algorithm
- Computes the SADs for all the targets in a given
area (exhaustive search) - Computationally intensive
- Its complexity is equal or greater than the rest
of the encoding steps - Takes most of the encoding time
4Mapping on a memory hierarchy subsystem
- The luma/chroma is represented internally in the
motion compensation algorithm as a line-by-line
matrix - So, each line of a macroblock will be separated
by (size(pel)width) bytes - This means that accessing pels that are sitting
on the same row will generate 1 cache miss/pel !!!
5Mapping on a memory hierarchy subsystem
- To overcome the above,
- we could arrange the information so that
consecutive block lines will sit in consecutive
memory locations
6Mapping on a memory hierarchy subsystem
- So, a natural representation of the chroma/luma
matrixes would be as a sequential macroblock line
by macroblock line - This way, the needed information is loaded into
the cache quicker
7Mapping on a memory hierarchy subsystem
- The advantages are immediate
- Each macroblock line is 16 pels
- So, we could fit 2 16-pels consecutive lines in a
cache line - The macroblock is accessed now in a natural,
sequential order
8Mapping on a memory hierarchy subsystem
9Mapping on a memory hierarchy subsystem
- The biggest problem that arises now is with
non-macroblock line boundary access - Each macroblock line sits at a 16-pel boundary in
our representation so far - For macroblock line aligned access, this is great
- How about non-macroblock line aligned access ?
10Mapping on a memory hierarchy subsystem
- We have problems imagine we want to access 16
pixels, but starting from position 4 in a
macroblock line - In the original representation, this is no
problem, since the original picture lines are
sequential in memory - In our case, we will end up in the next
consecutive macroblock line
11Mapping on a memory hierarchy subsystem
- Solutions 1 pretend we dont know about this
problem and let the encoder access the wrong pels - Solution 2 check each time if we are crossing a
macroblock line boundary and proceed accordingly - Solution 3 keep two blocked versions of the
picture the original picture blocked and the
shifted-by-32pels blocked
12Mapping on a memory hierarchy subsystem
- We prefer solution 3 (even if it is more
expensive in terms of memory) because this way
the pels are accessed quicker - If pel_pos32 lt 16, we are going to pick up the
pels from the blocked version of the original
picture - Else, we are going to pick up the pels from the
blocked versions of the original picture
shifted-by-32
13Mapping on a memory hierarchy subsystem
- 32pels will fit exactly in one cache line (or,
for better processors, even 64 pels) - So, each time we access two macroblock lines, we
will have no cache miss since the two macroblock
lines will fit into a cache line
14Results
- MET time decreased by approx. 8 compared to the
non-blocked exhaustive search - Cache misses/pixel decreased to approx. 15 from
600-800 !!! - Rate-distortion ratio was preserved
15Further optimizations
- Assembly language coding of the SAD computation
and in particular usage of the PSADW MMX
instruction - Multi-threading of the motion compensation
algorithm - By using Performance API (PAPI), we could measure
the runtime behavior of the cache and introduce
the cache misses into the motion cost function,
much like in 1
16Further optimizations
- Intelligent prefetching of data
- Extend the blocking algorithm to the entire
motion estimation engine