EE398 Project Presentation - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

EE398 Project Presentation

Description:

Cache organization and its impact on the motion compensation speed ... By using Performance API (PAPI), we could measure the runtime behavior of the ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 17
Provided by: caden48
Category:

less

Transcript and Presenter's Notes

Title: EE398 Project Presentation


1
EE398 Project Presentation
  • Performance-Complexity Tradeoff H.264 Motion
    Search
  • Ionut Hristodorescu
  • ionuth_at_stanford.edu

2
Outline
  • H.264 motion search algorithm
  • Mapping of the motion compensation algorithm on a
    memory hierarchy subsystem
  • Cache organization and its impact on the motion
    compensation speed
  • Making the internal H.264 data structures more
    cache friendly

3
H.264 Motion Search Algorithm
  • Block Matching Algorithm
  • Computes the SADs for all the targets in a given
    area (exhaustive search)
  • Computationally intensive
  • Its complexity is equal or greater than the rest
    of the encoding steps
  • Takes most of the encoding time

4
Mapping on a memory hierarchy subsystem
  • The luma/chroma is represented internally in the
    motion compensation algorithm as a line-by-line
    matrix
  • So, each line of a macroblock will be separated
    by (size(pel)width) bytes
  • This means that accessing pels that are sitting
    on the same row will generate 1 cache miss/pel !!!

5
Mapping on a memory hierarchy subsystem
  • To overcome the above,
  • we could arrange the information so that
    consecutive block lines will sit in consecutive
    memory locations

6
Mapping on a memory hierarchy subsystem
  • So, a natural representation of the chroma/luma
    matrixes would be as a sequential macroblock line
    by macroblock line
  • This way, the needed information is loaded into
    the cache quicker

7
Mapping on a memory hierarchy subsystem
  • The advantages are immediate
  • Each macroblock line is 16 pels
  • So, we could fit 2 16-pels consecutive lines in a
    cache line
  • The macroblock is accessed now in a natural,
    sequential order

8
Mapping on a memory hierarchy subsystem
9
Mapping on a memory hierarchy subsystem
  • The biggest problem that arises now is with
    non-macroblock line boundary access
  • Each macroblock line sits at a 16-pel boundary in
    our representation so far
  • For macroblock line aligned access, this is great
  • How about non-macroblock line aligned access ?

10
Mapping on a memory hierarchy subsystem
  • We have problems imagine we want to access 16
    pixels, but starting from position 4 in a
    macroblock line
  • In the original representation, this is no
    problem, since the original picture lines are
    sequential in memory
  • In our case, we will end up in the next
    consecutive macroblock line

11
Mapping on a memory hierarchy subsystem
  • Solutions 1 pretend we dont know about this
    problem and let the encoder access the wrong pels
  • Solution 2 check each time if we are crossing a
    macroblock line boundary and proceed accordingly
  • Solution 3 keep two blocked versions of the
    picture the original picture blocked and the
    shifted-by-32pels blocked

12
Mapping on a memory hierarchy subsystem
  • We prefer solution 3 (even if it is more
    expensive in terms of memory) because this way
    the pels are accessed quicker
  • If pel_pos32 lt 16, we are going to pick up the
    pels from the blocked version of the original
    picture
  • Else, we are going to pick up the pels from the
    blocked versions of the original picture
    shifted-by-32

13
Mapping on a memory hierarchy subsystem
  • 32pels will fit exactly in one cache line (or,
    for better processors, even 64 pels)
  • So, each time we access two macroblock lines, we
    will have no cache miss since the two macroblock
    lines will fit into a cache line

14
Results
  • MET time decreased by approx. 8 compared to the
    non-blocked exhaustive search
  • Cache misses/pixel decreased to approx. 15 from
    600-800 !!!
  • Rate-distortion ratio was preserved

15
Further optimizations
  • Assembly language coding of the SAD computation
    and in particular usage of the PSADW MMX
    instruction
  • Multi-threading of the motion compensation
    algorithm
  • By using Performance API (PAPI), we could measure
    the runtime behavior of the cache and introduce
    the cache misses into the motion cost function,
    much like in 1

16
Further optimizations
  • Intelligent prefetching of data
  • Extend the blocking algorithm to the entire
    motion estimation engine
Write a Comment
User Comments (0)
About PowerShow.com