EE398 Project Presentation

About This Presentation

Title:

Description:

Number of Views:22

Avg rating:3.0/5.0

Slides: 17

Provided by: caden48

Category:

Tags: ee398 | papi | presentation | project

Transcript and Presenter's Notes

Title: EE398 Project Presentation

1
EE398 Project Presentation

2
Outline

3
H.264 Motion Search Algorithm

4
Mapping on a memory hierarchy subsystem

The luma/chroma is represented internally in the
motion compensation algorithm as a line-by-line
matrix
So, each line of a macroblock will be separated
by (size(pel)width) bytes
This means that accessing pels that are sitting
on the same row will generate 1 cache miss/pel !!!

5
Mapping on a memory hierarchy subsystem

To overcome the above,
we could arrange the information so that
consecutive block lines will sit in consecutive
memory locations

6
Mapping on a memory hierarchy subsystem

So, a natural representation of the chroma/luma
matrixes would be as a sequential macroblock line
by macroblock line
This way, the needed information is loaded into
the cache quicker

7
Mapping on a memory hierarchy subsystem

8
Mapping on a memory hierarchy subsystem
9
Mapping on a memory hierarchy subsystem

10
Mapping on a memory hierarchy subsystem

We have problems imagine we want to access 16
pixels, but starting from position 4 in a
macroblock line
In the original representation, this is no
problem, since the original picture lines are
sequential in memory
In our case, we will end up in the next
consecutive macroblock line

11
Mapping on a memory hierarchy subsystem

Solutions 1 pretend we dont know about this
problem and let the encoder access the wrong pels
Solution 2 check each time if we are crossing a
macroblock line boundary and proceed accordingly
Solution 3 keep two blocked versions of the
picture the original picture blocked and the
shifted-by-32pels blocked

12
Mapping on a memory hierarchy subsystem

We prefer solution 3 (even if it is more
expensive in terms of memory) because this way
the pels are accessed quicker
If pel_pos32 lt 16, we are going to pick up the
pels from the blocked version of the original
picture
Else, we are going to pick up the pels from the
blocked versions of the original picture
shifted-by-32

13
Mapping on a memory hierarchy subsystem

32pels will fit exactly in one cache line (or,
for better processors, even 64 pels)
So, each time we access two macroblock lines, we
will have no cache miss since the two macroblock
lines will fit into a cache line

14
Results

15
Further optimizations

Assembly language coding of the SAD computation
and in particular usage of the PSADW MMX
instruction
Multi-threading of the motion compensation
algorithm
By using Performance API (PAPI), we could measure
the runtime behavior of the cache and introduce
the cache misses into the motion cost function,
much like in 1