BMVA Technical Meeting Multimedia Data Compression - PowerPoint PPT Presentation

1 / 51

About This Presentation

Title:

BMVA Technical Meeting Multimedia Data Compression

Description:

A 2D Affine transformation in one view can model on-the plane transformations ... Reconstructed target by a 2D affine transformation and our linear combination ... – PowerPoint PPT presentation

Number of Views:51

Avg rating:3.0/5.0

Slides: 52

Provided by: ioan90

Category:

more less

Transcript and Presenter's Notes

Title: BMVA Technical Meeting Multimedia Data Compression

1
BMVA Technical MeetingMultimedia Data Compression

Linear Combination of Face Views for Low Bit Rate
Face Video Compression

Ioannis Koufakis Department of Computer
Science University College London
2 December 1998
2
Overview

Very low bit-rate encoding of faces for
videoconferencing applications over the Internet
or mobile communication networks.
A face image is encoded as a linear combination
of three basis face views of the same person.
Background area is encoded similarly by using the
background areas of the three basis frames.
Eyes and mouth are compressed using principal
components analysis.
High compression ratio and good quality of the
decoded face images from 1384 to 1620 bits/frame.

3
Talk Structure

Introduction to the problem of compression of
video data for videoconferencing and its
requirements
Why current techniques and standards fail to
provide a solution for our purposes?
Proposed linear combination method
Geometry (shape estimation and reconstruction)
Texture Rendering (grey-level/colour
reconstruction)
Experiments with face reconstruction.
Examples of reconstructed face images
Objective quality measurements
Bit rate and compression ratio estimations

4
Talk Structure

Encoding of the background and facial features,
such as eyes and mouth
Discussion and conclusions
Tracking and face detection issues
Advantages and drawbacks of the methods
Further work that could be done

5
Introduction

For videoconferencing over the Internet or mobile
telephony networks very low bit rate encoding is
required due to usual limitations in the capacity
of the communication channels.
Target bit rate ? 64 Kbits/sec.
Video data requires huge amount of bandwidth. For
example, for colour video at 30 frames/sec
CIF (360 ? 288) ? 36.5 Mbits/sec ?
Compression ratio 6001
QCIF (180 ? 144) ? 10 Mbits/sec ?
Compression ratio1501
The temporal continuity of the video data has to
be preserved in order to ensure lip
synchronisation and consistency of the face
motion with the spoken words.

6
Introduction

Conventional DCT block-based encoding (e.g.
H.261) fail to achieve this target ratios without
deterioration of image quality.
naive solution lower resolution and frame rate.
Even with 2-10 frames/sec at QCIF, quality is
poor. In addition, synchronisation problems
between audio and video appear.
Knowledge of the image content should be used as
in model or object based coding (Pearson,
Aizawa).
can achieve high compression ratios, but complex
3D face models or explicit 3D information of the
image scene are required. In addition, robust
tracking techniques that fit the face onto the
model in each frame are needed.

7
Linear Combination of Views

Knowledge-based method, but uses only 2D
information of the scene content.
Introduced by Ullman and Basri (1991) for object
recognition applications.
all possible views of an object can be expressed
as a linear combination of a small number of
basis 2D views of the same object.
objects with smooth boundaries (e.g. faces)
undergoing rigid 3D transformations and scaling
under orthographic or weak perspective projection
? 5 views are required, but 3 views are
sufficient for a good approximation.
Each novel face view of the person who uses the
system is expressed as combination of 3 face
views of the same person that have been initially
transmitted to the receivers off-line prior the
video session.

8
2D Warping with an Affine Transformation

Why not using a single view to produce novel face
views?
A 2D Affine transformation in one view can model
on-the plane transformations but fails to capture
off-the-plane (3D) transformations.
2D Warping techniques work well for flat objects.
The face is a complicated 3D object whose shape
and texture information can not be inferred from
only one view.

9
2D Warping with an Affine Transformation

Initial and target face images used to test 2D
Affine warping

10
2D Warping with an Affine Transformation
Reconstructed target by a 2D affine
transformation and our linear combination method
(both with 30 control points)
11
Geometry

Any point (x,y) in the novel view is expressed in
terms of 3 corresponding points (x1,y1), (x2,
y2), (x3,y3) in the basis views
The 14 coefficients ?i, bi are computed using a
small number of control points located in the
novel and basis images, and solving a set of
linear equations by means of least squares
approximation.
Novel face view shape is encoded by either the 14
coefficients or by the coordinates of the control
points.

12
Texture Rendering

Texture of the novel view is reconstructed by the
texture of the three basis views using a weighted
interpolation.
For each pixel in the novel view we find the
corresponding pixels in the basis views using the
piecewise linear mapping technique (Goshtasby
1986).
Control points divide the face area in triangular
regions.
For each pair of corresponding triangles in the
target and basis views, a local linear mapping
function is estimated that maps pixels from one
triangle to the other.

13
Piecewise Linear Mapping

For each pair of basis and target view triangles,
we find the pixel in the basis onto which the
pixel in the target view is mapped.

Basis View 2
Basis View 1
Basis View 3
Target View
14
Texture Rendering

The texture of each pixel (x,y) in the novel view
is given
Weights wi are inversely proportional to the
distance di of the novel view from the ith basis
view
Distance di is given by

15
Texture Rendering

Weights wi are given by
Colour video data are encoded with the same bit
rate as grey-level video data.
The same linear combination coefficients and
weights are applied to each colour channel
(R,G,B).

16
Experiments and Results

The three basis views are selected to be one
frontal, one oriented to the left and one to the
right at almost 45º about the vertical axis.

17
Experiments and results

A number of face images of the same person in
different positions in 3D space were used as test
views.
In each image 19 and 30 control points were
located manually.
Triangulation is done manually and remains the
same for all views.
All test views were encoded by the coordinates of
the control points instead of the linear
combination coefficients ? better quality.

18
Experiments and Results

Actual and reconstructed face views when 30
control points were used

19
Experiments and Results

Actual and reconstructed face views when 19
control points were used

20
Experiments and Results

Test face views used in our experiments

21
Experiments and Results

Reconstructed test face views with 30 control
points

22
Experiments and Results

Reconstructed face views using 30 control points
and the coordinates of the landmark points(left)
or the linear combination coefficients(right)

23
Experiments and Results

Signal-to-Noise Ratio (SNR) of Normalised Mean
Square Error.

24
Experiments and Results

Bit rate depends on the number of control points.
The quality improves as the number of control
points increases.
For CIF resolution we need 18 bits/control point
(9 bits/coordinate)
19 points ? 342 bits/frame
30 points ? 540 bits/frame
For QCIF resolution we need 16 bits/control point
( 8 bits/coordinate)
19 points ? 304 bits/frame
30 points ? 480 bits/frame
Using the linear combination coefficients
14 coefficients, quantized at 17 bits/coefficient
? 238 bits/frame
lower bit rate BUT poorer quality, less robust

25
Background Encoding

Background is also encoded using the linear
combination.
We place a small number of control points at
fixed positions on the boundary of each frame and
basis views.
Background boundary points are combined with the
face boundary points, and the background texture
is synthesised similarly to the face texture.
Background is effectively encoded with zero bits.
Good results, especially for bland areas even
with a small number of control points. Areas of
the background revealed or occluded as a result
of face rotation and movement are not encoded so
well.

26
Background Encoding

Actual and reconstructed face and background
when 19 control points were used for the face
area and 6 control points at the frame boundary.

27
Background Encoding

Face and background reconstruction for a larger
frame with 30 face control points and 16 fixed
boundary points

28
Encoding of eyes and mouth

Linear combination for face views fails to encode
the eyes and mouth when the person talks, opens
or closes their eyes and changes facial
expression unless a large number of basis views
is used ? inefficient encoding
We suggest a combined approach of linear
combination and principal components analysis
(PCA).
Principal components analysis is used to encode
eyes and mouth.
Three training sets are usedleft eyes, right
eyes and mouth.
For colour video data, a training set is required
for each colour channel ? 9 training sets

29
Encoding of eyes and mouth

Images are geometrically normalised so that they
all have horizontal orientation and same size ?
reduces the variation in the training set
Normalisation is performed by using the control
points already transmitted for the linear
combination.
The inverse process is followed in order to
obtain the reconstructed images.
A small number of eigenvectors (10) are
sufficient for encoding.
Projection coefficients can be also quantized to
lower accuracy, and hence, further decrease the
bit rate required.

30
Encoding of eyes and mouth - Experiments

Miss America video sequence was used.
30 first frames were used as training examples.
20 subsequent frames were used as test frames.
Good quality reconstructed images were obtained
using only 10 out of 29 eigenvectors of the
training set, with the projection coefficients
quantized to 12 bits accuracy
1080 bits/frame encode eyes and mouth changes.

31
Encoding of eyes and mouth - Experiments

First ten eigenvectors of mouth training set
(shown for red channel)

Reconstructed mouth images

Actual Normalised Decoded
Reconstructed
32
Encoding of eyes and mouth - Experiments

First ten eigenvectors of left eye training set
(shown for blue channel)

Reconstructed left eye images examples

Actual Normalised Decoded
Reconstructed
33
Linear Combination and PCA - Results

Actual and reconstructed example frame of the
Miss America video sequence (frame 97)

34
Linear Combination and PCA - Results

Miss America frame was synthesised using
19 control points
30 training images
10 most important PCs from each set
Projection coefficients quantized at 12
bits/coefficient
1080 bits/frame encode eyes and mouth changes.
More bits are used for the encoding of important
features such as eyes and mouth rather than the
face

35
Linear Combination and PCA - Results

Bit rate/frame
30 control points 1620 bits/frame (CIF) - 1560
bits/frame (QCIF)
19 control points 1422 bits/frame (CIF) - 1384
bits/frame (QCIF)
Compression ratio
30 control points 1024.1 (CIF) -2651 (QCIF)
19 control points 11671 (CIF) - 3001 (QCIF)
Bit rate for 30 frames/sec
30 control points 47.5 Kbits/sec (CIF) - 45.7
Kbits/sec (QCIF)
19 control points 41.7 Kbits/sec (CIF) - 40.5
Kbits/sec (QCIF)

36
Linear Combination and PCA - Results

Compared with a conventional H.261 video coder
We used the PVRG-P64 software implementation of
H.261
(developed by the Portable Video Research Group
in Stanford)
Default motion compensation and quantization
parameters were used
Miss America, at 30 frames/sec with no
restrictions in the bit rate ? 423.7
Kbits/sec on average with good quality decoded
video data
Miss America, at 30 frames/sec, but the bit
rate restricted to 64 Kbits/sec ? 89 Kbits/sec on
average, but the quality of the decoded video
data was very poor. Each frame is encoded with
the first frame of the sequence no motion was
encoded in the compressed video data

37
Linear Combination and PCA - Results
Reconstructed frame of the Miss America video
sequence with PVRG H.261 encoder
Compressed at 423.7 Kbits/sec
Compressed at 89 Kbits/sec
38
Tracking of Control points

In our experiments detection and tracking of
control points was done manually.
Linear combination method requires fast and
robust face detection and facial feature tracking
techniques.
Suitable trackers developed by other researchers
Hager and Toyama - Xvision general purpose
tracking framework
Gee and Cippola model-based tracker for human
faces, using a RANSAC-type regression paradigm
Maurer and Malsburg tracking of facial
landmarks using Gabor wavelets
Wiskott and Malsburg elastic graph bunch
matching uses also Gabor wavelets and it is a
variation of the Maurer tracker.

39
Missing points estimation

The proposed scheme can tolerate some degree of
drop-out of control points by the tracker
in such a case, assuming that all the control
points are correctly detected in the basis views,
any missing points can be estimated by using the
linear combination coefficients computed from the
available control points.
For 30 control points, 500 random patterns of
missing points were tested and the RMS in the
estimation of the control points was computed
drop-out rate of 5 points ? 6 increase in the
RMS
drop-out rate of 8 points ? 9 increase in the
RMS
RMS increases very slowly, and only when 20 out
of 30 points are missing a major increase is
observed

40
Missing points estimation
41
Elimination of outlier control points

When some of the control points are incorrectly
detected, or the tracker provides more control
points than those requested, statistical methods
can be used to reduce the effect of those
outliers or eliminate them
least median of squares (LMS)
Hubers M-estimator
RANSAC
This framework can be also used for estimation of
occluded control points, by regarding them as
missing, from the correctly identified inliers.
The RANSAC algorithms seems to be the most
effective, especially with a large portion of
outliers.

42
Elimination of outlier control points

RANSAC seems to be the most effective, especially
with a potentially large portion of outliers.
For the linear combination, our aim is to find
with high probability (? 95) at least one sample
of 7 inliers from a set of random patterns of
control points detected by the tracker.
For 30 control points and 7 points in each
sample, the fraction of contaminated data
(false-positive) that can be tolerated is
almost 46-47.

43
Elimination of outlier control points

Number of samples required vs. fraction of
outliers, for probability ? 95

44
Conclusions

We propose a computationally simple and efficient
technique for compression of face video data.
Knowledge that the video data contains face
images facilitates the very high compression
ratios.
Only shape information (control points or linear
combination coefficients) needs to be
transmitted. Texture information is not required.
The linear combination captures changes in
scaling and 2D orientation (on the image plane)
as well as most of the 3D orientation changes.
Explicit 3D information or complex 3D modelling
is avoided.
The same bitrate is required for large as well as
small frames.

45
Conclusions
46
Conclusions

Start-up costs transmission of the face basis
views and the eigeneyes and eigenmouths of
the eye and mouth training sets.
These can be further compressed with
conventional (e.g. DCT-based) compression
algorithms.
Sparse correspondence between the basis and
target views is required. ?a robust and fast
tracking system is needed.
The robustness of the tracker can be increased by
using RANSAC -type methods.
The combined approach preserves both temporal and
spatial resolution at bit rates well below the
target of 64Kbits/sec we set.

47
Conclusions
Basis view 1 compressed with JPEG at 2300 bytes
Reconstructed view from the JPEG compressed basis
views. (200bytes)
JPEG compressed view at 1 quality of the
original (780 bytes)
48
Conclusions

Good quality for the reconstructed images.Minor
artefacts due to self-occlusion effects and
illumination specularities.
Self-occlusion can be eliminated by hidden
surface removal (affine depth, Ullman, 1991).
Geometric distortions in the background are
usually unpredictable, but they dont cause
serious problem. They are not easily detected, so
we ignore them.

49
Conclusions

If we consider the temporal dimension, there
might appear some problems when coding artefacts
are quite prominent in consecutive frames, so
that a smooth transition from to frame can not be
achieved.
An error concealment method is required that
eliminates such problem.
Temporal averaging over consecutive frames may
remove small spatial errors.
Error concealment could be more severe at the
periphery of the frame, since users will not be
concentrated at these areas

50
Further work

Integration of the principal components method
for eyes and mouth with the linear combination
method for face views in the same framework,
possibly with an automated tracking technique
(such as one developed by Mauerer et.al.)
More experiments need to be done with faces from
different persons and data recorded from actual
videoconferencing sessions.
Study visibility effects (e.g. self-occlusion)
and verify we can eliminate them.
Simulation of the transmission of compressed face
video data over a low bandwidth link in a
packet-switched network, using UDP or RTP.
Real-time transmission of compressed face video
data over the Internet in order to assess the
feasibility of the proposed method for the
envisaged applications.