Title: BMVA Technical Meeting Multimedia Data Compression
1BMVA Technical MeetingMultimedia Data Compression
- Linear Combination of Face Views for Low Bit Rate
Face Video Compression
Ioannis Koufakis Department of Computer
Science University College London
2 December 1998
2Overview
- Very low bit-rate encoding of faces for
videoconferencing applications over the Internet
or mobile communication networks. - A face image is encoded as a linear combination
of three basis face views of the same person. - Background area is encoded similarly by using the
background areas of the three basis frames. - Eyes and mouth are compressed using principal
components analysis. - High compression ratio and good quality of the
decoded face images from 1384 to 1620 bits/frame.
3Talk Structure
- Introduction to the problem of compression of
video data for videoconferencing and its
requirements - Why current techniques and standards fail to
provide a solution for our purposes? - Proposed linear combination method
- Geometry (shape estimation and reconstruction)
- Texture Rendering (grey-level/colour
reconstruction) - Experiments with face reconstruction.
- Examples of reconstructed face images
- Objective quality measurements
- Bit rate and compression ratio estimations
4Talk Structure
- Encoding of the background and facial features,
such as eyes and mouth - Discussion and conclusions
- Tracking and face detection issues
- Advantages and drawbacks of the methods
- Further work that could be done
5Introduction
- For videoconferencing over the Internet or mobile
telephony networks very low bit rate encoding is
required due to usual limitations in the capacity
of the communication channels. - Target bit rate ? 64 Kbits/sec.
- Video data requires huge amount of bandwidth. For
example, for colour video at 30 frames/sec - CIF (360 ? 288) ? 36.5 Mbits/sec ?
Compression ratio 6001 - QCIF (180 ? 144) ? 10 Mbits/sec ?
Compression ratio1501 - The temporal continuity of the video data has to
be preserved in order to ensure lip
synchronisation and consistency of the face
motion with the spoken words.
6Introduction
- Conventional DCT block-based encoding (e.g.
H.261) fail to achieve this target ratios without
deterioration of image quality. - naive solution lower resolution and frame rate.
Even with 2-10 frames/sec at QCIF, quality is
poor. In addition, synchronisation problems
between audio and video appear. - Knowledge of the image content should be used as
in model or object based coding (Pearson,
Aizawa). - can achieve high compression ratios, but complex
3D face models or explicit 3D information of the
image scene are required. In addition, robust
tracking techniques that fit the face onto the
model in each frame are needed.
7Linear Combination of Views
- Knowledge-based method, but uses only 2D
information of the scene content. - Introduced by Ullman and Basri (1991) for object
recognition applications. - all possible views of an object can be expressed
as a linear combination of a small number of
basis 2D views of the same object. - objects with smooth boundaries (e.g. faces)
undergoing rigid 3D transformations and scaling
under orthographic or weak perspective projection
? 5 views are required, but 3 views are
sufficient for a good approximation. - Each novel face view of the person who uses the
system is expressed as combination of 3 face
views of the same person that have been initially
transmitted to the receivers off-line prior the
video session.
82D Warping with an Affine Transformation
- Why not using a single view to produce novel face
views? - A 2D Affine transformation in one view can model
on-the plane transformations but fails to capture
off-the-plane (3D) transformations. - 2D Warping techniques work well for flat objects.
- The face is a complicated 3D object whose shape
and texture information can not be inferred from
only one view.
92D Warping with an Affine Transformation
- Initial and target face images used to test 2D
Affine warping
102D Warping with an Affine Transformation
Reconstructed target by a 2D affine
transformation and our linear combination method
(both with 30 control points)
11Geometry
- Any point (x,y) in the novel view is expressed in
terms of 3 corresponding points (x1,y1), (x2,
y2), (x3,y3) in the basis views - The 14 coefficients ?i, bi are computed using a
small number of control points located in the
novel and basis images, and solving a set of
linear equations by means of least squares
approximation. - Novel face view shape is encoded by either the 14
coefficients or by the coordinates of the control
points. -
12Texture Rendering
- Texture of the novel view is reconstructed by the
texture of the three basis views using a weighted
interpolation. - For each pixel in the novel view we find the
corresponding pixels in the basis views using the
piecewise linear mapping technique (Goshtasby
1986). - Control points divide the face area in triangular
regions. - For each pair of corresponding triangles in the
target and basis views, a local linear mapping
function is estimated that maps pixels from one
triangle to the other.
13Piecewise Linear Mapping
- For each pair of basis and target view triangles,
we find the pixel in the basis onto which the
pixel in the target view is mapped.
Basis View 2
Basis View 1
Basis View 3
Target View
14Texture Rendering
- The texture of each pixel (x,y) in the novel view
is given - Weights wi are inversely proportional to the
distance di of the novel view from the ith basis
view - Distance di is given by
15Texture Rendering
- Weights wi are given by
- Colour video data are encoded with the same bit
rate as grey-level video data. - The same linear combination coefficients and
weights are applied to each colour channel
(R,G,B).
16Experiments and Results
- The three basis views are selected to be one
frontal, one oriented to the left and one to the
right at almost 45º about the vertical axis.
17Experiments and results
- A number of face images of the same person in
different positions in 3D space were used as test
views. - In each image 19 and 30 control points were
located manually. - Triangulation is done manually and remains the
same for all views. - All test views were encoded by the coordinates of
the control points instead of the linear
combination coefficients ? better quality.
18Experiments and Results
- Actual and reconstructed face views when 30
control points were used
19Experiments and Results
- Actual and reconstructed face views when 19
control points were used
20Experiments and Results
- Test face views used in our experiments
21Experiments and Results
- Reconstructed test face views with 30 control
points
22Experiments and Results
- Reconstructed face views using 30 control points
and the coordinates of the landmark points(left)
or the linear combination coefficients(right)
23Experiments and Results
- Signal-to-Noise Ratio (SNR) of Normalised Mean
Square Error.
24Experiments and Results
- Bit rate depends on the number of control points.
The quality improves as the number of control
points increases. - For CIF resolution we need 18 bits/control point
(9 bits/coordinate) - 19 points ? 342 bits/frame
- 30 points ? 540 bits/frame
- For QCIF resolution we need 16 bits/control point
( 8 bits/coordinate) - 19 points ? 304 bits/frame
- 30 points ? 480 bits/frame
- Using the linear combination coefficients
- 14 coefficients, quantized at 17 bits/coefficient
? 238 bits/frame - lower bit rate BUT poorer quality, less robust
25Background Encoding
- Background is also encoded using the linear
combination. - We place a small number of control points at
fixed positions on the boundary of each frame and
basis views. - Background boundary points are combined with the
face boundary points, and the background texture
is synthesised similarly to the face texture. - Background is effectively encoded with zero bits.
- Good results, especially for bland areas even
with a small number of control points. Areas of
the background revealed or occluded as a result
of face rotation and movement are not encoded so
well.
26Background Encoding
- Actual and reconstructed face and background
when 19 control points were used for the face
area and 6 control points at the frame boundary.
27Background Encoding
- Face and background reconstruction for a larger
frame with 30 face control points and 16 fixed
boundary points
28Encoding of eyes and mouth
- Linear combination for face views fails to encode
the eyes and mouth when the person talks, opens
or closes their eyes and changes facial
expression unless a large number of basis views
is used ? inefficient encoding - We suggest a combined approach of linear
combination and principal components analysis
(PCA). - Principal components analysis is used to encode
eyes and mouth. - Three training sets are usedleft eyes, right
eyes and mouth. - For colour video data, a training set is required
for each colour channel ? 9 training sets
29Encoding of eyes and mouth
- Images are geometrically normalised so that they
all have horizontal orientation and same size ?
reduces the variation in the training set - Normalisation is performed by using the control
points already transmitted for the linear
combination. - The inverse process is followed in order to
obtain the reconstructed images. - A small number of eigenvectors (10) are
sufficient for encoding. - Projection coefficients can be also quantized to
lower accuracy, and hence, further decrease the
bit rate required.
30Encoding of eyes and mouth - Experiments
- Miss America video sequence was used.
- 30 first frames were used as training examples.
- 20 subsequent frames were used as test frames.
- Good quality reconstructed images were obtained
using only 10 out of 29 eigenvectors of the
training set, with the projection coefficients
quantized to 12 bits accuracy - 1080 bits/frame encode eyes and mouth changes.
31Encoding of eyes and mouth - Experiments
- First ten eigenvectors of mouth training set
(shown for red channel)
- Reconstructed mouth images
Actual Normalised Decoded
Reconstructed
32Encoding of eyes and mouth - Experiments
- First ten eigenvectors of left eye training set
(shown for blue channel)
- Reconstructed left eye images examples
Actual Normalised Decoded
Reconstructed
33Linear Combination and PCA - Results
- Actual and reconstructed example frame of the
Miss America video sequence (frame 97)
34Linear Combination and PCA - Results
- Miss America frame was synthesised using
- 19 control points
- 30 training images
- 10 most important PCs from each set
- Projection coefficients quantized at 12
bits/coefficient - 1080 bits/frame encode eyes and mouth changes.
- More bits are used for the encoding of important
features such as eyes and mouth rather than the
face
35Linear Combination and PCA - Results
- Bit rate/frame
- 30 control points 1620 bits/frame (CIF) - 1560
bits/frame (QCIF) - 19 control points 1422 bits/frame (CIF) - 1384
bits/frame (QCIF) - Compression ratio
- 30 control points 1024.1 (CIF) -2651 (QCIF)
- 19 control points 11671 (CIF) - 3001 (QCIF)
- Bit rate for 30 frames/sec
- 30 control points 47.5 Kbits/sec (CIF) - 45.7
Kbits/sec (QCIF) - 19 control points 41.7 Kbits/sec (CIF) - 40.5
Kbits/sec (QCIF)
36Linear Combination and PCA - Results
- Compared with a conventional H.261 video coder
- We used the PVRG-P64 software implementation of
H.261 - (developed by the Portable Video Research Group
in Stanford) - Default motion compensation and quantization
parameters were used - Miss America, at 30 frames/sec with no
restrictions in the bit rate ? 423.7
Kbits/sec on average with good quality decoded
video data - Miss America, at 30 frames/sec, but the bit
rate restricted to 64 Kbits/sec ? 89 Kbits/sec on
average, but the quality of the decoded video
data was very poor. Each frame is encoded with
the first frame of the sequence no motion was
encoded in the compressed video data
37Linear Combination and PCA - Results
Reconstructed frame of the Miss America video
sequence with PVRG H.261 encoder
Compressed at 423.7 Kbits/sec
Compressed at 89 Kbits/sec
38 Tracking of Control points
- In our experiments detection and tracking of
control points was done manually. - Linear combination method requires fast and
robust face detection and facial feature tracking
techniques. - Suitable trackers developed by other researchers
- Hager and Toyama - Xvision general purpose
tracking framework - Gee and Cippola model-based tracker for human
faces, using a RANSAC-type regression paradigm - Maurer and Malsburg tracking of facial
landmarks using Gabor wavelets - Wiskott and Malsburg elastic graph bunch
matching uses also Gabor wavelets and it is a
variation of the Maurer tracker.
39Missing points estimation
- The proposed scheme can tolerate some degree of
drop-out of control points by the tracker - in such a case, assuming that all the control
points are correctly detected in the basis views,
any missing points can be estimated by using the
linear combination coefficients computed from the
available control points. - For 30 control points, 500 random patterns of
missing points were tested and the RMS in the
estimation of the control points was computed - drop-out rate of 5 points ? 6 increase in the
RMS - drop-out rate of 8 points ? 9 increase in the
RMS - RMS increases very slowly, and only when 20 out
of 30 points are missing a major increase is
observed
40Missing points estimation
41Elimination of outlier control points
- When some of the control points are incorrectly
detected, or the tracker provides more control
points than those requested, statistical methods
can be used to reduce the effect of those
outliers or eliminate them - least median of squares (LMS)
- Hubers M-estimator
- RANSAC
- This framework can be also used for estimation of
occluded control points, by regarding them as
missing, from the correctly identified inliers. - The RANSAC algorithms seems to be the most
effective, especially with a large portion of
outliers.
42Elimination of outlier control points
- RANSAC seems to be the most effective, especially
with a potentially large portion of outliers. - For the linear combination, our aim is to find
with high probability (? 95) at least one sample
of 7 inliers from a set of random patterns of
control points detected by the tracker. - For 30 control points and 7 points in each
sample, the fraction of contaminated data
(false-positive) that can be tolerated is
almost 46-47.
43Elimination of outlier control points
- Number of samples required vs. fraction of
outliers, for probability ? 95
44Conclusions
- We propose a computationally simple and efficient
technique for compression of face video data. - Knowledge that the video data contains face
images facilitates the very high compression
ratios. - Only shape information (control points or linear
combination coefficients) needs to be
transmitted. Texture information is not required. - The linear combination captures changes in
scaling and 2D orientation (on the image plane)
as well as most of the 3D orientation changes. - Explicit 3D information or complex 3D modelling
is avoided. - The same bitrate is required for large as well as
small frames.
45Conclusions
46Conclusions
- Start-up costs transmission of the face basis
views and the eigeneyes and eigenmouths of
the eye and mouth training sets. - These can be further compressed with
conventional (e.g. DCT-based) compression
algorithms. - Sparse correspondence between the basis and
target views is required. ?a robust and fast
tracking system is needed. - The robustness of the tracker can be increased by
using RANSAC -type methods. - The combined approach preserves both temporal and
spatial resolution at bit rates well below the
target of 64Kbits/sec we set.
47Conclusions
Basis view 1 compressed with JPEG at 2300 bytes
Reconstructed view from the JPEG compressed basis
views. (200bytes)
JPEG compressed view at 1 quality of the
original (780 bytes)
48Conclusions
- Good quality for the reconstructed images.Minor
artefacts due to self-occlusion effects and
illumination specularities. - Self-occlusion can be eliminated by hidden
surface removal (affine depth, Ullman, 1991). - Geometric distortions in the background are
usually unpredictable, but they dont cause
serious problem. They are not easily detected, so
we ignore them.
49Conclusions
- If we consider the temporal dimension, there
might appear some problems when coding artefacts
are quite prominent in consecutive frames, so
that a smooth transition from to frame can not be
achieved. - An error concealment method is required that
eliminates such problem. - Temporal averaging over consecutive frames may
remove small spatial errors. - Error concealment could be more severe at the
periphery of the frame, since users will not be
concentrated at these areas
50Further work
- Integration of the principal components method
for eyes and mouth with the linear combination
method for face views in the same framework,
possibly with an automated tracking technique
(such as one developed by Mauerer et.al.) - More experiments need to be done with faces from
different persons and data recorded from actual
videoconferencing sessions. - Study visibility effects (e.g. self-occlusion)
and verify we can eliminate them. - Simulation of the transmission of compressed face
video data over a low bandwidth link in a
packet-switched network, using UDP or RTP. - Real-time transmission of compressed face video
data over the Internet in order to assess the
feasibility of the proposed method for the
envisaged applications.
51Contact Information
- e-mail G.Koufakis_at_cs.ucl.ac.uk
- www http//www.cs.ucl.ac.uk/staff/
G.Koufakis -