BMVA Technical Meeting Multimedia Data Compression - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

BMVA Technical Meeting Multimedia Data Compression

Description:

A 2D Affine transformation in one view can model on-the plane transformations ... Reconstructed target by a 2D affine transformation and our linear combination ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 52
Provided by: ioan90
Category:

less

Transcript and Presenter's Notes

Title: BMVA Technical Meeting Multimedia Data Compression


1
BMVA Technical MeetingMultimedia Data Compression
  • Linear Combination of Face Views for Low Bit Rate
    Face Video Compression

Ioannis Koufakis Department of Computer
Science University College London
2 December 1998
2
Overview
  • Very low bit-rate encoding of faces for
    videoconferencing applications over the Internet
    or mobile communication networks.
  • A face image is encoded as a linear combination
    of three basis face views of the same person.
  • Background area is encoded similarly by using the
    background areas of the three basis frames.
  • Eyes and mouth are compressed using principal
    components analysis.
  • High compression ratio and good quality of the
    decoded face images from 1384 to 1620 bits/frame.

3
Talk Structure
  • Introduction to the problem of compression of
    video data for videoconferencing and its
    requirements
  • Why current techniques and standards fail to
    provide a solution for our purposes?
  • Proposed linear combination method
  • Geometry (shape estimation and reconstruction)
  • Texture Rendering (grey-level/colour
    reconstruction)
  • Experiments with face reconstruction.
  • Examples of reconstructed face images
  • Objective quality measurements
  • Bit rate and compression ratio estimations

4
Talk Structure
  • Encoding of the background and facial features,
    such as eyes and mouth
  • Discussion and conclusions
  • Tracking and face detection issues
  • Advantages and drawbacks of the methods
  • Further work that could be done

5
Introduction
  • For videoconferencing over the Internet or mobile
    telephony networks very low bit rate encoding is
    required due to usual limitations in the capacity
    of the communication channels.
  • Target bit rate ? 64 Kbits/sec.
  • Video data requires huge amount of bandwidth. For
    example, for colour video at 30 frames/sec
  • CIF (360 ? 288) ? 36.5 Mbits/sec ?
    Compression ratio 6001
  • QCIF (180 ? 144) ? 10 Mbits/sec ?
    Compression ratio1501
  • The temporal continuity of the video data has to
    be preserved in order to ensure lip
    synchronisation and consistency of the face
    motion with the spoken words.

6
Introduction
  • Conventional DCT block-based encoding (e.g.
    H.261) fail to achieve this target ratios without
    deterioration of image quality.
  • naive solution lower resolution and frame rate.
    Even with 2-10 frames/sec at QCIF, quality is
    poor. In addition, synchronisation problems
    between audio and video appear.
  • Knowledge of the image content should be used as
    in model or object based coding (Pearson,
    Aizawa).
  • can achieve high compression ratios, but complex
    3D face models or explicit 3D information of the
    image scene are required. In addition, robust
    tracking techniques that fit the face onto the
    model in each frame are needed.

7
Linear Combination of Views
  • Knowledge-based method, but uses only 2D
    information of the scene content.
  • Introduced by Ullman and Basri (1991) for object
    recognition applications.
  • all possible views of an object can be expressed
    as a linear combination of a small number of
    basis 2D views of the same object.
  • objects with smooth boundaries (e.g. faces)
    undergoing rigid 3D transformations and scaling
    under orthographic or weak perspective projection
    ? 5 views are required, but 3 views are
    sufficient for a good approximation.
  • Each novel face view of the person who uses the
    system is expressed as combination of 3 face
    views of the same person that have been initially
    transmitted to the receivers off-line prior the
    video session.

8
2D Warping with an Affine Transformation
  • Why not using a single view to produce novel face
    views?
  • A 2D Affine transformation in one view can model
    on-the plane transformations but fails to capture
    off-the-plane (3D) transformations.
  • 2D Warping techniques work well for flat objects.
  • The face is a complicated 3D object whose shape
    and texture information can not be inferred from
    only one view.

9
2D Warping with an Affine Transformation
  • Initial and target face images used to test 2D
    Affine warping

10
2D Warping with an Affine Transformation
Reconstructed target by a 2D affine
transformation and our linear combination method
(both with 30 control points)
11
Geometry
  • Any point (x,y) in the novel view is expressed in
    terms of 3 corresponding points (x1,y1), (x2,
    y2), (x3,y3) in the basis views
  • The 14 coefficients ?i, bi are computed using a
    small number of control points located in the
    novel and basis images, and solving a set of
    linear equations by means of least squares
    approximation.
  • Novel face view shape is encoded by either the 14
    coefficients or by the coordinates of the control
    points.

12
Texture Rendering
  • Texture of the novel view is reconstructed by the
    texture of the three basis views using a weighted
    interpolation.
  • For each pixel in the novel view we find the
    corresponding pixels in the basis views using the
    piecewise linear mapping technique (Goshtasby
    1986).
  • Control points divide the face area in triangular
    regions.
  • For each pair of corresponding triangles in the
    target and basis views, a local linear mapping
    function is estimated that maps pixels from one
    triangle to the other.

13
Piecewise Linear Mapping
  • For each pair of basis and target view triangles,
    we find the pixel in the basis onto which the
    pixel in the target view is mapped.

Basis View 2
Basis View 1
Basis View 3
Target View
14
Texture Rendering
  • The texture of each pixel (x,y) in the novel view
    is given
  • Weights wi are inversely proportional to the
    distance di of the novel view from the ith basis
    view
  • Distance di is given by

15
Texture Rendering
  • Weights wi are given by
  • Colour video data are encoded with the same bit
    rate as grey-level video data.
  • The same linear combination coefficients and
    weights are applied to each colour channel
    (R,G,B).

16
Experiments and Results
  • The three basis views are selected to be one
    frontal, one oriented to the left and one to the
    right at almost 45º about the vertical axis.

17
Experiments and results
  • A number of face images of the same person in
    different positions in 3D space were used as test
    views.
  • In each image 19 and 30 control points were
    located manually.
  • Triangulation is done manually and remains the
    same for all views.
  • All test views were encoded by the coordinates of
    the control points instead of the linear
    combination coefficients ? better quality.

18
Experiments and Results
  • Actual and reconstructed face views when 30
    control points were used

19
Experiments and Results
  • Actual and reconstructed face views when 19
    control points were used

20
Experiments and Results
  • Test face views used in our experiments

21
Experiments and Results
  • Reconstructed test face views with 30 control
    points

22
Experiments and Results
  • Reconstructed face views using 30 control points
    and the coordinates of the landmark points(left)
    or the linear combination coefficients(right)

23
Experiments and Results
  • Signal-to-Noise Ratio (SNR) of Normalised Mean
    Square Error.

24
Experiments and Results
  • Bit rate depends on the number of control points.
    The quality improves as the number of control
    points increases.
  • For CIF resolution we need 18 bits/control point
    (9 bits/coordinate)
  • 19 points ? 342 bits/frame
  • 30 points ? 540 bits/frame
  • For QCIF resolution we need 16 bits/control point
    ( 8 bits/coordinate)
  • 19 points ? 304 bits/frame
  • 30 points ? 480 bits/frame
  • Using the linear combination coefficients
  • 14 coefficients, quantized at 17 bits/coefficient
    ? 238 bits/frame
  • lower bit rate BUT poorer quality, less robust

25
Background Encoding
  • Background is also encoded using the linear
    combination.
  • We place a small number of control points at
    fixed positions on the boundary of each frame and
    basis views.
  • Background boundary points are combined with the
    face boundary points, and the background texture
    is synthesised similarly to the face texture.
  • Background is effectively encoded with zero bits.
  • Good results, especially for bland areas even
    with a small number of control points. Areas of
    the background revealed or occluded as a result
    of face rotation and movement are not encoded so
    well.

26
Background Encoding
  • Actual and reconstructed face and background
    when 19 control points were used for the face
    area and 6 control points at the frame boundary.

27
Background Encoding
  • Face and background reconstruction for a larger
    frame with 30 face control points and 16 fixed
    boundary points

28
Encoding of eyes and mouth
  • Linear combination for face views fails to encode
    the eyes and mouth when the person talks, opens
    or closes their eyes and changes facial
    expression unless a large number of basis views
    is used ? inefficient encoding
  • We suggest a combined approach of linear
    combination and principal components analysis
    (PCA).
  • Principal components analysis is used to encode
    eyes and mouth.
  • Three training sets are usedleft eyes, right
    eyes and mouth.
  • For colour video data, a training set is required
    for each colour channel ? 9 training sets

29
Encoding of eyes and mouth
  • Images are geometrically normalised so that they
    all have horizontal orientation and same size ?
    reduces the variation in the training set
  • Normalisation is performed by using the control
    points already transmitted for the linear
    combination.
  • The inverse process is followed in order to
    obtain the reconstructed images.
  • A small number of eigenvectors (10) are
    sufficient for encoding.
  • Projection coefficients can be also quantized to
    lower accuracy, and hence, further decrease the
    bit rate required.

30
Encoding of eyes and mouth - Experiments
  • Miss America video sequence was used.
  • 30 first frames were used as training examples.
  • 20 subsequent frames were used as test frames.
  • Good quality reconstructed images were obtained
    using only 10 out of 29 eigenvectors of the
    training set, with the projection coefficients
    quantized to 12 bits accuracy
  • 1080 bits/frame encode eyes and mouth changes.

31
Encoding of eyes and mouth - Experiments
  • First ten eigenvectors of mouth training set
    (shown for red channel)
  • Reconstructed mouth images

Actual Normalised Decoded
Reconstructed
32
Encoding of eyes and mouth - Experiments
  • First ten eigenvectors of left eye training set
    (shown for blue channel)
  • Reconstructed left eye images examples

Actual Normalised Decoded
Reconstructed
33
Linear Combination and PCA - Results
  • Actual and reconstructed example frame of the
    Miss America video sequence (frame 97)

34
Linear Combination and PCA - Results
  • Miss America frame was synthesised using
  • 19 control points
  • 30 training images
  • 10 most important PCs from each set
  • Projection coefficients quantized at 12
    bits/coefficient
  • 1080 bits/frame encode eyes and mouth changes.
  • More bits are used for the encoding of important
    features such as eyes and mouth rather than the
    face

35
Linear Combination and PCA - Results
  • Bit rate/frame
  • 30 control points 1620 bits/frame (CIF) - 1560
    bits/frame (QCIF)
  • 19 control points 1422 bits/frame (CIF) - 1384
    bits/frame (QCIF)
  • Compression ratio
  • 30 control points 1024.1 (CIF) -2651 (QCIF)
  • 19 control points 11671 (CIF) - 3001 (QCIF)
  • Bit rate for 30 frames/sec
  • 30 control points 47.5 Kbits/sec (CIF) - 45.7
    Kbits/sec (QCIF)
  • 19 control points 41.7 Kbits/sec (CIF) - 40.5
    Kbits/sec (QCIF)

36
Linear Combination and PCA - Results
  • Compared with a conventional H.261 video coder
  • We used the PVRG-P64 software implementation of
    H.261
  • (developed by the Portable Video Research Group
    in Stanford)
  • Default motion compensation and quantization
    parameters were used
  • Miss America, at 30 frames/sec with no
    restrictions in the bit rate ? 423.7
    Kbits/sec on average with good quality decoded
    video data
  • Miss America, at 30 frames/sec, but the bit
    rate restricted to 64 Kbits/sec ? 89 Kbits/sec on
    average, but the quality of the decoded video
    data was very poor. Each frame is encoded with
    the first frame of the sequence no motion was
    encoded in the compressed video data

37
Linear Combination and PCA - Results
Reconstructed frame of the Miss America video
sequence with PVRG H.261 encoder
Compressed at 423.7 Kbits/sec
Compressed at 89 Kbits/sec
38
Tracking of Control points
  • In our experiments detection and tracking of
    control points was done manually.
  • Linear combination method requires fast and
    robust face detection and facial feature tracking
    techniques.
  • Suitable trackers developed by other researchers
  • Hager and Toyama - Xvision general purpose
    tracking framework
  • Gee and Cippola model-based tracker for human
    faces, using a RANSAC-type regression paradigm
  • Maurer and Malsburg tracking of facial
    landmarks using Gabor wavelets
  • Wiskott and Malsburg elastic graph bunch
    matching uses also Gabor wavelets and it is a
    variation of the Maurer tracker.

39
Missing points estimation
  • The proposed scheme can tolerate some degree of
    drop-out of control points by the tracker
  • in such a case, assuming that all the control
    points are correctly detected in the basis views,
    any missing points can be estimated by using the
    linear combination coefficients computed from the
    available control points.
  • For 30 control points, 500 random patterns of
    missing points were tested and the RMS in the
    estimation of the control points was computed
  • drop-out rate of 5 points ? 6 increase in the
    RMS
  • drop-out rate of 8 points ? 9 increase in the
    RMS
  • RMS increases very slowly, and only when 20 out
    of 30 points are missing a major increase is
    observed

40
Missing points estimation
41
Elimination of outlier control points
  • When some of the control points are incorrectly
    detected, or the tracker provides more control
    points than those requested, statistical methods
    can be used to reduce the effect of those
    outliers or eliminate them
  • least median of squares (LMS)
  • Hubers M-estimator
  • RANSAC
  • This framework can be also used for estimation of
    occluded control points, by regarding them as
    missing, from the correctly identified inliers.
  • The RANSAC algorithms seems to be the most
    effective, especially with a large portion of
    outliers.

42
Elimination of outlier control points
  • RANSAC seems to be the most effective, especially
    with a potentially large portion of outliers.
  • For the linear combination, our aim is to find
    with high probability (? 95) at least one sample
    of 7 inliers from a set of random patterns of
    control points detected by the tracker.
  • For 30 control points and 7 points in each
    sample, the fraction of contaminated data
    (false-positive) that can be tolerated is
    almost 46-47.

43
Elimination of outlier control points
  • Number of samples required vs. fraction of
    outliers, for probability ? 95

44
Conclusions
  • We propose a computationally simple and efficient
    technique for compression of face video data.
  • Knowledge that the video data contains face
    images facilitates the very high compression
    ratios.
  • Only shape information (control points or linear
    combination coefficients) needs to be
    transmitted. Texture information is not required.
  • The linear combination captures changes in
    scaling and 2D orientation (on the image plane)
    as well as most of the 3D orientation changes.
  • Explicit 3D information or complex 3D modelling
    is avoided.
  • The same bitrate is required for large as well as
    small frames.

45
Conclusions
46
Conclusions
  • Start-up costs transmission of the face basis
    views and the eigeneyes and eigenmouths of
    the eye and mouth training sets.
  • These can be further compressed with
    conventional (e.g. DCT-based) compression
    algorithms.
  • Sparse correspondence between the basis and
    target views is required. ?a robust and fast
    tracking system is needed.
  • The robustness of the tracker can be increased by
    using RANSAC -type methods.
  • The combined approach preserves both temporal and
    spatial resolution at bit rates well below the
    target of 64Kbits/sec we set.

47
Conclusions
Basis view 1 compressed with JPEG at 2300 bytes
Reconstructed view from the JPEG compressed basis
views. (200bytes)
JPEG compressed view at 1 quality of the
original (780 bytes)
48
Conclusions
  • Good quality for the reconstructed images.Minor
    artefacts due to self-occlusion effects and
    illumination specularities.
  • Self-occlusion can be eliminated by hidden
    surface removal (affine depth, Ullman, 1991).
  • Geometric distortions in the background are
    usually unpredictable, but they dont cause
    serious problem. They are not easily detected, so
    we ignore them.

49
Conclusions
  • If we consider the temporal dimension, there
    might appear some problems when coding artefacts
    are quite prominent in consecutive frames, so
    that a smooth transition from to frame can not be
    achieved.
  • An error concealment method is required that
    eliminates such problem.
  • Temporal averaging over consecutive frames may
    remove small spatial errors.
  • Error concealment could be more severe at the
    periphery of the frame, since users will not be
    concentrated at these areas

50
Further work
  • Integration of the principal components method
    for eyes and mouth with the linear combination
    method for face views in the same framework,
    possibly with an automated tracking technique
    (such as one developed by Mauerer et.al.)
  • More experiments need to be done with faces from
    different persons and data recorded from actual
    videoconferencing sessions.
  • Study visibility effects (e.g. self-occlusion)
    and verify we can eliminate them.
  • Simulation of the transmission of compressed face
    video data over a low bandwidth link in a
    packet-switched network, using UDP or RTP.
  • Real-time transmission of compressed face video
    data over the Internet in order to assess the
    feasibility of the proposed method for the
    envisaged applications.

51
Contact Information
  • e-mail G.Koufakis_at_cs.ucl.ac.uk
  • www http//www.cs.ucl.ac.uk/staff/
    G.Koufakis
Write a Comment
User Comments (0)
About PowerShow.com