Audio-visual speaker verification using continuous fused HMMs - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Audio-visual speaker verification using continuous fused HMMs

Description:

Presented by David Dean. Slides will be available at http://www.davidbdean.com ... variability, high recognition accuracy, and protection against impersonation. ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 30
Provided by: david1132
Category:

less

Transcript and Presenter's Notes

Title: Audio-visual speaker verification using continuous fused HMMs


1
Audio-visual speaker verification using
continuous fused HMMs
  • David Dean, Sridha Sridharan,
  • and Tim Wark
  • Presented by David Dean
  • Slides will be available at http//www.davidbdean.
    com/category/publications

2
Why audio-visual speaker recognition
  • Bimodal recognition exploits the synergy between
    acoustic speech and visual speech, particularly
    under adverse conditions. It is motivated by the
    needin many potential applications of
    speech-based recognitionfor robustness to speech
    variability, high recognition accuracy, and
    protection against impersonation.
  • (Chibelushi, Deravi and Mason 2002)

3
Early and late fusion
  • Most early approaches to audio-visual speaker
    recognition (AVSPR) used either early or late
    fusion (feature or output)
  • Problems
  • Output fusion cannot model temporal dependencies
  • Feature fusion suffers from problems with noise

Early Fusion
Late Fusion
4
Middle fusion - coupled HMMs
  • Middle fusion models can accept two streams of
    input and the combination is done within the
    classifier
  • Most middle fusion is performed using coupled
    HMMs (shown here)
  • Can be difficult to train
  • Dependencies between hidden states are not strong
    (Brand 1999)

5
Middle fusion fused HMMs
  • Pan et al. (2004) used probabilistic models to
    investigate the optimal multi-stream HMM design
  • Maximise mutual information in audio and video
  • They found that linking the observations of one
    modality to the hidden states of the other was
    more optimal than linking just the hidden states
    (i.e. Coupled HMM)

Acoustic Biased FHMM
6
Choosing the dominant modility
  • The fused HMM designed results in two designs,
    acoustic, or video biased
  • The choice of the dominant modality (the one
    biased towards) should be based upon which
    individual HMM can more reliably estimate the
    hidden state sequence for a particular
    application
  • Generally audio
  • Alternatively, both versions can be used
    concurrently and decision fused (as in Pan et al.
    2004)

7
Continuous fused HMMs
  • The original Fused HMM implementation treated the
    secondary domain as discrete (Pan et al. 2004)
  • This caused problems with within-speaker
    variation
  • Work fine on single session (CUAVE Dean et al.
    2006)
  • Fail on multi-session (XM2VTS)
  • Continuous FHMMs model both modalities with GMMs

8
Training FHMMs
  • Both biased FHMM (if needed) are trained
    independently
  • Train the dominant (audio for acoustic-biased,
    video for video-biased) HMM independently upon
    the training observation sequences for that
    modality
  • The best hidden state sequence of the trained HMM
    is found for each training observation using the
    Viterbi process
  • Model the relationship between the dominant
    hidden state sequence and the training
    observation sequences for the subordinate
    modality
  • i.e. model the probability of getting certain
    subordinate observation whilst within a
    particular dominant hidden state

9
Decoding FHMMs
  • The dominant FHMM can be viewed as a special type
    of HMM that outputs observations in two streams
  • This does not affect the decoding lattice, and
    the Viterbi algorithm can be used to decode

10
Experimental setup
HMM/GMM Output Fusion
Acoustic HMM/GMM
Output Fusion
Speaker Score
Acoustic Feature Extraction
Visual HMM/GMM
Visual Feature Extraction
Acoustic-Biased FHMM
Acoustic Biased FHMM
Speaker Score
Video-Biased FHMM
Visual Biased FHMM
Speaker Score
Manual Lip Tracking
11
Training and testing datasets
  • Training and testing configuration was based on
    the XM2VTSDB protocol (Messer et al. 1999)
  • 12 configurations were generated based on the
    single XM2VTSDB configuration
  • For each configuration
  • 400 client tests
  • 8000 imposter tests

12
Feature extraction
  • Audio
  • MFCC 12 1 energy, deltas and accelerations
    43 features
  • Video
  • Lip ROI manually tracked every 50 frames
  • 120x80 pixels
  • Grayscale
  • Down-sampled to 24x16
  • DCT 20 coefficients deltas and accelerations
    60 features

13
Output fusion training
  • Two classifier types for each modality
  • Gaussian mixture models (GMMs)
  • Trained over entire sequences
  • Hidden Markov models (HMMs)
  • Trained for each word
  • Speaker models adapted from background models
    using maximum a posterior (MAP) adaption (Lee
    Gauvin 1999)
  • Topology of HMMs and GMMs determined from
    testing evaluation partition in first
    configuration

14
Output fusion testing
  • Fused HMM performance is compared to output
    fusion of normal HMMs and GMMs in each modality
  • Audio HMM Video HMM
  • Audio HMM Video GMM
  • Audio GMM Video HMM
  • Audio GMM Video GMM
  • Evaluation session used to estimate each
    modalities output score distribution to normalise
    scores within each modality
  • Background model score subtracted from speaker
    scores to normalise for environment and length

15
Output fusion results
  • HMM-based models takes advantage of temporal
    information to improve performance over GMM-based
    models in both modalities
  • Audio GMM is near HMM, but with large number of
    Gaussians
  • Video GMM does not improve with more Gaussians

16
Output fusion results
  • Advantage of video HMM does not carry over to
    output fusion
  • Little difference between video HMM and GMM in
    output fusion
  • Output fusion performance affected mostly by
    choice of audio classifier
  • Output fusion doesnt take advantage of video
    temporal information

Audio GMM
Audio HMM
17
Fused HMM training
  • Both acoustic- and visual-biased FHMMs are
    examined
  • Individual audio and video HMMs used as basis for
    FHMMs
  • Secondary models adapted from individual
    speakers GMMs for each state of the underlying
    HMM
  • Background FHMM was formed similarly

18
Fused HMM testing
  • Subordinate observations are up/down sampled to
    the same rate as the dominant HMM
  • Evaluation session used to estimate each
    modalities frame-score distribution to normalise
    scores within each modality
  • Similar to output-fusion, but on a frame-by-frame
    basis rather than using final output score
  • As well as using subordinate models adapted to
    the states of the dominant HMM, testing is
    performed with
  • Word subordinate models
  • (same secondary model for entire word)
  • Global subordinate models
  • (same secondary for all words)
  • Finally, background FHMM score is subtracted to
    normalise for environment and length

19
Comparison with output-fusion
  • If the same subordinate model is used for each
    dominant state, the FHMM model can be viewed as
    functionally equivalent to HMM-GMM output fusion
  • Although in practice this is not the case due to
    resampling of the subordinate observations and
    where modality-normalisation occurs
  • Word and State subordinate models can also be
    viewed as functionally equivalent to HMM-GMM
    output fusion
  • Choose the subordinate GMM based on the dominant
    state for each frame
  • Provided that the FHMM design doesnt affect the
    best path through the lattice

20
Acoustic-biased FHMM results
  • There is some benefit in using state-based FHMM
    models for audio
  • State or word-based FHMM models are better than
    global for most of the plot

Best Performing Output Fusion
21
Video-biased FHMM results
  • No benefit in using word or state-based FHMM
    models for video
  • Therefore, no use in using FHMM models at all

Best Performing Output Fusion
22
Acoustic vs. video-biased FHMM
  • The acoustic-biased FHMM shows that the audio can
    be used to segment the video into
    visually-similar sequences
  • However, the video-biased FHMM cannot use video
    to segment the audio into acoustically-similar
    sequences
  • Whilst the performance increase is small, it
    appears that the acoustic FHMM is benefiting from
    a temporal relationship between the acoustic
    states and the visual observations

23
Conclusion
  • Video HMM improves performance over video GMM,
    but not when used not in output fusion
  • Output fusion performance based mainly on
    acoustic classifier chosen
  • Audio-biased continuous FHMMs can take advantage
    of the temporal relationship between audio states
    and video observations
  • However, the video-biased continuous FHMM
    performance appears to show no corresponding
    relationship between video states and audio
    observations

24
Continuing/Future Research
  • Secondary GMMs are recognising a large amount of
    static video information
  • Skin or lip colour, facial hair, etc.
  • This information has no temporal relationship
    with the audio states, and may be swamping the
    more dynamic information available in facial
    movements
  • A more efficient structure may be realised by
    using more dynamic video features (mean-removed
    DCT, contour-based or optical flow) and output
    fusion with a face GMM
  • This would take advantage of the temporal
    audio-visual relationship, in addition to static
    face-recognition

25
Speech Recognition with Fused HMMs
  • FHMMs improve single-modal speech processing in
    two ways
  • 2nd modality improves scores within states
  • 2nd modality improves state sequence
  • Text-dependent speaker recognition only benefits
    from the first improvement
  • State sequences is fixed
  • However, speech recognition can take advantage of
    both improvements

26
Speech Recognition with Fused HMMs
  • Using first XM2VTS configuration (XM2VTSDB)
  • Speaker-independent, continuous-speech, digit
    recognition
  • PLP-based Audio Features, Hierarchical LDA-based
    (of mean-removed DCT) video features
  • We believe this is comparable performance to
    coupled and asynchronous HMMs
  • But simpler to train and decode

27
FHMMs, synchronous HMMs and feature fusion
  • The FHMM structure can be implemented as a
    multi-modal synchronous HMM, and therefore with
    minor simplification as a feature-fusion HMM
  • The difference is in how the structure is trained
  • In synchronous HMMs and feature-fusion, both
    modalities are used to train the HMMs
  • FHMMs can be viewed as adapting a multi-modal
    synchronous HMM from the dominant single-modal
    HMM
  • If the same number of Gaussians are used for both
    modalities, a FHMM can be implemented within a
    single-modal HMM decoder
  • Decoding is exactly the same as with
    feature-fusion

28
References
  • Brand, M. (1999), A bayesian computer vision
    system for modeling human interactions, in
    ICVS'99', Gran Canaria, Spain.
  • Chibelushi, C., Deravi, F. Mason, J. (2002), A
    review of speech-based bimodal recognition',
    Multimedia, IEEE Transactions on 4(1), 2337.
  • Dean, D., Wark, T. Sridharan, S. (2006), An
    examination of audio-visual fused HMMs for
    speaker recognition, in MMUA 2006', Toulouse,
    France.
  • Lee, C.-H. Gauvain, J.-L. (1993), Speaker
    adaptation based on MAP estimation of HMM
    parameters, in Acoustics, Speech, and Signal
    Processing, 1993. ICASSP-93., 1993 IEEE
    International Conference on', Vol. 2, pp. 558561
    vol.2.
  • Luettin, J. Maitre, G. (1998), Evaluation
    protocol for the extended M2VTS database
    (XM2VTSDB), Technical report, IDIAP.
  • Messer, K., Matas, J., Kittler, J., Luettin, J.
    Maitre, G. (1999), XM2VTSDB The extended M2VTS
    database, in Audio and Video-based Biometric
    Person Authentication (AVBPA '99), Second
    International Conference on', Washington D.C.,
    pp. 7277.
  • Pan, H., Levinson, S., Huang, T. Liang, Z.-P.
    (2004), A fused hidden markov model with
    application to bimodal speech processing', IEEE
    Transactions on Signal Processing 52(3), 573-581.

29
Questions?
Write a Comment
User Comments (0)
About PowerShow.com