Audio-visual speaker verification using continuous fused HMMs - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Audio-visual speaker verification using continuous fused HMMs

Description:

Presented by David Dean. Slides will be available at http://www.davidbdean.com ... variability, high recognition accuracy, and protection against impersonation. ... – PowerPoint PPT presentation

Number of Views:79

Avg rating:3.0/5.0

Slides: 30

Provided by: david1132

Category:

more less

Transcript and Presenter's Notes

Title: Audio-visual speaker verification using continuous fused HMMs

1
Audio-visual speaker verification using
continuous fused HMMs

David Dean, Sridha Sridharan,
and Tim Wark
Presented by David Dean
Slides will be available at http//www.davidbdean.
com/category/publications

2
Why audio-visual speaker recognition

Bimodal recognition exploits the synergy between
acoustic speech and visual speech, particularly
under adverse conditions. It is motivated by the
needin many potential applications of
speech-based recognitionfor robustness to speech
variability, high recognition accuracy, and
protection against impersonation.
(Chibelushi, Deravi and Mason 2002)

3
Early and late fusion

Most early approaches to audio-visual speaker
recognition (AVSPR) used either early or late
fusion (feature or output)
Problems
Output fusion cannot model temporal dependencies
Feature fusion suffers from problems with noise

Early Fusion
Late Fusion
4
Middle fusion - coupled HMMs

Middle fusion models can accept two streams of
input and the combination is done within the
classifier
Most middle fusion is performed using coupled
HMMs (shown here)
Can be difficult to train
Dependencies between hidden states are not strong
(Brand 1999)

5
Middle fusion fused HMMs

Pan et al. (2004) used probabilistic models to
investigate the optimal multi-stream HMM design
Maximise mutual information in audio and video
They found that linking the observations of one
modality to the hidden states of the other was
more optimal than linking just the hidden states
(i.e. Coupled HMM)

Acoustic Biased FHMM
6
Choosing the dominant modility

The fused HMM designed results in two designs,
acoustic, or video biased
The choice of the dominant modality (the one
biased towards) should be based upon which
individual HMM can more reliably estimate the
hidden state sequence for a particular
application
Generally audio
Alternatively, both versions can be used
concurrently and decision fused (as in Pan et al.
2004)

7
Continuous fused HMMs

The original Fused HMM implementation treated the
secondary domain as discrete (Pan et al. 2004)
This caused problems with within-speaker
variation
Work fine on single session (CUAVE Dean et al.
2006)
Fail on multi-session (XM2VTS)
Continuous FHMMs model both modalities with GMMs

8
Training FHMMs

Both biased FHMM (if needed) are trained
independently
Train the dominant (audio for acoustic-biased,
video for video-biased) HMM independently upon
the training observation sequences for that
modality
The best hidden state sequence of the trained HMM
is found for each training observation using the
Viterbi process
Model the relationship between the dominant
hidden state sequence and the training
observation sequences for the subordinate
modality
i.e. model the probability of getting certain
subordinate observation whilst within a
particular dominant hidden state

9
Decoding FHMMs

The dominant FHMM can be viewed as a special type
of HMM that outputs observations in two streams
This does not affect the decoding lattice, and
the Viterbi algorithm can be used to decode

10
Experimental setup
HMM/GMM Output Fusion
Acoustic HMM/GMM
Output Fusion
Speaker Score
Acoustic Feature Extraction
Visual HMM/GMM
Visual Feature Extraction
Acoustic-Biased FHMM
Acoustic Biased FHMM
Speaker Score
Video-Biased FHMM
Visual Biased FHMM
Speaker Score
Manual Lip Tracking
11
Training and testing datasets

Training and testing configuration was based on
the XM2VTSDB protocol (Messer et al. 1999)
12 configurations were generated based on the
single XM2VTSDB configuration
For each configuration
400 client tests
8000 imposter tests

12
Feature extraction

Audio
MFCC 12 1 energy, deltas and accelerations
43 features
Video
Lip ROI manually tracked every 50 frames
120x80 pixels
Grayscale
Down-sampled to 24x16
DCT 20 coefficients deltas and accelerations
60 features

13
Output fusion training

Two classifier types for each modality
Gaussian mixture models (GMMs)
Trained over entire sequences
Hidden Markov models (HMMs)
Trained for each word
Speaker models adapted from background models
using maximum a posterior (MAP) adaption (Lee
Gauvin 1999)
Topology of HMMs and GMMs determined from
testing evaluation partition in first
configuration

14
Output fusion testing

Fused HMM performance is compared to output
fusion of normal HMMs and GMMs in each modality
Audio HMM Video HMM
Audio HMM Video GMM
Audio GMM Video HMM
Audio GMM Video GMM
Evaluation session used to estimate each
modalities output score distribution to normalise
scores within each modality
Background model score subtracted from speaker
scores to normalise for environment and length

15
Output fusion results

HMM-based models takes advantage of temporal
information to improve performance over GMM-based
models in both modalities
Audio GMM is near HMM, but with large number of
Gaussians
Video GMM does not improve with more Gaussians

16
Output fusion results

Advantage of video HMM does not carry over to
output fusion
Little difference between video HMM and GMM in
output fusion
Output fusion performance affected mostly by
choice of audio classifier
Output fusion doesnt take advantage of video
temporal information

Audio GMM
Audio HMM
17
Fused HMM training

Both acoustic- and visual-biased FHMMs are
examined
Individual audio and video HMMs used as basis for
FHMMs
Secondary models adapted from individual
speakers GMMs for each state of the underlying
HMM
Background FHMM was formed similarly

18
Fused HMM testing

Subordinate observations are up/down sampled to
the same rate as the dominant HMM
Evaluation session used to estimate each
modalities frame-score distribution to normalise
scores within each modality
Similar to output-fusion, but on a frame-by-frame
basis rather than using final output score
As well as using subordinate models adapted to
the states of the dominant HMM, testing is
performed with
Word subordinate models
(same secondary model for entire word)
Global subordinate models
(same secondary for all words)
Finally, background FHMM score is subtracted to
normalise for environment and length

19
Comparison with output-fusion

If the same subordinate model is used for each
dominant state, the FHMM model can be viewed as
functionally equivalent to HMM-GMM output fusion
Although in practice this is not the case due to
resampling of the subordinate observations and
where modality-normalisation occurs
Word and State subordinate models can also be
viewed as functionally equivalent to HMM-GMM
output fusion
Choose the subordinate GMM based on the dominant
state for each frame
Provided that the FHMM design doesnt affect the
best path through the lattice

20
Acoustic-biased FHMM results

There is some benefit in using state-based FHMM
models for audio
State or word-based FHMM models are better than
global for most of the plot

Best Performing Output Fusion
21
Video-biased FHMM results

No benefit in using word or state-based FHMM
models for video
Therefore, no use in using FHMM models at all

Best Performing Output Fusion
22
Acoustic vs. video-biased FHMM

The acoustic-biased FHMM shows that the audio can
be used to segment the video into
visually-similar sequences
However, the video-biased FHMM cannot use video
to segment the audio into acoustically-similar
sequences
Whilst the performance increase is small, it
appears that the acoustic FHMM is benefiting from
a temporal relationship between the acoustic
states and the visual observations

23
Conclusion

Video HMM improves performance over video GMM,
but not when used not in output fusion
Output fusion performance based mainly on
acoustic classifier chosen
Audio-biased continuous FHMMs can take advantage
of the temporal relationship between audio states
and video observations
However, the video-biased continuous FHMM
performance appears to show no corresponding
relationship between video states and audio
observations

24
Continuing/Future Research

Secondary GMMs are recognising a large amount of
static video information
Skin or lip colour, facial hair, etc.
This information has no temporal relationship
with the audio states, and may be swamping the
more dynamic information available in facial
movements
A more efficient structure may be realised by
using more dynamic video features (mean-removed
DCT, contour-based or optical flow) and output
fusion with a face GMM
This would take advantage of the temporal
audio-visual relationship, in addition to static
face-recognition

25
Speech Recognition with Fused HMMs

FHMMs improve single-modal speech processing in
two ways
2nd modality improves scores within states
2nd modality improves state sequence
Text-dependent speaker recognition only benefits
from the first improvement
State sequences is fixed
However, speech recognition can take advantage of
both improvements

26
Speech Recognition with Fused HMMs

Using first XM2VTS configuration (XM2VTSDB)
Speaker-independent, continuous-speech, digit
recognition
PLP-based Audio Features, Hierarchical LDA-based
(of mean-removed DCT) video features
We believe this is comparable performance to
coupled and asynchronous HMMs
But simpler to train and decode

27
FHMMs, synchronous HMMs and feature fusion

The FHMM structure can be implemented as a
multi-modal synchronous HMM, and therefore with
minor simplification as a feature-fusion HMM
The difference is in how the structure is trained
In synchronous HMMs and feature-fusion, both
modalities are used to train the HMMs
FHMMs can be viewed as adapting a multi-modal
synchronous HMM from the dominant single-modal
HMM
If the same number of Gaussians are used for both
modalities, a FHMM can be implemented within a
single-modal HMM decoder
Decoding is exactly the same as with
feature-fusion

28
References

Brand, M. (1999), A bayesian computer vision
system for modeling human interactions, in
ICVS'99', Gran Canaria, Spain.
Chibelushi, C., Deravi, F. Mason, J. (2002), A
review of speech-based bimodal recognition',
Multimedia, IEEE Transactions on 4(1), 2337.
Dean, D., Wark, T. Sridharan, S. (2006), An
examination of audio-visual fused HMMs for
speaker recognition, in MMUA 2006', Toulouse,
France.
Lee, C.-H. Gauvain, J.-L. (1993), Speaker
adaptation based on MAP estimation of HMM
parameters, in Acoustics, Speech, and Signal
Processing, 1993. ICASSP-93., 1993 IEEE
International Conference on', Vol. 2, pp. 558561
vol.2.
Luettin, J. Maitre, G. (1998), Evaluation
protocol for the extended M2VTS database
(XM2VTSDB), Technical report, IDIAP.
Messer, K., Matas, J., Kittler, J., Luettin, J.
Maitre, G. (1999), XM2VTSDB The extended M2VTS
database, in Audio and Video-based Biometric
Person Authentication (AVBPA '99), Second
International Conference on', Washington D.C.,
pp. 7277.
Pan, H., Levinson, S., Huang, T. Liang, Z.-P.
(2004), A fused hidden markov model with
application to bimodal speech processing', IEEE
Transactions on Signal Processing 52(3), 573-581.