Title: Audio-visual speaker verification using continuous fused HMMs
1Audio-visual speaker verification using
continuous fused HMMs
- David Dean, Sridha Sridharan,
- and Tim Wark
- Presented by David Dean
- Slides will be available at http//www.davidbdean.
com/category/publications
2Why audio-visual speaker recognition
- Bimodal recognition exploits the synergy between
acoustic speech and visual speech, particularly
under adverse conditions. It is motivated by the
needin many potential applications of
speech-based recognitionfor robustness to speech
variability, high recognition accuracy, and
protection against impersonation. - (Chibelushi, Deravi and Mason 2002)
3Early and late fusion
- Most early approaches to audio-visual speaker
recognition (AVSPR) used either early or late
fusion (feature or output) - Problems
- Output fusion cannot model temporal dependencies
- Feature fusion suffers from problems with noise
Early Fusion
Late Fusion
4Middle fusion - coupled HMMs
- Middle fusion models can accept two streams of
input and the combination is done within the
classifier - Most middle fusion is performed using coupled
HMMs (shown here) - Can be difficult to train
- Dependencies between hidden states are not strong
(Brand 1999)
5Middle fusion fused HMMs
- Pan et al. (2004) used probabilistic models to
investigate the optimal multi-stream HMM design - Maximise mutual information in audio and video
- They found that linking the observations of one
modality to the hidden states of the other was
more optimal than linking just the hidden states
(i.e. Coupled HMM)
Acoustic Biased FHMM
6Choosing the dominant modility
- The fused HMM designed results in two designs,
acoustic, or video biased - The choice of the dominant modality (the one
biased towards) should be based upon which
individual HMM can more reliably estimate the
hidden state sequence for a particular
application - Generally audio
- Alternatively, both versions can be used
concurrently and decision fused (as in Pan et al.
2004)
7Continuous fused HMMs
- The original Fused HMM implementation treated the
secondary domain as discrete (Pan et al. 2004) - This caused problems with within-speaker
variation - Work fine on single session (CUAVE Dean et al.
2006) - Fail on multi-session (XM2VTS)
- Continuous FHMMs model both modalities with GMMs
8Training FHMMs
- Both biased FHMM (if needed) are trained
independently - Train the dominant (audio for acoustic-biased,
video for video-biased) HMM independently upon
the training observation sequences for that
modality - The best hidden state sequence of the trained HMM
is found for each training observation using the
Viterbi process - Model the relationship between the dominant
hidden state sequence and the training
observation sequences for the subordinate
modality - i.e. model the probability of getting certain
subordinate observation whilst within a
particular dominant hidden state
9Decoding FHMMs
- The dominant FHMM can be viewed as a special type
of HMM that outputs observations in two streams - This does not affect the decoding lattice, and
the Viterbi algorithm can be used to decode
10Experimental setup
HMM/GMM Output Fusion
Acoustic HMM/GMM
Output Fusion
Speaker Score
Acoustic Feature Extraction
Visual HMM/GMM
Visual Feature Extraction
Acoustic-Biased FHMM
Acoustic Biased FHMM
Speaker Score
Video-Biased FHMM
Visual Biased FHMM
Speaker Score
Manual Lip Tracking
11Training and testing datasets
- Training and testing configuration was based on
the XM2VTSDB protocol (Messer et al. 1999) - 12 configurations were generated based on the
single XM2VTSDB configuration - For each configuration
- 400 client tests
- 8000 imposter tests
12Feature extraction
- Audio
- MFCC 12 1 energy, deltas and accelerations
43 features - Video
- Lip ROI manually tracked every 50 frames
- 120x80 pixels
- Grayscale
- Down-sampled to 24x16
- DCT 20 coefficients deltas and accelerations
60 features
13Output fusion training
- Two classifier types for each modality
- Gaussian mixture models (GMMs)
- Trained over entire sequences
- Hidden Markov models (HMMs)
- Trained for each word
- Speaker models adapted from background models
using maximum a posterior (MAP) adaption (Lee
Gauvin 1999) - Topology of HMMs and GMMs determined from
testing evaluation partition in first
configuration
14Output fusion testing
- Fused HMM performance is compared to output
fusion of normal HMMs and GMMs in each modality - Audio HMM Video HMM
- Audio HMM Video GMM
- Audio GMM Video HMM
- Audio GMM Video GMM
- Evaluation session used to estimate each
modalities output score distribution to normalise
scores within each modality - Background model score subtracted from speaker
scores to normalise for environment and length
15Output fusion results
- HMM-based models takes advantage of temporal
information to improve performance over GMM-based
models in both modalities - Audio GMM is near HMM, but with large number of
Gaussians - Video GMM does not improve with more Gaussians
16Output fusion results
- Advantage of video HMM does not carry over to
output fusion - Little difference between video HMM and GMM in
output fusion - Output fusion performance affected mostly by
choice of audio classifier - Output fusion doesnt take advantage of video
temporal information
Audio GMM
Audio HMM
17Fused HMM training
- Both acoustic- and visual-biased FHMMs are
examined - Individual audio and video HMMs used as basis for
FHMMs - Secondary models adapted from individual
speakers GMMs for each state of the underlying
HMM - Background FHMM was formed similarly
18Fused HMM testing
- Subordinate observations are up/down sampled to
the same rate as the dominant HMM - Evaluation session used to estimate each
modalities frame-score distribution to normalise
scores within each modality - Similar to output-fusion, but on a frame-by-frame
basis rather than using final output score - As well as using subordinate models adapted to
the states of the dominant HMM, testing is
performed with - Word subordinate models
- (same secondary model for entire word)
- Global subordinate models
- (same secondary for all words)
- Finally, background FHMM score is subtracted to
normalise for environment and length
19Comparison with output-fusion
- If the same subordinate model is used for each
dominant state, the FHMM model can be viewed as
functionally equivalent to HMM-GMM output fusion - Although in practice this is not the case due to
resampling of the subordinate observations and
where modality-normalisation occurs - Word and State subordinate models can also be
viewed as functionally equivalent to HMM-GMM
output fusion - Choose the subordinate GMM based on the dominant
state for each frame - Provided that the FHMM design doesnt affect the
best path through the lattice
20Acoustic-biased FHMM results
- There is some benefit in using state-based FHMM
models for audio - State or word-based FHMM models are better than
global for most of the plot
Best Performing Output Fusion
21Video-biased FHMM results
- No benefit in using word or state-based FHMM
models for video - Therefore, no use in using FHMM models at all
Best Performing Output Fusion
22Acoustic vs. video-biased FHMM
- The acoustic-biased FHMM shows that the audio can
be used to segment the video into
visually-similar sequences - However, the video-biased FHMM cannot use video
to segment the audio into acoustically-similar
sequences - Whilst the performance increase is small, it
appears that the acoustic FHMM is benefiting from
a temporal relationship between the acoustic
states and the visual observations
23Conclusion
- Video HMM improves performance over video GMM,
but not when used not in output fusion - Output fusion performance based mainly on
acoustic classifier chosen - Audio-biased continuous FHMMs can take advantage
of the temporal relationship between audio states
and video observations - However, the video-biased continuous FHMM
performance appears to show no corresponding
relationship between video states and audio
observations
24Continuing/Future Research
- Secondary GMMs are recognising a large amount of
static video information - Skin or lip colour, facial hair, etc.
- This information has no temporal relationship
with the audio states, and may be swamping the
more dynamic information available in facial
movements - A more efficient structure may be realised by
using more dynamic video features (mean-removed
DCT, contour-based or optical flow) and output
fusion with a face GMM - This would take advantage of the temporal
audio-visual relationship, in addition to static
face-recognition
25Speech Recognition with Fused HMMs
- FHMMs improve single-modal speech processing in
two ways - 2nd modality improves scores within states
- 2nd modality improves state sequence
- Text-dependent speaker recognition only benefits
from the first improvement - State sequences is fixed
- However, speech recognition can take advantage of
both improvements
26Speech Recognition with Fused HMMs
- Using first XM2VTS configuration (XM2VTSDB)
- Speaker-independent, continuous-speech, digit
recognition - PLP-based Audio Features, Hierarchical LDA-based
(of mean-removed DCT) video features - We believe this is comparable performance to
coupled and asynchronous HMMs - But simpler to train and decode
27FHMMs, synchronous HMMs and feature fusion
- The FHMM structure can be implemented as a
multi-modal synchronous HMM, and therefore with
minor simplification as a feature-fusion HMM - The difference is in how the structure is trained
- In synchronous HMMs and feature-fusion, both
modalities are used to train the HMMs - FHMMs can be viewed as adapting a multi-modal
synchronous HMM from the dominant single-modal
HMM - If the same number of Gaussians are used for both
modalities, a FHMM can be implemented within a
single-modal HMM decoder - Decoding is exactly the same as with
feature-fusion
28References
- Brand, M. (1999), A bayesian computer vision
system for modeling human interactions, in
ICVS'99', Gran Canaria, Spain. - Chibelushi, C., Deravi, F. Mason, J. (2002), A
review of speech-based bimodal recognition',
Multimedia, IEEE Transactions on 4(1), 2337. - Dean, D., Wark, T. Sridharan, S. (2006), An
examination of audio-visual fused HMMs for
speaker recognition, in MMUA 2006', Toulouse,
France. - Lee, C.-H. Gauvain, J.-L. (1993), Speaker
adaptation based on MAP estimation of HMM
parameters, in Acoustics, Speech, and Signal
Processing, 1993. ICASSP-93., 1993 IEEE
International Conference on', Vol. 2, pp. 558561
vol.2. - Luettin, J. Maitre, G. (1998), Evaluation
protocol for the extended M2VTS database
(XM2VTSDB), Technical report, IDIAP. - Messer, K., Matas, J., Kittler, J., Luettin, J.
Maitre, G. (1999), XM2VTSDB The extended M2VTS
database, in Audio and Video-based Biometric
Person Authentication (AVBPA '99), Second
International Conference on', Washington D.C.,
pp. 7277. - Pan, H., Levinson, S., Huang, T. Liang, Z.-P.
(2004), A fused hidden markov model with
application to bimodal speech processing', IEEE
Transactions on Signal Processing 52(3), 573-581.
29Questions?