Title: Computer Vision, Speech Communication
1HIWIRE
Computer Vision, Speech Communication and Signal
Processing Research Group
2HIWIRE Involved CVSP Members
- Group Leader Prof. Petros Maragos
- Ph.D. Students / Graduate Research Assistants
- D. Dimitriadis (speech recognition,
modulations) - V. Pitsikalis (speech recognition,
fractals/chaos, NLP) - A. Katsamanis (speech modulations, statistical
processing, recognition) - G. Papandreou (vision PDEs, active contours,
level sets, AV-ASR) - G. Evangelopoulos (vision/speech texture,
modulations, fractals) - S. Leykimiatis (speech statistical processing,
microphone arrays)
3ICCS-NTUA in HIWIRE 1st Year
- Evaluation
- Databases Completed
- Baseline Completed
- WP1
- Noise Robust Features Results 1st Year
- Audio-Visual ASR Baseline Visual Features
- Multi-microphone array Exploratory Phase
- VAD Prelim. Results
- WP2
- Speaker Normalization Baseline
- Non-native Speech Database Completed
4ICCS-NTUA in HIWIRE 1st Year
- Evaluation
- Databases Completed
- Baseline Completed
- WP1
- Noise Robust Features Results 1st Year
- Modulation Features Results 1st Year
- Fractal Features Results 1st Year
- Audio-Visual ASR Baseline Visual Features
- Multi-microphone array Exploratory Phase
- VAD Prelim. Results
- WP2
- Speaker Normalization Baseline
- Non-native Speech Database Completed
5WP1 Noise Robustness
- Platform HTK
- Baseline Evaluation
- Aurora 2, Aurora 3, TIMITNOISE
- Modulation Features
- AM-FM Modulations
- Teager Energy Cepstrum
- Fractal Features
- Dynamical Denoising
- Correlation Dimension
- Multiscale Fractal Dimension
- Hybrid-Merged Features
up to 62 (Aurora 3)
up to 36 (Aurora 2)
up to 61 (Aurora 2)
6ICCS-NTUA in HIWIRE 1st Year
- Evaluation
- Databases Completed
- Baseline Completed
- WP1
- Noise Robust Features Results 1st Year
- Speech Modulation Features Results 1st Year
- Fractal Features Results 1st Year
- Audio-Visual ASR Baseline Visual
Features - Multi-microphone array Exploratory Phase
- VAD Prelim. Results
- WP2
- Speaker Normalization Baseline
- Non-native Speech Database Completed
7Speech Modulation Features
- Filterbank Design
- Short-Term AM-FM Modulation Features
- Short-Term Mean Inst. Amplitude
IA-Mean - Short-Term Mean Inst. Frequency IF-Mean
- Frequency Modulation Percentages FMP
- Short-Term Energy Modulation Features
- Average Teager Energy, Cepstrum Coef. TECC
8Modulation Acoustic Features
Nonlinear Processing
Demodulation
Robust Feature Transformation/ Selection
Regularization Multiband Filtering
Speech
Statistical Processing
AM-FM Modulation Features Mean Inst. Ampl.
IA-Mean Mean Inst. Freq. IF-Mean Freq.
Mod. Percent. FMP
V.A.D.
Energy Features Teager Energy Cepstrum Coeff.
TECC
9TIMIT-based Speech Databases
- TIMIT Database
- Training Set 3696 sentences , 35
phonemes/utterances - Testing Set 1344 utterances, 46680 phonemes
- Sampling Frequency 16 kHz
- Feature Vectors
- MFCCC0AM-FM1st2nd Time Derivatives
- Stream Weights (1) for MFCC and (2) for
??-FM - 3-state left-right HMMs, 16 mixtures
- All-pair, Unweighted grammar
- Performance Criterion Phone Accuracy Rates ()
- Back-end System HTK v3.2.0
10Results TIMITNoise
Up to 106
11Aurora 3 - Spanish
- Connected-Digits, Sampling Frequency 8 kHz
- Training Set
- WM (Well-Matched) 3392 utterances (quiet 532,
low 1668 and high noise 1192 - MM (Medium-Mismatch) 1607 utterances (quiet 396
and low noise 1211) - HM (High-Mismatch) 1696 utterances (quiet 266,
low 834 and high noise 596) - Testing Set
- WM 1522 utterances (quiet 260, low 754 and high
noise 508), 8056 digits - MM 850 utterances (quiet 0, low 0 and high
noise 850), 4543 digits - HM 631 utterances (quiet 0, low 377 and high
noise 254), 3325 digits - 2 Back-end ASR Systems (??? and BLasr)
- Feature Vectors MFCCAM-FM (or Auditory?M-FM),
TECC - All-Pair, Unweighted Grammar (or Word-Pair
Grammar) - Performance Criterion Word (digit) Accuracy Rates
12Results Aurora 3 (HTK)
Up to 62
13Databases Aurora 2
- Task Speaker Independent Recognition of Digit
Sequences - TI - Digits at 8kHz
- Training (8440 Utterances per scenario, 55M/55F)
- Clean (8kHz, G712)
- Multi-Condition (8kHz, G712)
- 4 Noises (artificial) subway, babble, car,
exhibition - 5 SNRs 5, 10, 15, 20dB , clean
- Testing, artificially added noise
- 7 SNRs -5, 0, 5, 10, 15, 20dB , clean
- A noises as in multi-cond train., G712 (28028
Utters) - B restaurant, street, airport, train station,
G712 (28028 Utters) - C subway, street (MIRS) (14014 Utters)
14Results Aurora 2
Up to 12
15Work To Be Done on Modulation Features
16ICCS-NTUA in HIWIRE 1st Year
- Evaluation
- Databases Completed
- Baseline Completed
- WP1
- Noise Robust Features Results 1st Year
- Speech Modulation Features Results 1st Year
- Fractal Features Results 1st Year
- Audio-Visual ASR Baseline Visual Features
- Multi-microphone array Exploratory Phase
- VAD Prelim. Results
- WP2
- Speaker Normalization Baseline
- Non-native Speech Database Completed
17Fractal Features
N-d Cleaned
FDCD
speech signal
N-d Signal
Local SVD
Embedding
Filtered Dynamics - Correlation Dimension (8)
MFD
Geometrical Filtering
Multiscale Fractal Dimension (6)
Noisy Embedding
Filtered Embedding
18Databases Aurora 2
- Task Speaker Independent Recognition of Digit
Sequences - TI - Digits at 8kHz
- Training (8440 Utterances per scenario, 55M/55F)
- Clean (8kHz, G712)
- Multi-Condition (8kHz, G712)
- 4 Noises (artificial) subway, babble, car,
exhibition - 5 SNRs 5, 10, 15, 20dB , clean
- Testing, artificially added noise
- 7 SNRs -5, 0, 5, 10, 15, 20dB , clean
- A noises as in multi-cond train., G712 (28028
Utters) - B restaurant, street, airport, train station,
G712 (28028 Utters) - C subway, street (MIRS) (14014 Utters)
19Results Aurora 2
Up to 40
20Results Aurora 2
Up to 27
21Results Aurora 2
Up to 61
22Future Directions on Fractal Features
- Refine Fractal Feature Extraction.
- Application to Aurora 3.
- Fusion with other features.
23ICCS-NTUA in HIWIRE 1st Year
- Evaluation
- Databases Completed
- Baseline Completed
- WP1
- Noise Robust Features Results 1st Year
- Audio-Visual ASR Baseline Visual
Features - Multi-microphone array Exploratory Phase
- VAD Prelim. Results
- WP2
- Speaker Normalization Baseline
- Non-native Speech Database Completed
24Visual Front-End
- Aim
- Extract low-dimensional visual speech feature
vector from video - Visual front-end modules
- Speaker's face detection
- ROI tracking
- Facial Model Fitting
- Visual feature extraction
- Challenges
- Very high dimensional signal - which features are
proper? - Robustness
- Computational Efficiency
25Face Modeling
- A well studied problem in Computer Vision
- Active Appearance Models, Morphable Models,
Active Blobs - Both Shape Appearance can enhance lipreading
- The shape and appearance of human faces live in
low dimensional manifolds
26Image Fitting Example
step 2
step 6
step 10
step 14
step 18
27Example Face Interpretation Using AAM
shape track superimposed on original video
reconstructed face This is what the visual-only
speech recognizer sees!
original video
- Generative models like AAM allow us to evaluate
the output of the visual front-end
28Evaluation on the CUAVE Database
29Audio-Visual ASR Database
- Subset of CUAVE database used
- 36 speakers (30 training, 6 testing)
- 5 sequences of 10 connected digits per speaker
- Training set 1500 digits (30x5x10)
- Test set 300 digits (6x5x10)
- CUAVE database also contains more complex data
sets speaker moving around, speaker shows
profile, continuous digits, two speakers (to be
used in future evaluations) - CUAVE was kindly provided by the Clemson
University
30Recognition Results (Word Accuracy)
- Data
- Training 500 digits (29 speakers)
- Testing 100 digits (4 speakers)
Audio Visual Audiovisual
Classification 99 46 85
Recognition 98 26 78
31Future Work
- Visual Front-end
- Better trained AAM
- Temporal tracking
- Feature fusion
- Experimentation with alternative DBN
architectures - Automatic stream weight determination
- Integration with non-linear acoustic features
- Experiments on other audio-visual databases
- Systematic evaluation of visual features
32ICCS-NTUA in HIWIRE 1st Year
- Evaluation
- Databases Completed
- Baseline Completed
- WP1
- Noise Robust Features Results 1st Year
- Modulation Features Results 1st Year
- Fractal Features Results 1st Year
- Audio-Visual ASR Baseline Visual Features
- Multi-microphone array Exploratory Phase
- VAD Prelim. Results
- WP2
- Speaker Normalization Baseline
- Non-native Speech Database Completed
33User Robustness, Speaker Adaptation
- VTLN Baseline
- Platform HTK
- Database AURORA 4
- Fs 8 kHz
- Scenarios Training, Testing
- Comparison with MLLR
- Collection of non-Native Speech Data Completed
- 10 Speakers
- 100 Utterances/Speaker
34Vocal Tract Length Normalization
- Implementation HTK
- Warping Factor Estimation
- Maximum Likelihood (ML) criterion
Figures from Hain99, Lee96
35VTLN
- Training
- AURORA 4 Baseline Setup
- Clean (SIC), Multi-Condition (SIM), Noisy (SIN)
- Testing
- Estimate warping factor using adaptation
utterances (Supervised VTLN) - Per speaker warping factor (1, 2, 10, 20
Utterances) - 2-pass Decoding
- 1st pass
- Get a hypothetical transcription
- Alignment and ML to estimate per utterance
warping factor - 2nd pass
- Decode properly normalized utterance
36Databases Aurora 4
- Task 5000 Word, Continuous Speech Recognition
- WSJ0 (16 / 8 kHz) Artificially Added Noise
- 2 microphones Sennheiser, Other
- Filtering G712, P341
- Noises Car, Babble, Restaurant, Street, Airport,
Train Station - Training (7138 Utterances per scenario)
- Clean Sennheiser mic.
- Multi-Condition Sennheiser Other mic.,
- 75 w. artificially added noise _at_ SNR 10 20
dB - Noisy Sennheiser, artificially added noise
- SNR 10 20 dB
- Testing (330 Utterances 166 Utterances each.
Speaker 8) - SNR 5-15 dB
- 1-7 Sennheiser microphone
- 8-14 Other microphone
37VTLN Results, Clean Training
38VTLN Results, Multi-Condition Training
39VTLN Results, Noisy Training
40Future Directions for Speaker Normalization
- Estimate warping transforms at signal level
- Exploit instantaneous amplitude or frequency
signals to estimate the warping parameters,
Normalize the signal - Effective integration with model-based adaptation
techniques (collaboration with TSI)
41ICCS-NTUA in HIWIRE 1st Year
- Evaluation
- Databases Completed
- Baseline Completed
- WP1
- Noise Robust Features Results 1st Year
- Audio-Visual ASR Baseline Visual Features
- Multi-microphone array Exploratory Phase
- VAD Prelim. Results
- WP2
- Speaker Normalization Baseline
- Non-native Speech Database Completed
42WP1 Appendix Slides
43ASR Results ?
44Experimental Results IIa (HTK)
45Aurora 3 Configs
- HM
- States 14, Mixs 12
- MM
- States 16, Mixs 6
- WM
- States 16, Mixs 16
46WP1 Appendix Slides
47Baseline Aurora 2
- Database Structure
- 2 Training Scenarios, 3 Test Sets, 442
Conditions, - 7 SNRs per Condition Total of 2x70 Tests
- Presentation of Selected Results
- Average over SNR.
- Average over Condition.
- Training Scenarios Clean- v.s Multi- Train.
- Noise Level Low v.s. High SNR.
- Condition Worst v.s. Easy Conditions.
- Features MFCCDA v.s. MFCCDACMS
- Set up states 18 10-22, mixs 3-32,
MFCCDACMS
48Average Baseline Results Aurora 2
Average over all SNRs and all Conditions
Plain MFCCDA, CMS MFCCDACMS. Mixture
Clean train (Both Plain,CMS) 3, Multi train
Plain 22, CMS 32. Best Select for each
condition/noise the mixs with the best
result.
Average HTK results as reported with the
database.
49Results Aurora 2
Up to 12
50Results Aurora 2
Up to 40
51Results Aurora 2
Up to 27
52Results Aurora 2
Up to 61
53Aurora 2 Distributed, Multicondition Training
Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full Multicondition Training - Full
A A A A A B B B B B C C C
Subway Babble Car Exhibition Average Restaurant Street Airport Station Average Subway M Street M Average Average
Clean 98,68 98,52 98,39 98,49 98,52 98,68 98,52 98,39 98,49 98,52 98,50 98,58 98,54 98,52
20 dB 97,61 97,73 98,03 97,41 97,70 96,87 97,58 97,44 97,01 97,23 97,30 96,55 96,93 97,35
15 dB 96,47 97,04 97,61 96,67 96,95 95,30 96,31 96,12 95,53 95,82 96,35 95,53 95,94 96,29
10 dB 94,44 95,28 95,74 94,11 94,89 91,96 94,35 93,29 92,87 93,12 93,34 92,50 92,92 93,79
5 dB 88,36 87,55 87,80 87,60 87,83 83,54 85,61 86,25 83,52 84,73 82,41 82,53 82,47 85,52
0 dB 66,90 62,15 53,44 64,36 61,71 59,29 61,34 65,11 56,12 60,47 46,82 54,44 50,63 59,00
-5dB 26,13 27,18 20,58 24,34 24,56 25,51 27,60 29,41 21,07 25,90 18,91 24,24 21,58 24,50
Average 88,76 87,95 86,52 88,03 87,82 85,39 87,04 87,64 85,01 86,27 83,24 84,31 83,78 86,39
54Aurora 2 Distributed, Clean Training
Clean Training - Full Clean Training - Full Clean Training - Full Clean Training - Full Clean Training - Full Clean Training - Full Clean Training - Full Clean Training - Full Clean Training - Full Clean Training - Full Clean Training - Full Clean Training - Full Clean Training - Full Clean Training - Full Clean Training - Full
A A A A A B B B B B C C C
Subway Babble Car Exhibition Average Restaurant Street Airport Station Average Subway M Street M Average Average
Clean 98,93 99,00 98,96 99,20 99,02 98,93 99,00 98,96 99,20 99,02 99,14 98,97 99,06 99,03
20 dB 97,05 90,15 97,41 96,39 95,25 89,99 95,74 90,64 94,72 92,77 93,46 95,13 94,30 94,07
15 dB 93,49 73,76 90,04 92,04 87,33 76,24 88,45 77,01 83,65 81,34 86,77 88,91 87,84 85,04
10 dB 78,72 49,43 67,01 75,66 67,71 54,77 67,11 53,86 60,29 59,01 73,90 74,43 74,17 65,52
5 dB 52,16 26,81 34,09 44,83 39,47 31,01 38,45 30,33 27,92 31,93 51,27 49,21 50,24 38,61
0 dB 26,01 9,28 14,46 18,05 16,95 10,96 17,84 14,41 11,57 13,70 25,42 22,91 24,17 17,09
-5dB 11,18 1,57 9,39 9,60 7,94 3,47 10,46 8,23 8,45 7,65 11,82 11,15 11,49 8,53
Average 69,49 49,89 60,60 65,39 61,34 52,59 61,52 53,25 55,63 55,75 66,16 66,12 66,14 60,06
55WP1 Appendix Slides
56Introduction Motivations for AV-ASR
- Audio-only ASR does not work reliably in many
scenarios - Noisy background (e.g. car's cabin, cockpit)
- Interference between talkers
- Need to enhance the auditory signal when it is
not reliable - Human speech perception is multimodal
- Different modalities are weighed according to
their reliability - Hearing impaired people can lipread
- McGurk Effect (McGurk MacDonald, 1976)
- Machines should also be able to exploit
multimodal information
57Audio-Visual Feature Fusion
- Audio-visual feature integration is highly
non-trivial - Audio visual speech asychrony (100 ms)
- Relative reliability of streams can vary wildly
- Many approaches to feature fusion in the
literature - Early integration
- Intermediate integration
- Late integration
- Highly active research area (mainly machine
learning) - The class of Dynamic Bayesian Networks (DBNs)
seems particularly suited for the problem - Stream interaction explicitly modeled
- Model parameter inference is more difficult than
in HMM
58Visual Front-End AAM Parameters
- First frame of the 36 videos manually annotated
- 68 points on the whole face as shape landmarks
- Color appearance sampled at 10000 pixels
- Eigenvectors retained explain 70 variance
- 5 eigenshapes 10 eigenfaces
- Initial condition at each new frame the converged
solution at the previous frame - Inverse-compositional gradient descent algorithm
- Coarse-to-fine refinement (Gaussian pyramid - 3
scales)
59AV-ASR Experiment Setup
- Features
- Audio 39 features (MFCC_D_A)
- Visual (upsampled from 30 Hz to 100 Hz)
- 5 shape features (Sh)
- 10 appearance features (App)
- Audio-Visual 3945 feats (MFCC_D_ASHAPP_D_A)
- Two-stream HMM
- 8 state, left-to-right HMM whole-digit models
with no state skipping - Single Gaussian observation probability densities
- Separate audio video feature streams with equal
weights (1,1)
60WP1 Appendix Slides
61Aurora 4, Multi-Condition Training
62Aurora 4, Noisy Training
63Aurora 4, Noisy Training