Title: PowerPoint-Pr
1(No Transcript)
2Outline
Outline
- System Overview
- Emotional Speech Corpus
- Acoustic Analysis
- Semantic Analysis
- Stream Fusion
- Results
3System Overview
System Overview
Speech signal
Prosodic features
ASR-unit
Semantic interpretation (Bayesian Networks)
Classifier (SVM)
Stream fusion (MLP)
Emotion
4Emotional Speech Corpus
Emotional Speech Corpus
- Emotion set Anger, disgust, fear, joy,
neutrality, sadness, surprise - Corpus 1 Practical course
- 404 acted samples per emotion ?
- 13 speakers (1 female)
- Recorded within one year
- Corpus 2 Driving simulator
- 500 spontaneous emotion samples
- 200 acted samples (disgust, sadness)
5System Overview
System Overview
Speech signal
Prosodic features
ASR-unit
Semantic interpretation (Bayesian Networks)
Classifier (SVM)
Stream fusion (MLP)
Emotion
6Acoustic Analysis
Acoustic Analysis
- Low-level features
- Pitch contour (AMDF, low-pass filtering)
- Energy contour
- Spectrum
- Signal
- High-level features
- Statistic analysis of contours
- Elimination of mean, normalization to standard
dev. - Duration of one utterance (1-5 seconds)
7Acoustic Analysis
- Feature selection (1/2)
- Initial set of 200 statistical features
- Ranking 1 Single performance of each feature
(nearest-mean classifier) - Ranking 2 Sequential Forward Floating Search
wrapping by nearest-mean classifier
8Acoustic Analysis
- Feature selection (2/2)
- Top 10 features
Acoustic Feature SFFS-Rank Single Perf.
Pitch, maximum gradient 1 31.5
Pitch, standard deviation of distance between reversal points 2 23.0
Pitch, mean value 3 25.6
Signal, number of zero-crossings 4 16.9
Pitch, standard deviation 5 27.6
Duration of silences, mean value 6 17.5
Duration of voiced sounds, mean value 7 18.5
Energy, median of fall-time 8 17.8
Energy, mean distance between reversal points 9 19.0
Energy, mean of rise-time 10 17.6
9Acoustic Analysis
- Classification
- Evaluation of various classification methods
- 33 features
Classifier Error, Error,
Classifier Speaker indep. Speaker dep.
kMeans 57.05 27.38
kNN 30.41 17.39
GMM 25.17 10.88
MLP 26.86 9.36
SVM 23.88 7.05
ML-SVM 18.71 9.05
Output Vector of (pseudo-) recognition
confidences
10Acoustic Analysis
- Classification
- Multi-Layer Support Vector Machines
? No confidence vector to forward to fusion
11System Overview
System Overview
Speech signal
Prosodic features
ASR-unit
Semantic interpretation (Bayesian Networks)
Classifier (SVM)
Stream fusion (MLP)
Emotion
12Semantic Analysis
Semantic Analysis
- ASR-Unit
- HMM-based
- 1300 words german vocabulary
- No language model
- 5-best phrase hypotheses
- Recognition confidences per word
- Example output (first hypothesis)
I cant stand this every tray traffic-jam
69.3 34.6 72.1 20.0 36.1 15.9 55.8
13Semantic Analysis
Semantic Analysis
- Conditions
- Natural language
- Erroneous speech recognition
- Uncertain knowledge
- Incomplete knowledge
- Superfluous knowledge
- ? Probabilistic spotting approach
- ? Bayesian Belief Networks
14Semantic Analysis
Bayesian Belief Networks
- Acyclic graph of nodes and directed edges
- One state variable per node (here states
, ) - Setting node-dependencies via cond. probability
matrices - Setting initial probabilities in root nodes
- Observation A causes evidence in a child
node(i.e. is known) - Inference to direct parent nodes and finally to
root nodesBayes rule
15Semantic Analysis
Output Vector of real recognition confidences
16System Overview
System Overview
FF of HMC Overview
Speech signal
Prosodic features
ASR-unit
Semantic interpretation (Bayesian Networks)
Classifier (SVM)
Stream fusion (MLP)
Emotion
17Stream Fusion
Stream Fusion
- Pairwise mean
- Discriminative fusion applying MLP
- Input layer 2 x 7 confidences
- Hidden layer 100 nodes
- Output layer 7 recognition confidences
18Results
Results
Acoustic recognition rates (SVM)
Emotion ang dis fea joy ntl sad sur Mean
95.5 61.3 78.7 75.1 78.5 62.1 68.3 74.2
Semantic recognition rates
Emotion ang dis fea joy ntl sad sur Mean
78.4 71.2 53.4 57.7 56.0 35.0 65.5 59.6
19Results
Results
Recognition rates after discriminative fusion
Emotion ang dis fea joy ntl sad sur Mean
98.0 78.7 88.3 95.9 98.2 91.7 95.8 92.0
Overview
Acoustic Information Language Information Fusionby means Fusionby MLP
74.2 59.6 83.1 92.0
20Summary
Summary
- Acted Emotions
- 7 discrete emotion categories
- Prosodic feature selection via
- Singe feature performance
- Sequential forward floating search
- Evaluative comparision of different classifiers
- Outperforming SVMs
- Semantic analysis applying Bayesian Networks
- Significant gain by discriminative stream fusion
21(No Transcript)