PowerPoint-Pr - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

PowerPoint-Pr

Description:

Discriminative fusion applying MLP. Input layer: 2 x 7 confidences. Hidden layer: 100 nodes ... Significant gain by discriminative stream fusion. Slide -21 ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 22
Provided by: mmk9
Category:

less

Transcript and Presenter's Notes

Title: PowerPoint-Pr


1
(No Transcript)
2
Outline
Outline
  • System Overview
  • Emotional Speech Corpus
  • Acoustic Analysis
  • Semantic Analysis
  • Stream Fusion
  • Results

3
System Overview
System Overview
Speech signal
Prosodic features
ASR-unit
Semantic interpretation (Bayesian Networks)
Classifier (SVM)
Stream fusion (MLP)
Emotion
4
Emotional Speech Corpus
Emotional Speech Corpus
  • Emotion set Anger, disgust, fear, joy,
    neutrality, sadness, surprise
  • Corpus 1 Practical course
  • 404 acted samples per emotion ?
  • 13 speakers (1 female)
  • Recorded within one year
  • Corpus 2 Driving simulator
  • 500 spontaneous emotion samples
  • 200 acted samples (disgust, sadness)

5
System Overview
System Overview
Speech signal
Prosodic features
ASR-unit
Semantic interpretation (Bayesian Networks)
Classifier (SVM)
Stream fusion (MLP)
Emotion
6
Acoustic Analysis
Acoustic Analysis
  • Low-level features
  • Pitch contour (AMDF, low-pass filtering)
  • Energy contour
  • Spectrum
  • Signal
  • High-level features
  • Statistic analysis of contours
  • Elimination of mean, normalization to standard
    dev.
  • Duration of one utterance (1-5 seconds)

7
Acoustic Analysis
  • Feature selection (1/2)
  • Initial set of 200 statistical features
  • Ranking 1 Single performance of each feature
    (nearest-mean classifier)
  • Ranking 2 Sequential Forward Floating Search
    wrapping by nearest-mean classifier

8
Acoustic Analysis
  • Feature selection (2/2)
  • Top 10 features

Acoustic Feature SFFS-Rank Single Perf.
Pitch, maximum gradient 1 31.5
Pitch, standard deviation of distance between reversal points 2 23.0
Pitch, mean value 3 25.6
Signal, number of zero-crossings 4 16.9
Pitch, standard deviation 5 27.6
Duration of silences, mean value 6 17.5
Duration of voiced sounds, mean value 7 18.5
Energy, median of fall-time 8 17.8
Energy, mean distance between reversal points 9 19.0
Energy, mean of rise-time 10 17.6
9
Acoustic Analysis
  • Classification
  • Evaluation of various classification methods
  • 33 features

Classifier Error, Error,
Classifier Speaker indep. Speaker dep.
kMeans 57.05 27.38
kNN 30.41 17.39
GMM 25.17 10.88
MLP 26.86 9.36
SVM 23.88 7.05
ML-SVM 18.71 9.05
Output Vector of (pseudo-) recognition
confidences
10
Acoustic Analysis
  • Classification
  • Multi-Layer Support Vector Machines

? No confidence vector to forward to fusion
11
System Overview
System Overview
Speech signal
Prosodic features
ASR-unit
Semantic interpretation (Bayesian Networks)
Classifier (SVM)
Stream fusion (MLP)
Emotion
12
Semantic Analysis
Semantic Analysis
  • ASR-Unit
  • HMM-based
  • 1300 words german vocabulary
  • No language model
  • 5-best phrase hypotheses
  • Recognition confidences per word
  • Example output (first hypothesis)

I cant stand this every tray traffic-jam
69.3 34.6 72.1 20.0 36.1 15.9 55.8
13
Semantic Analysis
Semantic Analysis
  • Conditions
  • Natural language
  • Erroneous speech recognition
  • Uncertain knowledge
  • Incomplete knowledge
  • Superfluous knowledge
  • ? Probabilistic spotting approach
  • ? Bayesian Belief Networks

14
Semantic Analysis
Bayesian Belief Networks
  • Acyclic graph of nodes and directed edges
  • One state variable per node (here states
    , )
  • Setting node-dependencies via cond. probability
    matrices
  • Setting initial probabilities in root nodes
  • Observation A causes evidence in a child
    node(i.e. is known)
  • Inference to direct parent nodes and finally to
    root nodesBayes rule

15
Semantic Analysis
  • Emotion modelling

Output Vector of real recognition confidences
16
System Overview
System Overview
FF of HMC Overview
Speech signal
Prosodic features
ASR-unit
Semantic interpretation (Bayesian Networks)
Classifier (SVM)
Stream fusion (MLP)
Emotion
17
Stream Fusion
Stream Fusion
  • Pairwise mean
  • Discriminative fusion applying MLP
  • Input layer 2 x 7 confidences
  • Hidden layer 100 nodes
  • Output layer 7 recognition confidences

18
Results
Results
Acoustic recognition rates (SVM)
Emotion ang dis fea joy ntl sad sur Mean
95.5 61.3 78.7 75.1 78.5 62.1 68.3 74.2
Semantic recognition rates
Emotion ang dis fea joy ntl sad sur Mean
78.4 71.2 53.4 57.7 56.0 35.0 65.5 59.6
19
Results
Results
Recognition rates after discriminative fusion
Emotion ang dis fea joy ntl sad sur Mean
98.0 78.7 88.3 95.9 98.2 91.7 95.8 92.0
Overview
Acoustic Information Language Information Fusionby means Fusionby MLP
74.2 59.6 83.1 92.0
20
Summary
Summary
  • Acted Emotions
  • 7 discrete emotion categories
  • Prosodic feature selection via
  • Singe feature performance
  • Sequential forward floating search
  • Evaluative comparision of different classifiers
  • Outperforming SVMs
  • Semantic analysis applying Bayesian Networks
  • Significant gain by discriminative stream fusion

21
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com