Title: Emotional Grounding in Spoken Dialog Systems
1 Emotional Grounding in Spoken Dialog Systems
- Jackson Liscombe
- jaxin_at_cs.columbia.edu
- Giuseppe Riccardi Dilek Hakkani-Tür
- dsp3_at_research.att.com
dtur_at_research.att.com
2The Problem Emotion
- In Spoken Dialog Systems, users can
- start angry.
- get angry.
- end angry.
3Outline
- Previous Work
- Corpus Description
- Feature Extraction
- Classification Experiments
4Outline
- Previous Work
- Corpus Description
- Feature Extraction
- Classification Experiments
5Past Work
- Isolated Speech
- Spoken Dialog Systems
6Past Work Isolated Speech
- Acted Data
- Features
- F0/pitch
- energy
- speaking rate
- Researchers (late 1990s - present)
- Aubergé, Campbell, Cowie, Douglas-Cowie,
Hirscheberg, Liscombe, Mozziconacci, Oudeyer,
Pereira, Roach, Scherer, Schröder, Tato, Yuan,
Zetterholm,
7Past Work Spoken Dialog Systems (1)
- Batliner, Huber, Fischer, Spilker, Nöth (2003)
- system Verbmobil (Wizard of Oz scenarios)
- binary classification
- features
- prosodic
- lexical (POS tags, swear words)
- dialog acts (repeat/repair/insult)
- 0.1 relative improvement using dialog acts
8Past Work Spoken Dialog Systems (2)
- Ang, Dhillon, Krupski, Shriberg, Stolcke (2002)
- system DARPA Communicator
- binary classification
- features
- prosodic
- lexical (language model)
- dialog acts (repeats/repairs)
- 4 relative improvement using dialog acts
9Past Work Spoken Dialog Systems (3)
- Lee, Narayanan (2004)
- system Speechworks call-center
- binary classification
- features
- prosodic
- lexical (weighted mutual information)
- dialog acts (repeat/rejection)
- 3 improvement using dialog acts
10Past Work Summary
- Past research has focused on acoustic data
- But, moving toward grounding emotion in context
(dialogs acts) - Summer work extend contextual features for
better emotion prediction
11Outline
- Previous Work
- Corpus Description
- Feature Extraction
- Classification Experiments
12Corpus Description
- ATTs How May I Help You?SM corpus (0300
Benchmark) - Labeled with Voice Signature information
- user state (emotion)
- gender
- age
- accent type
13Corpus Description
Statistic Training Testing
number user turns 15,013 5,000
number of dialogs 4,259 1,431
number of turns per dialog 3.5 3.5
number of words per turn 9.0 9.9
14User Emotion Distribution
15Emotion Labels
- Original Set
- Positive/Neutral
- Somewhat Frustrated
- Very Frustrated
- Somewhat Angry
- Very Angry
- Other Somewhat Negative
- Very Negative
-
- Reduced Set
- Positive
- Negative
16Corpus Description Binary User States
Statistic Training Testing
of turns that are positive 88.1 73.1
of dialogs with at least one negative turn 24.8 44.7
of negative dialogs that start negative 43.5 59.9
of negative dialogs that end negative 42.4 48.7
17Outline
- Previous Work
- Corpus Description
- Feature Extraction
- Classification Experiments
18Feature Set Space
Features Context Prosodic Lexical Discourse
turni
turni-1 turni
turni-2 turni-1
19Feature Set Space Context Overview
Features Context Prosodic Lexical Discourse
turni Isolated Isolated Isolated
turni-1 turni Differentials Prior Statistics Prior Statistics
turni-2 turni-1 Differentials Prior Statistics Prior Statistics
Differentials Prior Statistics Prior Statistics
20Lexical Features
- Language Model (ngrams)
- Examples of words significantly correlated with
negative user state (plt0.001) - 1st person pronouns I, me
- requests for a human operator person, talk,
speak, human, machine - billing-related words dollars, cents
- curse words
21Prosodic Features
- Praat
- open source tool for speech analysis, synthesis,
statistics, manipulation, - Paul Boersma and David Weenink
- University of Amsterdam
- www.praat.org
22Prosodic Features
- Pitch (F0)
- overall minimum
- overall maximum
- overall median
- overall standard deviation
- mean absolute slope
- slope of final vowel
- longest vowel mean
- Other
- local jitter over longest vowel
- Energy
- overall minimum
- overall maximum
- overall mean
- overall standard deviation
- longest vowel mean
- Speaking Rate
- vowels per second
- mean vowel length
- ratio voiced frames to total frames
- percent internal silence
23Prosodic Features in Action
24A View Of Pitch
Very Frustrated
Somewhat Frustrated
- That amount is incorrect.
20020508/0508095935fhf2910
25Pitch/Energy/Speaking Rate Change
Person Transcription Emotion Play
Agent I'm sorry, I didn't understand that. How may I help you?
User please an operator please .hst may I speak- Positive/Neutral
Agent I'm sorry we're having trouble. Can we try one more time? Just briefly tell me how I may help you.
User may I speak to an assistant please Somewhat Frustrated
Agent Are you calling from your home phone?
User yes Somewhat Angry
20020221/0221080552atf1536
26Pitch/Energy/Speaking Rate Change
20020221/0221080552atf1536
27Pitch/Energy/Speaking Rate No Change
Person Transcription Emotion Play
Agent Hello, I am an ATT automated assistant. You can speak naturally to me. How may I help you ?
User I need to find out about a number that I don't recognize Positive/Neutral
Agent Would you like to look up a number you don't recognize on you bill?
User yes I would Positive/Neutral
Agent Are you calling from your home phone?
User yes I am Positive/Neutral
20020317/0317220739atf9051
28Pitch/Energy/Speaking Rate No Change
20020317/0317220739atf9051
29Feature Set Space Baseline
Features Context Prosodic Lexical Discourse
turni
turni-1 turni
turni-2 turni-1
30Discourse Features
- Dialog Acts
- greeting
- re-prompt
- confirmation
- specification
- acknowledgment
- disambiguation
31Feature Set Space State-of-the-Art
Features Context Prosodic Lexical Discourse
turni
turni-1 turni
turni-2 turni-1
32Contextual Features
- Lexical (2)
- edit distance with previous 2 turns
- Discourse (10)
- turn number
- call type repetition with previous 2 turns
- dialog act repetition with previous 2 turns
- Prosodic (34)
- 1st and 2nd order differentials for each feature
- Other (2)
- user state of previous 2 turns
33Feature Set Space Contextual
Features Context Prosodic Lexical Discourse
turni
turni-1 turni
turni-2 turni-1
34Outline
- Previous Work
- Corpus Description
- Feature Extraction
- Classification Experiments
35Experimental Design
- Training size 15,013 turns
- Testing size 5,000 turns
- Most frequent user state (positive) accounts for
73.1 of testing data - Learning Algorithm Used
- BoosTexter (boosting w/ weak learners)
- continuous and discrete valued features
- 2000 iterations
36Performance Accuracy Summary
Feature Set Accuracy Rel. Improv. over Baseline
Most Freq. State 73.1 -----
Baseline 76.1 -----
State-of-the-Art 77.0 1.2
Contextual 79.0 3.8
37Conclusions
- Baseline (prosodic and lexical features)
- leads to improved emotion prediction over chance
- State-of-the-Art (baseline plus dialog acts)
- gives further improvement
- Innovative contextual features
- improves emotion prediction even further
- Towards a computation model of emotional grounding
38Thank You