Title: Spoken Dialogue Systems
1Spoken Dialogue Systems
2Issues
- Error avoidance
- Error detection
- From the system side how likely is it the
system made an error? - From the user side what cues does the user
provide to indicate an error? - Error handling what can the system do when it
thinks an error has occurred? - Evaluation how do you know what needs fixing
most?
3Avoiding misunderstandings
- By imitating human performance
- Timing and grounding (Clark 03)
4Recognizing Problematic Dialogues
- Hastie et al, Whats the Trouble? ACL 2002.
5Recognizing Problematic Utterances (Hirschberg et
al 99--)
- Collect corpus from interactive voice response
system - Identify speaker turns
- incorrectly recognized
- where speakers first aware of error
- that correct misrecognitions
- Identify prosodic features of turns in each
category and compare to other turns - Use Machine Learning techniques to train a
classifier to make these distinctions
automatically
6Turn Types
TOOT Hi. This is ATT Amtrak Schedule System.
This is TOOT. How may I help you? User Hello.
I would like trains from Philadelphia to New York
leaving on Sunday at ten thirty in the evening.
TOOT Which city do you want to go to? User
New York.
misrecognition
correction
aware site
7Results
- Reduced error in predicting misrecognized turns
to 8.64 - Error in predicting awares (12)
- Error in predicting corrections (18-21)
8Evidence from Human Performance
- Users provide explicit positive and negative
feedback - Corpus-based vs. laboratory experiments do
these tell us different things? - Bell Gustafson 00
- What do we learn from this?
- What functions does feedback serve?
- Krahmer et al
- go on and go back signals in grounding
situations (implicit/explicit verification)
9- Pos short turns, unmarked word order,
confirmation, answers, no corrections or
repetitions, new info - Neg long turns, marked word order,
disconfirmation, no answer, corrections,
repetitions, no new info - Hypotheses supported but
- Can these cues be identified automatically?
- How might they affect the design of SDS?
10Error Handling Strategies
- Goldberg et al 03 how should systems best
inform the user that they dont understand? - System rephrasing vs. repetitions vs. statement
of not understanding - Apologies
- What behaviors might these produce?
- Hyperarticulation
- User frustration
- User repetition or rephrasing
11- What lessons do we learn?
- What produces least frustration?
- Best recognized input?
12Evaluating Dialogue Systems
- PARADISE framework (Walker et al 00)
- Performance of a dialogue system is affected
both by what gets accomplished by the user and
the dialogue agent and how it gets accomplished
Maximize Task Success
Minimize Costs
Efficiency Measures
Qualitative Measures
13Task Success
- Task goals seen as Attribute-Value Matrix
- ELVIS e-mail retrieval task (Walker et al 97)
- Find the time and place of your meeting with
Kim.
Attribute Value Selection Criterion Kim or
Meeting Time 1030 a.m. Place 2D516
- Task success defined by match between AVM values
at end of with true values for AVM
14Metrics
- Efficiency of the InteractionUser Turns, System
Turns, Elapsed Time - Quality of the Interaction ASR rejections, Time
Out Prompts, Help Requests, Barge-Ins, Mean
Recognition Score (concept accuracy),
Cancellation Requests - User Satisfaction
- Task Success perceived completion, information
extracted
15Experimental Procedures
- Subjects given specified tasks
- Spoken dialogues recorded
- Cost factors, states, dialog acts automatically
logged ASR accuracy,barge-in hand-labeled - Users specify task solution via web page
- Users complete User Satisfaction surveys
- Use multiple linear regression to model User
Satisfaction as a function of Task Success and
Costs test for significant predictive factors
16User SatisfactionSum of Many Measures
- Was Annie easy to understand in this
conversation? (TTS Performance) - In this conversation, did Annie understand what
you said? (ASR Performance) - In this conversation, was it easy to find the
message you wanted? (Task Ease) - Was the pace of interaction with Annie
appropriate in this conversation? (Interaction
Pace) - In this conversation, did you know what you could
say at each point of the dialog?
- (User Expertise)
- How often was Annie sluggish and slow to reply to
you in this conversation? (System Response) - Did Annie work the way you expected her to in
this conversation? (Expected Behavior) - From your current experience with using Annie to
get your email, do you think you'd use Annie
regularly to access your mail when you are away
from your desk? (Future Use)
17Performance Functions from Three Systems
- ELVIS User Sat. .21 COMP .47 MRS - .15 ET
- TOOT User Sat. .35 COMP .45 MRS - .14ET
- ANNIE User Sat. .33COMP .25 MRS .33 Help
- COMP User perception of task completion (task
success) - MRS Mean recognition accuracy (cost)
- ET Elapsed time (cost)
- Help Help requests (cost)
18Performance Model
- Perceived task completion and mean recognition
score are consistently significant predictors of
User Satisfaction - Performance model useful for system development
- Making predictions about system modifications
- Distinguishing good dialogues from bad
dialogues - But can we also tell on-line when a dialogue is
going wrong
19Next Week
- Speech summarization and data mining