Title: constructing accurate beliefs in task-oriented spoken dialog systems
1constructing accurate beliefs in task-oriented
spoken dialog systems
- Dan Bohus Computer Science Department
- www.cs.cmu.edu/dbohus Carnegie Mellon
University - dbohus_at_cs.cmu.edu Pittsburgh, PA 15213
2problem
- spoken language interfaces lack robustness when
faced with understanding errors
- errors stem mostly from speech recognition
- typical word error rates 20-30
- significant negative impact on interactions
3more concretely
- S What city are you leaving from?
- U Urbana Champaign OKAY IN THAT SAME PAY
- S Sorry Im not sure I understood what you said.
What city are you leaving from ? - U Urbana Champaign FOR MINUTE SINCE HEY
- S Im still having trouble understanding you. To
go on, I need you to answer the following
question. What city are you leaving from? - U Chicago CHICAGO
- S traveling from Chicago. Where would you like
to go? - U Huntsville SEOUL
- S traveling to Seoul. What day did you need to
travel? - U no no Im traveling to Birmingham THE
TRAVELING TO BERLIN P_M - S traveling in the afternoon. Okay what day
would you be departing chicago - U /uh/ the tenth of august AT THE TENTH OF
AUGUST - S a flight tomorrow I have a Korean Airlines
flight departing Chicago at 140pm, arrives Seoul
at 5pm the next day. Do you want to take that? - U my destination is Birmingham FLIGHT
DESTINATION MR WEEKEND - S traveling on Saturday, August 12th I have a
flight departing Chicago at 140pm arrives Seoul
at
4two types of understanding errors
- S What city are you leaving from?
- U Urbana Champaign OKAY IN THAT SAME PAY
- S Sorry Im not sure I understood what you said.
What city are you leaving from ? - U Urbana Champaign FOR MINUTE SINCE HEY
- S Im still having trouble understanding you. To
go on, I need you to answer the following
question. What city are you leaving from? - U Chicago CHICAGO
- S traveling from Chicago. Where would you like
to go? - U Huntsville SEOUL
- S traveling to Seoul. What day did you need to
travel? - U no no Im traveling to Birmingham THE
TRAVELING TO BERLIN P_M - S traveling in the afternoon. Okay what day
would you be departing chicago - U /uh/ the tenth of august AT THE TENTH OF
AUGUST - S a flight tomorrow I have a Korean Airlines
flight departing Chicago at 140pm, arrives Seoul
at 5pm the next day. Do you want to take that? - U my destination is Birmingham FLIGHT
DESTINATION MR WEEKEND - S traveling on Saturday, August 12th I have a
flight departing Chicago at 140pm arrives Seoul
at
5approaches for increasing robustness
- gracefully handle errors through interaction
- detect the problems
- develop a set of recovery strategies
- know how to choose between them (policy)
6six not-so-easy pieces
7todays talk
misunderstandings
- construct more accurate beliefs by integrating
information over multiple turns in a conversation
detection
S Where would you like to go? U Huntsville SEO
UL / 0.65
destination seoul/0.65
S traveling to Seoul. What day did you need to
travel?
U no no Im traveling to Birmingham
THE TRAVELING TO BERLIN P_M / 0.60
destination ?
8belief updating problem statement
destination seoul/0.65
S traveling to Seoul. What day did you need to
travel?
THE TRAVELING TO BERLIN P_M / 0.60
destination ?
- given
- an initial belief Pinitial(C) over concept C
- a system action SA
- a user response R
- construct an updated belief
- Pupdated(C) ? f (Pinitial(C), SA, R)
9outline
- related work
- a restricted version
- data
- user response analysis
- experiments and results
- current and future work
10current solutions
- most systems only track values, not beliefs
- new values overwrite old values
- use confidence scores
- yes ? trust hypothesis
- explicit confirm no ? delete hypothesis
- other ? non-understanding
- implicit confirm not much
- users who discover errors through incorrect
implicitconfirmations have a harder time getting
back on track - Shin et al, 2002
related work restricted version data user
response analysis results current and future
work
11confidence / detecting misunderstandings
- traditionally focused on word-level errors
Chase, Cox, Bansal, Ravinshankar, and many
others - recently detecting misunderstandingsWalker,
Wright, Litman, Bosch, Swerts, San-Segundo, Pao,
Gurevych, Bohus, and many others - machine learning approach binary classification
- in-domain, labeled dataset
- features from different knowledge sources
- acoustic, language model, parsing, dialog
management - 50 relative reduction in classification error
related work restricted version data user
response analysis results current and future
work
12detecting corrections
- detect if the user is trying to correct the
system Litman, Swerts, Hirschberg, Krahmer,
Levow - machine learning approach binary classification
- in-domain, labeled dataset
- features from different knowledge sources
- acoustic, prosody, language model, parsing,
dialog management - 50 relative reduction in classification error
related work restricted version data user
response analysis results current and future
work
13integration
- confidence annotation and correction detection
are useful tools - but separately, neither solves the problem
- bridge together in a unified approach to
accurately track beliefs
related work restricted version data user
response analysis results current and future
work
14outline
- related work
- a restricted version
- data
- user response analysis
- experiments and results
- current and future work
related work restricted version data user
response analysis results current and future
work
15belief updating general form
- given
- an initial belief Pinitial(C) over concept C
- a system action SA
- a user response R
- construct an updated belief
- Pupdated(C) ? f (Pinitial(C), SA, R)
related work restricted version data user
response analysis results current and future
work
16two simplifications
- 1. belief representation
- system unlikely to hear more than 3 or 4 values
for a concept within a dialog session - in our data considering only top hypothesis from
recognition - max 3 (conflicting values heard)
- only in 6.9 of cases, more than 1 value heard
- compressed beliefs top-K concept hypotheses
other - for now, K1
- 2. updates following system confirmation actions
related work restricted version data user
response analysis results current and future
work
17belief updating reduced version
- given
- an initial confidence score for the current top
hypothesis Confinit(thC) for concept C - a system confirmation action SA
- a user response R
- construct an updated confi-dence score for that
hypothesis - Confupd(thC) ? f (Confinit(thC), SA, R)
related work restricted version data user
response analysis results current and future
work
18outline
- related work
- a restricted version
- data
- user response analysis
- experiments and results
- current and future work
related work restricted version data user
response analysis results current and future
work
19data
- collected with RoomLine
- a phone-based mixed-initiative spoken dialog
system - conference room reservation
- explicit and implicit confirmations
- confidence threshold model ( some exploration)
- unplanned implicit confirmations
- I found 10 rooms for Friday between 1 and 3 p.m.
Would like a small room or a large one?
- I found 10 rooms for Friday between 1 and 3 p.m.
Would like a small room or a large one?
related work restricted version data user
response analysis results current and future
work
20corpus
- user study
- 46 participants (naïve users)
- 10 scenario-based interactions each
- compensated per task success
- corpus
- 449 sessions, 8848 user turns
- orthographically transcribed
- manually annotated
- misunderstandings
- corrections
- correct concept values
related work restricted version data user
response analysis results current and future
work
21outline
- related work
- a restricted version
- data
- user response analysis
- experiments and results
- current and future work
related work restricted version data user
response analysis results current and future
work
22user response types
- following Krahmer and Swerts, 2000
- study on Dutch train-table information system
- 3 user response types
- YES yes, right, thats right, correct, etc.
- NO no, wrong, etc.
- OTHER
- cross-tabulated against correctness of system
confirmations
related work restricted version data user
response analysis results current and future
work
23user responses to explicit confirmations
YES NO Other
CORRECT 94 93 0 0 5 7
INCORRECT 1 6 72 57 27 37
- numbers in brackets from KrahmerSwerts
related work restricted version data user
response analysis results current and future
work
24other responses to explicit confirmations
- 70 users repeat the correct value
- 15 users dont address the question
- attempt to shift conversation focus
- how often users correct the system?
User does not correct User corrects
CORRECT 1159 0
INCORRECT 29 10 of incor 250 90 of incor
related work restricted version data user
response analysis results current and future
work
25user responses to implicit confirmations
YES NO Other
CORRECT 30 0 7 0 63 100
INCORRECT 6 0 33 15 61 85
- numbers in brackets from KrahmerSwerts
related work restricted version data user
response analysis results current and future
work
26ignoring errors in implicit confirmations
- how often users correct the system?
User does not correct User corrects
CORRECT 552 2
INCORRECT 118 51 of incor 111 49 of incor
- explanation
- users correct later (40 of 118)
- users interact strategically / correct only if
essential
correct later correct later
critical 55 2
critical 14 47
related work restricted version data user
response analysis results current and future
work
27outline
- related work
- a restricted version
- data
- user response analysis
- experiments and results
- current and future work
related work restricted version data user
response analysis results current and future
work
28machine learning approach
- problem Confupd(thC) ? f (Confinit(thC), SA, R)
- need good probability outputs
- low cross-entropy between model predictions and
reality - logistic regression
- sample efficient
- stepwise approach ? feature selection
- logistic model tree for each action
- root splits on response-type
related work restricted version data user
response analysis results current and future
work
29features. target.
Initial Initial initial confidence score of top hypothesis, of initial hypotheses, concept type (bool / non-bool), concept identity
System action System action indicators describing other system actions in conjunction with current confirmation
User response Acoustic / prosodic acoustic and language scores, duration, pitch (min, max, mean, range, std.dev, min and max slope, plus normalized versions), voiced-to-unvoiced ratio, speech rate, initial pause
User response Lexical number of words, lexical terms highly correlated with corrections (MI)
User response Grammatical number of slots (new, repeated), parse fragmentation, parse gaps
User response Dialog dialog state, turn number, expectation match, new value for concept, timeout, barge-in.
- target was the top hypothesis correct?
related work restricted version data user
response analysis results current and future
work
30baselines
- initial baseline
- accuracy of system beliefs before the update
- heuristic baseline
- accuracy of heuristic update rule used by the
system - oracle baseline
- accuracy if we knew exactly what the user said
related work restricted version data user
response analysis results current and future
work
31results explicit confirmation
initial heuristic logistic model tree oracle
Hard error ()
Soft error
31.15
30
0.6
0.51
20
0.4
0.19
10
0.2
8.41
0.12
3.57
2.71
0
0.0
related work restricted version data user
response analysis results current and future
work
32results implicit confirmation
initial heuristic logistic model tree oracle
Hard error ()
Soft error
30.40
1.0
30
0.8
23.37
0.67
0.61
20
0.6
16.15
15.33
0.43
0.4
10
0.2
0
0.0
related work restricted version data user
response analysis results current and future
work
33results unplanned implicit confirmation
initial heuristic logistic model tree oracle
Hard error ()
Soft error
20
0.6
15.40
0.46
14.36
0.43
12.64
0.4
0.34
10.37
10
0.2
0
0.0
related work restricted version data user
response analysis results current and future
work
34informative features
- initial confidence score
- prosody features
- barge-in
- expectation match
- repeated grammar slots
- concept identity
related work restricted version data user
response analysis results current and future
work
35summary
- data-driven approach for constructing accurate
system beliefs - integrate information across multiple turns
- bridge together detection of misunderstandings
and corrections - performs better than current heuristics
- user response analysis
- users dont correct unless the error is critical
related work restricted version data user
response analysis results current and future
work
36outline
- related work
- a restricted version
- data
- user response analysis
- experiments and results
- current and future work
related work restricted version data user
response analysis results current and future
work
37current extensions
belief representation
- top hypothesis other
- logistic regression model
system action
related work restricted version data user
response analysis results current and future
work
382 hypotheses other
15.49
30.83
30.46
30
30
15.15
14.02
26.16
12.95
22.69
12
21.45
10.72
20
20
17.56
16.17
8
10
10
7.86
4
6.06
5.52
0
0
0
implicit confirmation
unplanned impl. conf.
explicit confirmation
80.00
98.14
initial heuristic lmt(basic) lmt(basicconcept) or
acle
45.03
12
40
9.64
9.49
8
25.66
6.08
19.23
20
4
0
0
unexpected update
request
related work restricted version data user
response analysis results current and future
work
39other work
misunderstandings
non-understandings
- costs for errors
- rejection threshold adaptation
- nonu impact on performance Interspeech-05
- transfering confidence annotators across domains
in progress
detection
- comparative analysis of 10 recovery strategies
SIGdial-05
strategies
- impact of policy on performance
- towards learning non-understanding recovery
policies SIGdial-05
policy
- RavenClaw dialog management for task-oriented
systems - RoomLine, Lets Go Public!, Vera,
LARRI, TeamTalk, Sublime EuroSpeech-03, HLT-05
related work restricted version data user
response analysis results current and future
work
40thank you! questions
41a more subtle caveat
- distribution of training data
- confidence annotator heuristic update rules
- distribution of run-time data
- confidence annotator learned model
- always a problem when interacting with the world!
- hopefully, distribution shift will not cause
large degradation in performance - remains to validate empirically
- maybe a bootstrap approach?
42KL-divergence cross-entropy
- KL divergence D(pq)
- Cross-entropy CH(p, q) H(p) D(pq)
- Negative log likelihood
43logistic regression
- regression model for binomial (binary) dependent
variables
- fit a model using max likelihood (avg
log-likelihood) - any stats package will do it for you
- no R2 measure
- test fit using likelihood ratio test
- stepwise logistic regression
- keep adding variables while data likelihood
increases signif. - use Bayesian information criterion to avoid
overfitting
44logistic regression
45logistic model tree
- regression tree, but with logistic models on
leaves
f
f0
f1
g
ggt10
glt10
46user study
- 46 participants, 1st time users
- 10 scenarios, fixed order
- presented graphically (explained during briefing)
- participants compensated per task success