Title: roadmap to a thesis proposal
1Increased Robustness in Spoken Dialog Systems
- (roadmap to a thesis proposal)
- Dan Bohus,
- SPHINX Lunch, May 2003
2The problem
- S Are you a registered user ?
- U No I'm not. No NO I'M NOT NO
- S What is your full name ?
- U Adam Schumacher ATHENS IN AKRON
- S There is an Athens in Georgia and in Greece.
Which destination did you want ? - U Neither NEITHER
- S I'm sorry, I didn't catch that. There is an
Athens in Georgia and in Greece. Which
destination did you want ? - U Georgia GEORGIA
- S A flight from Athens... Where do you want to
go ? - U Start over START OVER
- S Do you really want to start over ?
- U Yes YES
- S What city are you leaving from ?
- U Hamilton, Ontario HILTON ONTARIO
- S Sorry, I'm not sure I understood what you
said. Where are you leaving from ? - U Hamilton HILTON
- S I'm still having trouble understanding you. To
go on, I need you to answer the following
question. What city are you leaving from ? - U Toronto TORONTO
3Is this really a big problem?
- Lots of anecdotal evidence
- And some statistics
- CMU Communicator
- 66 of the sessions contain a serious
misunderstanding - 26 result in a complete breakdown in interaction
- Remaining 40 are still frustrating experiences
4More statistics
- USC study Shin et al (1)
- Labeled errors and user behavior on Communicator
(multi-site) corpus - Average 1.66 error segments/session
- 78 error segments get back on track
- 37 of the sessions have errors leading to
complete breakdown in interaction
Failed
37
5Yet more statistics
- Utterance level understanding error rates
- CMU Communicator 32.4 ? 66 of sess.
- Rudnicky, Bohus et al (2)
- CU Communicator 27.5 ? of sess.
- Segundo (3)
- HMIHY (ATT) 36.5 ? of sess.
- Walker (4)
- Jupiter (MIT) 28.5 ? of sess.
- Hazen (5)
6It is a significant problem !
10-30 lead to interaction breakdowns
60-70 contain misunderstandings
7Goal of proposed work
interaction breakdowns
sessions containing misunderstandings
8Outline
- The problem
- Sources of the problem
- The approach
- Infrastructure the RavenClaw framework
- Proposed work, in detail
- Discussion
9The problems in more detail
- S Are you a registered user ?
- U No I'm not. No NO I'M NOT NO
- S What is your full name ?
- U Adam Schumacher ATHENS IN AKRON
- S There is an Athens in Georgia and in Greece.
Which destination did you want ? - U Neither NEITHER
- S I'm sorry, I didn't catch that. There is an
Athens in Georgia and in Greece. Which
destination did you want ? - U Georgia GEORGIA
- S A flight from Athens... Where do you want to
go ? - U Start over START OVER
- S Do you really want to start over ?
- U Yes YES
- S What city are you leaving from ?
- U Hamilton, Ontario HILTON ONTARIO
- S Sorry, I'm not sure I understood what you
said. Where are you leaving from ? - U Hamilton HILTON
- S I'm still having trouble understanding you. To
go on, I need you to answer the following
question. What city are you leaving from ? - U Toronto TORONTO
10Three contributing factors
- 1. Low accuracy of speech recognition
- 2. Inability to assess reliability of beliefs
- 3. Lack of efficient error recovery and
prevention mechanisms
11Factor 1 Low recognition accuracy
- ASR still imperfect at best
- Variability environmental, speaker
- 10-30 WER in spoken language systems
- Tradeoff Accuracy vs. System Flexibility
- Effect Main source of errors in SDS
- WER ? most important predictor of user
satisfaction Walker et al (6,7) - Users prefer less flexible, more accurate systems
Walker et al (8)
12Factor 2 Inability to assess reliability of
beliefs
- Errors typically propagate to the upper levels of
the system, leading to - Non-understandings
- Misunderstandings
- Effect Misunderstandings are taken as facts and
acted upon - At best extra turns, user-initiated repairs,
frustration - At worst complete breakdown in interaction
13Factor 3 Lack of recovery mechanisms
- Small number of strategies
- Implicit and explicit verifications most popular
- Sub-optimal implementations
- Triggered in an ad-hoc / heuristic manner
- Problem is often regarded as an add-on
- Non-uniform, domain-specific treatment
- Effect Systems prone to complete breakdowns in
interaction
14Outline
- The problem
- Sources of the problem
- The approach
- Infrastructure the RavenClaw framework
- Proposed work, in detail
- Discussion
15Three contributing factors
- 1. Low accuracy of speech recognition
- 2. Inability to assess reliability of beliefs
- 3. Lack of efficient error recovery and
prevention mechanisms
16Approach 1
- 1. Low accuracy of speech recognition
- 2. Inability to assess reliability of beliefs
- 3. Lack of efficient error recovery and
prevention mechanisms
17Approach 2
- 1. Low accuracy of speech recognition
- 2. Inability to assess reliability of beliefs
- 3. Lack of efficient error recovery and
prevention mechanisms
18Why not just fix ASR?
- ASR performance is improving, but requirements
are increasing too - ASR will not become perfect anytime soon
- ASR is not the only source of errors
- Approach 2 ensure robustness under a large
variety of conditions
19Proposed solution
- Assuming the inputs are unreliable
A.Make systems able to assess the reliability of
their beliefs
B.Optimally deploy a set of error prevention and
recovery strategies
20Proposed solution more precisely
- Assuming the inputs are unreliable
1.Compute grounding state indicators -
reliability of beliefs (confidence annotation
updating) - correction detection -
goodness-of-dialog metrics - other user
models, etc
B.Optimally deploy a set of error prevention and
recovery strategies
21Proposed solution more precisely
- Assuming the inputs are unreliable
1.Compute grounding state indicators -
reliability of beliefs (confidence annotation
updating) - correction detection -
goodness-of-dialog metrics - other user
models, etc
2.Define the grounding actions - error
prevention and recovery strategies 3.Create a
grounding decision model - decides upon the
optimal strategy to employ at a given point
- Do it in a domain-independent manner !
22Outline
- The problem
- Sources of the problem
- The approach
- Infrastructure the RavenClaw framework
- Proposed work, in detail
- Discussion
23The RavenClaw DM framework
- Dialog Management framework for complex,
task-oriented dialog systems - Separation between Dialog Task and Generic
Conversational Skills - Developer focuses only on Dialog Task description
- Dialog Engine automatically ensures a minimum set
of conversational skills - Dialog Engine automatically ensures the grounding
behaviors
24RavenClaw architecture
Communicator
Welcome
Login
Travel
Locals
Bye
AskRegistered
GreetUser
GetProfile
Leg1
AskName
DepartLocation
ArriveLocation
- Dialog Task implemented by a hierarchy of agents
- Information captured in concepts
- Probability distributions over sets of values
- Support for belief assessment grounding
mechanisms
25Domain-Independent Grounding
26RavenClaw-based systems
- LARRISymphony Language-based Assistant for
Retrieval of Repair Information - IPANASA Ames Intelligent Procedure Assistant
- BusLineLets Go! Pittsburgh bus route
information - RoomLine conference room reservation at CMU
- TeamTalk11-754 spoken command and control for
a team of robots
27Outline
- The problem
- Sources of the problem
- The approach
- Infrastructure the RavenClaw framework
- Proposed work, in detail
- Discussion
28Previous/Proposed Work Overview
1.Compute grounding state indicators -
reliability of beliefs (confidence annotation
updating) - correction detection -
goodness-of-dialog metrics - other user
models, etc
2.Define the grounding actions - error
prevention and recovery strategies 3.Create a
grounding decision model - decides upon the
optimal strategy to employ at a given point
29Proposed Work, in Detail - Outline
1.Compute grounding state indicators -
reliability of beliefs (confidence annotation
updating) - correction detection -
goodness-of-dialog metrics - other user
models, etc
???
2.Define the grounding actions - error
prevention and recovery strategies 3.Create a
grounding decision model - decides upon the
optimal strategy to employ at a given point
30Reliability of beliefs
- Continuously assess reliability of beliefs
- Two sub-problems
- Computing the initial confidence in a concept
- Confidence annotation problem
- Update confidence based on events in the dialog
- User reaction to implicit or explicit
verifications - Domain reasoning
31Confidence annotation
- Traditionally focused on ASR Chase(9),
- More recently, interest in CA geared towards use
in SDS Walker(4), Segundo(3), Hazen(5),
Rudnicky, Bohus et al (2) - Utterance-level, Concept-level CA
- Integrating multiple features
- ASR acoustic lm scores, lattice, n-best
- Parser various measures of parse goodness
- Dialog Management state, expectations, history,
etc - 50 relative improvement in classification error
32Confidence annotation To Do List
- Improve accuracy even more ???
- More features / Less features / Better features
- Study transferability across domains ???
- Q Can we identify a set of features that
transfer well? - Q Can we use un- or semi-supervised learning or
bootstrap from little data and an annotator in a
different domain?
33Confidence updating ???
- To my knowledge, not really studied yet!
34Confidence updating approaches
- Naïve Bayesian updating ???
- Assumptions do not match reality
- Analytical model ???
- Set of heuristic / probabilistic rules
- Data-driven model ???
- Define events as features
- Learning task
- Initial Conf. E1 E2 E3 ? Current Conf.
1/0 - Bypass confidence updating ???
- Keep all events as grounding state indicators
(doesnt lose that much information)
35Proposed Work, in Detail - Outline
1.Compute grounding state indicators -
reliability of beliefs (confidence annotation
updating) - correction detection -
goodness-of-dialog metrics - other user
models, etc
???
2.Define the grounding actions - error
prevention and recovery strategies 3.Create a
grounding decision model - decides upon the
optimal strategy to employ at a given point
36Proposed Work, in Detail - Outline
1.Compute grounding state indicators -
reliability of beliefs (confidence annotation
updating) - correction detection -
goodness-of-dialog metrics - other user
models, etc
???
???
2.Define the grounding actions - error
prevention and recovery strategies 3.Create a
grounding decision model - decides upon the
optimal strategy to employ at a given point
37Correction Detection
- Automatically detect at run-time correction sites
or aware sites - Another data-driven classification task
- Prosodic features, bag-of-words features, lexical
markers Litman(10), Bosch(11), Swerts(12),
Lewov(13) - Useful for
- implementation of implicit / explicit
verifications - belief assessment / updating
- as direct indicator for grounding decisions
38Correction Detection To Do List
- Build an aware site detector ???
- Q Can we identify what is the user correcting?
- Study transferability across domains
??? - Q Can we identify a set of features that
transfer well? - Q Can we use un- or semi-supervised learning or
bootstrap from little data and a detector in a
different domain?
39Proposed Work, in Detail - Outline
1.Compute grounding state indicators -
reliability of beliefs (confidence annotation
updating) - correction detection -
goodness-of-dialog metrics - other user
models, etc
???
???
2.Define the grounding actions - error
prevention and recovery strategies 3.Create a
grounding decision model - decides upon the
optimal strategy to employ at a given point
40Proposed Work, in Detail - Outline
1.Compute grounding state indicators -
reliability of beliefs (confidence annotation
updating) - correction detection -
goodness-of-dialog metrics - other user
models, etc
???
???
???
2.Define the grounding actions - error
prevention and recovery strategies 3.Create a
grounding decision model - decides upon the
optimal strategy to employ at a given point
41Goodness-of-dialog indicators ???
- Assessing how well a conversation is advancing
- Non-understandings
- Q Can we identify the cause?
- Q Can we relate a non-understoodutterance to a
dialog expectation? - Dialog State related indicators / Stay_Here
- Q Can we expand this to some distance to
optimal dialog trace? - Overall confidence in beliefs within topic
- Q How to aggregate? Entropy-based measures?
- Allow for task-specific metrics of goodness
42Proposed Work, in Detail - Outline
1.Compute grounding state indicators -
reliability of beliefs (confidence annotation
updating) - correction detection -
goodness-of-dialog metrics - other user
models, etc
???
???
???
2.Define the grounding actions - error
prevention and recovery strategies 3.Create a
grounding decision model - decides upon the
optimal strategy to employ at a given point
43Proposed Work, in Detail - Outline
1.Compute grounding state indicators -
reliability of beliefs (confidence annotation
updating) - correction detection -
goodness-of-dialog metrics - other user
models, etc
???
???
???
???
2.Define the grounding actions - error
prevention and recovery strategies 3.Create a
grounding decision model - decides upon the
optimal strategy to employ at a given point
???
44Grounding Actions
- Design and evaluate a rich set of strategies for
preventing and recovering from errors (both
misunderstandings and non-understandings) - Current status few strategies used / analyzed
- Explicit verification Did you say Pittsburgh?
- Implicit verification traveling from
Pittsburgh when do you want to leave?
45Explicit Implicit Verifications
- Analysis of user behavior following these 2
strategies Krahmer(10), Swerts(11) - User behavior is rich, correction detectors are
important! - Design is important!
- Did you say Pittsburgh?
- Did you say Pittsburgh? Please respond yes or
no. - Do you want to fly from Pittsburgh?
- Correct implementation adequate support is
important! - Users discovering errors through implicit
confirmations are less likely to get back on
track hmm
46Strategies for misunderstandings
- Explicit verification (w/ variants)
- Implicit verification (w/ variants)
- Disambiguation
- Im sorry, are you flying out of Pittsburgh or
San Francisco? - Rejection
- Im not sure I understood what you said. Can you
tell me again where are you flying from?
47Strategies for non-understandings - I
- Lexically entrain
- Right now I need you to tell me the departure
city You can say for instance, Id like to fly
from Pittsburgh. - Ask repeat
- Im not sure I understood you. Can you repeat
that please? - Ask reformulate
- Can you please rephrase that?
- Diagnose
- If non-understanding source can be
known/estimated, give that information to the
user - I cant hear you very well. Can you please speak
closer to the microphone?
48Strategies for non-understandings - II
- Select alternative plan Domain specific
strategies - E.g. try to get state name first, then city name
- Establish context ( Confirm context variant)
- Right now Im trying to gather enough
information to make a room reservation. So far I
know you want a room on Tuesday. Now I need to
know for what time you need the room. - Give targeted help
- Give help on the topic / focus of the
conversation / estimated user goal - Constrain language model / recognition
49Strategies for non-understandings - III
- Switch input modality (i.e. DTMF, pen, etc)
- Restart topic / backup dialog
- Start-over
- Switch to operator
- Terminate session
-
50Grounding Strategies To Do List
- Design, implement, analyze, iterate
- Human-Human dialog analysis ???
- Design the strategies, with variants
andappropriate support ??? - Implement in the RavenClaw framework ???
- Perform data-driven analysis ???
- Q User behaviors
- Q Applicability conditions
- Q Costs, Success rates
51Proposed Work, in Detail - Outline
1.Compute grounding state indicators -
reliability of beliefs (confidence annotation
updating) - correction detection -
goodness-of-dialog metrics - other user
models, etc
???
???
???
???
2.Define the grounding actions - error
prevention and recovery strategies 3.Create a
grounding decision model - decides upon the
optimal strategy to employ at a given point
???
???
52Grounding decision model
- Decide which is the best grounding action to take
at a certain time - Goals / Desired properties
- Domain Independent
- Adaptive
- Learn and target any dialog performance metric
- Adjust to large variations in the reliability of
inputs - Accept any new strategies on the fly
- Scalable
53Previous work
- Conversation as action under uncertainty
Horvitz(14), Paek(15) - Bayesian decision theory with assumed utilities
- Reinforcement learning in spoken dialog systems
Kearns(16), Singh(17), Pieraccini(18),
Litman(19), Walker(20) - Learning dialog policies
- Heuristic approaches add refs
- Predominant in todays systems
54Grounding Decision Theoretic Approach
- Given
- Set of states Ss and a probabilistic model of
state given some evidence e, P(se) ? grounding
state indicators - Set of actions Aa ? grounding actions
- Model describing the utility of each action from
each state U(s,a) ? grounding model - Take action that maximizes expected utility
- EU(ae) ?S U(a,s)p(se)
55The missing ingredient Utilities
56Learning utilities
- Essentially a POMDP problem
- Hidden state
- Belief dictated by grounding state indicator
models - Actions
- Strategies
- Rewards
- Targeted optimization measures
EV
IC
IV
NGA
EV
C
IV
NGA
U
NGA
57A possible overall architecture
- 2 types of grounding models
- Dealing with misunderstandings, one grounding
model per concept - Dealing with non-understandings, one grounding
model per agent
Communicator
Welcome
Login
Travel
Locals
Bye
AskRegistered
GreetUser
GetProfile
Leg1
AskName
DepartLocation
ArriveLocation
58A possible overall architecture
- Q How to combine the decisions? ???
- Identify a small set of rules
- E.g. concepts first, then agents focused-to-top
- Hierarchical POMDP approaches ? Roy, Pineau,
Thrun
Communicator
Welcome
Login
Travel
Locals
Bye
AskRegistered
GreetUser
GetProfile
Leg1
AskName
DepartLocation
ArriveLocation
59A possible overall architecture
- Q Formulate parallel learning problem ???
- Large numbers of small models are good in
principle - Need to clearly identify assumptions
- Or hierarchical learning problem
Communicator
Welcome
Login
Travel
Locals
Bye
AskRegistered
GreetUser
GetProfile
Leg1
AskName
DepartLocation
ArriveLocation
60Proposed Work, in Detail - Outline
1.Compute grounding state indicators -
reliability of beliefs (confidence annotation
updating) - correction detection -
goodness-of-dialog metrics - other user
models, etc
???
???
???
???
2.Define the grounding actions - error
prevention and recovery strategies 3.Create a
grounding decision model - decides upon the
optimal strategy to employ at a given point
???
???
61Evaluation
interaction breakdowns
sessions containing misunderstandings
62Evaluation
- Evaluate proposed framework across a large
variety of domains - RoomLine, BusLine, LARRI, TeamTalk, etc
- Grounding state indicators evalution ???
- Internal metrics, e.g. accuracy, etc
- Grounding strategies analysis ???
- Empirical analysis
- Quantitative assessments costs, success rates
- Qualitative insights user behaviors, best
variants
63Evaluation
- Grounding model / framework evaluation (in terms
of chosen performance metric) - Against expert heuristic strategy ???
- Against smaller number of strategies
??? - Against non-adaptive system ???
64? and !