Title: Learning Optimal Strategies for Spoken Dialogue Systems
1Learning Optimal Strategies for Spoken Dialogue
Systems
- Diane Litman
- University of Pittsburgh
- Pittsburgh, PA 15260 USA
2Outline
- Motivation
- Markov Decision Processes and Reinforcement
Learning - NJFun A Case Study
- Advanced Topics
3Motivation
- Builders of real-time spoken dialogue systems
face fundamental design choices that strongly
influence system performance - when to confirm/reject/clarify what the user just
said? - when to ask a directive versus open prompt?
- when to use user, system, or mixed initiative?
- when to provide positive/negative/no feedback?
- etc.
- Can such decisions be automatically optimized via
reinforcement learning?
4Spoken Dialogue Systems (SDS)
- Provide voice access to back-end via telephone or
microphone - Front-end ASR (automatic speech recognition) and
TTS (text to speech) - Back-end DB, web, etc.
- Middle dialogue policy (what action to take at
each point in a dialogue)
5Typical SDS Architecture
LanguageUnderstanding
Dialogue Policy
Domain Back-end
Language Generation
6Reinforcement Learning (RL)
- Learning is associated with a reward
- By optimizing reward, algorithm learns optimal
strategy - Application to SDS
- Key assumption SDS can be represented as a
Markov Decision Process - Key benefit Formalization (when in a state, what
is the reward for taking a particular action,
among all action choices?)
7Reinforcement Learning and SDS
- debate over design choices
- learn choices using reinforcement learning
- agent interacting with an environment
- noisy inputs
- temporal / sequential aspect
- task success / failure
LanguageUnderstanding
noisy semantic input
Dialogue Manager
Domain Back-end
actions (semantic output)
Language Generation
8Sample Research Questions
- Which aspects of dialogue management are amenable
to learning and what reward functions are needed? - What representation of the dialogue state best
serves this learning? - What reinforcement learning methods are tractable
with large scale dialogue systems?
9Outline
- Motivation
- Markov Decision Processes and Reinforcement
Learning - NJFun A Case Study
- Advanced Topics
10Markov Decision Processes (MDP)
- Characterized by
- a set of states S an agent can be in
- a set of actions A the agent can take
- A reward r(a,s) that the agent receives for
taking an action in a state - ( Some other things Ill come back to (gamma,
state transition probabilities))
11 Modeling a Spoken Dialogue System as a
Probabilistic Agent
- A SDS can be characterized by
- The current knowledge of the system
- A set of states S the agent can be in
- a set of actions A the agent can take
- A goal G, which implies
- A success metric that tells us how well the agent
achieved its goal - A way of using this metric to create a strategy
or policy ? for what action to take in any
particular state.
12Reinforcement Learning
- The agent interacts with its environment to
achieve a goal - It receives reward (possibly delayed reward) for
its actions - it is not told what actions to take
- instead, it learns from indirect, potentially
delayed reward, to choose sequences of actions
that produce the greatest cumulative reward - Trial-and-error search
- neither exploitation nor exploration can be
pursued exclusively without failing at the task - Life-long learning
- on-going exploration
13ReinforcementLearning
?
Policy ? S ? A
state
reward
action
a0
a1
a2
. . .
s0
s1
s2
r0
r1
r2
14State Value Function, V
V(s) predicts the future total reward we can
obtain by entering state s
State, s V(s)
s0 ...
s1 10
s2 15
s3 6
s1
p(s0, a1, s1) 0.7
r(s0, a1) 2
? can exploit V greedily, i.e. in s, choose
action a for which the following is largest
p(s0, a1, s2) 0.3
s2
s0
p(s0, a2, s2) 0.5
r(s0, a2) 5
s3
p(s0, a2, s3) 0.5
Choosing a1 2 0.7 10 0.3 15
13.5 Choosing a2 5 0.5 15 0.5 6 15.5
15Action Value Function, Q
Q(s, a) predicts the future total reward we can
obtain by executing a in s
State, s Action, a Q(s, a)
s0 a1 13.5
s0 a2 15.5
s1 a1 ...
s1 a2 ...
? can exploit Q greedily, i.e. in s, choose
action a for which Q(s, a) is largest
s0
16Q Learning
Exploration versus exploitation
For each (s, a), initialise Q(s, a)
arbitrarily Observe current state, s Do until
reach goal state Select action a by exploiting
Q e-greedily, i.e. with probability e, choose
a randomly else choose the a for which Q(s,
a) is largest Execute a, entering state s and
receiving immediate reward r Update the
table entry for Q(s, a) s ? s
Watkins 1989
17More on Q Learning
s
a
Q(s, a)
r
s
a
Q(s, a)
18A Brief Tutorial Example
- A Day-and-Month dialogue system
- Goal fill in a two-slot frame
- Month November
- Day 12th
- Via the shortest possible interaction with user
- Levin, E., Pieraccini, R. and Eckert, W. A
Stochastic Model of Human-Machine Interaction for
Learning Dialog Strategies. IEEE Transactions on
Speech and Audio Processing. 2000.
19What is a State?
- In principle, MDP state could include any
possible information about dialogue - Complete dialogue history so far
- Usually use a much more limited set
- Values of slots in current frame
- Most recent question asked to user
- Users most recent answer
- ASR confidence
- etc
20State in the Day-and-Month Example
- Values of the two slots day and month.
- Total
- 2 special initial state si and sf.
- 365 states with a day and month
- 1 state for leap year
- 12 states with a month but no day
- 31 states with a day but no month
- 411 total states
21Actions in MDP Models of Dialogue
- Speech acts!
- Ask a question
- Explicit confirmation
- Rejection
- Give the user some database information
- Tell the user their choices
- Do a database query
22Actions in the Day-and-Month Example
- ad a question asking for the day
- am a question asking for the month
- adm a question asking for the daymonth
- af a final action submitting the form and
terminating the dialogue
23A Simple Reward Function
- For this example, lets use a cost function for
the entire dialogue - Let
- Ninumber of interactions (duration of dialogue)
- Nenumber of errors in the obtained values (0-2)
- Nfexpected distance from goal
- (0 for complete date, 1 if either data or month
are missing, 2 if both missing) - Then (weighted) cost is
- C wi?Ni we?Ne wf?Nf
243 Possible Policies
Dumb
P1probability of error in open prompt
Open prompt
P2probability of error in directive prompt
Directive prompt
253 Possible Policies
Strategy 3 is better than strategy 2 when
improved error rate justifies longer interaction
P1probability of error in open prompt
OPEN
P2probability of error in directive prompt
DIRECTIVE
26That was an Easy Optimization
- Only two actions, only tiny of policies
- In general, number of actions, states, policies
is quite large - So finding optimal policy is harder
- We need reinforcement learning
- Back to MDPs
27MDP
- We can think of a dialogue as a trajectory in
state space - The best policy is the one with the greatest
expected reward over all trajectories - How to compute a reward for a state sequence?
28Reward for a State Sequence
- One common approach discounted rewards
- Cumulative reward Q of a sequence is discounted
sum of utilities of individual states - Discount factor ? between 0 and 1
- Makes agent care more about current than future
rewards the more future a reward, the more
discounted its value
29The Markov Assumption
- MDP assumes that state transitions are Markovian
30Expected Reward for an Action
- Expected cumulative reward Q(s,a) for taking a
particular action from a particular state can be
computed by Bellman equation - immediate reward for current state
- expected discounted utility of all possible
next states s - weighted by probability of moving to that state
s - and assuming once there we take optimal action a
31Needed for Bellman Equation
- A model of p(ss,a) and estimate of R(s,a)
- If we had labeled training data
- P(ss,a) C(s,s,a)/C(s,a)
- If we knew the final reward for whole dialogue
R(s1,a1,s2,a2,,sn) - Given these parameters, can use value iteration
algorithm to learn Q values (pushing back reward
values over state sequences) and hence best policy
32How to estimate p(ss,a) without labeled data
- Random conversations with real people
- Carefully hand-tune small number of states and
policies - Build a dialogue system which explores state
space by generating a few hundred random
conversations with real humans - Set probabilities from this corpus
- Random conversations with simulated people
- Have millions of conversations with simulated
people - Can have a slightly larger state space
33Final Reward
- What is the final reward for whole dialogue
R(s1,a1,s2,a2,,sn)? - The general goodness of a whole dialogue!
- Can be automatically computed using PARADISE
evaluation framework (more later)
34Outline
- Motivation
- Markov Decision Processes and Reinforcement
Learning - NJFun A Case Study
- Advanced Topics
35NJFun A Case Study
- Idea use reinforcement learning to make a small
set of optimal policy decisions while interacting
with real people, and empirically evaluate - Singh, S., D. Litman, M. Kearns, and M. Walker.
Optimizing Dialogue Management with Reinforcement
Learning Experiments with the NJFun System.
Journal of AI Research. 2002.
36NJFun (aka RLDS) Dialogues
- Task Description"You have lived in Stanhope for
many years but have never managed to visit its
historic sites. Today you are feeling virtuous.
Use RLDS to find out what you can see this
morning. -
Unoptimized System
Optimized System
TTS Welcome to RLDS. Please say an activity
name or say 'list activities' for a list of
activities I know about. USER I would like to
visit a historic site in Stanhope this morning.
TTS Did you say you are interested in going to
a zoo? USER No. TTS I know about amusement
parks, aquariums, cruises, historic sites,
museums, parks, theaters, wineries, and zoos.
Please say an activity name from this list.
USER I would like to visit a historic site.
TTS Did you say you are interested in going to
a historic site? USER Yes. TTS Please give
me more information. USER ... TTS Please
tell me the location that you are interested in.
You can also tell me the time. . (8 turns
omitted) TTS Thank you for using the system.
Please give me feedback by saying 'good',
'so-so', or 'bad'. USER Bad. TTS Goodbye and
have a nice day!
TTS Welcome to RLDS. How may I help you? USER
I would like to um find about Stanhope historic
sites. TTS Did you say you are interested in
Stanhope? USER Yes. TTS What time of the day
do you want to go? USER Uh...morning. TTS
Did you say you want to go in the morning? USER
Yes. TTS I found 2 historic sites in Stanhope
that are open in the morning. They are the
Lenape Indian Village, and the Waterloo Village.
Thank you for using the system. Please give me
feedback by saying 'good', 'so-so', or 'bad'.
USER Good. TTS Goodbye and have a nice day!
37Some Issues in Automatic Speech Recognition (ASR)
- Inputs audio file grammar/language model
acoustic model - Outputs utterance matched from grammar, or no
match confidence score - Performance tradeoff
- small grammar --gt high accuracy on constrained
utterances, lots of no-matches - large grammar --gt match more utterances, but
with lower confidence
38Some Issues in Dialogue Policy Design
- Initiative policy
- Confirmation policy
- Criteria to be optimized
39Initiative Policy
- System initiative vs. user initiative
- Please state your departure city.
- How can I help you?
- Influences expectations
- ASR grammar must be chosen accordingly
- Best choice may differ from state to state
- May depend on user population task
40Confirmation Policy
- High ASR confidence accept ASR match and move on
- Moderate ASR confidence confirm
- Low ASR confidence re-ask
- How to set confidence thresholds?
- Early mistakes can be costly later, but excessive
confirmation is annoying
41Criteria to be Optimized
- Task completion
- Sales revenues
- User satisfaction
- ASR performance
- Number of turns
42Typical System Design Sequential Search
- Choose and implement several reasonable
dialogue policies - Field systems, gather dialogue data
- Do statistical analyses
- Refield system with best dialogue policy
- Can only examine a handful of policies
43Why Reinforcement Learning?
- Agents can learn to improve performance by
interacting with their environment - Thousands of possible dialogue policies, and want
to automate the choice of the optimal - Can handle many features of spoken dialogue
- noisy sensors (ASR output)
- stochastic behavior (user population)
- delayed rewards, and many possible rewards
- multiple plausible actions
- However, many practical challenges remain
44Proposed Approach
- Build initial system that is deliberately
exploratory wrt state and action space - Use dialogue data from initial system to build a
Markov decision process (MDP) - Use methods of reinforcement learning to compute
optimal policy (here, dialogue policy) of the MDP - Refield (improved?) system given by the optimal
policy - Empirically evaluate
45State-Based Design
- System state contains information relevant for
deciding the next action - info attributes perceived so far
- individual and average ASR confidences
- data on particular user
- etc.
- In practice, need a compressed state
- Dialogue policy mapping from each state in the
state space to a system action
46Markov Decision Processes
- System state s (in S)
- System action a in (in A)
- Transition probabilities P(ss,a)
- Reward function R(s,a) (stochastic)
- Our application P(ss,a) models the population
of users
47SDSs as MDPs
Initial system utterance
Initial user utterance
Actions have prob. outcomes
system logs
a e a e a e
...
1
2
1
2
3
3
estimate transition probabilities... P(next
state current state action) ...and
rewards... R(current state, action) ...from
set of exploratory dialogues (random action
choice)
Violations of Markov property! Will this work?
48Computing the Optimal
- Given parameters P(ss,a), R(s,a), can
efficiently compute policy maximizing expected
return - Typically compute the expected cumulative reward
(or Q-value) Q(s,a), using value iteration - Optimal policy selects the action with the
maximum Q-value at each dialogue state
49Potential Benefits
- A principled and general framework for automated
dialogue policy synthesis - learn the optimal action to take in each state
- Compares all policies simultaneously
- data efficient because actions are evaluated as a
function of state - traditional methods evaluate entire policies
- Potential for lifelong learning systems,
adapting to changing user populations
50The Application NJFun
- Dialogue system providing telephone access to a
DB of activities in NJ - Want to obtain 3 attributes
- activity type (e.g., wine tasting)
- location (e.g., Lambertville)
- time (e.g., morning)
- Failure to bind an attribute query DB with
dont-care
51NJFun as an MDP
- define state-space
- define action-space
- define reward structure
- collect data for training learn policy
- evaluate learned policy
a closer look RL in spoken dialog systems
current challenges RL for error handling
52The State Space
N.B. Non-state variables record attribute
values state does not condition on previous
attributes!
53Sample Action Choices
- Initiative (when T 0)
- user (open prompt and grammar)
- mixed (constrained prompt, open grammar)
- system (constrained prompt and grammar)
- Example
- GreetU How may I help you?
- GreetS Please say an activity name.
54Sample Confirmation Choices
- Confirmation (when V 1)
- confirm
- no confirm
- Example
- Conf3 Did you say want to go in the lttimegt?
- NoConf3
55Dialogue Policy Class
- Specify reasonable actions for each state
- 42 choice states (binary initiative or
confirmation action choices) - no choice for all other states
- Small state space (62), large policy space (242)
- Example choice state
- initial state 1,0,0,0,0,0
- action choices GreetS, GreetU
- Learn optimal action for each choice state
56Some System Details
- Uses ATTs WATSON ASR and TTS platform, DMD
dialogue manager - Natural language web version used to build
multiple ASR language models - Initial statistics used to tune bins for
confidence values, history bit (informative state
encoding)
57The Experiment
- Designed 6 specific tasks, each with web survey
- Split 75 internal subjects into training and
test, controlling for M/F, native/non-native,
experienced/inexperienced - 54 training subjects generated 311 dialogues
- Training dialogues used to build MDP
- Optimal policy for BINARY TASK COMPLETION
computed and implemented - 21 test subjects (for modified system) generated
124 dialogues - Did statistical analyses of performance changes
58Example of Learning
- Initial state is always
- Attribute(1), Confidence/Confirmed(0), Value(0),
Tries(0), Grammar(0), History(0) - Possible actions in this state
- GreetU How may I help you?
- GreetS Please say an activity name or say list
activities for a list of activities I know about - In this state, system learned that GreetU is the
optimal action.
59Reward Function
- Binary task completion (objective measure)
- 1 for 3 correct bindings, else -1
- Task completion (allows partial credit)
- -1 for an incorrect attribute binding
- 0,1,2,3 correct attribute bindings
- Other evaluation measures ASR performance
(objective), and phone feedback, perceived
completion, future use, perceived understanding,
user understanding, ease of use (all subjective) - Optimized for binary task completion, but
predicted improvements in other measures
60Main Results
- Task completion (-1 to 3)
- train mean 1.72
- test mean 2.18
- p-value lt 0.03
- Binary task completion
- train mean 51.5
- test mean 63.5
- p-value lt 0.06
61Other Results
- ASR performance (0-3)
- train mean 2.48
- test mean 2.67
- p-value lt 0.04
- Binary task completion for experts (dialogues
3-6) - train mean 45.6
- test mean 68.2
- p-value lt 0.01
62Subjective Measures
Subjective measures move to the middle rather
than improve
First graph It was easy to find the place that I
wanted (strongly agree 5,, strongly
disagree1) train mean 3.38, test mean 3.39,
p-value .98
63Comparison to Human Design
- Fielded comparison infeasible, but exploratory
dialogues provide a Monte Carlo proxy of
consistent trajectories - Test policy Average binary completion reward
0.67 (based on 12 trajectories) - Outperforms several standard fixed policies
- SysNoConfirm -0.08 (11)
- SysConfirm -0.6 (5)
- UserNoConfirm -0.2 (15)
- Mixed -0.077 (13)
- User Confirm 0.2727 (11), no difference
64A Sanity Check of the MDP
- Generate many random policies
- Compare value according to MDP and value based on
consistent exploratory trajectories - MDP evaluation of policy ideally perfectly
accurate (infinite Monte Carlo sampling), linear
fit with slope 1, intercept 0 - Correlation between Monte Carlo and MDP
- 1000 policies, gt 0 trajs cor. 0.31, slope 0.953,
int. 0.067, p lt 0.001 - 868 policies, gt 5 trajs cor. 0.39, slope 1.058,
int. 0.087, p lt 0.001
65Conclusions from NJFun
- MDPs and RL are a promising framework for
automated dialogue policy design - Practical methodology for system-building
- given a relatively small number of exploratory
dialogues, learn the optimal policy within a
large policy search space - NJFun first empirical test of formalism
- Resulted in measurable and significant system
improvements, as well as interesting linguistic
results
66Caveats
- Must still choose states, actions, reward
- Must be exploratory with taste
- Data sparsity
- Violations of the Markov property
- A formal framework and methodology, hopefully
automating one important step in system design
67Outline
- Motivation
- Markov Decision Processes and Reinforcement
Learning - NJFun A Case Study
- Advanced Topics
68Some Current Research Topics
- Scale to more complex systems
- Automate state representation
- POMDPs due to hidden state
- Learn terminal (and non-terminal) reward function
- Online rather than batch learning
69Addressing Scalability
- Approach 1 user models / simulations
- costly to obtain real data ? simulate users
- inexpensive and potentially richer source of
large corpora - but - whats the quality of the simulated data?
- again, real-world evaluation becomes paramount
- Approach 2 value function approximation
- data-driven state abstraction / aggregation
70Some Example Simulation Models
- P (userAction system Action)
- P (yesAnswer explicitConfirmation)
- P (yesAnswer explicitConfirmation, goal)
- E.g.,
- Levin, Pierraccini, Eckert
- Georgila, Henderson, Lemon
- Pietquin
- Scheffler and Young
71Example Simulation Models in Spoken Dialogue
Tutoring
- ProbCorrect Model
- Answers a question with the same correctness rate
as a real student - Random Model
- Gives random answers
- Hua Ai and Diane Litman. Comparing Real-Real,
Simulated-Simulated, and Simulated-Real Spoken
Dialogue Corpora. Proceedings of the AAAI
Workshop on Statistical and Empirical Approaches
for Spoken Dialogue Systems. 2006.
72ProbCorrect Model
- Compute correctness rates from the system logs
- Use real student utterance as candidate answers
- A Bigram Model
73Random Model
Pick an answer from the whole set
- Randomly pick an answer from the whole utterance
set - A Unigram Model
74Simulated Dialogue (ProbCorrect Model)
Tutor said To analyze this problem, which law of motion would you use?
You said The second law.
Tutor said The best law is the 3rd law. Do you recall what this law says?
You said No.
Tutor said It says Is there a gravitational force here?
You said Yes.
Tutor said Good. Now, whats the relationship between the two forces magnitudes?
You said They are equal.
Tutor said Yeah. Whats the relationshiop between the two forces directions?
You said Theyre different.
Tutor said To summarize
75Simulated Dialogue (Random Model)
Tutor said To analyze this problem, which law of motion would you use?
You said More.
Tutor said The best law is the 3rd law. Do you recall what this law says?
You said After all forces.
Tutor said It says Is there a gravitational force here?
You said Yes.
Tutor said We just discussed... Now, whats the relationship between the two forces magnitudes?
You said Yes.
Tutor said Well... Whats the relationshiop between the two forces directions?
You said Its decreasing.
Tutor said To summarize
76Evaluating Simulation Models
- Does the model produce human-like behavior
- Compare real and simulated user responses
- Metrics precision and recall
- Does the model reproduce the variety of human
behavior - Compare real and simulated dialogue corpora
- Metrics statistical characteristics of dialogue
features (see below)
77Evaluating Simulated Corpora
- High-level Dialogue Features
- Dialog Length (Number of turns)
- Turn Length (Number of actions per turn)
- Participant Activity (Ratio of system/user
actions per dialog) - Dialogue style and cooperativeness
- Proportion of goal-directed dialogues vs. others
- Number of times a piece of information is
re-asked - Dialogue success rate and efficiency
- Average goal/subgoal achievement rate
- Schatzmann, J., Georgila, K., and Young, S.
Quantitative Evaluation of User Simulation
Techniques for Spoken Dialogue Systems. In
Proceedings 6th SIGdial Workshop on Discourse and
Dialogue. 2005.
78Evaluating ProbCorrect vs. Random
- Differences shown by similar metrics are not
necessarily related to the reality level - two real corpora can be very different
- Metrics can distinguish to some extent
- real from simulated corpora
- two simulated corpora generated by different
models trained on the same real corpus - two simulated corpora generated by the same
model trained on two different real corpora
79Scalability Approach 2Function Approximation
- Q can be represented by a table only if the
number of states actions is small - Besides, this makes poor use of experience
- Hence, we use function approximation, e.g.
- neural nets
- weighted linear functions
- case-based/instance-based/memory-based
representations
80Current Research Topics
- Scale to more complex systems
- Automate state representation
- POMDPs due to hidden state
- Learn terminal (and non-terminal) reward function
- Online rather than batch learning
81Designing the State Representation
- Incrementally add features to a state and test
whether the learned strategy improves - Frampton, M. and Lemon, O. Learning More
Effective Dialogue Strategies Using Limited
Dialogue Move Features. Proceedings ACL/Coling.
2006. - Adding Last System and User Dialogue Acts
improves 7.8 - Tetreault J. and Litman, D. Using Reinforcement
Learning to Build a Better Model of Dialogue
State. Proceedings EACL. 2006. - See below
82Example Methodology and Evaluation in SDS Tutoring
- Construct MDPs to test the inclusion of new
state features to a baseline - Develop baseline state and policy
- Add a state feature to baseline and compare
polices - A feature is deemed important if adding it
results in a change in policy from a baseline
policy - Joel R. Tetreault and Diane J. Litman. Comparing
the Utility of State Features in Spoken Dialogue
Using Reinforcement Learning. Proceedings
HLT/NAACL. 2006.
83Baseline Policy
State State Size Policy
1 Correct 1308 SimpleFeedback
2 Incorrect 872 SimpleFeedback
- Trend if you only have student correctness as a
model of student state, the best policy is to
always give simple feedback
84Adding Certainty FeaturesHypothetical Policy
Change
0 shifts
5 shifts
Baseline State Policy BCertainty State
1 C SimFeed C,Certain C,Neutral C,Uncertain
2 I SimFeed I,Certain I,Neutral I,Uncertain
Cert 1 Policy ?
SimFeed SimFeed SimFeed
SimFeed SimFeed SimFeed
Cert 2 Policy ?
Mix SimFeed Mix
Mix ComplexFeedback Mix
85Evaluation Results
- Incorporating new features into standard tutorial
state representation has an impact on Tutor
Feedback policies - Including Certainty, Student Moves and Concept
Repetition into the state effected the most
change - Similar feature utility for choosing Tutor
Questions
86Designing the State Representation (continued)
- Other Approaches, e.g.,
- Paek, T. and Chickering, D. The Markov Assumption
in Spoken Dialogue Management. Proc. SIGDial.
2005. - Henderson, J., Lemon, O, and Georgila, K. Hybrid
Reinforcement/Supervised Learning for Dialogue
Policies from Communicator Data. Proc. IJCAI
Workshop on KR in Practical Dialogue Systems.
2005.
87Current Research Topics
- Scale to more complex systems
- Automate state representation
- POMDPs due to hidden state
- Learn terminal (and non-terminal) reward function
- Online rather than batch learning
88Beyond MDPs
- Partially Observable MDPs (POMDPs)
- We dont REALLY know the users state (we only
know what we THOUGHT the user said) - So need to take actions based on our BELIEF ,
I.e. a probability distribution over states
rather than the true state - e.g., Roy, Pineau and Thrun Young and Williams
- Decision Theoretic Methods
- e.g., Paek and Horvitz
89Why POMDPs?
- Does state model uncertainty natively (i.e., is
it partially rather than fully observable)? - Yes POMDP and DT
- No MDP
- Does the system plan (i.e., can cumulative reward
force the system to construct a plan for choice
of immediate actions)? - Yes MDP and POMDP
- No DT
90POMDP Intuitions
- At each time step t machine in some hidden state
s?S - Since we dont observe s, we keep a distribution
over states called a belief state b - So the probability of being in state s given
belief state b is b(s). - Based on the current belief state b, the machine
- selects an action am ? Am
- Receives a reward r(s,am)
- Transitions to a new (hidden) state s, where s
depends only on s and am - Machine then receives an observation o ? O,
which is dependent on s and am - Belief distribution is then updated based on o
and am.
91How to Learn Policies?
- State space is now continuous
- With smaller discrete state space, MDP could use
dynamic programming this doesnt work for POMDB - Exact solutions only work for small spaces
- Need approximate solutions
- And simplifying assumptions
92Current Research Topics
- Scale to more complex systems
- Automate state representation
- POMDPs due to hidden state
- Learn terminal (and non-terminal) reward function
- Online rather than batch learning
93Dialogue System Evaluation
- The normal reason we need a metric to help us
compare different implementations - A new reason we need a metric for how good a
dialogue went to automatically improve SDS
performance via reinforcement learning - Marilyn Walker. An Application of Reinforcement
Learning to Dialogue Strategy Selection in a
Spoken DIalouge System for Email. JAIR. 2000.
94PARADISE PARAdigm for DIalogue System Evaluation
- Performance of a dialogue system is affected
both by what gets accomplished by the user and
the dialogue agent and how it gets accomplished - Walker, M. A., Litman, D. J., Kamm, C. A., and
Abella, A. PARADISE A Framework for Evaluating
Spoken Dialogue Agents. Proceedings of ACL/EACL.
1997.
95Performance as User Satisfaction (from
Questionnaire)
96PARADISE Framework
- Measure parameters (interaction costs and
benefits) and performance in a corpus - Train model via multiple linear regression over
parameters, predicting performance - System Performance ? wi pi
- Test model on new corpus
- Predict performance during future system design
n
i1
97Example Learned Performance Function from Elvis
Walker 2000
- User Sat..27COMP.54MRS- .09BargeIn.15Rejec
t - COMP User perception of task completion (task
success) - MRS Mean (concept) recognition accuracy
(quality cost) - BargeIn Normalized of user interruptions
(quality cost) - Reject Normalized of ASR rejections (quality
cost) - Amount of variance in User Sat. accounted for by
the model - Average Training R2 .37
- Average Testing R2 .38
- Used as Reward for Reinforcement Learning
98Some Current Research Topics
- Scale to more complex systems
- Automate state representation
- POMDPs due to hidden state
- Learn terminal (and non-terminal) reward function
- Online rather than batch learning
99Offline versus Online Learning
- MDP typically works offline
- Would like to learn policy online
- System can improve over time
- Policy can change as environment changes
MDP
Training data
Policy
Dialogue System
User Simulator
Human User
Interactions work online
100Summary
- (PO)MDPs and RL are a promising framework for
automated dialogue policy design - Designer states the problem and the desired goal
- Solution methods find (or approximate) optimal
plans for any possible state - Disparate sources of uncertainty unified into a
probabilistic framework - Many interesting problems remain, e.g.,
- using this approach as a practical methodology
for system building - making more principled choices (states, rewards,
discount factors, etc.)
101Acknowledgements
- Talks on the web by Dan Bohus, Derek Bridge,
Joyce Chai, Dan Jurafsky, Oliver Lemon and James
Henderson, Jost Schatzmann and Steve Young, and
Jason Williams were used in the development of
this presentation - Slides from ITSPOKE group at University of
Pittsburgh
102Further Information
- Reinforcement Learning
- Sutton, R. and Barto G. Reinforcement Learning
An Introduction, MIT Press. 1998 (much available
online) - Artificial Intelligence and Machine Learning
Journals and Conferences - Application to Dialogue
- Jurafsky, D. and Martin, J. Dialogue and
Conversational Agents. Chapter 19 of Speech and
Language Processing An Introduction to Natural
Language Processing, Computational Linguistics,
and Speech Recognition. Draft of May 18, 2005
(available online only) - ACL Literature
- Spoken Language Community (e.g., IEEE and ISCA
publications)