Learning Optimal Strategies for Spoken Dialogue Systems

About This Presentation

Title:

Learning Optimal Strategies for Spoken Dialogue Systems

Description:

Learning Optimal Strategies for Spoken Dialogue Systems Diane Litman University of Pittsburgh Pittsburgh, PA 15260 USA – PowerPoint PPT presentation

Number of Views:15

Avg rating:3.0/5.0

Slides: 103

Provided by: dian3296

Learn more at: https://people.cs.pitt.edu

more less

Transcript and Presenter's Notes

Title: Learning Optimal Strategies for Spoken Dialogue Systems

1
Learning Optimal Strategies for Spoken Dialogue
Systems

Diane Litman
University of Pittsburgh
Pittsburgh, PA 15260 USA

2
Outline

Motivation
Markov Decision Processes and Reinforcement
Learning
NJFun A Case Study
Advanced Topics

3
Motivation

Builders of real-time spoken dialogue systems
face fundamental design choices that strongly
influence system performance
when to confirm/reject/clarify what the user just
said?
when to ask a directive versus open prompt?
when to use user, system, or mixed initiative?
when to provide positive/negative/no feedback?
etc.
Can such decisions be automatically optimized via
reinforcement learning?

4
Spoken Dialogue Systems (SDS)

Provide voice access to back-end via telephone or
microphone
Front-end ASR (automatic speech recognition) and
TTS (text to speech)
Back-end DB, web, etc.
Middle dialogue policy (what action to take at
each point in a dialogue)

5
Typical SDS Architecture
LanguageUnderstanding
Dialogue Policy
Domain Back-end
Language Generation
6
Reinforcement Learning (RL)

Learning is associated with a reward
By optimizing reward, algorithm learns optimal
strategy
Application to SDS
Key assumption SDS can be represented as a
Markov Decision Process
Key benefit Formalization (when in a state, what
is the reward for taking a particular action,
among all action choices?)

7
Reinforcement Learning and SDS

debate over design choices
learn choices using reinforcement learning
agent interacting with an environment
noisy inputs
temporal / sequential aspect
task success / failure

LanguageUnderstanding
noisy semantic input
Dialogue Manager
Domain Back-end
actions (semantic output)
Language Generation
8
Sample Research Questions

Which aspects of dialogue management are amenable
to learning and what reward functions are needed?
What representation of the dialogue state best
serves this learning?
What reinforcement learning methods are tractable
with large scale dialogue systems?

9
Outline

Motivation
Markov Decision Processes and Reinforcement
Learning
NJFun A Case Study
Advanced Topics

10
Markov Decision Processes (MDP)

Characterized by
a set of states S an agent can be in
a set of actions A the agent can take
A reward r(a,s) that the agent receives for
taking an action in a state
( Some other things Ill come back to (gamma,
state transition probabilities))

11
Modeling a Spoken Dialogue System as a
Probabilistic Agent

A SDS can be characterized by
The current knowledge of the system
A set of states S the agent can be in
a set of actions A the agent can take
A goal G, which implies
A success metric that tells us how well the agent
achieved its goal
A way of using this metric to create a strategy
or policy ? for what action to take in any
particular state.

12
Reinforcement Learning

The agent interacts with its environment to
achieve a goal
It receives reward (possibly delayed reward) for
its actions
it is not told what actions to take
instead, it learns from indirect, potentially
delayed reward, to choose sequences of actions
that produce the greatest cumulative reward
Trial-and-error search
neither exploitation nor exploration can be
pursued exclusively without failing at the task
Life-long learning
on-going exploration

13
ReinforcementLearning
?
Policy ? S ? A
state
reward
action
a0
a1
a2
. . .
s0
s1
s2
r0
r1
r2
14
State Value Function, V
V(s) predicts the future total reward we can
obtain by entering state s
State, s V(s)
s0 ...
s1 10
s2 15
s3 6
s1
p(s0, a1, s1) 0.7
r(s0, a1) 2
? can exploit V greedily, i.e. in s, choose
action a for which the following is largest
p(s0, a1, s2) 0.3
s2
s0
p(s0, a2, s2) 0.5
r(s0, a2) 5
s3
p(s0, a2, s3) 0.5
Choosing a1 2 0.7 10 0.3 15
13.5 Choosing a2 5 0.5 15 0.5 6 15.5
15
Action Value Function, Q
Q(s, a) predicts the future total reward we can
obtain by executing a in s
State, s Action, a Q(s, a)
s0 a1 13.5
s0 a2 15.5
s1 a1 ...
s1 a2 ...
? can exploit Q greedily, i.e. in s, choose
action a for which Q(s, a) is largest
s0
16
Q Learning
Exploration versus exploitation
For each (s, a), initialise Q(s, a)
arbitrarily Observe current state, s Do until
reach goal state Select action a by exploiting
Q e-greedily, i.e. with probability e, choose
a randomly else choose the a for which Q(s,
a) is largest Execute a, entering state s and
receiving immediate reward r Update the
table entry for Q(s, a) s ? s
Watkins 1989
17
More on Q Learning
s
a
Q(s, a)
r
s
a
Q(s, a)
18
A Brief Tutorial Example

A Day-and-Month dialogue system
Goal fill in a two-slot frame
Month November
Day 12th
Via the shortest possible interaction with user
Levin, E., Pieraccini, R. and Eckert, W. A
Stochastic Model of Human-Machine Interaction for
Learning Dialog Strategies. IEEE Transactions on
Speech and Audio Processing. 2000.

19
What is a State?

In principle, MDP state could include any
possible information about dialogue
Complete dialogue history so far
Usually use a much more limited set
Values of slots in current frame
Most recent question asked to user
Users most recent answer
ASR confidence
etc

20
State in the Day-and-Month Example

Values of the two slots day and month.
Total
2 special initial state si and sf.
365 states with a day and month
1 state for leap year
12 states with a month but no day
31 states with a day but no month
411 total states

21
Actions in MDP Models of Dialogue

Speech acts!
Ask a question
Explicit confirmation
Rejection
Give the user some database information
Tell the user their choices
Do a database query

22
Actions in the Day-and-Month Example

ad a question asking for the day
am a question asking for the month
adm a question asking for the daymonth
af a final action submitting the form and
terminating the dialogue

23
A Simple Reward Function

For this example, lets use a cost function for
the entire dialogue
Let
Ninumber of interactions (duration of dialogue)
Nenumber of errors in the obtained values (0-2)
Nfexpected distance from goal
(0 for complete date, 1 if either data or month
are missing, 2 if both missing)
Then (weighted) cost is
C wi?Ni we?Ne wf?Nf

24
3 Possible Policies
Dumb
P1probability of error in open prompt
Open prompt
P2probability of error in directive prompt
Directive prompt
25
3 Possible Policies
Strategy 3 is better than strategy 2 when
improved error rate justifies longer interaction
P1probability of error in open prompt
OPEN
P2probability of error in directive prompt
DIRECTIVE
26
That was an Easy Optimization

Only two actions, only tiny of policies
In general, number of actions, states, policies
is quite large
So finding optimal policy is harder
We need reinforcement learning
Back to MDPs

27
MDP

We can think of a dialogue as a trajectory in
state space
The best policy is the one with the greatest
expected reward over all trajectories
How to compute a reward for a state sequence?

28
Reward for a State Sequence

One common approach discounted rewards
Cumulative reward Q of a sequence is discounted
sum of utilities of individual states
Discount factor ? between 0 and 1
Makes agent care more about current than future
rewards the more future a reward, the more
discounted its value

29
The Markov Assumption

MDP assumes that state transitions are Markovian

30
Expected Reward for an Action

Expected cumulative reward Q(s,a) for taking a
particular action from a particular state can be
computed by Bellman equation
immediate reward for current state
expected discounted utility of all possible
next states s
weighted by probability of moving to that state
s
and assuming once there we take optimal action a

31
Needed for Bellman Equation

A model of p(ss,a) and estimate of R(s,a)
If we had labeled training data
P(ss,a) C(s,s,a)/C(s,a)
If we knew the final reward for whole dialogue
R(s1,a1,s2,a2,,sn)
Given these parameters, can use value iteration
algorithm to learn Q values (pushing back reward
values over state sequences) and hence best policy

32
How to estimate p(ss,a) without labeled data

Random conversations with real people
Carefully hand-tune small number of states and
policies
Build a dialogue system which explores state
space by generating a few hundred random
conversations with real humans
Set probabilities from this corpus
Random conversations with simulated people
Have millions of conversations with simulated
people
Can have a slightly larger state space

33
Final Reward

What is the final reward for whole dialogue
R(s1,a1,s2,a2,,sn)?
The general goodness of a whole dialogue!
Can be automatically computed using PARADISE
evaluation framework (more later)

34
Outline

Motivation
Markov Decision Processes and Reinforcement
Learning
NJFun A Case Study
Advanced Topics

35
NJFun A Case Study

Idea use reinforcement learning to make a small
set of optimal policy decisions while interacting
with real people, and empirically evaluate
Singh, S., D. Litman, M. Kearns, and M. Walker.
Optimizing Dialogue Management with Reinforcement
Learning Experiments with the NJFun System.
Journal of AI Research. 2002.

36
NJFun (aka RLDS) Dialogues

Task Description"You have lived in Stanhope for
many years but have never managed to visit its
historic sites. Today you are feeling virtuous.
Use RLDS to find out what you can see this
morning.

Unoptimized System
Optimized System
TTS Welcome to RLDS. Please say an activity
name or say 'list activities' for a list of
activities I know about. USER I would like to
visit a historic site in Stanhope this morning.
TTS Did you say you are interested in going to
a zoo? USER No. TTS I know about amusement
parks, aquariums, cruises, historic sites,
museums, parks, theaters, wineries, and zoos.
Please say an activity name from this list.
USER I would like to visit a historic site.
TTS Did you say you are interested in going to
a historic site? USER Yes. TTS Please give
me more information. USER ... TTS Please
tell me the location that you are interested in.
You can also tell me the time. . (8 turns
omitted) TTS Thank you for using the system.
Please give me feedback by saying 'good',
'so-so', or 'bad'. USER Bad. TTS Goodbye and
have a nice day!
TTS Welcome to RLDS. How may I help you? USER
I would like to um find about Stanhope historic
sites. TTS Did you say you are interested in
Stanhope? USER Yes. TTS What time of the day
do you want to go? USER Uh...morning. TTS
Did you say you want to go in the morning? USER
Yes. TTS I found 2 historic sites in Stanhope
that are open in the morning. They are the
Lenape Indian Village, and the Waterloo Village.
Thank you for using the system. Please give me
feedback by saying 'good', 'so-so', or 'bad'.
USER Good. TTS Goodbye and have a nice day!
37
Some Issues in Automatic Speech Recognition (ASR)

Inputs audio file grammar/language model
acoustic model
Outputs utterance matched from grammar, or no
match confidence score
Performance tradeoff
small grammar --gt high accuracy on constrained
utterances, lots of no-matches
large grammar --gt match more utterances, but
with lower confidence

38
Some Issues in Dialogue Policy Design

Initiative policy
Confirmation policy
Criteria to be optimized

39
Initiative Policy

System initiative vs. user initiative
Please state your departure city.
How can I help you?
Influences expectations
ASR grammar must be chosen accordingly
Best choice may differ from state to state
May depend on user population task

40
Confirmation Policy

High ASR confidence accept ASR match and move on
Moderate ASR confidence confirm
Low ASR confidence re-ask
How to set confidence thresholds?
Early mistakes can be costly later, but excessive
confirmation is annoying

41
Criteria to be Optimized

Task completion
Sales revenues
User satisfaction
ASR performance
Number of turns

42
Typical System Design Sequential Search

Choose and implement several reasonable
dialogue policies
Field systems, gather dialogue data
Do statistical analyses
Refield system with best dialogue policy
Can only examine a handful of policies

43
Why Reinforcement Learning?

Agents can learn to improve performance by
interacting with their environment
Thousands of possible dialogue policies, and want
to automate the choice of the optimal
Can handle many features of spoken dialogue
noisy sensors (ASR output)
stochastic behavior (user population)
delayed rewards, and many possible rewards
multiple plausible actions
However, many practical challenges remain

44
Proposed Approach

Build initial system that is deliberately
exploratory wrt state and action space
Use dialogue data from initial system to build a
Markov decision process (MDP)
Use methods of reinforcement learning to compute
optimal policy (here, dialogue policy) of the MDP
Refield (improved?) system given by the optimal
policy
Empirically evaluate

45
State-Based Design

System state contains information relevant for
deciding the next action
info attributes perceived so far
individual and average ASR confidences
data on particular user
etc.
In practice, need a compressed state
Dialogue policy mapping from each state in the
state space to a system action

46
Markov Decision Processes

System state s (in S)
System action a in (in A)
Transition probabilities P(ss,a)
Reward function R(s,a) (stochastic)
Our application P(ss,a) models the population
of users

47
SDSs as MDPs
Initial system utterance
Initial user utterance
Actions have prob. outcomes
system logs
a e a e a e
...
1
2
1
2
3
3
estimate transition probabilities... P(next
state current state action) ...and
rewards... R(current state, action) ...from
set of exploratory dialogues (random action
choice)
Violations of Markov property! Will this work?
48
Computing the Optimal

Given parameters P(ss,a), R(s,a), can
efficiently compute policy maximizing expected
return
Typically compute the expected cumulative reward
(or Q-value) Q(s,a), using value iteration
Optimal policy selects the action with the
maximum Q-value at each dialogue state

49
Potential Benefits

A principled and general framework for automated
dialogue policy synthesis
learn the optimal action to take in each state
Compares all policies simultaneously
data efficient because actions are evaluated as a
function of state
traditional methods evaluate entire policies
Potential for lifelong learning systems,
adapting to changing user populations

50
The Application NJFun

Dialogue system providing telephone access to a
DB of activities in NJ
Want to obtain 3 attributes
activity type (e.g., wine tasting)
location (e.g., Lambertville)
time (e.g., morning)
Failure to bind an attribute query DB with
dont-care

51
NJFun as an MDP

define state-space
define action-space
define reward structure
collect data for training learn policy
evaluate learned policy

a closer look RL in spoken dialog systems
current challenges RL for error handling
52
The State Space
N.B. Non-state variables record attribute
values state does not condition on previous
attributes!
53
Sample Action Choices

Initiative (when T 0)
user (open prompt and grammar)
mixed (constrained prompt, open grammar)
system (constrained prompt and grammar)
Example
GreetU How may I help you?
GreetS Please say an activity name.

54
Sample Confirmation Choices

Confirmation (when V 1)
confirm
no confirm
Example
Conf3 Did you say want to go in the lttimegt?
NoConf3

55
Dialogue Policy Class

Specify reasonable actions for each state
42 choice states (binary initiative or
confirmation action choices)
no choice for all other states
Small state space (62), large policy space (242)
Example choice state
initial state 1,0,0,0,0,0
action choices GreetS, GreetU
Learn optimal action for each choice state

56
Some System Details

Uses ATTs WATSON ASR and TTS platform, DMD
dialogue manager
Natural language web version used to build
multiple ASR language models
Initial statistics used to tune bins for
confidence values, history bit (informative state
encoding)

57
The Experiment

Designed 6 specific tasks, each with web survey
Split 75 internal subjects into training and
test, controlling for M/F, native/non-native,
experienced/inexperienced
54 training subjects generated 311 dialogues
Training dialogues used to build MDP
Optimal policy for BINARY TASK COMPLETION
computed and implemented
21 test subjects (for modified system) generated
124 dialogues
Did statistical analyses of performance changes

58
Example of Learning

Initial state is always
Attribute(1), Confidence/Confirmed(0), Value(0),
Tries(0), Grammar(0), History(0)
Possible actions in this state
GreetU How may I help you?
GreetS Please say an activity name or say list
activities for a list of activities I know about
In this state, system learned that GreetU is the
optimal action.

59
Reward Function

Binary task completion (objective measure)
1 for 3 correct bindings, else -1
Task completion (allows partial credit)
-1 for an incorrect attribute binding
0,1,2,3 correct attribute bindings
Other evaluation measures ASR performance
(objective), and phone feedback, perceived
completion, future use, perceived understanding,
user understanding, ease of use (all subjective)
Optimized for binary task completion, but
predicted improvements in other measures

60
Main Results

Task completion (-1 to 3)
train mean 1.72
test mean 2.18
p-value lt 0.03
Binary task completion
train mean 51.5
test mean 63.5
p-value lt 0.06

61
Other Results

ASR performance (0-3)
train mean 2.48
test mean 2.67
p-value lt 0.04
Binary task completion for experts (dialogues
3-6)
train mean 45.6
test mean 68.2
p-value lt 0.01

62
Subjective Measures
Subjective measures move to the middle rather
than improve
First graph It was easy to find the place that I
wanted (strongly agree 5,, strongly
disagree1) train mean 3.38, test mean 3.39,
p-value .98
63
Comparison to Human Design

Fielded comparison infeasible, but exploratory
dialogues provide a Monte Carlo proxy of
consistent trajectories
Test policy Average binary completion reward
0.67 (based on 12 trajectories)
Outperforms several standard fixed policies
SysNoConfirm -0.08 (11)
SysConfirm -0.6 (5)
UserNoConfirm -0.2 (15)
Mixed -0.077 (13)
User Confirm 0.2727 (11), no difference

64
A Sanity Check of the MDP

Generate many random policies
Compare value according to MDP and value based on
consistent exploratory trajectories
MDP evaluation of policy ideally perfectly
accurate (infinite Monte Carlo sampling), linear
fit with slope 1, intercept 0
Correlation between Monte Carlo and MDP
1000 policies, gt 0 trajs cor. 0.31, slope 0.953,
int. 0.067, p lt 0.001
868 policies, gt 5 trajs cor. 0.39, slope 1.058,
int. 0.087, p lt 0.001

65
Conclusions from NJFun

MDPs and RL are a promising framework for
automated dialogue policy design
Practical methodology for system-building
given a relatively small number of exploratory
dialogues, learn the optimal policy within a
large policy search space
NJFun first empirical test of formalism
Resulted in measurable and significant system
improvements, as well as interesting linguistic
results

66
Caveats

Must still choose states, actions, reward
Must be exploratory with taste
Data sparsity
Violations of the Markov property
A formal framework and methodology, hopefully
automating one important step in system design

67
Outline

Motivation
Markov Decision Processes and Reinforcement
Learning
NJFun A Case Study
Advanced Topics

68
Some Current Research Topics

Scale to more complex systems
Automate state representation
POMDPs due to hidden state
Learn terminal (and non-terminal) reward function
Online rather than batch learning

69
Addressing Scalability

Approach 1 user models / simulations
costly to obtain real data ? simulate users
inexpensive and potentially richer source of
large corpora
but - whats the quality of the simulated data?
again, real-world evaluation becomes paramount
Approach 2 value function approximation
data-driven state abstraction / aggregation

70
Some Example Simulation Models

P (userAction system Action)
P (yesAnswer explicitConfirmation)
P (yesAnswer explicitConfirmation, goal)
E.g.,
Levin, Pierraccini, Eckert
Georgila, Henderson, Lemon
Pietquin
Scheffler and Young

71
Example Simulation Models in Spoken Dialogue
Tutoring

ProbCorrect Model
Answers a question with the same correctness rate
as a real student
Random Model
Gives random answers
Hua Ai and Diane Litman. Comparing Real-Real,
Simulated-Simulated, and Simulated-Real Spoken
Dialogue Corpora. Proceedings of the AAAI
Workshop on Statistical and Empirical Approaches
for Spoken Dialogue Systems. 2006.

72
ProbCorrect Model

Compute correctness rates from the system logs
Use real student utterance as candidate answers
A Bigram Model

73
Random Model
Pick an answer from the whole set

Randomly pick an answer from the whole utterance
set
A Unigram Model

74
Simulated Dialogue (ProbCorrect Model)
Tutor said To analyze this problem, which law of motion would you use?
You said The second law.
Tutor said The best law is the 3rd law. Do you recall what this law says?
You said No.
Tutor said It says Is there a gravitational force here?
You said Yes.
Tutor said Good. Now, whats the relationship between the two forces magnitudes?
You said They are equal.
Tutor said Yeah. Whats the relationshiop between the two forces directions?
You said Theyre different.
Tutor said To summarize
75
Simulated Dialogue (Random Model)
Tutor said To analyze this problem, which law of motion would you use?
You said More.
Tutor said The best law is the 3rd law. Do you recall what this law says?
You said After all forces.
Tutor said It says Is there a gravitational force here?
You said Yes.
Tutor said We just discussed... Now, whats the relationship between the two forces magnitudes?
You said Yes.
Tutor said Well... Whats the relationshiop between the two forces directions?
You said Its decreasing.
Tutor said To summarize
76
Evaluating Simulation Models

Does the model produce human-like behavior
Compare real and simulated user responses
Metrics precision and recall
Does the model reproduce the variety of human
behavior
Compare real and simulated dialogue corpora
Metrics statistical characteristics of dialogue
features (see below)

77
Evaluating Simulated Corpora

High-level Dialogue Features
Dialog Length (Number of turns)
Turn Length (Number of actions per turn)
Participant Activity (Ratio of system/user
actions per dialog)
Dialogue style and cooperativeness
Proportion of goal-directed dialogues vs. others
Number of times a piece of information is
re-asked
Dialogue success rate and efficiency
Average goal/subgoal achievement rate
Schatzmann, J., Georgila, K., and Young, S.
Quantitative Evaluation of User Simulation
Techniques for Spoken Dialogue Systems. In
Proceedings 6th SIGdial Workshop on Discourse and
Dialogue. 2005.

78
Evaluating ProbCorrect vs. Random

Differences shown by similar metrics are not
necessarily related to the reality level
two real corpora can be very different
Metrics can distinguish to some extent
real from simulated corpora
two simulated corpora generated by different
models trained on the same real corpus
two simulated corpora generated by the same
model trained on two different real corpora

79
Scalability Approach 2Function Approximation

Q can be represented by a table only if the
number of states actions is small
Besides, this makes poor use of experience
Hence, we use function approximation, e.g.
neural nets
weighted linear functions
case-based/instance-based/memory-based
representations

80
Current Research Topics

Scale to more complex systems
Automate state representation
POMDPs due to hidden state
Learn terminal (and non-terminal) reward function
Online rather than batch learning

81
Designing the State Representation

Incrementally add features to a state and test
whether the learned strategy improves
Frampton, M. and Lemon, O. Learning More
Effective Dialogue Strategies Using Limited
Dialogue Move Features. Proceedings ACL/Coling.
2006.
Adding Last System and User Dialogue Acts
improves 7.8
Tetreault J. and Litman, D. Using Reinforcement
Learning to Build a Better Model of Dialogue
State. Proceedings EACL. 2006.
See below

82
Example Methodology and Evaluation in SDS Tutoring

Construct MDPs to test the inclusion of new
state features to a baseline
Develop baseline state and policy
Add a state feature to baseline and compare
polices
A feature is deemed important if adding it
results in a change in policy from a baseline
policy
Joel R. Tetreault and Diane J. Litman. Comparing
the Utility of State Features in Spoken Dialogue
Using Reinforcement Learning. Proceedings
HLT/NAACL. 2006.

83
Baseline Policy
State State Size Policy
1 Correct 1308 SimpleFeedback
2 Incorrect 872 SimpleFeedback

Trend if you only have student correctness as a
model of student state, the best policy is to
always give simple feedback

84
Adding Certainty FeaturesHypothetical Policy
Change
0 shifts
5 shifts
Baseline State Policy BCertainty State
1 C SimFeed C,Certain C,Neutral C,Uncertain
2 I SimFeed I,Certain I,Neutral I,Uncertain
Cert 1 Policy ?
SimFeed SimFeed SimFeed
SimFeed SimFeed SimFeed
Cert 2 Policy ?
Mix SimFeed Mix
Mix ComplexFeedback Mix
85
Evaluation Results

Incorporating new features into standard tutorial
state representation has an impact on Tutor
Feedback policies
Including Certainty, Student Moves and Concept
Repetition into the state effected the most
change
Similar feature utility for choosing Tutor
Questions

86
Designing the State Representation (continued)

Other Approaches, e.g.,
Paek, T. and Chickering, D. The Markov Assumption
in Spoken Dialogue Management. Proc. SIGDial.
2005.
Henderson, J., Lemon, O, and Georgila, K. Hybrid
Reinforcement/Supervised Learning for Dialogue
Policies from Communicator Data. Proc. IJCAI
Workshop on KR in Practical Dialogue Systems.
2005.

87
Current Research Topics

Scale to more complex systems
Automate state representation
POMDPs due to hidden state
Learn terminal (and non-terminal) reward function
Online rather than batch learning

88
Beyond MDPs

Partially Observable MDPs (POMDPs)
We dont REALLY know the users state (we only
know what we THOUGHT the user said)
So need to take actions based on our BELIEF ,
I.e. a probability distribution over states
rather than the true state
e.g., Roy, Pineau and Thrun Young and Williams
Decision Theoretic Methods
e.g., Paek and Horvitz

89
Why POMDPs?

Does state model uncertainty natively (i.e., is
it partially rather than fully observable)?
Yes POMDP and DT
No MDP
Does the system plan (i.e., can cumulative reward
force the system to construct a plan for choice
of immediate actions)?
Yes MDP and POMDP
No DT

90
POMDP Intuitions

At each time step t machine in some hidden state
s?S
Since we dont observe s, we keep a distribution
over states called a belief state b
So the probability of being in state s given
belief state b is b(s).
Based on the current belief state b, the machine
selects an action am ? Am
Receives a reward r(s,am)
Transitions to a new (hidden) state s, where s
depends only on s and am
Machine then receives an observation o ? O,
which is dependent on s and am
Belief distribution is then updated based on o
and am.

91
How to Learn Policies?

State space is now continuous
With smaller discrete state space, MDP could use
dynamic programming this doesnt work for POMDB
Exact solutions only work for small spaces
Need approximate solutions
And simplifying assumptions

92
Current Research Topics

Scale to more complex systems
Automate state representation
POMDPs due to hidden state
Learn terminal (and non-terminal) reward function
Online rather than batch learning

93
Dialogue System Evaluation

The normal reason we need a metric to help us
compare different implementations
A new reason we need a metric for how good a
dialogue went to automatically improve SDS
performance via reinforcement learning
Marilyn Walker. An Application of Reinforcement
Learning to Dialogue Strategy Selection in a
Spoken DIalouge System for Email. JAIR. 2000.

94
PARADISE PARAdigm for DIalogue System Evaluation

Performance of a dialogue system is affected
both by what gets accomplished by the user and
the dialogue agent and how it gets accomplished
Walker, M. A., Litman, D. J., Kamm, C. A., and
Abella, A. PARADISE A Framework for Evaluating
Spoken Dialogue Agents. Proceedings of ACL/EACL.
1997.

95
Performance as User Satisfaction (from
Questionnaire)
96
PARADISE Framework

Measure parameters (interaction costs and
benefits) and performance in a corpus
Train model via multiple linear regression over
parameters, predicting performance
System Performance ? wi pi
Test model on new corpus
Predict performance during future system design

n
i1
97
Example Learned Performance Function from Elvis
Walker 2000

User Sat..27COMP.54MRS- .09BargeIn.15Rejec
t
COMP User perception of task completion (task
success)
MRS Mean (concept) recognition accuracy
(quality cost)
BargeIn Normalized of user interruptions
(quality cost)
Reject Normalized of ASR rejections (quality
cost)
Amount of variance in User Sat. accounted for by
the model
Average Training R2 .37
Average Testing R2 .38
Used as Reward for Reinforcement Learning

98
Some Current Research Topics

Scale to more complex systems
Automate state representation
POMDPs due to hidden state
Learn terminal (and non-terminal) reward function
Online rather than batch learning

99
Offline versus Online Learning

MDP typically works offline
Would like to learn policy online
System can improve over time
Policy can change as environment changes

MDP
Training data
Policy
Dialogue System
User Simulator
Human User
Interactions work online
100
Summary

(PO)MDPs and RL are a promising framework for
automated dialogue policy design
Designer states the problem and the desired goal
Solution methods find (or approximate) optimal
plans for any possible state
Disparate sources of uncertainty unified into a
probabilistic framework
Many interesting problems remain, e.g.,
using this approach as a practical methodology
for system building
making more principled choices (states, rewards,
discount factors, etc.)

101
Acknowledgements

Talks on the web by Dan Bohus, Derek Bridge,
Joyce Chai, Dan Jurafsky, Oliver Lemon and James
Henderson, Jost Schatzmann and Steve Young, and
Jason Williams were used in the development of
this presentation
Slides from ITSPOKE group at University of
Pittsburgh

102
Further Information

Reinforcement Learning
Sutton, R. and Barto G. Reinforcement Learning
An Introduction, MIT Press. 1998 (much available
online)
Artificial Intelligence and Machine Learning
Journals and Conferences
Application to Dialogue
Jurafsky, D. and Martin, J. Dialogue and
Conversational Agents. Chapter 19 of Speech and
Language Processing An Introduction to Natural
Language Processing, Computational Linguistics,
and Speech Recognition. Draft of May 18, 2005
(available online only)
ACL Literature
Spoken Language Community (e.g., IEEE and ISCA
publications)

Write a Comment

User Comments (0)