Learning Optimal Strategies for Spoken Dialogue Systems

About This Presentation
Title:

Learning Optimal Strategies for Spoken Dialogue Systems

Description:

Learning Optimal Strategies for Spoken Dialogue Systems Diane Litman University of Pittsburgh Pittsburgh, PA 15260 USA – PowerPoint PPT presentation

Number of Views:15
Avg rating:3.0/5.0
Slides: 103
Provided by: dian3296

less

Transcript and Presenter's Notes

Title: Learning Optimal Strategies for Spoken Dialogue Systems


1
Learning Optimal Strategies for Spoken Dialogue
Systems
  • Diane Litman
  • University of Pittsburgh
  • Pittsburgh, PA 15260 USA

2
Outline
  • Motivation
  • Markov Decision Processes and Reinforcement
    Learning
  • NJFun A Case Study
  • Advanced Topics

3
Motivation
  • Builders of real-time spoken dialogue systems
    face fundamental design choices that strongly
    influence system performance
  • when to confirm/reject/clarify what the user just
    said?
  • when to ask a directive versus open prompt?
  • when to use user, system, or mixed initiative?
  • when to provide positive/negative/no feedback?
  • etc.
  • Can such decisions be automatically optimized via
    reinforcement learning?

4
Spoken Dialogue Systems (SDS)
  • Provide voice access to back-end via telephone or
    microphone
  • Front-end ASR (automatic speech recognition) and
    TTS (text to speech)
  • Back-end DB, web, etc.
  • Middle dialogue policy (what action to take at
    each point in a dialogue)

5
Typical SDS Architecture
LanguageUnderstanding
Dialogue Policy
Domain Back-end
Language Generation
6
Reinforcement Learning (RL)
  • Learning is associated with a reward
  • By optimizing reward, algorithm learns optimal
    strategy
  • Application to SDS
  • Key assumption SDS can be represented as a
    Markov Decision Process
  • Key benefit Formalization (when in a state, what
    is the reward for taking a particular action,
    among all action choices?)

7
Reinforcement Learning and SDS
  • debate over design choices
  • learn choices using reinforcement learning
  • agent interacting with an environment
  • noisy inputs
  • temporal / sequential aspect
  • task success / failure

LanguageUnderstanding
noisy semantic input
Dialogue Manager
Domain Back-end
actions (semantic output)
Language Generation
8
Sample Research Questions
  • Which aspects of dialogue management are amenable
    to learning and what reward functions are needed?
  • What representation of the dialogue state best
    serves this learning?
  • What reinforcement learning methods are tractable
    with large scale dialogue systems?

9
Outline
  • Motivation
  • Markov Decision Processes and Reinforcement
    Learning
  • NJFun A Case Study
  • Advanced Topics

10
Markov Decision Processes (MDP)
  • Characterized by
  • a set of states S an agent can be in
  • a set of actions A the agent can take
  • A reward r(a,s) that the agent receives for
    taking an action in a state
  • ( Some other things Ill come back to (gamma,
    state transition probabilities))

11
Modeling a Spoken Dialogue System as a
Probabilistic Agent
  • A SDS can be characterized by
  • The current knowledge of the system
  • A set of states S the agent can be in
  • a set of actions A the agent can take
  • A goal G, which implies
  • A success metric that tells us how well the agent
    achieved its goal
  • A way of using this metric to create a strategy
    or policy ? for what action to take in any
    particular state.

12
Reinforcement Learning
  • The agent interacts with its environment to
    achieve a goal
  • It receives reward (possibly delayed reward) for
    its actions
  • it is not told what actions to take
  • instead, it learns from indirect, potentially
    delayed reward, to choose sequences of actions
    that produce the greatest cumulative reward
  • Trial-and-error search
  • neither exploitation nor exploration can be
    pursued exclusively without failing at the task
  • Life-long learning
  • on-going exploration

13
ReinforcementLearning
?
Policy ? S ? A
state
reward
action
a0
a1
a2
. . .
s0
s1
s2
r0
r1
r2
14
State Value Function, V
V(s) predicts the future total reward we can
obtain by entering state s
State, s V(s)
s0 ...
s1 10
s2 15
s3 6
s1
p(s0, a1, s1) 0.7
r(s0, a1) 2
? can exploit V greedily, i.e. in s, choose
action a for which the following is largest
p(s0, a1, s2) 0.3
s2
s0
p(s0, a2, s2) 0.5
r(s0, a2) 5
s3
p(s0, a2, s3) 0.5
Choosing a1 2 0.7 10 0.3 15
13.5 Choosing a2 5 0.5 15 0.5 6 15.5
15
Action Value Function, Q
Q(s, a) predicts the future total reward we can
obtain by executing a in s
State, s Action, a Q(s, a)
s0 a1 13.5
s0 a2 15.5
s1 a1 ...
s1 a2 ...
? can exploit Q greedily, i.e. in s, choose
action a for which Q(s, a) is largest
s0
16
Q Learning
Exploration versus exploitation
For each (s, a), initialise Q(s, a)
arbitrarily Observe current state, s Do until
reach goal state Select action a by exploiting
Q e-greedily, i.e. with probability e, choose
a randomly else choose the a for which Q(s,
a) is largest Execute a, entering state s and
receiving immediate reward r Update the
table entry for Q(s, a) s ? s
Watkins 1989
17
More on Q Learning
s
a
Q(s, a)
r
s
a
Q(s, a)
18
A Brief Tutorial Example
  • A Day-and-Month dialogue system
  • Goal fill in a two-slot frame
  • Month November
  • Day 12th
  • Via the shortest possible interaction with user
  • Levin, E., Pieraccini, R. and Eckert, W. A
    Stochastic Model of Human-Machine Interaction for
    Learning Dialog Strategies. IEEE Transactions on
    Speech and Audio Processing. 2000.

19
What is a State?
  • In principle, MDP state could include any
    possible information about dialogue
  • Complete dialogue history so far
  • Usually use a much more limited set
  • Values of slots in current frame
  • Most recent question asked to user
  • Users most recent answer
  • ASR confidence
  • etc

20
State in the Day-and-Month Example
  • Values of the two slots day and month.
  • Total
  • 2 special initial state si and sf.
  • 365 states with a day and month
  • 1 state for leap year
  • 12 states with a month but no day
  • 31 states with a day but no month
  • 411 total states

21
Actions in MDP Models of Dialogue
  • Speech acts!
  • Ask a question
  • Explicit confirmation
  • Rejection
  • Give the user some database information
  • Tell the user their choices
  • Do a database query

22
Actions in the Day-and-Month Example
  • ad a question asking for the day
  • am a question asking for the month
  • adm a question asking for the daymonth
  • af a final action submitting the form and
    terminating the dialogue

23
A Simple Reward Function
  • For this example, lets use a cost function for
    the entire dialogue
  • Let
  • Ninumber of interactions (duration of dialogue)
  • Nenumber of errors in the obtained values (0-2)
  • Nfexpected distance from goal
  • (0 for complete date, 1 if either data or month
    are missing, 2 if both missing)
  • Then (weighted) cost is
  • C wi?Ni we?Ne wf?Nf

24
3 Possible Policies
Dumb
P1probability of error in open prompt
Open prompt
P2probability of error in directive prompt
Directive prompt
25
3 Possible Policies
Strategy 3 is better than strategy 2 when
improved error rate justifies longer interaction
P1probability of error in open prompt
OPEN
P2probability of error in directive prompt
DIRECTIVE
26
That was an Easy Optimization
  • Only two actions, only tiny of policies
  • In general, number of actions, states, policies
    is quite large
  • So finding optimal policy is harder
  • We need reinforcement learning
  • Back to MDPs

27
MDP
  • We can think of a dialogue as a trajectory in
    state space
  • The best policy is the one with the greatest
    expected reward over all trajectories
  • How to compute a reward for a state sequence?

28
Reward for a State Sequence
  • One common approach discounted rewards
  • Cumulative reward Q of a sequence is discounted
    sum of utilities of individual states
  • Discount factor ? between 0 and 1
  • Makes agent care more about current than future
    rewards the more future a reward, the more
    discounted its value

29
The Markov Assumption
  • MDP assumes that state transitions are Markovian

30
Expected Reward for an Action
  • Expected cumulative reward Q(s,a) for taking a
    particular action from a particular state can be
    computed by Bellman equation
  • immediate reward for current state
  • expected discounted utility of all possible
    next states s
  • weighted by probability of moving to that state
    s
  • and assuming once there we take optimal action a

31
Needed for Bellman Equation
  • A model of p(ss,a) and estimate of R(s,a)
  • If we had labeled training data
  • P(ss,a) C(s,s,a)/C(s,a)
  • If we knew the final reward for whole dialogue
    R(s1,a1,s2,a2,,sn)
  • Given these parameters, can use value iteration
    algorithm to learn Q values (pushing back reward
    values over state sequences) and hence best policy

32
How to estimate p(ss,a) without labeled data
  • Random conversations with real people
  • Carefully hand-tune small number of states and
    policies
  • Build a dialogue system which explores state
    space by generating a few hundred random
    conversations with real humans
  • Set probabilities from this corpus
  • Random conversations with simulated people
  • Have millions of conversations with simulated
    people
  • Can have a slightly larger state space

33
Final Reward
  • What is the final reward for whole dialogue
    R(s1,a1,s2,a2,,sn)?
  • The general goodness of a whole dialogue!
  • Can be automatically computed using PARADISE
    evaluation framework (more later)

34
Outline
  • Motivation
  • Markov Decision Processes and Reinforcement
    Learning
  • NJFun A Case Study
  • Advanced Topics

35
NJFun A Case Study
  • Idea use reinforcement learning to make a small
    set of optimal policy decisions while interacting
    with real people, and empirically evaluate
  • Singh, S., D. Litman, M. Kearns, and M. Walker.
    Optimizing Dialogue Management with Reinforcement
    Learning Experiments with the NJFun System.
    Journal of AI Research. 2002.

36
NJFun (aka RLDS) Dialogues
  • Task Description"You have lived in Stanhope for
    many years but have never managed to visit its
    historic sites. Today you are feeling virtuous.
    Use RLDS to find out what you can see this
    morning.

Unoptimized System
Optimized System
TTS Welcome to RLDS. Please say an activity
name or say 'list activities' for a list of
activities I know about. USER I would like to
visit a historic site in Stanhope this morning.
TTS Did you say you are interested in going to
a zoo? USER No. TTS I know about amusement
parks, aquariums, cruises, historic sites,
museums, parks, theaters, wineries, and zoos.
Please say an activity name from this list.
USER I would like to visit a historic site.
TTS Did you say you are interested in going to
a historic site? USER Yes. TTS Please give
me more information. USER ... TTS Please
tell me the location that you are interested in.
You can also tell me the time. . (8 turns
omitted) TTS Thank you for using the system.
Please give me feedback by saying 'good',
'so-so', or 'bad'. USER Bad. TTS Goodbye and
have a nice day!
TTS Welcome to RLDS. How may I help you? USER
I would like to um find about Stanhope historic
sites. TTS Did you say you are interested in
Stanhope? USER Yes. TTS What time of the day
do you want to go? USER Uh...morning. TTS
Did you say you want to go in the morning? USER
Yes. TTS I found 2 historic sites in Stanhope
that are open in the morning. They are the
Lenape Indian Village, and the Waterloo Village.
Thank you for using the system. Please give me
feedback by saying 'good', 'so-so', or 'bad'.
USER Good. TTS Goodbye and have a nice day!
37
Some Issues in Automatic Speech Recognition (ASR)
  • Inputs audio file grammar/language model
    acoustic model
  • Outputs utterance matched from grammar, or no
    match confidence score
  • Performance tradeoff
  • small grammar --gt high accuracy on constrained
    utterances, lots of no-matches
  • large grammar --gt match more utterances, but
    with lower confidence

38
Some Issues in Dialogue Policy Design
  • Initiative policy
  • Confirmation policy
  • Criteria to be optimized

39
Initiative Policy
  • System initiative vs. user initiative
  • Please state your departure city.
  • How can I help you?
  • Influences expectations
  • ASR grammar must be chosen accordingly
  • Best choice may differ from state to state
  • May depend on user population task

40
Confirmation Policy
  • High ASR confidence accept ASR match and move on
  • Moderate ASR confidence confirm
  • Low ASR confidence re-ask
  • How to set confidence thresholds?
  • Early mistakes can be costly later, but excessive
    confirmation is annoying

41
Criteria to be Optimized
  • Task completion
  • Sales revenues
  • User satisfaction
  • ASR performance
  • Number of turns

42
Typical System Design Sequential Search
  • Choose and implement several reasonable
    dialogue policies
  • Field systems, gather dialogue data
  • Do statistical analyses
  • Refield system with best dialogue policy
  • Can only examine a handful of policies

43
Why Reinforcement Learning?
  • Agents can learn to improve performance by
    interacting with their environment
  • Thousands of possible dialogue policies, and want
    to automate the choice of the optimal
  • Can handle many features of spoken dialogue
  • noisy sensors (ASR output)
  • stochastic behavior (user population)
  • delayed rewards, and many possible rewards
  • multiple plausible actions
  • However, many practical challenges remain

44
Proposed Approach
  • Build initial system that is deliberately
    exploratory wrt state and action space
  • Use dialogue data from initial system to build a
    Markov decision process (MDP)
  • Use methods of reinforcement learning to compute
    optimal policy (here, dialogue policy) of the MDP
  • Refield (improved?) system given by the optimal
    policy
  • Empirically evaluate

45
State-Based Design
  • System state contains information relevant for
    deciding the next action
  • info attributes perceived so far
  • individual and average ASR confidences
  • data on particular user
  • etc.
  • In practice, need a compressed state
  • Dialogue policy mapping from each state in the
    state space to a system action

46
Markov Decision Processes
  • System state s (in S)
  • System action a in (in A)
  • Transition probabilities P(ss,a)
  • Reward function R(s,a) (stochastic)
  • Our application P(ss,a) models the population
    of users

47
SDSs as MDPs
Initial system utterance
Initial user utterance
Actions have prob. outcomes
system logs
a e a e a e
...
1
2
1
2
3
3
estimate transition probabilities... P(next
state current state action) ...and
rewards... R(current state, action) ...from
set of exploratory dialogues (random action
choice)
Violations of Markov property! Will this work?
48
Computing the Optimal
  • Given parameters P(ss,a), R(s,a), can
    efficiently compute policy maximizing expected
    return
  • Typically compute the expected cumulative reward
    (or Q-value) Q(s,a), using value iteration
  • Optimal policy selects the action with the
    maximum Q-value at each dialogue state

49
Potential Benefits
  • A principled and general framework for automated
    dialogue policy synthesis
  • learn the optimal action to take in each state
  • Compares all policies simultaneously
  • data efficient because actions are evaluated as a
    function of state
  • traditional methods evaluate entire policies
  • Potential for lifelong learning systems,
    adapting to changing user populations

50
The Application NJFun
  • Dialogue system providing telephone access to a
    DB of activities in NJ
  • Want to obtain 3 attributes
  • activity type (e.g., wine tasting)
  • location (e.g., Lambertville)
  • time (e.g., morning)
  • Failure to bind an attribute query DB with
    dont-care

51
NJFun as an MDP
  • define state-space
  • define action-space
  • define reward structure
  • collect data for training learn policy
  • evaluate learned policy

a closer look RL in spoken dialog systems
current challenges RL for error handling
52
The State Space
N.B. Non-state variables record attribute
values state does not condition on previous
attributes!
53
Sample Action Choices
  • Initiative (when T 0)
  • user (open prompt and grammar)
  • mixed (constrained prompt, open grammar)
  • system (constrained prompt and grammar)
  • Example
  • GreetU How may I help you?
  • GreetS Please say an activity name.

54
Sample Confirmation Choices
  • Confirmation (when V 1)
  • confirm
  • no confirm
  • Example
  • Conf3 Did you say want to go in the lttimegt?
  • NoConf3

55
Dialogue Policy Class
  • Specify reasonable actions for each state
  • 42 choice states (binary initiative or
    confirmation action choices)
  • no choice for all other states
  • Small state space (62), large policy space (242)
  • Example choice state
  • initial state 1,0,0,0,0,0
  • action choices GreetS, GreetU
  • Learn optimal action for each choice state

56
Some System Details
  • Uses ATTs WATSON ASR and TTS platform, DMD
    dialogue manager
  • Natural language web version used to build
    multiple ASR language models
  • Initial statistics used to tune bins for
    confidence values, history bit (informative state
    encoding)

57
The Experiment
  • Designed 6 specific tasks, each with web survey
  • Split 75 internal subjects into training and
    test, controlling for M/F, native/non-native,
    experienced/inexperienced
  • 54 training subjects generated 311 dialogues
  • Training dialogues used to build MDP
  • Optimal policy for BINARY TASK COMPLETION
    computed and implemented
  • 21 test subjects (for modified system) generated
    124 dialogues
  • Did statistical analyses of performance changes

58
Example of Learning
  • Initial state is always
  • Attribute(1), Confidence/Confirmed(0), Value(0),
    Tries(0), Grammar(0), History(0)
  • Possible actions in this state
  • GreetU How may I help you?
  • GreetS Please say an activity name or say list
    activities for a list of activities I know about
  • In this state, system learned that GreetU is the
    optimal action.

59
Reward Function
  • Binary task completion (objective measure)
  • 1 for 3 correct bindings, else -1
  • Task completion (allows partial credit)
  • -1 for an incorrect attribute binding
  • 0,1,2,3 correct attribute bindings
  • Other evaluation measures ASR performance
    (objective), and phone feedback, perceived
    completion, future use, perceived understanding,
    user understanding, ease of use (all subjective)
  • Optimized for binary task completion, but
    predicted improvements in other measures

60
Main Results
  • Task completion (-1 to 3)
  • train mean 1.72
  • test mean 2.18
  • p-value lt 0.03
  • Binary task completion
  • train mean 51.5
  • test mean 63.5
  • p-value lt 0.06

61
Other Results
  • ASR performance (0-3)
  • train mean 2.48
  • test mean 2.67
  • p-value lt 0.04
  • Binary task completion for experts (dialogues
    3-6)
  • train mean 45.6
  • test mean 68.2
  • p-value lt 0.01

62
Subjective Measures
Subjective measures move to the middle rather
than improve
First graph It was easy to find the place that I
wanted (strongly agree 5,, strongly
disagree1) train mean 3.38, test mean 3.39,
p-value .98
63
Comparison to Human Design
  • Fielded comparison infeasible, but exploratory
    dialogues provide a Monte Carlo proxy of
    consistent trajectories
  • Test policy Average binary completion reward
    0.67 (based on 12 trajectories)
  • Outperforms several standard fixed policies
  • SysNoConfirm -0.08 (11)
  • SysConfirm -0.6 (5)
  • UserNoConfirm -0.2 (15)
  • Mixed -0.077 (13)
  • User Confirm 0.2727 (11), no difference

64
A Sanity Check of the MDP
  • Generate many random policies
  • Compare value according to MDP and value based on
    consistent exploratory trajectories
  • MDP evaluation of policy ideally perfectly
    accurate (infinite Monte Carlo sampling), linear
    fit with slope 1, intercept 0
  • Correlation between Monte Carlo and MDP
  • 1000 policies, gt 0 trajs cor. 0.31, slope 0.953,
    int. 0.067, p lt 0.001
  • 868 policies, gt 5 trajs cor. 0.39, slope 1.058,
    int. 0.087, p lt 0.001

65
Conclusions from NJFun
  • MDPs and RL are a promising framework for
    automated dialogue policy design
  • Practical methodology for system-building
  • given a relatively small number of exploratory
    dialogues, learn the optimal policy within a
    large policy search space
  • NJFun first empirical test of formalism
  • Resulted in measurable and significant system
    improvements, as well as interesting linguistic
    results

66
Caveats
  • Must still choose states, actions, reward
  • Must be exploratory with taste
  • Data sparsity
  • Violations of the Markov property
  • A formal framework and methodology, hopefully
    automating one important step in system design

67
Outline
  • Motivation
  • Markov Decision Processes and Reinforcement
    Learning
  • NJFun A Case Study
  • Advanced Topics

68
Some Current Research Topics
  • Scale to more complex systems
  • Automate state representation
  • POMDPs due to hidden state
  • Learn terminal (and non-terminal) reward function
  • Online rather than batch learning

69
Addressing Scalability
  • Approach 1 user models / simulations
  • costly to obtain real data ? simulate users
  • inexpensive and potentially richer source of
    large corpora
  • but - whats the quality of the simulated data?
  • again, real-world evaluation becomes paramount
  • Approach 2 value function approximation
  • data-driven state abstraction / aggregation

70
Some Example Simulation Models
  • P (userAction system Action)
  • P (yesAnswer explicitConfirmation)
  • P (yesAnswer explicitConfirmation, goal)
  • E.g.,
  • Levin, Pierraccini, Eckert
  • Georgila, Henderson, Lemon
  • Pietquin
  • Scheffler and Young

71
Example Simulation Models in Spoken Dialogue
Tutoring
  • ProbCorrect Model
  • Answers a question with the same correctness rate
    as a real student
  • Random Model
  • Gives random answers
  • Hua Ai and Diane Litman. Comparing Real-Real,
    Simulated-Simulated, and Simulated-Real Spoken
    Dialogue Corpora. Proceedings of the AAAI
    Workshop on Statistical and Empirical Approaches
    for Spoken Dialogue Systems. 2006.

72
ProbCorrect Model
  • Compute correctness rates from the system logs
  • Use real student utterance as candidate answers
  • A Bigram Model

73
Random Model
Pick an answer from the whole set
  • Randomly pick an answer from the whole utterance
    set
  • A Unigram Model

74
Simulated Dialogue (ProbCorrect Model)
Tutor said To analyze this problem, which law of motion would you use?
You said The second law.
Tutor said The best law is the 3rd law. Do you recall what this law says?
You said No.
Tutor said It says Is there a gravitational force here?
You said Yes.
Tutor said Good. Now, whats the relationship between the two forces magnitudes?
You said They are equal.
Tutor said Yeah. Whats the relationshiop between the two forces directions?
You said Theyre different.
Tutor said To summarize
75
Simulated Dialogue (Random Model)
Tutor said To analyze this problem, which law of motion would you use?
You said More.
Tutor said The best law is the 3rd law. Do you recall what this law says?
You said After all forces.
Tutor said It says Is there a gravitational force here?
You said Yes.
Tutor said We just discussed... Now, whats the relationship between the two forces magnitudes?
You said Yes.
Tutor said Well... Whats the relationshiop between the two forces directions?
You said Its decreasing.
Tutor said To summarize
76
Evaluating Simulation Models
  • Does the model produce human-like behavior
  • Compare real and simulated user responses
  • Metrics precision and recall
  • Does the model reproduce the variety of human
    behavior
  • Compare real and simulated dialogue corpora
  • Metrics statistical characteristics of dialogue
    features (see below)

77
Evaluating Simulated Corpora
  • High-level Dialogue Features
  • Dialog Length (Number of turns)
  • Turn Length (Number of actions per turn)
  • Participant Activity (Ratio of system/user
    actions per dialog)
  • Dialogue style and cooperativeness
  • Proportion of goal-directed dialogues vs. others
  • Number of times a piece of information is
    re-asked
  • Dialogue success rate and efficiency
  • Average goal/subgoal achievement rate
  • Schatzmann, J., Georgila, K., and Young, S.
    Quantitative Evaluation of User Simulation
    Techniques for Spoken Dialogue Systems. In
    Proceedings 6th SIGdial Workshop on Discourse and
    Dialogue. 2005.

78
Evaluating ProbCorrect vs. Random
  • Differences shown by similar metrics are not
    necessarily related to the reality level
  • two real corpora can be very different
  • Metrics can distinguish to some extent
  • real from simulated corpora
  • two simulated corpora generated by different
    models trained on the same real corpus
  • two simulated corpora generated by the same
    model trained on two different real corpora

79
Scalability Approach 2Function Approximation
  • Q can be represented by a table only if the
    number of states actions is small
  • Besides, this makes poor use of experience
  • Hence, we use function approximation, e.g.
  • neural nets
  • weighted linear functions
  • case-based/instance-based/memory-based
    representations

80
Current Research Topics
  • Scale to more complex systems
  • Automate state representation
  • POMDPs due to hidden state
  • Learn terminal (and non-terminal) reward function
  • Online rather than batch learning

81
Designing the State Representation
  • Incrementally add features to a state and test
    whether the learned strategy improves
  • Frampton, M. and Lemon, O. Learning More
    Effective Dialogue Strategies Using Limited
    Dialogue Move Features. Proceedings ACL/Coling.
    2006.
  • Adding Last System and User Dialogue Acts
    improves 7.8
  • Tetreault J. and Litman, D. Using Reinforcement
    Learning to Build a Better Model of Dialogue
    State. Proceedings EACL. 2006.
  • See below

82
Example Methodology and Evaluation in SDS Tutoring
  • Construct MDPs to test the inclusion of new
    state features to a baseline
  • Develop baseline state and policy
  • Add a state feature to baseline and compare
    polices
  • A feature is deemed important if adding it
    results in a change in policy from a baseline
    policy
  • Joel R. Tetreault and Diane J. Litman. Comparing
    the Utility of State Features in Spoken Dialogue
    Using Reinforcement Learning. Proceedings
    HLT/NAACL. 2006.

83
Baseline Policy
State State Size Policy
1 Correct 1308 SimpleFeedback
2 Incorrect 872 SimpleFeedback
  • Trend if you only have student correctness as a
    model of student state, the best policy is to
    always give simple feedback

84
Adding Certainty FeaturesHypothetical Policy
Change
0 shifts
5 shifts
Baseline State Policy BCertainty State
1 C SimFeed C,Certain C,Neutral C,Uncertain
2 I SimFeed I,Certain I,Neutral I,Uncertain
Cert 1 Policy ?
SimFeed SimFeed SimFeed
SimFeed SimFeed SimFeed
Cert 2 Policy ?
Mix SimFeed Mix
Mix ComplexFeedback Mix
85
Evaluation Results
  • Incorporating new features into standard tutorial
    state representation has an impact on Tutor
    Feedback policies
  • Including Certainty, Student Moves and Concept
    Repetition into the state effected the most
    change
  • Similar feature utility for choosing Tutor
    Questions

86
Designing the State Representation (continued)
  • Other Approaches, e.g.,
  • Paek, T. and Chickering, D. The Markov Assumption
    in Spoken Dialogue Management. Proc. SIGDial.
    2005.
  • Henderson, J., Lemon, O, and Georgila, K. Hybrid
    Reinforcement/Supervised Learning for Dialogue
    Policies from Communicator Data. Proc. IJCAI
    Workshop on KR in Practical Dialogue Systems.
    2005.

87
Current Research Topics
  • Scale to more complex systems
  • Automate state representation
  • POMDPs due to hidden state
  • Learn terminal (and non-terminal) reward function
  • Online rather than batch learning

88
Beyond MDPs
  • Partially Observable MDPs (POMDPs)
  • We dont REALLY know the users state (we only
    know what we THOUGHT the user said)
  • So need to take actions based on our BELIEF ,
    I.e. a probability distribution over states
    rather than the true state
  • e.g., Roy, Pineau and Thrun Young and Williams
  • Decision Theoretic Methods
  • e.g., Paek and Horvitz

89
Why POMDPs?
  • Does state model uncertainty natively (i.e., is
    it partially rather than fully observable)?
  • Yes POMDP and DT
  • No MDP
  • Does the system plan (i.e., can cumulative reward
    force the system to construct a plan for choice
    of immediate actions)?
  • Yes MDP and POMDP
  • No DT

90
POMDP Intuitions
  • At each time step t machine in some hidden state
    s?S
  • Since we dont observe s, we keep a distribution
    over states called a belief state b
  • So the probability of being in state s given
    belief state b is b(s).
  • Based on the current belief state b, the machine
  • selects an action am ? Am
  • Receives a reward r(s,am)
  • Transitions to a new (hidden) state s, where s
    depends only on s and am
  • Machine then receives an observation o ? O,
    which is dependent on s and am
  • Belief distribution is then updated based on o
    and am.

91
How to Learn Policies?
  • State space is now continuous
  • With smaller discrete state space, MDP could use
    dynamic programming this doesnt work for POMDB
  • Exact solutions only work for small spaces
  • Need approximate solutions
  • And simplifying assumptions

92
Current Research Topics
  • Scale to more complex systems
  • Automate state representation
  • POMDPs due to hidden state
  • Learn terminal (and non-terminal) reward function
  • Online rather than batch learning

93
Dialogue System Evaluation
  • The normal reason we need a metric to help us
    compare different implementations
  • A new reason we need a metric for how good a
    dialogue went to automatically improve SDS
    performance via reinforcement learning
  • Marilyn Walker. An Application of Reinforcement
    Learning to Dialogue Strategy Selection in a
    Spoken DIalouge System for Email. JAIR. 2000.

94
PARADISE PARAdigm for DIalogue System Evaluation
  • Performance of a dialogue system is affected
    both by what gets accomplished by the user and
    the dialogue agent and how it gets accomplished
  • Walker, M. A., Litman, D. J., Kamm, C. A., and
    Abella, A. PARADISE A Framework for Evaluating
    Spoken Dialogue Agents. Proceedings of ACL/EACL.
    1997.

95
Performance as User Satisfaction (from
Questionnaire)
96
PARADISE Framework
  • Measure parameters (interaction costs and
    benefits) and performance in a corpus
  • Train model via multiple linear regression over
    parameters, predicting performance
  • System Performance ? wi pi
  • Test model on new corpus
  • Predict performance during future system design

n
i1
97
Example Learned Performance Function from Elvis
Walker 2000
  • User Sat..27COMP.54MRS- .09BargeIn.15Rejec
    t
  • COMP User perception of task completion (task
    success)
  • MRS Mean (concept) recognition accuracy
    (quality cost)
  • BargeIn Normalized of user interruptions
    (quality cost)
  • Reject Normalized of ASR rejections (quality
    cost)
  • Amount of variance in User Sat. accounted for by
    the model
  • Average Training R2 .37
  • Average Testing R2 .38
  • Used as Reward for Reinforcement Learning

98
Some Current Research Topics
  • Scale to more complex systems
  • Automate state representation
  • POMDPs due to hidden state
  • Learn terminal (and non-terminal) reward function
  • Online rather than batch learning

99
Offline versus Online Learning
  • MDP typically works offline
  • Would like to learn policy online
  • System can improve over time
  • Policy can change as environment changes

MDP
Training data
Policy
Dialogue System
User Simulator
Human User
Interactions work online
100
Summary
  • (PO)MDPs and RL are a promising framework for
    automated dialogue policy design
  • Designer states the problem and the desired goal
  • Solution methods find (or approximate) optimal
    plans for any possible state
  • Disparate sources of uncertainty unified into a
    probabilistic framework
  • Many interesting problems remain, e.g.,
  • using this approach as a practical methodology
    for system building
  • making more principled choices (states, rewards,
    discount factors, etc.)

101
Acknowledgements
  • Talks on the web by Dan Bohus, Derek Bridge,
    Joyce Chai, Dan Jurafsky, Oliver Lemon and James
    Henderson, Jost Schatzmann and Steve Young, and
    Jason Williams were used in the development of
    this presentation
  • Slides from ITSPOKE group at University of
    Pittsburgh

102
Further Information
  • Reinforcement Learning
  • Sutton, R. and Barto G. Reinforcement Learning
    An Introduction, MIT Press. 1998 (much available
    online)
  • Artificial Intelligence and Machine Learning
    Journals and Conferences
  • Application to Dialogue
  • Jurafsky, D. and Martin, J. Dialogue and
    Conversational Agents. Chapter 19 of Speech and
    Language Processing An Introduction to Natural
    Language Processing, Computational Linguistics,
    and Speech Recognition. Draft of May 18, 2005
    (available online only)
  • ACL Literature
  • Spoken Language Community (e.g., IEEE and ISCA
    publications)
Write a Comment
User Comments (0)