Title: Using Reinforcement Learning to Build a Better Model of Dialogue State
1Using Reinforcement Learning to Build a Better
Model of Dialogue State
- Joel Tetreault
- LRDC
- University of Pittsburgh
- August 3, 2006
2Interests
- Natural Language Processing
- How do we get computers to understand speech or
text? - Get computers to reply in an intelligent and
satisfactory fashion - Research
- Discourse Processing
- Pronoun Resolution
- Affect Detection (IR)
- Machine learning for Spoken Dialogue Systems
3Intelligent Tutoring Systems
- Students who receive one-on-one instruction
perform as well as the top two percent of
students who receive traditional classroom
instruction Bloom 1984 - Unfortunately, providing every student with a
personal human tutor is infeasible - Develop computer tutors instead
4Intelligent Tutoring Systems
- Working hypothesis regarding learning gains
- Human Dialogue gt Computer Dialogue gt Text
- Long term of goal of ITS designers
- to close the gap between human and computer
dialogue - Make adaptive systems
5How to do it?
- U.Pittsburgh ITSPOKE group
- Use speech (instead of text-based system)
- Emotion detection
- Use information about the content of the
students response - Dialogue context
- However, with all the features and factors that
influence learning gain, how does one design a
system that can take each factor into account
properly? - Are some features more important than others?
6Reinforcement Learning for SDs
- Previous work has researched using machine
learning techniques to find the best action for a
system to make given huge state spaces - Singh et al., 02 Walker, 00 Henderson et
al., 05 - Problems with designing spoken dialogue systems
- How to handle noisy data or miscommunications?
- Hand-tailoring policies for complex dialogues?
- However, very little empirical work Paek et al.,
05 Frampton 05 on comparing the utility of
adding specialized features to construct a better
dialogue state
7Goal
- How does one choose which features best
contribute to a better model of dialogue state? - Goal show the comparative utility of adding four
different features to a dialogue state - 4 features concept repetition, frustration,
student performance, student moves - All are important to tutoring systems, but also
are important to dialogue systems in general
8Goal
- Long term goal
- Current ITSPOKE system only responds to
correctness of last student turn - Determine best state features and actions (for
each state) that would improve system - Incorporate action and state set into new
dialogue manager and test on human subjects to
measure improvement
9Outline
- Markov Decision Processes (MDP)
- MDP Instantiation
- Experimental Method
- Results
- Policies
- Feature Comparison
10Markov Decision Processes
- What is the best action an agent should take at
any state to maximize reward at the end? - MDP Input
- States
- Actions
- Reward Function
11MDP Output
- Policy optimal action for system to take in each
state - Calculated using policy iteration which depends
on - Propagating final reward to each state
- the probabilities of transitioning from one state
to the next given a certain action - Additional output V-value the worth of each
state
12Whats the best path to the fly?
13MDP Frog Example
Final State 1
-1
-1
-1
-1
-1
-1
-1
14MDP Frog Example
Final State 1
-1
0
-2
-1
0
-1
-2
-3
15MDPs in Spoken Dialogue
MDP works offline
MDP
Training data
Policy
Dialogue System
User Simulator
Human User
Interactions work online
16ITSPOKE System
- ITSPOKE spoken dialogue tutoring system Litman
et al. 04 - Back-end is Why2-Atlas system VanLehn et al.
2002 - Sphinx2 speech recognition and Cepstral
text-to-speech
17(No Transcript)
18(No Transcript)
19(No Transcript)
20ITSPOKE Corpus
- 100 dialogues with ITSPOKE spoken dialogue
tutoring system Litman et al. 04 - All possible dialogue paths were authored by
physics experts - Dialogues informally follow question-answer
format - 60 turns per dialogue on average
- Each student session has 5 dialogues bookended by
a pretest and posttest to calculate how much
student learned
21Corpus Annotations
- Manual annotations
- Tutor and Student Moves (similar to Dialog Acts)
- Forbes-Riley et al., 05
- Student Frustration and Certainty
- Litman et al. 04 Liscombe et al. 05
- Automated annotations
- Correctness (based on students response to last
question) - Concept Repetition (whether a concept is
repeated) - Correctness (past performance)
22MDP State Features
23(No Transcript)
24Student Frustration
Tutor26 Let me ask my question in a more
general way. What happens to the velocity of a
body when there is no force acting on
it? Student27 dammit (ASR it is) Tutor28
Could you please repeat that? Student29 same
(ASR i same) Tutor30 Right. And what happens
to the HORIZONTAL velocity of a body when there
is no HORIZONTAL force acting on
it? Student31 zero (ASR the zero) Tutor32
Well... OK, let's consider how forces in the
vertical direction affect the motion of a body in
a horizontal direction in a different scenario
ltomittedgt Student33 oh gravity you already
said this (ASR beats gravity you're exert
less) Tutor34 Fine. Are there any other forces
acting on the apple as it falls? Student35 no
why are you doing this again (ASR no y and to it
yes) Tutor36 Correct. The only force acting on
the apple as it falls to the earth is the earth's
gravitational force. In which direction does
gravitational force act? Student37 downward you
computer (ASR downward you computer)
25MDP Action Choices
26MDP Reward Function
- Reward Function use normalized learning gain to
do a median split on corpus - 10 students are high learners and the other 10
are low learners - High learner dialogues had a final state with a
reward of 100, low learners had one of -100
27Methodology
- Construct MDPs to test the inclusion of new
state features to a baseline - Develop baseline state and policy
- Add a feature to baseline and compare polices
- A feature is deemed important if adding it
results in a change in policy from a baseline
policy given 3 metrics - of Policy Differences (Diffs)
- Policy Change (PC)
- Expected Cumulative Reward (ECR)
- For each MDP verify policies are reliable
(V-value convergence)
28Hypothetical Policy Change Example
0 Diffs
5 Diffs
29Tests
B2
SMove
Concept
B1
Correctness
Certainty
Frustration
Baseline 2
Baseline 1
Correct
30Baseline
- Actions SAQ, CAQ, Mix, NoQ
- Baseline State Correctness
Baseline network
SAQCAQMixNoQ
C
I
FINAL
31Baseline 1 Policies
- Trend if you only have student correctness as a
model of student state, give a hint or other
state act to the student, otherwise give a Mix of
complex and short answer questions
32But are our policies reliable?
- Best way to test is to run real experiments with
human users with new dialogue manager, but that
is months of work - Our tact check if our corpus is large enough to
develop reliable policies by seeing if V-values
converge as we add more data to corpus - Method run MDP on subsets of our corpus
(incrementally add a student (5 dialogues) to
data, and rerun MDP on each subset)
33Baseline Convergence Plot
34Methodology Adding more Features
- Create more complicated baseline by adding
certainty feature (new baseline B2) - Add other 4 features (concept repetition,
frustration, performance, student move)
individually to new baseline - Check V-value and policy convergence
- Analyze policy changes
- Use Feature Comparison Metrics to determine the
relative utility of the three features
35Tests
B2
SMove
Concept
B1
Correctness
Certainty
Frustration
Baseline 2
Baseline 1
Correct
36Certainty
- Previous work (Bhatt et al., 04) has shown the
importance of certainty in ITS - A student who is certain and correct, may require
a harder question since he or she is doing well,
but one that is correct but showing some doubt is
a sign they are becoming confused, give an easier
question
37B2 Baseline Certainty Policies
Trend if neutral, give SAQ or NoQ, else give Mix
38Baseline 2 Convergence Plots
39Baseline 2 Diff Plots
Diff For each subset corpus, compare policy with
policy generated with full corpus
40Tests
B2
SMove
Concept
B1
Correctness
Certainty
Frustration
Baseline 2
Baseline 1
Correct
41Concept Repetition Policies
Trend if concept is repeated (R) give CAQ
42Frustration Policies
Trend if neutral, give CAQ
43Percent Correctness Policies
TrendGive Mix, especially for Low performers
44Feature Comparison (3 metrics)
- Diffs
- Number of new states whose policies differ from
the original - Insensitive to how frequently a state occurs
- Policy Change (P.C.)
- Take into account the frequency of each
state-action sequence
45Feature Comparison
- Expected Cumulative Reward (E.C.R.)
- One issue with P.C. is that frequently occurring
states have low V-values and thus may bias the
score - Use the expected value of being at the start of
the dialogue to compare features - ECR average V-value of all start states
46Question Act Results
- Trend of SMove gt Concept Repetition gt Frustration
gt Percent Correctness stays the same over all
three metrics - Baseline Also tested the effects of a binary
random feature - If enough data, a random feature should not alter
policies - Average diff of 5.1
47Feedback Act Results
- Trend of Smove and Concept Repetition being the
best stays the same, though features have less
impact given this action set - Frustration now slightly worse than Percent
Correctness
48Discussion
- Incorporating more information into a
representation of the student state has an impact
on tutor policies - Proposed three metrics to determine the relative
weight of three features - Including last Student Move and Concept
Repetition effected the most change across
different action sets
49Future Work
- Next step take promising state features and
resulting policies and implement in ITSPOKE - Evaluate with human users and simulated users
- Also researching how much data is enough to
prove reliability of policies?
50How reliable are policies?
Frustration
Concept
Possible data size is small and with increased
data we may see more fluctuations
51CB example
S0
S1
S2
2, 1, 5
52CB example
S0
S1
S2
2, 1, 5
53Confidence Bounds
- Hypothesis instead of looking at the V-values
and policy differences directly, look at the
confidence bounds of each V-value - As data increases, confidence of V-value should
shrink to reflect a better model of the world - Additionally, the policies should converge as
well
54Confidence Bound Methodology
- For each data slice, calculate upper and lower
bounds on the V-value - Take transition matrix for slice and sample from
each row using dirichlet statistical formula 1000
times - So get 1000 new transition matrices that are all
very similar - Run MDP on all 1000 transition matrices to get a
range of ECRs - Rows with not a lot of data are very volatile so
expect large range of ECRs, but as data
increases, transition matrices should stabilize - Take upper and lower bounds at 2.5 percentile
55Confidence Bounds
- CBs can also be used to distinguish how much
better an additional state feature is over a
baseline state space - That is, if the lower bound of a new state space
is greater than the upper bound of the baseline
state space
56Crossover Example
More complicated Model
ECR
Baseline
Data
57Confidence Bounds App 2
- Automatic model switching
- If you know a model, at its worst (ie. Its
lower bound is better than another models upper
bound) then you can automatically switch to the
more complicated model - Good for online RL applications
58Preliminary Results
- As data increases, confidence bounds for all
models shrink - Baseline 2 (certainty) has lower bound that is
higher than upper bound of B1 and policies tend
to stabilize - More complex states take longer to stabilize but
still perform better than baselines
59Baseline 1
Upper 23.65 Lower 0.24
60Baseline 2
Upper 57.16 Lower 39.62
61B2 Concept Repetition
Upper 64.30 Lower 49.16
62Acknowledgements
- ITSPOKE research group
- Dialog on Dialogs research group
- For more information
- http//www.cs.pitt.edu/tetreaul
- NLP/CL conference listings