Title: How much data is enough Generating reliable policies wMDPs
1How much data is enough? Generating reliable
policies w/MDPs
- Joel Tetreault
- University of Pittsburgh
- LRDC
- July 14, 2006
2Problem
- Problems with designing spoken dialogue systems
- How to handle noisy data or miscommunications?
- Hand-tailoring policies for complex dialogues?
- What features to use?
- Previous work used machine learning to improve
the dialogue manager of spoken dialogue systems - Singh et al., 02 Walker, 00 Henderson et
al., 05 - However, very little empirical work Paek et al.,
05 Frampton 05 on comparing the utility of
adding specialized features to construct a better
dialogue state
3Goal
- How does one choose which features best
contribute to a better model of dialogue state? - Goal show the comparative utility of adding
three different features to a dialogue state - 4 features concept repetition, frustration,
student performance, student moves - All are important to tutoring systems, but also
are important to dialogue systems in general
4Previous Work
- In complex domains, annotation and testing is
time-consuming so it is important to properly
choose best features beforehand - Developed a methodology for using Reinforcement
Learning to determine whether adding complex
features to a dialogue state will beneficially
alter policies Tetreault Litman, EACL 06 - Extensions
- Methodology to determine which features are the
best - Also show our results generalize over different
action choices (feedback vs. questions)
5Outline
- Markov Decision Processes (MDP)
- MDP Instantiation
- Experimental Method
- Results
- Policies
- Feature Comparison
6Markov Decision Processes
- What is the best action an agent should take at
any state to maximize reward at the end? - MDP Input
- States
- Actions
- Reward Function
7MDP Output
- Policy optimal action for system to take in each
state - Calculated using policy iteration which depends
on - Propagating final reward to each state
- the probabilities of getting from one state to
the next given a certain action - Additional output V-value the worth of each
state
8MDPs in Spoken Dialogue
MDP works offline
MDP
Training data
Policy
Dialogue System
User Simulator
Human User
Interactions work online
9ITSPOKE Corpus
- 100 dialogues with ITSPOKE spoken dialogue
tutoring system Litman et al. 04 - All possible dialogue paths were authored by
physics experts - Dialogues informally follow question-answer
format - 60 turns per dialogue on average
- Each student session has 5 dialogues bookended by
a pretest and posttest to calculate how much
student learned
10Corpus Annotations
- Manual annotations
- Tutor Moves (similar to Dialog Acts)
- Forbes-Riley et al., 05
- Student Frustration and Certainty
- Litman et al. 04 Liscombe et al. 05
- Automated annotations
- Correctness (based on students response to last
question) - Concept Repetition (whether a concept is
repeated) - Correctness (past performance)
11MDP State Features
12MDP Action Choices
13MDP Reward Function
- Reward Function use normalized learning gain to
do a median split on corpus - 10 students are high learners and the other 10
are low learners - High learner dialogues had a final state with a
reward of 100, low learners had one of -100
14Methodology
- Construct MDPs to test the inclusion of new
state features to a baseline - Develop baseline state and policy
- Add a feature to baseline and compare polices
- A feature is deemed important if adding it
results in a change in policy from a baseline
policy given 3 metrics - of Policy Differences (Diffs)
- Policy Change (PC)
- Expected Cumulative Reward (ECR)
- For each MDP verify policies are reliable
(V-value convergence)
15Hypothetical Policy Change Example
0 Diffs
5 Diffs
16Tests
B2
Concept
B1
Frustration
Correctness
Certainty
Baseline 2
Baseline 1
Correct
17Baseline
- Actions SAQ, CAQ, Mix, NoQ
- Baseline State Correctness
Baseline network
SAQCAQMixNoQ
C
I
FINAL
18Baseline 1 Policies
- Trend if you only have student correctness as a
model of student state, give a hint or other
state act to the student, otherwise give a Mix of
complex and short answer questions
19But are our policies reliable?
- Best way to test is to run real experiments with
human users with new dialogue manager, but that
is months of work - Our tact check if our corpus is large enough to
develop reliable policies by seeing if V-values
converge as we add more data to corpus - Method run MDP on subsets of our corpus
(incrementally add a student (5 dialogues) to
data, and rerun MDP on each subset)
20Baseline Convergence Plot
21Methodology Adding more Features
- Create more complicated baseline by adding
certainty feature (new baseline B2) - Add other 4 features (concept repetition,
frustration, performance, student move)
individually to new baseline - Check V-value and policy convergence
- Analyze policy changes
- Use Feature Comparison Metrics to determine the
relative utility of the three features
22Tests
B2
Concept
B1
Frustration
Correctness
Certainty
Baseline 2
Baseline 1
Correct
23Certainty
- Previous work (Bhatt et al., 04) has shown the
importance of certainty in ITS - A student who is certain and correct, may require
a harder question since he or she is doing well,
but one that is correct but showing some doubt is
a sign they are becoming confused, give an easier
question
24B2 Baseline Certainty Policies
Trend if neutral, give SAQ or NoQ, else give Mix
25Baseline 2 Convergence Plots
26Baseline 2 Diff Plots
Diff For each subset corpus, compare policy with
policy generated with full corpus
27Tests
B2
Concept
B1
Frustration
Correctness
Certainty
Baseline 2
Baseline 1
Correct
28Feature Comparison (3 metrics)
- Diffs
- Number of new states whose policies differ from
the original - Insensitive to how frequently a state occurs
- Policy Change (P.C.)
- Take into account the frequency of each
state-action sequence
29Feature Comparison
- Expected Cumulative Reward (E.C.R.)
- One issue with P.C. is that frequently occurring
states have low V-values and thus may bias the
score - Use the expected value of being at the start of
the dialogue to compare features - ECR average V-value of all start states
30Feature Comparison Results
- Trend of SMove gt Concept Repetition gt Frustration
gt Percent Correctness stays the same over all
three metrics - Baseline Also tested the effects of a binary
random feature - If enough data, a random feature should not alter
policies - Average diff of 5.1
31How reliable are policies?
Frustration
Concept
Possible data size is small and with increased
data we may see more fluctuations
32Confidence Bounds
- Hypothesis instead of looking at the V-values
and policy differences directly, look at the
confidence bounds of each V-value - As data increases, confidence of V-value should
shrink to reflect a better model of the world - Additionally, the policies should converge as
well
33Confidence Bounds
- CBs can also be used to distinguish how much
better an additional state feature is over a
baseline state space - That is, if the lower bound of a new state space
is greater than the upper bound of the baseline
state space
34Crossover Example
More complicated Model
ECR
Baseline
Data
35Confidence Bounds App 2
- Automatic model switching
- If you know a model, at its worst (ie. Its
lower bound is better than another models upper
bound) then you can automatically switch to the
more complicated model - Good for online RL applications
36Confidence Bound Methodology
- For each data slice, calculate upper and lower
bounds on the V-value - Take transition matrix for slice and sample from
each row using direch. statistical formula 1000
times - do this b/c real world data is not exactly
approximating what data is like in the real
world, but may be close - So get 1000 new transition matrices that are all
very similar - Run MDP on all 1000 transition matrices to get a
range of ECRs - Rows with not a lot of data are very volatile so
expect large range of ECRs, but as data
increases, transition matrices should stabilize
such that most of the new matrices produce
similar policies and values as the original - Take upper and lower bounds at 2.5 percentile
37Experiment
- Original action/state setup did not show anything
promising - State/action space too large for data?
- Not best MDP instantiation
- Looked at a variety of MDP configurations
- Refined reward metric
- Adding discourse segmentation
38essay Instantiation with 0305 data
39essay Baseline1
40essay Baseline2
41essay B2SMove
42Feature Comparison Results
- Reduced state size Certainty CertNeutral,
Uncert - Trend that SMove and Concept Repetition are the
best features - B2 ECR 31.92
43Baseline 1
Upper 23.65 Lower 0.24
44Baseline 2
Upper 57.16 Lower 39.62
45B2 Concept Repetition
Upper 64.30 Lower 49.16
46B2Percent Correctness
Upper 48.42 Lower 32.86
47B2Student Move
Upper 61.36 Lower 39.94
48Discussion
- Baseline 2 has crossover effect and policy
stability - More complex features (B2 X) have crossover
effect, but not sure if polices are stable (some
stabilize at 17 students) - Indicates that 100 dialogues isnt enough for
even this simple MDP? (but is enough for
baseline 2 to feel confident about?)