How much data is enough Generating reliable policies wMDPs - PowerPoint PPT Presentation

About This Presentation

Title:

How much data is enough Generating reliable policies wMDPs

Description:

Hand-tailoring policies for complex dialogues? What features to use? ... is time-consuming so it is important to properly choose best features beforehand ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 49

Provided by: csC76

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: How much data is enough Generating reliable policies wMDPs

1
How much data is enough? Generating reliable
policies w/MDPs

Joel Tetreault
University of Pittsburgh
LRDC
July 14, 2006

2
Problem

Problems with designing spoken dialogue systems
How to handle noisy data or miscommunications?
Hand-tailoring policies for complex dialogues?
What features to use?
Previous work used machine learning to improve
the dialogue manager of spoken dialogue systems
Singh et al., 02 Walker, 00 Henderson et
al., 05
However, very little empirical work Paek et al.,
05 Frampton 05 on comparing the utility of
adding specialized features to construct a better
dialogue state

3
Goal

How does one choose which features best
contribute to a better model of dialogue state?
Goal show the comparative utility of adding
three different features to a dialogue state
4 features concept repetition, frustration,
student performance, student moves
All are important to tutoring systems, but also
are important to dialogue systems in general

4
Previous Work

In complex domains, annotation and testing is
time-consuming so it is important to properly
choose best features beforehand
Developed a methodology for using Reinforcement
Learning to determine whether adding complex
features to a dialogue state will beneficially
alter policies Tetreault Litman, EACL 06
Extensions
Methodology to determine which features are the
best
Also show our results generalize over different
action choices (feedback vs. questions)

5
Outline

Markov Decision Processes (MDP)
MDP Instantiation
Experimental Method
Results
Policies
Feature Comparison

6
Markov Decision Processes

What is the best action an agent should take at
any state to maximize reward at the end?
MDP Input
States
Actions
Reward Function

7
MDP Output

Policy optimal action for system to take in each
state
Calculated using policy iteration which depends
on
Propagating final reward to each state
the probabilities of getting from one state to
the next given a certain action
Additional output V-value the worth of each
state

8
MDPs in Spoken Dialogue
MDP works offline
MDP
Training data
Policy
Dialogue System
User Simulator
Human User
Interactions work online
9
ITSPOKE Corpus

100 dialogues with ITSPOKE spoken dialogue
tutoring system Litman et al. 04
All possible dialogue paths were authored by
physics experts
Dialogues informally follow question-answer
format
60 turns per dialogue on average
Each student session has 5 dialogues bookended by
a pretest and posttest to calculate how much
student learned

10
Corpus Annotations

Manual annotations
Tutor Moves (similar to Dialog Acts)
Forbes-Riley et al., 05
Student Frustration and Certainty
Litman et al. 04 Liscombe et al. 05
Automated annotations
Correctness (based on students response to last
question)
Concept Repetition (whether a concept is
repeated)
Correctness (past performance)

11
MDP State Features
12
MDP Action Choices
13
MDP Reward Function

Reward Function use normalized learning gain to
do a median split on corpus
10 students are high learners and the other 10
are low learners
High learner dialogues had a final state with a
reward of 100, low learners had one of -100

14
Methodology

Construct MDPs to test the inclusion of new
state features to a baseline
Develop baseline state and policy
Add a feature to baseline and compare polices
A feature is deemed important if adding it
results in a change in policy from a baseline
policy given 3 metrics
of Policy Differences (Diffs)
Policy Change (PC)
Expected Cumulative Reward (ECR)
For each MDP verify policies are reliable
(V-value convergence)

15
Hypothetical Policy Change Example
0 Diffs
5 Diffs
16
Tests
B2
Concept
B1
Frustration
Correctness
Certainty
Baseline 2
Baseline 1
Correct
17
Baseline

Actions SAQ, CAQ, Mix, NoQ
Baseline State Correctness

Baseline network
SAQCAQMixNoQ
C
I
FINAL
18
Baseline 1 Policies

Trend if you only have student correctness as a
model of student state, give a hint or other
state act to the student, otherwise give a Mix of
complex and short answer questions

19
But are our policies reliable?

Best way to test is to run real experiments with
human users with new dialogue manager, but that
is months of work
Our tact check if our corpus is large enough to
develop reliable policies by seeing if V-values
converge as we add more data to corpus
Method run MDP on subsets of our corpus
(incrementally add a student (5 dialogues) to
data, and rerun MDP on each subset)

20
Baseline Convergence Plot
21
Methodology Adding more Features

Create more complicated baseline by adding
certainty feature (new baseline B2)
Add other 4 features (concept repetition,
frustration, performance, student move)
individually to new baseline
Check V-value and policy convergence
Analyze policy changes
Use Feature Comparison Metrics to determine the
relative utility of the three features

22
Tests
B2
Concept
B1
Frustration
Correctness
Certainty
Baseline 2
Baseline 1
Correct
23
Certainty

Previous work (Bhatt et al., 04) has shown the
importance of certainty in ITS
A student who is certain and correct, may require
a harder question since he or she is doing well,
but one that is correct but showing some doubt is
a sign they are becoming confused, give an easier
question

24
B2 Baseline Certainty Policies
Trend if neutral, give SAQ or NoQ, else give Mix
25
Baseline 2 Convergence Plots
26
Baseline 2 Diff Plots
Diff For each subset corpus, compare policy with
policy generated with full corpus
27
Tests
B2
Concept
B1
Frustration
Correctness
Certainty
Baseline 2
Baseline 1
Correct
28
Feature Comparison (3 metrics)

Diffs
Number of new states whose policies differ from
the original
Insensitive to how frequently a state occurs
Policy Change (P.C.)
Take into account the frequency of each
state-action sequence

29
Feature Comparison

Expected Cumulative Reward (E.C.R.)
One issue with P.C. is that frequently occurring
states have low V-values and thus may bias the
score
Use the expected value of being at the start of
the dialogue to compare features
ECR average V-value of all start states

30
Feature Comparison Results

Trend of SMove gt Concept Repetition gt Frustration
gt Percent Correctness stays the same over all
three metrics
Baseline Also tested the effects of a binary
random feature
If enough data, a random feature should not alter
policies
Average diff of 5.1

31
How reliable are policies?
Frustration
Concept
Possible data size is small and with increased
data we may see more fluctuations
32
Confidence Bounds

Hypothesis instead of looking at the V-values
and policy differences directly, look at the
confidence bounds of each V-value
As data increases, confidence of V-value should
shrink to reflect a better model of the world
Additionally, the policies should converge as
well

33
Confidence Bounds

CBs can also be used to distinguish how much
better an additional state feature is over a
baseline state space
That is, if the lower bound of a new state space
is greater than the upper bound of the baseline
state space

34
Crossover Example
More complicated Model
ECR
Baseline
Data
35
Confidence Bounds App 2

Automatic model switching
If you know a model, at its worst (ie. Its
lower bound is better than another models upper
bound) then you can automatically switch to the
more complicated model
Good for online RL applications

36
Confidence Bound Methodology

For each data slice, calculate upper and lower
bounds on the V-value
Take transition matrix for slice and sample from
each row using direch. statistical formula 1000
times
do this b/c real world data is not exactly
approximating what data is like in the real
world, but may be close
So get 1000 new transition matrices that are all
very similar
Run MDP on all 1000 transition matrices to get a
range of ECRs
Rows with not a lot of data are very volatile so
expect large range of ECRs, but as data
increases, transition matrices should stabilize
such that most of the new matrices produce
similar policies and values as the original
Take upper and lower bounds at 2.5 percentile

37
Experiment

Original action/state setup did not show anything
promising
State/action space too large for data?
Not best MDP instantiation
Looked at a variety of MDP configurations
Refined reward metric
Adding discourse segmentation

38
essay Instantiation with 0305 data
39
essay Baseline1
40
essay Baseline2
41
essay B2SMove
42
Feature Comparison Results

Reduced state size Certainty CertNeutral,
Uncert
Trend that SMove and Concept Repetition are the
best features
B2 ECR 31.92

43
Baseline 1
Upper 23.65 Lower 0.24
44
Baseline 2
Upper 57.16 Lower 39.62
45
B2 Concept Repetition
Upper 64.30 Lower 49.16
46
B2Percent Correctness
Upper 48.42 Lower 32.86
47
B2Student Move
Upper 61.36 Lower 39.94
48
Discussion

Baseline 2 has crossover effect and policy
stability
More complex features (B2 X) have crossover
effect, but not sure if polices are stable (some
stabilize at 17 students)
Indicates that 100 dialogues isnt enough for
even this simple MDP? (but is enough for
baseline 2 to feel confident about?)

Write a Comment

User Comments (0)