How much data is enough Generating reliable policies wMDPs - PowerPoint PPT Presentation

About This Presentation
Title:

How much data is enough Generating reliable policies wMDPs

Description:

Hand-tailoring policies for complex dialogues? What features to use? ... is time-consuming so it is important to properly choose best features beforehand ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 49
Provided by: csC76
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: How much data is enough Generating reliable policies wMDPs


1
How much data is enough? Generating reliable
policies w/MDPs
  • Joel Tetreault
  • University of Pittsburgh
  • LRDC
  • July 14, 2006

2
Problem
  • Problems with designing spoken dialogue systems
  • How to handle noisy data or miscommunications?
  • Hand-tailoring policies for complex dialogues?
  • What features to use?
  • Previous work used machine learning to improve
    the dialogue manager of spoken dialogue systems
  • Singh et al., 02 Walker, 00 Henderson et
    al., 05
  • However, very little empirical work Paek et al.,
    05 Frampton 05 on comparing the utility of
    adding specialized features to construct a better
    dialogue state

3
Goal
  • How does one choose which features best
    contribute to a better model of dialogue state?
  • Goal show the comparative utility of adding
    three different features to a dialogue state
  • 4 features concept repetition, frustration,
    student performance, student moves
  • All are important to tutoring systems, but also
    are important to dialogue systems in general

4
Previous Work
  • In complex domains, annotation and testing is
    time-consuming so it is important to properly
    choose best features beforehand
  • Developed a methodology for using Reinforcement
    Learning to determine whether adding complex
    features to a dialogue state will beneficially
    alter policies Tetreault Litman, EACL 06
  • Extensions
  • Methodology to determine which features are the
    best
  • Also show our results generalize over different
    action choices (feedback vs. questions)

5
Outline
  • Markov Decision Processes (MDP)
  • MDP Instantiation
  • Experimental Method
  • Results
  • Policies
  • Feature Comparison

6
Markov Decision Processes
  • What is the best action an agent should take at
    any state to maximize reward at the end?
  • MDP Input
  • States
  • Actions
  • Reward Function

7
MDP Output
  • Policy optimal action for system to take in each
    state
  • Calculated using policy iteration which depends
    on
  • Propagating final reward to each state
  • the probabilities of getting from one state to
    the next given a certain action
  • Additional output V-value the worth of each
    state

8
MDPs in Spoken Dialogue
MDP works offline
MDP
Training data
Policy
Dialogue System
User Simulator
Human User
Interactions work online
9
ITSPOKE Corpus
  • 100 dialogues with ITSPOKE spoken dialogue
    tutoring system Litman et al. 04
  • All possible dialogue paths were authored by
    physics experts
  • Dialogues informally follow question-answer
    format
  • 60 turns per dialogue on average
  • Each student session has 5 dialogues bookended by
    a pretest and posttest to calculate how much
    student learned

10
Corpus Annotations
  • Manual annotations
  • Tutor Moves (similar to Dialog Acts)
  • Forbes-Riley et al., 05
  • Student Frustration and Certainty
  • Litman et al. 04 Liscombe et al. 05
  • Automated annotations
  • Correctness (based on students response to last
    question)
  • Concept Repetition (whether a concept is
    repeated)
  • Correctness (past performance)

11
MDP State Features
12
MDP Action Choices
13
MDP Reward Function
  • Reward Function use normalized learning gain to
    do a median split on corpus
  • 10 students are high learners and the other 10
    are low learners
  • High learner dialogues had a final state with a
    reward of 100, low learners had one of -100

14
Methodology
  • Construct MDPs to test the inclusion of new
    state features to a baseline
  • Develop baseline state and policy
  • Add a feature to baseline and compare polices
  • A feature is deemed important if adding it
    results in a change in policy from a baseline
    policy given 3 metrics
  • of Policy Differences (Diffs)
  • Policy Change (PC)
  • Expected Cumulative Reward (ECR)
  • For each MDP verify policies are reliable
    (V-value convergence)

15
Hypothetical Policy Change Example
0 Diffs
5 Diffs
16
Tests
B2
Concept
B1
Frustration
Correctness
Certainty
Baseline 2
Baseline 1
Correct
17
Baseline
  • Actions SAQ, CAQ, Mix, NoQ
  • Baseline State Correctness

Baseline network
SAQCAQMixNoQ
C
I
FINAL
18
Baseline 1 Policies
  • Trend if you only have student correctness as a
    model of student state, give a hint or other
    state act to the student, otherwise give a Mix of
    complex and short answer questions

19
But are our policies reliable?
  • Best way to test is to run real experiments with
    human users with new dialogue manager, but that
    is months of work
  • Our tact check if our corpus is large enough to
    develop reliable policies by seeing if V-values
    converge as we add more data to corpus
  • Method run MDP on subsets of our corpus
    (incrementally add a student (5 dialogues) to
    data, and rerun MDP on each subset)

20
Baseline Convergence Plot
21
Methodology Adding more Features
  • Create more complicated baseline by adding
    certainty feature (new baseline B2)
  • Add other 4 features (concept repetition,
    frustration, performance, student move)
    individually to new baseline
  • Check V-value and policy convergence
  • Analyze policy changes
  • Use Feature Comparison Metrics to determine the
    relative utility of the three features

22
Tests
B2
Concept
B1
Frustration
Correctness
Certainty
Baseline 2
Baseline 1
Correct
23
Certainty
  • Previous work (Bhatt et al., 04) has shown the
    importance of certainty in ITS
  • A student who is certain and correct, may require
    a harder question since he or she is doing well,
    but one that is correct but showing some doubt is
    a sign they are becoming confused, give an easier
    question

24
B2 Baseline Certainty Policies
Trend if neutral, give SAQ or NoQ, else give Mix
25
Baseline 2 Convergence Plots
26
Baseline 2 Diff Plots
Diff For each subset corpus, compare policy with
policy generated with full corpus
27
Tests
B2
Concept
B1
Frustration
Correctness
Certainty
Baseline 2
Baseline 1
Correct
28
Feature Comparison (3 metrics)
  • Diffs
  • Number of new states whose policies differ from
    the original
  • Insensitive to how frequently a state occurs
  • Policy Change (P.C.)
  • Take into account the frequency of each
    state-action sequence

29
Feature Comparison
  • Expected Cumulative Reward (E.C.R.)
  • One issue with P.C. is that frequently occurring
    states have low V-values and thus may bias the
    score
  • Use the expected value of being at the start of
    the dialogue to compare features
  • ECR average V-value of all start states

30
Feature Comparison Results
  • Trend of SMove gt Concept Repetition gt Frustration
    gt Percent Correctness stays the same over all
    three metrics
  • Baseline Also tested the effects of a binary
    random feature
  • If enough data, a random feature should not alter
    policies
  • Average diff of 5.1

31
How reliable are policies?
Frustration
Concept
Possible data size is small and with increased
data we may see more fluctuations
32
Confidence Bounds
  • Hypothesis instead of looking at the V-values
    and policy differences directly, look at the
    confidence bounds of each V-value
  • As data increases, confidence of V-value should
    shrink to reflect a better model of the world
  • Additionally, the policies should converge as
    well

33
Confidence Bounds
  • CBs can also be used to distinguish how much
    better an additional state feature is over a
    baseline state space
  • That is, if the lower bound of a new state space
    is greater than the upper bound of the baseline
    state space

34
Crossover Example
More complicated Model
ECR
Baseline
Data
35
Confidence Bounds App 2
  • Automatic model switching
  • If you know a model, at its worst (ie. Its
    lower bound is better than another models upper
    bound) then you can automatically switch to the
    more complicated model
  • Good for online RL applications

36
Confidence Bound Methodology
  • For each data slice, calculate upper and lower
    bounds on the V-value
  • Take transition matrix for slice and sample from
    each row using direch. statistical formula 1000
    times
  • do this b/c real world data is not exactly
    approximating what data is like in the real
    world, but may be close
  • So get 1000 new transition matrices that are all
    very similar
  • Run MDP on all 1000 transition matrices to get a
    range of ECRs
  • Rows with not a lot of data are very volatile so
    expect large range of ECRs, but as data
    increases, transition matrices should stabilize
    such that most of the new matrices produce
    similar policies and values as the original
  • Take upper and lower bounds at 2.5 percentile

37
Experiment
  • Original action/state setup did not show anything
    promising
  • State/action space too large for data?
  • Not best MDP instantiation
  • Looked at a variety of MDP configurations
  • Refined reward metric
  • Adding discourse segmentation

38
essay Instantiation with 0305 data
39
essay Baseline1
40
essay Baseline2
41
essay B2SMove
42
Feature Comparison Results
  • Reduced state size Certainty CertNeutral,
    Uncert
  • Trend that SMove and Concept Repetition are the
    best features
  • B2 ECR 31.92

43
Baseline 1
Upper 23.65 Lower 0.24
44
Baseline 2
Upper 57.16 Lower 39.62
45
B2 Concept Repetition
Upper 64.30 Lower 49.16
46
B2Percent Correctness
Upper 48.42 Lower 32.86
47
B2Student Move
Upper 61.36 Lower 39.94
48
Discussion
  • Baseline 2 has crossover effect and policy
    stability
  • More complex features (B2 X) have crossover
    effect, but not sure if polices are stable (some
    stabilize at 17 students)
  • Indicates that 100 dialogues isnt enough for
    even this simple MDP? (but is enough for
    baseline 2 to feel confident about?)
Write a Comment
User Comments (0)
About PowerShow.com