Using Reinforcement Learning to Build a Better Model of Dialogue State

About This Presentation
Title:

Using Reinforcement Learning to Build a Better Model of Dialogue State

Description:

Get computers to reply in an intelligent and satisfactory fashion. Research. Discourse Processing ... term of goal of ITS designers: to close the gap between ... –

Number of Views:69
Avg rating:3.0/5.0
Slides: 63
Provided by: csRoch
Category:

less

Transcript and Presenter's Notes

Title: Using Reinforcement Learning to Build a Better Model of Dialogue State


1
Using Reinforcement Learning to Build a Better
Model of Dialogue State
  • Joel Tetreault
  • LRDC
  • University of Pittsburgh
  • August 3, 2006

2
Interests
  • Natural Language Processing
  • How do we get computers to understand speech or
    text?
  • Get computers to reply in an intelligent and
    satisfactory fashion
  • Research
  • Discourse Processing
  • Pronoun Resolution
  • Affect Detection (IR)
  • Machine learning for Spoken Dialogue Systems

3
Intelligent Tutoring Systems
  • Students who receive one-on-one instruction
    perform as well as the top two percent of
    students who receive traditional classroom
    instruction Bloom 1984
  • Unfortunately, providing every student with a
    personal human tutor is infeasible
  • Develop computer tutors instead

4
Intelligent Tutoring Systems
  • Working hypothesis regarding learning gains
  • Human Dialogue gt Computer Dialogue gt Text
  • Long term of goal of ITS designers
  • to close the gap between human and computer
    dialogue
  • Make adaptive systems

5
How to do it?
  • U.Pittsburgh ITSPOKE group
  • Use speech (instead of text-based system)
  • Emotion detection
  • Use information about the content of the
    students response
  • Dialogue context
  • However, with all the features and factors that
    influence learning gain, how does one design a
    system that can take each factor into account
    properly?
  • Are some features more important than others?

6
Reinforcement Learning for SDs
  • Previous work has researched using machine
    learning techniques to find the best action for a
    system to make given huge state spaces
  • Singh et al., 02 Walker, 00 Henderson et
    al., 05
  • Problems with designing spoken dialogue systems
  • How to handle noisy data or miscommunications?
  • Hand-tailoring policies for complex dialogues?
  • However, very little empirical work Paek et al.,
    05 Frampton 05 on comparing the utility of
    adding specialized features to construct a better
    dialogue state

7
Goal
  • How does one choose which features best
    contribute to a better model of dialogue state?
  • Goal show the comparative utility of adding four
    different features to a dialogue state
  • 4 features concept repetition, frustration,
    student performance, student moves
  • All are important to tutoring systems, but also
    are important to dialogue systems in general

8
Goal
  • Long term goal
  • Current ITSPOKE system only responds to
    correctness of last student turn
  • Determine best state features and actions (for
    each state) that would improve system
  • Incorporate action and state set into new
    dialogue manager and test on human subjects to
    measure improvement

9
Outline
  • Markov Decision Processes (MDP)
  • MDP Instantiation
  • Experimental Method
  • Results
  • Policies
  • Feature Comparison

10
Markov Decision Processes
  • What is the best action an agent should take at
    any state to maximize reward at the end?
  • MDP Input
  • States
  • Actions
  • Reward Function

11
MDP Output
  • Policy optimal action for system to take in each
    state
  • Calculated using policy iteration which depends
    on
  • Propagating final reward to each state
  • the probabilities of transitioning from one state
    to the next given a certain action
  • Additional output V-value the worth of each
    state

12
Whats the best path to the fly?
13
MDP Frog Example
Final State 1
-1
-1
-1
-1
-1
-1
-1
14
MDP Frog Example
Final State 1
-1
0
-2
-1
0
-1
-2
-3
15
MDPs in Spoken Dialogue
MDP works offline
MDP
Training data
Policy
Dialogue System
User Simulator
Human User
Interactions work online
16
ITSPOKE System
  • ITSPOKE spoken dialogue tutoring system Litman
    et al. 04
  • Back-end is Why2-Atlas system VanLehn et al.
    2002
  • Sphinx2 speech recognition and Cepstral
    text-to-speech

17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
ITSPOKE Corpus
  • 100 dialogues with ITSPOKE spoken dialogue
    tutoring system Litman et al. 04
  • All possible dialogue paths were authored by
    physics experts
  • Dialogues informally follow question-answer
    format
  • 60 turns per dialogue on average
  • Each student session has 5 dialogues bookended by
    a pretest and posttest to calculate how much
    student learned

21
Corpus Annotations
  • Manual annotations
  • Tutor and Student Moves (similar to Dialog Acts)
  • Forbes-Riley et al., 05
  • Student Frustration and Certainty
  • Litman et al. 04 Liscombe et al. 05
  • Automated annotations
  • Correctness (based on students response to last
    question)
  • Concept Repetition (whether a concept is
    repeated)
  • Correctness (past performance)

22
MDP State Features
23
(No Transcript)
24
Student Frustration
Tutor26 Let me ask my question in a more
general way. What happens to the velocity of a
body when there is no force acting on
it? Student27 dammit (ASR it is) Tutor28
Could you please repeat that? Student29 same
(ASR i same) Tutor30 Right. And what happens
to the HORIZONTAL velocity of a body when there
is no HORIZONTAL force acting on
it? Student31 zero (ASR the zero) Tutor32
Well... OK, let's consider how forces in the
vertical direction affect the motion of a body in
a horizontal direction in a different scenario
ltomittedgt Student33 oh gravity you already
said this (ASR beats gravity you're exert
less) Tutor34 Fine. Are there any other forces
acting on the apple as it falls? Student35 no
why are you doing this again (ASR no y and to it
yes) Tutor36 Correct. The only force acting on
the apple as it falls to the earth is the earth's
gravitational force. In which direction does
gravitational force act? Student37 downward you
computer (ASR downward you computer)
25
MDP Action Choices
26
MDP Reward Function
  • Reward Function use normalized learning gain to
    do a median split on corpus
  • 10 students are high learners and the other 10
    are low learners
  • High learner dialogues had a final state with a
    reward of 100, low learners had one of -100

27
Methodology
  • Construct MDPs to test the inclusion of new
    state features to a baseline
  • Develop baseline state and policy
  • Add a feature to baseline and compare polices
  • A feature is deemed important if adding it
    results in a change in policy from a baseline
    policy given 3 metrics
  • of Policy Differences (Diffs)
  • Policy Change (PC)
  • Expected Cumulative Reward (ECR)
  • For each MDP verify policies are reliable
    (V-value convergence)

28
Hypothetical Policy Change Example
0 Diffs
5 Diffs
29
Tests
B2
SMove
Concept
B1
Correctness
Certainty
Frustration
Baseline 2
Baseline 1
Correct
30
Baseline
  • Actions SAQ, CAQ, Mix, NoQ
  • Baseline State Correctness

Baseline network
SAQCAQMixNoQ
C
I
FINAL
31
Baseline 1 Policies
  • Trend if you only have student correctness as a
    model of student state, give a hint or other
    state act to the student, otherwise give a Mix of
    complex and short answer questions

32
But are our policies reliable?
  • Best way to test is to run real experiments with
    human users with new dialogue manager, but that
    is months of work
  • Our tact check if our corpus is large enough to
    develop reliable policies by seeing if V-values
    converge as we add more data to corpus
  • Method run MDP on subsets of our corpus
    (incrementally add a student (5 dialogues) to
    data, and rerun MDP on each subset)

33
Baseline Convergence Plot
34
Methodology Adding more Features
  • Create more complicated baseline by adding
    certainty feature (new baseline B2)
  • Add other 4 features (concept repetition,
    frustration, performance, student move)
    individually to new baseline
  • Check V-value and policy convergence
  • Analyze policy changes
  • Use Feature Comparison Metrics to determine the
    relative utility of the three features

35
Tests
B2
SMove
Concept
B1
Correctness
Certainty
Frustration
Baseline 2
Baseline 1
Correct
36
Certainty
  • Previous work (Bhatt et al., 04) has shown the
    importance of certainty in ITS
  • A student who is certain and correct, may require
    a harder question since he or she is doing well,
    but one that is correct but showing some doubt is
    a sign they are becoming confused, give an easier
    question

37
B2 Baseline Certainty Policies
Trend if neutral, give SAQ or NoQ, else give Mix
38
Baseline 2 Convergence Plots
39
Baseline 2 Diff Plots
Diff For each subset corpus, compare policy with
policy generated with full corpus
40
Tests
B2
SMove
Concept
B1
Correctness
Certainty
Frustration
Baseline 2
Baseline 1
Correct
41
Concept Repetition Policies
Trend if concept is repeated (R) give CAQ
42
Frustration Policies
Trend if neutral, give CAQ
43
Percent Correctness Policies
TrendGive Mix, especially for Low performers
44
Feature Comparison (3 metrics)
  • Diffs
  • Number of new states whose policies differ from
    the original
  • Insensitive to how frequently a state occurs
  • Policy Change (P.C.)
  • Take into account the frequency of each
    state-action sequence

45
Feature Comparison
  • Expected Cumulative Reward (E.C.R.)
  • One issue with P.C. is that frequently occurring
    states have low V-values and thus may bias the
    score
  • Use the expected value of being at the start of
    the dialogue to compare features
  • ECR average V-value of all start states

46
Question Act Results
  • Trend of SMove gt Concept Repetition gt Frustration
    gt Percent Correctness stays the same over all
    three metrics
  • Baseline Also tested the effects of a binary
    random feature
  • If enough data, a random feature should not alter
    policies
  • Average diff of 5.1

47
Feedback Act Results
  • Trend of Smove and Concept Repetition being the
    best stays the same, though features have less
    impact given this action set
  • Frustration now slightly worse than Percent
    Correctness

48
Discussion
  • Incorporating more information into a
    representation of the student state has an impact
    on tutor policies
  • Proposed three metrics to determine the relative
    weight of three features
  • Including last Student Move and Concept
    Repetition effected the most change across
    different action sets

49
Future Work
  • Next step take promising state features and
    resulting policies and implement in ITSPOKE
  • Evaluate with human users and simulated users
  • Also researching how much data is enough to
    prove reliability of policies?

50
How reliable are policies?
Frustration
Concept
Possible data size is small and with increased
data we may see more fluctuations
51
CB example
S0
S1
S2
2, 1, 5
52
CB example
S0
S1
S2
2, 1, 5
53
Confidence Bounds
  • Hypothesis instead of looking at the V-values
    and policy differences directly, look at the
    confidence bounds of each V-value
  • As data increases, confidence of V-value should
    shrink to reflect a better model of the world
  • Additionally, the policies should converge as
    well

54
Confidence Bound Methodology
  • For each data slice, calculate upper and lower
    bounds on the V-value
  • Take transition matrix for slice and sample from
    each row using dirichlet statistical formula 1000
    times
  • So get 1000 new transition matrices that are all
    very similar
  • Run MDP on all 1000 transition matrices to get a
    range of ECRs
  • Rows with not a lot of data are very volatile so
    expect large range of ECRs, but as data
    increases, transition matrices should stabilize
  • Take upper and lower bounds at 2.5 percentile

55
Confidence Bounds
  • CBs can also be used to distinguish how much
    better an additional state feature is over a
    baseline state space
  • That is, if the lower bound of a new state space
    is greater than the upper bound of the baseline
    state space

56
Crossover Example
More complicated Model
ECR
Baseline
Data
57
Confidence Bounds App 2
  • Automatic model switching
  • If you know a model, at its worst (ie. Its
    lower bound is better than another models upper
    bound) then you can automatically switch to the
    more complicated model
  • Good for online RL applications

58
Preliminary Results
  • As data increases, confidence bounds for all
    models shrink
  • Baseline 2 (certainty) has lower bound that is
    higher than upper bound of B1 and policies tend
    to stabilize
  • More complex states take longer to stabilize but
    still perform better than baselines

59
Baseline 1
Upper 23.65 Lower 0.24
60
Baseline 2
Upper 57.16 Lower 39.62
61
B2 Concept Repetition
Upper 64.30 Lower 49.16
62
Acknowledgements
  • ITSPOKE research group
  • Dialog on Dialogs research group
  • For more information
  • http//www.cs.pitt.edu/tetreaul
  • NLP/CL conference listings
Write a Comment
User Comments (0)
About PowerShow.com