Using Reinforcement Learning to Build a Better Model of Dialogue State

About This Presentation

Title:

Using Reinforcement Learning to Build a Better Model of Dialogue State

Description:

Get computers to reply in an intelligent and satisfactory fashion. Research. Discourse Processing ... term of goal of ITS designers: to close the gap between ... –

Number of Views:69

Avg rating:3.0/5.0

Slides: 63

Provided by: csRoch

Learn more at: https://www.cs.rochester.edu

Category:

more less

Transcript and Presenter's Notes

Title: Using Reinforcement Learning to Build a Better Model of Dialogue State

1
Using Reinforcement Learning to Build a Better
Model of Dialogue State

Joel Tetreault
LRDC
University of Pittsburgh
August 3, 2006

2
Interests

Natural Language Processing
How do we get computers to understand speech or
text?
Get computers to reply in an intelligent and
satisfactory fashion
Research
Discourse Processing
Pronoun Resolution
Affect Detection (IR)
Machine learning for Spoken Dialogue Systems

3
Intelligent Tutoring Systems

Students who receive one-on-one instruction
perform as well as the top two percent of
students who receive traditional classroom
instruction Bloom 1984
Unfortunately, providing every student with a
personal human tutor is infeasible
Develop computer tutors instead

4
Intelligent Tutoring Systems

Working hypothesis regarding learning gains
Human Dialogue gt Computer Dialogue gt Text
Long term of goal of ITS designers
to close the gap between human and computer
dialogue
Make adaptive systems

5
How to do it?

U.Pittsburgh ITSPOKE group
Use speech (instead of text-based system)
Emotion detection
Use information about the content of the
students response
Dialogue context
However, with all the features and factors that
influence learning gain, how does one design a
system that can take each factor into account
properly?
Are some features more important than others?

6
Reinforcement Learning for SDs

Previous work has researched using machine
learning techniques to find the best action for a
system to make given huge state spaces
Singh et al., 02 Walker, 00 Henderson et
al., 05
Problems with designing spoken dialogue systems
How to handle noisy data or miscommunications?
Hand-tailoring policies for complex dialogues?
However, very little empirical work Paek et al.,
05 Frampton 05 on comparing the utility of
adding specialized features to construct a better
dialogue state

7
Goal

How does one choose which features best
contribute to a better model of dialogue state?
Goal show the comparative utility of adding four
different features to a dialogue state
4 features concept repetition, frustration,
student performance, student moves
All are important to tutoring systems, but also
are important to dialogue systems in general

8
Goal

Long term goal
Current ITSPOKE system only responds to
correctness of last student turn
Determine best state features and actions (for
each state) that would improve system
Incorporate action and state set into new
dialogue manager and test on human subjects to
measure improvement

9
Outline

Markov Decision Processes (MDP)
MDP Instantiation
Experimental Method
Results
Policies
Feature Comparison

10
Markov Decision Processes

What is the best action an agent should take at
any state to maximize reward at the end?
MDP Input
States
Actions
Reward Function

11
MDP Output

Policy optimal action for system to take in each
state
Calculated using policy iteration which depends
on
Propagating final reward to each state
the probabilities of transitioning from one state
to the next given a certain action
Additional output V-value the worth of each
state

12
Whats the best path to the fly?
13
MDP Frog Example
Final State 1
-1
-1
-1
-1
-1
-1
-1
14
MDP Frog Example
Final State 1
-1
0
-2
-1
0
-1
-2
-3
15
MDPs in Spoken Dialogue
MDP works offline
MDP
Training data
Policy
Dialogue System
User Simulator
Human User
Interactions work online
16
ITSPOKE System

ITSPOKE spoken dialogue tutoring system Litman
et al. 04
Back-end is Why2-Atlas system VanLehn et al.
2002
Sphinx2 speech recognition and Cepstral
text-to-speech

17
(No Transcript)
18
(No Transcript)
19
(No Transcript)
20
ITSPOKE Corpus

100 dialogues with ITSPOKE spoken dialogue
tutoring system Litman et al. 04
All possible dialogue paths were authored by
physics experts
Dialogues informally follow question-answer
format
60 turns per dialogue on average
Each student session has 5 dialogues bookended by
a pretest and posttest to calculate how much
student learned

21
Corpus Annotations

Manual annotations
Tutor and Student Moves (similar to Dialog Acts)
Forbes-Riley et al., 05
Student Frustration and Certainty
Litman et al. 04 Liscombe et al. 05
Automated annotations
Correctness (based on students response to last
question)
Concept Repetition (whether a concept is
repeated)
Correctness (past performance)

22
MDP State Features
23
(No Transcript)
24
Student Frustration
Tutor26 Let me ask my question in a more
general way. What happens to the velocity of a
body when there is no force acting on
it? Student27 dammit (ASR it is) Tutor28
Could you please repeat that? Student29 same
(ASR i same) Tutor30 Right. And what happens
to the HORIZONTAL velocity of a body when there
is no HORIZONTAL force acting on
it? Student31 zero (ASR the zero) Tutor32
Well... OK, let's consider how forces in the
vertical direction affect the motion of a body in
a horizontal direction in a different scenario
ltomittedgt Student33 oh gravity you already
said this (ASR beats gravity you're exert
less) Tutor34 Fine. Are there any other forces
acting on the apple as it falls? Student35 no
why are you doing this again (ASR no y and to it
yes) Tutor36 Correct. The only force acting on
the apple as it falls to the earth is the earth's
gravitational force. In which direction does
gravitational force act? Student37 downward you
computer (ASR downward you computer)
25
MDP Action Choices
26
MDP Reward Function

Reward Function use normalized learning gain to
do a median split on corpus
10 students are high learners and the other 10
are low learners
High learner dialogues had a final state with a
reward of 100, low learners had one of -100

27
Methodology

Construct MDPs to test the inclusion of new
state features to a baseline
Develop baseline state and policy
Add a feature to baseline and compare polices
A feature is deemed important if adding it
results in a change in policy from a baseline
policy given 3 metrics
of Policy Differences (Diffs)
Policy Change (PC)
Expected Cumulative Reward (ECR)
For each MDP verify policies are reliable
(V-value convergence)

28
Hypothetical Policy Change Example
0 Diffs
5 Diffs
29
Tests
B2
SMove
Concept
B1
Correctness
Certainty
Frustration
Baseline 2
Baseline 1
Correct
30
Baseline

Actions SAQ, CAQ, Mix, NoQ
Baseline State Correctness

Baseline network
SAQCAQMixNoQ
C
I
FINAL
31
Baseline 1 Policies

Trend if you only have student correctness as a
model of student state, give a hint or other
state act to the student, otherwise give a Mix of
complex and short answer questions

32
But are our policies reliable?

Best way to test is to run real experiments with
human users with new dialogue manager, but that
is months of work
Our tact check if our corpus is large enough to
develop reliable policies by seeing if V-values
converge as we add more data to corpus
Method run MDP on subsets of our corpus
(incrementally add a student (5 dialogues) to
data, and rerun MDP on each subset)

33
Baseline Convergence Plot
34
Methodology Adding more Features

Create more complicated baseline by adding
certainty feature (new baseline B2)
Add other 4 features (concept repetition,
frustration, performance, student move)
individually to new baseline
Check V-value and policy convergence
Analyze policy changes
Use Feature Comparison Metrics to determine the
relative utility of the three features

35
Tests
B2
SMove
Concept
B1
Correctness
Certainty
Frustration
Baseline 2
Baseline 1
Correct
36
Certainty

Previous work (Bhatt et al., 04) has shown the
importance of certainty in ITS
A student who is certain and correct, may require
a harder question since he or she is doing well,
but one that is correct but showing some doubt is
a sign they are becoming confused, give an easier
question

37
B2 Baseline Certainty Policies
Trend if neutral, give SAQ or NoQ, else give Mix
38
Baseline 2 Convergence Plots
39
Baseline 2 Diff Plots
Diff For each subset corpus, compare policy with
policy generated with full corpus
40
Tests
B2
SMove
Concept
B1
Correctness
Certainty
Frustration
Baseline 2
Baseline 1
Correct
41
Concept Repetition Policies
Trend if concept is repeated (R) give CAQ
42
Frustration Policies
Trend if neutral, give CAQ
43
Percent Correctness Policies
TrendGive Mix, especially for Low performers
44
Feature Comparison (3 metrics)

Diffs
Number of new states whose policies differ from
the original
Insensitive to how frequently a state occurs
Policy Change (P.C.)
Take into account the frequency of each
state-action sequence

45
Feature Comparison

Expected Cumulative Reward (E.C.R.)
One issue with P.C. is that frequently occurring
states have low V-values and thus may bias the
score
Use the expected value of being at the start of
the dialogue to compare features
ECR average V-value of all start states

46
Question Act Results

Trend of SMove gt Concept Repetition gt Frustration
gt Percent Correctness stays the same over all
three metrics
Baseline Also tested the effects of a binary
random feature
If enough data, a random feature should not alter
policies
Average diff of 5.1

47
Feedback Act Results

Trend of Smove and Concept Repetition being the
best stays the same, though features have less
impact given this action set
Frustration now slightly worse than Percent
Correctness

48
Discussion

Incorporating more information into a
representation of the student state has an impact
on tutor policies
Proposed three metrics to determine the relative
weight of three features
Including last Student Move and Concept
Repetition effected the most change across
different action sets

49
Future Work

Next step take promising state features and
resulting policies and implement in ITSPOKE
Evaluate with human users and simulated users
Also researching how much data is enough to
prove reliability of policies?

50
How reliable are policies?
Frustration
Concept
Possible data size is small and with increased
data we may see more fluctuations
51
CB example
S0
S1
S2
2, 1, 5
52
CB example
S0
S1
S2
2, 1, 5
53
Confidence Bounds

Hypothesis instead of looking at the V-values
and policy differences directly, look at the
confidence bounds of each V-value
As data increases, confidence of V-value should
shrink to reflect a better model of the world
Additionally, the policies should converge as
well

54
Confidence Bound Methodology

For each data slice, calculate upper and lower
bounds on the V-value
Take transition matrix for slice and sample from
each row using dirichlet statistical formula 1000
times
So get 1000 new transition matrices that are all
very similar
Run MDP on all 1000 transition matrices to get a
range of ECRs
Rows with not a lot of data are very volatile so
expect large range of ECRs, but as data
increases, transition matrices should stabilize
Take upper and lower bounds at 2.5 percentile

55
Confidence Bounds

CBs can also be used to distinguish how much
better an additional state feature is over a
baseline state space
That is, if the lower bound of a new state space
is greater than the upper bound of the baseline
state space

56
Crossover Example
More complicated Model
ECR
Baseline
Data
57
Confidence Bounds App 2

Automatic model switching
If you know a model, at its worst (ie. Its
lower bound is better than another models upper
bound) then you can automatically switch to the
more complicated model
Good for online RL applications

58
Preliminary Results

As data increases, confidence bounds for all
models shrink
Baseline 2 (certainty) has lower bound that is
higher than upper bound of B1 and policies tend
to stabilize
More complex states take longer to stabilize but
still perform better than baselines