Title: online supervised learning of nonunderstanding recovery policies
1online supervised learning of non-understanding
recovery policies
- Dan Bohus
- www.cs.cmu.edu/dbohus
- dbohus_at_cs.cmu.edu
- Computer Science Department
- Carnegie Mellon University
- Pittsburgh, PA 15213
with thanks to Alex Rudnicky Brian
Langner Antoine Raux Alan Black Maxine Eskenazi
2understanding-errors in spoken dialog
MIS-understanding
NON-understanding
System constructs an incorrect semantic
representation of the users turn
System fails to construct a semantic
representation of the users turn
S Where are you flying from? U Birmingham
BERLIN PM
S Where are you flying from? U Urbana
Champaign OKAY IN THAT SAME PAY
3recovery strategies
- large set of strategies (strategy 1-step
action) - tradeoffs not well understood
- some strategies are more appropriate at certain
times - OOV -gt ask repeat is not a good idea
- door slam -gt ask repeat might work well
- Sorry, I didnt catch that
- Can you repeat that?
- Can you rephrase that?
- Where are you flying from?
- Please tell me the name of the city you are
leaving from - Could you please go to a quieter place?
- Sorry, I didnt catch that tell me the state
first
S
4recovery policy
- policy method for choosing between strategies
- difficult to handcraft
- especially over a large set of recovery
strategies - common approaches
- heuristic
- three strikes and youre out Balentine
- 1st non-understanding ask user to repeat
- 2nd non-understanding provide more help,
including examples - 3rd non-understanding transfer to an operator
5this talk
- an online, supervised method for learning a
non-understanding recovery policy from data
6overview
- introduction
- approach
- experimental setup
- results
- discussion
7overview
- introduction
- approach
- experimental setup
- results
- discussion
8intuition
- if we knew the probability of success for each
strategy in the current situation, we could
easily construct a policy
S Where are you flying from? U OKAY IN THAT
SAME PAY Urbana Champaign
S
- Sorry, I didnt catch that
- Can you repeat that?
- Can you rephrase that?
- Where are you flying from?
- Please tell me the name of the city you are
leaving from - Could you please go to a quieter place?
- Sorry, I didnt catch that tell me the state
first
32 15 20 30 45 25 43
9two step approach
- step 1 learn to estimate probability of success
for each strategy, in a given situation - step 2 use these estimates to choose between
strategies (and hence build a policy)
10learning predictors for strategy success
- supervised learning logistic regression
- target strategy recovery successfully or not
- success next turn is correctly understood
- labeled semi-automatically
- features describe current situation
- extracted from different knowledge sources
- recognition features
- language understanding features
- dialog-level features state, history
11logistic regression
- well-calibrated class-posterior probabilities
- predictions reflect empirical probability of
success - x of cases where P(SF)x are indeed successful
- sample efficient
- one model per strategy, so data will be sparse
- stepwise construction
- automatic feature selection
- provide confidence bounds
- very useful for online learning
12two step approach
- step 1 learn to estimate probability of success
for each strategy, in a given situation - step 2 use these estimates to choose between
strategies (and hence build a policy)
13policy learning
- choose strategy most likely to succeed
1
0
S1 S2 S3 S4
- BUT
- we want to learn online
- we have to deal with the exploration /
exploitation tradeoff
14highest-upper-bound learning
- choose strategy with highest-upper-bound
- proposed by Kaelbling 93
- empirically shown to do well in various problems
- intuition
1
1
0
0
S1 S2 S3 S4
S1 S2 S3 S4
exploration
exploitation
15highest-upper-bound learning
- choose strategy with highest upper bound
- proposed by Kaelbling 93
- empirically shown to do well in various problems
- intuition
1
1
0
0
S1 S2 S3 S4
S1 S2 S3 S4
exploration
exploitation
16highest-upper-bound learning
- choose strategy with highest upper bound
- proposed by Kaelbling 93
- empirically shown to do well in various problems
- intuition
1
1
0
0
S1 S2 S3 S4
S1 S2 S3 S4
exploration
exploitation
17highest-upper-bound learning
- choose strategy with highest upper bound
- proposed by Kaelbling 93
- empirically shown to do well in various problems
- intuition
1
1
0
0
S1 S2 S3 S4
S1 S2 S3 S4
exploration
exploitation
18highest-upper-bound learning
- choose strategy with highest upper bound
- proposed by Kaelbling 93
- empirically shown to do well in various problems
- intuition
1
1
0
0
S1 S2 S3 S4
S1 S2 S3 S4
exploration
exploitation
19overview
- introduction
- approach
- experimental setup
- results
- discussion
20system
- Lets Go! Public bus information system
- connected to PAT customer service line during
non-business hours - 30-50 calls / night
21strategies
22constraints
- constraints
- dont AREP more than twice in a row
- dont ARPH if words lt 3
- dont ASA unless words gt 5
- dont ASO unless (4 nonu in a row) and
(ratio.nonu gt 50) - dont GUP unless (dialog gt 30 turns) and
(ratio.nonu gt 80) - capture expert knowledge ensure system doesnt
use an unreasonable policy - 4.2/11 strategies available on average
- min1, max9
23features
- current non-understanding
- recognition, lexical, grammar, timing info
- current non-understanding segment
- length, which strategies already taken
- current dialog state and history
- encoded dialog states
- how good things have been going
24learning
- baseline period 2 weeks, 3/11 -gt 3/25, 2006
- system randomly chose a strategy, while obeying
constraints - in effect, a heuristic / stochastic policy
- learning period 5 weeks, 3/26 -gt 5/5, 2006
- each morning labeled data from previous night
- retrained likelihood of success predictors
- installed in the system for the next night
252 strategies eliminated
26overview
- introduction
- approach
- experimental setup
- results
- discussion
27results
- average non-understanding recovery rate (ANNR)
- improvement 33.6 ? 37.8 (p0.03) (12.5rel)
- fitted learning curve
A 0.3385 B 0.0470 C 0.5566 D -11.44
28policy evolution
- MOVE, HLP, ASA engaged more often
- AREP, ARPH engaged less often
29overview
- introduction
- approach
- experimental setup
- results
- discussion
30are the predictors learning anything?
- AREP(653), IT(273), SLL(300)
- no informative features
- ARPH(674), MOVE(1514)
- 1 informative feature (prev.nonu, words)
- ASA(637), RP(2532), HLP(3698), HLP_R(989)
- 4 or more informative features in the model
- dialog state (especially explicit confirm states)
- dialog history
31more features, more (specific) strategies
- more features would be useful
- day-of-week
- clustered dialog states
- ? (any ideas?) ?
- more strategies / variants
- approach might be able to filter out bad versions
- more specific strategies, features
- ask short answers worked well
- speak less loud didnt (why?)
32noise in the experiment
- 15-20 of responses following non-understandings
are non-user-responses - transient noises
- secondary speech
- primary speech not directed to the system
- this might affect training, in a future
experiment we want to eliminate that
33unsupervised learning
- supervised version
- success next turn is correctly
understoodi.e. no misunderstanding, no
non-understanding - unsupervised version
- success next turn is not a non-understanding
- success confidence score of next turn
- training labels automatically available
- performance improvements might still be possible
34thank you!