Title: Modeling%20the%20Cost%20of%20Misunderstandings
1Modeling the Cost of Misunderstandings in the CMU
Communicator System Dan Bohus Alex
Rudnicky School of Computer Science, Carnegie
Mellon University, Pittsburgh, PA, 15213
ø.
Abstract We present a data-driven approach
which allows us to quantitatively assess the
costs of various types of errors that a
confidence annotator commits in the CMU
Communicator spoken dialog system. Knowing these
costs we can determine the optimal tradeoff point
between these errors, and fine-tune the
confidence annotator accordingly. The cost models
based on net concept transfer efficiency fit our
data quite well, and the relative costs of
false-positives and false-negatives are in
accordance with our intuitions. We also find,
surprisingly that for a mixed-initiative system
such as the CMU Communicator, these errors
trade-off equally over a wide operating range.
1.
2.
- Motivation. Problem Formulation.
- Intro
- In previous work 1, we have cast the problem of
utterance-level confidence annotation as a binary
classification task, and have trained multiple
classifiers for this purpose - Training corpus 131 dialogs, 4550 utterances
- 12 Features from recognition, parsing and dialog
level - 7 Classifiers Decision Tree, ANN, Bayesian Net,
AdaBoost, Naïve Bayes, SVM, Logistic regression.
- Results (mean classification error rates in
10-fold cross-validation) - Most of the classifiers obtained statistically
indistinguishable results (with the notable
exception of Naïve Bayes). The logistic
regression model obtained much better
performance on a soft-metric
- Cost Models The Approach
- The Approach
- To model the impact of FPs and FNs on the system
performance, we - Identify a suitable dialog performance metric (P)
which we want to optimize for - Build a statistical regression model on whole
sessions using P as the response variable and the
counts of FPs and FNs as predictors - P f(FPs, FNs)
- P kCostFP FPCostFNFN (Linear Regression)
- Performance metrics
- User satisfaction (5-point scale) subjective,
hard to obtain - Completion (binary) too coarse
- Concept transmission efficiency
- CTC correctly transferred concepts/turn
- ITC incorrectly transferred concepts/turn
- REC relevantly expressed concepts/turn
- The Dataset
Random baseline 32
Previous Garble Baseline 25
Classifiers 16
2So the problem translates to locating a point on
the operating characteristic (by moving the
classification threshold) which minimizes the
total cost (and thus implicitly maximize the
chosen performance metric), rather than the
classification error rate. The cost, according to
model 3 is Cost 0.48 FPNC 2.12 FPC
1.33 FN 0.56 TN
3.
- Cost Models The Results
- Cost Models Targeting Efficiency
- 3 successively refined cost models were developed
targeting efficiency as a response variable. - The goodness of fit for this models (indicated by
R2), both on the training and in a 10-fold
cross-validation process are illustrated in the
table below. - Model 1 CTC FP FN TN k
- Model 2 CTCITC REC FP FN TN k
- added the ITC term so that we also minimize the
number of incorrectly transferred concepts. - REC captures a prior on the verbosity of the user
- both changes further improve performance
- Model 3 CTCITC REC FPC FPNC FN TN k
- The FP term was split into 2, since there are 2
different types of false positives in the system,
which intuitively should have very different
costs - FPC false positives with relevant concepts
- FPNC false positives without relevant concepts
The fact that the cost function is almost
constant across a wide range of thresholds,
indicates that the efficiency of the dialog stays
about the same, regardless of the ratios of FPs
and FNs that the system makes.
- Further Analysis
- Is CPT-IPT an Adequate Metric ?
- Mean 0.71 Standard Deviation 0.28,
- Mean for Completed dialogs 0.82,
- Mean for Uncompleted dialogs 0.57,
- differences are statistically significant at a
very high level of confidence (p 7.23 10-9) - Can We Reliably Extrapolate the Model to Other
Areas of the ROC ? - The distribution of FPs and FNs across dialogs
indicates that, although the data is obtained
with the confidence annotator running with a
threshold of 0.5, we have enough samples to
reliably estimate the other areas of the ROC.
Model R2 all R2 train R2 test
CTCFPFNTN 0.81 0.81 0.73
CTC-ITCFPFNTN CTC-ITCRECFPFNTN 0.860.89 0.860.89 0.780.83
CTC-ITC RECFPCFPNCFNTN 0.94 0.94 0.90
k 0.41
CREC 0.62
CFPNC -0.48
CFPC -2.12
CFN -1.33
CTN -0.55
4.
- Fine-tuning the annotator
- We want to find the optimal trade-off point on
the operating characteristic of the classifier.
Implicitly we are minimizing classification error
rate (FP FN).
- Conclusions
- Proposed a data-driven approach to quantitatively
assess the costs of various types of errors
committed by a confidence annotator. - Models based on efficiency fit the data well
obtained costs confirm the intuition. - For CMU Communicator, the models predict that the
total cost stays the same across a large range of
the operating characteristic of the confidence
annotator.
School of Computer Science, Carnegie Mellon
University, 2001, Pittsburgh, PA, 15213.