Predicting Good Probabilities With Supervised Learning - PowerPoint PPT Presentation

1 / 19

About This Presentation

Title:

Predicting Good Probabilities With Supervised Learning

Description:

Ideally, if the model predicts 0.75 for an example ... LET1. P4. LET2. P5. MEDIS. P6. SLAC. P7. HS. P8. MG. HIST. PLATT. ISO. Platt calibration for boosting ... – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 20

Provided by: alexandrun

Category:

more less

Transcript and Presenter's Notes

Title: Predicting Good Probabilities With Supervised Learning

1
Predicting Good Probabilities With Supervised
Learning

Alexandru Niculescu-Mizil
Rich Caruana
Cornell University

2
What are good probabilities?

Ideally, if the model predicts 0.75 for an
example then the conditional probability, given
the available attributes, of that example to be
positive is 0.75.
In practice
Good calibration out of all the cases the model
predicts 0.75 for, 75 are positive.
Low Brier score (squared error).
Low cross-entropy (log-loss).

3
Why good probabilities?

Intelligibility
If the classifier is part of a larger system
Speech recognition
Handwritten recognition
If the classifier is used for decision making
Cost sensitive decisions
Medical applications
Meteorology
Risk analysis

4
What did we do?

We analyzed the predictions made by ten
supervised learning algorithms.
For the analysis we used eight binary
classification problems.
Limitations
Only binary problems. No multiclass.
No high dimensional problems (only under 200).
Only moderately sized training sets.

5
Questions addressed in this talk

Which models are well calibrated and which are
not?
Can we fix the models that are not well
calibrated?
Which learning algorithm makes the best
probabilistic predictions?

6
Reliability diagrams

Put the cases with predicted values between 0 and
0.1 in the first bin, between 0.1 and 0.2 in the
second, etc.
For each bin, plot the mean predicted value
against the true fraction of positives.

7
Which models are well calibrated?
ANN
BAG-DT
LOGREG
SVM
BST-DT
BST-STMP
RF
DT
KNN
NB
8
Questions addressed in this talk

Which models are well calibrated and which are
not?
Can we fix the models that are not well
calibrated?
Which learning algorithm makes the best
probabilistic predictions?

9
Can we fix the models that are not well
calibrated?

Platt Scaling
Method used by Platt to obtain calibrated
probabilities from SVMs.
Platt 99
Converts the outputs by passing them through a
sigmoid.
The sigmoid is fitted using an independent
calibration set.

10
Can we fix the models that are not well
calibrated?

Isotonic Regression Robertson et
al. 88
More general calibration method used by Zadrozny
and Elkan. Zadrozny Elkan 01, 02
Converts the outputs by passing them through a
general isotonic (monotonically increasing)
function.
The isotonic function is fitted using an
independent calibration set.

11
Max-margin methods
12
Boosted decision trees
P1 COVT
P2 ADULT
P3 LET1
P4 LET2
P5 MEDIS
P6 SLAC
P7 HS
P8 MG
HIST
PLATT
ISO
13
Platt calibration for boosting

Before

COVT
ADULT
LET1
LET2
MEDIS
SLAC
HS
MG

After

COVT
ADULT
LET1
LET2
MEDIS
SLAC
HS
MG
14
Naive Bayes
15
Platt Scaling vs.Isotonic Regression
16
Questions addressed in this talk

Which models are well calibrated and which are
not?
Can we fix the models that are not well
calibrated?
Which learning algorithm makes the best
probabilistic predictions?

17
Empirical Comparison

For every learning algorithm we train different
models using many parameter settings and
variations.
For SVMs we vary the kernel, kernel parameters
and tradeoff parameter
For neural nets we vary the number of hidden
units, momentum,
For boosted trees we vary the type of decision
tree used as a base learner and the number of
steps of boosting
e.t.c.
Each model is trained on 4000 points and
calibrated with Platt Scaling and Isotonic
Regression on 1000 points
For each data set, learning algorithm and
calibration method we select the best model using
the same 1000 points used for calibration.

18
Empirical Comparison
BRIER SCORE
BST-DT SVM RF ANN BAG KNN
STMP DT LR NB
19
Summary and Conclusions

We examined the quality of the probabilities
predicted by ten supervised learning algorithms.
Neural nets, bagged trees and logistic regression
have well calibrated predictions.
Max-margin methods such as boosting and SVMs push
the predicted values away from 0 and 1. This
yields a sidmoid-shaped reliability diagram.
Learning algorithms such as Naive Bayes distort
the probabilities in the opposite way, pushing
them closer to 0 and 1.

20
Summary and Conclusions

We examined two methods to calibrate the
predictions.
Max-margin methods and Naive Baies benefit a lot
from calibration, while well-calibrated methods
do not.
Platt Scaling is more effective when the
calibration set is small, but Isotonic Regression
is more powerful when there is enough data to
prevent overfitting.
The methods that predict the best probabilities
are calibrated boosted trees, calibrated random
forests, calibrated SVMs, uncalibrated bagged
trees and uncalibrated neural nets.

21
Thank you!