Predicting Good Probabilities With Supervised Learning - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Predicting Good Probabilities With Supervised Learning

Description:

Ideally, if the model predicts 0.75 for an example ... LET1. P4. LET2. P5. MEDIS. P6. SLAC. P7. HS. P8. MG. HIST. PLATT. ISO. Platt calibration for boosting ... – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 20
Provided by: alexandrun
Category:

less

Transcript and Presenter's Notes

Title: Predicting Good Probabilities With Supervised Learning


1
Predicting Good Probabilities With Supervised
Learning
  • Alexandru Niculescu-Mizil
  • Rich Caruana
  • Cornell University

2
What are good probabilities?
  • Ideally, if the model predicts 0.75 for an
    example then the conditional probability, given
    the available attributes, of that example to be
    positive is 0.75.
  • In practice
  • Good calibration out of all the cases the model
    predicts 0.75 for, 75 are positive.
  • Low Brier score (squared error).
  • Low cross-entropy (log-loss).

3
Why good probabilities?
  • Intelligibility
  • If the classifier is part of a larger system
  • Speech recognition
  • Handwritten recognition
  • If the classifier is used for decision making
  • Cost sensitive decisions
  • Medical applications
  • Meteorology
  • Risk analysis

4
What did we do?
  • We analyzed the predictions made by ten
    supervised learning algorithms.
  • For the analysis we used eight binary
    classification problems.
  • Limitations
  • Only binary problems. No multiclass.
  • No high dimensional problems (only under 200).
  • Only moderately sized training sets.

5
Questions addressed in this talk
  • Which models are well calibrated and which are
    not?
  • Can we fix the models that are not well
    calibrated?
  • Which learning algorithm makes the best
    probabilistic predictions?

6
Reliability diagrams
  • Put the cases with predicted values between 0 and
    0.1 in the first bin, between 0.1 and 0.2 in the
    second, etc.
  • For each bin, plot the mean predicted value
    against the true fraction of positives.

7
Which models are well calibrated?
ANN
BAG-DT
LOGREG
SVM
BST-DT
BST-STMP
RF
DT
KNN
NB
8
Questions addressed in this talk
  • Which models are well calibrated and which are
    not?
  • Can we fix the models that are not well
    calibrated?
  • Which learning algorithm makes the best
    probabilistic predictions?

9
Can we fix the models that are not well
calibrated?
  • Platt Scaling
  • Method used by Platt to obtain calibrated
    probabilities from SVMs.
    Platt 99
  • Converts the outputs by passing them through a
    sigmoid.
  • The sigmoid is fitted using an independent
    calibration set.

10
Can we fix the models that are not well
calibrated?
  • Isotonic Regression Robertson et
    al. 88
  • More general calibration method used by Zadrozny
    and Elkan. Zadrozny Elkan 01, 02
  • Converts the outputs by passing them through a
    general isotonic (monotonically increasing)
    function.
  • The isotonic function is fitted using an
    independent calibration set.

11
Max-margin methods
12
Boosted decision trees
P1 COVT
P2 ADULT
P3 LET1
P4 LET2
P5 MEDIS
P6 SLAC
P7 HS
P8 MG
HIST
PLATT
ISO
13
Platt calibration for boosting
  • Before

COVT
ADULT
LET1
LET2
MEDIS
SLAC
HS
MG
  • After

COVT
ADULT
LET1
LET2
MEDIS
SLAC
HS
MG
14
Naive Bayes
15
Platt Scaling vs.Isotonic Regression
16
Questions addressed in this talk
  • Which models are well calibrated and which are
    not?
  • Can we fix the models that are not well
    calibrated?
  • Which learning algorithm makes the best
    probabilistic predictions?

17
Empirical Comparison
  • For every learning algorithm we train different
    models using many parameter settings and
    variations.
  • For SVMs we vary the kernel, kernel parameters
    and tradeoff parameter
  • For neural nets we vary the number of hidden
    units, momentum,
  • For boosted trees we vary the type of decision
    tree used as a base learner and the number of
    steps of boosting
  • e.t.c.
  • Each model is trained on 4000 points and
    calibrated with Platt Scaling and Isotonic
    Regression on 1000 points
  • For each data set, learning algorithm and
    calibration method we select the best model using
    the same 1000 points used for calibration.

18
Empirical Comparison
BRIER SCORE
BST-DT SVM RF ANN BAG KNN
STMP DT LR NB
19
Summary and Conclusions
  • We examined the quality of the probabilities
    predicted by ten supervised learning algorithms.
  • Neural nets, bagged trees and logistic regression
    have well calibrated predictions.
  • Max-margin methods such as boosting and SVMs push
    the predicted values away from 0 and 1. This
    yields a sidmoid-shaped reliability diagram.
  • Learning algorithms such as Naive Bayes distort
    the probabilities in the opposite way, pushing
    them closer to 0 and 1.

20
Summary and Conclusions
  • We examined two methods to calibrate the
    predictions.
  • Max-margin methods and Naive Baies benefit a lot
    from calibration, while well-calibrated methods
    do not.
  • Platt Scaling is more effective when the
    calibration set is small, but Isotonic Regression
    is more powerful when there is enough data to
    prevent overfitting.
  • The methods that predict the best probabilities
    are calibrated boosted trees, calibrated random
    forests, calibrated SVMs, uncalibrated bagged
    trees and uncalibrated neural nets.

21
Thank you!
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com