Title: Multitask Learning
1Spooky Stuff in Metric Space
2Spooky StuffData Mining in Metric Space
- Rich Caruana
- Alex Niculescu
- Cornell University
3Motivation 1
4Motivation 1 Pneumonia Risk Prediction
5Motivation 1 Many Learning Algorithms
- Neural nets
- Logistic regression
- Linear perceptron
- K-nearest neighbor
- Decision trees
- ILP (Inductive Logic Programming)
- SVMs (Support Vector Machines)
- Bagging X
- Boosting X
- Rule learners (C2, )
- Ripper
- Random Forests (forests of decision trees)
- Gaussian Processes
- Bayes Nets
-
- No one/few learning methods dominates the others
6Motivation 2
7Motivation 2 SLAC B/Bbar
- Particle accelerator generates B/Bbar particles
- Use machine learning to classify tracks as B or
Bbar - Domain specific performance measure SLQ-Score
- 5 increase in SLQ can save 1M in accelerator
time - SLAC researchers tried various DM/ML methods
- Good, but not great, SLQ performance
- We tried standard methods, got similar results
- We studied SLQ metric
- similar to probability calibration
- tried bagged probabilistic decision trees (good
on C-Section)
8Motivation 2 Bagged Probabilistic Trees
- Draw N bootstrap samples of data
- Train tree on each sample gt N trees
- Final prediction average prediction of N trees
Average prediction (0.23 0.19 0.34 0.22
0.26 0.31) / Trees 0.24
9Motivation 2 Improves Calibration Order of
Magnitude
single tree
Poor Calibration
100 bagged trees
Excellent Calibration
10Motivation 2 Significantly Improves SLQ
100 bagged trees
single tree
11Motivation 2
- Can we automate this analysis of performance
metrics so that its easier to recognize which
metrics are similar to each other?
12Motivation 3
13Motivation 3
14Scary Stuff
- In ideal world
- Learn model that predicts correct conditional
probabilities (Bayes optimal) - Yield optimal performance on any reasonable
metric - In real world
- Finite data
- 0/1 targets instead of conditional probabilities
- Hard to learn this ideal model
- Dont have good metrics for recognizing ideal
model - Ideal model isnt always needed
- In practice
- Do learning using many different metrics ACC,
AUC, CXE, RMS, - Each metric represents different tradeoffs
- Because of this, usually important to optimize to
appropriate metric
15Scary Stuff
16Scary Stuff
17In this work we compare nine commonly used
performance metrics by applying data mining to
the results of a massive empirical study
- Goals
- Discover relationships between performance
metrics - Are the metrics really that different?
- If you optimize to metric X, also get good perf
on metric Y? - Need to optimize to metric Y, which metric X
should you optimize to? - Which metrics are more/less robust?
- Design new, better metrics?
1810 Binary Classification Performance Metrics
- Threshold Metrics
- Accuracy
- F-Score
- Lift
- Ordering/Ranking Metrics
- ROC Area
- Average Precision
- Precision/Recall Break-Even Point
- Probability Metrics
- Root-Mean-Squared-Error
- Cross-Entropy
- Probability Calibration
- SAR ((1 - Squared Error) Accuracy ROC Area)
/ 3
19Accuracy
Predicted 1 Predicted 0
correct
a
b
True 0 True 1
c
d
incorrect
threshold
accuracy (ad) / (abcd)
20Lift
- not interested in accuracy on entire dataset
- want accurate predictions for 5, 10, or 20 of
dataset - dont care about remaining 95, 90, 80, resp.
- typical application marketing
- how much better than random prediction on the
fraction of the dataset predicted true (f(x) gt
threshold)
21Lift
Predicted 1 Predicted 0
a
b
True 0 True 1
c
d
threshold
22lift 3.5 if mailings sent to 20 of the
customers
23Precision/Recall, F, Break-Even Pt
harmonic average of precision and recall
24better performance
worse performance
25Predicted 1 Predicted 0
Predicted 1 Predicted 0
true positive
false negative
FN
TP
True 0 True 1
True 0 True 1
false positive
true negative
TN
FP
Predicted 1 Predicted 0
Predicted 1 Predicted 0
misses
P(pr0tr1)
hits
P(pr1tr1)
True 0 True 1
True 0 True 1
false alarms
correct rejections
P(pr0tr0)
P(pr1tr0)
26ROC Plot and ROC Area
- Receiver Operator Characteristic
- Developed in WWII to statistically model false
positive and false negative detections of radar
operators - Better statistical foundations than most other
measures - Standard measure in medicine and biology
- Becoming more popular in ML
- Sweep threshold and plot
- TPR vs. FPR
- Sensitivity vs. 1-Specificity
- P(truetrue) vs. P(truefalse)
- Sensitivity a/(ab) Recall LIFT numerator
- 1 - Specificity 1 - d/(cd)
27diagonal line is random prediction
28Calibration
- Good calibration
- If 1000 xs have pred(x) 0.2, 200 should be
positive
29Calibration
- Model can be accurate but poorly calibrated
- good threshold with uncalibrated probabilities
- Model can have good ROC but be poorly calibrated
- ROC insensitive to scaling/stretching
- only ordering has to be correct, not
probabilities themselves - Model can have very high variance, but be well
calibrated - Model can be stupid, but be well calibrated
- Calibration is a real oddball
30Measuring Calibration
- Bucket method
- In each bucket
- measure observed c-sec rate
- predicted c-sec rate (average of probabilities)
- if observed csec rate similar to predicted csec
rate gt good calibration in that bucket
0.05 0.15 0.25 0.35
0.45 0.55 0.65 0.75
0.85 0.95
0.0 0.1 0.2 0.3
0.4 0.5 0.6 0.7
0.8 0.9 1.0
31Calibration Plot
32Experiments
33Base-Level Learning Methods
- Decision trees
- K-nearest neighbor
- Neural nets
- SVMs
- Bagged Decision Trees
- Boosted Decision Trees
- Boosted Stumps
- Each optimizes different things
- Each best in different regimes
- Each algorithm has many variations and free
parameters - Generate about 2000 models on each test problem
34Data Sets
- 7 binary classification data sets
- Adult
- Cover Type
- Letter.p1 (balanced)
- Letter.p2 (unbalanced)
- Pneumonia (University of Pittsburgh)
- Hyper Spectral (NASA Goddard Space Center)
- Particle Physics (Stanford Linear Accelerator)
- 4 k train sets
- Large final test sets (usually 20k)
35Massive Empirical Comparison
- 7 base-level learning methods
- X
- 100s of parameter settings per method
-
- 2000 models per problem
- X
- 7 test problems
-
- 14,000 models
- X
- 10 performance metrics
-
- 140,000 model performance evaluations
36COVTYPE Calibration vs. Accuracy
37Multi Dimensional Scaling
38Scaling, Ranking, and Normalizing
- Problem
- some metrics, 1.00 is best (e.g. ACC)
- some metrics, 0.00 is best (e.g. RMS)
- some metrics, baseline is 0.50 (e.g. AUC)
- some problems/metrics, 0.60 is excellent
performance - some problems/metrics, 0.99 is poor performance
- Solution 1 Normalized Scores
- baseline performance gt 0.00
- best observed performance gt 1.00 (proxy for
Bayes optimal) - puts all metrics on equal footing
- Solution 2 Scale by Standard Deviation
- Solution 3 Rank Correlation
39Multi Dimensional Scaling
- Find low-dimension embedding of 10x14,000 data
- The 10 metrics span a 2-5 dimension subspace
40Multi Dimensional Scaling
- Look at 2-D MDS plots
- Scaled by standard deviation
- Normalized scores
- MDS of rank correlations
- MDS on each problem individually
- MDS averaged across all problems
412-D Multi-Dimensional Scaling
422-D Multi-Dimensional Scaling
Normalized Scores Scaling
Rank-Correlation Distance
43 Adult Covertype
Hyper-Spectral
Letter Medis
SLAC
44Correlation Analysis
- 2000 performances for each metric on each problem
- Correlation between all pairs of metrics
- 10 metrics
- 45 pairwise correlations
- Average of correlations over 7 test problems
- Standard correlation
- Rank correlation
- Present rank correlation here
45Rank Correlations
- Correlation analysis consistent with MDS analysis
- Ordering metrics have high correlations to each
other - ACC, AUC, RMS have best correlations of metrics
in each metric class - RMS has good correlation to other metrics
- SAR has best correlation to other metrics
46Summary
- 10 metrics span 2-5 Dim subspace
- Consistent results across problems and scalings
- Ordering Metrics Cluster AUC APR BEP
- CAL far from Ordering Metrics
- CAL nearest to RMS/MXE
- RMS MXE, but RMS much more centrally located
- Threshold Metrics ACC and FSC do not cluster as
tightly as ordering metrics and RMS/MXE - Lift behaves more like Ordering than Threshold
metrics - Old friends ACC, AUC, and RMS most representative
- New SAR metric is good, but not much better than
RMS
47New Resources
- Want to borrow 14,000 models?
- margin analysis
- comparison to new algorithm X
-
- PERF code software that calculates 2 dozen
performance metrics - Accuracy (at different thresholds)
- ROC Area and ROC plots
- Precision and Recall plots
- Break-even-point, F-score, Average Precision
- Squared Error
- Cross-Entropy
- Lift
-
- Currently, most metrics are for boolean
classification problems - We are willing to add new metrics and new
capabilities - Available at http//www.cs.cornell.edu/caruan
a
48Future Work
49Future/Related Work
- Ensemble method optimizes any metric (ICML04)
- Get good probs from Boosted Trees (AISTATS05)
- Comparison of learning algs on metrics (ICML06)
- First step in analyzing different performance
metrics - Develop new metrics with better properties
- SAR is a good general purpose metric
- Does optimizing to SAR yield better models?
- but RMS nearly as good
- attempts to make SAR better did not help much
- Extend to multi-class or hierarchical problems
where evaluating performance is more difficult
50Thank You.
51Spooky Stuff in Metric Space
52Which learning methods perform best on each
metric?
53Normalized Scores Best Single Models
- SVM predictions transformed to posterior
probabilities via Platt Scaling - SVM and ANN tied for first place Bagged Trees
nearly as good - Boosted Trees win 5 of 6 Threshold Rank
metrics, but yield lousy probs! - Boosting weaker stumps does not compare to
boosting full trees - KNN and Plain Decision Trees usually not
competitive (with 4k train sets) - Other interesting things. See papers.
54Platt Scaling
- SVM predictions -inf, inf
- Probability metrics require 0,1
- Platt scaling transforms SVM preds by fitting a
sigmoid - This gives SVM good probability performance
55Outline
- Motivation The One True Model
- Ten Performance Metrics
- Experiments
- Multidimensional Scaling (MDS) Analysis
- Correlation Analysis
- Learning Algorithm vs. Metric
- Summary
56Base-Level Learners
- Each optimizes different things
- ANN minimize squared error or cross-entropy
(good for probs) - SVM, Boosting optimize margin (good for
accuracy, poor for probs) - DT optimize info gain
- KNN ?
- Each best in different regimes
- SVM high dimensional data
- DT, KNN large data sets
- ANN non-linear prediction from many correlated
features - Each algorithm has many variations and free
parameters - SVM margin parameter, kernel, kernel parameters
(gamma, ) - ANN hidden units, hidden layers, learning
rate, early stopping point - DT splitting criterion, pruning options,
smoothing options, - KNN K, distance metric, distance weighted
averaging, - Generate about 2000 models on each test problem
57Motivation
- Holy Grail of Supervised Learning
- One True Model (a.k.a. Bayes Optimal Model)
- Predicts correct conditional probability for each
case - Yields optimal performance on all reasonable
metrics - Hard to learn given finite data
- train sets rarely have conditional probs, usually
just 0/1 targets - Isnt always necessary
- Many Different Performance Metrics
- ACC, AUC, CXE, RMS, PRE/REC
- Each represents different tradeoffs
- Usually important to optimize to appropriate
metric - Not all metric created equal
58Motivation
- In an ideal world
- Learn model that predicts correct conditional
probabilities - Yield optimal performance on any reasonable
metric - In real world
- Finite data
- 0/1 targets instead of conditional probabilities
- Hard to learn this ideal model
- Dont have good metrics for recognizing ideal
model - Ideal model isnt always necessary
- In practice
- Do learning using many different metrics ACC,
AUC, CXE, RMS, - Each metric represents different tradeoffs
- Because of this, usually important to optimize to
appropriate metric
59Accuracy
- Target 0/1, -1/1, True/False,
- Prediction f(inputs) f(x) 0/1 or Real
- Threshold f(x) gt thresh gt 1, else gt 0
- threshold(f(x)) 0/1
- right / total
- p(correct) p(threshold(f(x)) target)
60Precision and Recall
- Typically used in document retrieval
- Precision
- how many of the returned documents are correct
- precision(threshold)
- Recall
- how many of the positives does the model return
- recall(threshold)
- Precision/Recall Curve sweep thresholds
61Precision/Recall
Predicted 1 Predicted 0
a
b
True 0 True 1
c
d
threshold
62(No Transcript)