Multitask Learning - PowerPoint PPT Presentation

1 / 62

About This Presentation

Title:

Multitask Learning

Description:

Bagging X. Boosting X. Rule learners (C2, ...) Ripper. Random Forests (forests of decision trees) ... Motivation #2: Bagged Probabilistic Trees. Draw N ... – PowerPoint PPT presentation

Number of Views:778

Avg rating:5.0/5.0

Slides: 63

Provided by: richc153

Category:

more less

Transcript and Presenter's Notes

Title: Multitask Learning

1
Spooky Stuff in Metric Space
2
Spooky StuffData Mining in Metric Space

Rich Caruana
Alex Niculescu
Cornell University

3
Motivation 1
4
Motivation 1 Pneumonia Risk Prediction
5
Motivation 1 Many Learning Algorithms

Neural nets
Logistic regression
Linear perceptron
K-nearest neighbor
Decision trees
ILP (Inductive Logic Programming)
SVMs (Support Vector Machines)
Bagging X
Boosting X
Rule learners (C2, )
Ripper
Random Forests (forests of decision trees)
Gaussian Processes
Bayes Nets
No one/few learning methods dominates the others

6
Motivation 2
7
Motivation 2 SLAC B/Bbar

Particle accelerator generates B/Bbar particles
Use machine learning to classify tracks as B or
Bbar
Domain specific performance measure SLQ-Score
5 increase in SLQ can save 1M in accelerator
time
SLAC researchers tried various DM/ML methods
Good, but not great, SLQ performance
We tried standard methods, got similar results
We studied SLQ metric
similar to probability calibration
tried bagged probabilistic decision trees (good
on C-Section)

8
Motivation 2 Bagged Probabilistic Trees

Draw N bootstrap samples of data
Train tree on each sample gt N trees
Final prediction average prediction of N trees

Average prediction (0.23 0.19 0.34 0.22
0.26 0.31) / Trees 0.24
9
Motivation 2 Improves Calibration Order of
Magnitude
single tree
Poor Calibration
100 bagged trees
Excellent Calibration
10
Motivation 2 Significantly Improves SLQ
100 bagged trees
single tree
11
Motivation 2

Can we automate this analysis of performance
metrics so that its easier to recognize which
metrics are similar to each other?

12
Motivation 3
13
Motivation 3
14
Scary Stuff

In ideal world
Learn model that predicts correct conditional
probabilities (Bayes optimal)
Yield optimal performance on any reasonable
metric
In real world
Finite data
0/1 targets instead of conditional probabilities
Hard to learn this ideal model
Dont have good metrics for recognizing ideal
model
Ideal model isnt always needed
In practice
Do learning using many different metrics ACC,
AUC, CXE, RMS,
Each metric represents different tradeoffs
Because of this, usually important to optimize to
appropriate metric

15
Scary Stuff
16
Scary Stuff
17
In this work we compare nine commonly used
performance metrics by applying data mining to
the results of a massive empirical study

Goals
Discover relationships between performance
metrics
Are the metrics really that different?
If you optimize to metric X, also get good perf
on metric Y?
Need to optimize to metric Y, which metric X
should you optimize to?
Which metrics are more/less robust?
Design new, better metrics?

18
10 Binary Classification Performance Metrics

Threshold Metrics
Accuracy
F-Score
Lift
Ordering/Ranking Metrics
ROC Area
Average Precision
Precision/Recall Break-Even Point
Probability Metrics
Root-Mean-Squared-Error
Cross-Entropy
Probability Calibration
SAR ((1 - Squared Error) Accuracy ROC Area)
/ 3

19
Accuracy
Predicted 1 Predicted 0
correct
a
b
True 0 True 1
c
d
incorrect
threshold
accuracy (ad) / (abcd)
20
Lift

not interested in accuracy on entire dataset
want accurate predictions for 5, 10, or 20 of
dataset
dont care about remaining 95, 90, 80, resp.
typical application marketing
how much better than random prediction on the
fraction of the dataset predicted true (f(x) gt
threshold)

21
Lift
Predicted 1 Predicted 0
a
b
True 0 True 1
c
d
threshold
22
lift 3.5 if mailings sent to 20 of the
customers
23
Precision/Recall, F, Break-Even Pt
harmonic average of precision and recall
24
better performance
worse performance
25
Predicted 1 Predicted 0
Predicted 1 Predicted 0
true positive
false negative
FN
TP
True 0 True 1
True 0 True 1
false positive
true negative
TN
FP
Predicted 1 Predicted 0
Predicted 1 Predicted 0
misses
P(pr0tr1)
hits
P(pr1tr1)
True 0 True 1
True 0 True 1
false alarms
correct rejections
P(pr0tr0)
P(pr1tr0)
26
ROC Plot and ROC Area

Receiver Operator Characteristic
Developed in WWII to statistically model false
positive and false negative detections of radar
operators
Better statistical foundations than most other
measures
Standard measure in medicine and biology
Becoming more popular in ML
Sweep threshold and plot
TPR vs. FPR
Sensitivity vs. 1-Specificity
P(truetrue) vs. P(truefalse)
Sensitivity a/(ab) Recall LIFT numerator
1 - Specificity 1 - d/(cd)

27
diagonal line is random prediction
28
Calibration

Good calibration
If 1000 xs have pred(x) 0.2, 200 should be
positive

29
Calibration

Model can be accurate but poorly calibrated
good threshold with uncalibrated probabilities
Model can have good ROC but be poorly calibrated
ROC insensitive to scaling/stretching
only ordering has to be correct, not
probabilities themselves
Model can have very high variance, but be well
calibrated
Model can be stupid, but be well calibrated
Calibration is a real oddball

30
Measuring Calibration

Bucket method
In each bucket
measure observed c-sec rate
predicted c-sec rate (average of probabilities)
if observed csec rate similar to predicted csec
rate gt good calibration in that bucket

0.05 0.15 0.25 0.35
0.45 0.55 0.65 0.75
0.85 0.95

0.0 0.1 0.2 0.3
0.4 0.5 0.6 0.7
0.8 0.9 1.0
31
Calibration Plot
32
Experiments
33
Base-Level Learning Methods

Decision trees
K-nearest neighbor
Neural nets
SVMs
Bagged Decision Trees
Boosted Decision Trees
Boosted Stumps
Each optimizes different things
Each best in different regimes
Each algorithm has many variations and free
parameters
Generate about 2000 models on each test problem

34
Data Sets

7 binary classification data sets
Adult
Cover Type
Letter.p1 (balanced)
Letter.p2 (unbalanced)
Pneumonia (University of Pittsburgh)
Hyper Spectral (NASA Goddard Space Center)
Particle Physics (Stanford Linear Accelerator)
4 k train sets
Large final test sets (usually 20k)

35
Massive Empirical Comparison

7 base-level learning methods
X
100s of parameter settings per method
2000 models per problem
X
7 test problems
14,000 models
X
10 performance metrics
140,000 model performance evaluations

36
COVTYPE Calibration vs. Accuracy
37
Multi Dimensional Scaling
38
Scaling, Ranking, and Normalizing

Problem
some metrics, 1.00 is best (e.g. ACC)
some metrics, 0.00 is best (e.g. RMS)
some metrics, baseline is 0.50 (e.g. AUC)
some problems/metrics, 0.60 is excellent
performance
some problems/metrics, 0.99 is poor performance
Solution 1 Normalized Scores
baseline performance gt 0.00
best observed performance gt 1.00 (proxy for
Bayes optimal)
puts all metrics on equal footing
Solution 2 Scale by Standard Deviation
Solution 3 Rank Correlation

39
Multi Dimensional Scaling

Find low-dimension embedding of 10x14,000 data
The 10 metrics span a 2-5 dimension subspace

40
Multi Dimensional Scaling

Look at 2-D MDS plots
Scaled by standard deviation
Normalized scores
MDS of rank correlations
MDS on each problem individually
MDS averaged across all problems

41
2-D Multi-Dimensional Scaling
42
2-D Multi-Dimensional Scaling
Normalized Scores Scaling
Rank-Correlation Distance
43
Adult Covertype
Hyper-Spectral
Letter Medis
SLAC
44
Correlation Analysis

2000 performances for each metric on each problem
Correlation between all pairs of metrics
10 metrics
45 pairwise correlations
Average of correlations over 7 test problems
Standard correlation
Rank correlation
Present rank correlation here

45
Rank Correlations

Correlation analysis consistent with MDS analysis
Ordering metrics have high correlations to each
other
ACC, AUC, RMS have best correlations of metrics
in each metric class
RMS has good correlation to other metrics
SAR has best correlation to other metrics

46
Summary

10 metrics span 2-5 Dim subspace
Consistent results across problems and scalings
Ordering Metrics Cluster AUC APR BEP
CAL far from Ordering Metrics
CAL nearest to RMS/MXE
RMS MXE, but RMS much more centrally located
Threshold Metrics ACC and FSC do not cluster as
tightly as ordering metrics and RMS/MXE
Lift behaves more like Ordering than Threshold
metrics
Old friends ACC, AUC, and RMS most representative
New SAR metric is good, but not much better than
RMS

47
New Resources

Want to borrow 14,000 models?
margin analysis
comparison to new algorithm X
PERF code software that calculates 2 dozen
performance metrics
Accuracy (at different thresholds)
ROC Area and ROC plots
Precision and Recall plots
Break-even-point, F-score, Average Precision
Squared Error
Cross-Entropy
Lift
Currently, most metrics are for boolean
classification problems
We are willing to add new metrics and new
capabilities
Available at http//www.cs.cornell.edu/caruan
a

48
Future Work
49
Future/Related Work

Ensemble method optimizes any metric (ICML04)
Get good probs from Boosted Trees (AISTATS05)
Comparison of learning algs on metrics (ICML06)
First step in analyzing different performance
metrics
Develop new metrics with better properties
SAR is a good general purpose metric
Does optimizing to SAR yield better models?
but RMS nearly as good
attempts to make SAR better did not help much
Extend to multi-class or hierarchical problems
where evaluating performance is more difficult