Revising our Evaluation Practices in Machine Learning Nathalie Japkowicz - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Revising our Evaluation Practices in Machine Learning Nathalie Japkowicz

Description:

This is very different from the way Evaluation is approached in other applied ... classifier on the right has ridiculously low precision (33.3%) while the ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 45

Provided by: coem152

Category:

more less

Transcript and Presenter's Notes

Title: Revising our Evaluation Practices in Machine Learning Nathalie Japkowicz

1
Revising our Evaluation Practices in Machine
LearningNathalie Japkowicz

School of Computer Science and Software
Engineering
Monash University
(Visiting professor)

School of Information Technology and Engineering
University of Ottawa
(Home Institution)

2
Observations

The way in which Evaluation is conducted in
Machine learning/Data Mining has not been a
primary concern in the community.
This is very different from the way Evaluation
is approached in other applied fields such as
Economics, Psychology and Sociology.
In such fields, researchers have been more
concerned with the meaning and validity of their
results than in ours.

3
The Problem

The objective value of our advances in Machine
Learning may be different from what we believe it
is.
Our conclusions may be flawed or meaningless.
ML methods may get undue credit or not get
sufficiently recognized.
The field may start stagnating.
Practitioners in other fields or potential
business partners may dismiss our
approaches/results.
We hope that with better evaluation practices, we
can help the field of machine learning focus on
more effective research and encourage more
cross-discipline or cross-purposes exchanges.

4
Organization of the Talk

I. A review of the shortcomings of current
evaluation methods
Problems with Performance Evaluation
Problems with Confidence Estimation
II. Borrowing performance evaluation measures and
confidence tests from other disciplines (with
Marina Sokolova, Stan Szpakowicz, William Elamzeh
and Stan Matwin).
III. Constructing new Evaluation measures (with
William Elazmeh and Stan Matwin)

5
Part I

A review of the shortcomings of current
evaluation methods
Problems with Performance Evaluation
Problems with Confidence Evaluation
Presented at the AAAI2006 Workshop on
Evaluation Methods for Machine Learning

6
Recommended Steps for Proper Evaluation

Identify the interesting properties of the
classifier.
Choose an evaluation metric accordingly
Choose a confidence estimation method .
Check that all the assumptions made by the
evaluation metric and confidence estimator are
verified.
Run the evaluation method with the chosen metric
and confidence estimator, and analyze the
results.
Interpret the results with respect to the domain.

Suggested by William Elamzeh
7
Commonly Followed Steps of Evaluation

Identify the interesting properties of the
classifier.
Choose an evaluation metric accordingly
Choose a confidence estimation method .
Check that all the assumptions made by the
evaluation metric and confidence estimator are
verified.
Run the evaluation method with the chosen metric
and confidence estimator, and analyze the
results.
Interpret the results with respect to the domain.

These steps are typically considered, but only
very lightly
.
8
Overview

What happens when bad choices of performance
evaluation metrics are made? (Steps 1 and 2 are
considered too lightly)
Accuracy
Precision/Recall
ROC Analysis
Note each metric solves the problem of the
previous one, but introduces new shortcomings
(usually caught by the previous metrics)
What happens when bad choices of confidence
estimators are made and the assumptions
underlying these confidence estimator are not
respected (Steps 3 is considered lightly and Step
4 is disregarded).
The t-test

E.g.,
E.g.,
9
A Short Review I Confusion Matrix / Common
Performance evaluation Metrics

Accuracy (TPTN)/(PN)
Precision TP/(TPFP)
Recall/TP rate TP/P
FP Rate FP/N
ROC Analysis moves the threshold between the
positive and negative class from a small FP rate
to a large one. It plots the value of the Recall
against that of the FP Rate at each FP Rate
considered.

A Confusion Matrix
10
A Short Review II Confidence Estimation / The
t-Test

The most commonly used approach to confidence
estimation in Machine learning is
To run the algorithm using 10-fold
cross-validation and to record the accuracy at
each fold.
To compute a confidence interval around the
average of the difference between these reported
accuracies and a given gold standard, using the
t-test, i.e., the following formula
d /- tN,9 sd where
d is the average difference between the reported
accuracy and the given gold standard,
tN,9 is a constant chosen according to the degree
of confidence desired,
sd sqrt(1/90 Si110 (di d)2) where di
represents the difference between the reported
accuracy and the given gold standard at
fold i.

11
Whats wrong with Accuracy?

Both classifiers obtain 60 accuracy
They exhibit very different behaviours
On the left weak positive recognition
rate/strong negative recognition rate
On the right strong positive recognition
rate/weak negative recognition rate

12
Whats wrong with Precision/Recall?

Both classifiers obtain the same precision and
recall values of 66.7 and 40
They exhibit very different behaviours
Same positive recognition rate
Extremely different negative recognition rate
strong on the left / nil on the right
Note Accuracy has no problem catching this!

13
Whats wrong with ROC Analysis?(We consider
single points in space not the entire ROC Curve)

ROC Analysis and Precision yield contradictory
results
In terms of ROC Analysis, the classifier on the
right is a significantly better choice than the
one on the left. the point representing the
right classifier is on the same vertical line but
22.25 higher than the point representing the
left classifier
Yet, the classifier on the right has ridiculously
low precision (33.3) while the classifier on the
left has excellent precision (95.24).

14
Whats wrong with the t-test?

Classifiers 1 and 2 yield the same average mean
and confidence interval.
Yet, Classifier 1 is relatively stable, while
Classifier 2 is not.
Problem the t-test assumes a normal
distribution. The difference in accuracy between
classifier 2 and the gold-standard is not
normally distributed

15
Part I Discussion

There is nothing intrinsically wrong with any of
the performance evaluation measures or confidence
tests discussed. Its all a matter of thinking
about which one to use when, and what the results
means (both in terms of added value and
limitations).
Simple conceptualization of the Problem with
current evaluation practices
Evaluation Metrics and Confidence Measures
summarize the results ? ML Practitioners must
understand the terms of these summarizations and
verify that their assumptions are verified.
In certain cases, however, it is necessary to
look further and, eventually, borrow practices
from other disciplines. In, yet, other cases, it
pays to devise our own methods. Both instances
are discussed in what follows.

16
Part II a

Borrowing new performance evaluation measures
from the medical diagnostic community
(Marina Sokolova, Nathalie Japkowicz
and Stan Szpakowicz)
To be presented at the Australian AI 2006
Conference

17
The need to borrow new performance measures an
example

It has come to our attention that the performance
measures commonly used in Machine Learning are
not very good at assessing the performance of
problems in which the two classes are equally
important.
Accuracy focuses on both classes, but, it does
not distinguish between the two classes.
Other measures, such as Precision/Recall, F-Score
and ROC Analysis only focus on one class, without
concerning themselves with performance on the
other class.

18
Learning Problems in which the classes are
equally important

Examples of recent Machine Learning domains that
require equal focus on both classes and a
distinction between false positive and false
negative rates are
opinion/sentiment identification
classification of negotiations
An examples of a traditional problem that
requires equal focus on both classes and a
distinction between false positive and false
negative rates is
Medical Diagnostic Tests
What measures have researchers in the Medical
Diagnostic Test Community used that we can borrow?

19
Performance Measures in use in the Medical
Diagnostic Community

Common performance measures in use in the Medical
Diagnostic Community are
Sensitivity/Specificity (also in use in Machine
learning)
Likelihood ratios
Youdens Index
Discriminant Power
Biggerstaff, 2000 Blakeley Oddone, 1995

20
Sensitivity/Specificity

The sensitivity of a diagnostic test is
PD, i.e., the probability of obtaining
a positive test result in the
diseased population.
The specificity of a diagnostic test is
P-D, i.e., the probability of obtaining
a negative test result in the
disease-free population.
Sensitivity and specificity are not that useful,
however, since one, really is interested in
PD (PVP the Predictive Value of a Positive)
and PD- (PVN the Predictive Value of a
Negative) in both the medical testing community
and in Machine Learning. ? We can apply Bayes
Theorem to derive the PVP and PVN.

21
Deriving the PVPs and PVNs

The problem with deriving the PVP and PVN of a
test, is that in order to derive them, we need to
know pD, the pre-test probability of the
disease. This cannot be done directly.
As usual, however, we can set ourselves in the
context of the comparison of two tests (with
PD being the same in both cases).
Doing so, and using Bayes Theorem
PD (PD PD)/(PD PD PDPD)
We can get the following relationships (see
Biggerstaff, 2000)
PDY gt PDX ? ?Y gt ?X
PD-Y gt PD-X ? ?-Y lt ?-X
Where X and Y are two diagnostic tests, and X,
and X stand for confirming the presence of the
disease and confirming the absence of the
disease, respectively. (and similarly for Y and
Y)
? and ?- are the likelihood ratios that are
defined on the next slide

22
Likelihood Ratios

? and ?- are actually easy to derive.
The likelihood ratio of a positive test is
? PD / P D, i.e. the ratio of the
true
positive rate to the false
positive rate.
The likelihood ratio of a negative test is
?- P-D / P- D, i.e. the ratio of the
false negative rate to the true
negative rate.
Note We want to maximize ? and minimize ?-.
This means that, even though we cannot calculate
the PVP and PVN directly, we can get the
information we need to compare two tests through
the likelihood ratios.

23
Youdens Index and Discriminant Power

Youdens Index measures the avoidance of failure
of an algorithm while Discriminant Power
evaluates how well an algorithm distinguishes
between positive and negative examples.
Youdens Index
? sensitivity (1 specificity)
PD (1 - P-D)
Discriminant Power
DP v3/p (log X log Y),
where X sensitivity/(1 sensitivity)
and
Y specificity/(1-specificit
y)

24
Comparison of the various measures on the outcome
of e-negotiation
DP is below 3 ? insignificant
25
What does this all mean? Traditional ML Measures
26
What does this all mean? New Measures that are
more appropriate for problems where both classes
are as important
27
Part II a Discussion

The variety of results obtained with the
different measures suggest two conclusions
It is very important for practitioners of Machine
Learning to understand their domain deeply, to
understand what it is, exactly, that they want to
evaluate, and to reach their goal using
appropriate measures (existing or new ones).
Since some of the results are very close to each
other, it is important to establish reliable
confidence tests to find out whether or not these
results are significant.

28
Part II b

Borrowing new confidence tests from the medical
diagnostic community
(William Elamzeh, Nathalie Japkowicz
and Stan Matwin)
Presented at theICML2006 Workshop on ROC
Analysis

29
The need to borrow new confidence tests an
example

The class imbalance problem is very pervasive in
practical applications of machine learning.
In such cases, it is recommended to use ROC
analysis, which is believed not to be sensitive
to class imbalances.
Recently, a confidence test was proposed for ROC
Curves Macskassy Provost (ICML2005), the ROC
bands.
These ROC Bands are based on sampling methods
which become useless in case of severe class
imbalance and when the data distribution is
unknown. ? To compensate for that, we will
consider Tangos test.

30
Tangos Test

Rather than focusing on the true positive (a) and
true negative (d) rates, Tangos test focuses on
the false negative (b) and false positive (c)
rates.
Its null hypothesis,
H0 is d b - c 0
Testing this null hypothesis allows the
statistical test not to be influenced by class
imbalance in favour of the majority class (see
further)
Conversely, tests based on a and d rather than b
and c do break down in case of severe imbalance.

31
Example of a standard tests break down ROC
Bands on severely imbalanced data sets
The bands are wide and not very useful. This was
true of all the data sets we experimented with.
32
Examples of Tangos performance on severely
imbalanced data sets
The dark segments show that Tangos test detected
confident segments in most curves
33
Part II b Discussion

Evaluation results we obtain always need to be
corroborated using confidence measures. Failure
to do so may yield false conclusions.
When necessary, it is useful to borrow tests and
measures from other disciplines as we did with
likelihood ratios, etc. and Tangos test.
Sometimes, however, these borrowed tests are not
sufficient. In our case, for example, we may want
to visualize several measures and confidence
tests results simultaneously.
It may be necessary to construct new evaluation
methods in order to reach our goal.
This will be the topic of the next and last
section.

34
Part III

Constructing new evaluation measures
(William Elamzeh, Nathalie Japkowicz
and Stan Matwin)
To be presented at ECML2006

35
Motivation for our new evaluation method

ROC Analysis alone and its associated AUC measure
do not assess the performance of classifiers
adequately since they omit any information
regarding the confidence of these estimates.
Though the identification of the significant
portion of ROC Curves is an important step
towards generating a more useful assessment, this
analysis remains biased in favour of the large
class, in case of severe imbalances.
We would like to combine the information provided
by the ROC Curve together with information
regarding how balanced the classifier is with
regard to the misclassification of positive and
negative examples.

36
ROCs bias in the case of severe class imbalances

ROC Curves, for the pos class, plots the true
positive rate a/(ab) against the false positive
rate c/(cd).
When the number of pos. examples is significantly
lower than the number of neg. examples, ab ltlt
cd, as we change the class probability
threshold, a/(ab) climbs faster than c/(cd)
ROC gives the majority class (-) an unfair
advantage.
Ideally, a classifier should classify both
classes proportionally

Confusion Matrix
37
Correcting for ROCs bias in the case of severe
class imbalances

Though we keep ROC as a performance evaluation
measure, since rate information is useful, we
propose to favour classifiers that perform with
similar number of errors in both classes, for
confidence estimation.
More specifically,as in the Tango test, we favour
classifiers that have lower difference in
classification errors in both classes, (b-c)/n.
This quantity (b-c)/n is interesting not just for
confidence estimation, but also as an evaluation
measure in its own right

Confusion Matrix
38
Proposed Evaluation Method for severely
Imbalanced Data sets

Our method consists of five steps
Generate a ROC Curve R for a classifier K applied
to data D.
Apply Tangos confidence test in order to
identify the confident segments of R.
Compute the CAUC, the area under the confident
ROC segment.
Compute AveD, the average normalized difference
(b-c)/n for all points in the confident ROC
segment.
Plot CAUC against aveD ? An effective classifier
shows low AveD and high CAUC

39
Experiments and Expected Results

We considered 6 imbalanced domain from UCI. The
most imbalanced one contained only 1.4 examples
in the small class while the least imbalanced one
had as many as 26.
We ran 4 classifiers Decision Stumps, Decision
Trees, Decision Forests and Naïve Bayes
We expected the following results
Weak Performance from the Decision Stumps
Stronger Performance from the Decision Trees
Even Stronger Performance from the Random Forests
We expected Naïve Bayes to perform reasonably
well, but with no idea of how it would compare to
the tree family of learners

Same family of learners
40
Results using our new method Our expectations
are met

Note Classifiers
in the top left
corner
outperform those
in the bottom
right corner

Decision Stumps perform the worst, followed by
decision trees and then random forests (in most
cases)
Surprise 1 Decision trees outperform random
forests in the two most balanced data sets.
Surprise 2 Naïve Bayes consistently outperforms
Random forests

41
AUC Results

Our, more informed, results contradict the AUC
results which claim that
Decision Stumps are sometimes as good as or
superior to decision trees (!)
Random Forests outperforms all other systems in
all but one cases.

42
Part III Discussion

In order to better understand the performance of
classifiers on various domains, it can be useful
to consider several aspects of this evaluation
simultaneously.
In order to do so, it might be useful to create
specific measures adapted to the purpose of the
evaluation.
In our case, above, our evaluation measure
allowed us to study the tradeoff between
classification difference and area under the
confident segment of the AUC curve, thus,
producing more reliable results

43
General Conclusions

It is time for us (researchers in Machine
Learning) to stop applying evaluation methods
without thinking about the meaning of this
evaluation.
This may yield to the borrowing of existing
methods from other fields or to the creation of
new measures to suit our purposes.
As a result, our field should expand more
meaningfully and exchanges between our discipline
and others should be boosted, thus, benifitting
both disciplines.

44
Future Work

To survey more disciplines for useful evaluation
methods applicable to machine Learning.
To associate template Machine learning problems
to approaches to evaluation in order to ease a
Machine learning researchers choice of
evaluation methods.
To continue designing new approaches to
evaluation in Machine learning.

Write a Comment

User Comments (0)