Title: NonTraditional Metrics
1Non-Traditional Metrics
- Evaluation measures from the
- medical diagnostic community
-
- Constructing new evaluation
- measures that combine metric
- and statistical information
2Part I
- Borrowing new performance evaluation measures
from the medical diagnostic community - (Marina Sokolova, Nathalie Japkowicz
- and Stan Szpakowicz)
3The need to borrow new performance measures an
example
- It has come to our attention that the performance
measures commonly used in Machine Learning are
not very good at assessing the performance of
problems in which the two classes are equally
important. - Accuracy focuses on both classes, but, it does
not distinguish between the two classes. - Other measures, such as Precision/Recall, F-Score
and ROC Analysis only focus on one class, without
concerning themselves with performance on the
other class.
4Learning Problems in which the classes are
equally important
- Examples of recent Machine Learning domains that
require equal focus on both classes and a
distinction between false positive and false
negative rates are - opinion/sentiment identification
- classification of negotiations
- An examples of a traditional problem that
requires equal focus on both classes and a
distinction between false positive and false
negative rates is - Medical Diagnostic Tests
- What measures have researchers in the Medical
Diagnostic Test Community used that we can borrow?
5Performance Measures in use in the Medical
Diagnostic Community
- Common performance measures in use in the Medical
Diagnostic Community are - Sensitivity/Specificity (also in use in Machine
learning) - Likelihood ratios
- Youdens Index
- Discriminant Power
- Biggerstaff, 2000 Blakeley Oddone, 1995
6Sensitivity/Specificity
- The sensitivity of a diagnostic test is
- PD, i.e., the probability of obtaining
a positive test result in the - diseased population.
- The specificity of a diagnostic test is
- P-D, i.e., the probability of obtaining
a negative test result in the - disease-free population.
- Sensitivity and specificity are not that useful,
however, since one, really is interested in
PD (PVP the Predictive Value of a Positive)
and PD- (PVN the Predictive Value of a
Negative) in both the medical testing community
and in Machine Learning. ? We can apply Bayes
Theorem to derive the PVP and PVN.
7Deriving the PVPs and PVNs
- The problem with deriving the PVP and PVN of a
test, is that in order to derive them, we need to
know pD, the pre-test probability of the
disease. This cannot be done directly. - As usual, however, we can set ourselves in the
context of the comparison of two tests (with
PD being the same in both cases). - Doing so, and using Bayes Theorem
- PD (PD PD)/(PD PD PDPD)
- We can get the following relationships (see
Biggerstaff, 2000) - PDY gt PDX ? ?Y gt ?X
- PD-Y gt PD-X ? ?-Y lt ?-X
- Where X and Y are two diagnostic tests, and X,
and X stand for confirming the presence of the
disease and confirming the absence of the
disease, respectively. (and similarly for Y and
Y) - ? and ?- are the likelihood ratios that are
defined on the next slide
8Likelihood Ratios
- ? and ?- are actually easy to derive.
- The likelihood ratio of a positive test is
- ? PD / P D, i.e. the ratio of the
true -
positive rate to the false - positive rate.
- The likelihood ratio of a negative test is
- ?- P-D / P- D, i.e. the ratio of the
false negative rate to the true - negative rate.
- Note We want to maximize ? and minimize ?-.
- This means that, even though we cannot calculate
the PVP and PVN directly, we can get the
information we need to compare two tests through
the likelihood ratios.
9Youdens Index and Discriminant Power
- Youdens Index measures the avoidance of failure
of an algorithm while Discriminant Power
evaluates how well an algorithm distinguishes
between positive and negative examples. - Youdens Index
- ? sensitivity (1 specificity)
- PD (1 - P-D)
- Discriminant Power
- DP v3/p (log X log Y),
- where X sensitivity/(1 sensitivity)
and - Y specificity/(1-specificit
y)
10Comparison of the various measures on the outcome
of e-negotiation
DP is below 3 ? insignificant
11What does this all mean? Traditional ML Measures
12What does this all mean? New Measures that are
more appropriate for problems where both classes
are as important
13Part I Discussion
- The variety of results obtained with the
different measures suggest two conclusions - It is very important for practitioners of Machine
Learning to understand their domain deeply, to
understand what it is, exactly, that they want to
evaluate, and to reach their goal using
appropriate measures (existing or new ones). - Since some of the results are very close to each
other, it is important to establish reliable
confidence tests to find out whether or not these
results are significant.
14Part II
- Constructing new evaluation measures
- (William Elamzeh, Nathalie Japkowicz
- and Stan Matwin)
15Motivation for our new
evaluation method
- ROC Analysis alone and its associated AUC measure
do not assess the performance of classifiers
adequately since they omit any information
regarding the confidence of these estimates. - Though the identification of the significant
portion of ROC Curves is an important step
towards generating a more useful assessment, this
analysis remains biased in favour of the large
class, in case of severe imbalances. - We would like to combine the information provided
by the ROC Curve together with information
regarding how balanced the classifier is with
regard to the misclassification of positive and
negative examples.
16ROCs bias in the case of severe class imbalances
- ROC Curves, for the pos class, plots the true
positive rate a/(ab) against the false positive
rate c/(cd). - When the number of pos. examples is significantly
lower than the number of neg. examples, ab ltlt
cd, as we change the class probability
threshold, a/(ab) climbs faster than c/(cd) - ROC gives the majority class (-) an unfair
advantage. - Ideally, a classifier should classify both
classes proportionally
Confusion Matrix
17Correcting for ROCs bias in the case of severe
class imbalances
- Though we keep ROC as a performance evaluation
measure, since rate information is useful, we
propose to favour classifiers that perform with
similar number of errors in both classes, for
confidence estimation. - More specifically,as in the Tango test, we favour
classifiers that have lower difference in
classification errors in both classes, (b-c)/n. - This quantity (b-c)/n is interesting not just for
confidence estimation, but also as an evaluation
measure in its own right
Confusion Matrix
18Proposed Evaluation Method for severely
Imbalanced Data sets
- Our method consists of five steps
- Generate a ROC Curve R for a classifier K applied
to data D. - Apply Tangos confidence test in order to
identify the confident segments of R. - Compute the CAUC, the area under the confident
ROC segment. - Compute AveD, the average normalized difference
(b-c)/n for all points in the confident ROC
segment. - Plot CAUC against aveD ? An effective classifier
shows low AveD and high CAUC
19Experiments and Expected Results
- We considered 6 imbalanced domain from UCI. The
most imbalanced one contained only 1.4 examples
in the small class while the least imbalanced one
had as many as 26. - We ran 4 classifiers Decision Stumps, Decision
Trees, Decision Forests and Naïve Bayes - We expected the following results
- Weak Performance from the Decision Stumps
- Stronger Performance from the Decision Trees
- Even Stronger Performance from the Random Forests
- We expected Naïve Bayes to perform reasonably
well, but with no idea of how it would compare to
the tree family of learners
Same family of learners
20Results using our new method Our expectations
are met
- Note Classifiers
- in the top left
- corner
- outperform those
- in the bottom
- right corner
- Decision Stumps perform the worst, followed by
decision trees and then random forests (in most
cases) - Surprise 1 Decision trees outperform random
forests in the two most balanced data sets. - Surprise 2 Naïve Bayes consistently outperforms
Random forests
21AUC Results
- Our, more informed, results contradict the AUC
results which claim that - Decision Stumps are sometimes as good as or
superior to decision trees (!) - Random Forests outperforms all other systems in
all but one cases.
22Part II Discussion
- In order to better understand the performance of
classifiers on various domains, it can be useful
to consider several aspects of this evaluation
simultaneously. - In order to do so, it might be useful to create
specific measures adapted to the purpose of the
evaluation. - In our case, above, our evaluation measure
allowed us to study the tradeoff between
classification difference and area under the
confident segment of the AUC curve, thus,
producing more reliable results