Revising our Evaluation Practices in Machine Learning Nathalie Japkowicz - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Revising our Evaluation Practices in Machine Learning Nathalie Japkowicz

Description:

This is very different from the way Evaluation is approached in other applied ... classifier on the right has ridiculously low precision (33.3%) while the ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 45
Provided by: coem152
Category:

less

Transcript and Presenter's Notes

Title: Revising our Evaluation Practices in Machine Learning Nathalie Japkowicz


1
Revising our Evaluation Practices in Machine
LearningNathalie Japkowicz
  • School of Computer Science and Software
    Engineering
  • Monash University
  • (Visiting professor)
  • School of Information Technology and Engineering
  • University of Ottawa
  • (Home Institution)

2
Observations
  • The way in which Evaluation is conducted in
    Machine learning/Data Mining has not been a
    primary concern in the community.
  • This is very different from the way Evaluation
    is approached in other applied fields such as
    Economics, Psychology and Sociology.
  • In such fields, researchers have been more
    concerned with the meaning and validity of their
    results than in ours.

3
The Problem
  • The objective value of our advances in Machine
    Learning may be different from what we believe it
    is.
  • Our conclusions may be flawed or meaningless.
  • ML methods may get undue credit or not get
    sufficiently recognized.
  • The field may start stagnating.
  • Practitioners in other fields or potential
    business partners may dismiss our
    approaches/results.
  • We hope that with better evaluation practices, we
    can help the field of machine learning focus on
    more effective research and encourage more
    cross-discipline or cross-purposes exchanges.

4
Organization of the Talk
  • I. A review of the shortcomings of current
    evaluation methods
  • Problems with Performance Evaluation
  • Problems with Confidence Estimation
  • II. Borrowing performance evaluation measures and
    confidence tests from other disciplines (with
    Marina Sokolova, Stan Szpakowicz, William Elamzeh
    and Stan Matwin).
  • III. Constructing new Evaluation measures (with
    William Elazmeh and Stan Matwin)

5
Part I
  • A review of the shortcomings of current
    evaluation methods
  • Problems with Performance Evaluation
  • Problems with Confidence Evaluation
  • Presented at the AAAI2006 Workshop on
    Evaluation Methods for Machine Learning

6
Recommended Steps for Proper Evaluation
  • Identify the interesting properties of the
    classifier.
  • Choose an evaluation metric accordingly
  • Choose a confidence estimation method .
  • Check that all the assumptions made by the
    evaluation metric and confidence estimator are
    verified.
  • Run the evaluation method with the chosen metric
    and confidence estimator, and analyze the
    results.
  • Interpret the results with respect to the domain.

Suggested by William Elamzeh
7
Commonly Followed Steps of Evaluation
  • Identify the interesting properties of the
    classifier.
  • Choose an evaluation metric accordingly
  • Choose a confidence estimation method .
  • Check that all the assumptions made by the
    evaluation metric and confidence estimator are
    verified.
  • Run the evaluation method with the chosen metric
    and confidence estimator, and analyze the
    results.
  • Interpret the results with respect to the domain.

These steps are typically considered, but only
very lightly
.
8
Overview
  • What happens when bad choices of performance
    evaluation metrics are made? (Steps 1 and 2 are
    considered too lightly)
  • Accuracy
  • Precision/Recall
  • ROC Analysis
  • Note each metric solves the problem of the
    previous one, but introduces new shortcomings
    (usually caught by the previous metrics)
  • What happens when bad choices of confidence
    estimators are made and the assumptions
    underlying these confidence estimator are not
    respected (Steps 3 is considered lightly and Step
    4 is disregarded).
  • The t-test

E.g.,
E.g.,
9
A Short Review I Confusion Matrix / Common
Performance evaluation Metrics
  • Accuracy (TPTN)/(PN)
  • Precision TP/(TPFP)
  • Recall/TP rate TP/P
  • FP Rate FP/N
  • ROC Analysis moves the threshold between the
    positive and negative class from a small FP rate
    to a large one. It plots the value of the Recall
    against that of the FP Rate at each FP Rate
    considered.

A Confusion Matrix
10
A Short Review II Confidence Estimation / The
t-Test
  • The most commonly used approach to confidence
    estimation in Machine learning is
  • To run the algorithm using 10-fold
    cross-validation and to record the accuracy at
    each fold.
  • To compute a confidence interval around the
    average of the difference between these reported
    accuracies and a given gold standard, using the
    t-test, i.e., the following formula
  • d /- tN,9 sd where
  • d is the average difference between the reported
    accuracy and the given gold standard,
  • tN,9 is a constant chosen according to the degree
    of confidence desired,
  • sd sqrt(1/90 Si110 (di d)2) where di
    represents the difference between the reported
    accuracy and the given gold standard at
    fold i.

11
Whats wrong with Accuracy?
  • Both classifiers obtain 60 accuracy
  • They exhibit very different behaviours
  • On the left weak positive recognition
    rate/strong negative recognition rate
  • On the right strong positive recognition
    rate/weak negative recognition rate

12
Whats wrong with Precision/Recall?
  • Both classifiers obtain the same precision and
    recall values of 66.7 and 40
  • They exhibit very different behaviours
  • Same positive recognition rate
  • Extremely different negative recognition rate
    strong on the left / nil on the right
  • Note Accuracy has no problem catching this!

13
Whats wrong with ROC Analysis?(We consider
single points in space not the entire ROC Curve)
  • ROC Analysis and Precision yield contradictory
    results
  • In terms of ROC Analysis, the classifier on the
    right is a significantly better choice than the
    one on the left. the point representing the
    right classifier is on the same vertical line but
    22.25 higher than the point representing the
    left classifier
  • Yet, the classifier on the right has ridiculously
    low precision (33.3) while the classifier on the
    left has excellent precision (95.24).

14
Whats wrong with the t-test?
  • Classifiers 1 and 2 yield the same average mean
    and confidence interval.
  • Yet, Classifier 1 is relatively stable, while
    Classifier 2 is not.
  • Problem the t-test assumes a normal
    distribution. The difference in accuracy between
    classifier 2 and the gold-standard is not
    normally distributed

15
Part I Discussion
  • There is nothing intrinsically wrong with any of
    the performance evaluation measures or confidence
    tests discussed. Its all a matter of thinking
    about which one to use when, and what the results
    means (both in terms of added value and
    limitations).
  • Simple conceptualization of the Problem with
    current evaluation practices
  • Evaluation Metrics and Confidence Measures
    summarize the results ? ML Practitioners must
    understand the terms of these summarizations and
    verify that their assumptions are verified.
  • In certain cases, however, it is necessary to
    look further and, eventually, borrow practices
    from other disciplines. In, yet, other cases, it
    pays to devise our own methods. Both instances
    are discussed in what follows.

16
Part II a
  • Borrowing new performance evaluation measures
    from the medical diagnostic community
  • (Marina Sokolova, Nathalie Japkowicz
  • and Stan Szpakowicz)
  • To be presented at the Australian AI 2006
    Conference

17
The need to borrow new performance measures an
example
  • It has come to our attention that the performance
    measures commonly used in Machine Learning are
    not very good at assessing the performance of
    problems in which the two classes are equally
    important.
  • Accuracy focuses on both classes, but, it does
    not distinguish between the two classes.
  • Other measures, such as Precision/Recall, F-Score
    and ROC Analysis only focus on one class, without
    concerning themselves with performance on the
    other class.

18
Learning Problems in which the classes are
equally important
  • Examples of recent Machine Learning domains that
    require equal focus on both classes and a
    distinction between false positive and false
    negative rates are
  • opinion/sentiment identification
  • classification of negotiations
  • An examples of a traditional problem that
    requires equal focus on both classes and a
    distinction between false positive and false
    negative rates is
  • Medical Diagnostic Tests
  • What measures have researchers in the Medical
    Diagnostic Test Community used that we can borrow?

19
Performance Measures in use in the Medical
Diagnostic Community
  • Common performance measures in use in the Medical
    Diagnostic Community are
  • Sensitivity/Specificity (also in use in Machine
    learning)
  • Likelihood ratios
  • Youdens Index
  • Discriminant Power
  • Biggerstaff, 2000 Blakeley Oddone, 1995

20
Sensitivity/Specificity
  • The sensitivity of a diagnostic test is
  • PD, i.e., the probability of obtaining
    a positive test result in the
  • diseased population.
  • The specificity of a diagnostic test is
  • P-D, i.e., the probability of obtaining
    a negative test result in the
  • disease-free population.
  • Sensitivity and specificity are not that useful,
    however, since one, really is interested in
    PD (PVP the Predictive Value of a Positive)
    and PD- (PVN the Predictive Value of a
    Negative) in both the medical testing community
    and in Machine Learning. ? We can apply Bayes
    Theorem to derive the PVP and PVN.

21
Deriving the PVPs and PVNs
  • The problem with deriving the PVP and PVN of a
    test, is that in order to derive them, we need to
    know pD, the pre-test probability of the
    disease. This cannot be done directly.
  • As usual, however, we can set ourselves in the
    context of the comparison of two tests (with
    PD being the same in both cases).
  • Doing so, and using Bayes Theorem
  • PD (PD PD)/(PD PD PDPD)
  • We can get the following relationships (see
    Biggerstaff, 2000)
  • PDY gt PDX ? ?Y gt ?X
  • PD-Y gt PD-X ? ?-Y lt ?-X
  • Where X and Y are two diagnostic tests, and X,
    and X stand for confirming the presence of the
    disease and confirming the absence of the
    disease, respectively. (and similarly for Y and
    Y)
  • ? and ?- are the likelihood ratios that are
    defined on the next slide

22
Likelihood Ratios
  • ? and ?- are actually easy to derive.
  • The likelihood ratio of a positive test is
  • ? PD / P D, i.e. the ratio of the
    true

  • positive rate to the false
  • positive rate.
  • The likelihood ratio of a negative test is
  • ?- P-D / P- D, i.e. the ratio of the
    false negative rate to the true
  • negative rate.
  • Note We want to maximize ? and minimize ?-.
  • This means that, even though we cannot calculate
    the PVP and PVN directly, we can get the
    information we need to compare two tests through
    the likelihood ratios.

23
Youdens Index and Discriminant Power
  • Youdens Index measures the avoidance of failure
    of an algorithm while Discriminant Power
    evaluates how well an algorithm distinguishes
    between positive and negative examples.
  • Youdens Index
  • ? sensitivity (1 specificity)
  • PD (1 - P-D)
  • Discriminant Power
  • DP v3/p (log X log Y),
  • where X sensitivity/(1 sensitivity)
    and
  • Y specificity/(1-specificit
    y)

24
Comparison of the various measures on the outcome
of e-negotiation
DP is below 3 ? insignificant
25
What does this all mean? Traditional ML Measures
26
What does this all mean? New Measures that are
more appropriate for problems where both classes
are as important
27
Part II a Discussion
  • The variety of results obtained with the
    different measures suggest two conclusions
  • It is very important for practitioners of Machine
    Learning to understand their domain deeply, to
    understand what it is, exactly, that they want to
    evaluate, and to reach their goal using
    appropriate measures (existing or new ones).
  • Since some of the results are very close to each
    other, it is important to establish reliable
    confidence tests to find out whether or not these
    results are significant.

28
Part II b
  • Borrowing new confidence tests from the medical
    diagnostic community
  • (William Elamzeh, Nathalie Japkowicz
  • and Stan Matwin)
  • Presented at theICML2006 Workshop on ROC
    Analysis

29
The need to borrow new confidence tests an
example
  • The class imbalance problem is very pervasive in
    practical applications of machine learning.
  • In such cases, it is recommended to use ROC
    analysis, which is believed not to be sensitive
    to class imbalances.
  • Recently, a confidence test was proposed for ROC
    Curves Macskassy Provost (ICML2005), the ROC
    bands.
  • These ROC Bands are based on sampling methods
    which become useless in case of severe class
    imbalance and when the data distribution is
    unknown. ? To compensate for that, we will
    consider Tangos test.

30
Tangos Test
  • Rather than focusing on the true positive (a) and
    true negative (d) rates, Tangos test focuses on
    the false negative (b) and false positive (c)
    rates.
  • Its null hypothesis,
    H0 is d b - c 0
  • Testing this null hypothesis allows the
    statistical test not to be influenced by class
    imbalance in favour of the majority class (see
    further)
  • Conversely, tests based on a and d rather than b
    and c do break down in case of severe imbalance.

31
Example of a standard tests break down ROC
Bands on severely imbalanced data sets
The bands are wide and not very useful. This was
true of all the data sets we experimented with.
32
Examples of Tangos performance on severely
imbalanced data sets
The dark segments show that Tangos test detected
confident segments in most curves
33
Part II b Discussion
  • Evaluation results we obtain always need to be
    corroborated using confidence measures. Failure
    to do so may yield false conclusions.
  • When necessary, it is useful to borrow tests and
    measures from other disciplines as we did with
    likelihood ratios, etc. and Tangos test.
  • Sometimes, however, these borrowed tests are not
    sufficient. In our case, for example, we may want
    to visualize several measures and confidence
    tests results simultaneously.
  • It may be necessary to construct new evaluation
    methods in order to reach our goal.
  • This will be the topic of the next and last
    section.

34
Part III
  • Constructing new evaluation measures
  • (William Elamzeh, Nathalie Japkowicz
  • and Stan Matwin)
  • To be presented at ECML2006

35
Motivation for our new evaluation method
  • ROC Analysis alone and its associated AUC measure
    do not assess the performance of classifiers
    adequately since they omit any information
    regarding the confidence of these estimates.
  • Though the identification of the significant
    portion of ROC Curves is an important step
    towards generating a more useful assessment, this
    analysis remains biased in favour of the large
    class, in case of severe imbalances.
  • We would like to combine the information provided
    by the ROC Curve together with information
    regarding how balanced the classifier is with
    regard to the misclassification of positive and
    negative examples.

36
ROCs bias in the case of severe class imbalances
  • ROC Curves, for the pos class, plots the true
    positive rate a/(ab) against the false positive
    rate c/(cd).
  • When the number of pos. examples is significantly
    lower than the number of neg. examples, ab ltlt
    cd, as we change the class probability
    threshold, a/(ab) climbs faster than c/(cd)
  • ROC gives the majority class (-) an unfair
    advantage.
  • Ideally, a classifier should classify both
    classes proportionally

Confusion Matrix
37
Correcting for ROCs bias in the case of severe
class imbalances
  • Though we keep ROC as a performance evaluation
    measure, since rate information is useful, we
    propose to favour classifiers that perform with
    similar number of errors in both classes, for
    confidence estimation.
  • More specifically,as in the Tango test, we favour
    classifiers that have lower difference in
    classification errors in both classes, (b-c)/n.
  • This quantity (b-c)/n is interesting not just for
    confidence estimation, but also as an evaluation
    measure in its own right

Confusion Matrix
38
Proposed Evaluation Method for severely
Imbalanced Data sets
  • Our method consists of five steps
  • Generate a ROC Curve R for a classifier K applied
    to data D.
  • Apply Tangos confidence test in order to
    identify the confident segments of R.
  • Compute the CAUC, the area under the confident
    ROC segment.
  • Compute AveD, the average normalized difference
    (b-c)/n for all points in the confident ROC
    segment.
  • Plot CAUC against aveD ? An effective classifier
    shows low AveD and high CAUC

39
Experiments and Expected Results
  • We considered 6 imbalanced domain from UCI. The
    most imbalanced one contained only 1.4 examples
    in the small class while the least imbalanced one
    had as many as 26.
  • We ran 4 classifiers Decision Stumps, Decision
    Trees, Decision Forests and Naïve Bayes
  • We expected the following results
  • Weak Performance from the Decision Stumps
  • Stronger Performance from the Decision Trees
  • Even Stronger Performance from the Random Forests
  • We expected Naïve Bayes to perform reasonably
    well, but with no idea of how it would compare to
    the tree family of learners

Same family of learners
40
Results using our new method Our expectations
are met
  • Note Classifiers
  • in the top left
  • corner
  • outperform those
  • in the bottom
  • right corner
  • Decision Stumps perform the worst, followed by
    decision trees and then random forests (in most
    cases)
  • Surprise 1 Decision trees outperform random
    forests in the two most balanced data sets.
  • Surprise 2 Naïve Bayes consistently outperforms
    Random forests

41
AUC Results
  • Our, more informed, results contradict the AUC
    results which claim that
  • Decision Stumps are sometimes as good as or
    superior to decision trees (!)
  • Random Forests outperforms all other systems in
    all but one cases.

42
Part III Discussion
  • In order to better understand the performance of
    classifiers on various domains, it can be useful
    to consider several aspects of this evaluation
    simultaneously.
  • In order to do so, it might be useful to create
    specific measures adapted to the purpose of the
    evaluation.
  • In our case, above, our evaluation measure
    allowed us to study the tradeoff between
    classification difference and area under the
    confident segment of the AUC curve, thus,
    producing more reliable results

43
General Conclusions
  • It is time for us (researchers in Machine
    Learning) to stop applying evaluation methods
    without thinking about the meaning of this
    evaluation.
  • This may yield to the borrowing of existing
    methods from other fields or to the creation of
    new measures to suit our purposes.
  • As a result, our field should expand more
    meaningfully and exchanges between our discipline
    and others should be boosted, thus, benifitting
    both disciplines.

44
Future Work
  • To survey more disciplines for useful evaluation
    methods applicable to machine Learning.
  • To associate template Machine learning problems
    to approaches to evaluation in order to ease a
    Machine learning researchers choice of
    evaluation methods.
  • To continue designing new approaches to
    evaluation in Machine learning.
Write a Comment
User Comments (0)
About PowerShow.com