Statistical Significance Testing - PowerPoint PPT Presentation

About This Presentation
Title:

Statistical Significance Testing

Description:

Can the observed results be attributed to real characteristics of the ... by an appropriate post-hoc test (e.g., Nemeyni, Hommel, Holm, or Hochberg) ... – PowerPoint PPT presentation

Number of Views:367
Avg rating:3.0/5.0
Slides: 10
Provided by: COEMA4
Category:

less

Transcript and Presenter's Notes

Title: Statistical Significance Testing


1
Statistical Significance Testing
2
The purpose of Statistical Significance Testing
  • The purpose of Statistical Significance Testing
    is to answer the following questions
  • Can the observed results be attributed to real
    characteristics of the classifiers under scrutiny
    or were they obtained by chance?
  • Were the data sets representative of the problems
    to which the classifier will be applied in the
    future?
  • Important Unfortunately, because of the
    inductive nature of the problem, such questions
    cannot be fully answered. The user should,
    instead, accept that no matter what evaluation
    procedures are followed, they only allows us to
    gather some evidence into the classifiers
    behaviour. They are almost never conclusive.

3
Current Disagreements with Statistical
Significance Testing
  • There is, currently, a controversy in the
    statistical community with some scholars calling
    for the rejection of Null Hypothesis Significance
    Testing (NHST) due to the fact that
  • It is commonly misinterpreted, causing
    over-confidence in meaningless results.
  • Its results can be manipulated to show
    statistical significance even if that
    significance is not practically meaningful.
  • While remaining cautious about these issues, we
    believe that NHST is the best tool we have to
    answer the previous two questions, and will
    continue to use it.

4
Overview of Statistical Tests
  • We consider two categories of approaches
  • Parametric approaches that make strong
    assumptions about the distribution of the
    population, and
  • Non-parametric approaches whose assumptions are
    not as strong.
  • Parametric approaches are typically more powerful
    than non-parametric ones if the assumptions are
    verified.
  • Note 1 the quality of statistical tests is
    measured using two quantities the Type I Error
    of the test which denotes the probability of
    incorrectly detecting a difference when no such
    difference between two classifiers exists and,
    the Power of the test signifying the ability to
    detect differences when they do exist.
  • Note 2 We assume that in all cases, the
    algorithms are tested on the same domains
    (matched samples).

5
Overview of the Different Tests I
  • Comparisons of two algorithms on a Single domain
    I
  • Parametric t-test along with a measure of the
    effect size (Cohens d statistics)
  • Explanation The t-test determines whether the
    observed difference in the performance measures
    of the classifiers is statistically significant.
    However, it cannot confirm whether this
    difference, although statistically significantly
    different, is also of any practical importance.
    That is, it does measure the effect but not the
    size of this effect. This can be done using one
    of the available effect size measuring
    statistics, such as Cohens d statistics.
  • When is it appropriate to use the t-test?
  • If the samples come from a normal or
    pseudo-normal distribution (i.e., the test set
    has, at least 30 examples in it gt 300 examples
    are necessary if we are running 10-fold
    cross-validation experiments).
  • If the samples were selected at random Can we
    really know?
  • If the two populations have equal variances.

6
Overview of the Different Tests II
  • Comparisons of two algorithms on a Single domain
    II
  • Non-Parametric
  • McNemars Test
  • t-test versus McNemar McNemar doesnt make the
    kind of assumptions made by the t-test and it
    compares well to it in terms of Type I error and
    Power. However, McNemars test can be applied
    only under the condition where the number of
    disagreements between the two classifiers is
    large (generally gt 20). If not, the Sign test
    can be applied, instead (see next slide).

7
Overview of the Different Tests III
  • Comparison of Two algorithms on a Multiple
    domains
  • Non-Parametric
  • Sign Test
  • Wilcoxons Sign rank test
  • The Sign Test versus Wilcoxons Sign rank test
    Wilcoxons Sign Rank Test is more powerful than
    the sign test and is generally preferred.
    However, the Sign test is very simple.

8
Overview of the Different Tests IV
  • Comparison of Multiple algorithms on Multiple
    domains
  • Parametric One-Way Repeated Measure ANOVA,
    followed by an appropriate post-hoc test (e.g.,
    Tukey, Dunnett, Bonferroni or Bonferroni-Dunn)
  • Non-Parametric Friedmans Test, followed by an
    appropriate post-hoc test (e.g., Nemeyni, Hommel,
    Holm, or Hochberg)
  • ANOVA versus Friedman More often than not the
    assumptions required by the ANOVA test cannot be
    ascertained. Therefore, the Friedman Test is
    usually preferred.

9
Practical Concerns
  • Section 6.8 of my book (with M. Shah) shows how
    to use the freely downloadable R Statistical
    Software, in order to compute most of the tests
    discussed on the previous slides.
Write a Comment
User Comments (0)
About PowerShow.com