Title: Statistical Comparison of Two Learning Algorithms
1Statistical Comparison of Two Learning Algorithms
- Presented by
- Payam Refaeilzadeh
2Overview
- How can we tell if one algorithm can learn better
than another? - Design an experiment to measure the accuracy of
the two algorithms. - Run multiple trials.
- Compare the samples not just their means. Do a
statistically sound test of the two samples. - Is any observed difference significant? Is it
due to true difference between algorithms or
natural variation in the measurements?
3Statistical Hypothesis Testing
- Statistical Hypothesis A statement about the
parameters of one or more populations - Hypothesis Testing A procedure for deciding to
accept or reject the hypothesis - Identify the parameter of interest
- State a null hypothesis, H0
- Specify an alternate hypothesis, H1
- Choose a significance level a
- State an appropriate test statistic
4Statistical Hypothesis Testing Cont
- Null Hypothesis (H0) A statement presumed to be
true until statistical evidence shows otherwise - Usually specifies an exact value for a parameter
- Example H0 µ 30 Kg
- Alternate Hypothesis (H1) Accepted if the null
hypothesis is rejected - Test Statistic Particular statistic calculated
from measurements of a random sample / experiment - A test statistic is assumed to follow a
particular distribution (normal, t, chi-square,
etc) - That particular distribution can be used to test
for the significance of the calculated test
statistic.
5Error in Hypothesis Testing
- Type I error occurs when H0 is rejected but it is
in fact true - P(Type I error)a or significance level
- Type II error occurs when we fail to reject H0
but it is in fact false - P(Type II error)ß
- power 1-ß Probability of correctly rejecting
H0 - power ability to distinguish between the two
populations
6Paired t-Test
- Collect data in pairs
- Example Given a training set DTrain and a test
set DTest, train both learning algorithms on
DTrain and then test their accuracies on DTest. - Suppose n paired measurements have been made
- Assume
- The measurements are independent
- The measurements for each algorithm follow a
normal distribution - The test statistic T0 will follow a
t-distribution with n-1 degrees of freedom
7Paired t-Test cont
Trial Algorithm 1 Accuracy X1 Algorithm 2 Accuracy X2
1 X11 X21
2 X12 X22
..
n X1N X2N
Null Hypothesis H0 µD ?0 Test Statistic
Assume X1 follows N(µ1,s1) X2
follows N(µ2,s2) Let µD µ1 - µ2 Di X1i -
X2i i1,2,...,n
Rejection Criteria H1 µD ? ?0 t0 gt
ta/2,n-1 H1 µD gt ?0 t0 gt ta,n-1 H1 µD lt ?0
t0 lt -ta,n-1
8Cross Validated t-test
- Paired t-Test on the 10 paired accuracies
obtained from 10-fold cross validation - Advantages
- Large train set size
- Most powerful (Diettrich, 98)
- Disadvantages
- Accuracy results are not independent (overlap)
- Somewhat elevated probability of type-1 error
(Diettrich, 98)
  Â
  Â
  Â
  Â
  Â
  Â
  Â
  Â
  Â
  Â
95x2 Cross Validated t-test
- Run 2-fold cross validation 5 times
- Use results from the first of five replications
to estimate mean difference - Use results for all folds to estimate the
variance - Advantage
- Lowest Type-1 error (Diettrich, 98)
- Disadvantage
- Not as powerful as 10 fold cross validated t-test
(Diettrich, 98)
10Re-sampled t-test
- Randomly divide data into train / test sets
(usually 2/3 1/3) - Run multiple trials (usually 30)
- Perform a paired t-test between the trial
accuracies - This test has very high probability of type-1
error and should never be used.
11Calibrated Tests
- Bouckaert ICML 2003
- It is very difficult to estimate the true degrees
of freedom because independence assumptions are
being violated - Instead of correcting for the mean-difference,
calibrate on the degrees of freedom - Recommendation use 10 times repeated 10-fold
cross validation with 10 degrees of freedom
12References
- R. R. Bouckaert. Choosing between two learning
algorithms based on calibrated tests. ICML03 PP
51-58. - T. G. Dietterich. Approximate statistical tests
for comparing supervised classification learning
algorithms. Neural Computation, 1018951924,
1998. - D. C. Montgomery et al. Engineering Statistics.
2nd Edition. Wiley Press. 2001