Evaluation of Learning Models - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Evaluation of Learning Models

Description:

Binomial probability distribution. Probability P(r) of r heads in n coin flips, if p = Pr(heads) ... binomial. errorS(h) follows a Binomial distribution, ... – PowerPoint PPT presentation

Number of Views:78

Avg rating:3.0/5.0

Slides: 39

Provided by: fda017

Category:

more less

Transcript and Presenter's Notes

Title: Evaluation of Learning Models

1
Evaluation of Learning Models

Literature
T. Mitchel, Machine Learning, chapter 5
I.H. Witten and E. Frank, Data Mining, chapter 5

2
Fayyads KDD Methodology
data
3
Contents

Estimation of errors for one hypothesis
Comparison of hypotheses
Comparison of learning models
Practical aspects

4
Two definitions of error

True error of hypothesis h with respect to target
function f and distribution D is the probability
that h will misclassify an instance drawn at
random according to D.

5
Two definitions of error (2)

Sample error of hypothesis h with respect to
target function f and data sample S is the
proportion of examples h misclassifieswhere
is 1 if and 0
otherwise.

6
Two definitions of error (3)

How well does errorS(h) estimate errorD(h)?

7
Problems estimating error

1. Bias If S is training set, errorS(h) is
optimistically biasedFor unbiased estimate, h
and S must be chosen independently.
2. Variance Even with unbiased S, errorS(h) may
still vary from errorD(h).

8
Example

Hypothesis h misclassifies 12 of the 40 examples
in SWhat is errorD(h)?

9
Estimators

Experiment1. choose sample S of size n
according to distribution D2. measure
errorS(h)errorS(h) is a random variable (i.e.,
result of an experiment)errorS(h) is an unbiased
estimator for errorD(h)Given observed errorS(h)
what can we conclude about errorD(h)?

10
Confidence intervals

If
S contains n examples, drawn independently of h
and each other
n ? 30
Then
With approximately 95 probability, errorD(h)
lies in the interval

11
Confidence intervals (2)

If
S contains n examples, drawn independently of h
and each other
n ? 30
Then
With approximately N probability, errorD(h) lies
in the intervalwhereN 50 68 80 90 95 98
99zN 0.67 1.00 1.28 1.64 1.96 2.33 2.58

12
errorS(h) is a random variable

Rerun the experiment with different randomly
drawn S (of size n)
Probability of observing r misclassified examples

13
Binomial probability distribution

Probability P(r) of r heads in n coin flips, if p
Pr(heads)

Binomial distribution for n 10 and p 0.3
14
Binomial probability distribution (2)

Expected, or mean value of X, EX, is
Variance of X is
Standard deviation of X, ?X, is

15
Normal distribution approximates binomial

errorS(h) follows a Binomial distribution, with
mean
standard deviation
Approximate this by a Normal distribution with
mean
standard deviation

16
Normal probability distribution

The probability that X will fall into the
interval (a,b) is given by
Expected, or mean value of X, EX, is EX ?
Variance of X is Var(x) ?2
Standard deviation of X, ?X , ?X ?

17
Normal probability distribution

80 of area (probability) lies in ? ? 1.28?
N of area (probability) lies in ? ? zN?
N 50 68 80 90 95 98 99
zN 0.67 1.00 1.28 1.64 1.96 2.33 2.58

18
Confidence intervals, more correctly

If
S contains n examples, drawn independently of h
and each other
n ? 30
Then
with approximately 95 probability, errorS(h)
lies in the interval
and errorD(h) approximately lies in the interval

19
Central Limit Theorem

Consider a set of independent, identically
distributed random variables Y1...Yn, all
governed by an arbitrary probability distribution
with mean ? and finite variance ?2. Define the
sample mean,
Central Limit Theorem. As n ? ?, the distribution
governing approaches a Normal distribution,
with mean ? and variance ?2/n.

20
Difference between hypotheses

Test h1 on sample S1, test h2 on S2
1. Pick parameter to estimate
2. Choose an estimator
3. Determine probability distribution that
governs estimator

21
Difference between hypotheses (2)

4. Find interval (L, U) such that N of
probability mass falls in the interval

22
Paired t test to compare hA, hB

1. Partition data into k disjoint test sets
T1,T2,,Tk of equal size, where this size is at
least 30.
2. For i from 1 to k, do
3. Return the value , where

23
Paired t test to compare hA, hB (2)

N confidence interval estimate for d
Note approximately normally distributed

24
Comparing learning algorithms LA and LB

What wed like to estimatewhere L(S) is the
hypothesis output by learner L using training set
S
I.e., the expected difference in true error
between hypotheses output by learners LA and LB,
when trained using randomly selected training
sets S drawn according to distribution D.

25
Comparing learning algorithms LA and LB (2)

But, given limited data D0, what is a good
estimator?
Could partition D0 into training set S0 and test
set T0, and measure
Even better, repeat this many times and average
the results

26
Comparing learning algorithms LA and LB (3)
k-fold cross validation

1. Partition data D0 into k disjoint test sets
T1,T2,,Tk of equal size, where this size is at
least 30.
2. For i from 1 to k, do use Ti for the test
set, and the remaining data for training set Si
3. Return the average of the errors on the test
sets

27
Practical AspectsA note on parameter tuning

It is important that the test data is not used in
any way to create the classifier
Some learning schemes operate in two stages
Stage 1 builds the basic structure
Stage 2 optimizes parameter settings
The test data cant be used for parameter tuning!
Proper procedure uses three sets training data,
validation data, and test data
Validation data is used to optimize parameters

28
Holdout estimation, stratification

What shall we do if the amount of data is
limited?
The holdout method reserves a certain amount for
testing and uses the remainder for training
Usually one third for testing, the rest for
training
Problem the samples might not be representative
Example class might be missing in the test data
Advanced version uses stratification
Ensures that each class is represented with
approximately equal proportions in both subsets

29
More on cross-validation

Standard method for evaluation stratified
ten-fold
cross-validation
Why ten? Extensive experiments have shown that
this is the best choice to get an accurate
estimate
There is also some theoretical evidence for this
Stratification reduces the estimates variance
Even better repeated stratified cross-validation
E.g. ten-fold cross-validation is repeated ten
times
and results are averaged (reduces the variance)

30
Issues in evaluation

Statistical reliability of estimated differences
in performance
Choice of performance measure
Number of correct classifications
Accuracy of probability estimates
Error in numeric predictions
Costs assigned to different types of errors
Many practical applications involve costs

31
Counting the costs

In practice, different types of classification
errors often incur different costs
Examples
Predicting when cows are in heat (in estrus)
Not in estrus correct 97 of the time
Loan decisions
Oil-slick detection
Fault diagnosis
Promotional mailing

32
Taking costs into account

The confusion matrix
There are many other types of costs!
E.g. costs of collecting training data

33
Lift charts

In practice, costs are rarely known
Decisions are usually made by comparing possible
scenarios
Example promotional mailout
Situation 1 classifier predicts that 0.1 of all
households will respond
Situation 2 classifier predicts that 0.4 of the
10000 most promising households will respond
A lift chart allows for a visual comparison

34
Generating a lift chart

Instances are sorted according to their predicted
probability of being a true positiveRank Predic
ted probability Actual class1 0.95 Yes2 0.93
Yes3 0.93 No4 0.88 Yes
In lift chart, x axis is sample size and y axis
is number of true positives

35
A hypothetical lift chart
36
Summary of measures
37
Model selection criteria

Model selection criteria attempt to find a good
compromise betweenA. The complexity of a
modelB. Its prediction accuracy on the training
data
Reasoning a good model is a simple model that
achieves high accuracy on the given data
Also known as Occams Razor the best theory is
the smallest one that describes all the facts

38
Warning

Suppose you are gathering hypotheses that have a
probability of 95 to have an error level below
10
What if you have found 100 hypotheses satisfying
this condition?
Then the probability that all have an error below
10 is equal to (0.95)100 ? 0.013 corresponding
to 1.3 . So, the probability of having at least
one hypothesis with an error above 10 is about
98.7!