Evaluation of Learning Models - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Evaluation of Learning Models

Description:

Binomial probability distribution. Probability P(r) of r heads in n coin flips, if p = Pr(heads) ... binomial. errorS(h) follows a Binomial distribution, ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 39
Provided by: fda017
Category:

less

Transcript and Presenter's Notes

Title: Evaluation of Learning Models


1
Evaluation of Learning Models
  • Literature
  • T. Mitchel, Machine Learning, chapter 5
  • I.H. Witten and E. Frank, Data Mining, chapter 5

2
Fayyads KDD Methodology
data
3
Contents
  • Estimation of errors for one hypothesis
  • Comparison of hypotheses
  • Comparison of learning models
  • Practical aspects

4
Two definitions of error
  • True error of hypothesis h with respect to target
    function f and distribution D is the probability
    that h will misclassify an instance drawn at
    random according to D.

5
Two definitions of error (2)
  • Sample error of hypothesis h with respect to
    target function f and data sample S is the
    proportion of examples h misclassifieswhere
    is 1 if and 0
    otherwise.

6
Two definitions of error (3)
  • How well does errorS(h) estimate errorD(h)?

7
Problems estimating error
  • 1. Bias If S is training set, errorS(h) is
    optimistically biasedFor unbiased estimate, h
    and S must be chosen independently.
  • 2. Variance Even with unbiased S, errorS(h) may
    still vary from errorD(h).

8
Example
  • Hypothesis h misclassifies 12 of the 40 examples
    in SWhat is errorD(h)?

9
Estimators
  • Experiment1. choose sample S of size n
    according to distribution D2. measure
    errorS(h)errorS(h) is a random variable (i.e.,
    result of an experiment)errorS(h) is an unbiased
    estimator for errorD(h)Given observed errorS(h)
    what can we conclude about errorD(h)?

10
Confidence intervals
  • If
  • S contains n examples, drawn independently of h
    and each other
  • n ? 30
  • Then
  • With approximately 95 probability, errorD(h)
    lies in the interval

11
Confidence intervals (2)
  • If
  • S contains n examples, drawn independently of h
    and each other
  • n ? 30
  • Then
  • With approximately N probability, errorD(h) lies
    in the intervalwhereN 50 68 80 90 95 98
    99zN 0.67 1.00 1.28 1.64 1.96 2.33 2.58

12
errorS(h) is a random variable
  • Rerun the experiment with different randomly
    drawn S (of size n)
  • Probability of observing r misclassified examples

13
Binomial probability distribution
  • Probability P(r) of r heads in n coin flips, if p
    Pr(heads)

Binomial distribution for n 10 and p 0.3
14
Binomial probability distribution (2)
  • Expected, or mean value of X, EX, is
  • Variance of X is
  • Standard deviation of X, ?X, is

15
Normal distribution approximates binomial
  • errorS(h) follows a Binomial distribution, with
  • mean
  • standard deviation
  • Approximate this by a Normal distribution with
  • mean
  • standard deviation

16
Normal probability distribution
  • The probability that X will fall into the
    interval (a,b) is given by
  • Expected, or mean value of X, EX, is EX ?
  • Variance of X is Var(x) ?2
  • Standard deviation of X, ?X , ?X ?

17
Normal probability distribution
  • 80 of area (probability) lies in ? ? 1.28?
  • N of area (probability) lies in ? ? zN?
    N 50 68 80 90 95 98 99
    zN 0.67 1.00 1.28 1.64 1.96 2.33 2.58

18
Confidence intervals, more correctly
  • If
  • S contains n examples, drawn independently of h
    and each other
  • n ? 30
  • Then
  • with approximately 95 probability, errorS(h)
    lies in the interval
  • and errorD(h) approximately lies in the interval

19
Central Limit Theorem
  • Consider a set of independent, identically
    distributed random variables Y1...Yn, all
    governed by an arbitrary probability distribution
    with mean ? and finite variance ?2. Define the
    sample mean,
  • Central Limit Theorem. As n ? ?, the distribution
    governing approaches a Normal distribution,
    with mean ? and variance ?2/n.

20
Difference between hypotheses
  • Test h1 on sample S1, test h2 on S2
  • 1. Pick parameter to estimate
  • 2. Choose an estimator
  • 3. Determine probability distribution that
    governs estimator

21
Difference between hypotheses (2)
  • 4. Find interval (L, U) such that N of
    probability mass falls in the interval

22
Paired t test to compare hA, hB
  • 1. Partition data into k disjoint test sets
    T1,T2,,Tk of equal size, where this size is at
    least 30.
  • 2. For i from 1 to k, do
  • 3. Return the value , where

23
Paired t test to compare hA, hB (2)
  • N confidence interval estimate for d
  • Note approximately normally distributed

24
Comparing learning algorithms LA and LB
  • What wed like to estimatewhere L(S) is the
    hypothesis output by learner L using training set
    S
  • I.e., the expected difference in true error
    between hypotheses output by learners LA and LB,
    when trained using randomly selected training
    sets S drawn according to distribution D.

25
Comparing learning algorithms LA and LB (2)
  • But, given limited data D0, what is a good
    estimator?
  • Could partition D0 into training set S0 and test
    set T0, and measure
  • Even better, repeat this many times and average
    the results

26
Comparing learning algorithms LA and LB (3)
k-fold cross validation
  • 1. Partition data D0 into k disjoint test sets
    T1,T2,,Tk of equal size, where this size is at
    least 30.
  • 2. For i from 1 to k, do use Ti for the test
    set, and the remaining data for training set Si
  • 3. Return the average of the errors on the test
    sets

27
Practical AspectsA note on parameter tuning
  • It is important that the test data is not used in
    any way to create the classifier
  • Some learning schemes operate in two stages
  • Stage 1 builds the basic structure
  • Stage 2 optimizes parameter settings
  • The test data cant be used for parameter tuning!
  • Proper procedure uses three sets training data,
  • validation data, and test data
  • Validation data is used to optimize parameters

28
Holdout estimation, stratification
  • What shall we do if the amount of data is
    limited?
  • The holdout method reserves a certain amount for
  • testing and uses the remainder for training
  • Usually one third for testing, the rest for
    training
  • Problem the samples might not be representative
  • Example class might be missing in the test data
  • Advanced version uses stratification
  • Ensures that each class is represented with
    approximately equal proportions in both subsets

29
More on cross-validation
  • Standard method for evaluation stratified
    ten-fold
  • cross-validation
  • Why ten? Extensive experiments have shown that
    this is the best choice to get an accurate
    estimate
  • There is also some theoretical evidence for this
  • Stratification reduces the estimates variance
  • Even better repeated stratified cross-validation
  • E.g. ten-fold cross-validation is repeated ten
    times
  • and results are averaged (reduces the variance)

30
Issues in evaluation
  • Statistical reliability of estimated differences
    in performance
  • Choice of performance measure
  • Number of correct classifications
  • Accuracy of probability estimates
  • Error in numeric predictions
  • Costs assigned to different types of errors
  • Many practical applications involve costs

31
Counting the costs
  • In practice, different types of classification
    errors often incur different costs
  • Examples
  • Predicting when cows are in heat (in estrus)
  • Not in estrus correct 97 of the time
  • Loan decisions
  • Oil-slick detection
  • Fault diagnosis
  • Promotional mailing

32
Taking costs into account
  • The confusion matrix
  • There are many other types of costs!
  • E.g. costs of collecting training data

33
Lift charts
  • In practice, costs are rarely known
  • Decisions are usually made by comparing possible
    scenarios
  • Example promotional mailout
  • Situation 1 classifier predicts that 0.1 of all
    households will respond
  • Situation 2 classifier predicts that 0.4 of the
    10000 most promising households will respond
  • A lift chart allows for a visual comparison

34
Generating a lift chart
  • Instances are sorted according to their predicted
    probability of being a true positiveRank Predic
    ted probability Actual class1 0.95 Yes2 0.93
    Yes3 0.93 No4 0.88 Yes
  • In lift chart, x axis is sample size and y axis
    is number of true positives

35
A hypothetical lift chart
36
Summary of measures
37
Model selection criteria
  • Model selection criteria attempt to find a good
    compromise betweenA. The complexity of a
    modelB. Its prediction accuracy on the training
    data
  • Reasoning a good model is a simple model that
    achieves high accuracy on the given data
  • Also known as Occams Razor the best theory is
    the smallest one that describes all the facts

38
Warning
  • Suppose you are gathering hypotheses that have a
    probability of 95 to have an error level below
    10
  • What if you have found 100 hypotheses satisfying
    this condition?
  • Then the probability that all have an error below
    10 is equal to (0.95)100 ? 0.013 corresponding
    to 1.3 . So, the probability of having at least
    one hypothesis with an error above 10 is about
    98.7!
Write a Comment
User Comments (0)
About PowerShow.com