Algorithm-Independent Machine Learning Anna Egorova-F - PowerPoint PPT Presentation

About This Presentation
Title:

Algorithm-Independent Machine Learning Anna Egorova-F

Description:

Ugly Duckling Theorem ... An ugly duckling is as similar to the beautiful swan 1 as does beautiful swan 2 ... Ugly Duckling Theorem. Use for comparison of ... – PowerPoint PPT presentation

Number of Views:332
Avg rating:3.0/5.0
Slides: 18
Provided by: djam9
Category:

less

Transcript and Presenter's Notes

Title: Algorithm-Independent Machine Learning Anna Egorova-F


1
Algorithm-Independent Machine LearningAnna
Egorova-FörsterUniversity of LuganoPattern
Classification Reading Group, January 2007All
materials in these slides were taken from
Pattern Classification (2nd ed) by R. O. Duda,
P. E. Hart and D. G. Stork, John Wiley Sons,
2000 with the permission of the authors and the
publisher
2
Algorithm-Independent Machine Learning
  • So far different classifiers and methods
    presented
  • BUT
  • Is some classifier better than all others?
  • How to compare classifiers?
  • Is comparison possible at all?
  • Is at least some classifier always better than
    random?
  • AND
  • Do techniques exist which boost all classifiers?

3
No Free Lunch Theorem
  • For any two learning algorithms P1(hD) and
    P2(hD), the following are true, independent of
    the sampling distribution P(x) and the number n
    of training points
  • Uniformly averaged over all target functions F,
    ?1(EF,n) - ?2(EF,n) 0.
  • For any fixed training set D, uniformly averaged
    over F, ?1(EF,D) - ?2(EF,D) 0
  • Uniformly averaged over all priors P(F), ?1(En)
    - ?2(En) 0
  • For any fixed training set D, uniformly averaged
    over P(F), ?1(ED) - ?2(ED) 0

4
No Free Lunch Theorem
  • Uniformly averaged over all target functions F,
    ?1(EF,n) - ?2(EF,n) 0.
  • Average over all possible target functions, the
    error will be the same for all classifiers.
  • Possible target functions 25
  • For any fixed training set D, uniformly averaged
    over F, ?1(EF,D) - ?2(EF,D) 0
  • Even if we know the training set D, the
    off-training errors will be the same.

x F h1 h2
000 1 1 1
001 -1 -1 -1
010 1 1 1
011 -1 1 -1
100 1 1 -1
101 -1 1 -1
110 1 1 -1
111 1 1 -1
Training set D
Off-Training set
5
Consequences of the No Free Lunch Theorem
  • If no information about the target function F(x)
    is provided
  • No classifier is better than some other in the
    general case
  • No classifier is better than random in the
    general case

6
Ugly Duckling TheoremFeatures Comparison
  • Binary feature fi
  • Patterns xi in the form
  • f1 and f2, f1 or f2 etc.
  • Rank of a predicate r the number of simplest
    patterns it contains.
  • Rank 1
  • x1 f1 AND NOT f2
  • x2 f1 AND f2
  • x3 f2 AND NOT f1
  • Rank 2
  • x1 OR x2 f1
  • Rank 3
  • x1 OR x2 OR x3 f1 OR f2

Venn diagram
7
Features with prior information
8
Features Comparison
  • To compare two patterns take the number of
    features they share?
  • Blind_left 0,1
  • Blind_right 0,1
  • Is (0,1) more similar to (1,0) or to (1,1)???
  • Different representations also possible
  • Blind_right 0,1
  • Both_eyes_same 0,1
  • With no prior information about the features
    available ? impossible to prefer some
    representation over another

9
Ugly Duckling Theorem
  • Given that we use a finite set of predicates that
    enables us to distinguish any two patterns under
    consideration, the number of predicates shared
    by two such patterns is constant and independent
    of the choice of those patterns. Furthermore, if
    pattern similarity is based on the total number
    of predicates shared by two patterns, then any
    two patterns are equally similar.
  • An ugly duckling is as similar to the beautiful
    swan 1 as does beautiful swan 2 to beautiful swan
    1.

10
Ugly Duckling Theorem
  • Use for comparison of patterns the number of
    predicates they share.
  • For two different patterns xi and xj
  • No same predicates of rank 1
  • One of rank 2 xi OR xj
  • In the general case
  • Result is independent of choice of xi and xj!

11
Bias and Variance
  • Bias given the training set D, we can accurately
    estimate F from D.
  • Variance given different training sets D, there
    will be no (little) differences between the
    estimations of F.
  • Low bias means usually high variance
  • High bias means usually low variance
  • Best low bias, low variance
  • Only possible with as much as possible
    information about F(x).

12
Bias and variance
13
Resampling for estimating statisticsJackknife
  • Remove some point from the training set
  • Calculate the statistics with the new training
    set
  • Repeat for all points
  • Calculate the jackknife statistics

14
Bagging
  • Draw n lt n training points and train a different
    classifier
  • Combine classifiers votes into end result
  • Classifiers are of same type all neural
    networks, decision trees etc.
  • Instability small changes in the training sets
    leads to significantly different classifiers
    and/or results

15
Boosting
  • Improve the performance of different types of
    classifiers
  • Weak learners the classifier has accuracy only
    slightly better than random
  • Example three component-classifiers for a
    two-class problem
  • Draw three different training sets D1, D2 and D3
    and train three different classifiers C1, C2 and
    C3 (weak learners).

16
Boosting
  • D1 randomly draw n1 lt n training points from D.
  • Train C1 with D1
  • D2 most informative dataset with respect to
    D1.
  • Half of the points are classified properly by C1,
    half of them not.
  • Flip a coin if head, find the first pattern in
    D/D1 misclassified by C1. If tails, find a
    pattern properly classified by C1.
  • Continue until possible
  • Train C2 with D2
  • D3 most informative with respect to C1 and C2.
  • Randomly select a pattern from D/(D1,D2)
  • If C1 and C2 disagree, add it to D3
  • Train C3 with D3

17
Boosting
Write a Comment
User Comments (0)
About PowerShow.com