Algorithm-Independent Machine Learning Anna Egorova-F

About This Presentation

Title:

Algorithm-Independent Machine Learning Anna Egorova-F

Description:

Ugly Duckling Theorem ... An ugly duckling is as similar to the beautiful swan 1 as does beautiful swan 2 ... Ugly Duckling Theorem. Use for comparison of ... – PowerPoint PPT presentation

Number of Views:332

Avg rating:3.0/5.0

Slides: 18

Provided by: djam9

Category:

more less

Transcript and Presenter's Notes

Title: Algorithm-Independent Machine Learning Anna Egorova-F

1
Algorithm-Independent Machine LearningAnna
Egorova-FörsterUniversity of LuganoPattern
Classification Reading Group, January 2007All
materials in these slides were taken from
Pattern Classification (2nd ed) by R. O. Duda,
P. E. Hart and D. G. Stork, John Wiley Sons,
2000 with the permission of the authors and the
publisher
2
Algorithm-Independent Machine Learning

So far different classifiers and methods
presented
BUT
Is some classifier better than all others?
How to compare classifiers?
Is comparison possible at all?
Is at least some classifier always better than
random?
AND
Do techniques exist which boost all classifiers?

3
No Free Lunch Theorem

For any two learning algorithms P1(hD) and
P2(hD), the following are true, independent of
the sampling distribution P(x) and the number n
of training points
Uniformly averaged over all target functions F,
?1(EF,n) - ?2(EF,n) 0.
For any fixed training set D, uniformly averaged
over F, ?1(EF,D) - ?2(EF,D) 0
Uniformly averaged over all priors P(F), ?1(En)
- ?2(En) 0
For any fixed training set D, uniformly averaged
over P(F), ?1(ED) - ?2(ED) 0

4
No Free Lunch Theorem

Uniformly averaged over all target functions F,
?1(EF,n) - ?2(EF,n) 0.
Average over all possible target functions, the
error will be the same for all classifiers.
Possible target functions 25
For any fixed training set D, uniformly averaged
over F, ?1(EF,D) - ?2(EF,D) 0
Even if we know the training set D, the
off-training errors will be the same.

x F h1 h2
000 1 1 1
001 -1 -1 -1
010 1 1 1
011 -1 1 -1
100 1 1 -1
101 -1 1 -1
110 1 1 -1
111 1 1 -1
Training set D
Off-Training set
5
Consequences of the No Free Lunch Theorem

If no information about the target function F(x)
is provided
No classifier is better than some other in the
general case
No classifier is better than random in the
general case

6
Ugly Duckling TheoremFeatures Comparison

Binary feature fi
Patterns xi in the form
f1 and f2, f1 or f2 etc.
Rank of a predicate r the number of simplest
patterns it contains.
Rank 1
x1 f1 AND NOT f2
x2 f1 AND f2
x3 f2 AND NOT f1
Rank 2
x1 OR x2 f1
Rank 3
x1 OR x2 OR x3 f1 OR f2

Venn diagram
7
Features with prior information
8
Features Comparison

To compare two patterns take the number of
features they share?
Blind_left 0,1
Blind_right 0,1
Is (0,1) more similar to (1,0) or to (1,1)???
Different representations also possible
Blind_right 0,1
Both_eyes_same 0,1
With no prior information about the features
available ? impossible to prefer some
representation over another

9
Ugly Duckling Theorem

Given that we use a finite set of predicates that
enables us to distinguish any two patterns under
consideration, the number of predicates shared
by two such patterns is constant and independent
of the choice of those patterns. Furthermore, if
pattern similarity is based on the total number
of predicates shared by two patterns, then any
two patterns are equally similar.
An ugly duckling is as similar to the beautiful
swan 1 as does beautiful swan 2 to beautiful swan
1.

10
Ugly Duckling Theorem

Use for comparison of patterns the number of
predicates they share.
For two different patterns xi and xj
No same predicates of rank 1
One of rank 2 xi OR xj
In the general case
Result is independent of choice of xi and xj!

11
Bias and Variance

Bias given the training set D, we can accurately
estimate F from D.
Variance given different training sets D, there
will be no (little) differences between the
estimations of F.
Low bias means usually high variance
High bias means usually low variance
Best low bias, low variance
Only possible with as much as possible
information about F(x).

12
Bias and variance
13
Resampling for estimating statisticsJackknife

Remove some point from the training set
Calculate the statistics with the new training
set
Repeat for all points
Calculate the jackknife statistics

14
Bagging

Draw n lt n training points and train a different
classifier
Combine classifiers votes into end result
Classifiers are of same type all neural
networks, decision trees etc.
Instability small changes in the training sets
leads to significantly different classifiers
and/or results

15
Boosting

Improve the performance of different types of
classifiers
Weak learners the classifier has accuracy only
slightly better than random
Example three component-classifiers for a
two-class problem
Draw three different training sets D1, D2 and D3
and train three different classifiers C1, C2 and
C3 (weak learners).

16
Boosting

D1 randomly draw n1 lt n training points from D.
Train C1 with D1
D2 most informative dataset with respect to
D1.
Half of the points are classified properly by C1,
half of them not.
Flip a coin if head, find the first pattern in
D/D1 misclassified by C1. If tails, find a
pattern properly classified by C1.
Continue until possible
Train C2 with D2
D3 most informative with respect to C1 and C2.
Randomly select a pattern from D/(D1,D2)
If C1 and C2 disagree, add it to D3
Train C3 with D3