5 Evaluation of hypotheses

About This Presentation

Title:

5 Evaluation of hypotheses

Description:

When comparing hypotheses on the same data set, more powerful procedure possible ... for each single example, compare h1 and h2 ... – PowerPoint PPT presentation

Number of Views:318

Avg rating:3.0/5.0

Slides: 105

Provided by: csKule

Category:

more less

Transcript and Presenter's Notes

Title: 5 Evaluation of hypotheses

1
5 Evaluation of hypotheses

Q How good are results of learning process ?
Some statistics
Evaluating a single hypothesis
Comparing hypotheses
Comparing learning algorithms
ROC analysis
? Mitchell Ch. 5

2
Some Statistics (sorry)

Evaluation of predictive model often based on
predictive accuracy probability that hypothesis
makes correct prediction for random instance
acc(h) P(h(X) c(X)) 1 - error(h)
Estimating this accuracy standard statistics
Recall
binomial normal distribution
confidence intervals hypothesis tests

3
Binomial distribution

An experiment that succeeds with probability p is
repeated n times what is the probability of
having x successes?
assumptions constant p, independent experiments
Given h with accuracy p, n instances in some test
set, gives probability of making x correct
predictions

4
Normal distribution

Sum of many independent variables follows
(approximately) normal distribution
Can be used to approximate binomial distribution
Formulae used in practice are derived from this
approximation

Example for n10 and p0.3

6
Confidence intervals

From p and n we can compute an interval in which
x (or x/n) lies with probability close to 1 (e.g.
0.95)
In practice we want to do the opposite
given p x/n, give interval for p
interval that contains p with probability c is
called a c confidence interval

population
0
1
sample
0
1
7
Hypothesis tests

Principle of a hypothesis test
given a certain claim (hypothesis) H0, test it by
looking at a sample
if sample gives a result that is very unlikely if
H0 were true, reject H0
E.g.
claim h predicts correctly in 90 of cases (H0
p0.9)
test on data set p 0.8
this is abnormally low, hence reject the claim
abnormal confidence interval from p does not
contain p

8
Evaluating a single hypothesis

To estimate true accuracy of hypothesis h
compute accuracy of h on sample of unseen data
compute e.g. 95 confidence interval
Formula for 95 confidence interval
derived from normal distribution
p is accuracy on sample, n is size of sample

? z? 0.90 1.64 0.95 1.96 0.99
2.58
9
The importance of test sets

Important theory assumes random, independent
sample
Training set used to learn h is not independent!
if we denote
errorTr(h) error of h on training set
error(h) true error of h on population
errorTe(h) error of h on sample different from
training set
then typically (E() denotes expected value)
E(errorTr(h)) lt error(h) (cf. overfitting as
extreme case)
E(errorTr(h))-error(h) bias of estimator
errorTr
E(errorTe(h)) error(h)

10
Creating test sets

How to obtain an independent test set?
Simple method use e.g. 2/3 of available data for
training, 1/3 for testing
Problem if not much data available
smaller training set makes it more difficult to
learn
smaller test set gives less accurate estimates
Popular solution cross-validation
learn h from full set S
partition S in n subsets learn n hypotheses hi,
each time leaving out a different subset use
average of test set accuracies of hi as estimate
of accuracy of h

11
Example 3-fold cross-validation
Given data set S learn h from S
S1 test set for h1
training set for h1
errorS1(h1)
errorS3(h3)
errorS2(h2)
12

Not entirely unbiased - usually small pessimistic
bias
S contains more elements than S-Si
different folds not independent
Still preferable over using training set accuracy

13
Comparison of hypotheses

Given two hypotheses, which one has lower true
error?
Statistical hypothesis test
claim that both are equally good
if claim rejected, accept that 1 is better
2 cases
compare 2 hypotheses on possibly different test
sets
compare 2 hypotheses on same test set

14
Comparing 2 hypotheses

To compare h1 and h2, estimate p1-p2 from samples
S1 (p1) and S2 (p2)
if very likely p1-p2 gt 0 h1 is better
similarly, lt 0 h2 is better
otherwise, no difference demonstrated
Formula for confidence interval of difference

15
Comparing 2 hypotheses on the same data set

When comparing hypotheses on the same data set,
more powerful procedure possible
uses more information from test
possible influence of easy/difficult examples
removed
More informative method
for each single example, compare h1 and h2
how often was h1 correct and h2 wrong on the same
example, vs. the other way around?
use McNemars test

16
McNemar's test

Consider table
If h1 is equally good as h2
for each instance where h1 and h2 differ,
probability 0.5 that either is correct
hence we expect B ? C ? (BC)/2
B and C follow binomial (/- normal) distribution
reject equality if B deviates too much from
(BC)/2

17
Example comparison

Consider table below
Method with independent test sets
55-45 in favour of h2 (correct predictions, out
of 100)
not very convincing
Method with same test set
much more convincing 10-0 in favour of h2

- h2 clearly better than h1 - might not be
discovered using "conservative" comparison
18
Comparing learning algorithms

Compare these two questions
Q1 given hypotheses h1 and h2, which one has
better predictive accuracy?
Q2 given learners L1 and L2 and data set S,
which learner can be expected to build best
hypothesis from S?
note that hypotheses themselves may vary
more difficult to answer than Q1

One possible method
For several data sets Si similar to S
split Si into training set Str and test set Ste
learn h1 and h2 from Str using L1 resp. L2
compute ?i errorSte(h1)-errorSte(h2)
Hypothesis test / confidence interval for mean of
?
What is limited set of data available?

If limited data available
repeated runs within 1 data set?
e.g. cross-validation n splits into Str and Ste
e.g. 30 random splits into Str and Ste
problem dependencies between data sets
may make it very easy to draw incorrect
conclusions!
high probability of "type 1" error concluding
one learner is better than other when this is not
the case
to be avoided
reasonable approach 5 2-fold cross-validations
details T. Dietterich, Neural computation 10(7),
1998
only really good solution collect more data!

21
ROC analysis

Accuracy based evaluation not always appropriate
Shortcomings
relative measure
unstable when class distribution may change
assumes symmetric misclassification costs
Alternatives
correlation
ROC analysis

22
1 Accuracy is a relative measure

E.g., "99 correct prediction" is this good?
Yes, if 50 "" and 50 "-"
No, if 1 "" and 99 "-"
always predicting "neg" gives 99 accuracy
Should be compared with "base accuracy" of always
predicting the majority class
base accuracy maxac,bd / T
Even then, it may be misleading...

Assume all examples "-", except in blue region
("")
Which of these classifiers is best?

Classifier 1
Classifier 2

IF false THEN pos 96 correct
IF green area THEN pos 92 correct
24

Alternative measures exist
e.g., correlation ? (ad-bc) / ?TposTminTT-
close to 1 high correlation predictions -
classes
close to 0 no correlation
(close to -1 predicting the opposite)

actual value
note /- are actual values pos/neg are
predictions
prediction
25
2 Different misclassification costs

Accuracy ignores possibility of different
misclassification costs
sometimes, incorrectly predicting "pos" costs
more/less than incorrectly predicting "neg"
e.g.
not treating an ill patient vs. treating a
healthy patient
refusing credit to client who would have paid
back vs. assigning credit to client who won't pay
back
Need to distinguish probability of making
different types of errors

Solution distinguish predictive accuracy for
different classes
Acc probability that some instance is classified
correctly
Decomposed into
TP probability that a positive instance is
classified correctly
TN probability that a negative is classified
correctly
TP true positive rate, TN true negative
rate
We also define
FP 1-TN false positive rate probability
that a negative is classified as positive
analogously FN 1-TP

Consider costs CFP and CFN
cost of false positive resp. false negative
Expected cost of a single prediction
C CFP P(pos-) P(-) CFN P(neg) P()
estimated by C CFP FP T-/T CFN FN T /T
Note
Acc is weighted average of TP and TN
Acc TP T/T TN T-/T
C is not computable from Acc alone

28
3 Changing class distributions

Accuracy is sensitive to changes in class
distribution
E.g.
Suppose a classifier has TP 0.8, TN 0.6
Test on test set with T/T 0.5, T-/T 0.5
Acc 0.7
Employed in environment with T/T 0.3, T-/T
0.7
Acc 0.66

29
ROC diagrams

ROC "Receiver operating characteristic"
Allows to see
how well a classifier will perform given certain
misclassification costs and class distribution
in which environments one classifier is better
than another

ROC diagram plots TP versus FP
From confusion matrix
TP a/(ac) a/T
FP b/(bd) b/T-

actual value
prediction
31
Classifier in ROC diagram

1 classifier 1 point on ROC diagram
The closer to the upper left, the better

TP
perfect prediction
no positives forgotten
1
B
.
random prediction
A
.
no negatives returned
0
1
FP
32
Rank classifiers

Rank classifiers
assign a rank to their predictions
some predictions are more certain than others -gt
higher rank
e.g. neural nets
criterion lt0.5 neg, gt0.5 pos
but 0.9 is more certainly positive than 0.51
raise/lower threshold of 0.5 effect?
e.g. decision trees
use purity of leaf used for prediction to rank it

Rank classifier gives ROC curve

TP
1
.
C with low threshold better than B
B
C
A
.
C with high threshold worse than A
0
1
FP
34
Costs in ROC diagram

Given misclassification costs
cFP cost of a false positive
cFN cost of a false negative (undetected "")
Average cost is
c cFP FP T-/T cFN (1-TP) T/T
Lines of equal cost can be drawn in ROC diagram
(straight lines)

35
increasing cost
TP
1
.
B
C
A
.
0
1
FP
36
high cost of false positive A is better
TP
low cost of false positive C is better
1
.
B
C
B is never better than C
A
.
0
1
FP
37
Sets of classifiers

Different classifiers may be good in different
environments
Given a set of classifiers, ROC analysis allows
to
decide in which cases a classifier is optimal
remove classifiers that are never optimal
Classifiers that may be optimal are always on
convex hull of set of points

38
Example convex hull

Which classifiers are never optimal?

TP
1
.
.
.
.
.
0
1
FP
39
Evaluation of regression models

Predicting numbers no "right or wrong" approach
Possible measures
Sum of squared errors SSE
Is an absolute measure
Relative error RE measures improvement over
trivial model
RE SSE(hypothesis) / SSE(trivial hypothesis)
Trivial hypothesis e.g. always predict mean
Spearman correlation r
measures how well predictions and actual values
correlate
less sensitive to actual errors

40
To remember

Accuracy based evaluation
methods for evaluating a single hypothesis,
comparing hypotheses, comparing algorithms
limitations of accuracy as evaluation criterion
Other evaluation criteria
criteria for regression
correlation as alternative for accuracy
ROC analysis plotting classifiers and rank
classifiers, iso-cost lines, convex hull

41
6 Bayesian learning

Introduction probabilistic (Bayesian) methods
MAP and ML hypotheses
Minimum description length principle
Bayes optimal classifier
Naïve Bayes learner
example learning over text data
Bayesian belief networks
(Expectation Maximization (EM) see later)
? Mitchell Ch. 6

42
Bayesian approaches

Several roles for probability theory in machine
learning
describing existing learners
e.g. compare them with optimal probabilistic
learner
developing practical learning algorithms
e.g. Naïve Bayes learner
Bayes theorem plays a central role

43
Basics of probability

P(A) probability that A happens
P(AB) probability that A happens, given that B
happens (conditional probability)
Some rules
complement P(not A) 1 - P(A)
disjunction P(A or B) P(A)P(B)-P(A and B)
conjunction P(A and B) P(A) P(BA)
P(A) P(B) if A and B independent
total probabilityP(A) ? P(ABi) P(Bi)

44
Bayes Theorem

P(AB) P(BA) P(A) / P(B)
Mainly 2 ways of using Bayes theorem
Applied to learning a hypothesis h from data D
P(hD) P(Dh) P(h) / P(D) P(Dh)P(h)
P(h) a priori probability that h is correct
P(hD) a posteriori probability that h is
correct
P(D) probability of obtaining data D
P(Dh) probability of obtaining data D if h is
correct
Applied to classification of a single example e
P(classe) P(eclass)P(class)/P(e)

45
Bayes theorem Example

Example
assume some lab test for a disease has 98 chance
of giving positive result if disease is present,
and 97 chance of giving negative result if
disease is absent
assume furthermore 0.8 of population has this
disease
given positive result, what is probability that
disease is present?
P(DP) P(PD)P(D) / P(P) 0.980.008 /
(0.980.008 0.030.992)

46
MAP and ML hypotheses

Question given the current data D and some
hypothesis space H, return the hypothesis h in H
that is most likely to be correct.
Note this h is optimal in a certain sense
no method can exist that finds with higher
probability the correct h

47
MAP hypothesis

Given some data D and a hypothesis space H, find
the hypothesis h?H that has the highest
probability of being correct i.e., P(hD) is
maximal
This hypothesis is called the maximal a
posteriori hypothesis hMAP hMAP argmaxh?H
P(hD)
argmaxh?H P(Dh)P(h)/P(D) argmaxh?H
P(Dh)P(h)
last equality holds because P(D) is constant
So we need P(Dh) and P(h) for all h?H to
compute hMAP

48
ML hypothesis

P(h) a priori probability that h is correct
What if no preferences for one h over another?
Then assume P(h) P(h) for all h, h?H
Under this assumption hMAP is called the maximum
likelihood hypothesis hML
hML argmaxh?H P(Dh) (because P(h) constant)
How to find hMAP or hML ?
brute force method compute P(Dh), P(h) for all
h?H
usually not feasible

49
Version spaces in a MAP / ML setting

Consider FindS finds most specific hypothesis
hms in H consistent with data D
Under what circumstances is hms hMAP?
Assume (for simplicity) set of data given, and D
consists of the classes of the instances
P(Dh) 0 if h inconsistent with D, 1 otherwise
P(hD) P(Dh) P(h)
for any h not in VS P(hD) 0
for any h in VS P(hD) P(h)
so hmshMAP if ?h,h P(h)?P(h) ? h more
specific than h

50
Characterising learnersin a MAP setting

E.g. candidate elimination

training examples
Candidate Elimination
hypotheses
hypothesis space H
is equivalent with
training examples
brute force MAP learner
hypotheses
hypothesis space H
P(h) uniform P(Dh) 1 if consistent,
0 otherwise
51
Characterising numeric predictionin a ML setting

Minimisation of MSE (mean squared error)
hMSE hML under assumption of gaussian noise
target values in data are produced as d(x)
f(x)? with
f(x) true target value of x and ? noise
? random and normally distributed
Predicting probabilities
most likely hypothesis can be found by maximising
cross-entropy
hML argmaxh?H ? di ln h(xi) (1-di) ln
(1-h(xi))
with di target value for instance xi (0/1),
h(xi) predicted probability that xi1

52
Minimum Description Length (MDL)

Occams razor prefer simplest hypothesis
simplest shortest description
Minimal description length principle
hypothesiscorrections should have shortest
description
trading off complexity and correctness of
hypothesis
given data D, hypothesis space H, and encodings
C1 and C2 for hypotheses resp. data
hMDL argminh?H LC1(h) LC2(Dh)
LC(x) denotes description length of x under
encoding C
LC2(Dh) indicates exceptions to hs predictions
LC2(Dh) 0 if h entirely correct

Interesting observation
hMAP equals hMDL under optimal encodings
encoding length of message based on its
probability
But note no link between optimal encoding and
practical encoding mechanisms! (trees, )
hence, no reason for claiming that e.g. shorter
trees have higher probability of being correct!

54
Bayes Optimal Classifier

Problem considered up till now
given data D and hypothesis space H, find most
probable hypothesis h in H
Now consider this problem
given data D, hypothesis space H, and new
instance x, what is most probable classification
of x?
equivalent to h(x), with h most probable
hypothesis? No.
example
P(h1D)0.4, P(h2D)P(h3D)0.3 and h1(x),
h2(x)h3(x) -
What is most probable classification of x?

55
Bayes optimal classifier

With hypothesis space H and set of classes V,
most probable classification is
In our example P(h1) 1, P(-h1)0 etc.

56
Gibbs classifier

Bayes Optimal is optimal but expensive
uses all hypotheses in H
What if we approximate it as follows
select one h at random, according to
probabilities P(hi)
predict h(x)
This method is called the Gibbs classifier
Surprisingly E(errorGibbs) ? 2
E(errorBayesOptimal)
Try to apply this to VS approach

57
Naïve Bayes classifier

Simple popular classification method
Based on Bayes rule assumption of conditional
independence
assumption often violated in practice
even then, it usually works well
Successful application classification of text
documents

58
Classification using Bayes rule

Given attribute values, what is most probable
value of target variable?
Problem too much data needed to estimate
P(a1anvj)

59
The Naïve Bayes classifier

Naïve Bayes assumption attributes are
independent, given the class
P(a1,,anvj) P(a1vj)P(a2vj)P(anvj)
also called conditionally independence (given the
class)
Under that assumption, vMAP becomes

What if assumption violated?
i.e. P(a1,,anvj) ? P(a1vj)P(a2vj)P(anvj)
Prediction still equivalent to Bayes prediction
as long as the following (weaker) condition
holds
But probabilities associated with prediction may
be unrealistically close to 0 or 1

61
Learning a Naïve Bayes classifier

To learn such a classifier just estimate P(vj),
P(aivj) from data
How to estimate?
simplest standard estimate from statistics
estimate probability from sample proportion
e.g., estimate P(AB) as count(A and B) /
count(B)
in practice, something more complicated needed

62
Estimating probabilities

Problem
What if attribute value ai never observed for
class vj?
Estimate P(aivj)0 because count(ai and vj) 0
?
Effect is too strong this 0 makes the whole
product 0!
Solution use m-estimate
interpolates between observed value nc/n and a
priori estimate p -gt estimate may get close to 0
but never 0
m is weight given to a priori estimate

63
Learning to classify text

Example application
given text of newsgroup article, guess which
newsgroup it is taken from
Naïve bayes turns out to work well on this
application
How to apply NB?
Key issue how do we represent examples? what
are the attributes?

64
Representation

Binary classification (/-) or multiple classes
possible
Attributes word positions
I.e. attribute i represents i-th word in text
value of attribute word that occurs there
Note could have chosen other representations
e.g. attribute specific word, value its
frequency in the text
further assumption probability of having a
specific word is independent of position
P(aiwkvj) P(amwkvj) ?i,m

65
Algorithm
procedure learn_naïve_bayes_text(E set of
articles, V set of classes) Voc all words and
tokens occurring in E estimate P(vj) and
P(wkvj) for all wk in E and vj in V Nj
number of articles of class j N number of
articles P(vj) Nj/N nkj number of times
word wk occurs in text of class j nj number
of words in class j (counting doubles) P(wkvj)
(nkj1)/(njVoc) procedure
classify_naïve_bayes_text(A article) remove
from A all words/tokens that are not in
Voc return argmaxvj?V P(vj) ?i P(aivj)
66

Experiment reported in Mitchell
1000 articles taken from 20 newsgroups
guess correct newsgroup for unseen documents
89 classification accuracy with previous approach

67
Bayesian Belief Networks

Consider two extremes of spectrum
guessing joint probability distribution
would yield optimal classifier
but infeasible in practice (too much data needed)
Naïve Bayes
much more feasible
but strong assumptions of conditional
independence
Can we find something in between?
make some independence assumptions, but only
where reasonable

68
Bayesian belief networks

Bayesian belief network consists of
1 graph
intuitively indicates which variables directly
influence which other variables
arrow from A to B A has direct effect on B
parents(X) set of all nodes directly
influencing X
X is influenced only by its parents
formally each node is conditionally independent
of each of its non-descendants, given its parents
conditional independence cf. Naïve Bayes
X conditionally independent of Y given Z iff
P(XY,Z) P(XZ)
2 conditional probability tables
for each node X P(Xparents(X)) is given

69
Example

Burglary or earthquake may cause alarm to go off
Alarm going off may cause one of neighbours to
call

E 0.01 -E 0.99
B 0.05 -B 0.95
Burglary
Earthquake
B,E B,-E -B,E -B,-E A 0.9 0.8 0.4
0.01 -A 0.1 0.2 0.6 0.99
Alarm
John calls
Mary calls
A -A M 0.9 0.2 -M 0.1 0.8
A -A J 0.8 0.1 -J 0.2 0.9
70

Network topology usually reflects direct causal
influences
other structure also possible
but may render network more complex

Mary calls
Earthquake
Burglary
John calls
Alarm
Alarm
Burglary
John calls
Mary calls
Earthquake
71

Graph conditional probability tables allow to
construct joint probability distribution of all
variables
P(X1,X2,,Xn) ?i P(Xiparents(Xi))
In other words bayesian belief network carries
full information on joint probability distribution

72
Example
Burglary
Earthquake

Joint probability distribution from conditional
ones
P(J,M,A,B,E) P(JA) P(MA) P(AB,E) P(B) P(E)
to see this start with P(B) and P(E)
conditionally independent from each other given
their parents unconditionally independent,
hence P(B,E) P(B) P(E)
P(A,B,E) P(AB,E) P(B,E) (by definition)
P(J,M,A,B,E) P(J,MA,B,E) P(A,B,E) (by def.)
P(JA) P(MA) P(A,B,E)

Alarm
John calls
Mary calls
73
Inference

Given values for certain nodes, infer probability
distribution for values of other nodes
General algorithm quite complicated
see Russel Norvig, 1995 Artificial
Intelligence, a Modern Approach

74
Simplest case 2 nodes

Given p(A), p(BA)
A known, infer pAa(B)
directly from p(BA)
A unknown, infer pA?(B)
"total probability" rule
B known, infer pBb(A)
Bayes' rule
B unknown, infer pB?(A)
p(A)

A
B
75
A simple 3 nodes network

Given p(A), p(BA), p(CB)
E.g., Aa and Cc known

A
P(BbAa,Cc) P(Bb,Aa,Cc)
?i P(Aa,Bbi, Cc)
P(Aa)P(BbAa)P(CcBb)
?i P(Aa)P(BbiAa)P(CcBbi)
B
C
76
More nodes

In general, relatively complex reasoning can be
achieved
forward and backward reasoning
"explaining away"
"good grade" is evidence for "studied hard"
but is less strong evidence if known that person
looked at neighbour's copy

study
see neighbour's copy
good grade
77
General case
evidence (observed)
to be predicted
unobserved

In general inference is NP-complete
approximating methods, e.g. Monte-Carlo

78
Learning bayesian networks

Assume structure of network given
only conditional probability tables to be learnt
training examples may include values for all
variables, or just for some of them
when all variables observable
estimating probabilities as easy as for Naïve
Bayes
e.g. estimate P(AB,C) as count(A,B,C)/count(B,C)
when not all variables observable
notice similarity with training neural networks
(hidden units)
methods exist, based on gradient descent or EM
(see later)

When structure of network not given
search for structure tables
e.g. propose structure, learn tables
propose change to structure, relearn, see whether
better results
active research topic

80
Sample complexity

When structure known and all variables
observable
how many examples needed for learning?
accurate estimates of conditional probability
tables needed
complexity of learning linear in size of largest
probability table
i.e. exponential in number of parent variables of
node
compare with estimating joint distribution
exponential in total number of variables
and with naïve bayes
always only 1 parent variable, i.e. the class

81
To remember

Importance of Bayes theorem
MAP, ML, MDL
definitions, characterising learners from this
perspective, relationship MDL-MAP
Bayes optimal classifier, Gibbs classifier
Naïve Bayes how it works, assumptions made,
application to text classification
Bayesian networks representation, inference,
learning

82
7 Computational learning theory

Q how difficult are certain learning tasks?
Measuring the complexity of learning
Different settings for concept learning
PAC-learning
VC dimension
Mistake bounds
? Mitchell Ch. 7

83
COLT Computational learning theory

Find theory to relate
probability of successful learning
number of training examples (sample complexity)
how training examples are chosen
complexity of hypothesis space
accuracy to which target is approximated
time spent on learning (time complexity)

84
Complexity of concept learning

Task
given
instance space X
unknown target function c X ? 0,1
hypothesis space H
training examples D ?X
find
a hypothesis h in H such that h(x) c(x)
for all x ? D?
for all x ? X? (this is most interesting)

85
Settings for concept learning

How many examples needed for learning, depends on
how well they are chosen
1) learner proposes instances, teacher classifies
them
i.e. learner chooses x, teacher provides c(x)
2) teacher gives instances
teacher chooses x and provides c(x)
3) instances are provided randomly
nobody chooses x, teacher provides c(x)
Which one do you think is easiest?

86
Sample complexity setting 1

Learner proposes x, teacher gives c(x)
Learner can choose x based on what it already
knows
e.g. version spaces
try choose x so that half of VS predicts , other
half -
after seeing c(x), VS is divided by 2
hence, hypothesis will be learnt after ?log2H?
examples
it may not be possible to choose x in this way
in this case, more examples needed
general idea learner should reduce remaining
possibilities as much as possible

87
Sample complexity setting 2

Teacher chooses x and provides c(x)
Note teacher knows how learner works (i.e., has
all knowledge the learner has) knows target
concept
benevolent teacher can point learner in right
direction
what is the optimal teaching strategy?
will depend on form of H
Example
learning conjunctions of up to n boolean literals
of form Aitrue/false
n1 examples suffice (why?)

88
Sample complexity setting 3

x randomly chosen, according to probability
distribution D over X
Very important setting in practice
often no control over how data are collected
Question, more specific
assume instance space X, hypothesis space H, set
of possible target concepts C, distribution D
over X
given random set S of examples ltx,c(x)gt with c?C
and each x drawn according to D
find h?H for which Px drawn from D(h(x)?c(x)) is
small

89
PAC learning

Covers setting 3
examples presented in random fashion makes it
difficult to guarantee learning of correct
hypothesis
relax the learning task learner should very
probably find an approximately correct
hypothesis
probability close to 1 of finding acceptable
hypothesis
denote this probability as 1-?
acceptable probability that h predicts
something different from c is close to 0
denote this probability as ?

90
PAC learning

Probably (1-?) Approximately (?) Correct
Consider
class C of possible target concepts defined over
instance space X instances have size n
learner L using hypothesis space H
Definition, cf. Mitchell
C is PAC-learnable by L using H if ?c?C, ?D over
X, 0lt?lt1/2, 0lt?lt1/2, L finds with probability 1-?
some h?H with errD(h)lt? in time polynomial in
1/?,1/?,n and size(c)

91
Bounding the true error

Consider
true error errD(h) Px drawn from D(h(x)?c(x))
training error errS(h) Px?S(h(x)?c(x))
Can we bound errD(h), given errS(h)?
Simplest case assume errS(h)0
no noise, proposed hypothesis is consistent with
data
to find an upper bound on errD(h), given
errS(h)0, find probability that some h?VS could
have errD(h)gte
if this probability is low, e is a good upper
bound

92
?-exhausting the VS

Definition
The version space VSH,S is ?-exhausted w.r.t. c
and D, if ?h?VSH,S errD(h)lt?
i.e., every hypothesis h in VSH,S should have
error less than ? with respect to c and D
Important question
how large should S be, so that with probability
1-?, VSH,S is ?-exhausted?
this will indicate the sample complexity of the
task

Theorem (Haussler, 1988)
assume finite H, S is set of random examples
(drawn according to D) of some target concept c,
then for any 0???1, P(VSH,S not ?-exhausted) lt
He-? S
proof
for any single h?H with errD(h)??, P(h?VSH,S) ?
(1-?)S
if we have n such hs, probability that at least
one of them is in VS is ? n (1-?)S
since there cant be more than H, this is ? H
(1-?)S
hence P(VSH,S not ?-exhausted) ? H (1-?)S
the latter is approximated by He-? S

94
Sample complexity

So if we want VS to be ?-exhausted with
probability 1-?, we need He-? S ? ? and
hence
S ? 1/? (lnH ln(1/? ))
Try yourself
assume H is space of conjunctions of literals
chosen from n attributes (e.g. A1 and not A3)
how many random examples needed to 0.01-exhaust
VS with probability 0.95, if n10?

95
Agnostic learning

So far, we assumed finding a h?VSH,S was possible
What if not?
Agnostic learning not sure c?H
In that case, can only hope to approximate c as
closely as possible
From so-called Hoefding bounds
P(errD(h)gterrS(h)?) ? e-2m?2
Derive S ? 1/2?2 (ln H ln(1/?))

96
The VC dimension

Vapnik-Chervonenkis dimension
alternative notion for measure how expressive H
is
up till now we used H
what if H is infinite? (e.g., neural nets)
also many different h?H may represent more or
less the same hypothesis to what extent does H
contain truly different hypotheses?
VC-dim. based on notion of shattering instances

97
Shattering instances

A set of instances S is shattered by a hypothesis
space H iff for each subset of S there exists a
hypothesis h H consistent with it
in other words for each possible concept c, a
h?H exists that is equivalent to c on S
Examples
consider H half-planes (bounded by straight
lines)
left set is shattered by H, right set is not

98
Vapnik-Chervonenkis dimension

The VC-dimension VC(H) of hypothesis space H
defined over instance space X is the size of the
largest finite subset of X shattered by H, or ?
if arbitrarily large subsets can be shattered
Note peculiarities
shatters for each labeling, a consistent h
exists
VC-dim d if at least one subset of size d is
shattered
E.g. VC-dimension of straight lines in ?2 is 3
check that four points can never be shattered

99
Sample complexity boundswith VC-dimension

How many examples needed to ?-exhaust VSH,S with
probability at least 1-??
S ? 1/? (4 log2 (2/?) 8 VC(H) log2(13/?))

100
Mistake bounds

Consider an alternative kind of complexity
measure
up till now look at total number of examples
needed before finding good hypothesis
other setting
when getting an example, first guess its
classification
if wrong, teacher corrects it
can we bound the number of mistakes made during
this process, before converging to good
hypothesis?

101
Example Find-S

Find-S, in context of conjunctions of boolean
literals
start with h most specific hypothesis
h l1 ??l1 ? l2 ? ln ? ?ln everything
negative
For each positive instance x
remove each literal not satisfied by x from
How many mistakes before finding correct h?
Try yourself (the answer is n1)

102
Example Halving algorithm

Halving algorithm
keep track of VS using candidate-elimination
algorithm
classify new instance as follows
consider prediction of each h?VS as a vote
use majority vote for classification
How many mistakes before VS converges to correct
concept?
worst case?
best case?

103
Optimal mistake bounds

Let MA(C) max mistakes made by algorithm A to
learn concepts in C (over all concepts and
training sequences)
The optimal mistake bound Opt(C) is the minimal
MA(C) over all possible A
Property VC(C) ? Opt(C) ? Mhalving(C) ? log2(C)

104
To remember