Title: 5 Evaluation of hypotheses
15 Evaluation of hypotheses
- Q How good are results of learning process ?
- Some statistics
- Evaluating a single hypothesis
- Comparing hypotheses
- Comparing learning algorithms
- ROC analysis
- ? Mitchell Ch. 5
2Some Statistics (sorry)
- Evaluation of predictive model often based on
predictive accuracy probability that hypothesis
makes correct prediction for random instance - acc(h) P(h(X) c(X)) 1 - error(h)
- Estimating this accuracy standard statistics
- Recall
- binomial normal distribution
- confidence intervals hypothesis tests
3Binomial distribution
- An experiment that succeeds with probability p is
repeated n times what is the probability of
having x successes? - assumptions constant p, independent experiments
- Given h with accuracy p, n instances in some test
set, gives probability of making x correct
predictions
4Normal distribution
- Sum of many independent variables follows
(approximately) normal distribution - Can be used to approximate binomial distribution
- Formulae used in practice are derived from this
approximation
5 6Confidence intervals
- From p and n we can compute an interval in which
x (or x/n) lies with probability close to 1 (e.g.
0.95) - In practice we want to do the opposite
- given p x/n, give interval for p
- interval that contains p with probability c is
called a c confidence interval
population
0
1
sample
0
1
7Hypothesis tests
- Principle of a hypothesis test
- given a certain claim (hypothesis) H0, test it by
looking at a sample - if sample gives a result that is very unlikely if
H0 were true, reject H0 - E.g.
- claim h predicts correctly in 90 of cases (H0
p0.9) - test on data set p 0.8
- this is abnormally low, hence reject the claim
- abnormal confidence interval from p does not
contain p
8Evaluating a single hypothesis
- To estimate true accuracy of hypothesis h
- compute accuracy of h on sample of unseen data
- compute e.g. 95 confidence interval
- Formula for 95 confidence interval
- derived from normal distribution
- p is accuracy on sample, n is size of sample
? z? 0.90 1.64 0.95 1.96 0.99
2.58
9The importance of test sets
- Important theory assumes random, independent
sample - Training set used to learn h is not independent!
- if we denote
- errorTr(h) error of h on training set
- error(h) true error of h on population
- errorTe(h) error of h on sample different from
training set - then typically (E() denotes expected value)
- E(errorTr(h)) lt error(h) (cf. overfitting as
extreme case) - E(errorTr(h))-error(h) bias of estimator
errorTr - E(errorTe(h)) error(h)
10Creating test sets
- How to obtain an independent test set?
- Simple method use e.g. 2/3 of available data for
training, 1/3 for testing - Problem if not much data available
- smaller training set makes it more difficult to
learn - smaller test set gives less accurate estimates
- Popular solution cross-validation
- learn h from full set S
- partition S in n subsets learn n hypotheses hi,
each time leaving out a different subset use
average of test set accuracies of hi as estimate
of accuracy of h
11Example 3-fold cross-validation
Given data set S learn h from S
S1 test set for h1
training set for h1
errorS1(h1)
errorS3(h3)
errorS2(h2)
12- Not entirely unbiased - usually small pessimistic
bias - S contains more elements than S-Si
- different folds not independent
- Still preferable over using training set accuracy
13Comparison of hypotheses
- Given two hypotheses, which one has lower true
error? - Statistical hypothesis test
- claim that both are equally good
- if claim rejected, accept that 1 is better
- 2 cases
- compare 2 hypotheses on possibly different test
sets - compare 2 hypotheses on same test set
14Comparing 2 hypotheses
- To compare h1 and h2, estimate p1-p2 from samples
S1 (p1) and S2 (p2) - if very likely p1-p2 gt 0 h1 is better
- similarly, lt 0 h2 is better
- otherwise, no difference demonstrated
- Formula for confidence interval of difference
15Comparing 2 hypotheses on the same data set
- When comparing hypotheses on the same data set,
more powerful procedure possible - uses more information from test
- possible influence of easy/difficult examples
removed - More informative method
- for each single example, compare h1 and h2
- how often was h1 correct and h2 wrong on the same
example, vs. the other way around? - use McNemars test
16McNemar's test
- Consider table
- If h1 is equally good as h2
- for each instance where h1 and h2 differ,
probability 0.5 that either is correct - hence we expect B ? C ? (BC)/2
- B and C follow binomial (/- normal) distribution
- reject equality if B deviates too much from
(BC)/2
17Example comparison
- Consider table below
- Method with independent test sets
- 55-45 in favour of h2 (correct predictions, out
of 100) - not very convincing
- Method with same test set
- much more convincing 10-0 in favour of h2
- h2 clearly better than h1 - might not be
discovered using "conservative" comparison
18Comparing learning algorithms
- Compare these two questions
- Q1 given hypotheses h1 and h2, which one has
better predictive accuracy? - Q2 given learners L1 and L2 and data set S,
which learner can be expected to build best
hypothesis from S? - note that hypotheses themselves may vary
- more difficult to answer than Q1
19- One possible method
- For several data sets Si similar to S
- split Si into training set Str and test set Ste
- learn h1 and h2 from Str using L1 resp. L2
- compute ?i errorSte(h1)-errorSte(h2)
- Hypothesis test / confidence interval for mean of
? - What is limited set of data available?
20- If limited data available
- repeated runs within 1 data set?
- e.g. cross-validation n splits into Str and Ste
- e.g. 30 random splits into Str and Ste
- problem dependencies between data sets
- may make it very easy to draw incorrect
conclusions! - high probability of "type 1" error concluding
one learner is better than other when this is not
the case - to be avoided
- reasonable approach 5 2-fold cross-validations
- details T. Dietterich, Neural computation 10(7),
1998 - only really good solution collect more data!
21ROC analysis
- Accuracy based evaluation not always appropriate
- Shortcomings
- relative measure
- unstable when class distribution may change
- assumes symmetric misclassification costs
- Alternatives
- correlation
- ROC analysis
221 Accuracy is a relative measure
- E.g., "99 correct prediction" is this good?
- Yes, if 50 "" and 50 "-"
- No, if 1 "" and 99 "-"
- always predicting "neg" gives 99 accuracy
- Should be compared with "base accuracy" of always
predicting the majority class - base accuracy maxac,bd / T
- Even then, it may be misleading...
23- Assume all examples "-", except in blue region
("") - Which of these classifiers is best?
Classifier 1
Classifier 2
IF false THEN pos 96 correct
IF green area THEN pos 92 correct
24- Alternative measures exist
- e.g., correlation ? (ad-bc) / ?TposTminTT-
- close to 1 high correlation predictions -
classes - close to 0 no correlation
- (close to -1 predicting the opposite)
actual value
note /- are actual values pos/neg are
predictions
prediction
252 Different misclassification costs
- Accuracy ignores possibility of different
misclassification costs - sometimes, incorrectly predicting "pos" costs
more/less than incorrectly predicting "neg" - e.g.
- not treating an ill patient vs. treating a
healthy patient - refusing credit to client who would have paid
back vs. assigning credit to client who won't pay
back - Need to distinguish probability of making
different types of errors
26- Solution distinguish predictive accuracy for
different classes - Acc probability that some instance is classified
correctly - Decomposed into
- TP probability that a positive instance is
classified correctly - TN probability that a negative is classified
correctly - TP true positive rate, TN true negative
rate - We also define
- FP 1-TN false positive rate probability
that a negative is classified as positive - analogously FN 1-TP
27- Consider costs CFP and CFN
- cost of false positive resp. false negative
- Expected cost of a single prediction
- C CFP P(pos-) P(-) CFN P(neg) P()
- estimated by C CFP FP T-/T CFN FN T /T
- Note
- Acc is weighted average of TP and TN
- Acc TP T/T TN T-/T
- C is not computable from Acc alone
283 Changing class distributions
- Accuracy is sensitive to changes in class
distribution - E.g.
- Suppose a classifier has TP 0.8, TN 0.6
- Test on test set with T/T 0.5, T-/T 0.5
- Acc 0.7
- Employed in environment with T/T 0.3, T-/T
0.7 - Acc 0.66
29ROC diagrams
- ROC "Receiver operating characteristic"
- Allows to see
- how well a classifier will perform given certain
misclassification costs and class distribution - in which environments one classifier is better
than another
30- ROC diagram plots TP versus FP
- From confusion matrix
- TP a/(ac) a/T
- FP b/(bd) b/T-
actual value
prediction
31Classifier in ROC diagram
- 1 classifier 1 point on ROC diagram
- The closer to the upper left, the better
TP
perfect prediction
no positives forgotten
1
B
.
random prediction
A
.
no negatives returned
0
1
FP
32Rank classifiers
- Rank classifiers
- assign a rank to their predictions
- some predictions are more certain than others -gt
higher rank - e.g. neural nets
- criterion lt0.5 neg, gt0.5 pos
- but 0.9 is more certainly positive than 0.51
- raise/lower threshold of 0.5 effect?
- e.g. decision trees
- use purity of leaf used for prediction to rank it
33- Rank classifier gives ROC curve
TP
1
.
C with low threshold better than B
B
C
A
.
C with high threshold worse than A
0
1
FP
34Costs in ROC diagram
- Given misclassification costs
- cFP cost of a false positive
- cFN cost of a false negative (undetected "")
- Average cost is
- c cFP FP T-/T cFN (1-TP) T/T
- Lines of equal cost can be drawn in ROC diagram
(straight lines)
35increasing cost
TP
1
.
B
C
A
.
0
1
FP
36high cost of false positive A is better
TP
low cost of false positive C is better
1
.
B
C
B is never better than C
A
.
0
1
FP
37Sets of classifiers
- Different classifiers may be good in different
environments - Given a set of classifiers, ROC analysis allows
to - decide in which cases a classifier is optimal
- remove classifiers that are never optimal
- Classifiers that may be optimal are always on
convex hull of set of points
38Example convex hull
- Which classifiers are never optimal?
TP
1
.
.
.
.
.
0
1
FP
39Evaluation of regression models
- Predicting numbers no "right or wrong" approach
- Possible measures
- Sum of squared errors SSE
- Is an absolute measure
- Relative error RE measures improvement over
trivial model - RE SSE(hypothesis) / SSE(trivial hypothesis)
- Trivial hypothesis e.g. always predict mean
- Spearman correlation r
- measures how well predictions and actual values
correlate - less sensitive to actual errors
40To remember
- Accuracy based evaluation
- methods for evaluating a single hypothesis,
comparing hypotheses, comparing algorithms - limitations of accuracy as evaluation criterion
- Other evaluation criteria
- criteria for regression
- correlation as alternative for accuracy
- ROC analysis plotting classifiers and rank
classifiers, iso-cost lines, convex hull
416 Bayesian learning
- Introduction probabilistic (Bayesian) methods
- MAP and ML hypotheses
- Minimum description length principle
- Bayes optimal classifier
- Naïve Bayes learner
- example learning over text data
- Bayesian belief networks
- (Expectation Maximization (EM) see later)
- ? Mitchell Ch. 6
42Bayesian approaches
- Several roles for probability theory in machine
learning - describing existing learners
- e.g. compare them with optimal probabilistic
learner - developing practical learning algorithms
- e.g. Naïve Bayes learner
- Bayes theorem plays a central role
43Basics of probability
- P(A) probability that A happens
- P(AB) probability that A happens, given that B
happens (conditional probability) - Some rules
- complement P(not A) 1 - P(A)
- disjunction P(A or B) P(A)P(B)-P(A and B)
- conjunction P(A and B) P(A) P(BA)
- P(A) P(B) if A and B independent
- total probabilityP(A) ? P(ABi) P(Bi)
44Bayes Theorem
- P(AB) P(BA) P(A) / P(B)
- Mainly 2 ways of using Bayes theorem
- Applied to learning a hypothesis h from data D
- P(hD) P(Dh) P(h) / P(D) P(Dh)P(h)
- P(h) a priori probability that h is correct
- P(hD) a posteriori probability that h is
correct - P(D) probability of obtaining data D
- P(Dh) probability of obtaining data D if h is
correct - Applied to classification of a single example e
- P(classe) P(eclass)P(class)/P(e)
45Bayes theorem Example
- Example
- assume some lab test for a disease has 98 chance
of giving positive result if disease is present,
and 97 chance of giving negative result if
disease is absent - assume furthermore 0.8 of population has this
disease - given positive result, what is probability that
disease is present? - P(DP) P(PD)P(D) / P(P) 0.980.008 /
(0.980.008 0.030.992)
46MAP and ML hypotheses
- Question given the current data D and some
hypothesis space H, return the hypothesis h in H
that is most likely to be correct. - Note this h is optimal in a certain sense
- no method can exist that finds with higher
probability the correct h
47MAP hypothesis
- Given some data D and a hypothesis space H, find
the hypothesis h?H that has the highest
probability of being correct i.e., P(hD) is
maximal - This hypothesis is called the maximal a
posteriori hypothesis hMAP hMAP argmaxh?H
P(hD) - argmaxh?H P(Dh)P(h)/P(D) argmaxh?H
P(Dh)P(h) - last equality holds because P(D) is constant
- So we need P(Dh) and P(h) for all h?H to
compute hMAP
48ML hypothesis
- P(h) a priori probability that h is correct
- What if no preferences for one h over another?
- Then assume P(h) P(h) for all h, h?H
- Under this assumption hMAP is called the maximum
likelihood hypothesis hML - hML argmaxh?H P(Dh) (because P(h) constant)
- How to find hMAP or hML ?
- brute force method compute P(Dh), P(h) for all
h?H - usually not feasible
49Version spaces in a MAP / ML setting
- Consider FindS finds most specific hypothesis
hms in H consistent with data D - Under what circumstances is hms hMAP?
- Assume (for simplicity) set of data given, and D
consists of the classes of the instances - P(Dh) 0 if h inconsistent with D, 1 otherwise
- P(hD) P(Dh) P(h)
- for any h not in VS P(hD) 0
- for any h in VS P(hD) P(h)
- so hmshMAP if ?h,h P(h)?P(h) ? h more
specific than h
50Characterising learnersin a MAP setting
- E.g. candidate elimination
training examples
Candidate Elimination
hypotheses
hypothesis space H
is equivalent with
training examples
brute force MAP learner
hypotheses
hypothesis space H
P(h) uniform P(Dh) 1 if consistent,
0 otherwise
51Characterising numeric predictionin a ML setting
- Minimisation of MSE (mean squared error)
- hMSE hML under assumption of gaussian noise
- target values in data are produced as d(x)
f(x)? with - f(x) true target value of x and ? noise
- ? random and normally distributed
- Predicting probabilities
- most likely hypothesis can be found by maximising
cross-entropy - hML argmaxh?H ? di ln h(xi) (1-di) ln
(1-h(xi)) - with di target value for instance xi (0/1),
h(xi) predicted probability that xi1
52Minimum Description Length (MDL)
- Occams razor prefer simplest hypothesis
- simplest shortest description
- Minimal description length principle
- hypothesiscorrections should have shortest
description - trading off complexity and correctness of
hypothesis - given data D, hypothesis space H, and encodings
C1 and C2 for hypotheses resp. data - hMDL argminh?H LC1(h) LC2(Dh)
- LC(x) denotes description length of x under
encoding C - LC2(Dh) indicates exceptions to hs predictions
- LC2(Dh) 0 if h entirely correct
53- Interesting observation
- hMAP equals hMDL under optimal encodings
- encoding length of message based on its
probability - But note no link between optimal encoding and
practical encoding mechanisms! (trees, ) - hence, no reason for claiming that e.g. shorter
trees have higher probability of being correct!
54Bayes Optimal Classifier
- Problem considered up till now
- given data D and hypothesis space H, find most
probable hypothesis h in H - Now consider this problem
- given data D, hypothesis space H, and new
instance x, what is most probable classification
of x? - equivalent to h(x), with h most probable
hypothesis? No. - example
- P(h1D)0.4, P(h2D)P(h3D)0.3 and h1(x),
h2(x)h3(x) - - What is most probable classification of x?
55Bayes optimal classifier
- With hypothesis space H and set of classes V,
most probable classification is - In our example P(h1) 1, P(-h1)0 etc.
56Gibbs classifier
- Bayes Optimal is optimal but expensive
- uses all hypotheses in H
- What if we approximate it as follows
- select one h at random, according to
probabilities P(hi) - predict h(x)
- This method is called the Gibbs classifier
- Surprisingly E(errorGibbs) ? 2
E(errorBayesOptimal) - Try to apply this to VS approach
57Naïve Bayes classifier
- Simple popular classification method
- Based on Bayes rule assumption of conditional
independence - assumption often violated in practice
- even then, it usually works well
- Successful application classification of text
documents
58Classification using Bayes rule
- Given attribute values, what is most probable
value of target variable? - Problem too much data needed to estimate
P(a1anvj)
59The Naïve Bayes classifier
- Naïve Bayes assumption attributes are
independent, given the class - P(a1,,anvj) P(a1vj)P(a2vj)P(anvj)
- also called conditionally independence (given the
class) - Under that assumption, vMAP becomes
60- What if assumption violated?
- i.e. P(a1,,anvj) ? P(a1vj)P(a2vj)P(anvj)
- Prediction still equivalent to Bayes prediction
as long as the following (weaker) condition
holds - But probabilities associated with prediction may
be unrealistically close to 0 or 1
61Learning a Naïve Bayes classifier
- To learn such a classifier just estimate P(vj),
P(aivj) from data - How to estimate?
- simplest standard estimate from statistics
- estimate probability from sample proportion
- e.g., estimate P(AB) as count(A and B) /
count(B) - in practice, something more complicated needed
62Estimating probabilities
- Problem
- What if attribute value ai never observed for
class vj? - Estimate P(aivj)0 because count(ai and vj) 0
? - Effect is too strong this 0 makes the whole
product 0! - Solution use m-estimate
- interpolates between observed value nc/n and a
priori estimate p -gt estimate may get close to 0
but never 0 - m is weight given to a priori estimate
63Learning to classify text
- Example application
- given text of newsgroup article, guess which
newsgroup it is taken from - Naïve bayes turns out to work well on this
application - How to apply NB?
- Key issue how do we represent examples? what
are the attributes?
64Representation
- Binary classification (/-) or multiple classes
possible - Attributes word positions
- I.e. attribute i represents i-th word in text
- value of attribute word that occurs there
- Note could have chosen other representations
e.g. attribute specific word, value its
frequency in the text - further assumption probability of having a
specific word is independent of position - P(aiwkvj) P(amwkvj) ?i,m
65Algorithm
procedure learn_naïve_bayes_text(E set of
articles, V set of classes) Voc all words and
tokens occurring in E estimate P(vj) and
P(wkvj) for all wk in E and vj in V Nj
number of articles of class j N number of
articles P(vj) Nj/N nkj number of times
word wk occurs in text of class j nj number
of words in class j (counting doubles) P(wkvj)
(nkj1)/(njVoc) procedure
classify_naïve_bayes_text(A article) remove
from A all words/tokens that are not in
Voc return argmaxvj?V P(vj) ?i P(aivj)
66- Experiment reported in Mitchell
- 1000 articles taken from 20 newsgroups
- guess correct newsgroup for unseen documents
- 89 classification accuracy with previous approach
67Bayesian Belief Networks
- Consider two extremes of spectrum
- guessing joint probability distribution
- would yield optimal classifier
- but infeasible in practice (too much data needed)
- Naïve Bayes
- much more feasible
- but strong assumptions of conditional
independence - Can we find something in between?
- make some independence assumptions, but only
where reasonable
68Bayesian belief networks
- Bayesian belief network consists of
- 1 graph
- intuitively indicates which variables directly
influence which other variables - arrow from A to B A has direct effect on B
- parents(X) set of all nodes directly
influencing X - X is influenced only by its parents
- formally each node is conditionally independent
of each of its non-descendants, given its parents - conditional independence cf. Naïve Bayes
- X conditionally independent of Y given Z iff
P(XY,Z) P(XZ) - 2 conditional probability tables
- for each node X P(Xparents(X)) is given
69Example
- Burglary or earthquake may cause alarm to go off
- Alarm going off may cause one of neighbours to
call
E 0.01 -E 0.99
B 0.05 -B 0.95
Burglary
Earthquake
B,E B,-E -B,E -B,-E A 0.9 0.8 0.4
0.01 -A 0.1 0.2 0.6 0.99
Alarm
John calls
Mary calls
A -A M 0.9 0.2 -M 0.1 0.8
A -A J 0.8 0.1 -J 0.2 0.9
70- Network topology usually reflects direct causal
influences - other structure also possible
- but may render network more complex
Mary calls
Earthquake
Burglary
John calls
Alarm
Alarm
Burglary
John calls
Mary calls
Earthquake
71- Graph conditional probability tables allow to
construct joint probability distribution of all
variables - P(X1,X2,,Xn) ?i P(Xiparents(Xi))
- In other words bayesian belief network carries
full information on joint probability distribution
72Example
Burglary
Earthquake
- Joint probability distribution from conditional
ones - P(J,M,A,B,E) P(JA) P(MA) P(AB,E) P(B) P(E)
- to see this start with P(B) and P(E)
- conditionally independent from each other given
their parents unconditionally independent,
hence P(B,E) P(B) P(E) - P(A,B,E) P(AB,E) P(B,E) (by definition)
- P(J,M,A,B,E) P(J,MA,B,E) P(A,B,E) (by def.)
P(JA) P(MA) P(A,B,E)
Alarm
John calls
Mary calls
73Inference
- Given values for certain nodes, infer probability
distribution for values of other nodes - General algorithm quite complicated
- see Russel Norvig, 1995 Artificial
Intelligence, a Modern Approach
74Simplest case 2 nodes
- Given p(A), p(BA)
- A known, infer pAa(B)
- directly from p(BA)
- A unknown, infer pA?(B)
- "total probability" rule
- B known, infer pBb(A)
- Bayes' rule
- B unknown, infer pB?(A)
- p(A)
A
B
75A simple 3 nodes network
- Given p(A), p(BA), p(CB)
- E.g., Aa and Cc known
A
P(BbAa,Cc) P(Bb,Aa,Cc)
?i P(Aa,Bbi, Cc)
P(Aa)P(BbAa)P(CcBb)
?i P(Aa)P(BbiAa)P(CcBbi)
B
C
76More nodes
- In general, relatively complex reasoning can be
achieved - forward and backward reasoning
- "explaining away"
- "good grade" is evidence for "studied hard"
- but is less strong evidence if known that person
looked at neighbour's copy
study
see neighbour's copy
good grade
77General case
evidence (observed)
to be predicted
unobserved
- In general inference is NP-complete
- approximating methods, e.g. Monte-Carlo
78Learning bayesian networks
- Assume structure of network given
- only conditional probability tables to be learnt
- training examples may include values for all
variables, or just for some of them - when all variables observable
- estimating probabilities as easy as for Naïve
Bayes - e.g. estimate P(AB,C) as count(A,B,C)/count(B,C)
- when not all variables observable
- notice similarity with training neural networks
(hidden units) - methods exist, based on gradient descent or EM
(see later)
79- When structure of network not given
- search for structure tables
- e.g. propose structure, learn tables
- propose change to structure, relearn, see whether
better results - active research topic
80Sample complexity
- When structure known and all variables
observable - how many examples needed for learning?
- accurate estimates of conditional probability
tables needed - complexity of learning linear in size of largest
probability table - i.e. exponential in number of parent variables of
node - compare with estimating joint distribution
- exponential in total number of variables
- and with naïve bayes
- always only 1 parent variable, i.e. the class
81To remember
- Importance of Bayes theorem
- MAP, ML, MDL
- definitions, characterising learners from this
perspective, relationship MDL-MAP - Bayes optimal classifier, Gibbs classifier
- Naïve Bayes how it works, assumptions made,
application to text classification - Bayesian networks representation, inference,
learning
827 Computational learning theory
- Q how difficult are certain learning tasks?
- Measuring the complexity of learning
- Different settings for concept learning
- PAC-learning
- VC dimension
- Mistake bounds
- ? Mitchell Ch. 7
83COLT Computational learning theory
- Find theory to relate
- probability of successful learning
- number of training examples (sample complexity)
- how training examples are chosen
- complexity of hypothesis space
- accuracy to which target is approximated
- time spent on learning (time complexity)
84Complexity of concept learning
- Task
- given
- instance space X
- unknown target function c X ? 0,1
- hypothesis space H
- training examples D ?X
- find
- a hypothesis h in H such that h(x) c(x)
- for all x ? D?
- for all x ? X? (this is most interesting)
85Settings for concept learning
- How many examples needed for learning, depends on
how well they are chosen - 1) learner proposes instances, teacher classifies
them - i.e. learner chooses x, teacher provides c(x)
- 2) teacher gives instances
- teacher chooses x and provides c(x)
- 3) instances are provided randomly
- nobody chooses x, teacher provides c(x)
- Which one do you think is easiest?
86Sample complexity setting 1
- Learner proposes x, teacher gives c(x)
- Learner can choose x based on what it already
knows - e.g. version spaces
- try choose x so that half of VS predicts , other
half - - after seeing c(x), VS is divided by 2
- hence, hypothesis will be learnt after ?log2H?
examples - it may not be possible to choose x in this way
- in this case, more examples needed
- general idea learner should reduce remaining
possibilities as much as possible
87Sample complexity setting 2
- Teacher chooses x and provides c(x)
- Note teacher knows how learner works (i.e., has
all knowledge the learner has) knows target
concept - benevolent teacher can point learner in right
direction - what is the optimal teaching strategy?
- will depend on form of H
- Example
- learning conjunctions of up to n boolean literals
of form Aitrue/false - n1 examples suffice (why?)
88Sample complexity setting 3
- x randomly chosen, according to probability
distribution D over X - Very important setting in practice
- often no control over how data are collected
- Question, more specific
- assume instance space X, hypothesis space H, set
of possible target concepts C, distribution D
over X - given random set S of examples ltx,c(x)gt with c?C
and each x drawn according to D - find h?H for which Px drawn from D(h(x)?c(x)) is
small
89PAC learning
- Covers setting 3
- examples presented in random fashion makes it
difficult to guarantee learning of correct
hypothesis - relax the learning task learner should very
probably find an approximately correct
hypothesis - probability close to 1 of finding acceptable
hypothesis - denote this probability as 1-?
- acceptable probability that h predicts
something different from c is close to 0 - denote this probability as ?
90PAC learning
- Probably (1-?) Approximately (?) Correct
- Consider
- class C of possible target concepts defined over
instance space X instances have size n - learner L using hypothesis space H
- Definition, cf. Mitchell
- C is PAC-learnable by L using H if ?c?C, ?D over
X, 0lt?lt1/2, 0lt?lt1/2, L finds with probability 1-?
some h?H with errD(h)lt? in time polynomial in
1/?,1/?,n and size(c)
91Bounding the true error
- Consider
- true error errD(h) Px drawn from D(h(x)?c(x))
- training error errS(h) Px?S(h(x)?c(x))
- Can we bound errD(h), given errS(h)?
- Simplest case assume errS(h)0
- no noise, proposed hypothesis is consistent with
data - to find an upper bound on errD(h), given
errS(h)0, find probability that some h?VS could
have errD(h)gte - if this probability is low, e is a good upper
bound
92?-exhausting the VS
- Definition
- The version space VSH,S is ?-exhausted w.r.t. c
and D, if ?h?VSH,S errD(h)lt? - i.e., every hypothesis h in VSH,S should have
error less than ? with respect to c and D - Important question
- how large should S be, so that with probability
1-?, VSH,S is ?-exhausted? - this will indicate the sample complexity of the
task
93- Theorem (Haussler, 1988)
- assume finite H, S is set of random examples
(drawn according to D) of some target concept c,
then for any 0???1, P(VSH,S not ?-exhausted) lt
He-? S - proof
- for any single h?H with errD(h)??, P(h?VSH,S) ?
(1-?)S - if we have n such hs, probability that at least
one of them is in VS is ? n (1-?)S - since there cant be more than H, this is ? H
(1-?)S - hence P(VSH,S not ?-exhausted) ? H (1-?)S
- the latter is approximated by He-? S
94Sample complexity
- So if we want VS to be ?-exhausted with
probability 1-?, we need He-? S ? ? and
hence - S ? 1/? (lnH ln(1/? ))
- Try yourself
- assume H is space of conjunctions of literals
chosen from n attributes (e.g. A1 and not A3) - how many random examples needed to 0.01-exhaust
VS with probability 0.95, if n10?
95Agnostic learning
- So far, we assumed finding a h?VSH,S was possible
- What if not?
- Agnostic learning not sure c?H
- In that case, can only hope to approximate c as
closely as possible - From so-called Hoefding bounds
- P(errD(h)gterrS(h)?) ? e-2m?2
- Derive S ? 1/2?2 (ln H ln(1/?))
96The VC dimension
- Vapnik-Chervonenkis dimension
- alternative notion for measure how expressive H
is - up till now we used H
- what if H is infinite? (e.g., neural nets)
- also many different h?H may represent more or
less the same hypothesis to what extent does H
contain truly different hypotheses? - VC-dim. based on notion of shattering instances
97Shattering instances
- A set of instances S is shattered by a hypothesis
space H iff for each subset of S there exists a
hypothesis h H consistent with it - in other words for each possible concept c, a
h?H exists that is equivalent to c on S - Examples
- consider H half-planes (bounded by straight
lines) - left set is shattered by H, right set is not
98Vapnik-Chervonenkis dimension
- The VC-dimension VC(H) of hypothesis space H
defined over instance space X is the size of the
largest finite subset of X shattered by H, or ?
if arbitrarily large subsets can be shattered - Note peculiarities
- shatters for each labeling, a consistent h
exists - VC-dim d if at least one subset of size d is
shattered - E.g. VC-dimension of straight lines in ?2 is 3
- check that four points can never be shattered
99Sample complexity boundswith VC-dimension
- How many examples needed to ?-exhaust VSH,S with
probability at least 1-?? - S ? 1/? (4 log2 (2/?) 8 VC(H) log2(13/?))
100Mistake bounds
- Consider an alternative kind of complexity
measure - up till now look at total number of examples
needed before finding good hypothesis - other setting
- when getting an example, first guess its
classification - if wrong, teacher corrects it
- can we bound the number of mistakes made during
this process, before converging to good
hypothesis?
101Example Find-S
- Find-S, in context of conjunctions of boolean
literals - start with h most specific hypothesis
- h l1 ??l1 ? l2 ? ln ? ?ln everything
negative - For each positive instance x
- remove each literal not satisfied by x from
- How many mistakes before finding correct h?
- Try yourself (the answer is n1)
102Example Halving algorithm
- Halving algorithm
- keep track of VS using candidate-elimination
algorithm - classify new instance as follows
- consider prediction of each h?VS as a vote
- use majority vote for classification
- How many mistakes before VS converges to correct
concept? - worst case?
- best case?
103Optimal mistake bounds
- Let MA(C) max mistakes made by algorithm A to
learn concepts in C (over all concepts and
training sequences) - The optimal mistake bound Opt(C) is the minimal
MA(C) over all possible A - Property VC(C) ? Opt(C) ? Mhalving(C) ? log2(C)
104To remember
- What COLT is about
- Different settings for learning
- PAC-learning definition, ?-exhaustion,
derivation of simplest bound - Shattering, VC-dimension definitions and
intuition - Mistake bounds examples, optimal mistake bound
relationship with other measures