5 Evaluation of hypotheses - PowerPoint PPT Presentation

1 / 104
About This Presentation
Title:

5 Evaluation of hypotheses

Description:

When comparing hypotheses on the same data set, more powerful procedure possible ... for each single example, compare h1 and h2 ... – PowerPoint PPT presentation

Number of Views:317
Avg rating:3.0/5.0
Slides: 105
Provided by: csKule
Category:

less

Transcript and Presenter's Notes

Title: 5 Evaluation of hypotheses


1
5 Evaluation of hypotheses
  • Q How good are results of learning process ?
  • Some statistics
  • Evaluating a single hypothesis
  • Comparing hypotheses
  • Comparing learning algorithms
  • ROC analysis
  • ? Mitchell Ch. 5

2
Some Statistics (sorry)
  • Evaluation of predictive model often based on
    predictive accuracy probability that hypothesis
    makes correct prediction for random instance
  • acc(h) P(h(X) c(X)) 1 - error(h)
  • Estimating this accuracy standard statistics
  • Recall
  • binomial normal distribution
  • confidence intervals hypothesis tests

3
Binomial distribution
  • An experiment that succeeds with probability p is
    repeated n times what is the probability of
    having x successes?
  • assumptions constant p, independent experiments
  • Given h with accuracy p, n instances in some test
    set, gives probability of making x correct
    predictions

4
Normal distribution
  • Sum of many independent variables follows
    (approximately) normal distribution
  • Can be used to approximate binomial distribution
  • Formulae used in practice are derived from this
    approximation

5
  • Example for n10 and p0.3

6
Confidence intervals
  • From p and n we can compute an interval in which
    x (or x/n) lies with probability close to 1 (e.g.
    0.95)
  • In practice we want to do the opposite
  • given p x/n, give interval for p
  • interval that contains p with probability c is
    called a c confidence interval

population
0
1
sample
0
1
7
Hypothesis tests
  • Principle of a hypothesis test
  • given a certain claim (hypothesis) H0, test it by
    looking at a sample
  • if sample gives a result that is very unlikely if
    H0 were true, reject H0
  • E.g.
  • claim h predicts correctly in 90 of cases (H0
    p0.9)
  • test on data set p 0.8
  • this is abnormally low, hence reject the claim
  • abnormal confidence interval from p does not
    contain p

8
Evaluating a single hypothesis
  • To estimate true accuracy of hypothesis h
  • compute accuracy of h on sample of unseen data
  • compute e.g. 95 confidence interval
  • Formula for 95 confidence interval
  • derived from normal distribution
  • p is accuracy on sample, n is size of sample

? z? 0.90 1.64 0.95 1.96 0.99
2.58
9
The importance of test sets
  • Important theory assumes random, independent
    sample
  • Training set used to learn h is not independent!
  • if we denote
  • errorTr(h) error of h on training set
  • error(h) true error of h on population
  • errorTe(h) error of h on sample different from
    training set
  • then typically (E() denotes expected value)
  • E(errorTr(h)) lt error(h) (cf. overfitting as
    extreme case)
  • E(errorTr(h))-error(h) bias of estimator
    errorTr
  • E(errorTe(h)) error(h)

10
Creating test sets
  • How to obtain an independent test set?
  • Simple method use e.g. 2/3 of available data for
    training, 1/3 for testing
  • Problem if not much data available
  • smaller training set makes it more difficult to
    learn
  • smaller test set gives less accurate estimates
  • Popular solution cross-validation
  • learn h from full set S
  • partition S in n subsets learn n hypotheses hi,
    each time leaving out a different subset use
    average of test set accuracies of hi as estimate
    of accuracy of h

11
Example 3-fold cross-validation
Given data set S learn h from S
S1 test set for h1
training set for h1
errorS1(h1)
errorS3(h3)
errorS2(h2)
12
  • Not entirely unbiased - usually small pessimistic
    bias
  • S contains more elements than S-Si
  • different folds not independent
  • Still preferable over using training set accuracy

13
Comparison of hypotheses
  • Given two hypotheses, which one has lower true
    error?
  • Statistical hypothesis test
  • claim that both are equally good
  • if claim rejected, accept that 1 is better
  • 2 cases
  • compare 2 hypotheses on possibly different test
    sets
  • compare 2 hypotheses on same test set

14
Comparing 2 hypotheses
  • To compare h1 and h2, estimate p1-p2 from samples
    S1 (p1) and S2 (p2)
  • if very likely p1-p2 gt 0 h1 is better
  • similarly, lt 0 h2 is better
  • otherwise, no difference demonstrated
  • Formula for confidence interval of difference

15
Comparing 2 hypotheses on the same data set
  • When comparing hypotheses on the same data set,
    more powerful procedure possible
  • uses more information from test
  • possible influence of easy/difficult examples
    removed
  • More informative method
  • for each single example, compare h1 and h2
  • how often was h1 correct and h2 wrong on the same
    example, vs. the other way around?
  • use McNemars test

16
McNemar's test
  • Consider table
  • If h1 is equally good as h2
  • for each instance where h1 and h2 differ,
    probability 0.5 that either is correct
  • hence we expect B ? C ? (BC)/2
  • B and C follow binomial (/- normal) distribution
  • reject equality if B deviates too much from
    (BC)/2

17
Example comparison
  • Consider table below
  • Method with independent test sets
  • 55-45 in favour of h2 (correct predictions, out
    of 100)
  • not very convincing
  • Method with same test set
  • much more convincing 10-0 in favour of h2

- h2 clearly better than h1 - might not be
discovered using "conservative" comparison
18
Comparing learning algorithms
  • Compare these two questions
  • Q1 given hypotheses h1 and h2, which one has
    better predictive accuracy?
  • Q2 given learners L1 and L2 and data set S,
    which learner can be expected to build best
    hypothesis from S?
  • note that hypotheses themselves may vary
  • more difficult to answer than Q1

19
  • One possible method
  • For several data sets Si similar to S
  • split Si into training set Str and test set Ste
  • learn h1 and h2 from Str using L1 resp. L2
  • compute ?i errorSte(h1)-errorSte(h2)
  • Hypothesis test / confidence interval for mean of
    ?
  • What is limited set of data available?

20
  • If limited data available
  • repeated runs within 1 data set?
  • e.g. cross-validation n splits into Str and Ste
  • e.g. 30 random splits into Str and Ste
  • problem dependencies between data sets
  • may make it very easy to draw incorrect
    conclusions!
  • high probability of "type 1" error concluding
    one learner is better than other when this is not
    the case
  • to be avoided
  • reasonable approach 5 2-fold cross-validations
  • details T. Dietterich, Neural computation 10(7),
    1998
  • only really good solution collect more data!

21
ROC analysis
  • Accuracy based evaluation not always appropriate
  • Shortcomings
  • relative measure
  • unstable when class distribution may change
  • assumes symmetric misclassification costs
  • Alternatives
  • correlation
  • ROC analysis

22
1 Accuracy is a relative measure
  • E.g., "99 correct prediction" is this good?
  • Yes, if 50 "" and 50 "-"
  • No, if 1 "" and 99 "-"
  • always predicting "neg" gives 99 accuracy
  • Should be compared with "base accuracy" of always
    predicting the majority class
  • base accuracy maxac,bd / T
  • Even then, it may be misleading...

23
  • Assume all examples "-", except in blue region
    ("")
  • Which of these classifiers is best?

Classifier 1
Classifier 2


IF false THEN pos 96 correct
IF green area THEN pos 92 correct
24
  • Alternative measures exist
  • e.g., correlation ? (ad-bc) / ?TposTminTT-
  • close to 1 high correlation predictions -
    classes
  • close to 0 no correlation
  • (close to -1 predicting the opposite)

actual value
note /- are actual values pos/neg are
predictions
prediction
25
2 Different misclassification costs
  • Accuracy ignores possibility of different
    misclassification costs
  • sometimes, incorrectly predicting "pos" costs
    more/less than incorrectly predicting "neg"
  • e.g.
  • not treating an ill patient vs. treating a
    healthy patient
  • refusing credit to client who would have paid
    back vs. assigning credit to client who won't pay
    back
  • Need to distinguish probability of making
    different types of errors

26
  • Solution distinguish predictive accuracy for
    different classes
  • Acc probability that some instance is classified
    correctly
  • Decomposed into
  • TP probability that a positive instance is
    classified correctly
  • TN probability that a negative is classified
    correctly
  • TP true positive rate, TN true negative
    rate
  • We also define
  • FP 1-TN false positive rate probability
    that a negative is classified as positive
  • analogously FN 1-TP

27
  • Consider costs CFP and CFN
  • cost of false positive resp. false negative
  • Expected cost of a single prediction
  • C CFP P(pos-) P(-) CFN P(neg) P()
  • estimated by C CFP FP T-/T CFN FN T /T
  • Note
  • Acc is weighted average of TP and TN
  • Acc TP T/T TN T-/T
  • C is not computable from Acc alone

28
3 Changing class distributions
  • Accuracy is sensitive to changes in class
    distribution
  • E.g.
  • Suppose a classifier has TP 0.8, TN 0.6
  • Test on test set with T/T 0.5, T-/T 0.5
  • Acc 0.7
  • Employed in environment with T/T 0.3, T-/T
    0.7
  • Acc 0.66

29
ROC diagrams
  • ROC "Receiver operating characteristic"
  • Allows to see
  • how well a classifier will perform given certain
    misclassification costs and class distribution
  • in which environments one classifier is better
    than another

30
  • ROC diagram plots TP versus FP
  • From confusion matrix
  • TP a/(ac) a/T
  • FP b/(bd) b/T-

actual value
prediction
31
Classifier in ROC diagram
  • 1 classifier 1 point on ROC diagram
  • The closer to the upper left, the better

TP
perfect prediction
no positives forgotten
1
B
.
random prediction
A
.
no negatives returned
0
1
FP
32
Rank classifiers
  • Rank classifiers
  • assign a rank to their predictions
  • some predictions are more certain than others -gt
    higher rank
  • e.g. neural nets
  • criterion lt0.5 neg, gt0.5 pos
  • but 0.9 is more certainly positive than 0.51
  • raise/lower threshold of 0.5 effect?
  • e.g. decision trees
  • use purity of leaf used for prediction to rank it

33
  • Rank classifier gives ROC curve

TP
1
.
C with low threshold better than B
B
C
A
.
C with high threshold worse than A
0
1
FP
34
Costs in ROC diagram
  • Given misclassification costs
  • cFP cost of a false positive
  • cFN cost of a false negative (undetected "")
  • Average cost is
  • c cFP FP T-/T cFN (1-TP) T/T
  • Lines of equal cost can be drawn in ROC diagram
    (straight lines)

35
increasing cost
TP
1
.
B
C
A
.
0
1
FP
36
high cost of false positive A is better
TP
low cost of false positive C is better
1
.
B
C
B is never better than C
A
.
0
1
FP
37
Sets of classifiers
  • Different classifiers may be good in different
    environments
  • Given a set of classifiers, ROC analysis allows
    to
  • decide in which cases a classifier is optimal
  • remove classifiers that are never optimal
  • Classifiers that may be optimal are always on
    convex hull of set of points

38
Example convex hull
  • Which classifiers are never optimal?

TP
1
.
.
.
.
.
0
1
FP
39
Evaluation of regression models
  • Predicting numbers no "right or wrong" approach
  • Possible measures
  • Sum of squared errors SSE
  • Is an absolute measure
  • Relative error RE measures improvement over
    trivial model
  • RE SSE(hypothesis) / SSE(trivial hypothesis)
  • Trivial hypothesis e.g. always predict mean
  • Spearman correlation r
  • measures how well predictions and actual values
    correlate
  • less sensitive to actual errors

40
To remember
  • Accuracy based evaluation
  • methods for evaluating a single hypothesis,
    comparing hypotheses, comparing algorithms
  • limitations of accuracy as evaluation criterion
  • Other evaluation criteria
  • criteria for regression
  • correlation as alternative for accuracy
  • ROC analysis plotting classifiers and rank
    classifiers, iso-cost lines, convex hull

41
6 Bayesian learning
  • Introduction probabilistic (Bayesian) methods
  • MAP and ML hypotheses
  • Minimum description length principle
  • Bayes optimal classifier
  • Naïve Bayes learner
  • example learning over text data
  • Bayesian belief networks
  • (Expectation Maximization (EM) see later)
  • ? Mitchell Ch. 6

42
Bayesian approaches
  • Several roles for probability theory in machine
    learning
  • describing existing learners
  • e.g. compare them with optimal probabilistic
    learner
  • developing practical learning algorithms
  • e.g. Naïve Bayes learner
  • Bayes theorem plays a central role

43
Basics of probability
  • P(A) probability that A happens
  • P(AB) probability that A happens, given that B
    happens (conditional probability)
  • Some rules
  • complement P(not A) 1 - P(A)
  • disjunction P(A or B) P(A)P(B)-P(A and B)
  • conjunction P(A and B) P(A) P(BA)
  • P(A) P(B) if A and B independent
  • total probabilityP(A) ? P(ABi) P(Bi)

44
Bayes Theorem
  • P(AB) P(BA) P(A) / P(B)
  • Mainly 2 ways of using Bayes theorem
  • Applied to learning a hypothesis h from data D
  • P(hD) P(Dh) P(h) / P(D) P(Dh)P(h)
  • P(h) a priori probability that h is correct
  • P(hD) a posteriori probability that h is
    correct
  • P(D) probability of obtaining data D
  • P(Dh) probability of obtaining data D if h is
    correct
  • Applied to classification of a single example e
  • P(classe) P(eclass)P(class)/P(e)

45
Bayes theorem Example
  • Example
  • assume some lab test for a disease has 98 chance
    of giving positive result if disease is present,
    and 97 chance of giving negative result if
    disease is absent
  • assume furthermore 0.8 of population has this
    disease
  • given positive result, what is probability that
    disease is present?
  • P(DP) P(PD)P(D) / P(P) 0.980.008 /
    (0.980.008 0.030.992)

46
MAP and ML hypotheses
  • Question given the current data D and some
    hypothesis space H, return the hypothesis h in H
    that is most likely to be correct.
  • Note this h is optimal in a certain sense
  • no method can exist that finds with higher
    probability the correct h

47
MAP hypothesis
  • Given some data D and a hypothesis space H, find
    the hypothesis h?H that has the highest
    probability of being correct i.e., P(hD) is
    maximal
  • This hypothesis is called the maximal a
    posteriori hypothesis hMAP hMAP argmaxh?H
    P(hD)
  • argmaxh?H P(Dh)P(h)/P(D) argmaxh?H
    P(Dh)P(h)
  • last equality holds because P(D) is constant
  • So we need P(Dh) and P(h) for all h?H to
    compute hMAP

48
ML hypothesis
  • P(h) a priori probability that h is correct
  • What if no preferences for one h over another?
  • Then assume P(h) P(h) for all h, h?H
  • Under this assumption hMAP is called the maximum
    likelihood hypothesis hML
  • hML argmaxh?H P(Dh) (because P(h) constant)
  • How to find hMAP or hML ?
  • brute force method compute P(Dh), P(h) for all
    h?H
  • usually not feasible

49
Version spaces in a MAP / ML setting
  • Consider FindS finds most specific hypothesis
    hms in H consistent with data D
  • Under what circumstances is hms hMAP?
  • Assume (for simplicity) set of data given, and D
    consists of the classes of the instances
  • P(Dh) 0 if h inconsistent with D, 1 otherwise
  • P(hD) P(Dh) P(h)
  • for any h not in VS P(hD) 0
  • for any h in VS P(hD) P(h)
  • so hmshMAP if ?h,h P(h)?P(h) ? h more
    specific than h

50
Characterising learnersin a MAP setting
  • E.g. candidate elimination

training examples
Candidate Elimination
hypotheses
hypothesis space H
is equivalent with
training examples
brute force MAP learner
hypotheses
hypothesis space H
P(h) uniform P(Dh) 1 if consistent,
0 otherwise
51
Characterising numeric predictionin a ML setting
  • Minimisation of MSE (mean squared error)
  • hMSE hML under assumption of gaussian noise
  • target values in data are produced as d(x)
    f(x)? with
  • f(x) true target value of x and ? noise
  • ? random and normally distributed
  • Predicting probabilities
  • most likely hypothesis can be found by maximising
    cross-entropy
  • hML argmaxh?H ? di ln h(xi) (1-di) ln
    (1-h(xi))
  • with di target value for instance xi (0/1),
    h(xi) predicted probability that xi1

52
Minimum Description Length (MDL)
  • Occams razor prefer simplest hypothesis
  • simplest shortest description
  • Minimal description length principle
  • hypothesiscorrections should have shortest
    description
  • trading off complexity and correctness of
    hypothesis
  • given data D, hypothesis space H, and encodings
    C1 and C2 for hypotheses resp. data
  • hMDL argminh?H LC1(h) LC2(Dh)
  • LC(x) denotes description length of x under
    encoding C
  • LC2(Dh) indicates exceptions to hs predictions
  • LC2(Dh) 0 if h entirely correct

53
  • Interesting observation
  • hMAP equals hMDL under optimal encodings
  • encoding length of message based on its
    probability
  • But note no link between optimal encoding and
    practical encoding mechanisms! (trees, )
  • hence, no reason for claiming that e.g. shorter
    trees have higher probability of being correct!

54
Bayes Optimal Classifier
  • Problem considered up till now
  • given data D and hypothesis space H, find most
    probable hypothesis h in H
  • Now consider this problem
  • given data D, hypothesis space H, and new
    instance x, what is most probable classification
    of x?
  • equivalent to h(x), with h most probable
    hypothesis? No.
  • example
  • P(h1D)0.4, P(h2D)P(h3D)0.3 and h1(x),
    h2(x)h3(x) -
  • What is most probable classification of x?

55
Bayes optimal classifier
  • With hypothesis space H and set of classes V,
    most probable classification is
  • In our example P(h1) 1, P(-h1)0 etc.

56
Gibbs classifier
  • Bayes Optimal is optimal but expensive
  • uses all hypotheses in H
  • What if we approximate it as follows
  • select one h at random, according to
    probabilities P(hi)
  • predict h(x)
  • This method is called the Gibbs classifier
  • Surprisingly E(errorGibbs) ? 2
    E(errorBayesOptimal)
  • Try to apply this to VS approach

57
Naïve Bayes classifier
  • Simple popular classification method
  • Based on Bayes rule assumption of conditional
    independence
  • assumption often violated in practice
  • even then, it usually works well
  • Successful application classification of text
    documents

58
Classification using Bayes rule
  • Given attribute values, what is most probable
    value of target variable?
  • Problem too much data needed to estimate
    P(a1anvj)

59
The Naïve Bayes classifier
  • Naïve Bayes assumption attributes are
    independent, given the class
  • P(a1,,anvj) P(a1vj)P(a2vj)P(anvj)
  • also called conditionally independence (given the
    class)
  • Under that assumption, vMAP becomes

60
  • What if assumption violated?
  • i.e. P(a1,,anvj) ? P(a1vj)P(a2vj)P(anvj)
  • Prediction still equivalent to Bayes prediction
    as long as the following (weaker) condition
    holds
  • But probabilities associated with prediction may
    be unrealistically close to 0 or 1

61
Learning a Naïve Bayes classifier
  • To learn such a classifier just estimate P(vj),
    P(aivj) from data
  • How to estimate?
  • simplest standard estimate from statistics
  • estimate probability from sample proportion
  • e.g., estimate P(AB) as count(A and B) /
    count(B)
  • in practice, something more complicated needed

62
Estimating probabilities
  • Problem
  • What if attribute value ai never observed for
    class vj?
  • Estimate P(aivj)0 because count(ai and vj) 0
    ?
  • Effect is too strong this 0 makes the whole
    product 0!
  • Solution use m-estimate
  • interpolates between observed value nc/n and a
    priori estimate p -gt estimate may get close to 0
    but never 0
  • m is weight given to a priori estimate

63
Learning to classify text
  • Example application
  • given text of newsgroup article, guess which
    newsgroup it is taken from
  • Naïve bayes turns out to work well on this
    application
  • How to apply NB?
  • Key issue how do we represent examples? what
    are the attributes?

64
Representation
  • Binary classification (/-) or multiple classes
    possible
  • Attributes word positions
  • I.e. attribute i represents i-th word in text
  • value of attribute word that occurs there
  • Note could have chosen other representations
    e.g. attribute specific word, value its
    frequency in the text
  • further assumption probability of having a
    specific word is independent of position
  • P(aiwkvj) P(amwkvj) ?i,m

65
Algorithm
procedure learn_naïve_bayes_text(E set of
articles, V set of classes) Voc all words and
tokens occurring in E estimate P(vj) and
P(wkvj) for all wk in E and vj in V Nj
number of articles of class j N number of
articles P(vj) Nj/N nkj number of times
word wk occurs in text of class j nj number
of words in class j (counting doubles) P(wkvj)
(nkj1)/(njVoc) procedure
classify_naïve_bayes_text(A article) remove
from A all words/tokens that are not in
Voc return argmaxvj?V P(vj) ?i P(aivj)
66
  • Experiment reported in Mitchell
  • 1000 articles taken from 20 newsgroups
  • guess correct newsgroup for unseen documents
  • 89 classification accuracy with previous approach

67
Bayesian Belief Networks
  • Consider two extremes of spectrum
  • guessing joint probability distribution
  • would yield optimal classifier
  • but infeasible in practice (too much data needed)
  • Naïve Bayes
  • much more feasible
  • but strong assumptions of conditional
    independence
  • Can we find something in between?
  • make some independence assumptions, but only
    where reasonable

68
Bayesian belief networks
  • Bayesian belief network consists of
  • 1 graph
  • intuitively indicates which variables directly
    influence which other variables
  • arrow from A to B A has direct effect on B
  • parents(X) set of all nodes directly
    influencing X
  • X is influenced only by its parents
  • formally each node is conditionally independent
    of each of its non-descendants, given its parents
  • conditional independence cf. Naïve Bayes
  • X conditionally independent of Y given Z iff
    P(XY,Z) P(XZ)
  • 2 conditional probability tables
  • for each node X P(Xparents(X)) is given

69
Example
  • Burglary or earthquake may cause alarm to go off
  • Alarm going off may cause one of neighbours to
    call

E 0.01 -E 0.99
B 0.05 -B 0.95
Burglary
Earthquake
B,E B,-E -B,E -B,-E A 0.9 0.8 0.4
0.01 -A 0.1 0.2 0.6 0.99
Alarm
John calls
Mary calls
A -A M 0.9 0.2 -M 0.1 0.8
A -A J 0.8 0.1 -J 0.2 0.9
70
  • Network topology usually reflects direct causal
    influences
  • other structure also possible
  • but may render network more complex

Mary calls
Earthquake
Burglary
John calls
Alarm
Alarm
Burglary
John calls
Mary calls
Earthquake
71
  • Graph conditional probability tables allow to
    construct joint probability distribution of all
    variables
  • P(X1,X2,,Xn) ?i P(Xiparents(Xi))
  • In other words bayesian belief network carries
    full information on joint probability distribution

72
Example
Burglary
Earthquake
  • Joint probability distribution from conditional
    ones
  • P(J,M,A,B,E) P(JA) P(MA) P(AB,E) P(B) P(E)
  • to see this start with P(B) and P(E)
  • conditionally independent from each other given
    their parents unconditionally independent,
    hence P(B,E) P(B) P(E)
  • P(A,B,E) P(AB,E) P(B,E) (by definition)
  • P(J,M,A,B,E) P(J,MA,B,E) P(A,B,E) (by def.)
    P(JA) P(MA) P(A,B,E)

Alarm
John calls
Mary calls
73
Inference
  • Given values for certain nodes, infer probability
    distribution for values of other nodes
  • General algorithm quite complicated
  • see Russel Norvig, 1995 Artificial
    Intelligence, a Modern Approach

74
Simplest case 2 nodes
  • Given p(A), p(BA)
  • A known, infer pAa(B)
  • directly from p(BA)
  • A unknown, infer pA?(B)
  • "total probability" rule
  • B known, infer pBb(A)
  • Bayes' rule
  • B unknown, infer pB?(A)
  • p(A)

A
B
75
A simple 3 nodes network
  • Given p(A), p(BA), p(CB)
  • E.g., Aa and Cc known

A
P(BbAa,Cc) P(Bb,Aa,Cc)
?i P(Aa,Bbi, Cc)
P(Aa)P(BbAa)P(CcBb)
?i P(Aa)P(BbiAa)P(CcBbi)
B
C
76
More nodes
  • In general, relatively complex reasoning can be
    achieved
  • forward and backward reasoning
  • "explaining away"
  • "good grade" is evidence for "studied hard"
  • but is less strong evidence if known that person
    looked at neighbour's copy

study
see neighbour's copy
good grade
77
General case
evidence (observed)
to be predicted
unobserved
  • In general inference is NP-complete
  • approximating methods, e.g. Monte-Carlo

78
Learning bayesian networks
  • Assume structure of network given
  • only conditional probability tables to be learnt
  • training examples may include values for all
    variables, or just for some of them
  • when all variables observable
  • estimating probabilities as easy as for Naïve
    Bayes
  • e.g. estimate P(AB,C) as count(A,B,C)/count(B,C)
  • when not all variables observable
  • notice similarity with training neural networks
    (hidden units)
  • methods exist, based on gradient descent or EM
    (see later)

79
  • When structure of network not given
  • search for structure tables
  • e.g. propose structure, learn tables
  • propose change to structure, relearn, see whether
    better results
  • active research topic

80
Sample complexity
  • When structure known and all variables
    observable
  • how many examples needed for learning?
  • accurate estimates of conditional probability
    tables needed
  • complexity of learning linear in size of largest
    probability table
  • i.e. exponential in number of parent variables of
    node
  • compare with estimating joint distribution
  • exponential in total number of variables
  • and with naïve bayes
  • always only 1 parent variable, i.e. the class

81
To remember
  • Importance of Bayes theorem
  • MAP, ML, MDL
  • definitions, characterising learners from this
    perspective, relationship MDL-MAP
  • Bayes optimal classifier, Gibbs classifier
  • Naïve Bayes how it works, assumptions made,
    application to text classification
  • Bayesian networks representation, inference,
    learning

82
7 Computational learning theory
  • Q how difficult are certain learning tasks?
  • Measuring the complexity of learning
  • Different settings for concept learning
  • PAC-learning
  • VC dimension
  • Mistake bounds
  • ? Mitchell Ch. 7

83
COLT Computational learning theory
  • Find theory to relate
  • probability of successful learning
  • number of training examples (sample complexity)
  • how training examples are chosen
  • complexity of hypothesis space
  • accuracy to which target is approximated
  • time spent on learning (time complexity)

84
Complexity of concept learning
  • Task
  • given
  • instance space X
  • unknown target function c X ? 0,1
  • hypothesis space H
  • training examples D ?X
  • find
  • a hypothesis h in H such that h(x) c(x)
  • for all x ? D?
  • for all x ? X? (this is most interesting)

85
Settings for concept learning
  • How many examples needed for learning, depends on
    how well they are chosen
  • 1) learner proposes instances, teacher classifies
    them
  • i.e. learner chooses x, teacher provides c(x)
  • 2) teacher gives instances
  • teacher chooses x and provides c(x)
  • 3) instances are provided randomly
  • nobody chooses x, teacher provides c(x)
  • Which one do you think is easiest?

86
Sample complexity setting 1
  • Learner proposes x, teacher gives c(x)
  • Learner can choose x based on what it already
    knows
  • e.g. version spaces
  • try choose x so that half of VS predicts , other
    half -
  • after seeing c(x), VS is divided by 2
  • hence, hypothesis will be learnt after ?log2H?
    examples
  • it may not be possible to choose x in this way
  • in this case, more examples needed
  • general idea learner should reduce remaining
    possibilities as much as possible

87
Sample complexity setting 2
  • Teacher chooses x and provides c(x)
  • Note teacher knows how learner works (i.e., has
    all knowledge the learner has) knows target
    concept
  • benevolent teacher can point learner in right
    direction
  • what is the optimal teaching strategy?
  • will depend on form of H
  • Example
  • learning conjunctions of up to n boolean literals
    of form Aitrue/false
  • n1 examples suffice (why?)

88
Sample complexity setting 3
  • x randomly chosen, according to probability
    distribution D over X
  • Very important setting in practice
  • often no control over how data are collected
  • Question, more specific
  • assume instance space X, hypothesis space H, set
    of possible target concepts C, distribution D
    over X
  • given random set S of examples ltx,c(x)gt with c?C
    and each x drawn according to D
  • find h?H for which Px drawn from D(h(x)?c(x)) is
    small

89
PAC learning
  • Covers setting 3
  • examples presented in random fashion makes it
    difficult to guarantee learning of correct
    hypothesis
  • relax the learning task learner should very
    probably find an approximately correct
    hypothesis
  • probability close to 1 of finding acceptable
    hypothesis
  • denote this probability as 1-?
  • acceptable probability that h predicts
    something different from c is close to 0
  • denote this probability as ?

90
PAC learning
  • Probably (1-?) Approximately (?) Correct
  • Consider
  • class C of possible target concepts defined over
    instance space X instances have size n
  • learner L using hypothesis space H
  • Definition, cf. Mitchell
  • C is PAC-learnable by L using H if ?c?C, ?D over
    X, 0lt?lt1/2, 0lt?lt1/2, L finds with probability 1-?
    some h?H with errD(h)lt? in time polynomial in
    1/?,1/?,n and size(c)

91
Bounding the true error
  • Consider
  • true error errD(h) Px drawn from D(h(x)?c(x))
  • training error errS(h) Px?S(h(x)?c(x))
  • Can we bound errD(h), given errS(h)?
  • Simplest case assume errS(h)0
  • no noise, proposed hypothesis is consistent with
    data
  • to find an upper bound on errD(h), given
    errS(h)0, find probability that some h?VS could
    have errD(h)gte
  • if this probability is low, e is a good upper
    bound

92
?-exhausting the VS
  • Definition
  • The version space VSH,S is ?-exhausted w.r.t. c
    and D, if ?h?VSH,S errD(h)lt?
  • i.e., every hypothesis h in VSH,S should have
    error less than ? with respect to c and D
  • Important question
  • how large should S be, so that with probability
    1-?, VSH,S is ?-exhausted?
  • this will indicate the sample complexity of the
    task

93
  • Theorem (Haussler, 1988)
  • assume finite H, S is set of random examples
    (drawn according to D) of some target concept c,
    then for any 0???1, P(VSH,S not ?-exhausted) lt
    He-? S
  • proof
  • for any single h?H with errD(h)??, P(h?VSH,S) ?
    (1-?)S
  • if we have n such hs, probability that at least
    one of them is in VS is ? n (1-?)S
  • since there cant be more than H, this is ? H
    (1-?)S
  • hence P(VSH,S not ?-exhausted) ? H (1-?)S
  • the latter is approximated by He-? S

94
Sample complexity
  • So if we want VS to be ?-exhausted with
    probability 1-?, we need He-? S ? ? and
    hence
  • S ? 1/? (lnH ln(1/? ))
  • Try yourself
  • assume H is space of conjunctions of literals
    chosen from n attributes (e.g. A1 and not A3)
  • how many random examples needed to 0.01-exhaust
    VS with probability 0.95, if n10?

95
Agnostic learning
  • So far, we assumed finding a h?VSH,S was possible
  • What if not?
  • Agnostic learning not sure c?H
  • In that case, can only hope to approximate c as
    closely as possible
  • From so-called Hoefding bounds
  • P(errD(h)gterrS(h)?) ? e-2m?2
  • Derive S ? 1/2?2 (ln H ln(1/?))

96
The VC dimension
  • Vapnik-Chervonenkis dimension
  • alternative notion for measure how expressive H
    is
  • up till now we used H
  • what if H is infinite? (e.g., neural nets)
  • also many different h?H may represent more or
    less the same hypothesis to what extent does H
    contain truly different hypotheses?
  • VC-dim. based on notion of shattering instances

97
Shattering instances
  • A set of instances S is shattered by a hypothesis
    space H iff for each subset of S there exists a
    hypothesis h H consistent with it
  • in other words for each possible concept c, a
    h?H exists that is equivalent to c on S
  • Examples
  • consider H half-planes (bounded by straight
    lines)
  • left set is shattered by H, right set is not

98
Vapnik-Chervonenkis dimension
  • The VC-dimension VC(H) of hypothesis space H
    defined over instance space X is the size of the
    largest finite subset of X shattered by H, or ?
    if arbitrarily large subsets can be shattered
  • Note peculiarities
  • shatters for each labeling, a consistent h
    exists
  • VC-dim d if at least one subset of size d is
    shattered
  • E.g. VC-dimension of straight lines in ?2 is 3
  • check that four points can never be shattered

99
Sample complexity boundswith VC-dimension
  • How many examples needed to ?-exhaust VSH,S with
    probability at least 1-??
  • S ? 1/? (4 log2 (2/?) 8 VC(H) log2(13/?))

100
Mistake bounds
  • Consider an alternative kind of complexity
    measure
  • up till now look at total number of examples
    needed before finding good hypothesis
  • other setting
  • when getting an example, first guess its
    classification
  • if wrong, teacher corrects it
  • can we bound the number of mistakes made during
    this process, before converging to good
    hypothesis?

101
Example Find-S
  • Find-S, in context of conjunctions of boolean
    literals
  • start with h most specific hypothesis
  • h l1 ??l1 ? l2 ? ln ? ?ln everything
    negative
  • For each positive instance x
  • remove each literal not satisfied by x from
  • How many mistakes before finding correct h?
  • Try yourself (the answer is n1)

102
Example Halving algorithm
  • Halving algorithm
  • keep track of VS using candidate-elimination
    algorithm
  • classify new instance as follows
  • consider prediction of each h?VS as a vote
  • use majority vote for classification
  • How many mistakes before VS converges to correct
    concept?
  • worst case?
  • best case?

103
Optimal mistake bounds
  • Let MA(C) max mistakes made by algorithm A to
    learn concepts in C (over all concepts and
    training sequences)
  • The optimal mistake bound Opt(C) is the minimal
    MA(C) over all possible A
  • Property VC(C) ? Opt(C) ? Mhalving(C) ? log2(C)

104
To remember
  • What COLT is about
  • Different settings for learning
  • PAC-learning definition, ?-exhaustion,
    derivation of simplest bound
  • Shattering, VC-dimension definitions and
    intuition
  • Mistake bounds examples, optimal mistake bound
    relationship with other measures
Write a Comment
User Comments (0)
About PowerShow.com