Computational Learning Theory - PowerPoint PPT Presentation

1 / 44
About This Presentation

Computational Learning Theory


Ch5 Computational Learning Theory. Introduction. Probably Learning ... The optimal mistake bound for C, denoted by Opt(C), defined as minAlearning algMA(C) ... – PowerPoint PPT presentation

Number of Views:161
Avg rating:3.0/5.0
Slides: 45
Provided by: Che79


Transcript and Presenter's Notes

Title: Computational Learning Theory

  • ??
  • ??????????????
  • ??????????

  • ??????
  • Tel82529680
  • ?????
  • ????http//

Ch5 Computational Learning Theory
  • Introduction
  • Probably Learning
  • Sample Complexity for Finite Hypothesis Spaces
  • Sample Complexity for Infinite Hypothesis Spaces
  • Mistake Bound Model

  • Problem setting in Ch5
  • Inductively learning an unknown target function,
    given training examples and a hypothesis space
  • Focus on
  • How many training examples are sufficient?
  • How many mistakes will the learner make before

Introduction (2)
  • Desirable quantitative bounds depending on
  • Complexity of hypo space,
  • Accuracy of approximation
  • Probability of outputting a successful hypo
  • How the training examples are presented
  • Learner proposes instances
  • Teacher presents instances
  • Some random process produces instances
  • Specifically, study sample complexity,
    computational complexity, and mistake bound.

Ch5 Computational Learning Theory
  • Introduction
  • PAC Learning model
  • Sample Complexity for Finite Hypothesis Spaces
  • Sample Complexity for Infinite Hypothesis Spaces
  • Mistake Bound Model

Problem Setting
  • Space of possible instances X (e.g. set of all
    people) over which target functions may be
  • Assume that different instances in X may be
    encountered with different frequencies.
  • Modeling above assumption as unknown
    (stationary) probability distribution D that
    defines the probability of encountering each
    instance in X
  • Training examples are provided by drawing
    instances independently from X, according to D,
    and they are noise-free.
  • Each element c in target function set C
    corresponds to certain subset of X, i.e. c is a
    Boolean function. (Just for the sake of

Error of a Hypothesis
  • Training error of hypo h w.r.t. target function c
    and training data set S of n sample is
  • True error of hypo h w.r.t. target function c and
    distribution D is
  • errorD(h) is not observable, so how probable is
    it that errorS(h) gives a misleading estimates of
  • Different from problem setting in Ch3, where
    samples are drawn independently from h, here h
    depends on training samples.

An Illustration of True Error
PAC Learnability
  • PAC refers to Probably Approximately Correct
  • It is desirable that errorD(h) to be zero,
    however, to be realistic, we weaken our demand in
    two ways
  • errorD(h) is to be bounded by a small number e
  • Learner is not required to success on every
    training sample, rather that its probability of
    failure is to be bounded by a constant d.
  • Hence we come up with the idea of Probably
    Approximately Correct

  • Def. Consider concept class C defined over
    instance space X of cardinality n and a learner L
    using hypo space H. C is PAC-learnable by L using
    H if for all c in C, distribution D over X, and e
    d in (0,0.5), L will with probability at least
    1-d output a hypo h s.t. errorD(h) e, in time
    that is polynomial in 1/e, 1/d, n, and size(c),
    the coding length of c in C.

Ch5 Computational Learning Theory
  • Introduction
  • Probably Learning
  • Sample Complexity for Finite Hypothesis Spaces
  • Sample Complexity for Infinite Hypothesis Spaces
  • Mistake Bound Model

Sample Complexity for Finite Hypothesis Spaces
  • Start from a good class of learnerconsistent
    learner, defined as one that outputs a hypo which
    perfectly fits the training data set, whenever
  • Recall Version space VSH,D is defined to be the
    set of all hypo h?H that correctly classify all
    training examples in D.
  • Property. Every consistent learner outputs a hypo
    belonging to version space.

  • Def. VSH,D is said to be e-exhausted w.r.t. c and
    D if for any h in VSH,D, errorD(h)lte.

e-exhausting the Version Space
  • Theorem 5.1 If hypo space H is finite, and D is a
    sequence of m independent randomly drawn examples
    of some target concept c, then for any 0e1, the
    probability that VSH,D is not e-exhausted w.r.t.
    c is no more than He -em.
  • Basic idea behind the proof Since H is finite,
    we can enumerate hypotheses in VSH,D by h1, h2,
    hk. VSH,D is not e-exhausted iff at least one
    of these hi satisfies errorD(h)e, however, such
    hi perfectly fits the m number of training

  • The Theorem bounds the probability that m number
    of training examples fail to eliminate all bad
  • If we want the upper bound to be no more than d,
    and we solve the resulting inequality for m, it
    follows that m(1/e)(lnHln(1/d)).
  • Such m number of training examples are sufficient
    to guarantee that any consistent hypo will be
    probably (with probability 1-d) approximately
    correct (with error less than e).

A PAC-Learnable Example
  • Consider class C of conjunction of boolean
  • A boolean literal is any boolean variable or its
  • Q Is such C PAC-learnable?
  • A Yes, by going through the following two steps
  • Show that any consistent learner will require
    only a polynomial number of training examples to
    learn any element of C
  • Suggest a specific algorithm that use polynomial
    time per training example.

  • Step1
  • Let H consists of conjunction of literals based
    on n boolean variables.
  • Now take a look at m(1/e)(lnHln(1/d)),
    observe that H3n, then the inequality becomes
  • Step2
  • FIND-S algorithm satisfies the requirement
  • For each new positive training example, the
    algorithm computes intersection of literals
    shared by current hypothesis and the example,
    using time linear in n

  • Conclusion Conjunctions of boolean literals are

Agnostic Learning Inconsistent Hypo
  • In the proof of Theorem 5.1, we assume that VSH,D
    is not empty, and a simple way to guarantee such
    condition holds is that we assume that c belongs
    to H.
  • Agnostic learning setting Dont assume c?H, and
    the learner simply finds hypo with minimum
    training error instead.

  • The question in Theorem 5.1 becomes
  • Let errorD(h) denotes the training error of hypo
    h, and hbest be the hypo in H smallest training
    error. How many training examples suffice to
    ensure (with high probability) that errorD(hbest
    )errorD(hbest )e?
  • Borrow the setting under which we estimate
    errorD(h) via errorS(h) in Ch3, and apply The
    Hoeffding bound PrerrorD(h)gterrorD(h)eexp-2m
    e2 for any h.
  • Let the above probability be bounded by some
    constant d, it follows that

Ch5 Computational Learning Theory
  • Introduction
  • Probably Learning
  • Sample Complexity for Finite Hypothesis Spaces
  • Sample Complexity for Infinite Hypothesis Spaces
  • Mistake Bound Model

Limitation of Theorem 5.1
  • Quite weak bound (can easily be greater than 1 if
    cardinality of H is large enough!)
  • H must be finite
  • Introduce a new measureVapnik Chervonenkis
    dimension of H, or VC dimension.
  • Rough idea of VC it measures complexity of H by
    the number of distinct instances from X that can
    be completely discriminated using H

Shattering a Set of Instances
  • Def. A dichotomy of a set S is a partition of S
    into two disjoint subsets
  • Def. A set of instances S is shattered by hypo
    space H iff for every dichotomy of S, there
    exists some hypo in H consistent with this

3 instances shattered
VC Dimension
  • Motivation What if H cant shatter X? Try finite
    subsets of X.
  • Def. VC dimension of hypo space H defined over
    instance space X is the size of largest finite
    subset of X shattered by H. If any arbitrarily
    large finite subsets of X can be shattered by H,
    then VC(H)8
  • Roughly speaking, VC dimension measures how many
    (training) points can be separated for all
    possible labeling using functions of a given

An Example Linear Decision Surface
  • Line case Xreal number set, and Hset of all
    open intervals, then VC(H)2.
  • Plane case Xxy-plane, and Hset of all linear
    decision surface of the plane, then VC(H)3.
  • General case For n-dim real-number space, let H
    be its linear decision surface, then VC(H)n1.

Sample Complexity from VC Dimension
  • How many randomly drawn examples suffice to
    e-exhaust VSH,D with probability at least 1-d?
  • (Blumer et al. 1989)
  • Furthermore, it is possible to obtain a lower
    bound on sample complexity (i.e. minimum number
    of required training samples)

Lower Bound on Sample Complexity
  • Theorem 5.2 (Ehrenfeucht et al. 1989) Consider
    any concept class C s.t. VC(C)2, any learner L,
    and any 0ltelt1/8, and 0ltdlt1/100. Then there exists
    a distribution D and target concept in C s.t. if
    L observes fewer examples than
  • max(1/e)log(1/d), (VC(C)-1)/(32e), then with
    probability at least d, L outputs a hypo h having

Ch5 Computational Learning Theory
  • Introduction
  • Probably Learning
  • Sample Complexity for Finite Hypothesis Spaces
  • Sample Complexity for Infinite Hypothesis Spaces
  • Mistake Bound Model

Recall Introduction of this Chapter
  • Problem setting in Ch5
  • Inductively learning an unknown target function,
    given training examples and a hypothesis space
  • Focus on
  • How many training examples are sufficient?
  • How many mistakes will the learner make before

Introduction (2)
  • Desirable quantitative bounds depending on
  • Complexity of hypo space,
  • Accuracy of approximation
  • Probability of outputting a successful hypo
  • How the training examples are presented
  • Learner proposes instances
  • Teacher presents instances
  • Some random process produces instances
  • Specifically, study sample complexity,
    computational complexity, and mistake bound.

Introduction to Mistake Bound
  • Mistake bound the total number of mistakes a
    learner makes before it converges to the correct
  • Assume the learner receives a sequence of
    training examples, however, for each instance x,
    the learner must first predict c(x) before it
    receives correct answer from the teacher.
  • Application scenario when the learning must be
    done on-the-fly, rather than during off-line
    training stage.

Find-S Algorithm
  • Finding-S Find a maximally specific hypothesis
  • Initialize h to the most specific hypothesis in H
  • For each positive training example x
  • For each attribute constraint ai in h, if it is
    satisfied by x, then do nothing otherwise
    replace ai by the next more general constraint
    that is satisfied by x.
  • Output hypo h

Mistake Bound for FIND-S
  • Assume training data is noise-free and target
    concept c is in the hypo space H, which consists
    of conjunction of up to n boolean literals
  • Then in the worst case the learner needs to make
    n1 mistakes before it learns c
  • Note that misclassification occurs only in case
    that the latest learned hypo misclassifies a
    positive example as negative, and one such
    mistake removes at least one constraint from the
  • In the above worst case c is the function that
    assigns every instance to true value

Mistake Bound for Halving Algorithm
  • Halving algorithm incrementally learning the
    version space as every new instance arrives
    predict a new instance by majority votes (of hypo
    in VS)
  • Q What is the maximum number of mistakes that
    can be made by a halving algorithm, for an
    arbitrary finite H, before it exactly learns the
    target concept c (assume c is in H)?
  • Answer the largest integer no more than log2H
  • How about the minimum number of mistakes?
  • Answer zero-mistake!

Optimal Mistake Bounds
  • For an arbitrary concept class C, assuming HC,
    interested in the lowest worst-case mistake bound
    over all possible learning algorithms
  • Let MA(c) denotes the maximum number of mistakes
    over all possible training examples that a
    learner A makes to exactly learn c.
  • Def. MA(C) maxc?CMA(c)
  • Ex MFIND-s(C)n-1, MHalving(C)log2C

Optimal Mistake Bounds (2)
  • The optimal mistake bound for C, denoted by
    Opt(C), defined as minA?learning algMA(C)
  • Notice that Opt(C)MHalving(C)log2C
  • Furthermore, Littlestone (1987) shows that
    VC(C)Opt(C) !
  • When C equal to the power-set Cp of any subset of
    finite instance space X, the above four
    quantities become equal to each other, i.e. X

Weighted-Majority Algorithm
  • It is a generalization of Halving algorithm
    makes a prediction by taking a weighted vote
    among a pool of prediction algorithms (or
    hypotheses) and learns by altering the weights
  • It starts by assigning equal weight (1) to every
    prediction algorithm. Whenever an algorithm
    misclassifies a training example, reduces its
  • Halving algorithm reduces the weight to zero

Procedure for Adjusting Weights
  • ai denotes the ith prediction algorithm in the
    pool wi denotes the weight of ai, and is
    initialized to 1
  • For each training example ltx, c(x)gt
  • Initialize q0 q1 to be 0
  • For each ai, if ai(x)0 then q0?q0wi, else
  • If q1gtq0, predicts c(x) to be 1, else
  • if q1ltq0, predicts c(x) to be 0, else
  • predicts c(x) at random to be 1 or 0.
  • For each ai, do
  • If ai(x)?c(x) (given by the teacher), wi?ßwi

Comments on Adjusting Weights Idea
  • The idea can be found in various problems such as
    pattern matching, where we might reduce weights
    of less frequently used patterns in the learned
  • The textbook claims that one benefit of the
    algorithm is that it is able to accommodate
    inconsistent training data, but in case of
    learning by query, we presume that answer given
    by the teacher is always correct.

Relative Mistake Bound for the Algorithm
  • Theorem 5.3 Let D be the training sequence, A be
    any set of n prediction algorithms, and k be the
    minimum number of mistakes made by any algorithm
    in A for the training sequence D. Then the number
    of mistakes over D made by Weighted-Majority
    algorithm using ß0.5 is at most 2.4(klog2n)
  • Proof The basic idea is that we compare the
    final weight of best prediction algorithm to the
    sum of weights over all predictions. Let aj be
    such algorithm with k mistakes, then its final
    weight wj0.5k. Now consider the sum W of weights
    over all predictions, observe that for every
    mistake made, W is reduced to at most 0.75W.

Proof of Theorem 5.3 (contd)
  • Let M be the total number of mistakes made by the
    algorithm, then the final total weight is at most
    n(0.75)M, and furthermore, 0.5k n(0.75)M. Solve
    this inequality for M, and we are done.

  • Problem setting in Ch5
  • Inductively learning an unknown target function,
    given training examples and a hypothesis space
  • Focus on
  • How many training examples are sufficient?
  • PAC-learning model (probably approximately), VC
    dimension for infinite hypo space
  • How many mistakes will the learner make before
  • Mistake bound, optimal mistake bound

  • 7.2, 7.5, 7.8 (10pt each, Due Tuesday, 11-3)
Write a Comment
User Comments (0)