Title: Advanced Artificial Intelligence Lecture 4: Learning Theory
1Advanced Artificial IntelligenceLecture 4
Learning Theory
- Bob McKay
- School of Computer Science and Engineering
- College of Engineering
- Seoul National University
2Outline
- Language Identification
- PAC Learning
- Vapnik-Chervonenkis Dimension
- Mistake-Bounded Learning
3What should a Definition of Learnability Look
Like?
- First try
- How easy is it to learn a function f?
- Easy, build a definition of f into the learning
algorithm - Second try
- How easy is it to learn a given function f from a
set of functions F?
4Language Identification in the Limit
- Gold (1967)
- Algorithm identifies language L in the limit if
there is some K such that after K steps, the
algorithm always answers L, and L is in fact the
correct answer. - Computability focus
- can concept be learned at all
- rather than computational feasibility
- can concept be learned with reasonable resources
- Many sub-definitions, the most important being
whether - the algorithm gets positive examples only, or
negative plus positive examples - the algorithm is given the examples in a
predetermined order, or can ask about specific
examples - A very strict definition, appropriate for noise -
free environment and infinite time only
5What should a Definition of Learnability Look
Like?
- Third try
- Add a requirement for polynomial time rather than
just eventually - Whats wrong with this?
6Defining Learnability - Noise Issues
- You might get a row of misleading instances by
chance - Don't require a guaranteed correct answer, just
one correct with a given probability - You might only see noisy answers for some inputs
- Don't require the learned function to always be
correct - just correct 'almost everywhere'
- The examples may not be equally likely to be seen
- Take the example distribution into account
- As greater accuracy is required, learning is
likely to require more examples - Learning is required to be polynomial in both
size of input and required accuracy
7PAC Learning
- A set F of Boolean functions is learnable iff
there is - a polynomial p and algorithm A(F) such that
- for every f in F and for any distributions D, D-
of likelihood of samples, and every ?, ? gt 0, - A halts in p(S(f),1/?,1/?) and outputs a program
g such that - with probability at least 1-?
- ?g(x)0 D(x) lt ?
- ? g(x)1 D-(x) lt ?
- (S(f) is some measure of the actual size of f)
- Valiant 'A Theory of the Learnable', 1984 a
motivational and informal paper, both positive
and negative results - Pitt and Valiant 'Computational Limits on
Learning from examples' a formal and
mathematical paper, further mainly negative
results
8PAC Learning
- ? is a measure of how accurate the function g is
('approximately correct') - ? is a measure of how often g can be wrongly
chosen ('probably correct') - The definition could be rewritten with ? for both
these roles - this is equivalent to the original definition
anyway - but the use of separate ? and ? simplifies the
derivation of limits on learnability. - Variants of the definitions
- A is allowed access either to positive examples
only, or both positive and negative examples - g is required to produce no false positives
- g is required to produce no false negatives
9PAC Learning Results
- k-CNF
- Formulas in Conjunctive Normal Form, max k
literals per conjunct - PAC learnable from positive examples only
- k-DNF
- Formulas in Disjunctive Normal Form, max k
literals per conjunct - PAC learnable from negative examples only
- k-term CNF
- (CNF with at most k conjuncts)
- Polynomially hard
- k-term DNF
- (DNF with at most k disjuncts)
- Polynomially hard
10PAC Learning Results
- Virtually all negative results rely on the
assumption that RP ltgt NP - ie that some problems solvable in
non-deterministic polynomial time cannot be
solved in random polynomial time - informally, that making the right guesses gives
an advantage over just making random guesses - The above results may seem somewhat surprising,
since k-CNF includes k-term DNF (and mutatis
mutandis) - There have been a series of PAC-learnability
results since Valiants original work more
negative than positive - This leads to the current emphasis on bias to
restrict hypothesis space, and background
knowledge to guide learning
11Extensions of PAC Learning
- k-term DNF is not PAC learnable
- but
- we can extend the PAC definition to allow f and g
to belong to different function classes - and
- k-term DNF is PAC learnable by k-CNF!!!
- The problem is more in expressing the right
hypothesis - Than in converging on that hypothesis
- Pitt and Warmuth 'Reductions among Prediction
Problems On the Difficulty of Predicting
Automata' 1988 - Polynomial predicability
- Essentially the PAC definition, but with the
hypothesis allowed to belong to an arbitrary
language
12PAC Learning and Sample Size
- PAC learning results are expressed in terms of
the amount of computation time needed to learn a
concept - Many algorithms require a constant or
near-constant time to process a sample,
independent of the number of samples - Most (but not all) results regarding polynomial
time may be translated into results about
polynomial sample size - k-term DNF
- We mentioned above that k-term DNF is not
PAC-learnable - Nevertheless (see Mitchell) k-term DNF is
learnable in polynomial sample size - The samples just take longer to process, the
longer the formula is
13Vapnik-Chervonenkis Dimension Why?
- Good estimates of the amount of data needed to
learn - A neutral comparison measure between different
methods - A measure to help avoid over-fitting
- The underpinning for support vector machine
learning
14Reminder Version Spaces
- The version space VSH,D is the subset of the
hypothesis space H which is consistent with the
learning data D - The region of the generalisation hierarchy
- bounded above by the positive examples
- bounded below by the negative examples
- As further examples are added to D, the
boundaries of the version space contract to
remain consistent.
15Reminder Candidate Elimination
- Set G to most general hypotheses in L
- Set S to most specific hypotheses in L
- For each example d in D
- If d is a positive example
- Remove from G any hypothesis inconsistent with d
- For each hypothesis s in S inconsistent with d
- Remove s from S
- Add to S all minimal generalisations h of s such
that h is consistent with d, and some member of
G is more general than h - Remove from S any hypothesis that is more general
than another hypothesis in S
- If d is a negative example
- Remove from S any hypothesis inconsistent with d
- For each hypothesis g in G that is not consistent
with d - Remove g from G
- Add to G all minimal specialisations h of g such
that h is consistent with d, and some member of S
is more specific than h - Remove from G any hypothesis that is less general
than another hypothesis in G
16True vs Sample Error
- error is the true error rate of the hypothesis
- r is the error rate on the examples seen so
far - Note that r 0 for all hypotheses in VSH,D
- aim reduce error of hypotheses in VSH,D lt ?
17?-Exhaustion
- Suppose we wish to learn a target concept c from
a hypothesis space H, using a set of training
examples D drawn from c with distribution D - VSH,D is ?-exhausted if every hypothesis h in
VSH,D has error less than ? - (?h ? VSH,D) error D (h) lt ?
18Probability Bounds
- Suppose we are given a particular sample size m
(drawn randomly and independently) - What is the probability that the version space
VSH,D has not been ?-exhausted? - There is a relatively simple bound - the
probability is at most He-?m
19Sample Size, Finite H
- We would like the probability that we have not
?-exhausted VSH,D to be less than ? - He-?m lt ?
- Then we need m samples, where
- m gt (ln H ln (1 / ?)) / ?
20Sample Size, Infinite H
- For finite hypothesis spaces
- The formula is very generous
- For infinite hypothesis spaces
- Gives no guidance at all
- We would like a measure of difficulty of a
hypothesis space giving a bound for infinite
spaces and a tighter bound for finite spaces. - This is what the VC dimension gives us
- Note that the previous analysis completely
ignores the structure of the individual
hypotheses in H, relying on the corresponding
version space - The VC dimension takes into account the
fine-grained structure of H, and its interaction
with the individual data items.
21Shattering
- Definition A hypothesis space H shatters a set
of instances S iff for every dichotomy in S,
there is a hypothesis h in H consistent with that
dichotomy
- Figure 1 8 hypotheses shattering 3 instances
22VC Dimension
- We seek bounds on the number of instances
required to learn a concept with a given
fidelity. - We would like to know the largest size of S that
H can shatter - the larger S is, the more expressive H is
- Definition 2 VC(H) is the size of the largest
subset of the instance space X which H shatters - If there is no limit on this size, then VC(H) ?
23Example 1 Real Intervals
- X R and H (the set of closed intervals on R)
- What is VC(H)?
- Consider the set S -1, 1
- S is shattered by H
- so VC(H) is at least 2.
- Consider S x1, x2, x3 with x1 lt x2 lt x3
- H doesnt shatter S
- no hypothesis from H can represent x1, x3.
- Why not?
- Suppose Y y1, y2 covers x1, x3
- Then clearly, y1 ? x1 and y2 ? x3
- So y1 lt x2 lt y2, so that Y covers x2 as well.
- Thus H cannot shatter any 3-element subset of R,
from which it follows that VC(H) 2
24Example 2 Linear Decisions
- Let X R2, H the set of linear decision surfaces
- (two-input perceptron)
- H shatters any 2-element subset
- VC(H) ? 2.
- For three element sets
- if the elements of S are co-linear, then H cannot
shatter them (as above). - H can shatter any set of three points which are
not co-linear. - Thus VC(H) ? 3
25Example 2 Linear Decisions (cont)
- 4 points not shattered
- No single decision plane can partition these
points into (-1,-1),(1,1) and (-1,1),(1,-1) - But of course, this isnt enough
- to be sure that VC(H) 3, we need to know that
no set of four points can be shattered
26Example 2 Linear Decisions (cont)
- If there were such a set of four points
- No three of them are collinear (see previous)
- Hence there is an affine transformation of them
onto (-1,-1),(-1,1),(1,-1),(1,1) - That transformation would also transform the
decision surfaces into new linear decision
surfaces which shatter (-1,-1),(-1,1),(1,-1),(1,1
) - contradiction!
- For linear decision surfaces in Rn, the VC
dimension is n1 - (ie the VC dimension of an n-input perceptron is
n1).
27Example 3 Conjunctions of Literals
- Consider conjunctions of n3 literals
- Represent each instance as a bitstring
- Consider the set S 100,010,001.
- Naming the boolean variables A, B, C, we see
- the hypothesis set ?, A, B, C, A, B, C,
ABC shatters S - (? is the empty conjunction, which is always
true, hence covers the full set S) - Thus VC(H) is at least 3.
28Example 3 Conjunctions of Literals (cont)
- Can a set of four instances be shattered?
- The answer is no, though the proof is non-trivial
- So VC(H) 3.
- The proof is more general
- if H is the set of boolean conjunctions of up to
n variables, and X is the set of boolean
instances, then VC(H) n.
29VC Dimension andHypothesis Space Size
- From examples 1 and 2
- The VC dimension can be quite small even when the
hypothesis space is infinite - If we can get learning bounds in terms of the VC
dimension, these will apply even to infinite
hypothesis spaces
30VC Dimension andMinimum Sample Size
- Recall for finite spaces, a bound on the number
m of samples necessary to ?-exhaust H with
probability ? is - m ? (ln H ln (1 / ?)) / ?
- Using the VC dimension, we get
- m ? (1 / ?) 4 log2(2 / ?) 8 VC(H) log2(13 /
?) - (Blumers Theorem)
- The minimum number of examples is proportional to
the VC dimension
31VC Dimension and Sample Size
- The above is a guaranteed bound. But how lucky
could we get? - Assuming VC(H) ? 2, ? lt 1/8 and ? lt 1/100
- For any learner L, there is a situation in which,
with probability at least ?, L outputs a
hypothesis having error rate at least ?, if L
observes fewer training examples than - Max(1 / ?) log (1 / ?), (VC(H) 1) / 32?
- (Ehrenfeuchts Theorem)
32VC Dimension and Neural Nets
- The VC dimension of a neural network is
determined by the number of free parameters in
the network - A free parameter is a one (usually a weight)
which can change independently of any other
parameters of the network.
33VC Dimension Threshold Activation
- For networks with a threshold activation
function - ?(v) 1 for v ? 0
- ?(v) 0 for v lt 0
- the VC dimension is proportional to W(log W),
where W is the total number of free parameters in
the network
34VC Dimension Sigmoid Activation
- For networks with a sigmoid activation function
- ?(v) 1 / (1 e-v)
- the VC dimension is proportional to W2, where W
is the total number of free parameters in the
network
35Structural Risk Minimisation
- We would like to find the neural network N with
the minimum generalisation error vgen(w) for the
trained weight vector w.
36Decision Tree Error Curve
37Generalisation Error Curve
38Structural Risk Minimisation
- There is an upper bound for vgen(w) given by
- Vgteed(w) vtrain(w) ?1(N,VC(N),?, vtrain(w))
- N is the number of training examples, ? is a
measure of the certainty we want - The exact form of ?1 is complex most
importantly, ?1 increases with VC(N), so the
guaranteed risk and generalisation error have the
general form shown above
39Structural Risk Algorithm
- General method for finding the best-generalising
neural network - Define a sequence N1, N2,. of classifiers
with monotonically increasing VC dimension. - Minimise the training error of each
- Identify the classifier N with the smallest
guaranteed risk - This classifier is the one with the best
generalising ability for unseen data.
40Varying VC Dimension
- For fully connected multilayer feedforward
networks, one simple way to vary VC(N) is to
monotonically increase the number of neurons in
one of the hidden layers.
41Mistake-Bounded Learning
- In some situations (where we must use the result
of the learning right from the start) we may be
more concerned about the number of mistakes we
make in learning a concept, than about the total
number of instances required - In mistake bound learning, the learner is
required, after receiving each instance x, to
give a prediction of c(x) - before it is given the real answer
- Each erroneous value counts as a mistake
- we are interested in the total number of mistakes
made before the algorithm converges to c - In some ways, an extension of Golds definition
42Mistake-Bounded Learning - Example
- For some algorithms and hypothesis spaces, it is
possible to derive bounds on the number of
mistakes which will be made in learning - if H is the set of conjunctions formed from any
subset of n literals and their negations - find-S algorithm will make at most n1 mistakes
in learning a given concept - With the same H
- candidate-elimination algorithm will make at most
log2n mistakes
43Optimal Mistake Bounds
- Optimal mistake bounds give an estimate of the
overall complexity of a hypothesis space - The optimal mistake bound opt(H) is the minimum
over all algorithms of the mistake bound for H - Littlestones Theorem
- VC(H) ? opt(H) ? log2H
44?????