Advanced Artificial Intelligence Lecture 4: Learning Theory

About This Presentation

Title:

Advanced Artificial Intelligence Lecture 4: Learning Theory

Description:

We would like to know the largest size of S that H can shatter: ... Thus H cannot shatter any 3-element subset of R, from which it ... 4 points not shattered ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 45

Provided by: scSn

Category:

more less

Transcript and Presenter's Notes

Title: Advanced Artificial Intelligence Lecture 4: Learning Theory

1
Advanced Artificial IntelligenceLecture 4
Learning Theory

Bob McKay
School of Computer Science and Engineering
College of Engineering
Seoul National University

2
Outline

Language Identification
PAC Learning
Vapnik-Chervonenkis Dimension
Mistake-Bounded Learning

3
What should a Definition of Learnability Look
Like?

First try
How easy is it to learn a function f?
Easy, build a definition of f into the learning
algorithm
Second try
How easy is it to learn a given function f from a
set of functions F?

4
Language Identification in the Limit

Gold (1967)
Algorithm identifies language L in the limit if
there is some K such that after K steps, the
algorithm always answers L, and L is in fact the
correct answer.
Computability focus
can concept be learned at all
rather than computational feasibility
can concept be learned with reasonable resources
Many sub-definitions, the most important being
whether
the algorithm gets positive examples only, or
negative plus positive examples
the algorithm is given the examples in a
predetermined order, or can ask about specific
examples
A very strict definition, appropriate for noise -
free environment and infinite time only

5
What should a Definition of Learnability Look
Like?

Third try
Add a requirement for polynomial time rather than
just eventually
Whats wrong with this?

6
Defining Learnability - Noise Issues

You might get a row of misleading instances by
chance
Don't require a guaranteed correct answer, just
one correct with a given probability
You might only see noisy answers for some inputs
Don't require the learned function to always be
correct
just correct 'almost everywhere'
The examples may not be equally likely to be seen
Take the example distribution into account
As greater accuracy is required, learning is
likely to require more examples
Learning is required to be polynomial in both
size of input and required accuracy

7
PAC Learning

A set F of Boolean functions is learnable iff
there is
a polynomial p and algorithm A(F) such that
for every f in F and for any distributions D, D-
of likelihood of samples, and every ?, ? gt 0,
A halts in p(S(f),1/?,1/?) and outputs a program
g such that
with probability at least 1-?
?g(x)0 D(x) lt ?
? g(x)1 D-(x) lt ?
(S(f) is some measure of the actual size of f)
Valiant 'A Theory of the Learnable', 1984 a
motivational and informal paper, both positive
and negative results
Pitt and Valiant 'Computational Limits on
Learning from examples' a formal and
mathematical paper, further mainly negative
results

8
PAC Learning

? is a measure of how accurate the function g is
('approximately correct')
? is a measure of how often g can be wrongly
chosen ('probably correct')
The definition could be rewritten with ? for both
these roles
this is equivalent to the original definition
anyway
but the use of separate ? and ? simplifies the
derivation of limits on learnability.
Variants of the definitions
A is allowed access either to positive examples
only, or both positive and negative examples
g is required to produce no false positives
g is required to produce no false negatives

9
PAC Learning Results

k-CNF
Formulas in Conjunctive Normal Form, max k
literals per conjunct
PAC learnable from positive examples only
k-DNF
Formulas in Disjunctive Normal Form, max k
literals per conjunct
PAC learnable from negative examples only
k-term CNF
(CNF with at most k conjuncts)
Polynomially hard
k-term DNF
(DNF with at most k disjuncts)
Polynomially hard

10
PAC Learning Results

Virtually all negative results rely on the
assumption that RP ltgt NP
ie that some problems solvable in
non-deterministic polynomial time cannot be
solved in random polynomial time
informally, that making the right guesses gives
an advantage over just making random guesses
The above results may seem somewhat surprising,
since k-CNF includes k-term DNF (and mutatis
mutandis)
There have been a series of PAC-learnability
results since Valiants original work more
negative than positive
This leads to the current emphasis on bias to
restrict hypothesis space, and background
knowledge to guide learning

11
Extensions of PAC Learning

k-term DNF is not PAC learnable
but
we can extend the PAC definition to allow f and g
to belong to different function classes
and
k-term DNF is PAC learnable by k-CNF!!!
The problem is more in expressing the right
hypothesis
Than in converging on that hypothesis
Pitt and Warmuth 'Reductions among Prediction
Problems On the Difficulty of Predicting
Automata' 1988
Polynomial predicability
Essentially the PAC definition, but with the
hypothesis allowed to belong to an arbitrary
language

12
PAC Learning and Sample Size

PAC learning results are expressed in terms of
the amount of computation time needed to learn a
concept
Many algorithms require a constant or
near-constant time to process a sample,
independent of the number of samples
Most (but not all) results regarding polynomial
time may be translated into results about
polynomial sample size
k-term DNF
We mentioned above that k-term DNF is not
PAC-learnable
Nevertheless (see Mitchell) k-term DNF is
learnable in polynomial sample size
The samples just take longer to process, the
longer the formula is

13
Vapnik-Chervonenkis Dimension Why?

Good estimates of the amount of data needed to
learn
A neutral comparison measure between different
methods
A measure to help avoid over-fitting
The underpinning for support vector machine
learning

14
Reminder Version Spaces

The version space VSH,D is the subset of the
hypothesis space H which is consistent with the
learning data D
The region of the generalisation hierarchy
bounded above by the positive examples
bounded below by the negative examples
As further examples are added to D, the
boundaries of the version space contract to
remain consistent.

15
Reminder Candidate Elimination

Set G to most general hypotheses in L
Set S to most specific hypotheses in L
For each example d in D

If d is a positive example
Remove from G any hypothesis inconsistent with d
For each hypothesis s in S inconsistent with d
Remove s from S
Add to S all minimal generalisations h of s such
that h is consistent with d, and some member of
G is more general than h
Remove from S any hypothesis that is more general
than another hypothesis in S

If d is a negative example
Remove from S any hypothesis inconsistent with d
For each hypothesis g in G that is not consistent
with d
Remove g from G
Add to G all minimal specialisations h of g such
that h is consistent with d, and some member of S
is more specific than h
Remove from G any hypothesis that is less general
than another hypothesis in G

16
True vs Sample Error

error is the true error rate of the hypothesis
r is the error rate on the examples seen so
far
Note that r 0 for all hypotheses in VSH,D
aim reduce error of hypotheses in VSH,D lt ?

17
?-Exhaustion

Suppose we wish to learn a target concept c from
a hypothesis space H, using a set of training
examples D drawn from c with distribution D
VSH,D is ?-exhausted if every hypothesis h in
VSH,D has error less than ?
(?h ? VSH,D) error D (h) lt ?

18
Probability Bounds

Suppose we are given a particular sample size m
(drawn randomly and independently)
What is the probability that the version space
VSH,D has not been ?-exhausted?
There is a relatively simple bound - the
probability is at most He-?m

19
Sample Size, Finite H

We would like the probability that we have not
?-exhausted VSH,D to be less than ?
He-?m lt ?
Then we need m samples, where
m gt (ln H ln (1 / ?)) / ?

20
Sample Size, Infinite H

For finite hypothesis spaces
The formula is very generous
For infinite hypothesis spaces
Gives no guidance at all
We would like a measure of difficulty of a
hypothesis space giving a bound for infinite
spaces and a tighter bound for finite spaces.
This is what the VC dimension gives us
Note that the previous analysis completely
ignores the structure of the individual
hypotheses in H, relying on the corresponding
version space
The VC dimension takes into account the
fine-grained structure of H, and its interaction
with the individual data items.

21
Shattering

Definition A hypothesis space H shatters a set
of instances S iff for every dichotomy in S,
there is a hypothesis h in H consistent with that
dichotomy

Figure 1 8 hypotheses shattering 3 instances

22
VC Dimension

We seek bounds on the number of instances
required to learn a concept with a given
fidelity.
We would like to know the largest size of S that
H can shatter
the larger S is, the more expressive H is
Definition 2 VC(H) is the size of the largest
subset of the instance space X which H shatters
If there is no limit on this size, then VC(H) ?

23
Example 1 Real Intervals

X R and H (the set of closed intervals on R)
What is VC(H)?
Consider the set S -1, 1
S is shattered by H
so VC(H) is at least 2.
Consider S x1, x2, x3 with x1 lt x2 lt x3
H doesnt shatter S
no hypothesis from H can represent x1, x3.
Why not?
Suppose Y y1, y2 covers x1, x3
Then clearly, y1 ? x1 and y2 ? x3
So y1 lt x2 lt y2, so that Y covers x2 as well.
Thus H cannot shatter any 3-element subset of R,
from which it follows that VC(H) 2

24
Example 2 Linear Decisions

Let X R2, H the set of linear decision surfaces
(two-input perceptron)
H shatters any 2-element subset
VC(H) ? 2.
For three element sets
if the elements of S are co-linear, then H cannot
shatter them (as above).
H can shatter any set of three points which are
not co-linear.
Thus VC(H) ? 3

25
Example 2 Linear Decisions (cont)

4 points not shattered
No single decision plane can partition these
points into (-1,-1),(1,1) and (-1,1),(1,-1)
But of course, this isnt enough
to be sure that VC(H) 3, we need to know that
no set of four points can be shattered

26
Example 2 Linear Decisions (cont)

If there were such a set of four points
No three of them are collinear (see previous)
Hence there is an affine transformation of them
onto (-1,-1),(-1,1),(1,-1),(1,1)
That transformation would also transform the
decision surfaces into new linear decision
surfaces which shatter (-1,-1),(-1,1),(1,-1),(1,1
)
contradiction!
For linear decision surfaces in Rn, the VC
dimension is n1
(ie the VC dimension of an n-input perceptron is
n1).

27
Example 3 Conjunctions of Literals

Consider conjunctions of n3 literals
Represent each instance as a bitstring
Consider the set S 100,010,001.
Naming the boolean variables A, B, C, we see
the hypothesis set ?, A, B, C, A, B, C,
ABC shatters S
(? is the empty conjunction, which is always
true, hence covers the full set S)
Thus VC(H) is at least 3.

28
Example 3 Conjunctions of Literals (cont)

Can a set of four instances be shattered?
The answer is no, though the proof is non-trivial
So VC(H) 3.
The proof is more general
if H is the set of boolean conjunctions of up to
n variables, and X is the set of boolean
instances, then VC(H) n.

29
VC Dimension andHypothesis Space Size

From examples 1 and 2
The VC dimension can be quite small even when the
hypothesis space is infinite
If we can get learning bounds in terms of the VC
dimension, these will apply even to infinite
hypothesis spaces

30
VC Dimension andMinimum Sample Size

Recall for finite spaces, a bound on the number
m of samples necessary to ?-exhaust H with
probability ? is
m ? (ln H ln (1 / ?)) / ?
Using the VC dimension, we get
m ? (1 / ?) 4 log2(2 / ?) 8 VC(H) log2(13 /
?)
(Blumers Theorem)
The minimum number of examples is proportional to
the VC dimension

31
VC Dimension and Sample Size

The above is a guaranteed bound. But how lucky
could we get?
Assuming VC(H) ? 2, ? lt 1/8 and ? lt 1/100
For any learner L, there is a situation in which,
with probability at least ?, L outputs a
hypothesis having error rate at least ?, if L
observes fewer training examples than
Max(1 / ?) log (1 / ?), (VC(H) 1) / 32?
(Ehrenfeuchts Theorem)

32
VC Dimension and Neural Nets

The VC dimension of a neural network is
determined by the number of free parameters in
the network
A free parameter is a one (usually a weight)
which can change independently of any other
parameters of the network.

33
VC Dimension Threshold Activation

For networks with a threshold activation
function
?(v) 1 for v ? 0
?(v) 0 for v lt 0
the VC dimension is proportional to W(log W),
where W is the total number of free parameters in
the network

34
VC Dimension Sigmoid Activation

For networks with a sigmoid activation function
?(v) 1 / (1 e-v)
the VC dimension is proportional to W2, where W
is the total number of free parameters in the
network

35
Structural Risk Minimisation

We would like to find the neural network N with
the minimum generalisation error vgen(w) for the
trained weight vector w.

36
Decision Tree Error Curve
37
Generalisation Error Curve
38
Structural Risk Minimisation

There is an upper bound for vgen(w) given by
Vgteed(w) vtrain(w) ?1(N,VC(N),?, vtrain(w))
N is the number of training examples, ? is a
measure of the certainty we want
The exact form of ?1 is complex most
importantly, ?1 increases with VC(N), so the
guaranteed risk and generalisation error have the
general form shown above

39
Structural Risk Algorithm

General method for finding the best-generalising
neural network
Define a sequence N1, N2,. of classifiers
with monotonically increasing VC dimension.
Minimise the training error of each
Identify the classifier N with the smallest
guaranteed risk
This classifier is the one with the best
generalising ability for unseen data.

40
Varying VC Dimension

For fully connected multilayer feedforward
networks, one simple way to vary VC(N) is to
monotonically increase the number of neurons in
one of the hidden layers.

41
Mistake-Bounded Learning

In some situations (where we must use the result
of the learning right from the start) we may be
more concerned about the number of mistakes we
make in learning a concept, than about the total
number of instances required
In mistake bound learning, the learner is
required, after receiving each instance x, to
give a prediction of c(x)
before it is given the real answer
Each erroneous value counts as a mistake
we are interested in the total number of mistakes
made before the algorithm converges to c
In some ways, an extension of Golds definition

42
Mistake-Bounded Learning - Example

For some algorithms and hypothesis spaces, it is
possible to derive bounds on the number of
mistakes which will be made in learning
if H is the set of conjunctions formed from any
subset of n literals and their negations
find-S algorithm will make at most n1 mistakes
in learning a given concept
With the same H
candidate-elimination algorithm will make at most
log2n mistakes

43
Optimal Mistake Bounds

Optimal mistake bounds give an estimate of the
overall complexity of a hypothesis space
The optimal mistake bound opt(H) is the minimum
over all algorithms of the mistake bound for H
Littlestones Theorem
VC(H) ? opt(H) ? log2H

44
?????

Write a Comment

User Comments (0)