Title: Rigorous Learning Curve Bounds from Statistical Mechanics
1Rigorous Learning Curve Bounds from Statistical
Mechanics
- D. Haussler, M. Kearns,
- H. S. Seung, N. Tishby
Presentation Talya Meltzer
2Motivation
- According to the VC-theory, minimizing the
empirical error within a function class F on a
random sample will lead to generalization error
bounds - Realizable case
- Unrealizable case
- The VC-bounds are the best distribution-independen
t upper bounds
3Motivation
- Yet, these bounds are vacuous for mltd
- And fail to capture the true behavior of
particular learning curves - Experimental learning curves fit a variety of
functional forms, including exponentials - Curves analyzed using statistical mechanics
methods, experience phase transitions (sudden
drops in the generalization error)
4Main Ideas
- Decompose the hypothesis class into error
shells - Attribute each hypothesis the correct
generalization error, while taking the specific
distribution into account - Use the thermodynamic limit method
- Notate the correct scale at which to analyze a
learning curve - Express the learning curve as a competition
between an entropy function and an energy function
5Overview The PAC Learning Model
- The hypothesis class
- Input
- Assumptions
- The examples in the training set S are sampled
i.i.d according to a distribution D over X - D is unknown
- D is fixed throughout the learning process
- There exists a target function fX?Y, i.e. yi
f(xi) - Goal find the target function
6Overview The PAC Learning Model
- Training (empirical) error
- Generalization error
- The class F is PAC-learnable, if there exists a
learning algorithm which given e,d returns h?F
such that - The training error is minimal
-
7The Finite Realizable Case
- The version space
- The e-ball
- If B(e) includes VS(S), then any function in the
version space has generalization error e
8The Finite Realizable Case
9Decomposition into error shells
In a finite class, there is a finite number of
possible error values 0 e1 lt e2 lt lt er
1 rFlt8
10Decomposition into error shells
So, we can replace the union bound in the exact
phrase
Now, with probability at least 1-d, any h
consistent with the sample obeys
To understand the behavior of this bound, we will
use the thermodynamic limit method
11The Thermodynamic Limit
- We consider an infinite sequence of classes of
functions F1,F2,,FN, - FN f XN ? 0,1 , Nlog2(FN)
- We are often interested in a parametric class of
functions - The number of functions in the class at any given
error value may have a limiting asymptotic
behavior, as the number of parameters grows
12The Thermodynamic Limit
- Rewrite the expression
- Notate the scaling function t(N) when chosen
properly, captures the scale at which the
learning curve is most interesting - Find a permissible entropy bound tightly
captures the behavior of
The entropy of the j-th error shell POSITIVE
The minus energy of the j-th error shell
NEGATIVE
13The Thermodynamic Limit
- Formal definitions
- t(N) a mapping from the natural numbers to the
natural numbers, such that - s(e) a continuous function
- s(e) is called a permissible entropy bound if
there exists a natural number N0 such that for
all N N0 and for all 1jr(N)
14The Thermodynamic Limit
am/t(N) remains const, as m,N?8 a controls the
competition between the entropy and the energy
15The Thermodynamic Limit
- In order to describe infinite systems
- We describe a system in finite size, then let the
size grow to infinity - We normalize extensive variables by the volume
- We keep the density fixed ? N/V const, as
N,V ? 8
16The Thermodynamic Limit
The Learning System vs. The Thermodynamic System
17The Thermodynamic Limit
- Benefits N isolated in the factor t(N), and the
remaining factor is the continuous function - Define as the largest such that
- In the thermodynamic limit, under certain
conditions, we can bound the generalization error
of any consistent hypothesis by
18The Thermodynamic Limit
We will see that for egte the thermodynamic limit
of the sum is 0. Let 0ltt1 be an arbitrarily
small quantity
19The Thermodynamic Limit
The limit will be indeed zero, provided that
r(N)o(expt(N)?)
20The Thermodynamic Limit
- Summary
- e is the rightmost crossing point of s(e) and
-aln(1-e) - in the thermodynamic limit, any hypothesis h
consistent with m at(N) examples will have
egen(h) e t (with probability 1).
21Scaled Learning Curves
- Extracting scaled learning curves
- Let the value of a vary
- Apply the thermodynamic limit method to each
value - Plot the generalization error bound as a function
of a (instead of m ? scaled)
22Artificial Examples
Using weak permissible entropy bound for some
scaling function t(N)
s(e)1
23Artificial Examples
Using single-peak permissible entropy bound
24Artificial Examples
Using different single-peak as a permissible
entropy bound
25Artificial Examples
Using double-peak permissible entropy bound
26Phase Transitions
- The sudden drops in the learning curves are
called phase transitions - In thermodynamic systems, a phase transition is
the transformation from one phase to another - A critical point is the conditions (such as
temperature, pressure) at which the transition
occur
27Phase Transitions
Well known phase transitions solid to liquid,
liquid to gas...
28Phase Transitions more
29Phase Transitions Learning
- In some learning curves, we see a transition from
a finite generalization error to perfect learning - The transition occur in a critical a, i.e. when
the sample reaches the size of m aCt(N) - In this critical point the system realizes the
problem at once
30(Almost) Real Examples The Ising Perceptron
fN arbitrary target function, defined by w0
31(Almost) Real Examples The Ising Perceptron
Due to the spherically symmetric distribution
The number of perceptrons with hamming distance j
from the target
32(Almost) Real Examples The Ising Perceptron
Weve seen this entropy bound as the single-peak
- The phase transition to perfect learning occur
in aC1.448 - The critical m for perfect learning according to
both the VC and cardinality bounds, is
, rather than
33(Almost) Real Examples The Ising Perceptron
- The right zero crossing yields the upper bound on
the generalization error - With high probability, there are no hypotheses in
VS(S) with error less than the left zero crossing
except for the target itself - VS(S) minus the target is contained within these
zero crossings
34The Thermodynamic LimitLower Bound
- The thermodynamic limit method can provide a
lower bound to the generalization error - The lower bound shows that the behavior examined
in scaled learning curve, including phase
transitions, can actually occur for certain
function classes and distributions - We will use the energy function 2ae
- The qualitative behavior of the curves obtained
by intersecting with 2ae and -aln(1-e) is
essentially the same
35The Thermodynamic LimitLower Bound
- We can construct
- a function class sequence FN over XN
- a distribution sequence DN over XN
- a target function sequence fN
- such that
- s(e) is a permissible entropy bound with respect
to t(N)N - for the largest e½ for which 2aes(e), there
is a constant probability to find a consistent
hypothesis with egen(h)e - ? e is a lower bound on the worst consistent
hypothesis
36The Finite Unrealizable Case
- The data can be labeled according to a function
not within our class - Or sampled by a distribution DN over XN0,1,
which can also model noise in the examples - Use u(e) as a permissible energy bound, if for
any h in F and any sample size m
for the realizable case we had and the exact
equality
37The Finite Unrealizable Case
- We can always choose
- (and in certain cases we can do better)
- The standard cardinality bound obtained
- Since the class is finite, we can slice it into
error shells and apply the thermodynamic limit,
just as in the realizable case. - Choosing e to be the rightmost intersection of
s(e) and au(e), we get for any tgt0
38The Infinite Case
- The covering approach build a finite ?-cover,
F?, to the infinite class ? emin(?)? - Apply the thermodynamic limit by building a
sequence of nested covers - Result a bound on the error of e?, the rightmost
crossing function of s?(e) and au?(e) - Trade-off
- The best error achievable in the chosen cover
F? improves as ??0 - The size of F? increases as ??0
39Real World Example
Sufficient Dimensionality Reduction with
Irrelevance Statistics A. Globerson, G. Chechik,
N. Tishby
- In this example
- Main data images of all men with neutral face
expression and light either from the right or the
left - Irrelevance data similarly created with female
images
40Real World Example
41Summary
- Benefits of the method
- Derives tighter bounds
- Allows to describe the behavior for small samples
as well ? useful in practice, where we want to
work with md - Captures the phase transitions in learning
curves, including transitions to perfect
learning, which can actually occur
experimentally in certain problems - Further work to be done
- Refined extensions to the infinite case