VC Dimension

About This Presentation

Title:

VC Dimension

Description:

... of functions f(X,w) shatters the sample if all 2L separations ... 2) There is at least one sample of h 1 vectors that cannot be shattered by any function from S ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 57

Provided by: erikma7

Category:

more less

Transcript and Presenter's Notes

Title: VC Dimension

1
VC Dimension Direct Robustness Control in
Statistical Learning TheoryMichel Béra
co-Founder and Chief Scientific
Officer21/10/2002
2
Agenda
Company positionning 15 Mins
SRM / SVM Theory
45 Mins
KXEN Analytic Framework (Demo) 15Mins
3
Company Background

Founded in July 1998
Delaware corporation
Headquartered in San Francisco
Operations in U.S. and Europe
Strong Executive Team
R. Haddad (CEO), E. Marcade (CTO), M. Bera (CSO)
Active Scientific Committee
Includes Gregory Piatetsky-Shapiro (founder
SigKDD), Lee Giles (Penn State Professor,
formerly with NEC), Gilbert Saporta (French
Statistical Society President), Yann Le Cun (NEC,
Manager of V.Vapnik), Léon Bottou (NEC), Bernhard
Schoelkopf (Max-Planck-Institut)

4
Go-To-Market Strategy
By embedding KXEN into major Applications and
partnering with leading SI, KXEN is on the way to
become the de-facto standard for Predictive
Analytics
5
KXEN Value Proposition
Analyst
Business User

Traditional Data Mining Approach Cost per Model
30,000
Business Question
Interpret Results
Business User
Prepare Data
Build Model
Test Model
Understand
Apply
Wait for Analyst Time
3 Weeks
Our company builds hundreds of predictive models
in the same time we used to build one. KXEN
allows us to save millions of dollars with more
effective campaigns Financial Industry Customer
6
What is a good model?
Low quality /High Robustness
Low Robustness
Robust Model
7
Agenda
Company positionning 15 Mins
SRM / SVM Theory 45 Mins
KXEN Analytic Framework (Demo) - Mins
8
VC dimension - definition (1)

Let us consider a sample (x1, .. , xL) from Rn
There are 2L different ways to separate the
sample in two sub-samples
A set S of functions f(X,w) shatters the sample
if all 2L separations can be defined by different
f(X,w) from family S

9
VC dimension - definition (2)

A function family S has VC dimension h (h is an
integer) if
1) Every sample of h vectors from Rn can be
shattered by a function from S
2) There is at least one sample of h1 vectors
that cannot be shattered by any function from S

10
Example VC dimension

VC dimension
Measures the complexity of a solution
(function).
Is not directely related to the number of
variables

11
Other examples

VC dimension for hyperplanes of Rn is n1
VC dimension of set of functions
f(x,w) sign (sin (w.x) ), c lt x lt 1,
cgt0,
where w is a free parameter, is infinite.
VC dimensions are not always equal to the number
of parameter that define a given family S of
functions.

12
Key Example linear modelsy ltwxgt b

VC dimension of family S of linear models
with
depends on C and can take any value between 0
and n.

13
VC dimension interpretation

VC dimension of S an integer, that measures the
dispersion or separating power (complexity) of
function family S
We shall now show that VC dimension (a major
theorem from Vapnik) gives a powerful indication
for model robustness.

14
Learning Theory Problem (1)

A model computes a function
Problem minimize in w Risk Expectation
w a parameter that specifies the chosen model
z (X, y) are possible values for attributes
(variables)
Q measures (quantifies) model error cost
P(z) is the underlying probability law (unknown)
for data z

15
Learning Theory Problem (2)

We get L data from learning sample (z1, .. , zL),
and we suppose them iid sampled from law P(z).
To minimize R(w), we start by minimizing
Empirical Risk over this sample
We shall use such an approach for
classification (eg. Q can be a cost function
based on cost for misclassified points)
regression (eg. Q can be a cost of least squares
type)

16
Learning Theory Problem (3)

Central problem for Statistical Learning Theory
What is the relation between Risk Expectation
R(W)and Empirical Risk E(W)?
How to define and measure a generalization
capacity (robustness) for a model ?

17
Four Pillars for SLT (1 and 2)

Consistency (guarantees generalization)
Under what conditions will a model be consistent
?
Model convergence speed (a measure for
generalization)
How does generalization capacity improve when
sample size L grows?

18
Four Pillars for SLT (3 and 4)

Generalization capacity control
How to control in an efficient way model
generalization starting with the only given
information we have our sample data?
A strategy for good learning algorithms
Is there a strategy that gurantees, measures and
controls our learning model generalization
capacity ?

19
Consistency definition

A learning process (model) is said to be
consistent if model error, measured on new data
sampled from the same underlying probability laws
of our original sample, converges, when original
sample size increases, towards model error,
measured on original sample.

20
Consistent training?
error
Test error
Training error
number of training examples
error
Test error
Training error
number of training examples
21
Vapnik main theorem

Q Under which conditions will a learning
process (model) be consistent?
R A model will be consistent if and only if the
function f that defines the model comes from a
family of functions S with finite VC dimension h
A finite VC dimension h not only guarantees a
generalization capacity (consistency), but to
pick f in a family S with finite VC dimension h
is the only way to build a model that generalizes.

22
Model convergence speed (generalization capacity)

Q What is the nature of model error difference
between learning data (sample) and test data, for
a sample of finite size L?
R This difference is no greater than a limit
that only depends on the ratio between VC
dimension h of model functions family S, and
sample size L, ie h/L
This statement is a new theorem that belongs to
Kolmogorov-Smirnov way for results, ie theorems
that do not depend on datas underlying
probability law.

23
Model convergence speed
Sample size L
24
Empirical Risk Minimization

With probability 1-q, the following inequality is
true
where w0 is the parameter w value that minimizes
Empirical Risk

25
SRM methodology how to control model
generalization capacity

Risk Expectation Empirical Risk Confidence
Interval
To minimize Empirical Risk alone will not always
give a good generalization capacity one will
want to minimize the sum of Empirical Risk and
Confidence Interval
What is important is not Vapnik limit numerical
value , most often too large to be of any
practical use, it is the fact that this limit is
a non decreasing function of model family
function richness

26
SRM strategy (1)

With probability 1-q,
When L/h is small (h too large), second term of
equation becomes large
SRM basic idea for strategy is to minimize
simultaneously both terms standing on the right
of above majoring equation for R(w)
To do this, one has to make h a controlled
parameter

27
SRM strategy (2)

Let us consider a sequence S1 lt S2 lt .. lt Sn of
model family functions, with respective growing
VC dimensions
h1 lt h2 lt .. lt hn
For each family Si of our sequence, the
inequality
is valid

28
SRM strategy (3)
SRM find i such that expected risk R(w) becomes
minimum, for a specific hhi, relating to a
specific family Si of our sequence build model
using f from Si
Risk
Model Complexity
29
Putting SRM into action linear models case (1)

There are many SRM-based strategies to build
models
In the case of linear models
y ltwxgt b,
one wants to make w a controlled
parameter let us call SC the linear model
function family satisfying the constraint
w lt C
Vapnik Major theorem
When C decreases, h(SC) decreases
x lt R

30
Putting SRM into action linear models case (2)

To control w, one can envision two routes to
model
Regularization/Ridge Regression, ie min. over w
and b
RG(w,b) S(yi-ltwxigt - b)² i1,..,L l
w²
Support Vector Machines (SVM), ie solve directly
an optimization problem (hereunder classif. SVM,
separable data)
Minimize w², with (yi /-1)
and yi(ltwxigt b) gt1 for all i1,..,L

31
Linear classifiers

Rules of the form weight vector w, threshold b
f(x) sign( Swixi i1,..,L b )
f(x) 1 if Swixi i1,..,L b gt 0
-1 else
If the L training examples (x1,y1),..,(xL,yL),
where xi is a vector from Rn, and yi 1 or 1,
are linearly separable, there is an infinity of
hyperplanes f separating the 1 and the -1.
However, there is a unique f that will define
the maximum width corridor between yi1 and
yi-1 of the sample. Let 2d be the width of this
optimal corridor.

32
VC dimension of thick Hyperplanes

Lemma the VC dimension of hyperplanes defined
by f (w,b) with margin d ( thick
hyperplane) and sample vectors xi verifying
xi lt R, i 1,..,L is bounded by
VC dim lt R²/d² 1
The VC dimension of such a linear classifier
does not necessarily depend on the number of
attributes or the number of parameters!

33
Maximizing margin d relation to SRM

The hypothesis space with minimal VC-dim
according to SRM will be the hyperplane with
maximum margin d.
It will be entirely defined through the parts of
the sample with minimal distance the support
vectors
The number of support vectors is neither L, nor
n, not h, the corresponding VC-dimension
If the number of support vectors is large
compared to L, the model may be beautiful in
theory, but extremely costly to apply!

34
Computing Optimal Hyperplane

Training examples
(x1,y1), ..,(xL,yL), xi from Rn, yi 1 or 1,
i1,..,L
Requirement 1 zero training error
(yi-1) gt ltw.xigt b lt 0
(yi1) gtltw.xigt b gt 0
Hence in all cases yiltw.xigt b gt 0
Requirement 2 maximum margin
Maximise d, with d miniltw.xigtb/w,
i1,..,L
Requirements 12
Maximize d with for every i1,..,L, yi
ltw.xigtb/w gt d

35
Notions of Duality

A large numer of linear models optimization (this
includes ridge regression and SVMs) can be
written in a dual space, of dimension L, the
space of a
With y f(x) ltwxgt b, let us have w Xa,
where a is a vector from RL, X being matrix xij,
i1,..,L j1,..,n
Wi Sxijaj j1,..,n
It can be shown that for such models, a solution
y can be written with the only use of scalar
products ltxixjgtand ltxixgt, and expressed as a
linear combination of yi
f(x) S
aiyiltxxigt i1,..,L b

36
SVM dual optimization problem

By setting d 1/w, the problem becomes
Minimize J(w,b) ½w², with yiltwxigtbgt1
The solution can be written as a linear
combination of the training data
w Saiyixi, aigt0, i1,..,L
b (1/2)(ltwxposgt ltwxneggt)
Dual optimization problem
Maximize L(a)Sai i1,..,L
(1/2)SSaiajyiyjltxixjgt i1,..,L j1,..,L
with Saiyi i1,..,L 0 and ai gt0,
i1,..,L
This is a positive semi-definite quadratic program

37
SVM primal and dual equivalences

Theorem the primal OP and the dual OP have the
same solution.
Given the solution a of the dual OP,
w Saiyixi i1,..,L
b (1/2)(ltwxposgt ltwxneggt)
is the solution of the primal OP.
Hence learning result (SVM classifier) can be
represented in two alternative ways
weight vector and threshold (w,b)
Vector of influences of each sample data
a1,..,aL

38
Properties of the SVM Dual OP

Dual optimization problem
Maximize
L(a)Sai i1,..,L (1/2)SSaiajyiyjltxixjgt
i1,..,L j1,..,L
with
Saiyi i1,..,L 0 and ai gt0, i1,..,L
There is a single solution (ie (w,b) is unique)
One factor ai for each training example
Describes the influence of training example i on
the result
ai gt 0 ? training example is a support vector
ai 0 else
Depends exclusively on inner products between
samples

39
SVM the ugly case of non-separable training
samples

For some training samples there is no separating
hyperplane
Complete separation is suboptimal for many
training samples (eg. a single -1 close to
the cloud of 1 , all the other -1 far
away)
There is hence a need to trade-off between margin
size (robustness) and training error

40
Soft Margin SubOptimal Example
41
SVM Soft-Margin Separation

Same idea as regularization maximize margin and
mininimize training error simultaneously.
Hard Margin
Minimize J(w,b) (1/2) w²
With constraints yiltwxigt b gt 1, i1,..,L
Soft Margin
Minimize J(w,b,x) (1/2) w² C Sxi,
i1,..,L
With constraints yiltwxigt b gt 1 xi and xi
gt 0, i1,..,L
Sxi, i1,..,L is an upper bound on training
error number
C is a parameter that controls trade-off between
margin and error
Dual optimization problem for soft margin
Maximize L(a)Saii1,..,L-(1/2)SSaiajyiyjltxix
jgt i,j1,..,L
With constraints Saiyii1,..,L0 and
0ltailtC, i1,..,L

42
Properties of the Soft-Margin Dual OP

Dual OP
Maximize L(a)Saii1,..,L-(1/2)SSaiajyiyjltxix
jgt i,j1,..,L
With constraints Saiyii1,..,L0 and
0ltailtC, i1,..,L
Single solution (ie (w,b) is unique)
One factor ai for each training example
Influence of single training example limited by C
0ltailtC ? SV with xi0
aiC ? SV with xigt0
ai0 else
Results are based exclusively on inner products
between training examples

43
Soft Margin - Support vectors
xi
xj
44
Strategies towards non-linear problems (1/4)

Notion of feature space (Vapnik extended
space for attributes) a feature space is a
manifold in which one tries to embed through an
homeomorphism the original attributes
x(x1,..,xn) -gt Y(x) (f1(x),..,fN(x))
By doing so one will try, in the new manifold, to
build models in a linear approach on new
attributes fp
y f(x) Swp fp(x), p1,..,N b

45
Strategies towards non-linear problems (2/4)

The use of dual representation allows then to
express the generalized linear model such
obtained under the following way
y Saiyiltf(xi),f(x)gt i1,..,L
b
The idea for Reproducing Kernels in Hilbert
Spaces (RKHS) is to express non-linearity in an
indirect way through the use of a function K,
following a certain number of criteria (Mercer),
defining extended feature space geometry through
its inner product
K(xi,xj) ltf(xi),f(xj)gt
Our model becomes y Saiyi K(xi,x) i1,..,L
b

46
Strategies towards non-linear problems (3/4)

There are many exmaples of Mercer kernels K, such
as
Linear K(xi,xj) ltxixjgt
Polynomial K(xi,xj) ltxixjgt1d
Radial basis functions K(xi,xj)
exp(-gxi-xj²)
Sigmoid Kernels K(xi,xj) tanh(gxi-xjc)
Dual approach and kernels allow to build an
important class of robust models for non linear
problems, where the number of attributes can be
huge (thousands, millions, even infinite..). In
dual space one works with a space of finite
dimension L. Such models belong to the family of
so called generalized linear models.

47
Strategies towards non-linear problems (4/4)

Induced non-linearity Mercers theorem allows to
express a kernel K under the following form
K(x1,x2) Slifi(x1)fi(x2)
i1,2,..
K here defines an inner product in extended
feature space
A generalized linear model has then a
representation in original attributes space
y Sliyifi(x) i1,2,.. b,
where y Samymf(xm)) m1,..,L

48
Soft Margin SVM with Kernels

Training Optimization Problem in dual space
Maximize
L(a)Saii1,..,L-(1/2)SSaiajyiyjK(xixj)
i,j1,..,L
With constraints
Saiyii1,..,L0
and 0ltailtC, i1,..,L
Classification model for new example x,
f(x) sign(SaiyiK(xi,x) xi e SV set b)

49
When do SVMs Work?

If
Training error on the sample on average is low
And the margin d / R on the sample on average is
large
Then
SVM learns a classification rule with low error
rate with high probability (worst case)
SVM learns classification rules that have low
error rate on average
SVM learns a classification rule for which the
(leave-one-out) estimated error rate is low

50
Conclusion (1/2)

Vapniks theory allows one to build a new vision
on the notion of robustness, with a set of
theorems that belong to the Kolmogorov type,
which means whatever the underlying
probabilistic laws of sample data
To build a model becomes to negotiate under this
vision a trade-off (Friedman) between an
excellent fit and a proper robustness.
Cross-validation (a tool used also in SVMs to
fine tune constant C for soft margin SVM models)
replaces here tests used in the Fisher approach.

51
Conclusion (2/2)

In a first phase of his work, the Statistician
becomes with SRM (and K2C!) free of a tedious and
time-expensive work fine-tune and test data
probabilistic laws.
Linear models can be controlled efficiently in
robustness. Two roads to model are Regularization
(eg. Ridge Regression, K2R) and Support Vector
Machines (SVM and KSVM)
Reproducing Kernel in Hilbert Spaces (RKHS)
theory together with Vapniks vision on linear
models open a Major Way to the build up of
efficient non linear models generalized linear
models.

52
Agenda
Company positionning 15 Mins
SRM / SVM Theory 45 Mins
KXEN Analytic Framework (Demo) - Mins
53
The Business Value of Analytics

High
Traditional Data Mining
Business Value
OLAP
Query and Reporting
Low
Usability
Easy to Use
Hard to use
54
KXEN Analytic Framework 2.2
Data Access C API
Consistent Coder K2C
55
DEMO - American Census
Business Case Identify target the persons
in my Db who are earning more than 50K

Available Information
American Census
International benchmark
Dataset
15 Variables
48,000 Records
Mix of text and numerical data

56
Enterprise and Production Modeling

Enterprise
Planning
KPIs
ROI
Setting Policy

57
Value Added for Professionals
Each group has different goals and
constraints. KXEN breaks down the walls between
the departments.
58
THANKS FOR YOUR TIME

Write a Comment

User Comments (0)

About PowerShow.com

VC Dimension - PowerPoint PPT Presentation

VC Dimension

... of functions f(X,w) shatters the sample if all 2L separations ... 2) There is at least one sample of h 1 vectors that cannot be shattered by any function from S ... – PowerPoint PPT presentation