Title: VC Dimension
1VC Dimension Direct Robustness Control in
Statistical Learning TheoryMichel Béra
co-Founder and Chief Scientific
Officer21/10/2002
2Agenda
Company positionning 15 Mins
SRM / SVM Theory
45 Mins
KXEN Analytic Framework (Demo) 15Mins
3Company Background
- Founded in July 1998
- Delaware corporation
- Headquartered in San Francisco
- Operations in U.S. and Europe
- Strong Executive Team
- R. Haddad (CEO), E. Marcade (CTO), M. Bera (CSO)
- Active Scientific Committee
- Includes Gregory Piatetsky-Shapiro (founder
SigKDD), Lee Giles (Penn State Professor,
formerly with NEC), Gilbert Saporta (French
Statistical Society President), Yann Le Cun (NEC,
Manager of V.Vapnik), Léon Bottou (NEC), Bernhard
Schoelkopf (Max-Planck-Institut)
4Go-To-Market Strategy
By embedding KXEN into major Applications and
partnering with leading SI, KXEN is on the way to
become the de-facto standard for Predictive
Analytics
5KXEN Value Proposition
Analyst
Business User
Traditional Data Mining Approach Cost per Model
30,000
Business Question
Interpret Results
Business User
Prepare Data
Build Model
Test Model
Understand
Apply
Wait for Analyst Time
3 Weeks
Our company builds hundreds of predictive models
in the same time we used to build one. KXEN
allows us to save millions of dollars with more
effective campaigns Financial Industry Customer
6What is a good model?
Low quality /High Robustness
Low Robustness
Robust Model
7Agenda
Company positionning 15 Mins
SRM / SVM Theory 45 Mins
KXEN Analytic Framework (Demo) - Mins
8VC dimension - definition (1)
- Let us consider a sample (x1, .. , xL) from Rn
- There are 2L different ways to separate the
sample in two sub-samples - A set S of functions f(X,w) shatters the sample
if all 2L separations can be defined by different
f(X,w) from family S
9VC dimension - definition (2)
- A function family S has VC dimension h (h is an
integer) if - 1) Every sample of h vectors from Rn can be
shattered by a function from S - 2) There is at least one sample of h1 vectors
that cannot be shattered by any function from S
10Example VC dimension
- VC dimension
- Measures the complexity of a solution
(function). - Is not directely related to the number of
variables
11Other examples
- VC dimension for hyperplanes of Rn is n1
- VC dimension of set of functions
- f(x,w) sign (sin (w.x) ), c lt x lt 1,
cgt0, - where w is a free parameter, is infinite.
- VC dimensions are not always equal to the number
of parameter that define a given family S of
functions.
12Key Example linear modelsy ltwxgt b
- VC dimension of family S of linear models
-
- with
- depends on C and can take any value between 0
and n.
13VC dimension interpretation
- VC dimension of S an integer, that measures the
dispersion or separating power (complexity) of
function family S - We shall now show that VC dimension (a major
theorem from Vapnik) gives a powerful indication
for model robustness.
14Learning Theory Problem (1)
- A model computes a function
- Problem minimize in w Risk Expectation
- w a parameter that specifies the chosen model
- z (X, y) are possible values for attributes
(variables) - Q measures (quantifies) model error cost
- P(z) is the underlying probability law (unknown)
for data z
15Learning Theory Problem (2)
- We get L data from learning sample (z1, .. , zL),
and we suppose them iid sampled from law P(z). - To minimize R(w), we start by minimizing
Empirical Risk over this sample - We shall use such an approach for
- classification (eg. Q can be a cost function
based on cost for misclassified points) - regression (eg. Q can be a cost of least squares
type)
16Learning Theory Problem (3)
- Central problem for Statistical Learning Theory
- What is the relation between Risk Expectation
R(W)and Empirical Risk E(W)? - How to define and measure a generalization
capacity (robustness) for a model ?
17Four Pillars for SLT (1 and 2)
- Consistency (guarantees generalization)
- Under what conditions will a model be consistent
? - Model convergence speed (a measure for
generalization) - How does generalization capacity improve when
sample size L grows?
18Four Pillars for SLT (3 and 4)
- Generalization capacity control
- How to control in an efficient way model
generalization starting with the only given
information we have our sample data? - A strategy for good learning algorithms
- Is there a strategy that gurantees, measures and
controls our learning model generalization
capacity ?
19Consistency definition
- A learning process (model) is said to be
consistent if model error, measured on new data
sampled from the same underlying probability laws
of our original sample, converges, when original
sample size increases, towards model error,
measured on original sample.
20Consistent training?
error
Test error
Training error
number of training examples
error
Test error
Training error
number of training examples
21Vapnik main theorem
- Q Under which conditions will a learning
process (model) be consistent? - R A model will be consistent if and only if the
function f that defines the model comes from a
family of functions S with finite VC dimension h - A finite VC dimension h not only guarantees a
generalization capacity (consistency), but to
pick f in a family S with finite VC dimension h
is the only way to build a model that generalizes.
22Model convergence speed (generalization capacity)
- Q What is the nature of model error difference
between learning data (sample) and test data, for
a sample of finite size L? - R This difference is no greater than a limit
that only depends on the ratio between VC
dimension h of model functions family S, and
sample size L, ie h/L - This statement is a new theorem that belongs to
Kolmogorov-Smirnov way for results, ie theorems
that do not depend on datas underlying
probability law.
23Model convergence speed
Sample size L
24Empirical Risk Minimization
- With probability 1-q, the following inequality is
true -
- where w0 is the parameter w value that minimizes
Empirical Risk
25 SRM methodology how to control model
generalization capacity
- Risk Expectation Empirical Risk Confidence
Interval - To minimize Empirical Risk alone will not always
give a good generalization capacity one will
want to minimize the sum of Empirical Risk and
Confidence Interval - What is important is not Vapnik limit numerical
value , most often too large to be of any
practical use, it is the fact that this limit is
a non decreasing function of model family
function richness
26SRM strategy (1)
- With probability 1-q,
-
- When L/h is small (h too large), second term of
equation becomes large - SRM basic idea for strategy is to minimize
simultaneously both terms standing on the right
of above majoring equation for R(w) - To do this, one has to make h a controlled
parameter
27SRM strategy (2)
- Let us consider a sequence S1 lt S2 lt .. lt Sn of
model family functions, with respective growing
VC dimensions - h1 lt h2 lt .. lt hn
- For each family Si of our sequence, the
inequality -
-
- is valid
28SRM strategy (3)
SRM find i such that expected risk R(w) becomes
minimum, for a specific hhi, relating to a
specific family Si of our sequence build model
using f from Si
Risk
Model Complexity
29Putting SRM into action linear models case (1)
- There are many SRM-based strategies to build
models - In the case of linear models
- y ltwxgt b,
- one wants to make w a controlled
parameter let us call SC the linear model
function family satisfying the constraint - w lt C
-
- Vapnik Major theorem
- When C decreases, h(SC) decreases
- x lt R
-
30Putting SRM into action linear models case (2)
- To control w, one can envision two routes to
model - Regularization/Ridge Regression, ie min. over w
and b - RG(w,b) S(yi-ltwxigt - b)² i1,..,L l
w² - Support Vector Machines (SVM), ie solve directly
an optimization problem (hereunder classif. SVM,
separable data) - Minimize w², with (yi /-1)
- and yi(ltwxigt b) gt1 for all i1,..,L
31Linear classifiers
- Rules of the form weight vector w, threshold b
- f(x) sign( Swixi i1,..,L b )
- f(x) 1 if Swixi i1,..,L b gt 0
- -1 else
- If the L training examples (x1,y1),..,(xL,yL),
where xi is a vector from Rn, and yi 1 or 1,
are linearly separable, there is an infinity of
hyperplanes f separating the 1 and the -1. - However, there is a unique f that will define
the maximum width corridor between yi1 and
yi-1 of the sample. Let 2d be the width of this
optimal corridor.
32VC dimension of  thick Hyperplanes
- Lemma the VC dimension of hyperplanes defined
by f (w,b) with margin d (Â thickÂ
hyperplane) and sample vectors xi verifying
xi lt R, i 1,..,L is bounded by - VC dim lt R²/d² 1
- The VC dimension of such a linear classifier
does not necessarily depend on the number of
attributes or the number of parameters!
33Maximizing margin d relation to SRM
- The hypothesis space with minimal VC-dim
according to SRM will be the hyperplane with
maximum margin d. -
- It will be entirely defined through the parts of
the sample with minimal distance the support
vectors - The number of support vectors is neither L, nor
n, not h, the corresponding VC-dimension - If the number of support vectors is large
compared to L, the model may be beautiful in
theory, but extremely costly to apply!
34Computing Optimal Hyperplane
- Training examples
- (x1,y1), ..,(xL,yL), xi from Rn, yi 1 or 1,
i1,..,L - Requirement 1 zero training error
- (yi-1) gt ltw.xigt b lt 0
- (yi1) gtltw.xigt b gt 0
- Hence in all cases yiltw.xigt b gt 0
- Requirement 2 maximum margin
- Maximise d, with d miniltw.xigtb/w,
i1,..,L - Requirements 12
- Maximize d with for every i1,..,L, yi
ltw.xigtb/w gt d
35Notions of Duality
- A large numer of linear models optimization (this
includes ridge regression and SVMs) can be
written in a dual space, of dimension L, the
space of a - With y f(x) ltwxgt b, let us have w Xa,
where a is a vector from RL, X being matrix xij,
i1,..,L j1,..,n - Wi Sxijaj j1,..,n
- It can be shown that for such models, a solution
y can be written with the only use of scalar
products ltxixjgtand ltxixgt, and expressed as a
linear combination of yi - f(x) S
aiyiltxxigt i1,..,L b
36SVM dual optimization problem
- By setting d 1/w, the problem becomes
- Minimize J(w,b) ½w², with yiltwxigtbgt1
- The solution can be written as a linear
combination of the training data - w Saiyixi, aigt0, i1,..,L
- b (1/2)(ltwxposgt ltwxneggt)
- Dual optimization problem
- Maximize L(a)Sai i1,..,L
(1/2)SSaiajyiyjltxixjgt i1,..,L j1,..,L - with Saiyi i1,..,L 0 and ai gt0,
i1,..,L - This is a positive semi-definite quadratic program
37SVM primal and dual equivalences
- Theorem the primal OP and the dual OP have the
same solution. - Given the solution a of the dual OP,
- w Saiyixi i1,..,L
- b (1/2)(ltwxposgt ltwxneggt)
- is the solution of the primal OP.
- Hence learning result (SVM classifier) can be
represented in two alternative ways - weight vector and threshold (w,b)
- Vector of  influences of each sample data
a1,..,aL
38Properties of the SVM Dual OP
- Dual optimization problem
- Maximize
- L(a)Sai i1,..,L (1/2)SSaiajyiyjltxixjgt
i1,..,L j1,..,L - with
- Saiyi i1,..,L 0 and ai gt0, i1,..,L
- There is a single solution (ie (w,b) is unique)
- One factor ai for each training example
- Describes the influence of training example i on
the result - ai gt 0 ? training example is a support vector
- ai 0 else
- Depends exclusively on inner products between
samples
39SVM the ugly case of non-separable training
samples
- For some training samples there is no separating
hyperplane - Complete separation is suboptimal for many
training samples (eg. a single  -1 close to
the cloud of  1 , all the other  -1 far
away) - There is hence a need to trade-off between margin
size (robustness) and training error
40Soft Margin SubOptimal Example
41SVM Soft-Margin Separation
- Same idea as regularization maximize margin and
mininimize training error simultaneously. - Hard Margin
- Minimize J(w,b) (1/2) w²
- With constraints yiltwxigt b gt 1, i1,..,L
- Soft Margin
- Minimize J(w,b,x) (1/2) w² C Sxi,
i1,..,L - With constraints yiltwxigt b gt 1 xi and xi
gt 0, i1,..,L - Sxi, i1,..,L is an upper bound on training
error number - C is a parameter that controls trade-off between
margin and error - Dual optimization problem for soft margin
- Maximize L(a)Saii1,..,L-(1/2)SSaiajyiyjltxix
jgt i,j1,..,L - With constraints Saiyii1,..,L0 and
0ltailtC, i1,..,L
42Properties of the Soft-Margin Dual OP
- Dual OP
- Maximize L(a)Saii1,..,L-(1/2)SSaiajyiyjltxix
jgt i,j1,..,L - With constraints Saiyii1,..,L0 and
0ltailtC, i1,..,L - Single solution (ie (w,b) is unique)
- One factor ai for each training example
- Influence of single training example limited by C
- 0ltailtC ? SV with xi0
- aiC ? SV with xigt0
- ai0 else
- Results are based exclusively on inner products
between training examples
43Soft Margin - Support vectors
xi
xj
44Strategies towards non-linear problems (1/4)
- Notion of  feature space (Vapnik extended
space for attributes) a feature space is a
manifold in which one tries to embed through an
homeomorphism the original attributes
x(x1,..,xn) -gt Y(x) (f1(x),..,fN(x)) - By doing so one will try, in the new manifold, to
build models in a linear approach on new
attributes fp - y f(x) Swp fp(x), p1,..,N b
45Strategies towards non-linear problems (2/4)
- The use of dual representation allows then to
express the generalized linear model such
obtained under the following way - y Saiyiltf(xi),f(x)gt i1,..,L
b - The idea for Reproducing Kernels in Hilbert
Spaces (RKHS) is to express non-linearity in an
indirect way through the use of a function K,
following a certain number of criteria (Mercer),
defining extended feature space geometry through
its inner product - K(xi,xj) ltf(xi),f(xj)gt
- Our model becomes y Saiyi K(xi,x) i1,..,L
b
46Strategies towards non-linear problems (3/4)
- There are many exmaples of Mercer kernels K, such
as - Linear K(xi,xj) ltxixjgt
- Polynomial K(xi,xj) ltxixjgt1d
- Radial basis functions K(xi,xj)
exp(-gxi-xj²) - Sigmoid Kernels K(xi,xj) tanh(gxi-xjc)
- Dual approach and kernels allow to build an
important class of robust models for non linear
problems, where the number of attributes can be
huge (thousands, millions, even infinite..). In
dual space one works with a space of finite
dimension L. Such models belong to the family of
so called generalized linear models.
47Strategies towards non-linear problems (4/4)
- Induced non-linearity Mercers theorem allows to
express a kernel K under the following form - K(x1,x2) Slifi(x1)fi(x2)
i1,2,.. - K here defines an inner product in extended
feature space - A generalized linear model has then a
representation in original attributes space - y Sliyifi(x) i1,2,.. b,
- where y Samymf(xm)) m1,..,L
48Soft Margin SVM with Kernels
- Training Optimization Problem in dual space
- Maximize
- L(a)Saii1,..,L-(1/2)SSaiajyiyjK(xixj)
i,j1,..,L - With constraints
- Saiyii1,..,L0
- and 0ltailtC, i1,..,L
- Classification model for new example x,
- f(x) sign(SaiyiK(xi,x) xi e SV set b)
49When do SVMs Work?
- If
- Training error on the sample on average is low
- And the margin d / R on the sample on average is
large - Then
- SVM learns a classification rule with low error
rate with high probability (worst case) - SVM learns classification rules that have low
error rate on average - SVM learns a classification rule for which the
(leave-one-out) estimated error rate is low
50Conclusion (1/2)
- Vapniks theory allows one to build a new vision
on the notion of robustness, with a set of
theorems that belong to the  Kolmogorov type,
which means  whatever the underlying
probabilistic laws of sample data - To build a model becomes to negotiate under this
vision a trade-off (Friedman) between an
excellent fit and a proper robustness.
Cross-validation (a tool used also in SVMs to
fine tune constant C for soft margin SVM models)
replaces here tests used in the Fisher approach.
51Conclusion (2/2)
- In a first phase of his work, the Statistician
becomes with SRM (and K2C!) free of a tedious and
time-expensive work fine-tune and test data
probabilistic laws. - Linear models can be controlled efficiently in
robustness. Two roads to model are Regularization
(eg. Ridge Regression, K2R) and Support Vector
Machines (SVM and KSVM) - Reproducing Kernel in Hilbert Spaces (RKHS)
theory together with Vapniks vision on linear
models open a Major Way to the build up of
efficient non linear models generalized linear
models.
52Agenda
Company positionning 15 Mins
SRM / SVM Theory 45 Mins
KXEN Analytic Framework (Demo) - Mins
53The Business Value of Analytics
High
Traditional Data Mining
Business Value
OLAP
Query and Reporting
Low
Usability
Easy to Use
Hard to use
54KXEN Analytic Framework 2.2
Data Access C API
Consistent Coder K2C
55DEMO - American Census
Business Case Identify target the persons
in my Db who are earning more than 50K
- Available Information
- American Census
- International benchmark
- Dataset
- 15 Variables
- 48,000 Records
- Mix of text and numerical data
56Enterprise and Production Modeling
- Enterprise
- Planning
- KPIs
- ROI
- Setting Policy
-
57Value Added for Professionals
Each group has different goals and
constraints. KXEN breaks down the walls between
the departments.
58THANKS FOR YOUR TIME