Title: Functional%20Analytic%20Approach%20to%20Model%20Selection
1Functional Analytic Approach to Model Selection
- Department of Computer Science,
- Tokyo Institute of Technology, Tokyo, Japan
- Masashi Sugiyama
- (sugi_at_cs.titech.ac.jp)
2Regression Problem
Learning target function
Learned function
Training examples
(noise)
From , obtain such
that it is as close to as possible.
3Typical Method of Learning
- Linear regression model
- Ridge regression
Parameters to be learned
Basis functions
Ridge parameter
4Model Selection
Target function Learned function
Too simple
Appropriate
Too complex
Choice of the model affects heavily on the
learned function
(Model refers to, e.g., the ridge parameter )
5Ideal Model Selection
Determine the model such that a certain
generalization error is minimized.
Badness of
6Practical Model Selection
However, the generalization error can not
be directly calculated since it includes unknown
learning target function
(not true for Bayesian model selection using
evidence)
Determine the model such that an estimator
of the generalization error is minimized.
We want to have an accurate estimator
7Two Approaches toEstimating Generalization Error
(1)
Try to obtain unbiased estimators
- CP (Mallows, 1973)
- Cross-Validation
- Akaike Information Criterion
- (Akaike, 1974) etc.
Interested in typical-case performance
8Two Approaches toEstimating Generalization Error
(2)
Try to obtain probabilistic upper bounds
- VC-bound (Vapnik Chervonenkis, 1974)
- Span bound (Chapelle Vapnik, 2000)
- Concentration bound
- (Bousquet Elisseeff 2001) etc.
with probability
Interested in worst-case performance
9Popular Choices ofGeneralization Measure
- Risk
- e.g.,
- Kullback-Leibler divergence
Target density
Learned density
10Concerns in Existing Methods
- The used approximation often requires a large
(infinite) number of training examples for
justification (asymptotic approximation) - They do not work with small samples
- Generalization measure should be integrated over
, from which training examples
are drawn - They can not be used for transduction
- (estimating error at a point of interest)
11Our Interests
- We are interested in
- Estimating the generalization error with accuracy
guaranteed for small (finite) samples - Estimating the transduction error
- (the error at a point of interest)
- Investigating the role of unlabeled samples
(samples without output sample
values )
12Our Generalization Measure
- Functional Hilbert space
- We assume
Norm in the function space
13Generalization Measurein Functional Hilbert Space
- A functional Hilbert space is specified by
- Set of functions which span the space,
- Inner product (and norm).
- Given a set of functions, we can design the inner
product (and therefore the generalization
measure) as desired.
14Examples of the Norm
- Weighted distance in input domain
- Weighted distance in Fourier domain
- Sobolev norm
Weight function
Weight function
-th derivative of
15Interesting Features
Weight function
- When and
, we can use unlabelled samples
for estimating - For transductive inference (given
), - For interpolation, extrapolation Desired
16Goal of My Talk
- I suppose that you like the generalization
measure defined in the functional Hilbert space. - The goal of my talk is to give a method for
estimating the generalization error.
Norm in the function space
17Function Spaces for Learning
- For further discussion, we have to specify the
class of function spaces. - We want the class to be less restrictive.
- A general function space such as is not
suitable for learning problems because a value of
a function at a point is not specified in .
and have different values at
But they are treated as the same function in
is spanned by
18Reproducing Kernel Hilbert Space
- A function space that is rather general and a
value of a function at a point is specified is
the reproducing kernel Hilbert space (RKHS). - RKHS has the reproducing kernel
- For any fixed ,
- is a function of in
- For any function in and any ,
Inner product in the function space
19Formulation of Learning Problem
- Specified RKHS
- Fixed
- , Mean , Variance
- Linear estimation
- e.g., ridge regression for linear model
Linear operator
Basis functions in
20Sampling Operator
- For any RKHS , there exists a linear operator
from to such that - Indeed,
Neumann-Schatten product
For vectors,
-th standard basis in
21Formulation
RKHS
Sample value space
Sampling operator (Always linear)
Learning target function
noise
Gen. error
Learning operator (Assume linear)
Learned function
22Expected Generalization Error
- We are interested in typical performance so we
estimate the expected generalization error over
the noise - We do not take expectation over input points
- Data-dependent !
- We do not assume
- Advantageous in active learning !
Expectation over noise
23Bias / Variance Decomposition
Variance
Bias
RKHS
Expectation over noise
Bias
Noise variance
Adjoint of
Variance
We want to estimate the bias !
24Tricks for Estimating Bias
Sugiyama Ogawa (Neural Comp., 2001)
- Suppose we have a linear operator
- that gives an unbiased estimate of
Expectation over noise
We use for estimating the bias of
RKHS
25Unbiased Estimator of Bias
RKHS
Bias
Rough estimate
26Subspace Information Criterion(SIC)
Estimate of Bias
Variance
SIC is an unbiased estimator of the
generalization error with finite samples
Expectation over noise
27Obtaining Unbiased Estimate
We need that gives an unbiased estimate of
learning target .
exists if and only if
span the entire space . When this is
satisfied, is given by .
Generalized inverse
We can enjoy all the features ! (Unlabeled
samples, transductive inference etc.)
28Example of Using SICStandard Linear Regression
- Learning target function
- where are unknown
- Regression model
- where are estimated linearly
- (e.g., ridge regression)
29Example (cont.)
- Generalization measure
- If the design matrix has
rank , then the best linear unbiased
estimator (BLUE) always exists - In this case, SIC provides an unbiased estimate
of the above generalization error
Weight function
Number of basis functions
30Applicability of SIC
- However, the design matrix
has rank only if - Therefore, the target function should be
included in a rather small model
Number of basis functions
Number of training examples
Range of application of SIC is rather limited
31When Unbiased EstimateDoes Not Exist
Sugiyama Müller (JMLR, 2002)
- exists if and only if
span the whole space . - When this condition is not fulfilled, let us
restrict ourselves to finding a learning result
function from a subspace , not from the
entire RKHS
RKHS
32Essential Generalization Error
Irrelevant (constant)
Essential
is just replaced by
Essentially, we are estimating projection
RKHS
33Unbiased Estimate of Projection
- If a linear operator that gives an unbiased
estimate of the projection of the learning target
is available, then SIC is an unbiased
estimator of the essential generalization error.
Such exists if and only if the subspace
is included in the span of
.
e.g., kernel regression model
34Restriction
- However, another restriction arises
- If the generalization measure is designed as
desired, we have to use the kernel function
induced by the generalization measure
35Restriction (cont.)
- On the other hand,
- If a desired kernel function is used, then we
have to use the generalization measure induced by
the kernel - e.g., generalization measure in Gaussian
RKHS heavily penalizes high frequency
components
36Summary of Usage of SIC
- SIC essentially has two modes.
- For rather restricted linear regression, SIC has
several interesting properties. - Unlabeled samples can be utilized for estimating
prediction error (expected test error). - Any weighted error measures can be used,
- e.g., inter-(extra-)polation, transductive
inference. - For kernel regression, SIC can always be applied.
However, kernel induced generalization measure
should be employed.
37Simulation (1) Setting
- Trigonometric polynomial RKHS
- Span
- Gen. measure
- Learning target function
- sinc-like function in
38Simulation (1) Setting (cont.)
- Training examples
- Ridge regression is used for learning
- Number of training examples
- Noise variance
- Number of trials
39Simulation (1-a)Using Unlabeled Samples
- We estimate the prediction error using 1000
unlabeled samples
40Results Unlabeled Samples
Values can be negative since some constants are
ignored
Ridge parameter
41Results Unlabeled Samples
Ridge parameter
42Simulation (1-b)Transduction
- We estimate the test error
- at a single test point
43Results Transduction
Ridge parameter
44Results Transduction
Ridge parameter
45Simulation (2) Infinite Dimensional RKHS
- Gaussian RKHS
- Learning target function sinc function
- Training examples
-
- We estimate
46Results Gaussian RKHS
Ridge parameter
47Results Gaussian RKHS
Ridge parameter
48Simulation (3) DELVE Data Sets
- Gaussian RKHS
- We choose the ridge parameter by
- SIC
- Leave-one-out cross-validation
- An empirical Bayesian method
- (Marginal likelihood maximization)
- Performance is compared by test error
(Akaike, 1980)
49Normalized Test Errors
Red Best or comparable (95t-test)
50Image Restoration
Sugiyama et al. (IEICE Trans., 2001) Sugiyama
Ogawa (Signal Processing, 2002)
Restoration Filter
Parameter
e.g., Gaussian filter, regularization filter
Degraded Image
large
small
appropriate
We would like to determine parameter values
appropriately.
51Formulation
Hilbert space
Hilbert space
Degradation
Original image
Noise
Filter
Restored image
Observed image
52Results with Regularization Filter
Original images
Degraded images
Restored images using SIC
53Precipitation Estimation
Moro Sugiyama (IEICE General Conf., 2001)
- Estimating future precipitation from past
precipitation and whether radar data. - Our method with SIC won the 1st prize in
estimation accuracy in IEICE Precipitation
Estimation Contest 2001
1st TokyoTech MSE0.71
2nd KyuTech MSE0.75
3rd Chiba Univ MSE0.93
4th MSE1.18
Precipitation and weather radar and data
from IEICE Precipitation Estimation Contest 2001
54References(Fundamentals of SIC)
- Proposing the concept of SIC
- Sugiyama, M. Ogawa, H. Subspace information
criterion for model selection. Neural
Computation, vol.13, no.8, pp.1863-1889, 2001 - Performance evaluation of SIC
- Sugiyama, M. Ogawa, H. Theoretical and
experimental evaluation of the subspace
information criterion. Machine Learning, vol.48,
no.1/2/3, pp.25-50, 2002.
55References (SIC forParticular Learning Methods)
- SIC for regularization learning
- Sugiyama, M. Ogawa, H. Optimal design of
regularization term and regularization parameter
by subspace information criterion. Neural
Networks, vol.15, no.3, pp.349-361, 2002. - SIC for sparse regressors
- Tsuda, K., Sugiyama, M., Müller, K.-R.
Subspace information criterion for non-quadratic
regularizers --- Model selection for sparse
regressors. IEEE Transactions on Neural
Networks, vol.13, no.1, pp.70-80, 2002.
56References (Applications of SIC)
- Applying SIC to image restoration
- Sugiyama, M., Imaizumi, D., Ogawa, H.
Subspace information criterion for image
restoration --- Optimizing parameters in linear
filters. IEICE Transactions on Information and
Systems, vol.E84-D, no.9, pp.1249-1256, Sep.
2001. - Sugiyama, M. Ogawa, H. A unified method for
optimizing linear image restoration filters.
Signal Processing, vol.82, no.11, pp.1773-1787,
2002. - Applying SIC to precipitation estimation
- Moro, S. Sugiyama, M. Estimation of
precipitation from meteorological radar data. In
Proceedings of the 2001 IEICE General Conference
SD-1-10, pp.264-265, Shiga, Japan, Mar. 26-29,
2001.
57References (Extensions of SIC)
- Extending range of application of SIC
- Sugiyama, M. Müller, K.-R. The subspace
information criterion for infinite dimensional
hypothesis spaces. Journal of Machine Learning
Research, vol.3 (Nov), pp.323-359, 2002. - Further Improving SIC
- Sugiyama, M.Improving precision of the subspace
information criterion. IEICE Transactions on
Fundamentals (to appear). - Sugiyama, M., Kawanabe, M. Müller, K.-R.
Trading variance reduction with unbiasedness ---
The regularized subspace information criterion
for robust model selection (submitted).
58Conclusions
- We formulated the regression problem from a
functional analytic point of view. - Within this framework, we gave a generalization
error estimator called the subspace information
criterion (SIC). - Unbiasedness of SIC guaranteed even with finite
samples. - We did not take expectation over training sample
points so SIC may be more data-dependent.
59Conclusions (cont.)
- SIC essentially has two modes
- For rather restrictive linear regression,
- SIC has several interesting properties.
- Unlabeled samples can be utilized for estimating
prediction error. - Any weighted error measures can be used, e.g.,
interpolation, extrapolation, transductive
inference. - For kernel regression, SIC can always be applied.
However, kernel induced generalization measure
should be employed.