Feature selection methods

About This Presentation

Title:

Feature selection methods

Description:

Thousands to millions of low level features: select the most relevant one to ... Formalism ( chap. 5) Next few s: Andr Elisseeff ... – PowerPoint PPT presentation

Number of Views:1054

Avg rating:3.0/5.0

Slides: 37

Provided by: Isabell47

Category:

more less

Transcript and Presenter's Notes

Title: Feature selection methods

1
Feature selection methods

Isabelle Guyon
isabelle_at_clopinet.com

IPAM summer school on Mathematics in Brain
Imaging. July 2008
2
Feature Selection

Thousands to millions of low level features
select the most relevant one to build better,
faster, and easier to understand learning
machines.

n
X
m
3
Applications
examples
Customer knowledge
Quality control
Market Analysis
106
OCR HWR
Machine vision
105
Text Categorization
104
103
System diagnosis
Bioinformatics
102
10
variables/features
10
102
103
104
105
4
Nomenclature

Univariate method considers one variable
(feature) at a time.
Multivariate method considers subsets of
variables (features) together.
Filter method ranks features or feature subsets
independently of the predictor (classifier).
Wrapper method uses a classifier to assess
features or feature subsets.

5
Univariate Filter Methods
6
Univariate feature ranking
m-
m
P(XiY1)
P(XiY-1)
-1
xi
s-
s

Normally distributed classes, equal variance s2
unknown estimated from data as s2within.
Null hypothesis H0 m m-
T statistic If H0 is true,
t (m - m-)/(swithin?1/m1/m-)
Student(mm--2 d.f.)

7
Statistical tests ( chap. 2)
Null distribution

H0 X and Y are independent.
Relevance index ? test statistic.
Pvalue ? false positive rate FPR nfp / nirr
Multiple testing problem use Bonferroni
correction pval ? n pval
False discovery rate FDR nfp / nsc ? FPR
n/nsc
Probe method FPR ? nsp/np

8
Univariate Dependence

Independence
P(X, Y) P(X) P(Y)
Measure of dependence
MI(X, Y) ? P(X,Y) log dX dY
KL( P(X,Y) P(X)P(Y) )

P(X,Y) P(X)P(Y)
9
Other criteria ( chap. 3)

A choice of feature selection ranking methods
depending on the nature of
the variables and the target (binary,
categorical, continuous)
the problem (dependencies between variables,
linear/non-linear relationships between variables
and target)
the available data (number of examples and
number of variables, noise in data)
the available tabulated statistics.

10
Multivariate Methods
11
Univariate selection may fail
Guyon-Elisseeff, JMLR 2004 Springer 2006
12
Filters,Wrappers, andEmbedded methods
13
Relief
ReliefltDmiss/Dhitgt
nearest hit
Dhit
Dmiss
nearest miss
Kira and Rendell, 1992
Dhit
Dmiss
14
Wrappers for feature selection
Kohavi-John, 1997
N features, 2N possible feature subsets!
15
Search Strategies ( chap. 4)

Exhaustive search.
Simulated annealing, genetic algorithms.
Beam search keep k best path at each step.
Greedy search forward selection or backward
elimination.
PTA(l,r) plus l , take away r at each step,
run SFS l times then SBS r times.
Floating search (SFFS and SBFS) One step of SFS
(resp. SBS), then SBS (resp. SFS) as long as we
find better subsets than those of the same size
obtained so far. Any time, if a better subset of
the same size was already found, switch abruptly.

n-k
g
16
Forward Selection (wrapper)
n n-1 n-2 1

Also referred to as SFS Sequential Forward
Selection
17
Forward Selection (embedded)
n n-1 n-2 1

Guided search we do not consider alternative
paths. Typical ex. Gram-Schmidt orthog. and tree
classifiers.
18
Backward Elimination (wrapper)
Also referred to as SBS Sequential Backward
Selection
1 n-2 n-1 n

Start
19
Backward Elimination (embedded)
Guided search we do not consider alternative
paths. Typical ex. recursive feature
elimination RFE-SVM.
1 n-2 n-1 n

Start
20
Scaling Factors

Idea Transform a discrete space into a
continuous space.

ss1, s2, s3, s4

Discrete indicators of feature presence si ?0,
1
Continuous scaling factors si ? IR

Now we can do gradient descent!
21
Formalism ( chap. 5)

Many learning algorithms are cast into a
minimization of some regularized functional

Regularization capacity control
Empirical error
Justification of RFE and many other embedded
methods.
Next few slides André Elisseeff
22
Embedded method

Embedded methods are a good inspiration to design
new feature selection techniques for your own
algorithms
Find a functional that represents your prior
knowledge about what a good model is.
Add the s weights into the functional and make
sure its either differentiable or you can
perform a sensitivity analysis efficiently
Optimize alternatively according to a and s
Use early stopping (validation set) or your own
stopping criterion to stop and select the subset
of features
Embedded methods are therefore not too far from
wrapper techniques and can be extended to
multiclass, regression, etc

23
The l1 SVM

A version of SVM where W(w)w2 is replaced by
the l1 norm W(w)?i wi
Can be considered an embedded feature selection
method
Some weights will be drawn to zero (tend to
remove redundant features)
Difference from the regular SVM where redundant
features are included

Bi et al 2003, Zhu et al, 2003
24
The l0 SVM

Replace the regularizer w2 by the l0 norm
Further replace by ?i log(? wi)
Boils down to the following multiplicative update
algorithm

Weston et al, 2003
25
Causality
26
What can go wrong?
Guyon-Aliferis-Elisseeff, 2007
27
What can go wrong?
28
What can go wrong?
Guyon-Aliferis-Elisseeff, 2007
29
http//clopinet.com/causality
30
Wrapping up
31
Bilevel optimization
N variables/features
Split data into 3 sets training, validation, and
test set.

1) For each feature subset, train predictor on
training data.
2) Select the feature subset, which performs best
on validation data.
Repeat and average if you want to reduce variance
(cross-validation).
3) Test on test data.

M samples
32
Complexity of Feature Selection
With high probability
Generalization_error ? Validation_error e(C/m2)
Error
m2 number of validation examples, N total
number of features, n feature subset size.
n
Try to keep C of the order of m2.
33
CLOP http//clopinet.com/CLOP/

CLOPChallenge Learning Object Package.
Based on the Matlab Spider package developed at
the Max Planck Institute.
Two basic abstractions
Data object
Model object
Typical script
D data(X,Y) Data constructor
M kridge Model constructor
R, Mt train(M, D) Train modelgtMt
Dt data(Xt, Yt) Test data constructor
Rt test(Mt, Dt) Test the model

34
NIPS 2003 FS challenge
http//clopinet.com/isabelle/Projects/ETH/Feature_
Selection_w_CLOP.html
35
Conclusion

Feature selection focuses on uncovering subsets
of variables X1, X2, predictive of the target
Y.
Multivariate feature selection is in principle
more powerful than univariate feature selection,
but not always in practice.
Taking a closer look at the type of dependencies
in terms of causal relationships may help
refining the notion of variable relevance.

36
Acknowledgements and references

Feature Extraction,
Foundations and Applications
I. Guyon et al, Eds.
Springer, 2006.
http//clopinet.com/fextract-book
2) Causal feature selection
I. Guyon, C. Aliferis, A. Elisseeff
To appear in Computational Methods of Feature
Selection,
Huan Liu and Hiroshi Motoda Eds.,
Chapman and Hall/CRC Press, 2007.
http//clopinet.com/causality

Write a Comment

User Comments (0)

About PowerShow.com

Feature selection methods - PowerPoint PPT Presentation

Feature selection methods

Thousands to millions of low level features: select the most relevant one to ... Formalism ( chap. 5) Next few s: Andr Elisseeff ... – PowerPoint PPT presentation