Feature selection methods - PowerPoint PPT Presentation

About This Presentation
Title:

Feature selection methods

Description:

Thousands to millions of low level features: select the most relevant one to ... Formalism ( chap. 5) Next few s: Andr Elisseeff ... – PowerPoint PPT presentation

Number of Views:1054
Avg rating:3.0/5.0
Slides: 37
Provided by: Isabell47
Category:

less

Transcript and Presenter's Notes

Title: Feature selection methods


1
Feature selection methods
  • Isabelle Guyon
  • isabelle_at_clopinet.com

IPAM summer school on Mathematics in Brain
Imaging. July 2008
2
Feature Selection
  • Thousands to millions of low level features
    select the most relevant one to build better,
    faster, and easier to understand learning
    machines.

n
X
m
3
Applications
examples
Customer knowledge
Quality control
Market Analysis
106
OCR HWR
Machine vision
105
Text Categorization
104
103
System diagnosis
Bioinformatics
102
10
variables/features
10
102
103
104
105
4
Nomenclature
  • Univariate method considers one variable
    (feature) at a time.
  • Multivariate method considers subsets of
    variables (features) together.
  • Filter method ranks features or feature subsets
    independently of the predictor (classifier).
  • Wrapper method uses a classifier to assess
    features or feature subsets.

5
Univariate Filter Methods
6
Univariate feature ranking
m-
m
P(XiY1)
P(XiY-1)
-1
xi
s-
s
  • Normally distributed classes, equal variance s2
    unknown estimated from data as s2within.
  • Null hypothesis H0 m m-
  • T statistic If H0 is true,
  • t (m - m-)/(swithin?1/m1/m-)
    Student(mm--2 d.f.)

7
Statistical tests ( chap. 2)
Null distribution
  • H0 X and Y are independent.
  • Relevance index ? test statistic.
  • Pvalue ? false positive rate FPR nfp / nirr
  • Multiple testing problem use Bonferroni
    correction pval ? n pval
  • False discovery rate FDR nfp / nsc ? FPR
    n/nsc
  • Probe method FPR ? nsp/np

8
Univariate Dependence
  • Independence
  • P(X, Y) P(X) P(Y)
  • Measure of dependence
  • MI(X, Y) ? P(X,Y) log dX dY
  • KL( P(X,Y) P(X)P(Y) )

P(X,Y) P(X)P(Y)
9
Other criteria ( chap. 3)
  • A choice of feature selection ranking methods
    depending on the nature of
  • the variables and the target (binary,
    categorical, continuous)
  • the problem (dependencies between variables,
    linear/non-linear relationships between variables
    and target)
  • the available data (number of examples and
    number of variables, noise in data)
  • the available tabulated statistics.

10
Multivariate Methods
11
Univariate selection may fail
Guyon-Elisseeff, JMLR 2004 Springer 2006
12
Filters,Wrappers, andEmbedded methods
13
Relief
ReliefltDmiss/Dhitgt
nearest hit
Dhit
Dmiss
nearest miss
Kira and Rendell, 1992
Dhit
Dmiss
14
Wrappers for feature selection
Kohavi-John, 1997
N features, 2N possible feature subsets!
15
Search Strategies ( chap. 4)
  • Exhaustive search.
  • Simulated annealing, genetic algorithms.
  • Beam search keep k best path at each step.
  • Greedy search forward selection or backward
    elimination.
  • PTA(l,r) plus l , take away r at each step,
    run SFS l times then SBS r times.
  • Floating search (SFFS and SBFS) One step of SFS
    (resp. SBS), then SBS (resp. SFS) as long as we
    find better subsets than those of the same size
    obtained so far. Any time, if a better subset of
    the same size was already found, switch abruptly.

n-k
g
16
Forward Selection (wrapper)
n n-1 n-2 1

Also referred to as SFS Sequential Forward
Selection
17
Forward Selection (embedded)
n n-1 n-2 1

Guided search we do not consider alternative
paths. Typical ex. Gram-Schmidt orthog. and tree
classifiers.
18
Backward Elimination (wrapper)
Also referred to as SBS Sequential Backward
Selection
1 n-2 n-1 n

Start
19
Backward Elimination (embedded)
Guided search we do not consider alternative
paths. Typical ex. recursive feature
elimination RFE-SVM.
1 n-2 n-1 n

Start
20
Scaling Factors
  • Idea Transform a discrete space into a
    continuous space.

ss1, s2, s3, s4
  • Discrete indicators of feature presence si ?0,
    1
  • Continuous scaling factors si ? IR

Now we can do gradient descent!
21
Formalism ( chap. 5)
  • Many learning algorithms are cast into a
    minimization of some regularized functional

Regularization capacity control
Empirical error
Justification of RFE and many other embedded
methods.
Next few slides André Elisseeff
22
Embedded method
  • Embedded methods are a good inspiration to design
    new feature selection techniques for your own
    algorithms
  • Find a functional that represents your prior
    knowledge about what a good model is.
  • Add the s weights into the functional and make
    sure its either differentiable or you can
    perform a sensitivity analysis efficiently
  • Optimize alternatively according to a and s
  • Use early stopping (validation set) or your own
    stopping criterion to stop and select the subset
    of features
  • Embedded methods are therefore not too far from
    wrapper techniques and can be extended to
    multiclass, regression, etc

23
The l1 SVM
  • A version of SVM where W(w)w2 is replaced by
    the l1 norm W(w)?i wi
  • Can be considered an embedded feature selection
    method
  • Some weights will be drawn to zero (tend to
    remove redundant features)
  • Difference from the regular SVM where redundant
    features are included

Bi et al 2003, Zhu et al, 2003
24
The l0 SVM
  • Replace the regularizer w2 by the l0 norm
  • Further replace by ?i log(? wi)
  • Boils down to the following multiplicative update
    algorithm

Weston et al, 2003
25
Causality
26
What can go wrong?
Guyon-Aliferis-Elisseeff, 2007
27
What can go wrong?
28
What can go wrong?
Guyon-Aliferis-Elisseeff, 2007
29
http//clopinet.com/causality
30
Wrapping up
31
Bilevel optimization
N variables/features
Split data into 3 sets training, validation, and
test set.
  • 1) For each feature subset, train predictor on
    training data.
  • 2) Select the feature subset, which performs best
    on validation data.
  • Repeat and average if you want to reduce variance
    (cross-validation).
  • 3) Test on test data.

M samples
32
Complexity of Feature Selection
With high probability
Generalization_error ? Validation_error e(C/m2)
Error
m2 number of validation examples, N total
number of features, n feature subset size.
n
Try to keep C of the order of m2.
33
CLOP http//clopinet.com/CLOP/
  • CLOPChallenge Learning Object Package.
  • Based on the Matlab Spider package developed at
    the Max Planck Institute.
  • Two basic abstractions
  • Data object
  • Model object
  • Typical script
  • D data(X,Y) Data constructor
  • M kridge Model constructor
  • R, Mt train(M, D) Train modelgtMt
  • Dt data(Xt, Yt) Test data constructor
  • Rt test(Mt, Dt) Test the model

34
NIPS 2003 FS challenge
http//clopinet.com/isabelle/Projects/ETH/Feature_
Selection_w_CLOP.html
35
Conclusion
  • Feature selection focuses on uncovering subsets
    of variables X1, X2, predictive of the target
    Y.
  • Multivariate feature selection is in principle
    more powerful than univariate feature selection,
    but not always in practice.
  • Taking a closer look at the type of dependencies
    in terms of causal relationships may help
    refining the notion of variable relevance.

36
Acknowledgements and references
  • Feature Extraction,
  • Foundations and Applications
  • I. Guyon et al, Eds.
  • Springer, 2006.
  • http//clopinet.com/fextract-book
  • 2) Causal feature selection
  • I. Guyon, C. Aliferis, A. Elisseeff
  • To appear in Computational Methods of Feature
    Selection,
  • Huan Liu and Hiroshi Motoda Eds.,
  • Chapman and Hall/CRC Press, 2007.
  • http//clopinet.com/causality
Write a Comment
User Comments (0)
About PowerShow.com