Title: Feature Selection and Causal discovery
1Feature Selection and Causal discovery
- Isabelle Guyon, Clopinet
- André Elisseeff, IBM Zürich
- Constantin Aliferis, Vanderbilt University
2Road Map
Feature selection
- What is feature selection?
- Why is it hard?
- What works best in practice?
- How to make progress using causality?
- Can causal discovery benefit from feature
selection?
Causal discovery
3Introduction
4Causal discovery
- What affects your health?
- What affects the economy?
- What affects climate changes?
- and
- Which actions will have beneficial effects?
5Feature Selection
Y
X
Remove features Xi to improve (or least degrade)
prediction of Y.
6Uncovering Dependencies
?
Factors of variability
Actual
Artifactual
Known
Unknown
Unobservable
Observable
Controllable
Uncontrollable
7Predictions and Actions
Y
X
See e.g. Judea Pearl, Causality, 2000
8Predictive power of causes and effects
Smoking
Smoking is a better predictor of lung disease
than coughing.
Lung disease
Coughing
9Causal feature selection
- Abandon the usual motto of predictive modeling
we dont care about causality. - Feature selection may benefit from introducing a
notion of causality - To be able to predict the consequence of given
actions. - To add robustness to the predictions if the input
distribution changes. - To get more compact and robust feature sets.
10FS-enabled causal discovery
- Isnt causal discovery solved with experiments?
- No! Randomized Controlled Trials (RCT) may be
- Unethical (e.g. a RCT about the effects of
smoking) - Costly and time consuming
- Impossible (e.g. astronomy)
- Observational data may be available to help plan
future experiments ? Causal discovery may benefit
from feature selection.
11Feature selection basics
12Individual Feature Irrelevance
- P(Xi, Y) P(Xi) P(Y)
- P(Xi Y) P(Xi)
-
density
xi
13Individual Feature Relevance
m-
m
-1
s-
s
1
14Univariate selection may fail
Guyon-Elisseeff, JMLR 2004 Springer 2006
15Multivariate FS is complex
Kohavi-John, 1997
n features, 2n possible feature subsets!
16FS strategies
- Wrappers
- Use the target risk functional to evaluate
feature subsets. - Train one learning machine for each feature
subset investigated. - Filters
- Use another evaluation function than the target
risk functional. - Often no learning machine is involved in the
feature selection process.
17Reducing complexity
- For wrappers
- Use forward or backward selection O(n2) steps.
- Mix forward and backward search, e.g. floating
search. - For filters
- Use a cheap evaluation function (no learning
machine). - Make independence assumptions n evaluations.
- Embedded methods
- Do not retrain the LM at every step e.g. RFE, n
steps. - Search FS space and LM parameter space
simultaneously e.g. 1-norm/Lasso approaches.
18In practice
- Univariate feature selection often yields better
accuracy results than multivariate feature
selection. - NO feature selection at all gives sometimes the
best accuracy results, even in the presence of
known distracters. - Multivariate methods usually claim only better
parsimony. - How can we make multivariate FS work better?
NIPS 2003 and WCCI 2006 challenges
http//clopinet.com/challenges
19Definition of irrelevance
- We want to determine whether one variable Xi is
relevant to the target Y. - Surely irrelevant feature
- P(Xi, Y S\i) P(Xi S\i)P(Y S\i)
- for all S\i ? X\i
- for all assignment of values to S\i
- Are all non-irrelevant features relevant?
20Causality enters the picture
21Causal Bayesian networks
- Bayesian network
- Graph with random variables X1, X2, Xn as nodes.
- Dependencies represented by edges.
- Allow us to compute P(X1, X2, Xn) as
- Pi P( Xi Parents(Xi) ).
- Edge directions have no meaning.
- Causal Bayesian network egde directions indicate
causality.
22Markov blanket
Smoking
Lung disease
Allergy
Coughing
A node is conditionally independent of all other
nodes given its Markov blanket.
23Relevance revisited
- In terms of Bayesian networks in faithful
distributions - Strongly relevant features members of the
Markov Blanket - Weakly relevant features variables with a path
to the Markov Blanket but not in the Markov
Blanket - Irrelevant features variables with no path to
the Markov Blanket
Koller-Sahami, 1996 Kohavi-John, 1997 Aliferis
et al., 2002.
24Is X2 relevant?
1
P(X1, X2 , Y) P(X1 X2 , Y) P(X2) P(Y)
25Are X1 and X2relevant?
2
Y
disease
normal
P(X1, X2 , Y) P(X1 X2 , Y) P(X2) P(Y)
26XOR and unfaithfulness
Y X1 ? X2
X2
X1
Example X1 and X2 Two fair coins tossed at
random Y Win if both coins end on the same side
Y
27Adding a variable
3
28X1 Y X2
3
life expectancy
Is chocolate good for your health?
chocolate intake
29Really?
3
life expectancy
Is chocolate good for your health?
chocolate intake
30Same independence relationsDifferent causal
relations
P(X1, X2 , Y) P(X1 X2) P(Y X2) P(X2)
P(X1, X2 , Y) P(Y X2) P(X2 X1) P(X1)
P(X1, X2 , Y) P(X1 X2) P(X2 Y) P(Y)
31Is X1 relevant?
3
32Non-causal features may be predictive yet not
relevant
1
2
3
33Causal feature discovery
P(X,Y) P(XY) P(Y)
P(X,Y) P(YX) P(X)
Sun-Janzing-Schoelkopf, 2005
34Conclusion
- Feature selection focuses on uncovering subsets
of variables X1, X2, predictive of the target
Y. - Taking a closer look at the type of dependencies
may help refining the notion of variable
relevance. - Uncovering causal relationships may yield better
feature selection, robust under distribution
changes. - These causal features may be better targets of
action.