Title: Feature Selection and Causality Inference
1Feature Selection and Causality Inference
- Isabelle Guyon, Clopinet
- André Elisseeff, IBM Zürich
2Purpose
- What affects your health?
- What affects the economy?
- What affects climate changes?
- and
- Which actions will have beneficial effects?
3Road Map
- What is feature selection?
- Why is it hard?
- What works best in practice?
- How are we going to make progress?
4Feature Selection
Y
X
Remove features to improve (or least degrade)
performance.
5Uncovering Dependencies
?
Factors of variability
Factual
Artifactual
Known
Unknown
Unobservable
Observable
Controllable
Uncontrollable
6Predictions and Actions
Y
X
Judea Pearl, Causality, 2000
7Individual Feature Irrelevance
- P(Xi, Y) P(Xi) P(Y)
- P(Xi Y) P(Xi)
-
density
xi
8Individual Feature Relevance
m-
m
-1
s-
s
1
9Multivariate Cases
Guyon-Elisseeff, JMLR 2004 Springer 2006
10Is multivariate FS always best?
Kohavi-John, 1997
n features, 2n possible feature subsets!
11In practice
- Univariate feature selection often gives better
results than multivariate feature selection. - NO feature selection at all gives sometimes the
best results, even in the presence of known
distracters. - How can we make multivariate FS work better?
NIPS 2003 and WCCI 2006 challenges
http//clopinet.com/challenges
12Definition of relevance
- We want to determine whether one variable Xi is
relevant to the target Y. - Surely irrelevant feature
- P(Xi, Y S\i) P(Xi S\i)P(Y S\i)
- for all S\i ? X\i
- for all assignment of values to S\i
- Are all non-irrelevant features relevant?
13Is X2 relevant?
1
P(X1, X2 , Y) P(X1 X2 , Y) P(X2) P(Y)
14Are X1 and X2relevant?
2
Y
disease
normal
P(X1, X2 , Y) P(X1 X2 , Y) P(X2) P(Y)
15Adding a variable
3
16X1 Y X2
3
life expectancy
Is chocolate good for your health?
chocolate intake
17Really?
3
life expectancy
Is chocolate good for your health?
chocolate intake
18Same independence relationsDifferent causal
relations
P(X1, X2 , Y) P(X1 X2) P(Y X2) P(X2)
P(X1, X2 , Y) P(Y X2) P(X2 X1) P(X1)
P(X1, X2 , Y) P(X1 X2) P(X2 Y) P(Y)
19Is X1 relevant?
3
20Non-causal features may be predictive yet not
relevant
1
2
3
21Causal Features
P(X,Y) P(XY) P(Y)
22Experiments
- Features Gene expression coefficients.
- Samples Prostate tissues tumor vs. control.
- Training data, Stanford 87 laser microdissected
tissues (tumor G3/4, control NL, Dysplasia,
BPH), U133A array (20,000 genes). - Test data, Oncomine (3 datasets) 164 tissues ,
U95A array (12,500 genes). - X (87, 6839) Y (87) Xt (164, 6839) Yt (164)
23Univariate Filter AUC
Fnum5 Errate0.28 BER0.26 AUC0.83
1
0.9
0.8
0.7
0.6
0.5
Sensitivity
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Specificity
24Causal Feature Selection
Fnum5 Errate0.15 BER0.12 AUC0.95
1
0.9
0.8
0.7
0.6
0.5
Sensitivity
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Specificity
25Causal features are robust under change of
distribution
35
30
25
20
Number of genes found (among 1000 best to predict
test data)
15
10
5
0
0
5
10
15
20
25
30
35
40
45
50
Gene rank (ranking performed with training
examples)
26Conclusion
- Feature selection focuses on uncovering subsets
of variables X1, X2, predictive of the target
Y. - Taking a closer look at the type of dependencies
may help refining the notion of variable
relevance. - Uncovering causal relationships may yield better
feature selection, robust under distribution
changes. - These causal features may be better targets of
action.
27http//clopinet.com/fextract-book
Feature Extraction, Foundations and
Applications I. Guyon et al, Eds. Springer, 2006.
- Tutorials
- NIPS03 challenge results
- Challenge data
- Sample code
- Teaching material
28Extras (not in the talk)
29Individual Feature Relevance
S2N
-1
1
m
m-
m
1
-1
-yi
yi
Golub et al, Science Vol 28615 Oct. 1999
-1
S2N ? R x ? y after standardization x
?(x-mx)/sx
s-
s
30Is X1 relevant?
peak
temperature
Population selection bias
P(X1, X2 , Y) P(X1 X2) P(Y X2) P(X2)
31Is X1 relevant?
peak
plate (X2)
health (X1)
peak (Y)
health status
Confounding factor
P(X1, X2 , Y) P(Y X2) P(X2 X1) P(X1)