Title: Feature Selection Methods
1Feature Selection Methods
- An overview
- Thanks to Qiang Yang
- Modified by Charles Ling
2What is Feature selection ?
- Feature selection Problem of selecting some
subset of a learning algorithms input variables
upon which it should focus attention, while
ignoring the rest (DIMENSIONALITY REDUCTION) - Humans/animals do that constantly!
2/54
3Motivational example from Biology
1
- Monkeys performing classification task
?
N. Sigala N. Logothetis, 2002 Visual
categorization shapes feature selectivity in the
primate temporal cortex.
3/54
1 Nathasha Sigala, Nikos Logothetis Visual
categorization shapes feature selectivity in the
primate visual cortex. Nature Vol. 415(2002)
4Motivational example from Biology
- Monkeys performing classification task
Diagnostic features - Eye separation - Eye
height Non-Diagnostic features - Mouth height
- Nose length
4/54
5Motivational example from Biology
- Monkeys performing classification task
- Results
- activity of a population of 150 neurons in the
anterior inferior temporal cortex was measured - 44 neurons responded significantly differently to
at least one feature - After Training 72 (32/44) were selective to one
or both of the diagnostic features (and not for
the non-diagnostic features)
5/54
6Motivational example from Biology
- Monkeys performing classification task
- Results
- (single neurons)
The data from the present study indicate that
neuronal selectivity was shaped by the most
relevant subset of features during the
categorization training.
6/54
7feature selection
- Reducing the feature space by throwing out some
of the features (covariates) - Also called variable selection
- Motivating idea try to find a simple,
parsimonious model - Occams razor simplest explanation that accounts
for the data is best
8feature extraction
- Feature Extraction is a process that extract a
new set of features from the original data
through numerical Functional mapping. - Idea
- Given data points in d-dimensional space,
- Project into lower dimensional space while
preserving as much information as possible - E.g., find best planar approximation to 3D data
- E.g., find best planar approximation to 104D data
9Feature Selection vs Feature Extraction
- Differs in two ways
- Feature selection chooses subset of features
- Feature extraction creates new features
(dimensions) defined as functions over all
features
10Outline
- What is Feature Reduction?
- Feature Selection
- Feature Extraction
- Why need Feature Reduction?
- Feature Selection Methods
- Filter
- Wrapper
- Feature Extraction Methods
- Linear
- Nonlinear
11Motivation
- The objective of feature reduction is three-fold
- Improving the prediction performance of the
predictors (accuracy) - Providing a faster and more cost-effective
predictors (CPU time) - Providing a better understanding of the
underlying process that generated the data (??)
11
12feature reduction--examples
Task 1 classify whether a document is about
cats Data word counts in the document
Task 2 predict chances of lung disease Data
medical history survey
X
X
cat 2
and 35
it 20
kitten 8
electric 2
trouble 4
then 5
several 9
feline 2
while 4
lemon 2
Vegetarian No
Plays video games Yes
Family history No
Athletic No
Smoker Yes
Sex Male
Lung capacity 5.8L
Hair color Red
Car Audi
Weight 185 lbs
Reduced X
Reduced X
cat 2
kitten 8
feline 2
Family history No
Smoker Yes
13Feature reduction in task 1
- task 1 Were interested in prediction features
are not interesting in themselves, we just want
to build a good classifier (or other kind of
predictor). - Text classification
- Features for all 105 English words, and maybe all
word pairs - Common practice throw in every feature you can
think of, let feature selection get rid of
useless ones - Training too expensive with all features
- The presence of irrelevant features hurts
generalization.
14Feature reduction in task 2
- task 2 Were interested in featureswe want to
know which are relevant. If we fit a model, it
should be interpretable. - What causes lung cancer?
- Features are aspects of a patients medical
history - Binary response variable did the patient develop
lung cancer? - Which features best predict whether lung cancer
will develop? Might want to legislate against
these features.
15Get at Case 2 through Case 1
- Even if we just want to identify features, it can
be useful to pretend we want to do prediction. - Relevant features are (typically) exactly those
that most aid prediction. - But not always. Highly correlated features may
be redundant but both interesting as causes. - e.g. smoking in the morning, smoking at night
16Outline
- What is Feature Reduction?
- Feature Selection
- Feature Extraction
- Why need Feature Reduction?
- Feature Selection Methods
- Filtering
- Wrapper
- Feature Extraction Methods
- Linear
- Nonlinear
17Filtering methods
- Basic idea assign score to each feature f
indicating how related xf and y are. - Intuition if xi,fyi for all i, then f is good
no matter what our model iscontains all
information about y. - Many popular scores see Yang and Pederson 97
- Classification with categorical data
Chi-squared, information gain - Can use binning to make continuous data
categorical - Regression correlation, mutual information
- Markov blanket Koller and Sahami, 96
- Then somehow pick how many of the highest scoring
features to keep (nested models)
18Filtering methods
- Advantages
- Very fast
- Simple to apply
- Disadvantages
- Doesnt take into account which learning
algorithm will be used. - Doesnt take into account correlations between
features - This can be an advantage if were only interested
in ranking the relevance of features, rather than
performing prediction. - Also a significant disadvantagesee homework
- Suggestion use light filtering as an efficient
initial step if there are many obviously
irrelevant features - Caveat here tooapparently useless features can
be useful when grouped with others
19Wrapper Methods
- Learner is considered a black-box
- Interface of the black-box is used to score
subsets of variables according to the predictive
power of the learner when using the subsets. - Results vary for different learners
- One needs to define
- how to search the space of all possible variable
subsets ? - how to assess the prediction performance of a
learner ?
19/54
20Wrapper Methods
- The problem of finding the optimal subset is
NP-hard! - A wide range of heuristic search strategies can
be used. Two different classes - Forward selection (start with empty feature set
and add features at each step) - Backward elimination(start with full feature set
and discard features at each step) - predictive power is usually measured on a
validation set or by cross-validation - By using the learner as a black box wrappers are
universal and simple! - Criticism a large amount of computation is
required.
20/54
21Wrapper Methods
21/54
22Feature selection search strategy
Method Property Comments
Exhaustive search Evaluate all (dm) possible subsets Guaranteed to find the optimal subset not feasible for even moderately large values of m and d.
Sequential Forward Selection (SFS) Select the best single feature and then add one feature at a time which in combination with the selected features maximize criterion function. Once a feature is retained, it cannot be discarded computationally attractive since to select a subset of size 2, it examines only (d-1) possible subsets.
Sequential Backward Selection (SBS) Start with all the d features and successively delete one feature at a time. Once a feature is deleted, it cannot be brought back into the optimal subset requires more computation than sequential forward selection.
23Comparsion of filter and wrapper
- Wrapper method is tied to solving a
classification algorithm, hence the criterion can
be optimaized - but it is potentially very time consuming since
they typically need to evaluate a
cross-validation scheme at every iteration. - Filtering method is much faster but it do not
incorporate learning.
23
24Multivariate FS is complex
Kohavi-John, 1997
n features, 2n possible feature subsets!
25In practice
- Univariate feature selection often yields better
accuracy results than multivariate feature
selection. - NO feature selection at all gives sometimes the
best accuracy results, even in the presence of
known distracters. - Multivariate methods usually claim only better
parsimony. - How can we make multivariate FS work better?
NIPS 2003 and WCCI 2006 challenges
http//clopinet.com/challenges
26Feature Extraction-Definition
- Given a set of features
- the Feature Extraction(Construction) problem is
- is to map F to some feature set that
maximizes the learners ability to classify
patterns. - (again )
- This general definition subsumes feature
selection (i.e. a feature selection algorithm
also performs a mapping but can only map to
subsets of the input variables)
here is the set of all possible feature
sets
26/51
27Linear, Unsupervised Feature Selection
- Question Are attributes A1 and A2 independent?
- If they are very dependent, we can remove
eitherA1 or A2 - If A1 is independent on a class attribute A2, we
can remove A1 from our training data
28Chi-Squared Test (cont.)
- Question Are attributes A1 and A2 independent?
- These features are nominal valued (discrete)
- Null Hypothesis we expect independence
Outlook Temperature
Sunny High
Cloudy Low
Sunny High
29The Weather example Observed Count
temperature? Outlook High Low Outlook Subtotal
Sunny 2 0 2
Cloudy 0 1 1
Temperature Subtotal 2 1 Total count in table 3
Outlook Temperature
Sunny High
Cloudy Low
Sunny High
30The Weather example Expected Count
If attributes were independent, then the
subtotals would be Like this (this table is also
known as
temperature? Outlook High Low Subtotal
Sunny 32/32/34/31.3 32/31/32/30.6 2 (prob2/3)
Cloudy 32/31/30.6 31/31/30.3 1, (prob1/3)
Subtotal 2 (prob2/3) 1 (prob1/3) Total count in table 3
Outlook Temperature
Sunny High
Cloudy Low
Sunny High
31Question How different between observed and
expected?
- If Chi-squared value is very large, then A1 and
A2 are not independent ? that is, they are
dependent! - Degrees of freedom if table has nm items, then
freedom (n-1)(m-1) - In our example
- Degree 1
- Chi-Squared?
32Chi-Squared Table what does it mean?
- If calculated value is much greater than in the
table, then you have reason to reject the
independence assumption - When your calculated chi-square value is greater
than the chi2 value shown in the 0.05 column
(3.84) of this table ? you are 95 certain that
attributes are actually dependent! - i.e. there is only a 5 probability that your
calculated X2 value would occur by chance
33Example Revisited (http//helios.bto.ed.ac.uk/bto/
statistics/tress9.html)
- We dont have to have two-dimensional count table
(also known as contingency table) - Suppose that the ratio of male to female students
in the Science Faculty is exactly 11, - But, the Honours class over the past ten years
there have been 80 females and 40 males. - Question Is this a significant departure from
the (11) expectation?
Observed Honours Male Female Total
40 80 120
34Expected (http//helios.bto.ed.ac.uk/bto/statistic
s/tress9.html)
- Suppose that the ratio of male to female students
in the Science Faculty is exactly 11, - but in the Honours class over the past ten years
there have been 80 females and 40 males. - Question Is this a significant departure from
the (11) expectation? - Note the expected is filled in, from 11
expectation, instead of calculated
Expected Honours Male Female Total
60 60 120
35Chi-Squared Calculation
Female Male Total
Observed numbers (O) 80 40 120
Expected numbers (E) 60 60 120
O - E 20 -20 0
(O-E)2 400 400
(O-E)2 / E 6.67 6.67 Sum13.34 X2
36Chi-Squared Test (Cont.)
- Then, check the chi-squared table for
significance - http//helios.bto.ed.ac.uk/bto/statistics/table2.h
tmlChi20squared20test - Compare our X2 value with a c2 (chi squared)
value in a table of c2 with n-1 degrees of
freedom - n is the number of categories, i.e. 2 in our case
-- males and females). - We have only one degree of freedom (n-1). From
the c2 table, we find a "critical value of 3.84
for p 0.05. - 13.34 gt 3.84, and the expectation (that the
MaleFemale in honours major are 11) is wrong!
37Chi-Squared Test in Weka weather.nominal.arff
38Chi-Squared Test in Weka
39Chi-Squared Test in Weka
40Example of Decision Tree Induction
Initial attribute set A1, A2, A3, A4, A5, A6
A4 ?
A6?
A1?
Class 2
Class 2
Class 1
Class 1
Reduced attribute set A1, A4, A6
41Unsupervised Feature ExtractionPCA
- Given N data vectors (samples) from k-dimensions
(features), find c lt k orthogonal dimensions
that can be best used to represent the data - Feature set is reduced from k to c
- Example datacollection of emails k100 word
counts c10 new features - The original data set is reduced by projecting
the N data vectors on c principal components
(reduced dimensions) - Each (old) data vector Xj is a linear combination
of the c principal component vectors Y1, Y2, Yc
through weights Wi - Xj mW1Y1W2Y2WcYc, i1, 2, N
- m is the mean of the data set
- W1, W2, are the ith components
- Y1, Y2, are the ith Eigen vectors
- Works for numeric data only
- Used when the number of dimensions is large
42- Principal Component Analysis
- See online tutorials such as http//www.cs.otago.a
c.nz/cosc453/student_tutorials/principal_component
s.pdf
X2
Note Y1 is the first eigen vector, Y2 is the
second. Y2 ignorable.
X1
Key observation variance largest!
43Principle Component Analysis (PCA)
Principle Component Analysis project onto
subspace with the most variance (unsupervised
doesnt take y into account)
44Principal Component Analysis one attribute first
Temperature
42
40
24
30
15
18
15
30
15
30
35
30
40
30
- Question how much spread is in the data along
the axis? (distance to the mean) - VarianceStandard deviation2
45Now consider two dimensions
XTemperature YHumidity
40 90
40 90
40 90
30 90
15 70
15 70
15 70
30 90
15 70
30 70
30 70
30 90
40 70
30 90
- Covariance measures thecorrelation between X
and Y - cov(X,Y)0 independent
- Cov(X,Y)gt0 move same dir
- Cov(X,Y)lt0 move oppo dir
46More than two attributes covariance matrix
- Contains covariance values between all possible
dimensions (attributes) - Example for three attributes (x,y,z)
47Background eigenvalues AND eigenvectors
- Eigenvectors e C e ? e
- How to calculate e and ?
- Calculate det(C-?I), yields a polynomial (degree
n) - Determine roots to det(C-?I)0, roots are
eigenvalues ? - Check out any math book such as
- Elementary Linear Algebra by Howard Anton,
Publisher John,Wiley Sons - Or any math packages such as MATLAB
48Steps of PCA
- Calculate eigenvalues ? and eigenvectors e for
covariance matrix C - Eigenvalues ?j corresponds to variance on each
component j - Thus, sort by ?j
- Take the first n eigenvectors ei where n is the
number of top eigenvalues - These are the directions with the largest
variances
49An Example
Mean124.1 Mean253.8
X1 X2 X1' X2'
19 63 -5.1 9.25
39 74 14.9 20.25
30 87 5.9 33.25
30 23 5.9 -30.75
15 35 -9.1 -18.75
15 43 -9.1 -10.75
15 32 -9.1 -21.75
30 73 5.9 19.25
50Covariance Matrix
75 106
106 482
- C
- Using MATLAB, we find out
- Eigenvectors
- e1(-0.98,-0.21), ?151.8
- e2(0.21,-0.98), ?2560.2
- Thus the second eigenvector is more important!
51If we only keep one dimension e2
yi
-10.14
-16.72
-31.35
31.374
16.464
8.624
19.404
-17.63
- We keep the dimension of e2(0.21,-0.98)
- We can obtain the final data as
52Using Matlab to figure it out
53PCA in Weka
54Wesather Data from UCI Dataset (comes with weka
package)
55(No Transcript)
56Summary of PCA
- PCA is used for reducing the number of numerical
attributes - The key is in data transformation
- Adjust data by mean
- Find eigenvectors for covariance matrix
- Transform data
- Note only linear combination of data (weighted
sum of original data)
57Summary
- Data preparation is a big issue for data mining
- Data preparation includes transformation, which
are - Data sampling and feature selection
- Discretization
- Missing value handling
- Incorrect value handling
- Feature Selection and Feature Extraction
58Linear Method Linear Discriminant Analysis (LDA)
- LDA finds the projection that best separates the
two classes - Multiple discriminant analysis (MDA) extends LDA
to multiple classes
Best projection direction for classification
12/9/2020
58
59PCA vs. LDA
- PCA is unsupervised while LDA is supervised.
- PCA can extract r (rank of data) principles
features while LDA can find (c-1) features. - Both based on SVD technique.
60SVD - Definition
- An x m Un x r L r x r (Vm x r)T
- A n x m matrix (e.g., n documents, m terms)
- U n x r matrix (n documents, r concepts)
- L r x r diagonal matrix (strength of each
concept) (r rank of the matrix) - V m x r matrix (m terms, r concepts)
61SVD - Properties
- spectral decomposition of the matrix
l1
x
x
u1
u2
l2
v1
v2
62SVD - Example
retrieval
inf.
lung
brain
data
CS
x
x
MD
63SVD - Example
retrieval
CS-concept
inf.
lung
MD-concept
brain
data
doc-to-concept similarity matrix
CS
x
x
MD
64SVD - Example
retrieval
strength of CS-concept
inf.
lung
brain
data
CS
x
x
MD
65SVD Dimensionality reduction
- Q how exactly is dim. reduction done?
- A set the smallest singular values to zero
x
x
66SVD - Dimensionality reduction
x
x
67Others Linear FA,ICA,NMF
- Can be interpreted by matrix factorization but
differs in basic assumptions.
V
W
mixture weight
data
H
factors
n
k
m
m
k
68Assumptions
- Factor Analysis (FA)
- uncorrelated assumption.
- Independent Component Analysis (ICA)
- independence assumption
- Nonnegative Matrix Factorization (NMF)
- nonnegative assumption
69Deficiencies of Linear Methods
- Data may not be best summarized by linear
combination of features - Example PCA cannot discover 1D structure of a
helix
12/9/2020
69