Feature Selection Methods

About This Presentation

Title:

Feature Selection Methods

Description:

... Determine roots to det(C- I)=0, roots are eigenvalues Check out any math book such as Elementary Linear Algebra by Howard Anton, ... – PowerPoint PPT presentation

Number of Views:233

Avg rating:3.0/5.0

Slides: 58

Provided by: Jiaw157

Category:

more less

Transcript and Presenter's Notes

Title: Feature Selection Methods

1
Feature Selection Methods

An overview
Thanks to Qiang Yang
Modified by Charles Ling

2
What is Feature selection ?

Feature selection Problem of selecting some
subset of a learning algorithms input variables
upon which it should focus attention, while
ignoring the rest (DIMENSIONALITY REDUCTION)
Humans/animals do that constantly!

2/54
3
Motivational example from Biology
1

Monkeys performing classification task

?
N. Sigala N. Logothetis, 2002 Visual
categorization shapes feature selectivity in the
primate temporal cortex.
3/54
1 Nathasha Sigala, Nikos Logothetis Visual
categorization shapes feature selectivity in the
primate visual cortex. Nature Vol. 415(2002)
4
Motivational example from Biology

Monkeys performing classification task

Diagnostic features - Eye separation - Eye
height Non-Diagnostic features - Mouth height
- Nose length
4/54
5
Motivational example from Biology

Monkeys performing classification task
Results
activity of a population of 150 neurons in the
anterior inferior temporal cortex was measured
44 neurons responded significantly differently to
at least one feature
After Training 72 (32/44) were selective to one
or both of the diagnostic features (and not for
the non-diagnostic features)

5/54
6
Motivational example from Biology

Monkeys performing classification task
Results
(single neurons)

The data from the present study indicate that
neuronal selectivity was shaped by the most
relevant subset of features during the
categorization training.
6/54
7
feature selection

Reducing the feature space by throwing out some
of the features (covariates)
Also called variable selection
Motivating idea try to find a simple,
parsimonious model
Occams razor simplest explanation that accounts
for the data is best

8
feature extraction

Feature Extraction is a process that extract a
new set of features from the original data
through numerical Functional mapping.
Idea
Given data points in d-dimensional space,
Project into lower dimensional space while
preserving as much information as possible
E.g., find best planar approximation to 3D data
E.g., find best planar approximation to 104D data

9
Feature Selection vs Feature Extraction

Differs in two ways
Feature selection chooses subset of features
Feature extraction creates new features
(dimensions) defined as functions over all
features

10
Outline

What is Feature Reduction?
Feature Selection
Feature Extraction
Why need Feature Reduction?
Feature Selection Methods
Filter
Wrapper
Feature Extraction Methods
Linear
Nonlinear

11
Motivation

The objective of feature reduction is three-fold
Improving the prediction performance of the
predictors (accuracy)
Providing a faster and more cost-effective
predictors (CPU time)
Providing a better understanding of the
underlying process that generated the data (??)

11
12
feature reduction--examples
Task 1 classify whether a document is about
cats Data word counts in the document
Task 2 predict chances of lung disease Data
medical history survey

X
X
cat 2
and 35
it 20
kitten 8
electric 2
trouble 4
then 5
several 9
feline 2
while 4

lemon 2
Vegetarian No
Plays video games Yes
Family history No
Athletic No
Smoker Yes
Sex Male
Lung capacity 5.8L
Hair color Red
Car Audi

Weight 185 lbs
Reduced X
Reduced X
cat 2
kitten 8
feline 2
Family history No
Smoker Yes
13
Feature reduction in task 1

task 1 Were interested in prediction features
are not interesting in themselves, we just want
to build a good classifier (or other kind of
predictor).
Text classification
Features for all 105 English words, and maybe all
word pairs
Common practice throw in every feature you can
think of, let feature selection get rid of
useless ones
Training too expensive with all features
The presence of irrelevant features hurts
generalization.

14
Feature reduction in task 2

task 2 Were interested in featureswe want to
know which are relevant. If we fit a model, it
should be interpretable.
What causes lung cancer?
Features are aspects of a patients medical
history
Binary response variable did the patient develop
lung cancer?
Which features best predict whether lung cancer
will develop? Might want to legislate against
these features.

15
Get at Case 2 through Case 1

Even if we just want to identify features, it can
be useful to pretend we want to do prediction.
Relevant features are (typically) exactly those
that most aid prediction.
But not always. Highly correlated features may
be redundant but both interesting as causes.
e.g. smoking in the morning, smoking at night

16
Outline

What is Feature Reduction?
Feature Selection
Feature Extraction
Why need Feature Reduction?
Feature Selection Methods
Filtering
Wrapper
Feature Extraction Methods
Linear
Nonlinear

17
Filtering methods

Basic idea assign score to each feature f
indicating how related xf and y are.
Intuition if xi,fyi for all i, then f is good
no matter what our model iscontains all
information about y.
Many popular scores see Yang and Pederson 97
Classification with categorical data
Chi-squared, information gain
Can use binning to make continuous data
categorical
Regression correlation, mutual information
Markov blanket Koller and Sahami, 96
Then somehow pick how many of the highest scoring
features to keep (nested models)

18
Filtering methods

Advantages
Very fast
Simple to apply
Disadvantages
Doesnt take into account which learning
algorithm will be used.
Doesnt take into account correlations between
features
This can be an advantage if were only interested
in ranking the relevance of features, rather than
performing prediction.
Also a significant disadvantagesee homework
Suggestion use light filtering as an efficient
initial step if there are many obviously
irrelevant features
Caveat here tooapparently useless features can
be useful when grouped with others

19
Wrapper Methods

Learner is considered a black-box
Interface of the black-box is used to score
subsets of variables according to the predictive
power of the learner when using the subsets.
Results vary for different learners
One needs to define
how to search the space of all possible variable
subsets ?
how to assess the prediction performance of a
learner ?

19/54
20
Wrapper Methods

The problem of finding the optimal subset is
NP-hard!
A wide range of heuristic search strategies can
be used. Two different classes
Forward selection (start with empty feature set
and add features at each step)
Backward elimination(start with full feature set
and discard features at each step)
predictive power is usually measured on a
validation set or by cross-validation
By using the learner as a black box wrappers are
universal and simple!
Criticism a large amount of computation is
required.

20/54
21
Wrapper Methods
21/54
22
Feature selection search strategy
Method Property Comments
Exhaustive search Evaluate all (dm) possible subsets Guaranteed to find the optimal subset not feasible for even moderately large values of m and d.
Sequential Forward Selection (SFS) Select the best single feature and then add one feature at a time which in combination with the selected features maximize criterion function. Once a feature is retained, it cannot be discarded computationally attractive since to select a subset of size 2, it examines only (d-1) possible subsets.
Sequential Backward Selection (SBS) Start with all the d features and successively delete one feature at a time. Once a feature is deleted, it cannot be brought back into the optimal subset requires more computation than sequential forward selection.
23
Comparsion of filter and wrapper

Wrapper method is tied to solving a
classification algorithm, hence the criterion can
be optimaized
but it is potentially very time consuming since
they typically need to evaluate a
cross-validation scheme at every iteration.
Filtering method is much faster but it do not
incorporate learning.

23
24
Multivariate FS is complex
Kohavi-John, 1997
n features, 2n possible feature subsets!
25
In practice

Univariate feature selection often yields better
accuracy results than multivariate feature
selection.
NO feature selection at all gives sometimes the
best accuracy results, even in the presence of
known distracters.
Multivariate methods usually claim only better
parsimony.
How can we make multivariate FS work better?

NIPS 2003 and WCCI 2006 challenges
http//clopinet.com/challenges
26
Feature Extraction-Definition

Given a set of features
the Feature Extraction(Construction) problem is
is to map F to some feature set that
maximizes the learners ability to classify
patterns.
(again )
This general definition subsumes feature
selection (i.e. a feature selection algorithm
also performs a mapping but can only map to
subsets of the input variables)

here is the set of all possible feature
sets
26/51
27
Linear, Unsupervised Feature Selection

Question Are attributes A1 and A2 independent?
If they are very dependent, we can remove
eitherA1 or A2
If A1 is independent on a class attribute A2, we
can remove A1 from our training data

28
Chi-Squared Test (cont.)

Question Are attributes A1 and A2 independent?
These features are nominal valued (discrete)
Null Hypothesis we expect independence

Outlook Temperature
Sunny High
Cloudy Low
Sunny High
29
The Weather example Observed Count
temperature? Outlook High Low Outlook Subtotal
Sunny 2 0 2
Cloudy 0 1 1
Temperature Subtotal 2 1 Total count in table 3
Outlook Temperature
Sunny High
Cloudy Low
Sunny High
30
The Weather example Expected Count
If attributes were independent, then the
subtotals would be Like this (this table is also
known as
temperature? Outlook High Low Subtotal
Sunny 32/32/34/31.3 32/31/32/30.6 2 (prob2/3)
Cloudy 32/31/30.6 31/31/30.3 1, (prob1/3)
Subtotal 2 (prob2/3) 1 (prob1/3) Total count in table 3
Outlook Temperature
Sunny High
Cloudy Low
Sunny High
31
Question How different between observed and
expected?

If Chi-squared value is very large, then A1 and
A2 are not independent ? that is, they are
dependent!
Degrees of freedom if table has nm items, then
freedom (n-1)(m-1)
In our example
Degree 1
Chi-Squared?

32
Chi-Squared Table what does it mean?

If calculated value is much greater than in the
table, then you have reason to reject the
independence assumption
When your calculated chi-square value is greater
than the chi2 value shown in the 0.05 column
(3.84) of this table ? you are 95 certain that
attributes are actually dependent!
i.e. there is only a 5 probability that your
calculated X2 value would occur by chance

33
Example Revisited (http//helios.bto.ed.ac.uk/bto/
statistics/tress9.html)

We dont have to have two-dimensional count table
(also known as contingency table)
Suppose that the ratio of male to female students
in the Science Faculty is exactly 11,
But, the Honours class over the past ten years
there have been 80 females and 40 males.
Question Is this a significant departure from
the (11) expectation?

Observed Honours Male Female Total
40 80 120
34
Expected (http//helios.bto.ed.ac.uk/bto/statistic
s/tress9.html)

Suppose that the ratio of male to female students
in the Science Faculty is exactly 11,
but in the Honours class over the past ten years
there have been 80 females and 40 males.
Question Is this a significant departure from
the (11) expectation?
Note the expected is filled in, from 11
expectation, instead of calculated

Expected Honours Male Female Total
60 60 120
35
Chi-Squared Calculation
Female Male Total
Observed numbers (O) 80 40 120
Expected numbers (E) 60 60 120
O - E 20 -20 0
(O-E)2 400 400
(O-E)2 / E 6.67 6.67 Sum13.34 X2
36
Chi-Squared Test (Cont.)

Then, check the chi-squared table for
significance
http//helios.bto.ed.ac.uk/bto/statistics/table2.h
tmlChi20squared20test
Compare our X2 value with a c2 (chi squared)
value in a table of c2 with n-1 degrees of
freedom
n is the number of categories, i.e. 2 in our case
-- males and females).
We have only one degree of freedom (n-1). From
the c2 table, we find a "critical value of 3.84
for p 0.05.
13.34 gt 3.84, and the expectation (that the
MaleFemale in honours major are 11) is wrong!

37
Chi-Squared Test in Weka weather.nominal.arff
38
Chi-Squared Test in Weka
39
Chi-Squared Test in Weka
40
Example of Decision Tree Induction
Initial attribute set A1, A2, A3, A4, A5, A6
A4 ?
A6?
A1?
Class 2
Class 2
Class 1
Class 1
Reduced attribute set A1, A4, A6
41
Unsupervised Feature ExtractionPCA

Given N data vectors (samples) from k-dimensions
(features), find c lt k orthogonal dimensions
that can be best used to represent the data
Feature set is reduced from k to c
Example datacollection of emails k100 word
counts c10 new features
The original data set is reduced by projecting
the N data vectors on c principal components
(reduced dimensions)
Each (old) data vector Xj is a linear combination
of the c principal component vectors Y1, Y2, Yc
through weights Wi
Xj mW1Y1W2Y2WcYc, i1, 2, N
m is the mean of the data set
W1, W2, are the ith components
Y1, Y2, are the ith Eigen vectors
Works for numeric data only
Used when the number of dimensions is large

Principal Component Analysis
See online tutorials such as http//www.cs.otago.a
c.nz/cosc453/student_tutorials/principal_component
s.pdf

X2
Note Y1 is the first eigen vector, Y2 is the
second. Y2 ignorable.
X1
Key observation variance largest!
43
Principle Component Analysis (PCA)
Principle Component Analysis project onto
subspace with the most variance (unsupervised
doesnt take y into account)
44
Principal Component Analysis one attribute first
Temperature
42
40
24
30
15
18
15
30
15
30
35
30
40
30

Question how much spread is in the data along
the axis? (distance to the mean)
VarianceStandard deviation2

45
Now consider two dimensions
XTemperature YHumidity
40 90
40 90
40 90
30 90
15 70
15 70
15 70
30 90
15 70
30 70
30 70
30 90
40 70
30 90

Covariance measures thecorrelation between X
and Y
cov(X,Y)0 independent
Cov(X,Y)gt0 move same dir
Cov(X,Y)lt0 move oppo dir

46
More than two attributes covariance matrix

Contains covariance values between all possible
dimensions (attributes)
Example for three attributes (x,y,z)

47
Background eigenvalues AND eigenvectors

Eigenvectors e C e ? e
How to calculate e and ?
Calculate det(C-?I), yields a polynomial (degree
n)
Determine roots to det(C-?I)0, roots are
eigenvalues ?
Check out any math book such as
Elementary Linear Algebra by Howard Anton,
Publisher John,Wiley Sons
Or any math packages such as MATLAB

48
Steps of PCA

Calculate eigenvalues ? and eigenvectors e for
covariance matrix C
Eigenvalues ?j corresponds to variance on each
component j
Thus, sort by ?j
Take the first n eigenvectors ei where n is the
number of top eigenvalues
These are the directions with the largest
variances

49
An Example
Mean124.1 Mean253.8
X1 X2 X1' X2'
19 63 -5.1 9.25
39 74 14.9 20.25
30 87 5.9 33.25
30 23 5.9 -30.75
15 35 -9.1 -18.75
15 43 -9.1 -10.75
15 32 -9.1 -21.75
30 73 5.9 19.25
50
Covariance Matrix
75 106
106 482

C
Using MATLAB, we find out
Eigenvectors
e1(-0.98,-0.21), ?151.8
e2(0.21,-0.98), ?2560.2
Thus the second eigenvector is more important!

51
If we only keep one dimension e2
yi
-10.14
-16.72
-31.35
31.374
16.464
8.624
19.404
-17.63

We keep the dimension of e2(0.21,-0.98)
We can obtain the final data as

52
Using Matlab to figure it out
53
PCA in Weka
54
Wesather Data from UCI Dataset (comes with weka
package)
55
(No Transcript)
56
Summary of PCA

PCA is used for reducing the number of numerical
attributes
The key is in data transformation
Adjust data by mean
Find eigenvectors for covariance matrix
Transform data
Note only linear combination of data (weighted
sum of original data)

57
Summary

Data preparation is a big issue for data mining
Data preparation includes transformation, which
are
Data sampling and feature selection
Discretization
Missing value handling
Incorrect value handling
Feature Selection and Feature Extraction

58
Linear Method Linear Discriminant Analysis (LDA)

LDA finds the projection that best separates the
two classes
Multiple discriminant analysis (MDA) extends LDA
to multiple classes

Best projection direction for classification
12/9/2020
58
59
PCA vs. LDA

PCA is unsupervised while LDA is supervised.
PCA can extract r (rank of data) principles
features while LDA can find (c-1) features.
Both based on SVD technique.

60
SVD - Definition