Feature Selection Methods - PowerPoint PPT Presentation

About This Presentation
Title:

Feature Selection Methods

Description:

... Determine roots to det(C- I)=0, roots are eigenvalues Check out any math book such as Elementary Linear Algebra by Howard Anton, ... – PowerPoint PPT presentation

Number of Views:230
Avg rating:3.0/5.0
Slides: 58
Provided by: Jiaw157
Category:

less

Transcript and Presenter's Notes

Title: Feature Selection Methods


1
Feature Selection Methods
  • An overview
  • Thanks to Qiang Yang
  • Modified by Charles Ling

2
What is Feature selection ?
  • Feature selection Problem of selecting some
    subset of a learning algorithms input variables
    upon which it should focus attention, while
    ignoring the rest (DIMENSIONALITY REDUCTION)
  • Humans/animals do that constantly!

2/54
3
Motivational example from Biology
1
  • Monkeys performing classification task

?
N. Sigala N. Logothetis, 2002 Visual
categorization shapes feature selectivity in the
primate temporal cortex.
3/54
1 Nathasha Sigala, Nikos Logothetis Visual
categorization shapes feature selectivity in the
primate visual cortex. Nature Vol. 415(2002)
4
Motivational example from Biology
  • Monkeys performing classification task

Diagnostic features - Eye separation - Eye
height Non-Diagnostic features - Mouth height
- Nose length
4/54
5
Motivational example from Biology
  • Monkeys performing classification task
  • Results
  • activity of a population of 150 neurons in the
    anterior inferior temporal cortex was measured
  • 44 neurons responded significantly differently to
    at least one feature
  • After Training 72 (32/44) were selective to one
    or both of the diagnostic features (and not for
    the non-diagnostic features)

5/54
6
Motivational example from Biology
  • Monkeys performing classification task
  • Results
  • (single neurons)

The data from the present study indicate that
neuronal selectivity was shaped by the most
relevant subset of features during the
categorization training.
6/54
7
feature selection
  • Reducing the feature space by throwing out some
    of the features (covariates)
  • Also called variable selection
  • Motivating idea try to find a simple,
    parsimonious model
  • Occams razor simplest explanation that accounts
    for the data is best

8
feature extraction
  • Feature Extraction is a process that extract a
    new set of features from the original data
    through numerical Functional mapping.
  • Idea
  • Given data points in d-dimensional space,
  • Project into lower dimensional space while
    preserving as much information as possible
  • E.g., find best planar approximation to 3D data
  • E.g., find best planar approximation to 104D data

9
Feature Selection vs Feature Extraction
  • Differs in two ways
  • Feature selection chooses subset of features
  • Feature extraction creates new features
    (dimensions) defined as functions over all
    features

10
Outline
  • What is Feature Reduction?
  • Feature Selection
  • Feature Extraction
  • Why need Feature Reduction?
  • Feature Selection Methods
  • Filter
  • Wrapper
  • Feature Extraction Methods
  • Linear
  • Nonlinear

11
Motivation
  • The objective of feature reduction is three-fold
  • Improving the prediction performance of the
    predictors (accuracy)
  • Providing a faster and more cost-effective
    predictors (CPU time)
  • Providing a better understanding of the
    underlying process that generated the data (??)

11
12
feature reduction--examples
Task 1 classify whether a document is about
cats Data word counts in the document
Task 2 predict chances of lung disease Data
medical history survey

X
X
cat 2
and 35
it 20
kitten 8
electric 2
trouble 4
then 5
several 9
feline 2
while 4

lemon 2
Vegetarian No
Plays video games Yes
Family history No
Athletic No
Smoker Yes
Sex Male
Lung capacity 5.8L
Hair color Red
Car Audi

Weight 185 lbs
Reduced X
Reduced X
cat 2
kitten 8
feline 2
Family history No
Smoker Yes
13
Feature reduction in task 1
  • task 1 Were interested in prediction features
    are not interesting in themselves, we just want
    to build a good classifier (or other kind of
    predictor).
  • Text classification
  • Features for all 105 English words, and maybe all
    word pairs
  • Common practice throw in every feature you can
    think of, let feature selection get rid of
    useless ones
  • Training too expensive with all features
  • The presence of irrelevant features hurts
    generalization.

14
Feature reduction in task 2
  • task 2 Were interested in featureswe want to
    know which are relevant. If we fit a model, it
    should be interpretable.
  • What causes lung cancer?
  • Features are aspects of a patients medical
    history
  • Binary response variable did the patient develop
    lung cancer?
  • Which features best predict whether lung cancer
    will develop? Might want to legislate against
    these features.

15
Get at Case 2 through Case 1
  • Even if we just want to identify features, it can
    be useful to pretend we want to do prediction.
  • Relevant features are (typically) exactly those
    that most aid prediction.
  • But not always. Highly correlated features may
    be redundant but both interesting as causes.
  • e.g. smoking in the morning, smoking at night

16
Outline
  • What is Feature Reduction?
  • Feature Selection
  • Feature Extraction
  • Why need Feature Reduction?
  • Feature Selection Methods
  • Filtering
  • Wrapper
  • Feature Extraction Methods
  • Linear
  • Nonlinear

17
Filtering methods
  • Basic idea assign score to each feature f
    indicating how related xf and y are.
  • Intuition if xi,fyi for all i, then f is good
    no matter what our model iscontains all
    information about y.
  • Many popular scores see Yang and Pederson 97
  • Classification with categorical data
    Chi-squared, information gain
  • Can use binning to make continuous data
    categorical
  • Regression correlation, mutual information
  • Markov blanket Koller and Sahami, 96
  • Then somehow pick how many of the highest scoring
    features to keep (nested models)

18
Filtering methods
  • Advantages
  • Very fast
  • Simple to apply
  • Disadvantages
  • Doesnt take into account which learning
    algorithm will be used.
  • Doesnt take into account correlations between
    features
  • This can be an advantage if were only interested
    in ranking the relevance of features, rather than
    performing prediction.
  • Also a significant disadvantagesee homework
  • Suggestion use light filtering as an efficient
    initial step if there are many obviously
    irrelevant features
  • Caveat here tooapparently useless features can
    be useful when grouped with others

19
Wrapper Methods
  • Learner is considered a black-box
  • Interface of the black-box is used to score
    subsets of variables according to the predictive
    power of the learner when using the subsets.
  • Results vary for different learners
  • One needs to define
  • how to search the space of all possible variable
    subsets ?
  • how to assess the prediction performance of a
    learner ?

19/54
20
Wrapper Methods
  • The problem of finding the optimal subset is
    NP-hard!
  • A wide range of heuristic search strategies can
    be used. Two different classes
  • Forward selection (start with empty feature set
    and add features at each step)
  • Backward elimination(start with full feature set
    and discard features at each step)
  • predictive power is usually measured on a
    validation set or by cross-validation
  • By using the learner as a black box wrappers are
    universal and simple!
  • Criticism a large amount of computation is
    required.

20/54
21
Wrapper Methods
21/54
22
Feature selection search strategy
Method Property Comments
Exhaustive search Evaluate all (dm) possible subsets Guaranteed to find the optimal subset not feasible for even moderately large values of m and d.
Sequential Forward Selection (SFS) Select the best single feature and then add one feature at a time which in combination with the selected features maximize criterion function. Once a feature is retained, it cannot be discarded computationally attractive since to select a subset of size 2, it examines only (d-1) possible subsets.
Sequential Backward Selection (SBS) Start with all the d features and successively delete one feature at a time. Once a feature is deleted, it cannot be brought back into the optimal subset requires more computation than sequential forward selection.
23
Comparsion of filter and wrapper
  • Wrapper method is tied to solving a
    classification algorithm, hence the criterion can
    be optimaized
  • but it is potentially very time consuming since
    they typically need to evaluate a
    cross-validation scheme at every iteration.
  • Filtering method is much faster but it do not
    incorporate learning.

23
24
Multivariate FS is complex
Kohavi-John, 1997
n features, 2n possible feature subsets!
25
In practice
  • Univariate feature selection often yields better
    accuracy results than multivariate feature
    selection.
  • NO feature selection at all gives sometimes the
    best accuracy results, even in the presence of
    known distracters.
  • Multivariate methods usually claim only better
    parsimony.
  • How can we make multivariate FS work better?

NIPS 2003 and WCCI 2006 challenges
http//clopinet.com/challenges
26
Feature Extraction-Definition
  • Given a set of features
  • the Feature Extraction(Construction) problem is
  • is to map F to some feature set that
    maximizes the learners ability to classify
    patterns.
  • (again )
  • This general definition subsumes feature
    selection (i.e. a feature selection algorithm
    also performs a mapping but can only map to
    subsets of the input variables)

here is the set of all possible feature
sets
26/51
27
Linear, Unsupervised Feature Selection
  • Question Are attributes A1 and A2 independent?
  • If they are very dependent, we can remove
    eitherA1 or A2
  • If A1 is independent on a class attribute A2, we
    can remove A1 from our training data

28
Chi-Squared Test (cont.)
  • Question Are attributes A1 and A2 independent?
  • These features are nominal valued (discrete)
  • Null Hypothesis we expect independence

Outlook Temperature
Sunny High
Cloudy Low
Sunny High
29
The Weather example Observed Count
temperature? Outlook High Low Outlook Subtotal
Sunny 2 0 2
Cloudy 0 1 1
Temperature Subtotal 2 1 Total count in table 3
Outlook Temperature
Sunny High
Cloudy Low
Sunny High
30
The Weather example Expected Count
If attributes were independent, then the
subtotals would be Like this (this table is also
known as
temperature? Outlook High Low Subtotal
Sunny 32/32/34/31.3 32/31/32/30.6 2 (prob2/3)
Cloudy 32/31/30.6 31/31/30.3 1, (prob1/3)
Subtotal 2 (prob2/3) 1 (prob1/3) Total count in table 3
Outlook Temperature
Sunny High
Cloudy Low
Sunny High
31
Question How different between observed and
expected?
  • If Chi-squared value is very large, then A1 and
    A2 are not independent ? that is, they are
    dependent!
  • Degrees of freedom if table has nm items, then
    freedom (n-1)(m-1)
  • In our example
  • Degree 1
  • Chi-Squared?

32
Chi-Squared Table what does it mean?
  • If calculated value is much greater than in the
    table, then you have reason to reject the
    independence assumption
  • When your calculated chi-square value is greater
    than the chi2 value shown in the 0.05 column
    (3.84) of this table ? you are 95 certain that
    attributes are actually dependent!
  • i.e. there is only a 5 probability that your
    calculated X2 value would occur by chance

33
Example Revisited (http//helios.bto.ed.ac.uk/bto/
statistics/tress9.html)
  • We dont have to have two-dimensional count table
    (also known as contingency table)
  • Suppose that the ratio of male to female students
    in the Science Faculty is exactly 11,
  • But, the Honours class over the past ten years
    there have been 80 females and 40 males.
  • Question Is this a significant departure from
    the (11) expectation?

Observed Honours Male Female Total
40 80 120
34
Expected (http//helios.bto.ed.ac.uk/bto/statistic
s/tress9.html)
  • Suppose that the ratio of male to female students
    in the Science Faculty is exactly 11,
  • but in the Honours class over the past ten years
    there have been 80 females and 40 males.
  • Question Is this a significant departure from
    the (11) expectation?
  • Note the expected is filled in, from 11
    expectation, instead of calculated

Expected Honours Male Female Total
60 60 120
35
Chi-Squared Calculation
Female Male Total
Observed numbers (O) 80 40 120
Expected numbers (E) 60 60 120
O - E 20 -20 0
(O-E)2 400 400
(O-E)2 / E 6.67 6.67 Sum13.34 X2
36
Chi-Squared Test (Cont.)
  • Then, check the chi-squared table for
    significance
  • http//helios.bto.ed.ac.uk/bto/statistics/table2.h
    tmlChi20squared20test
  • Compare our X2 value with a c2 (chi squared)
    value in a table of c2 with n-1 degrees of
    freedom
  • n is the number of categories, i.e. 2 in our case
    -- males and females).
  • We have only one degree of freedom (n-1). From
    the c2 table, we find a "critical value of 3.84
    for p 0.05.
  • 13.34 gt 3.84, and the expectation (that the
    MaleFemale in honours major are 11) is wrong!

37
Chi-Squared Test in Weka weather.nominal.arff
38
Chi-Squared Test in Weka
39
Chi-Squared Test in Weka
40
Example of Decision Tree Induction
Initial attribute set A1, A2, A3, A4, A5, A6
A4 ?
A6?
A1?
Class 2
Class 2
Class 1
Class 1
Reduced attribute set A1, A4, A6
41
Unsupervised Feature ExtractionPCA
  • Given N data vectors (samples) from k-dimensions
    (features), find c lt k orthogonal dimensions
    that can be best used to represent the data
  • Feature set is reduced from k to c
  • Example datacollection of emails k100 word
    counts c10 new features
  • The original data set is reduced by projecting
    the N data vectors on c principal components
    (reduced dimensions)
  • Each (old) data vector Xj is a linear combination
    of the c principal component vectors Y1, Y2, Yc
    through weights Wi
  • Xj mW1Y1W2Y2WcYc, i1, 2, N
  • m is the mean of the data set
  • W1, W2, are the ith components
  • Y1, Y2, are the ith Eigen vectors
  • Works for numeric data only
  • Used when the number of dimensions is large

42
  • Principal Component Analysis
  • See online tutorials such as http//www.cs.otago.a
    c.nz/cosc453/student_tutorials/principal_component
    s.pdf

X2
Note Y1 is the first eigen vector, Y2 is the
second. Y2 ignorable.
X1
Key observation variance largest!
43
Principle Component Analysis (PCA)
Principle Component Analysis project onto
subspace with the most variance (unsupervised
doesnt take y into account)
44
Principal Component Analysis one attribute first
Temperature
42
40
24
30
15
18
15
30
15
30
35
30
40
30
  • Question how much spread is in the data along
    the axis? (distance to the mean)
  • VarianceStandard deviation2

45
Now consider two dimensions
XTemperature YHumidity
40 90
40 90
40 90
30 90
15 70
15 70
15 70
30 90
15 70
30 70
30 70
30 90
40 70
30 90
  • Covariance measures thecorrelation between X
    and Y
  • cov(X,Y)0 independent
  • Cov(X,Y)gt0 move same dir
  • Cov(X,Y)lt0 move oppo dir

46
More than two attributes covariance matrix
  • Contains covariance values between all possible
    dimensions (attributes)
  • Example for three attributes (x,y,z)

47
Background eigenvalues AND eigenvectors
  • Eigenvectors e C e ? e
  • How to calculate e and ?
  • Calculate det(C-?I), yields a polynomial (degree
    n)
  • Determine roots to det(C-?I)0, roots are
    eigenvalues ?
  • Check out any math book such as
  • Elementary Linear Algebra by Howard Anton,
    Publisher John,Wiley Sons
  • Or any math packages such as MATLAB

48
Steps of PCA
  • Calculate eigenvalues ? and eigenvectors e for
    covariance matrix C
  • Eigenvalues ?j corresponds to variance on each
    component j
  • Thus, sort by ?j
  • Take the first n eigenvectors ei where n is the
    number of top eigenvalues
  • These are the directions with the largest
    variances

49
An Example
Mean124.1 Mean253.8
X1 X2 X1' X2'
19 63 -5.1 9.25
39 74 14.9 20.25
30 87 5.9 33.25
30 23 5.9 -30.75
15 35 -9.1 -18.75
15 43 -9.1 -10.75
15 32 -9.1 -21.75
30 73 5.9 19.25
50
Covariance Matrix
75 106
106 482
  • C
  • Using MATLAB, we find out
  • Eigenvectors
  • e1(-0.98,-0.21), ?151.8
  • e2(0.21,-0.98), ?2560.2
  • Thus the second eigenvector is more important!

51
If we only keep one dimension e2
yi
-10.14
-16.72
-31.35
31.374
16.464
8.624
19.404
-17.63
  • We keep the dimension of e2(0.21,-0.98)
  • We can obtain the final data as

52
Using Matlab to figure it out
53
PCA in Weka
54
Wesather Data from UCI Dataset (comes with weka
package)
55
(No Transcript)
56
Summary of PCA
  • PCA is used for reducing the number of numerical
    attributes
  • The key is in data transformation
  • Adjust data by mean
  • Find eigenvectors for covariance matrix
  • Transform data
  • Note only linear combination of data (weighted
    sum of original data)

57
Summary
  • Data preparation is a big issue for data mining
  • Data preparation includes transformation, which
    are
  • Data sampling and feature selection
  • Discretization
  • Missing value handling
  • Incorrect value handling
  • Feature Selection and Feature Extraction

58
Linear Method Linear Discriminant Analysis (LDA)
  • LDA finds the projection that best separates the
    two classes
  • Multiple discriminant analysis (MDA) extends LDA
    to multiple classes

Best projection direction for classification
12/9/2020
58
59
PCA vs. LDA
  • PCA is unsupervised while LDA is supervised.
  • PCA can extract r (rank of data) principles
    features while LDA can find (c-1) features.
  • Both based on SVD technique.

60
SVD - Definition
  • An x m Un x r L r x r (Vm x r)T
  • A n x m matrix (e.g., n documents, m terms)
  • U n x r matrix (n documents, r concepts)
  • L r x r diagonal matrix (strength of each
    concept) (r rank of the matrix)
  • V m x r matrix (m terms, r concepts)

61
SVD - Properties
  • spectral decomposition of the matrix

l1
x
x

u1
u2
l2
v1
v2
62
SVD - Example
  • A U L VT - example

retrieval
inf.
lung
brain
data
CS
x
x

MD
63
SVD - Example
  • A U L VT - example

retrieval
CS-concept
inf.
lung
MD-concept
brain
data
doc-to-concept similarity matrix
CS
x
x

MD
64
SVD - Example
  • A U L VT - example

retrieval
strength of CS-concept
inf.
lung
brain
data
CS
x
x

MD
65
SVD Dimensionality reduction
  • Q how exactly is dim. reduction done?
  • A set the smallest singular values to zero

x
x

66
SVD - Dimensionality reduction
x
x

67
Others Linear FA,ICA,NMF
  • Can be interpreted by matrix factorization but
    differs in basic assumptions.

V
W
mixture weight
data
H
factors

n
k
m
m
k
68
Assumptions
  • Factor Analysis (FA)
  • uncorrelated assumption.
  • Independent Component Analysis (ICA)
  • independence assumption
  • Nonnegative Matrix Factorization (NMF)
  • nonnegative assumption

69
Deficiencies of Linear Methods
  • Data may not be best summarized by linear
    combination of features
  • Example PCA cannot discover 1D structure of a
    helix

12/9/2020
69
Write a Comment
User Comments (0)
About PowerShow.com