Dimensionality Reduction by Feature Selection in Machine Learning - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Dimensionality Reduction by Feature Selection in Machine Learning

Description:

... to the hyperplane of linear SVM trained on all the features [Brank et al 2002] ... Illustration on Reuters-2000 Data [Brank et al 2002] Reuters-2000 Data ... – PowerPoint PPT presentation

Number of Views:584
Avg rating:3.0/5.0
Slides: 36
Provided by: DunjaMl9
Category:

less

Transcript and Presenter's Notes

Title: Dimensionality Reduction by Feature Selection in Machine Learning


1
Dimensionality Reduction by Feature Selection in
Machine Learning
  • Dunja Mladenic
  • J.Stefan Institute, Slovenia

2
Reasons for dimensionality reduction
  • Dimensionality reduction in machine learning is
    usually performed to
  • Improve the prediction performance
  • Improve learning efficiency
  • Provide faster predictors possibly requesting
    less information on the original data
  • Reduce complexity of the learned results, enable
    better understanding of the underlying process

3
Approaches to dimensionality reduction
  • Map the original features onto the reduced
    dimensionality space by
  • selecting a subset of the original features
  • no feature transformation, just select a feature
    subset
  • constructing features to replace the original
    features
  • using methods from statistics, such as, PCA
  • using background knowledge for constructing new
    features to be used in addition/instead of the
    original features (can be followed by feature
    subset selection)
  • general background knowledge (sum or product of
    features,...)
  • domain specific background knowledge (parser for
    text data to get noun phrases, clustering of
    words, user-specified function,)

Addressed here
4
Example for the problem
  • Data set
  • Five Boolean features
  • C F1 V F2
  • F3 F2 , F5 F4
  • Optimal subset
  • F1, F2 or F1, F3
  • optimization in space of all feature subsets (
    possibilities)
  • (tutorial on genomics Yu 2004)

5
Search for feature subset
  • An example of search space (John Kohavi 1997)

Forward selection
Backward elimination
6
Feature subset selection
  • commonly used search strategies
  • forward selection
  • FSubset greedily add features one at a time
  • forward stepwise selection
  • FSubset greedily add or remove features one
    at a time
  • backward elimination
  • FSubsetAllFeatures greedily remove features one
    at a time
  • backward stepwise elimination
  • FSubsetAllFeatures greedily add or remove
    features one at a time
  • random mutation
  • FSubsetRandomFeatures
  • greedily add or remove randomly selected feature
    one at a time
  • stop after a given number of iterations

7
Approaches to feature subset selection
  • Filters - evaluation function independent of the
    learning algorithm
  • Wrappers - evaluation using model selection based
    on the machine learning algorithm
  • Embedded approaches - feature selection during
    learning
  • Simple Filters - assume feature independence
    (used for problems with large number of features,
    eg. text classification)

8
Filtering
Evaluation independent of ML algorithm
9
Filters Distribution-based Koller Sahami 1996
  • Idea select a minimal subset of features that
    keeps class probability distribution close to the
    original distribution P(CFeatureSet) is close to
    P(CAllFeatures)
  • start with all the features
  • use backward elimination to eliminate a
    predefined number of features
  • evaluation the next feature to be deleted is
    obtained using Cross-entropy measure

10
Filters Relief Kira Rendell 1992
  • Evaluation of a feature subset
  • represent examples using the feature subset
  • on a random subset of examples calculate average
    difference in distance from
  • the nearest example of the same class and the
    nearest example of the different class
  • F discrete
    F cont.
  • some extensions, empirical and theoretical
    analysis in Robnik-Sikonja Kononenko 2003

11
Filters FOCUS Almallim Dietterich 1991
  • Evaluation of a feature subset
  • represent examples using the feature subset
  • count conflicts in class value (two examples with
    the same feature values and different class
    value)
  • Search all the (promising) subsets of the same
    (increasing) size are evaluated until a
    sufficient (no conflicts) subset is found
  • assumes existence of a small sufficient subset
    --gt not appropriate for tasks with many features
  • some extensions of the algorithm use heuristic
    search to avoid evaluating all the subsets of the
    same size

12
Illustration of FOCUS
Conflict!
Conflict!
13
Filters Random Liu Setiono 1996
  • Evaluation of a feature subset
  • represent examples using the feature subset
  • calculate the inconsistency rate
  • (the average difference between the number of
    examples with equal feature values and the number
    of examples among them with the locally, most
    frequent class value)
  • select the smallest subset with inconsistency
    rate below the given threshold
  • Search random sampling to search the space of
    feature subsets
  • evaluate the predetermined number of subsets
  • noise handling by setting the threshold gt 0
  • if threshold 0, then the same evaluation as in
    FOCUS

14
Filters MDL-based Pfahringer 1995
  • Evaluation using Minimum Description Length
  • represent examples using the feature subset
  • calculate MDL of a simple decision table
    representing examples
  • Search start with random feature subset and add
    or delete a feature, one at a time
  • performs at least as well as the wrapper approach
    applied on the simple decision tables and scales
    up better to large number of training examples

15
Wrapper
Evaluation uses the same ML algorithm that is
used after the feature selection
16
Wrappers Instance-based learning
  • Evaluation using instance-based learning
  • represent examples using the feature subset
  • estimate model quality using cross-validation
  • Search Aha Bankert 1994
  • start with random feature subset
  • use beam search with backward elimination
  • Search Skalak 1994
  • start with random feature subset
  • use random mutation

17
Wrappers Decision tree induction
  • Evaluation using decision tree induction
  • represent examples using the feature subset
  • estimate model quality using cross-validation
  • Search Bala et al 1995, Cherkauer Shavlik
    1996
  • using genetic algorithm
  • Search Caruana Freitag 1994
  • adding and removing features (backward stepwise
    elimination)
  • additionally, at each step removes all the
    features that were not used in the decision tree
    induced for the evaluation of the current feature
    subset

18
Metric-based model selection
  • Ideapoor models behave differently on training
    and other data
  • Evaluation using machine learning algorithm
  • represent examples using the feature subset
  • generate model using some ML algorithm
  • estimate model quality comparing the performance
    of two models on training and on unlabeled data,
    chose the largest subset that satisfies
    triangular inequality with all the smaller
    subsets
  • Combine metric and cross-validation Bengio
    Chapados 2003
  • based on their disagreement on testing examples
    (higher disagreement means lower trust to
    cross-validation)
  • Intuition cross-validation provides good results
    but has high variance and should benefit from a
    combination with model selection having with
    lower variance

19
Embedded
Feature selection as integral part of model
generation
20
Embedded
  • at each iteration of the incremental optimization
    of the model
  • use a fast gradient-based heuristic to find the
    most promising feature Perkins et al 2003
  • Idea features that are relevant to the concept
    should affect the generalization error bound of
    non-linear SVM more than irrelevant features
  • use backward elimination based on the criteria
    derived from generalization error bounds of the
    SVM theory (the weight vector norm or, using
    upper bounds of the leave-one-out error)
    Rakotomamonjy 2003

21
Embedded in filters Cardie 1993
  • Use embedded feature selection as
    pre-processing
  • evaluation and search using the process embedded
    in decision tree induction
  • the final feature subset contains only the
    features that appear in the induced decision tree
  • used for learning using Nearest Neighbor algorithm

22
Simple Filtering
Evaluation independent of ML algorithm
23
Feature subset selection on text data commonly
used methods
  • Simple filtering using some scoring measure to
    evaluate individual feature
  • supervised measures
  • information gain, cross entropy for text
    (information gain on only one feature value),
    mutual information for text
  • supervised measures for binary class
  • odds ratio (target class vs. the rest), bi-normal
    separation
  • unsupervised measures
  • term frequency, document frequency
  • Simple filtering using embedded approach to score
    the features
  • scoring measure equal to weights in the normal to
    the hyperplane of linear SVM trained on all the
    features Brank et al 2002
  • learning using linear SVM, Perceptron, Naïve Bayes

24
Scoring individual feature
  • InformationGain
  • CrossEntropyTxt
  • MutualInfoTxt
  • OddsRatio
  • Frequency
  • Bi-NormalSeparation
  • F - Normal distribution cumulative probability
    function

25
Influence of feature selection on the
classification performance
  • Some ML algorithms are more sensitive to the
    feature subset than other
  • Naïve Bayes on document categorization sensitive
    to the feature subset
  • Linear SVM has embedded weighting of features
    that partially compensates for feature selection

26
Illustration of feature selection
  • Naïve Bayes on Yahoo! hierarchy data
  • Comparison of different feature scoring measures
    in simple filtering
  • Linear SVM on standard Reuters-2000 news data
  • Comparison of scoring measures including embedded
    SVM-normal and perceptron used as pre-processing

27
Illustration on 5 datasets from Yahoo! hierarchy
using Naïve Bayes Mladenic Grobelnik 2003
28
CrossEntropy
OddsRatio
  • Feature subset size importantly influences the
    performance
  • Some measures more sensitive than other

MutualInf
InfGain
Random
29
  • Rank of the correct category in the list of all
    categories
  • F2-measure combining precision and recall
    emphases on recall
  • Ctgs number of categories looking promising
    (testing example needs to be classified by their
    models)
  • best results Odds ratio
  • using only a small number of features (50-100,
    0.2-5)
  • improves performance of Naïve Bayes
  • surprisingly good results unsupervised Term
    frequency
  • poor results Information gain
  • probably because it is not compatible with Naïve
    Bayes (selects mostly features representative for
    neg. class and features informative when not
    occurring in the document)

30
Illustration on Reuters-2000 Data Brank et al
2002
810,000 News articles 103 Categories
504,468 articles
302,323 articles
Training Period
Test Period
14. April,1997
20. Aug,1996
19. Aug,1997
  • Reuters-2000 Data used in the experiments
  • 16 categories covering the range of break-event
    point (estimated on a sample) and class
    distribution
  • Training sample of 118,294 articles from the
    training period
  • Testing 302,323 articles from the test period

31
Experiments with Naïve Bayes Classifier
  • Benefits from feature selection
  • SVM-normal gives the best performance

SVM Normal
InfoGain
OddsRatio
PercNormal
32
Average number of nonzero components per vector
instead of the overall no. of features
  • The same results showing F1 vs. sparsity of the
    document vectors represented wiht the selected
    features

SVM Normal
InfoGain
OddsRatio
PercNormal
33
Experiments with Perceptron Classifier
  • Does not benefit from feature selection
  • Perceptron and SVM Normal feature selection give
    comparable performance

SVM Normal
InfoGain
PercNormal
OddsRatio
34
Experiments with the Linear SVM Classifier
SVM Normal
OddsRatio
InfoGain
  • Does not benefit from feature selection
  • SVM-normal the best performance

PercNormal
35
DiscussionUsing discarded features can help
  • The features that harm performance if used as
    input were found to improve performance if used
    as additional output
  • obtain additional information by introducing
    mapping from the selected features to the
    discarded features (the multitask learning
    setting Caruana de Sa 2003)
  • experiments on synthetic regression and
    classification problems and real-world medical
    data have shown improvements in performance
  • Intuition transfer of information occurs inside
    the model, when in addition to the class value it
    models also additional output consisting of the
    discarded features

36
Discussion
  • Feature subset selection as pre-processing
  • ignore interaction with the target learning
    algorithm
  • Simple Filters work for large number of
    features
  • assume feature independence, limited results
  • the size of feature subset to be determined
  • Filters search space of size , can not
    handle many features
  • relay on general data characteristics
    (consistency, distance, class distribution)
  • use the target learning algorithm for evaluation
  • Wrappers high accuracy, computationally
    expensive
  • use model selection with cross-validation of the
    target algorithm, similar to metric-based model
    selection (eg., comparing output on training and
    on unlabeled data)
  • Feature subset selection during learning
  • use the target learning algorithm during feature
    selection
  • Embedded can be used by filters to find the
    feature subset
Write a Comment
User Comments (0)
About PowerShow.com