What I Did This Summer - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

What I Did This Summer

Description:

Features = Attributes = classifier's 'handles' ... Ugly Blob case: Breast Cancer. Related to Problem #2 ... Find out that Big Ugly Blob phenomenon is a problem ... – PowerPoint PPT presentation

Number of Views:33
Avg rating:3.0/5.0
Slides: 21
Provided by: rebeccaf7
Category:

less

Transcript and Presenter's Notes

Title: What I Did This Summer


1
What I Did This Summer
  • Rebecca Fiebrink
  • 2 August 2005

2
or,
THE TRUTH ABOUT FEATURE SELECTION
3
Overview
  • Wrapper-based feature selection
  • Common wisdom regarding feature selection
  • Methodological problems with existing work
  • Practical problems with existing work
  • Where my work fits in

4
Wrapper-based Feature Selection
  • Features Attributes classifiers handles to
    the stuff we want to classify
  • Wed like to improve classifier performance by
    removing irrelevant/redundant features
  • Train a classifier using an example feature
    subset, then look at its performance
  • Perform some sort of search among subsets, using
    classification accuracy to inform the search and
    ultimately the choice of the best subset
  • Search could be genetic algorithm, random,
    exhaustive,
  • Feature weighting is a related option

5
Common Wisdom
  • Feature selection can improve accuracy
  • More intensive/complex search algorithms can
    improve accuracy, given the necessary resources
  • Standard datasets make good examples when
    comparing search algorithms
  • Classifier performance on available data gives us
    useful information for predicting its performance
    on new data

6
Big Problem 1
  • Many feature selection/weighting studies dont
    measure performance on a hold-out test set!
  • e.g., Punch et al. 1993
  • See Reunanen 2003
  • Algorithm A is claimed to be better than B
    because A can usually find subsets that lead to a
    higher classification rate on dataset X
  • A might actually be overfitting on X!

7
Related to Big Problem 1
  • There is little work establishing good practices
    for comparing selection algorithms using a
    hold-out set
  • How much data to leave out? 10? 50? 90?
  • Stratify the data before choosing hold-out?
  • Use cross-validation or random selection to form
    multiple test sets?
  • Numerous parameters to set for selection methods
    and classifier on top of this
  • Metric for classifier performance to guide
    selection?
  • Which normalization method, distance metric, etc.
    (for k-NN)?

8
Big Problem 2
  • Wrapper-based selection relies on performance on
    the training data to guide the search.

9
Big Problem 2
  • Ideal case Ecoli

10
(No Transcript)
11
(No Transcript)
12
Big Problem 2
  • Ugly Blob case Breast Cancer

13
(No Transcript)
14
(No Transcript)
15
Related to Problem 2
  • Of the datasets for which there is a good
    correlation between training and testing, optimal
    performance is often achieved using all features
  • Example Tic-tac-toe

16
(No Transcript)
17
(No Transcript)
18
The Truth
  • Some algorithms that have been claimed to be
    superior may not be
  • Popular standard datasets (e.g., UCI repository)
    can actually be horrible examples to use
  • Feature selection wont improve accuracy in many
    cases (see Reunanen 2004)
  • Conclusions that have been drawn using them might
    not apply to other domains (e.g., music)
  • We really know less than we thought we knew

19
My work
  • Find out that Big Ugly Blob phenomenon is a
    problem
  • Examine potential for feature selection/weighting
    to improve accuracy on many datasets
  • Design an informed framework for comparing
    selection methods
  • Fix parameters of algorithms and classifier
    (k-NN) appropriately
  • Fix comparison framework options, such as ratio
    of training size to testing size
  • Run comparisons of different algorithms on
    different datasets in order to generalize about
    their behavior

20
Very brief list of references mentioned
  • Blake, C., and C. Merz. 1998. UCI Repository of
    machine learning databases. lthttp//www.ics.uci.e
    du/mlearn/MLRepository.htmlgt University of
    California, Irvine, Department of Information and
    Computer Sciences. Accessed 13 April 2005.
  • Punch, W. et al. 1993. Further research on
    feature selection and classification using
    genetic algorithms. Proceedings of the Fifth
    International Conference on Genetic Algorithms,
    557-64.
  • Reunanen, J. 2003. Overfitting in making
    comparisons between variable selection methods.
    Journal of machine learning research 3 1371-82.
  • Reunanen, J. 2004. A pitfall in determining the
    optimal feature subset size. Proceedings of the
    Fourth International Workshop on Pattern
    Recognition in Information Systems, 176-85.
Write a Comment
User Comments (0)
About PowerShow.com