What I Did This Summer

About This Presentation

Title:

Description:

Number of Views:33

Avg rating:3.0/5.0

Slides: 21

Provided by: rebeccaf7

Category:

Tags: blob | breastrelated | summer

Transcript and Presenter's Notes

Title: What I Did This Summer

1
What I Did This Summer

2
or,
THE TRUTH ABOUT FEATURE SELECTION
3
Overview

4
Wrapper-based Feature Selection

Features Attributes classifiers handles to
the stuff we want to classify
Wed like to improve classifier performance by
removing irrelevant/redundant features
Train a classifier using an example feature
subset, then look at its performance
Perform some sort of search among subsets, using
classification accuracy to inform the search and
ultimately the choice of the best subset
Search could be genetic algorithm, random,
exhaustive,
Feature weighting is a related option

5
Common Wisdom

Feature selection can improve accuracy
More intensive/complex search algorithms can
improve accuracy, given the necessary resources
Standard datasets make good examples when
comparing search algorithms
Classifier performance on available data gives us
useful information for predicting its performance
on new data

6
Big Problem 1

Many feature selection/weighting studies dont
measure performance on a hold-out test set!
e.g., Punch et al. 1993
See Reunanen 2003
Algorithm A is claimed to be better than B
because A can usually find subsets that lead to a
higher classification rate on dataset X
A might actually be overfitting on X!

7
Related to Big Problem 1

There is little work establishing good practices
for comparing selection algorithms using a
hold-out set
How much data to leave out? 10? 50? 90?
Stratify the data before choosing hold-out?
Use cross-validation or random selection to form
multiple test sets?
Numerous parameters to set for selection methods
and classifier on top of this
Metric for classifier performance to guide
selection?
Which normalization method, distance metric, etc.
(for k-NN)?

8
Big Problem 2

Wrapper-based selection relies on performance on
the training data to guide the search.

9
Big Problem 2

10
(No Transcript)
11
(No Transcript)
12
Big Problem 2

13
(No Transcript)
14
(No Transcript)
15
Related to Problem 2

Of the datasets for which there is a good
correlation between training and testing, optimal
performance is often achieved using all features
Example Tic-tac-toe

16
(No Transcript)
17
(No Transcript)
18
The Truth

Some algorithms that have been claimed to be
superior may not be
Popular standard datasets (e.g., UCI repository)
can actually be horrible examples to use
Feature selection wont improve accuracy in many
cases (see Reunanen 2004)
Conclusions that have been drawn using them might
not apply to other domains (e.g., music)
We really know less than we thought we knew

19
My work

Find out that Big Ugly Blob phenomenon is a
problem
Examine potential for feature selection/weighting
to improve accuracy on many datasets
Design an informed framework for comparing
selection methods
Fix parameters of algorithms and classifier
(k-NN) appropriately
Fix comparison framework options, such as ratio
of training size to testing size
Run comparisons of different algorithms on
different datasets in order to generalize about
their behavior

20
Very brief list of references mentioned

Blake, C., and C. Merz. 1998. UCI Repository of
machine learning databases. lthttp//www.ics.uci.e
du/mlearn/MLRepository.htmlgt University of
California, Irvine, Department of Information and
Computer Sciences. Accessed 13 April 2005.
Punch, W. et al. 1993. Further research on
feature selection and classification using
genetic algorithms. Proceedings of the Fifth
International Conference on Genetic Algorithms,
557-64.
Reunanen, J. 2003. Overfitting in making
comparisons between variable selection methods.
Journal of machine learning research 3 1371-82.
Reunanen, J. 2004. A pitfall in determining the
optimal feature subset size. Proceedings of the
Fourth International Workshop on Pattern
Recognition in Information Systems, 176-85.