Title: talk proteomics
1Lecture 5 Feature Selection (Elena Marchioris
slides adapted)Bioinformatics Data Analysis
and Tools
bvhoute_at_few.vu.nl
2What is feature selection?
- Reducing the feature space by removing some of
the (non-relevant) features. - Also known as
- variable selection
- feature reduction
- attribute selection
- variable subset selection
3Why select features?
- It is cheaper to measure less variables.
- The resulting classifier is simpler and
potentially faster. - Prediction accuracy may improve by discarding
irrelevant variables. - Identifying relevant variables gives more insight
into the nature of the corresponding
classification problem (biomarker detection). - Alleviate the curse of dimensionality.
4Why select features?
No feature selection
Top 100 feature selection Selection based on
variance
Correlation plot Data Leukemia, 3 class
1
-1
5The curse of dimensionality
- Term introduced by Richard Bellman1.
- Problems caused by the exponential increase in
volume associated with adding extra dimensions to
a (mathematical) space. - So the problem space increases with the number
of variables/features.
1Bellman, R.E. 1957. Dynamic Programming.
Princeton University Press, Princeton, NJ
6The curse of dimensionality
- A high dimensional feature space leads to
problems in for example - Machine learning danger of overfitting with too
many variables. - Optimization finding the global optimum is
(virtually) infeasible in a high-dimensional
space. - Microarray analysis the number of features
(genes) is much larger than the number of objects
(samples). So a huge amount of observations is
needed to obtain a good estimate of the function
of a gene.
7Approaches
- Wrapper
- Feature selection takes into account the
contribution to the performance of a given type
of classifier. - Filter
- Feature selection is based on an evaluation
criterion for quantifying how well feature
(subsets) discriminate the two classes. - Embedded
- Feature selection is part of the training
procedure of a classifier (e.g. decision trees).
8Embedded methods
- Attempt to jointly or simultaneously train both a
classifier and a feature subset. - Often optimize an objective function that jointly
rewards accuracy of classification and penalizes
use of more features. - Intuitively appealing.
- Example tree-building algorithms
Adapted from J. Fridlyand
9Approaches to Feature Selection
Filter Approach
Feature Selection by Distance Metric Score
Input Features
Model
Train Model
Wrapper Approach
Feature Set
Feature Selection Search
Model
Train Model
Input Features
Importance of features given by the model
Adapted from Shin and Jasso
10Filter methods
Feature selection
p
S
Classifier design
R
R
S ltlt p
- Features are scored independently and the top S
are used by the classifier. - Score correlation, mutual information,
t-statistic, F-statistic, p-value, tree
importance statistic, etc.
Easy to interpret. Can provide some insight into
the disease markers.
Adapted from J. Fridlyand
11Problems with filter method
- Redundancy in selected features features are
considered independently and not measured on the
basis of whether they contribute new information. - Interactions among features generally can not be
explicitly incorporated (some filter methods are
smarter than others). - Classifier has no say in what features should be
used some scores may be more appropriates in
conjuction with some classifiers than others.
Adapted from J. Fridlyand
12Wrapper methods
Feature selection
p
S
Classifier design
R
R
S ltlt p
- Iterative approach many feature subsets are
scored based on classification performance and
best is used.
Adapted from J. Fridlyand
13Problems with wrapper methods
- Computationally expensive for each feature
subset to be considered, a classifier must be
built and evaluated. - No exhaustive search is possible (2 subsets to
consider) generally greedy algorithms only. - Easy to overfit.
Adapted from J. Fridlyand
14Example Microarray Analysis
Labeled cases (38 bone marrow samples 27 AML,
11 ALL Each contains 7129 gene expression values)
Train model (using Neural Networks, Support
Vector Machines, Bayesian nets, etc.)
key genes
34 New unlabeled bone marrow samples
Model
AML/ALL
15Microarray Data Challenges to Machine Learning
Algorithms
- Few samples for analysis (38 labeled).
- Extremely high-dimensional data (7129 gene
expression values per sample). - Noisy data.
- Complex underlying mechanisms, not fully
understood.
16Some genes are more useful than others for
building classification models
Example genes 36569_at and 36495_at are useful
17Some genes are more useful than others for
building classification models
Example genes 36569_at and 36495_at are useful
AML
ALL
18Some genes are more useful than others for
building classification models
Example genes 37176_at and 36563_at not useful
19Importance of feature (gene) selection
- Majority of genes are not directly related to
leukemia. - Having a large number of features enhances the
models flexibility, but makes it prone to
overfitting. - Noise and the small number of training samples
makes this even more likely. - Some types of models, like kNN do not scale well
with many features.
20How do we choose the most relevant of the 7219
genes?
- Distance metrics to capture class separation.
- Rank genes according to distance metric score.
- Choose the top n ranked genes.
HIGH score
LOW score
21Distance metrics
- Tamayos Relative Class Separation
- t-test
- Bhattacharyya distance
22SVM-RFE wrapper
- Recursive Feature Elimination
- Train linear SVM ? linear decision function.
- Use absolute value of variable weights to rank
variables. - Remove half variables with lower rank.
- Repeat above steps (train, rank, remove) on data
restricted to variables not removed. - Output subset of variables.
23SVM-RFE
- Linear binary classifier decision function
- Recursive Feature Elimination (SVM-RFE)
- - At each iteration
- eliminate threshold of variables with lower
score - recompute scores of remaining variables
24SVM-RFE I. Guyon et al., Machine
Learning, 46,389-422, 2002
25RELIEF
- Idea relevant features make (1) nearest examples
of same class closer and (2) nearest examples of
opposite classes more far apart. - weights of all features zero
- For each example in training set
- find nearest example from same (hit) and opposite
class (miss) - update weight of each feature by adding
abs(example - miss) -abs(example - hit)
RELIEF I. Kira K, Rendell L, 10th Int. Conf. on
AI, 129-134, 1992
26RELIEF Algorithm
- RELIEF assigns weights to variables based on how
well they separate samples from their nearest
neighbors (nnb) from the same and from the
opposite class. - RELIEF
- input X (two classes)
- output W (weights assigned to variables)
- nr_var total number of variables
- weights zero vector of size nr_var
- for all x in X do
- hit(x) nnb of x from same class
- miss(x) nnb of x from opposite class
- weights abs(x-miss(x)) - abs(x-hit(x))
- end
- nr_ex number of examples of X
- return W weights/nr_ex
- Note Variables have to be normalized (e.g.,
divide each variable by its (max min) values)
27RELIEF example
Gene expressions for two types of leukemia -
3 patiënts with AML (Acute Myeloid Leukemia) -
3 patiënts with ALL (Acute Lymphoblastic Leukemia)
- What are the weights of genes 1-5, assigned by
RELIEF?
28RELIEF normalization
First, apply (max-min) normalization -
identify the max and min value of each feature
(gene) - Divide all values of each feature
with the corresponding (max-min)
normalization 3 / (6-1) 0.6
29RELIEF distance matrix
Data after normalization
Then, calculate the distance matrix
Distance measure 1 - Pearson Correlation
30RELIEF 1st iteration
RELIEF, Iteration 1 AML1
31RELIEF 2nd iteration
RELIEF, Iteration 2 AML2
32RELIEF results (after 6th iteration)
Weights after last iteration
Last step is to sort the features by
their weights, and select the features with
the highest ranks
33RELIEF
- Advantages
- Fast.
- Easy to implement.
- Disadvantages
- Does not filter out redundant features, so
features with very similar values could be
selected. - Not robust to outliers.
- Classic RELIEF can only handle data sets with two
classes.
34Extension of RELIEF RELIEF-F
- Extension for multi-class problems.
- Instead of finding one near miss, the algorithm
finds -
one near miss for each -
different class and -
averages their contribution -
of updating the weights.
RELIEF-F input X (two or more classes
C) output W (weights assigned to
variables) nr_var total number of
variables weights zero vector of size
nr_var for all x in X do hit(x) nnb of x from
same class sum_miss 0 for all c in C do
miss(x, c) nnb of x from class c
sum_miss abs(x-miss(x, c)) / nr_examples(c)
end weights sum_miss -
abs(x-hit(x)) end nr_ex number of examples of
X return W weights/nr_ex