talk proteomics - PowerPoint PPT Presentation

About This Presentation
Title:

talk proteomics

Description:

Title: talk proteomics Author: Elena Marchiori Last modified by: gebruiker Created Date: 3/28/2002 7:39:25 AM Document presentation format: On-screen Show – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 35
Provided by: Elen113
Category:

less

Transcript and Presenter's Notes

Title: talk proteomics


1
Lecture 5 Feature Selection (Elena Marchioris
slides adapted)Bioinformatics Data Analysis
and Tools
bvhoute_at_few.vu.nl
2
What is feature selection?
  • Reducing the feature space by removing some of
    the (non-relevant) features.
  • Also known as
  • variable selection
  • feature reduction
  • attribute selection
  • variable subset selection

3
Why select features?
  • It is cheaper to measure less variables.
  • The resulting classifier is simpler and
    potentially faster.
  • Prediction accuracy may improve by discarding
    irrelevant variables.
  • Identifying relevant variables gives more insight
    into the nature of the corresponding
    classification problem (biomarker detection).
  • Alleviate the curse of dimensionality.

4
Why select features?
No feature selection
Top 100 feature selection Selection based on
variance
Correlation plot Data Leukemia, 3 class
1
-1
5
The curse of dimensionality
  • Term introduced by Richard Bellman1.
  • Problems caused by the exponential increase in
    volume associated with adding extra dimensions to
    a (mathematical) space.
  • So the problem space increases with the number
    of variables/features.

1Bellman, R.E. 1957. Dynamic Programming.
Princeton University Press, Princeton, NJ
6
The curse of dimensionality
  • A high dimensional feature space leads to
    problems in for example
  • Machine learning danger of overfitting with too
    many variables.
  • Optimization finding the global optimum is
    (virtually) infeasible in a high-dimensional
    space.
  • Microarray analysis the number of features
    (genes) is much larger than the number of objects
    (samples). So a huge amount of observations is
    needed to obtain a good estimate of the function
    of a gene.

7
Approaches
  • Wrapper
  • Feature selection takes into account the
    contribution to the performance of a given type
    of classifier.
  • Filter
  • Feature selection is based on an evaluation
    criterion for quantifying how well feature
    (subsets) discriminate the two classes.
  • Embedded
  • Feature selection is part of the training
    procedure of a classifier (e.g. decision trees).

8
Embedded methods
  • Attempt to jointly or simultaneously train both a
    classifier and a feature subset.
  • Often optimize an objective function that jointly
    rewards accuracy of classification and penalizes
    use of more features.
  • Intuitively appealing.
  • Example tree-building algorithms

Adapted from J. Fridlyand
9
Approaches to Feature Selection
Filter Approach
Feature Selection by Distance Metric Score
Input Features
Model
Train Model
Wrapper Approach
Feature Set
Feature Selection Search
Model
Train Model
Input Features
Importance of features given by the model
Adapted from Shin and Jasso
10
Filter methods
Feature selection
p
S
Classifier design
R
R
S ltlt p
  • Features are scored independently and the top S
    are used by the classifier.
  • Score correlation, mutual information,
    t-statistic, F-statistic, p-value, tree
    importance statistic, etc.

Easy to interpret. Can provide some insight into
the disease markers.
Adapted from J. Fridlyand
11
Problems with filter method
  • Redundancy in selected features features are
    considered independently and not measured on the
    basis of whether they contribute new information.
  • Interactions among features generally can not be
    explicitly incorporated (some filter methods are
    smarter than others).
  • Classifier has no say in what features should be
    used some scores may be more appropriates in
    conjuction with some classifiers than others.

Adapted from J. Fridlyand
12
Wrapper methods
Feature selection
p
S
Classifier design
R
R
S ltlt p
  • Iterative approach many feature subsets are
    scored based on classification performance and
    best is used.

Adapted from J. Fridlyand
13
Problems with wrapper methods
  • Computationally expensive for each feature
    subset to be considered, a classifier must be
    built and evaluated.
  • No exhaustive search is possible (2 subsets to
    consider) generally greedy algorithms only.
  • Easy to overfit.

Adapted from J. Fridlyand
14
Example Microarray Analysis
Labeled cases (38 bone marrow samples 27 AML,
11 ALL Each contains 7129 gene expression values)
Train model (using Neural Networks, Support
Vector Machines, Bayesian nets, etc.)
key genes
34 New unlabeled bone marrow samples
Model
AML/ALL
15
Microarray Data Challenges to Machine Learning
Algorithms
  • Few samples for analysis (38 labeled).
  • Extremely high-dimensional data (7129 gene
    expression values per sample).
  • Noisy data.
  • Complex underlying mechanisms, not fully
    understood.

16
Some genes are more useful than others for
building classification models
Example genes 36569_at and 36495_at are useful
17
Some genes are more useful than others for
building classification models
Example genes 36569_at and 36495_at are useful
AML
ALL
18
Some genes are more useful than others for
building classification models
Example genes 37176_at and 36563_at not useful
19
Importance of feature (gene) selection
  • Majority of genes are not directly related to
    leukemia.
  • Having a large number of features enhances the
    models flexibility, but makes it prone to
    overfitting.
  • Noise and the small number of training samples
    makes this even more likely.
  • Some types of models, like kNN do not scale well
    with many features.

20
How do we choose the most relevant of the 7219
genes?
  1. Distance metrics to capture class separation.
  2. Rank genes according to distance metric score.
  3. Choose the top n ranked genes.

HIGH score
LOW score
21
Distance metrics
  • Tamayos Relative Class Separation
  • t-test
  • Bhattacharyya distance

22
SVM-RFE wrapper
  • Recursive Feature Elimination
  • Train linear SVM ? linear decision function.
  • Use absolute value of variable weights to rank
    variables.
  • Remove half variables with lower rank.
  • Repeat above steps (train, rank, remove) on data
    restricted to variables not removed.
  • Output subset of variables.

23
SVM-RFE
  • Linear binary classifier decision function
  • Recursive Feature Elimination (SVM-RFE)
  • - At each iteration
  • eliminate threshold of variables with lower
    score
  • recompute scores of remaining variables

24
SVM-RFE I. Guyon et al., Machine
Learning, 46,389-422, 2002
25
RELIEF
  • Idea relevant features make (1) nearest examples
    of same class closer and (2) nearest examples of
    opposite classes more far apart.
  • weights of all features zero
  • For each example in training set
  • find nearest example from same (hit) and opposite
    class (miss)
  • update weight of each feature by adding
    abs(example - miss) -abs(example - hit)

RELIEF I. Kira K, Rendell L, 10th Int. Conf. on
AI, 129-134, 1992
26
RELIEF Algorithm
  • RELIEF assigns weights to variables based on how
    well they separate samples from their nearest
    neighbors (nnb) from the same and from the
    opposite class.
  • RELIEF
  • input X (two classes)
  • output W (weights assigned to variables)
  • nr_var total number of variables
  • weights zero vector of size nr_var
  • for all x in X do
  • hit(x) nnb of x from same class
  • miss(x) nnb of x from opposite class
  • weights abs(x-miss(x)) - abs(x-hit(x))
  • end
  • nr_ex number of examples of X
  • return W weights/nr_ex
  • Note Variables have to be normalized (e.g.,
    divide each variable by its (max min) values)

27
RELIEF example
Gene expressions for two types of leukemia -
3 patiënts with AML (Acute Myeloid Leukemia) -
3 patiënts with ALL (Acute Lymphoblastic Leukemia)
  • What are the weights of genes 1-5, assigned by
    RELIEF?

28
RELIEF normalization
First, apply (max-min) normalization -
identify the max and min value of each feature
(gene) - Divide all values of each feature
with the corresponding (max-min)
normalization 3 / (6-1) 0.6
29
RELIEF distance matrix
Data after normalization
Then, calculate the distance matrix
Distance measure 1 - Pearson Correlation
30
RELIEF 1st iteration
RELIEF, Iteration 1 AML1
31
RELIEF 2nd iteration
RELIEF, Iteration 2 AML2
32
RELIEF results (after 6th iteration)
Weights after last iteration
Last step is to sort the features by
their weights, and select the features with
the highest ranks
33
RELIEF
  • Advantages
  • Fast.
  • Easy to implement.
  • Disadvantages
  • Does not filter out redundant features, so
    features with very similar values could be
    selected.
  • Not robust to outliers.
  • Classic RELIEF can only handle data sets with two
    classes.

34
Extension of RELIEF RELIEF-F
  • Extension for multi-class problems.
  • Instead of finding one near miss, the algorithm
    finds

  • one near miss for each

  • different class and

  • averages their contribution

  • of updating the weights.

RELIEF-F input X (two or more classes
C) output W (weights assigned to
variables) nr_var total number of
variables weights zero vector of size
nr_var for all x in X do hit(x) nnb of x from
same class sum_miss 0 for all c in C do
miss(x, c) nnb of x from class c
sum_miss abs(x-miss(x, c)) / nr_examples(c)
end weights sum_miss -
abs(x-hit(x)) end nr_ex number of examples of
X return W weights/nr_ex
Write a Comment
User Comments (0)
About PowerShow.com