SVMbased Feature Selection for Genomic Data - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

SVMbased Feature Selection for Genomic Data

Description:

Correlation with label, Odds ratio, P-value. Can not capture patterns with multiple genes ... Class label: Rejection (0) vs. non-rejection (1) Experimental Design (1) ... – PowerPoint PPT presentation

Number of Views:84

Avg rating:3.0/5.0

Slides: 25

Provided by: VxTe

Category:

more less

Transcript and Presenter's Notes

Title: SVMbased Feature Selection for Genomic Data

1
SVM-based Feature Selection for Genomic Data

Gang Fang
gangfang_at_cs.umn.edu

2
Motivation Bio-marker Discovery from Gene
Expression Data

Classical study of cancer subtypes Golub et al.
(1999) Identification of diagnostic genes
Genes
samples
3
Problem Statement

A gene expression Matrix M (N p)
M (i,j) is the jth genes expression level of
patient i
Phenotype class label (N 1)
Phenotypes can be
Disease or non-disease
Survival longer than k years or less than k years
Cancer subtypes which need different treatments
Goal (biomarker discovery)
Find a small group of genes that are highly
predictive w.r.t. the phenotype class for future
diagnostics.

4
The Major Challenge Nltltp
of Samples p (50300)
of Genes N ( 20,000)
5
Additional Challenges

Noise
Microarray data is very noisy
Complexity of disease
Multiple genes together drive a disease
Univariate ranking mostly does not help
Exponential search
2p possible gene subsets

6
Existing approaches

Do not explicitly identify bio-makers
Pure classification models
Dimension reduction
Biomarker identification oriented methods
Univariate ranking
Rank all the singletons via some measure
Correlation with label, Odds ratio, P-value
Can not capture patterns with multiple genes
Discriminative pattern mining
Feature selection

7
SVM-Recursive Feature Elimination - Motivation

Change in objective function when a feature is
removed as a ranking criterion Kohavi John,
1997
For Linear discriminant functions with quadratic
cost function J is a function of wi, DJ(i) and
magnitude of the weights are equivalent
Mean-squared-error classifier with cost function
J w.x-y2
Linear SVMs, which minimize J(1/2)w2, under
constrains.
This justifies the use of wi2 as a feature
ranking criterion in linear SVM.
Note, this is different from singleton ranking

8
SVM-Recursive Feature Elimination - Algorithm
9
SVM-Recursive Feature Elimination Deeper Inside

Pros
Guarantee of Classification Accuracy
Gradually remove features that are relatively
less relevant
Cons
Eliminating one feature at a time is
time-consuming (large p, cross validation,
parameter settings)
Variations
Original SVM-RFE (A)
The regularization parameter C is tuned for each
subset of surviving features (in contrast to the
SVM-RFE paper, where a single (large) value of C
was used for all subsets of features) (B)
Eliminating multiple features at at time
Constant (e.g. 10) (C)
According to the distribution of wi2 (D)
E.g. remove those with wi2 that are close to
min(wi2)

10
Datasets

A Simple data for Sanity Check
UCI Cleve (165,165-, 27 features)
Two breast cancer gene expression data (Public)
Rosseta
97 patients (51(1) 46(0)), 24881 genes
Class label Metastasis (1) vs. non-metastasis
(0)
Two SNP data
Myeloma
143 patients (70(0) 73(1)), 8231 SNP genotypes
Class label Survival lt 1 year (0) vs. survival gt
3 years (1)
Kidney transplant
271 patients (135(0) 136(1)), 8231 SNP
genotypes
Class label Rejection (0) vs. non-rejection (1)

11
Experimental Design (1)

Framework of Visualization of SVM-RFE
80 as training, 20 as validation
Run SVM-RFE on training set to compute weights,
and select model (c) with validation accuracy.
Note the regularization parameter C is tuned for
each subset of surviving features
Use the weights computed on training set to
select feature subset
One at a time removal
Constant percentage removal (e.g. 5)
Weights based removal (e.g. 20 range close to
min(w))

12
Cleve 1 by 1
13
Cleve fixed 10
14
Cleve min(weights) 20
15
Rosseta fixed 10
16
Rosseta min(weights) 20
17
Myeloma fixed 10
18
Myeloma min(weights) 20
19
Experimental Design (2)

Framework of Training, Validation and Test
In each iteration
We randomly take 60 as training, 20 as
validation and 20 as test (from /- separately
to maintain the class balance).
Run SVM-RFE on each training set (with
validation) to generate a ranked list.
Note the regularization parameter C is tuned for
each subset of surviving features
Select the feature subset
1 With highest validation accuracy for test,
aiming at higher test accuracy
2 With top k features (fixed small k), aiming
at bio-marker discovery
Use average test accuracy for comparison

20
Experimental Results For Higher Accuracy
21
Experimental Results Fixed top 100 features for
biomarker discovery
22
Expected Results

SVM-RFE vs SVM
SVM-RFE can improve classification accuracy
SVM-RFE can select a relatively smaller subset of
genes (even though classification accuracy falls
a bit)
Constant C vs Varying C
Model (c) selection generally improve
classification accuracy
Removal Scheme
1 by 1 Time consuming
Fixed percentage Not feasible for high
resolution selection
Weights based automatically decides of
features to remove, but may be too aggressive

23
Future work

Algorithm
More advanced weights based removal
Experiments
Constant C vs. Varying C
More parameters, and more detailed summary (e.g.
average percentage of features selected for
classification)
More runs

24
Thank you!Gang Fanggangfang_at_cs.umn.edu

Write a Comment

User Comments (0)