SVMbased Feature Selection for Genomic Data - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

SVMbased Feature Selection for Genomic Data

Description:

Correlation with label, Odds ratio, P-value. Can not capture patterns with multiple genes ... Class label: Rejection (0) vs. non-rejection (1) Experimental Design (1) ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 25
Provided by: VxTe
Category:

less

Transcript and Presenter's Notes

Title: SVMbased Feature Selection for Genomic Data


1
SVM-based Feature Selection for Genomic Data
  • Gang Fang
  • gangfang_at_cs.umn.edu

2
Motivation Bio-marker Discovery from Gene
Expression Data

Classical study of cancer subtypes Golub et al.
(1999) Identification of diagnostic genes
Genes
samples
3
Problem Statement
  • A gene expression Matrix M (N p)
  • M (i,j) is the jth genes expression level of
    patient i
  • Phenotype class label (N 1)
  • Phenotypes can be
  • Disease or non-disease
  • Survival longer than k years or less than k years
  • Cancer subtypes which need different treatments
  • Goal (biomarker discovery)
  • Find a small group of genes that are highly
    predictive w.r.t. the phenotype class for future
    diagnostics.

4
The Major Challenge Nltltp
of Samples p (50300)
of Genes N ( 20,000)
5
Additional Challenges
  • Noise
  • Microarray data is very noisy
  • Complexity of disease
  • Multiple genes together drive a disease
  • Univariate ranking mostly does not help
  • Exponential search
  • 2p possible gene subsets

6
Existing approaches
  • Do not explicitly identify bio-makers
  • Pure classification models
  • Dimension reduction
  • Biomarker identification oriented methods
  • Univariate ranking
  • Rank all the singletons via some measure
  • Correlation with label, Odds ratio, P-value
  • Can not capture patterns with multiple genes
  • Discriminative pattern mining
  • Feature selection

7
SVM-Recursive Feature Elimination - Motivation
  • Change in objective function when a feature is
    removed as a ranking criterion Kohavi John,
    1997
  • For Linear discriminant functions with quadratic
    cost function J is a function of wi, DJ(i) and
    magnitude of the weights are equivalent
  • Mean-squared-error classifier with cost function
    J w.x-y2
  • Linear SVMs, which minimize J(1/2)w2, under
    constrains.
  • This justifies the use of wi2 as a feature
    ranking criterion in linear SVM.
  • Note, this is different from singleton ranking

8
SVM-Recursive Feature Elimination - Algorithm
9
SVM-Recursive Feature Elimination Deeper Inside
  • Pros
  • Guarantee of Classification Accuracy
  • Gradually remove features that are relatively
    less relevant
  • Cons
  • Eliminating one feature at a time is
    time-consuming (large p, cross validation,
    parameter settings)
  • Variations
  • Original SVM-RFE (A)
  • The regularization parameter C is tuned for each
    subset of surviving features (in contrast to the
    SVM-RFE paper, where a single (large) value of C
    was used for all subsets of features) (B)
  • Eliminating multiple features at at time
  • Constant (e.g. 10) (C)
  • According to the distribution of wi2 (D)
  • E.g. remove those with wi2 that are close to
    min(wi2)

10
Datasets
  • A Simple data for Sanity Check
  • UCI Cleve (165,165-, 27 features)
  • Two breast cancer gene expression data (Public)
  • Rosseta
  • 97 patients (51(1) 46(0)), 24881 genes
  • Class label Metastasis (1) vs. non-metastasis
    (0)
  • Two SNP data
  • Myeloma
  • 143 patients (70(0) 73(1)), 8231 SNP genotypes
  • Class label Survival lt 1 year (0) vs. survival gt
    3 years (1)
  • Kidney transplant
  • 271 patients (135(0) 136(1)), 8231 SNP
    genotypes
  • Class label Rejection (0) vs. non-rejection (1)

11
Experimental Design (1)
  • Framework of Visualization of SVM-RFE
  • 80 as training, 20 as validation
  • Run SVM-RFE on training set to compute weights,
    and select model (c) with validation accuracy.
  • Note the regularization parameter C is tuned for
    each subset of surviving features
  • Use the weights computed on training set to
    select feature subset
  • One at a time removal
  • Constant percentage removal (e.g. 5)
  • Weights based removal (e.g. 20 range close to
    min(w))

12
Cleve 1 by 1
13
Cleve fixed 10
14
Cleve min(weights) 20
15
Rosseta fixed 10
16
Rosseta min(weights) 20
17
Myeloma fixed 10
18
Myeloma min(weights) 20
19
Experimental Design (2)
  • Framework of Training, Validation and Test
  • In each iteration
  • We randomly take 60 as training, 20 as
    validation and 20 as test (from /- separately
    to maintain the class balance).
  • Run SVM-RFE on each training set (with
    validation) to generate a ranked list.
  • Note the regularization parameter C is tuned for
    each subset of surviving features
  • Select the feature subset
  • 1 With highest validation accuracy for test,
    aiming at higher test accuracy
  • 2 With top k features (fixed small k), aiming
    at bio-marker discovery
  • Use average test accuracy for comparison

20
Experimental Results For Higher Accuracy
21
Experimental Results Fixed top 100 features for
biomarker discovery
22
Expected Results
  • SVM-RFE vs SVM
  • SVM-RFE can improve classification accuracy
  • SVM-RFE can select a relatively smaller subset of
    genes (even though classification accuracy falls
    a bit)
  • Constant C vs Varying C
  • Model (c) selection generally improve
    classification accuracy
  • Removal Scheme
  • 1 by 1 Time consuming
  • Fixed percentage Not feasible for high
    resolution selection
  • Weights based automatically decides of
    features to remove, but may be too aggressive

23
Future work
  • Algorithm
  • More advanced weights based removal
  • Experiments
  • Constant C vs. Varying C
  • More parameters, and more detailed summary (e.g.
    average percentage of features selected for
    classification)
  • More runs

24
Thank you!Gang Fanggangfang_at_cs.umn.edu
Write a Comment
User Comments (0)
About PowerShow.com