Classifying Lymphoma Dataset Using Multiclass Support Vector Machines

About This Presentation

Title:

Classifying Lymphoma Dataset Using Multiclass Support Vector Machines

Description:

Missing Values Imputation. 3% of gene expression profiles data are missing ... Local Mean Imputation (KNN) Partition the data set D into two sets. ... –

Number of Views:365

Avg rating:3.0/5.0

Slides: 31

Provided by: anar7

Learn more at: https://cs.gmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Classifying Lymphoma Dataset Using Multiclass Support Vector Machines

1
Classifying Lymphoma Dataset Using Multi-class
Support Vector Machines

INFS-795 Advanced Data Mining
Prof. Domeniconi
Presented by Hong Chai

2
Agenda

(1) Lymphoma Dataset Description
(2) Data Preprocessing
- Formatting
- Dealing with Missing Values
- Gene Selections
(3) Multi-class SVM Classification
- 1-against-all
- 1-against-1
(4) Tools
(5) References

3
Lymphoma Dataset

Alizadeh et al.(2000), Distinct Types of Diffuse
Large B-cell Lymphoma Identified by Gene
Expression Profiling
Publicly available at http//llmpp.nih.gov/lymphom
a/
In microarray data,
Expression profiling of
genes are measured in
rows
Samples are columns

4
Lymphoma Dataset

96 samples of lymphocytes (instances)
4026 human genes (features)
9 classes of lymphoma
DLBCL, GCB, NIL, ABB, RAT, TCL, FL,
RBB, CLL
Glimpse of data

5
Lymphoma Dataset

6
Goal

Task classification
Assign each patient sample to one of 9
categories, e.g. Diffuse Large B-cell Lymphoma
(DLBCL) or Chronic Lymphocytic Leukemia (CLL).
Microarray data classification an alternative to
current malignancies classification that relies
on morphological or clinical variables
Medical implications
Precise categorization of cancers more relevant
diagnosis
More accurate assignment of cases to high risk or
low risk categories
more targeted therapies
Improved predictability of outcome.

7
Data Preprocessing

Missing Values Imputation
3 of gene expression profiles data are missing
1980 of the 4026 genes have missing values
49.1 of genes (features) involved
Some of these genes may be highly informative
for
classification
Need to deal with missing values before applying
to
SVM

8
Missing Value Approaches

Instance or feature deletion
- if dataset large enough does not
distort distribution
Replace with a randomly drawn observed value
- proved to work well (http//globain/cse/
psu.edu/courses/spring2003/3-norm-val.pdf)
EM algorithm
Global mode or mean substitution
- will distort distribution
Local mode or mean substitution with KNN
algorithm (Prof. Domeniconi)

9
Local Mean Imputation (KNN)

Partition the data set D into two sets.
Let the first set, Dm, contain
instances with missing value(s).
The other set, Dc, contains instances
with complete values.
2. For each instance vector x ? Dm
Divide the vector into observed and
missing parts as x xo xm.
Calculate the distance between xo and
every instance y ? Dc,
using only those features that are
observed in x.
From the K closest ys (instances in
Dc), calculate the mean of
the feature for which x has missing
value(s). Make substitution
with this local mean.
(Note for nominal features use mode.
n/a in microarray data)

10
Data Preprocessing

Feature Selection Motivations
- Number of features large, instances small
- Reduce dimensionality to overcome
overfitting
- A small number of discriminant marker
genes
may characterize different cancer classes
Example Guyon et al. identified 2 genes
that yield zero leave-
one-out error in the
leukemia dataset, 4 genes in the
colon cancer dataset
that give 98 accuracy.
(Guyon et al. Gene
Selection for Cancer Classification using SVM,
2002)

11
Feature Selection

Discriminant Score Ranking
Which gene is more informative in the 2-class
case
-
-
Gene 1
Gene 2

12
Separation Score

Gene 1 more discriminant. Criteria
- Large difference of µ and µ-
- Small variance within each class
Score function
F(gj) (µj - µj-) / (sj sj-)

13
Separation Score

In multi-class cases, rank genes that are
discriminant among multiple classes
C1 C2
? C3
A gene may functionally relates to several cancer
classes such as C1 and C2

14
Separation Score

Proposing an adapted score function
For each gene j
Calculate µi in each class Ci
Sort µi in descending order
Find a cutoff point with largest diff(µi,
µj)
µ ? µexp-cutoff-left
s ? sexp-cutoff-left
µ- ? µexp-cutoff-right
s- ? sexp-cutoff-right
F(gj) (µj - µj-) / (sj sj-)
Rank genes by F(gj)
Select top genes via threshold

15
Separation Score

Disadvantage
Does not yield more compact gene sets still
abundant
Does not consider mutual information between
genes

16
Feature Selection

Recursive Feature Elimination/SVM
In the linear SVM model on the full feature set
Sign (wx b)
w is a vector of weights for each feature,
x is an input instance, and b a threshold.
If wi 0, feature Xi does not influence
classification and can be eliminated from the set
of features.

17
RFE/SVM

2. When w is computed for the full feature set,
sort features according in descending order of
weights. The lower half is eliminated.
3. A new linear SVM is built using the new set of
features. Repeat the process until the set of
predictors is non-divisible by two.
4. The best feature subset is chosen.

18
Feature Selection

PCA comment not common in microarray data.
Disadvantage none of original inputs can be
discarded
We want to retain a minimum subset of informative
genes to achieve best classification performance.

19
Multi-class SVM

20
Multi-class SVM Approaches

1-against-all
Each of the SVMs separates a single class from
all remaining classes (Cortes and Vapnik, 1995)
1-against-1
Pair-wise. k(k-1)/2, k? Y SVMs are trained. Each
SVM separates a pair of classes (Fridman, 1996)
Performance similar in some experiments
(Nakajima, 2000)
Time complexity similar k evaluation in 1-all,
k-1 in 1-1

21
1 -against- All

Or one-against-rest, a tree algorithm
Decomposed to a collection of binary
classifications
k decision functions, one for each class
(wk)T ?xbk,
k? Y
The kth classifier constructs a hyperplane
between class n and the k-1 other classes
Class of x argmaxi(wi)T
?(x)bi

22
1 -against- 1

k(k-1)/2 classifiers where each one is trained on
data from two classes
For training data from ith and jth classes, run
binary classification
Voting strategy If
Sign(wij)T ?
xbij)
says x is in class i, then add 1 to class i.
Else to class j.
Assign x to class with largest vote (Max wins)

23
Kernels to Experiment

Polynomial kernels
K(Xi, Xj)(XiXj1)d
Gaussian Kernels
K(Xi, Xj)e(- Xi - Xj
/s2)

24
SVM Tools - Weka

Data Preprocessing
To ARFF format
Import file

25
SVM Tools - Weka

Feature Selection using SVM
Select Attribute
SVMAttributeEval

26
SVM Tools - Weka

Multi-class classifier
Classify
Meta
MultiClassClassifier
(Handles multi-class
datasets with 2-class
classifiers)

27
SVM Tools - Weka

Multi-class SVM
Classify
Functions
SMO
(Wekas SVM)

28
SVM Tools - Weka

Multi-class SVM Options
Method
1-against-1
1-against-all
Kernel options
not found

29
Multi-class SVM Tools

Other Tools include
SVMTorch (1-against-all)
LibSVM (1-against-1)
LightSVM

30
References

Alizadeh et al. Distinct types of diffuse large
B-cell lymphoma identified by gene expression
profiling, 1999
Cristianini, An Introduction to Support Vector
Machines, 2000
Dor et al, Scoring Genes for Relevance, 2000
Franc and Hlavac, Multi-class Support Vector
Machines
Furey et al. Support vector machine
classification and validation of cancer tissue
samples using microarray expression data, 2000
Guyon et al. Gene Selection for Cancer
Classification using Support Vector Machines,
2002
Selikoff, The SVM-Tree Algorithm, A New Method
for Handling Multi-class SVM, 2003
Shipp et al. Diffuse Large B-cell lymphoma
outcome prediction by gene expression profiling
and supervised machine learning, 2002
Weston, Multi-class Support Vector Machines,
Technical Report, 1998