Imbalanced%20Data%20Set%20Learning%20with%20Synthetic%20Examples - PowerPoint PPT Presentation

About This Presentation

Title:

Imbalanced%20Data%20Set%20Learning%20with%20Synthetic%20Examples

Description:

Data sets are said to be balanced if there are, approximately, as many positive ... Discrimination between Earthquakes and Nuclear Explosions. Document Filtering ... – PowerPoint PPT presentation

Number of Views:779

Avg rating:3.0/5.0

Slides: 26

Provided by: nat1159

Category:

more less

Transcript and Presenter's Notes

Title: Imbalanced%20Data%20Set%20Learning%20with%20Synthetic%20Examples

1
Imbalanced Data Set Learning with Synthetic
Examples

Benjamin X. Wang
and
Nathalie Japkowicz

2
The Class Imbalance Problem I

Data sets are said to be balanced if there are,
approximately, as many positive examples of the
concept as there are negative ones.
There exist many domains that do not have a
balanced data set.
Examples
Helicopter Gearbox Fault Monitoring
Discrimination between Earthquakes and Nuclear
Explosions
Document Filtering
Detection of Oil Spills
Detection of Fraudulent Telephone Calls

3
The Class Imbalance Problem II

The problem with class imbalances is that
standard learners are often biased towards the
majority class.
That is because these classifiers attempt to
reduce global quantities such as the error rate,
not taking the data distribution into
consideration.
As a result examples from the overwhelming class
are well-classified whereas examples from the
minority class tend to be misclassified.

4
Some Generalities

Evaluating the performance of a learning system
on a class imbalance problem is not done
appropriately with the standard accuracy/error
rate measures. ? ROC Analysis is typically used,
instead.
There is a parallel between research on class
imbalances and cost-sensitive learning.
There are four main ways to deal with class
imbalances re-sampling, re-weighing, adjusting
the probabilistic estimate, one-class learning

5
Advantage of Resampling

Re-sampling provides a simple way of biasing the
generalization process.
It can do so by
Generating synthetic samples accordingly biased
Controlling the amount and placement of the new
samples
Note this type of control can also be achieved
by smoothing the classifiers probabilistic
estimate (e.g., Zadrozny Elkan, 2001), but that
type of control cannot be as localized as the one
achieved with re-sampling techniques.

6
SMOTE A State-of-the-Art Resampling Approach

SMOTE stands for Synthetic Minority Oversampling
Technique.
It is a technique designed by Chawla, Hall,
Kegelmeyer in 2002.
It combines Informed Oversampling of the minority
class with Random Undersampling of the
majority class.
SMOTE currently yields the best results as far as
re-sampling and modifying the probabilistic
estimate techniques go (Chawla, 2003).

7
SMOTEs Informed Oversampling Procedure II

For each minority Sample
Find its k-nearest minority neighbours
Randomly select j of these neighbours
Randomly generate synthetic samples along the
lines joining the minority sample and its j
selected neighbours
(j depends on the amount of oversampling desired)

8
SMOTEs Informed vs. Random Oversampling

Random Oversampling (with replacement) of the
minority class has the effect of making the
decision region for the minority class very
specific.
In a decision tree, it would cause a new split
and often lead to overfitting.
SMOTEs informed oversampling generalizes the
decision region for the minority class.
As a result, larger and less specific regions are
learned, thus, paying attention to minority class
samples without causing overfitting.

9
SMOTEs Informed Oversampling Procedure I
But what if there is a majority sample Nearby?
Majority sample
Minority sample
Synthetic sample
10
SMOTEs Shortcomings

Overgeneralization
SMOTEs procedure is inherently dangerous since
it blindly generalizes the minority area without
regard to the majority class.
This strategy is particularly problematic in the
case of highly skewed class distributions since,
in such cases, the minority class is very sparse
with respect to the majority class, thus
resulting in a greater chance of class mixture.
Lack of Flexibility
The number of synthetic samples generated by
SMOTE is fixed in advance, thus not allowing for
any flexibility in the re-balancing rate.

11
SMOTEs Tendency for Overgeneralization
Overgeneralization!!!
Minority sample
Synthetic sample
Majority sample
12
Our Proposed Solution

In order to avoid overgeneralization, we propose
to use three techniques
Testing for data sparsity
Clustering the minority class
2-class (rather than 1-class) sample
generalization
In order to avoid SMOTEs lack of flexibility,
we propose one technique
Multiple Trials/Feedback
We call our Approach Adaptive Synthetic Minority
Oversampling Method (ASMO)

13
ASMOs Strategy I

Overfitting Avoidance I Testing for data
sparsity
For each minority sample m, if ms g neighbours
are majority samples, then the data set is sparse
and ASMO should be used. Otherwise, SMOTE can be
used. (As a default, we used g20).
Overgeneralization Avoidance II Clustering
We will use k-means or other such clustering
systems on the minority class (for now, this step
is done, but in a non-standard way)

14
ASMOs Strategy II

Overfitting Avoidance III Synthetic sample
generation using two classes
Rather than using the k-nearest neighbours of the
minority class to generate new samples, we use
the k nearest neighbours of the opposite class.

15
ASMOs Strategy III Overfitting
avoidance Overview
- Clustering
-2-class sample generation
Minority sample
Synthetic sample
Majority sample
16
ASMOs Strategy III

Flexibility Enhancement through Multiple Trials
and Feedback
For each Cluster Ci, iterate through different
rates of majority undersampling and synthetic
minority generation. Keep the best combination
subset Si.
Merge the Sis into a single training set S.
Apply the classifier to S.

17
Discussion of our Technique I

Assumption we made/Justification
the problem is decomposable. i.e., optimizing
each subset will yield an optimal merged set.
As long as the base classifier we use does some
kind of local learning (not just global
optimization), this assumption should hold.
Question/Answer
Why did we use different oversampling and
undersampling rates?
It was previously shown that optimal sampling
rates are problem dependent, and thus, are best
set adaptively (Weiss Provost, 2003, Estabrook
Japkowicz, 2001)

18
Experiment Setup I

We tested our system on three different data
sets
Lupus (thanks to James Malley of NIH)
Minority class 2.8
Dataset Size 3839
Abalone-5 (UCI)
Minority class 2.75
Dataset Size 4177
Connect-4 (UCI)
Minority class 9.5
Dataset Size 11,258

19
Experiment Setup II

ASMO was compared to two other techniques
SMOTE
O-D the Combination of Random Over- and Down
(Under)- sampling O-D was shown to outperform
both Random Oversampling and Random Undersampling
in preliminary experiments.
The base classifier in all experiments is SVM
k-NN was used in the syntactic generation
process in order to identify the samples nearest
neighbours (within the minority class or between
the minority and majority class).
The results are reported in the form of ROC
Curves on 10-fold corss-validation experiments.

20
Results on Lupus
21
Results on Abalone-5
22
Results on Connect-4
23
Discussion of the Results

On every domain, ASMO slightly outperforms both
O-D and SMOTE. In the ROC areas where ASMO
does not outperform the other two systems, its
performance equals theirs.
ASMOs effect seems to be one of smoothening
SMOTEs ROC Curve.
SMOTEs performance is comparatively better in
the two domains where the class imbalance is
greater (Lupus, Abalone-5). We expect its
relative performance to increase as the imbalance
grows even more.

24
Summary

We presented a few modifications to the
State-of-the-art re-sampling system, SMOTE.
These modifications had two goals
To correct for SMOTEs tendency to overgeneralize
To make SMOTE more flexible
We observed a slight improved performance on
three domains. However that improvement came at
the expense of greater time consumption.

25
Future Work This was a very preliminary study!

To clean-up the system (e.g., to use a standard
clustering method)
To test the system more rigorously (to test for
significance to use TANGO used in the medical
domain
To test our system on highly imbalanced data
sets, to see if, indeed, our design helps address
this particular issue.
To modify the data generation process so as to
test biases other than the one proposed by SMOTE.