Imbalanced%20Data%20Set%20Learning%20with%20Synthetic%20Examples - PowerPoint PPT Presentation

About This Presentation
Title:

Imbalanced%20Data%20Set%20Learning%20with%20Synthetic%20Examples

Description:

Data sets are said to be balanced if there are, approximately, as many positive ... Discrimination between Earthquakes and Nuclear Explosions. Document Filtering ... – PowerPoint PPT presentation

Number of Views:777
Avg rating:3.0/5.0
Slides: 26
Provided by: nat1159
Category:

less

Transcript and Presenter's Notes

Title: Imbalanced%20Data%20Set%20Learning%20with%20Synthetic%20Examples


1
Imbalanced Data Set Learning with Synthetic
Examples
  • Benjamin X. Wang
  • and
  • Nathalie Japkowicz

2
The Class Imbalance Problem I
  • Data sets are said to be balanced if there are,
    approximately, as many positive examples of the
    concept as there are negative ones.
  • There exist many domains that do not have a
    balanced data set.
  • Examples
  • Helicopter Gearbox Fault Monitoring
  • Discrimination between Earthquakes and Nuclear
    Explosions
  • Document Filtering
  • Detection of Oil Spills
  • Detection of Fraudulent Telephone Calls

3
The Class Imbalance Problem II
  • The problem with class imbalances is that
    standard learners are often biased towards the
    majority class.
  • That is because these classifiers attempt to
    reduce global quantities such as the error rate,
    not taking the data distribution into
    consideration.
  • As a result examples from the overwhelming class
    are well-classified whereas examples from the
    minority class tend to be misclassified.

4
Some Generalities
  • Evaluating the performance of a learning system
    on a class imbalance problem is not done
    appropriately with the standard accuracy/error
    rate measures. ? ROC Analysis is typically used,
    instead.
  • There is a parallel between research on class
    imbalances and cost-sensitive learning.
  • There are four main ways to deal with class
    imbalances re-sampling, re-weighing, adjusting
    the probabilistic estimate, one-class learning

5
Advantage of Resampling
  • Re-sampling provides a simple way of biasing the
    generalization process.
  • It can do so by
  • Generating synthetic samples accordingly biased
  • Controlling the amount and placement of the new
    samples
  • Note this type of control can also be achieved
    by smoothing the classifiers probabilistic
    estimate (e.g., Zadrozny Elkan, 2001), but that
    type of control cannot be as localized as the one
    achieved with re-sampling techniques.

6
SMOTE A State-of-the-Art Resampling Approach
  • SMOTE stands for Synthetic Minority Oversampling
    Technique.
  • It is a technique designed by Chawla, Hall,
    Kegelmeyer in 2002.
  • It combines Informed Oversampling of the minority
    class with Random Undersampling of the
    majority class.
  • SMOTE currently yields the best results as far as
    re-sampling and modifying the probabilistic
    estimate techniques go (Chawla, 2003).

7
SMOTEs Informed Oversampling Procedure II
  • For each minority Sample
  • Find its k-nearest minority neighbours
  • Randomly select j of these neighbours
  • Randomly generate synthetic samples along the
    lines joining the minority sample and its j
    selected neighbours
  • (j depends on the amount of oversampling desired)

8
SMOTEs Informed vs. Random Oversampling
  • Random Oversampling (with replacement) of the
    minority class has the effect of making the
    decision region for the minority class very
    specific.
  • In a decision tree, it would cause a new split
    and often lead to overfitting.
  • SMOTEs informed oversampling generalizes the
    decision region for the minority class.
  • As a result, larger and less specific regions are
    learned, thus, paying attention to minority class
    samples without causing overfitting.

9
SMOTEs Informed Oversampling Procedure I
But what if there is a majority sample Nearby?
Majority sample
Minority sample
Synthetic sample
10
SMOTEs Shortcomings
  • Overgeneralization
  • SMOTEs procedure is inherently dangerous since
    it blindly generalizes the minority area without
    regard to the majority class.
  • This strategy is particularly problematic in the
    case of highly skewed class distributions since,
    in such cases, the minority class is very sparse
    with respect to the majority class, thus
    resulting in a greater chance of class mixture.
  • Lack of Flexibility
  • The number of synthetic samples generated by
    SMOTE is fixed in advance, thus not allowing for
    any flexibility in the re-balancing rate.

11
SMOTEs Tendency for Overgeneralization
Overgeneralization!!!
Minority sample
Synthetic sample
Majority sample
12
Our Proposed Solution
  • In order to avoid overgeneralization, we propose
    to use three techniques
  • Testing for data sparsity
  • Clustering the minority class
  • 2-class (rather than 1-class) sample
    generalization
  • In order to avoid SMOTEs lack of flexibility,
    we propose one technique
  • Multiple Trials/Feedback
  • We call our Approach Adaptive Synthetic Minority
    Oversampling Method (ASMO)

13
ASMOs Strategy I
  • Overfitting Avoidance I Testing for data
    sparsity
  • For each minority sample m, if ms g neighbours
    are majority samples, then the data set is sparse
    and ASMO should be used. Otherwise, SMOTE can be
    used. (As a default, we used g20).
  • Overgeneralization Avoidance II Clustering
  • We will use k-means or other such clustering
    systems on the minority class (for now, this step
    is done, but in a non-standard way)

14
ASMOs Strategy II
  • Overfitting Avoidance III Synthetic sample
    generation using two classes
  • Rather than using the k-nearest neighbours of the
    minority class to generate new samples, we use
    the k nearest neighbours of the opposite class.

15
ASMOs Strategy III Overfitting
avoidance Overview
- Clustering
-2-class sample generation
Minority sample
Synthetic sample
Majority sample
16
ASMOs Strategy III
  • Flexibility Enhancement through Multiple Trials
    and Feedback
  • For each Cluster Ci, iterate through different
    rates of majority undersampling and synthetic
    minority generation. Keep the best combination
    subset Si.
  • Merge the Sis into a single training set S.
  • Apply the classifier to S.

17
Discussion of our Technique I
  • Assumption we made/Justification
  • the problem is decomposable. i.e., optimizing
    each subset will yield an optimal merged set.
  • As long as the base classifier we use does some
    kind of local learning (not just global
    optimization), this assumption should hold.
  • Question/Answer
  • Why did we use different oversampling and
    undersampling rates?
  • It was previously shown that optimal sampling
    rates are problem dependent, and thus, are best
    set adaptively (Weiss Provost, 2003, Estabrook
    Japkowicz, 2001)

18
Experiment Setup I
  • We tested our system on three different data
    sets
  • Lupus (thanks to James Malley of NIH)
  • Minority class 2.8
  • Dataset Size 3839
  • Abalone-5 (UCI)
  • Minority class 2.75
  • Dataset Size 4177
  • Connect-4 (UCI)
  • Minority class 9.5
  • Dataset Size 11,258

19
Experiment Setup II
  • ASMO was compared to two other techniques
  • SMOTE
  • O-D the Combination of Random Over- and Down
    (Under)- sampling O-D was shown to outperform
    both Random Oversampling and Random Undersampling
    in preliminary experiments.
  • The base classifier in all experiments is SVM
    k-NN was used in the syntactic generation
    process in order to identify the samples nearest
    neighbours (within the minority class or between
    the minority and majority class).
  • The results are reported in the form of ROC
    Curves on 10-fold corss-validation experiments.

20
Results on Lupus
21
Results on Abalone-5
22
Results on Connect-4
23
Discussion of the Results
  • On every domain, ASMO slightly outperforms both
    O-D and SMOTE. In the ROC areas where ASMO
    does not outperform the other two systems, its
    performance equals theirs.
  • ASMOs effect seems to be one of smoothening
    SMOTEs ROC Curve.
  • SMOTEs performance is comparatively better in
    the two domains where the class imbalance is
    greater (Lupus, Abalone-5). We expect its
    relative performance to increase as the imbalance
    grows even more.

24
Summary
  • We presented a few modifications to the
    State-of-the-art re-sampling system, SMOTE.
  • These modifications had two goals
  • To correct for SMOTEs tendency to overgeneralize
  • To make SMOTE more flexible
  • We observed a slight improved performance on
    three domains. However that improvement came at
    the expense of greater time consumption.

25
Future Work This was a very preliminary study!
  • To clean-up the system (e.g., to use a standard
    clustering method)
  • To test the system more rigorously (to test for
    significance to use TANGO used in the medical
    domain
  • To test our system on highly imbalanced data
    sets, to see if, indeed, our design helps address
    this particular issue.
  • To modify the data generation process so as to
    test biases other than the one proposed by SMOTE.
Write a Comment
User Comments (0)
About PowerShow.com