Mining with Rare Cases

About This Presentation

Title:

Mining with Rare Cases

Description:

Identifying rare diseases. Classification Problem. Covers relatively few training examples ... Rare disease detection. Etc. 6. Employ Boosting ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 34

Provided by: indarb

Learn more at: https://cs.gmu.edu

more less

Transcript and Presenter's Notes

Title: Mining with Rare Cases

1
Mining with Rare Cases

Paper by Gary M. Weiss
Presenter Indar Bhatia
INFS 795
April 28, 2005

2
Presentation Overview

Motivation and Introduction to problem
Why Rare Cases are Problematic
Techniques for Handling Rare Cases
Summary and Conclusion

3
Motivation and Introduction

What are rare cases?
A case corresponds to a region in the instance
space that is meaningful to the domain under
study.
A rare case is a case that covers a small region
of the instance space
Why are they important
Detecting suspicious cargo
Finding sources of rare diseases
Detecting Fraud
Finding terrorists
Identifying rare diseases
Classification Problem
Covers relatively few training examples
Example Finding association between infrequently
purchased supermarket items

4
Modeling Problem
P2
P1
P3
Clustering showing 1 common And 3 rare classes
Two Class Classification Positive Class
contains 1 Common and 2 Rare classes

For a classification problem, the rare cases may
manifest themselves as small disjuncts, i.e.,
those disjuncts in the classifier that cover few
training examples
In unsupervised learning, the three rare cases
will be more difficult to generalize from because
they contain fewer data points
In association rule mining, the problem will be
to detect items that co-occur most infrequently.

5
Modeling Problem

Current research indicates rare cases and small
disjuncts pose difficulties for data mining,
i.e., rare cases have much higher
misclassification rate than common cases.
Small disjuncts collectively cover a substantial
fraction of all examples and cannot simply be
eliminated doing so will substantially degrade
the performance of a classifier.
In a most thorough study of small disjuncts, (
Weiss Hirsh, 2000), it was shown that in the
classifiers induced from 30 real-world data sets,
most classifier errors are contributed by the
smaller disjuncts.

6
Why Rare Cases are Problematic

Problems arise due to absolute rarity
Most fundamental problem is associated lack of
data only few examples related to rare cases
are in the data set (Absolute rarity)
Lack of data makes it difficult to detect rare
cases, and if detected, makes generalization
difficult
Problems arise due to relative rarity
Looking for a needle in a Haystack rare cases
are obscured by common cases (Relative rarity)
Data mining algorithms rely on greedy Search
heuristics that examine one variable at a time.
Since the detection of rare cases may depend on
the conjunction of many conditions, any single
condition in isolation may not provide much
guidance.
For example , consider Association rule mining
problem. Association Analysis has to have a very
low support, support 0. This causes a
combinatorial explosion in large datasets.

7
Why Rare Cases are Problematic

The Metrics
The metrics used to evaluate classifier accuracy
are more focused on common cases. As a
consequence, rare cases may be totally ignored.
Example
consider decision tree. Most decision trees are
grown in a top-down manner, where test conditions
are repeatedly evaluated and the best one
selected.
The metrics (i.e., the information gain) used to
select the best test generally prefers tests that
result in a balanced tree where purity is
increased for most of the examples.
Rare cases which correspond to high purity
branches covering few examples will often not be
included in the decision tree.

8
Why Rare Cases are Problematic

The Bias
The bias of a data mining system is critical to
its performance. The extra-evidentiary bias makes
it possible to generalize from specific examples.
Bias used by many data mining systems, especially
those used to induce classifiers, employ a
maximum-generality bias.
This means that when a disjunct that covers some
set of training examples is formed, only the most
general set of conditions that satisfy those
examples are selected.
The maximum-generality bias works well for common
cases, but not for rare cases/small disjuncts.
Attempts to address the problems of small
disjuncts by selecting an appropriate bias must
be considered.

9
Why Rare Cases are Problematic

Noisy data
Sufficient high level of background noise may
prevent the learner to distinguish between noise
and rare cases.
Unfortunately, there is not much that can be done
to minimize the impact on noise on rare cases.
For example Pruning and overfitting avoidance
techniques, as well as inductive biases that
foster generalization, can minimize the overall
impact of noise but, because these methods tend
to remove both the rare cases and noise-generated
ones, they do so at the expense of rare cases.

10
Techniques For Handling rare Cases

Obtain Additional Training Data
Use a More Appropriate Inductive Bias
Use More Appropriate Metrics
Employ Non-Greedy Search Techniques
Employ Knowledge/Human Interaction
Employ Boosting
Place Rare Cases Into Separate Classes

11
1. Obtain Additional Training Data

Simply obtaining additional training data will
not help much because most of the new data will
be also associated with the common cases and may
be some associated with rare cases. This may help
problems of absolute rarity but not with
relative rarity
Only by selectively obtaining additional data for
the rare cases can one address the issues with
relative rarity. Such a sampling scheme will also
help with absolute rarity.
The selective sampling approach does not seem
practical for real-world data sets.

12
2. Use a More Appropriate Inductive Bias

Rare cases tend to cause small disjuncts to be
formed in a classifier induced from labeled data.
This is partly due to bias used by most learners.
Simple strategies that eliminate all small
disjuncts or use statistical significance testing
to prevent small disjuncts from being formed,
have proven to perform poorly.
More sophisticated approaches for adjusting the
bias of a learner in order to minimize the
problem with small disjuncts have been
investigated.
Holte et al. (1989) use a maximum generality bias
for large disjuncts and use a maximum specificity
bias for small disjuncts. This was shown to have
improved performance of small disjuncts but
degrade the performance of large disjuncts,
yielding poorer overall performance.

13
2. Use a More Appropriate Inductive Bias

The approach was refined to ensure that the more
specific bias used to induce the small disjuncts
does not affect and therefore cannot degrade
the performance of the large disjuncts.
This was accomplished by using different learners
for examples that fall into large disjuncts and
examples that fall into small disjuncts. (Ting,
1994)
This hybrid approach was shown to improve the
accuracy of small disjuncts, the results were not
conclusive.
Carvalho and Frietas(2002a, 2002b) essentially
use the same approach, except that the set of
training examples falling into each individual
small disjunct are used to generate a separate
classifier.
Several attempts have been made to perform better
on rare cases by using a highly specific bias for
the induced small disjuncts. These methods have
shown mixed success.

14
3. Use More Appropriate Metrics

Altering Relative importance of Precision vs.
Recall
Use Evaluation Metrics that, unlike accuracy
metrics, do not discount the importance of rare
cases.
Given a classification rule R that predicts
target class C, the recall of R is the of
examples belonging to C that are correctly
identified while the precision of R is the of
times that the rule is correct.
Rare cases can be given more prominence by
increasing the importance of precision over
recall.
Timeweaver (Weiss, 1999), a genetic-algorithm
based classification system, searches for rare
cases by carefully altering the relative
importance of precision vs. recall

15
3. Use More Appropriate Metrics

Two-Phase Rule Induction
PNrule (Joshi, Aggarwal Kumar, 2001) uses
two-phase rule induction to focus on each measure
separately.
The first Phase focuses on recall. In the second
phase, precision is optimized. This is
accomplished by learning to identify false
positives within the rule from phase-1.
In the Needle-in-the-haystack analogy, the first
phase identifies regions likely to contain the
needle, then in the second phase learns to
discard the hay strands within these regions.

16
PN-rule Learning

P-phase
Positive examples with good support
Seek good recall
N-phase
Remove FP from examples of P-phase
High accuracy and significant support

17
4. Employ Non-Greedy Search Techniques

Most Greedy algorithms are designed to be locally
optimal, so as to avoid local minima. This is
done to make sure that the solution remains
tractable. Mining algorithms based on Greedy
method are not globally optimal.
Greedy algorithms are not suitable for dealing
with rare cases because rare cases may depend on
the conjunction of many conditions and any single
condition in isolation my not provide the needed
solution.
Mining solution algorithms for handling rare
cases must use more powerful global search
methods.
Recommended solution
Genetic algorithms, which operate on a population
of candidate solutions rather than a single
solution
For this reason GA are more appropriate for rare
cases. (Goldberg, 1989), (Freitas, 2002), (Weiss,
1999), (Cavallo and Freitas, 2002)

18
5. Employ Knowledge/Human Interaction

Interaction and knowledge of domain experts can
be used more effectively for rare case mining.
Example
SAR detection
Rare disease detection
Etc.

19
6. Employ Boosting

Boosting algorithms, such as AdaBoost, are
iterative algorithms that place different weights
on the training distribution at each iteration.
Following each iteration, boosting increases the
weights associated with the incorrectly
classified examples and decreases the weight
associated with the correctly classified
examples.
This forces the learner to focus more on the
incorrectly classified examples in the next
iteration,
An algorithm, RareBoost (Joshi, Kumar and
Agarwal, 2001) which applies modified
weight-update mechanism to improve the
performance of rare classes and rare cases.

20
7. Place Rare Cases Into Separate Classes

Rare cases complicate classification because
different rare cases may have little in common
between them, making it difficult to assign same
class label to all of them.
Solution Reformulate the problem so that rare
cases are viewed as separate classes.
Approach
Separate each class into subclasses using
clustering
Learn after re-labeling the training examples
with the new class labels
Because multiple clustering experiments were used
in steps 1, step 2 involves learning multiple
models.
These models are combined using voting.

21
Boosting based algorithms

RareBoost
Updates the weights differently
SMOTEBoost
Combination of SMOTE (Synthetic Minority
Oversampling Technique) and boosting

22
CREDOS

First use ripple down rules to overfit the data
Ripple down rules are often used
Then prune to improve generalization
Different mechanism from decision trees

23
Cost Sensitive Modeling

Detection rate / False Alarm rate may be
misleading
Cost factors damage cost, response cost,
operational cost
Costs for TP, FP, TN, FN
Define cumulative cost

24
Outlier Detection Schemes

Detect intrusions (data points) that are very
different from the normal activities (rest of
the data points)
General Steps
Identify normal behavior
Construct useful set of features
Define similarity function
Use outlier detection algorithm
Statistics based
Distance based
Model based

25
Distance Based Outlier Detection

Represent data as a vector of features
Major approaches
Nearest neighbor based
Density based
Clustering based
Problem
High dimensionality of data

26
Distance Based Nearest Neighbor

Not enough neighbors ? Outliers
Compute distance d to the k-th nearest neighbor
Outlier points
Located in more sparse neighborhoods
Have d larger than a certain threashold
Mahalanobis-distance based approach
More appropriate for computing distance with
skewed distributions

27
Distance Based Density

Local Outlier Factor (LOF)
Average of the ratios of the density of example p
and the density of its nearest neighbors
Compute density of local neighborhood for each
point
Compute LOF
Larger LOF ? Outliers

28
Distance Based Clustering

Radius w of proximity is specified
Two points x1 and x2 are near if d(x1, x2)ltw
Define N(x) as number of points that are within w
of x
Points in small cluster ? Outliers
Fixed-width clustering for speedup

29
Distance Based - Clustering (cont.)

K-Nearst Neighbor Canopy Clustering
Compute sum of distances to k nearest neighbors
Small K-NN ? point in dense region
Canopy clustering for speedup
WaveCluster
Transform data into multidimensional signals
using wavelet transformation
Remove Hign/Low frequency parts
Remaining parts ? Outliers

30
Model Based Outlier Detection

Similar to Probabilistic Based schemes
Build prediction model for normal behavior
Deviation from model ? potential intrusion
Major approaches
Neural networks
Unsupervised Support Vector Machines (SVMs)

31
Model Based - Neural Networks

Use a replicator 4-layer feed-forward neural
network
Input variables are the target output during
training
RNN forms a compressed model for traning data
Outlyingness ? reconstruction error

32
Model Based - SVMs

Attempt to separate the entire set of training
data from the origin
Regions where most data lies are labeled as one
class

Parameters
Expected outlier rates
Good for high quality controlled training data
Variance of Radial Basis Function (RBF)
- Larger ? higher detection rate and more false
alarm
- Smaller ? lower detection rate and fewer false
alarm

33
Summary And Conclusion

Rare classes, which result from highly skewed
class distribution, share many of the problems
associated with rare cases. Rare classes and rare
cases are connected.
Rare cases may occur can occur within both rare
classes and common classes, it is expected that
rare cases to be more of an issue for rare
classers.
(Japkowicz, 2001) views rare classes as a
consequence of between-class imbalance and rare
cases as a consequence of within-class
imbalances.
Thus, both forms of rarity are a type of data
imbalance
Modeling improvements presented in this paper are
applicable to both types of rarity.