Title: Mining with Rare Cases
1Mining with Rare Cases
- Paper by Gary M. Weiss
- Presenter Indar Bhatia
- INFS 795
- April 28, 2005
2Presentation Overview
- Motivation and Introduction to problem
- Why Rare Cases are Problematic
- Techniques for Handling Rare Cases
- Summary and Conclusion
3Motivation and Introduction
- What are rare cases?
- A case corresponds to a region in the instance
space that is meaningful to the domain under
study. - A rare case is a case that covers a small region
of the instance space - Why are they important
- Detecting suspicious cargo
- Finding sources of rare diseases
- Detecting Fraud
- Finding terrorists
- Identifying rare diseases
- Classification Problem
- Covers relatively few training examples
- Example Finding association between infrequently
purchased supermarket items
4Modeling Problem
P2
P1
P3
Clustering showing 1 common And 3 rare classes
Two Class Classification Positive Class
contains 1 Common and 2 Rare classes
- For a classification problem, the rare cases may
manifest themselves as small disjuncts, i.e.,
those disjuncts in the classifier that cover few
training examples - In unsupervised learning, the three rare cases
will be more difficult to generalize from because
they contain fewer data points - In association rule mining, the problem will be
to detect items that co-occur most infrequently.
5Modeling Problem
- Current research indicates rare cases and small
disjuncts pose difficulties for data mining,
i.e., rare cases have much higher
misclassification rate than common cases. - Small disjuncts collectively cover a substantial
fraction of all examples and cannot simply be
eliminated doing so will substantially degrade
the performance of a classifier. - In a most thorough study of small disjuncts, (
Weiss Hirsh, 2000), it was shown that in the
classifiers induced from 30 real-world data sets,
most classifier errors are contributed by the
smaller disjuncts.
6Why Rare Cases are Problematic
- Problems arise due to absolute rarity
- Most fundamental problem is associated lack of
data only few examples related to rare cases
are in the data set (Absolute rarity) - Lack of data makes it difficult to detect rare
cases, and if detected, makes generalization
difficult - Problems arise due to relative rarity
- Looking for a needle in a Haystack rare cases
are obscured by common cases (Relative rarity) - Data mining algorithms rely on greedy Search
heuristics that examine one variable at a time.
Since the detection of rare cases may depend on
the conjunction of many conditions, any single
condition in isolation may not provide much
guidance. - For example , consider Association rule mining
problem. Association Analysis has to have a very
low support, support 0. This causes a
combinatorial explosion in large datasets.
7Why Rare Cases are Problematic
- The Metrics
- The metrics used to evaluate classifier accuracy
are more focused on common cases. As a
consequence, rare cases may be totally ignored. - Example
- consider decision tree. Most decision trees are
grown in a top-down manner, where test conditions
are repeatedly evaluated and the best one
selected. - The metrics (i.e., the information gain) used to
select the best test generally prefers tests that
result in a balanced tree where purity is
increased for most of the examples. - Rare cases which correspond to high purity
branches covering few examples will often not be
included in the decision tree.
8Why Rare Cases are Problematic
- The Bias
- The bias of a data mining system is critical to
its performance. The extra-evidentiary bias makes
it possible to generalize from specific examples. - Bias used by many data mining systems, especially
those used to induce classifiers, employ a
maximum-generality bias. - This means that when a disjunct that covers some
set of training examples is formed, only the most
general set of conditions that satisfy those
examples are selected. - The maximum-generality bias works well for common
cases, but not for rare cases/small disjuncts. - Attempts to address the problems of small
disjuncts by selecting an appropriate bias must
be considered.
9Why Rare Cases are Problematic
- Noisy data
- Sufficient high level of background noise may
prevent the learner to distinguish between noise
and rare cases. - Unfortunately, there is not much that can be done
to minimize the impact on noise on rare cases. - For example Pruning and overfitting avoidance
techniques, as well as inductive biases that
foster generalization, can minimize the overall
impact of noise but, because these methods tend
to remove both the rare cases and noise-generated
ones, they do so at the expense of rare cases.
10Techniques For Handling rare Cases
- Obtain Additional Training Data
- Use a More Appropriate Inductive Bias
- Use More Appropriate Metrics
- Employ Non-Greedy Search Techniques
- Employ Knowledge/Human Interaction
- Employ Boosting
- Place Rare Cases Into Separate Classes
111. Obtain Additional Training Data
- Simply obtaining additional training data will
not help much because most of the new data will
be also associated with the common cases and may
be some associated with rare cases. This may help
problems of absolute rarity but not with
relative rarity - Only by selectively obtaining additional data for
the rare cases can one address the issues with
relative rarity. Such a sampling scheme will also
help with absolute rarity. - The selective sampling approach does not seem
practical for real-world data sets.
122. Use a More Appropriate Inductive Bias
- Rare cases tend to cause small disjuncts to be
formed in a classifier induced from labeled data.
This is partly due to bias used by most learners. - Simple strategies that eliminate all small
disjuncts or use statistical significance testing
to prevent small disjuncts from being formed,
have proven to perform poorly. - More sophisticated approaches for adjusting the
bias of a learner in order to minimize the
problem with small disjuncts have been
investigated. - Holte et al. (1989) use a maximum generality bias
for large disjuncts and use a maximum specificity
bias for small disjuncts. This was shown to have
improved performance of small disjuncts but
degrade the performance of large disjuncts,
yielding poorer overall performance.
132. Use a More Appropriate Inductive Bias
- The approach was refined to ensure that the more
specific bias used to induce the small disjuncts
does not affect and therefore cannot degrade
the performance of the large disjuncts. - This was accomplished by using different learners
for examples that fall into large disjuncts and
examples that fall into small disjuncts. (Ting,
1994) - This hybrid approach was shown to improve the
accuracy of small disjuncts, the results were not
conclusive. - Carvalho and Frietas(2002a, 2002b) essentially
use the same approach, except that the set of
training examples falling into each individual
small disjunct are used to generate a separate
classifier. - Several attempts have been made to perform better
on rare cases by using a highly specific bias for
the induced small disjuncts. These methods have
shown mixed success.
143. Use More Appropriate Metrics
- Altering Relative importance of Precision vs.
- Recall
- Use Evaluation Metrics that, unlike accuracy
metrics, do not discount the importance of rare
cases. - Given a classification rule R that predicts
target class C, the recall of R is the of
examples belonging to C that are correctly
identified while the precision of R is the of
times that the rule is correct. - Rare cases can be given more prominence by
increasing the importance of precision over
recall. - Timeweaver (Weiss, 1999), a genetic-algorithm
based classification system, searches for rare
cases by carefully altering the relative
importance of precision vs. recall
153. Use More Appropriate Metrics
- Two-Phase Rule Induction
- PNrule (Joshi, Aggarwal Kumar, 2001) uses
two-phase rule induction to focus on each measure
separately. - The first Phase focuses on recall. In the second
phase, precision is optimized. This is
accomplished by learning to identify false
positives within the rule from phase-1. - In the Needle-in-the-haystack analogy, the first
phase identifies regions likely to contain the
needle, then in the second phase learns to
discard the hay strands within these regions.
16PN-rule Learning
- P-phase
- Positive examples with good support
- Seek good recall
- N-phase
- Remove FP from examples of P-phase
- High accuracy and significant support
174. Employ Non-Greedy Search Techniques
- Most Greedy algorithms are designed to be locally
optimal, so as to avoid local minima. This is
done to make sure that the solution remains
tractable. Mining algorithms based on Greedy
method are not globally optimal. - Greedy algorithms are not suitable for dealing
with rare cases because rare cases may depend on
the conjunction of many conditions and any single
condition in isolation my not provide the needed
solution. - Mining solution algorithms for handling rare
cases must use more powerful global search
methods. - Recommended solution
- Genetic algorithms, which operate on a population
of candidate solutions rather than a single
solution - For this reason GA are more appropriate for rare
cases. (Goldberg, 1989), (Freitas, 2002), (Weiss,
1999), (Cavallo and Freitas, 2002)
185. Employ Knowledge/Human Interaction
- Interaction and knowledge of domain experts can
be used more effectively for rare case mining. - Example
- SAR detection
- Rare disease detection
- Etc.
196. Employ Boosting
- Boosting algorithms, such as AdaBoost, are
iterative algorithms that place different weights
on the training distribution at each iteration. - Following each iteration, boosting increases the
weights associated with the incorrectly
classified examples and decreases the weight
associated with the correctly classified
examples. - This forces the learner to focus more on the
incorrectly classified examples in the next
iteration, - An algorithm, RareBoost (Joshi, Kumar and
Agarwal, 2001) which applies modified
weight-update mechanism to improve the
performance of rare classes and rare cases.
207. Place Rare Cases Into Separate Classes
- Rare cases complicate classification because
different rare cases may have little in common
between them, making it difficult to assign same
class label to all of them. - Solution Reformulate the problem so that rare
cases are viewed as separate classes. - Approach
- Separate each class into subclasses using
clustering - Learn after re-labeling the training examples
with the new class labels - Because multiple clustering experiments were used
in steps 1, step 2 involves learning multiple
models. - These models are combined using voting.
21Boosting based algorithms
- RareBoost
- Updates the weights differently
- SMOTEBoost
- Combination of SMOTE (Synthetic Minority
Oversampling Technique) and boosting
22CREDOS
- First use ripple down rules to overfit the data
- Ripple down rules are often used
- Then prune to improve generalization
- Different mechanism from decision trees
23Cost Sensitive Modeling
- Detection rate / False Alarm rate may be
misleading - Cost factors damage cost, response cost,
operational cost - Costs for TP, FP, TN, FN
- Define cumulative cost
24Outlier Detection Schemes
- Detect intrusions (data points) that are very
different from the normal activities (rest of
the data points) - General Steps
- Identify normal behavior
- Construct useful set of features
- Define similarity function
- Use outlier detection algorithm
- Statistics based
- Distance based
- Model based
25Distance Based Outlier Detection
- Represent data as a vector of features
- Major approaches
- Nearest neighbor based
- Density based
- Clustering based
- Problem
- High dimensionality of data
26Distance Based Nearest Neighbor
- Not enough neighbors ? Outliers
- Compute distance d to the k-th nearest neighbor
- Outlier points
- Located in more sparse neighborhoods
- Have d larger than a certain threashold
- Mahalanobis-distance based approach
- More appropriate for computing distance with
skewed distributions
27Distance Based Density
- Local Outlier Factor (LOF)
- Average of the ratios of the density of example p
and the density of its nearest neighbors - Compute density of local neighborhood for each
point - Compute LOF
- Larger LOF ? Outliers
28Distance Based Clustering
- Radius w of proximity is specified
- Two points x1 and x2 are near if d(x1, x2)ltw
- Define N(x) as number of points that are within w
of x - Points in small cluster ? Outliers
- Fixed-width clustering for speedup
29Distance Based - Clustering (cont.)
- K-Nearst Neighbor Canopy Clustering
- Compute sum of distances to k nearest neighbors
- Small K-NN ? point in dense region
- Canopy clustering for speedup
- WaveCluster
- Transform data into multidimensional signals
using wavelet transformation - Remove Hign/Low frequency parts
- Remaining parts ? Outliers
30Model Based Outlier Detection
- Similar to Probabilistic Based schemes
- Build prediction model for normal behavior
- Deviation from model ? potential intrusion
- Major approaches
- Neural networks
- Unsupervised Support Vector Machines (SVMs)
31Model Based - Neural Networks
- Use a replicator 4-layer feed-forward neural
network - Input variables are the target output during
training - RNN forms a compressed model for traning data
- Outlyingness ? reconstruction error
32Model Based - SVMs
- Attempt to separate the entire set of training
data from the origin - Regions where most data lies are labeled as one
class
- Parameters
- Expected outlier rates
- Good for high quality controlled training data
- Variance of Radial Basis Function (RBF)
- - Larger ? higher detection rate and more false
alarm - - Smaller ? lower detection rate and fewer false
alarm
33Summary And Conclusion
- Rare classes, which result from highly skewed
class distribution, share many of the problems
associated with rare cases. Rare classes and rare
cases are connected. - Rare cases may occur can occur within both rare
classes and common classes, it is expected that
rare cases to be more of an issue for rare
classers. - (Japkowicz, 2001) views rare classes as a
consequence of between-class imbalance and rare
cases as a consequence of within-class
imbalances. - Thus, both forms of rarity are a type of data
imbalance - Modeling improvements presented in this paper are
applicable to both types of rarity.