Mining with Rare Cases

About This Presentation
Title:

Mining with Rare Cases

Description:

Identifying rare diseases. Classification Problem. Covers relatively few training examples ... Rare disease detection. Etc. 6. Employ Boosting ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 34
Provided by: indarb
Learn more at: https://cs.gmu.edu

less

Transcript and Presenter's Notes

Title: Mining with Rare Cases


1
Mining with Rare Cases
  • Paper by Gary M. Weiss
  • Presenter Indar Bhatia
  • INFS 795
  • April 28, 2005

2
Presentation Overview
  • Motivation and Introduction to problem
  • Why Rare Cases are Problematic
  • Techniques for Handling Rare Cases
  • Summary and Conclusion

3
Motivation and Introduction
  • What are rare cases?
  • A case corresponds to a region in the instance
    space that is meaningful to the domain under
    study.
  • A rare case is a case that covers a small region
    of the instance space
  • Why are they important
  • Detecting suspicious cargo
  • Finding sources of rare diseases
  • Detecting Fraud
  • Finding terrorists
  • Identifying rare diseases
  • Classification Problem
  • Covers relatively few training examples
  • Example Finding association between infrequently
    purchased supermarket items

4
Modeling Problem
P2
P1
P3
Clustering showing 1 common And 3 rare classes
Two Class Classification Positive Class
contains 1 Common and 2 Rare classes
  • For a classification problem, the rare cases may
    manifest themselves as small disjuncts, i.e.,
    those disjuncts in the classifier that cover few
    training examples
  • In unsupervised learning, the three rare cases
    will be more difficult to generalize from because
    they contain fewer data points
  • In association rule mining, the problem will be
    to detect items that co-occur most infrequently.

5
Modeling Problem
  • Current research indicates rare cases and small
    disjuncts pose difficulties for data mining,
    i.e., rare cases have much higher
    misclassification rate than common cases.
  • Small disjuncts collectively cover a substantial
    fraction of all examples and cannot simply be
    eliminated doing so will substantially degrade
    the performance of a classifier.
  • In a most thorough study of small disjuncts, (
    Weiss Hirsh, 2000), it was shown that in the
    classifiers induced from 30 real-world data sets,
    most classifier errors are contributed by the
    smaller disjuncts.

6
Why Rare Cases are Problematic
  • Problems arise due to absolute rarity
  • Most fundamental problem is associated lack of
    data only few examples related to rare cases
    are in the data set (Absolute rarity)
  • Lack of data makes it difficult to detect rare
    cases, and if detected, makes generalization
    difficult
  • Problems arise due to relative rarity
  • Looking for a needle in a Haystack rare cases
    are obscured by common cases (Relative rarity)
  • Data mining algorithms rely on greedy Search
    heuristics that examine one variable at a time.
    Since the detection of rare cases may depend on
    the conjunction of many conditions, any single
    condition in isolation may not provide much
    guidance.
  • For example , consider Association rule mining
    problem. Association Analysis has to have a very
    low support, support 0. This causes a
    combinatorial explosion in large datasets.

7
Why Rare Cases are Problematic
  • The Metrics
  • The metrics used to evaluate classifier accuracy
    are more focused on common cases. As a
    consequence, rare cases may be totally ignored.
  • Example
  • consider decision tree. Most decision trees are
    grown in a top-down manner, where test conditions
    are repeatedly evaluated and the best one
    selected.
  • The metrics (i.e., the information gain) used to
    select the best test generally prefers tests that
    result in a balanced tree where purity is
    increased for most of the examples.
  • Rare cases which correspond to high purity
    branches covering few examples will often not be
    included in the decision tree.

8
Why Rare Cases are Problematic
  • The Bias
  • The bias of a data mining system is critical to
    its performance. The extra-evidentiary bias makes
    it possible to generalize from specific examples.
  • Bias used by many data mining systems, especially
    those used to induce classifiers, employ a
    maximum-generality bias.
  • This means that when a disjunct that covers some
    set of training examples is formed, only the most
    general set of conditions that satisfy those
    examples are selected.
  • The maximum-generality bias works well for common
    cases, but not for rare cases/small disjuncts.
  • Attempts to address the problems of small
    disjuncts by selecting an appropriate bias must
    be considered.

9
Why Rare Cases are Problematic
  • Noisy data
  • Sufficient high level of background noise may
    prevent the learner to distinguish between noise
    and rare cases.
  • Unfortunately, there is not much that can be done
    to minimize the impact on noise on rare cases.
  • For example Pruning and overfitting avoidance
    techniques, as well as inductive biases that
    foster generalization, can minimize the overall
    impact of noise but, because these methods tend
    to remove both the rare cases and noise-generated
    ones, they do so at the expense of rare cases.

10
Techniques For Handling rare Cases
  • Obtain Additional Training Data
  • Use a More Appropriate Inductive Bias
  • Use More Appropriate Metrics
  • Employ Non-Greedy Search Techniques
  • Employ Knowledge/Human Interaction
  • Employ Boosting
  • Place Rare Cases Into Separate Classes

11
1. Obtain Additional Training Data
  • Simply obtaining additional training data will
    not help much because most of the new data will
    be also associated with the common cases and may
    be some associated with rare cases. This may help
    problems of absolute rarity but not with
    relative rarity
  • Only by selectively obtaining additional data for
    the rare cases can one address the issues with
    relative rarity. Such a sampling scheme will also
    help with absolute rarity.
  • The selective sampling approach does not seem
    practical for real-world data sets.

12
2. Use a More Appropriate Inductive Bias
  • Rare cases tend to cause small disjuncts to be
    formed in a classifier induced from labeled data.
    This is partly due to bias used by most learners.
  • Simple strategies that eliminate all small
    disjuncts or use statistical significance testing
    to prevent small disjuncts from being formed,
    have proven to perform poorly.
  • More sophisticated approaches for adjusting the
    bias of a learner in order to minimize the
    problem with small disjuncts have been
    investigated.
  • Holte et al. (1989) use a maximum generality bias
    for large disjuncts and use a maximum specificity
    bias for small disjuncts. This was shown to have
    improved performance of small disjuncts but
    degrade the performance of large disjuncts,
    yielding poorer overall performance.

13
2. Use a More Appropriate Inductive Bias
  • The approach was refined to ensure that the more
    specific bias used to induce the small disjuncts
    does not affect and therefore cannot degrade
    the performance of the large disjuncts.
  • This was accomplished by using different learners
    for examples that fall into large disjuncts and
    examples that fall into small disjuncts. (Ting,
    1994)
  • This hybrid approach was shown to improve the
    accuracy of small disjuncts, the results were not
    conclusive.
  • Carvalho and Frietas(2002a, 2002b) essentially
    use the same approach, except that the set of
    training examples falling into each individual
    small disjunct are used to generate a separate
    classifier.
  • Several attempts have been made to perform better
    on rare cases by using a highly specific bias for
    the induced small disjuncts. These methods have
    shown mixed success.

14
3. Use More Appropriate Metrics
  • Altering Relative importance of Precision vs.
  • Recall
  • Use Evaluation Metrics that, unlike accuracy
    metrics, do not discount the importance of rare
    cases.
  • Given a classification rule R that predicts
    target class C, the recall of R is the of
    examples belonging to C that are correctly
    identified while the precision of R is the of
    times that the rule is correct.
  • Rare cases can be given more prominence by
    increasing the importance of precision over
    recall.
  • Timeweaver (Weiss, 1999), a genetic-algorithm
    based classification system, searches for rare
    cases by carefully altering the relative
    importance of precision vs. recall

15
3. Use More Appropriate Metrics
  • Two-Phase Rule Induction
  • PNrule (Joshi, Aggarwal Kumar, 2001) uses
    two-phase rule induction to focus on each measure
    separately.
  • The first Phase focuses on recall. In the second
    phase, precision is optimized. This is
    accomplished by learning to identify false
    positives within the rule from phase-1.
  • In the Needle-in-the-haystack analogy, the first
    phase identifies regions likely to contain the
    needle, then in the second phase learns to
    discard the hay strands within these regions.

16
PN-rule Learning
  • P-phase
  • Positive examples with good support
  • Seek good recall
  • N-phase
  • Remove FP from examples of P-phase
  • High accuracy and significant support

17
4. Employ Non-Greedy Search Techniques
  • Most Greedy algorithms are designed to be locally
    optimal, so as to avoid local minima. This is
    done to make sure that the solution remains
    tractable. Mining algorithms based on Greedy
    method are not globally optimal.
  • Greedy algorithms are not suitable for dealing
    with rare cases because rare cases may depend on
    the conjunction of many conditions and any single
    condition in isolation my not provide the needed
    solution.
  • Mining solution algorithms for handling rare
    cases must use more powerful global search
    methods.
  • Recommended solution
  • Genetic algorithms, which operate on a population
    of candidate solutions rather than a single
    solution
  • For this reason GA are more appropriate for rare
    cases. (Goldberg, 1989), (Freitas, 2002), (Weiss,
    1999), (Cavallo and Freitas, 2002)

18
5. Employ Knowledge/Human Interaction
  • Interaction and knowledge of domain experts can
    be used more effectively for rare case mining.
  • Example
  • SAR detection
  • Rare disease detection
  • Etc.

19
6. Employ Boosting
  • Boosting algorithms, such as AdaBoost, are
    iterative algorithms that place different weights
    on the training distribution at each iteration.
  • Following each iteration, boosting increases the
    weights associated with the incorrectly
    classified examples and decreases the weight
    associated with the correctly classified
    examples.
  • This forces the learner to focus more on the
    incorrectly classified examples in the next
    iteration,
  • An algorithm, RareBoost (Joshi, Kumar and
    Agarwal, 2001) which applies modified
    weight-update mechanism to improve the
    performance of rare classes and rare cases.

20
7. Place Rare Cases Into Separate Classes
  • Rare cases complicate classification because
    different rare cases may have little in common
    between them, making it difficult to assign same
    class label to all of them.
  • Solution Reformulate the problem so that rare
    cases are viewed as separate classes.
  • Approach
  • Separate each class into subclasses using
    clustering
  • Learn after re-labeling the training examples
    with the new class labels
  • Because multiple clustering experiments were used
    in steps 1, step 2 involves learning multiple
    models.
  • These models are combined using voting.

21
Boosting based algorithms
  • RareBoost
  • Updates the weights differently
  • SMOTEBoost
  • Combination of SMOTE (Synthetic Minority
    Oversampling Technique) and boosting

22
CREDOS
  • First use ripple down rules to overfit the data
  • Ripple down rules are often used
  • Then prune to improve generalization
  • Different mechanism from decision trees

23
Cost Sensitive Modeling
  • Detection rate / False Alarm rate may be
    misleading
  • Cost factors damage cost, response cost,
    operational cost
  • Costs for TP, FP, TN, FN
  • Define cumulative cost

24
Outlier Detection Schemes
  • Detect intrusions (data points) that are very
    different from the normal activities (rest of
    the data points)
  • General Steps
  • Identify normal behavior
  • Construct useful set of features
  • Define similarity function
  • Use outlier detection algorithm
  • Statistics based
  • Distance based
  • Model based

25
Distance Based Outlier Detection
  • Represent data as a vector of features
  • Major approaches
  • Nearest neighbor based
  • Density based
  • Clustering based
  • Problem
  • High dimensionality of data

26
Distance Based Nearest Neighbor
  • Not enough neighbors ? Outliers
  • Compute distance d to the k-th nearest neighbor
  • Outlier points
  • Located in more sparse neighborhoods
  • Have d larger than a certain threashold
  • Mahalanobis-distance based approach
  • More appropriate for computing distance with
    skewed distributions

27
Distance Based Density
  • Local Outlier Factor (LOF)
  • Average of the ratios of the density of example p
    and the density of its nearest neighbors
  • Compute density of local neighborhood for each
    point
  • Compute LOF
  • Larger LOF ? Outliers

28
Distance Based Clustering
  • Radius w of proximity is specified
  • Two points x1 and x2 are near if d(x1, x2)ltw
  • Define N(x) as number of points that are within w
    of x
  • Points in small cluster ? Outliers
  • Fixed-width clustering for speedup

29
Distance Based - Clustering (cont.)
  • K-Nearst Neighbor Canopy Clustering
  • Compute sum of distances to k nearest neighbors
  • Small K-NN ? point in dense region
  • Canopy clustering for speedup
  • WaveCluster
  • Transform data into multidimensional signals
    using wavelet transformation
  • Remove Hign/Low frequency parts
  • Remaining parts ? Outliers

30
Model Based Outlier Detection
  • Similar to Probabilistic Based schemes
  • Build prediction model for normal behavior
  • Deviation from model ? potential intrusion
  • Major approaches
  • Neural networks
  • Unsupervised Support Vector Machines (SVMs)

31
Model Based - Neural Networks
  • Use a replicator 4-layer feed-forward neural
    network
  • Input variables are the target output during
    training
  • RNN forms a compressed model for traning data
  • Outlyingness ? reconstruction error

32
Model Based - SVMs
  • Attempt to separate the entire set of training
    data from the origin
  • Regions where most data lies are labeled as one
    class
  • Parameters
  • Expected outlier rates
  • Good for high quality controlled training data
  • Variance of Radial Basis Function (RBF)
  • - Larger ? higher detection rate and more false
    alarm
  • - Smaller ? lower detection rate and fewer false
    alarm

33
Summary And Conclusion
  • Rare classes, which result from highly skewed
    class distribution, share many of the problems
    associated with rare cases. Rare classes and rare
    cases are connected.
  • Rare cases may occur can occur within both rare
    classes and common classes, it is expected that
    rare cases to be more of an issue for rare
    classers.
  • (Japkowicz, 2001) views rare classes as a
    consequence of between-class imbalance and rare
    cases as a consequence of within-class
    imbalances.
  • Thus, both forms of rarity are a type of data
    imbalance
  • Modeling improvements presented in this paper are
    applicable to both types of rarity.
Write a Comment
User Comments (0)