Discriminative Frequent Pattern Analysis for Effective Classification - PowerPoint PPT Presentation

About This Presentation
Title:

Discriminative Frequent Pattern Analysis for Effective Classification

Description:

Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih-Wei Hsu Presented by Mary Biddle – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 25
Provided by: mbid1
Category:

less

Transcript and Presenter's Notes

Title: Discriminative Frequent Pattern Analysis for Effective Classification


1
Discriminative Frequent Pattern Analysis for
Effective Classification
  • By Hong Cheng, Xifeng Yan, Jiawei Han, Chih-Wei
    Hsu
  • Presented by Mary Biddle

2
IntroductionPattern Example
  • Patterns
  • ABCD
  • ABCF
  • BCD
  • BCEF
  • Frequency
  • A 2
  • B 4
  • C 4
  • D 2
  • E 1
  • F 2
  • AB 2
  • BC 4
  • CD 2
  • CE 1
  • CF 2

3
Motivation
  • Why are frequent patterns useful for
    classification? Why do frequent patterns provide
    a good substitute for the complete pattern set?
  • How does frequent pattern-based classification
    achieve both high scalability and accuracy for
    the classification of large datasets?
  • What is the strategy for setting the minimum
    support threshold?
  • Given a set of frequent patterns, how should we
    select high quality ones for effective
    classification?

4
InformationFisher Score Definition
  • In statistics and information theory, the Fisher
    Information is the variance of the score.
  • The Fisher information is a way of measuring the
    amount of information that an observable random
    variable X carries about an unknown parameter ?
    upon which the likelihood function of ?, L(?)
    f(X, ?), depends. The likelihood function is the
    joint probability of the data, the Xs,
    conditional on the value of ?, as a function of
    ?.

5
IntroductionInformation Gain Definition
  • In probability theory and information theory
    Information Gain is a measure of the difference
    between two probability distributions from a
    true probability distribution P to an arbitrary
    probability distribution Q.
  • The expected Information Gain is the change in
    information entropy from a prior state to a state
    that take some information as given.
  • Usually an attribute with high information gain
    should be preferred to other attributes.

6
ModelCombined Feature Definition
  • Each (attribute, value) pair is mapped to a
    distinct item in I o1,,od.
  • A combined feature a oa1,,oak is a subset of
    I, where oai o1,,od, 1 i k
  • oi I is a single feature.
  • Given a dataset D xi, the set of data that
    contains a is denoted as Da xixiaj 1, oaj
    a.

7
ModelFrequent Combined Feature Definition
  • For a dataset D, a combined feature a is frequent
    if ? Da/D ?0, where ? is the relative
    support of a, and ?0 is the min_sup threshold, 0
    ?0 1.
  • The set of frequent defined features is denoted
    as F.

8
ModelInformation Gain
  • For a patter a represented by a random variable
    X, the information gain is
  • IG(CX) H(C)-H(CX)
  • Where H(C) is the entropy
  • And H(CX) is the conditional entropy
  • Given a dataset with a fixed class distribution,
    H(C) is a constant.

9
Model Information Gain Upper Bound
  • The information gain upper bound IGub is
  • IGub(CX) H(C) - Hlb(CX)
  • Where Hlb is the lower bound of H(CX)

10
ModelFisher Score
  • Fisher score is defined as
  • Fr (?ci1 ni(uui-u)2)/ (?ci1 nisi2)
  • where ni is the number of data samples in class
    i,
  • uui is the average feature value in class i
  • si is the standard deviation of the feature value
    in class i
  • u is the average feature value in the whole
    dataset.

11
(No Transcript)
12
ModelRelevance Measure S
  • A relevance measure S is a function mapping a
    pattern a to a real value such that S(a) is the
    relevance w.r.t. the class label.
  • Measures like information gain and fisher score
    can be used as a relevance measure.

13
ModelRedundancy Measure
  • A redundancy measure R is a function mapping two
    patterns a and ß to a real value such that R(a,
    ß) is the redundancy between them.
  • R(a, ß) (P(a, ß) /
  • (P(a) P(ß) P(a,ß) ))x min(S(a),S(ß))
  • P is the predicate function from the Jaccard
    measure.

14
Modelinformation gain
  • The gain of a pattern a given a set of already
    selected patterns Fs is
  • g(a)S(a)-maxR(a, ß)
  • Where ß Fs

15
Algorithm framework of frequent pattern-based
classification
  1. Feature generation
  2. Feature selection
  3. Model learning

16
Algorithm1. Feature Generation
  1. Compute information gain (or Fisher score) upper
    bound as a function of support ?.
  2. Choose an information gain threshold IG0 for
    feature filtering purposes.
  3. Find ? arg max? (IGub(?)IG0)
  4. Mine frequent patterns with min_sup ?

17
Algorithm2. Feature Selection Algorithm MMRFS
18
Algorithm3. Model Learning
  • Use the resulting features as input to the
    learning model of your choice.
  • They experimented with SVM and C4.5

19
Contributions
  • Propose a framework of frequent pattern-based
    classification by analyzing the relationship
    between pattern frequency and its predictive
    power.
  • Frequent pattern-based classification could
    exploit the state-of-the-art frequent pattern
    mining algorithms for feature generation with
    much better scalability.
  • Suggest a strategy for setting a minimum support.
  • An effective and efficient feature selection
    algorithm is proposed to select a set of frequent
    and discriminative patterns for classification.

20
ExperimentsAccuracy with SVN and C4.5
21
ExperimentsAccuracy and Time Measures
22
Related Work
  • Associative Classification
  • The association between frequent patterns and
    class labels is used for prediction. A
    classifier is built based on high-confidence,
    high-support association rules.
  • Top-K rule mining
  • A recent work on top-k rule mining discovers
    top-k covering rule groups for each row of gene
    expression profiles. Prediction is perfomed
    based on a classification score which combines
    the support and confidence measures of the rules.
  • HARMONY (mines classification rules)
  • It uses an instance-centric rule-generation
    approach and assures for each training instance,
    that one of the highest confidence rules covering
    the instance is included in the rule set. This
    is the more efficient and scalable than previous
    rule-based classifiers. On several datasets the
    classifier accuracy was significantly higher,
    i.e. 11.94 on Waveform and 3.4 on Letter
    Recognition.
  • All of the following use frequent patterns
  • String kernels
  • Word combinations (NLP)
  • Structural features in graph classification

23
Differences between Associative Classification
and Discriminative Frequent Pattern Analysis
Classification
  • Frequent Patterns are used to represent the data
    in a different feature space. Associative
    classification builds a classification using
    rules only.
  • In associative classification, the prediction
    process is to find one or several top ranked
    rule(s) for prediction. In this process, the
    prediction is made by the classification model.
  • The information gain is used to discriminate the
    patterns being used by using it to determine the
    min_sup and in the selection of the frequent
    patterns.

24
Pros and Cons
  • Pros
  • Reduces Time
  • More accurate
  • Cons
  • Space concerns on large datasets because it uses
    the entire Pattern set, initially.
Write a Comment
User Comments (0)
About PowerShow.com