Discriminative Frequent Pattern Analysis for Effective Classification - PowerPoint PPT Presentation

About This Presentation

Title:

Discriminative Frequent Pattern Analysis for Effective Classification

Description:

Discriminative Frequent Pattern Analysis for Effective Classification By Hong Cheng, Xifeng Yan, Jiawei Han, Chih-Wei Hsu Presented by Mary Biddle – PowerPoint PPT presentation

Number of Views:87

Avg rating:3.0/5.0

Slides: 25

Provided by: mbid1

Learn more at: http://protocols.netlab.uky.edu

Category:

more less

Transcript and Presenter's Notes

Title: Discriminative Frequent Pattern Analysis for Effective Classification

1
Discriminative Frequent Pattern Analysis for
Effective Classification

By Hong Cheng, Xifeng Yan, Jiawei Han, Chih-Wei
Hsu
Presented by Mary Biddle

2
IntroductionPattern Example

Patterns
ABCD
ABCF
BCD
BCEF

Frequency
A 2
B 4
C 4
D 2
E 1
F 2
AB 2
BC 4
CD 2
CE 1
CF 2

3
Motivation

Why are frequent patterns useful for
classification? Why do frequent patterns provide
a good substitute for the complete pattern set?
How does frequent pattern-based classification
achieve both high scalability and accuracy for
the classification of large datasets?
What is the strategy for setting the minimum
support threshold?
Given a set of frequent patterns, how should we
select high quality ones for effective
classification?

4
InformationFisher Score Definition

In statistics and information theory, the Fisher
Information is the variance of the score.
The Fisher information is a way of measuring the
amount of information that an observable random
variable X carries about an unknown parameter ?
upon which the likelihood function of ?, L(?)
f(X, ?), depends. The likelihood function is the
joint probability of the data, the Xs,
conditional on the value of ?, as a function of
?.

5
IntroductionInformation Gain Definition

In probability theory and information theory
Information Gain is a measure of the difference
between two probability distributions from a
true probability distribution P to an arbitrary
probability distribution Q.
The expected Information Gain is the change in
information entropy from a prior state to a state
that take some information as given.
Usually an attribute with high information gain
should be preferred to other attributes.

6
ModelCombined Feature Definition

Each (attribute, value) pair is mapped to a
distinct item in I o1,,od.
A combined feature a oa1,,oak is a subset of
I, where oai o1,,od, 1 i k
oi I is a single feature.
Given a dataset D xi, the set of data that
contains a is denoted as Da xixiaj 1, oaj
a.

7
ModelFrequent Combined Feature Definition

For a dataset D, a combined feature a is frequent
if ? Da/D ?0, where ? is the relative
support of a, and ?0 is the min_sup threshold, 0
?0 1.
The set of frequent defined features is denoted
as F.

8
ModelInformation Gain

For a patter a represented by a random variable
X, the information gain is
IG(CX) H(C)-H(CX)
Where H(C) is the entropy
And H(CX) is the conditional entropy
Given a dataset with a fixed class distribution,
H(C) is a constant.

9
Model Information Gain Upper Bound

The information gain upper bound IGub is
IGub(CX) H(C) - Hlb(CX)
Where Hlb is the lower bound of H(CX)

10
ModelFisher Score

Fisher score is defined as
Fr (?ci1 ni(uui-u)2)/ (?ci1 nisi2)
where ni is the number of data samples in class
i,
uui is the average feature value in class i
si is the standard deviation of the feature value
in class i
u is the average feature value in the whole
dataset.

11
(No Transcript)
12
ModelRelevance Measure S

A relevance measure S is a function mapping a
pattern a to a real value such that S(a) is the
relevance w.r.t. the class label.
Measures like information gain and fisher score
can be used as a relevance measure.

13
ModelRedundancy Measure

A redundancy measure R is a function mapping two
patterns a and ß to a real value such that R(a,
ß) is the redundancy between them.
R(a, ß) (P(a, ß) /
(P(a) P(ß) P(a,ß) ))x min(S(a),S(ß))
P is the predicate function from the Jaccard
measure.

14
Modelinformation gain

The gain of a pattern a given a set of already
selected patterns Fs is
g(a)S(a)-maxR(a, ß)
Where ß Fs

15
Algorithm framework of frequent pattern-based
classification

Feature generation
Feature selection
Model learning

16
Algorithm1. Feature Generation

Compute information gain (or Fisher score) upper
bound as a function of support ?.
Choose an information gain threshold IG0 for
feature filtering purposes.
Find ? arg max? (IGub(?)IG0)
Mine frequent patterns with min_sup ?

17
Algorithm2. Feature Selection Algorithm MMRFS
18
Algorithm3. Model Learning

Use the resulting features as input to the
learning model of your choice.
They experimented with SVM and C4.5

19
Contributions

Propose a framework of frequent pattern-based
classification by analyzing the relationship
between pattern frequency and its predictive
power.
Frequent pattern-based classification could
exploit the state-of-the-art frequent pattern
mining algorithms for feature generation with
much better scalability.
Suggest a strategy for setting a minimum support.
An effective and efficient feature selection
algorithm is proposed to select a set of frequent
and discriminative patterns for classification.

20
ExperimentsAccuracy with SVN and C4.5
21
ExperimentsAccuracy and Time Measures
22
Related Work

Associative Classification
The association between frequent patterns and
class labels is used for prediction. A
classifier is built based on high-confidence,
high-support association rules.
Top-K rule mining
A recent work on top-k rule mining discovers
top-k covering rule groups for each row of gene
expression profiles. Prediction is perfomed
based on a classification score which combines
the support and confidence measures of the rules.
HARMONY (mines classification rules)
It uses an instance-centric rule-generation
approach and assures for each training instance,
that one of the highest confidence rules covering
the instance is included in the rule set. This
is the more efficient and scalable than previous
rule-based classifiers. On several datasets the
classifier accuracy was significantly higher,
i.e. 11.94 on Waveform and 3.4 on Letter
Recognition.
All of the following use frequent patterns
String kernels
Word combinations (NLP)
Structural features in graph classification

23
Differences between Associative Classification
and Discriminative Frequent Pattern Analysis
Classification

Frequent Patterns are used to represent the data
in a different feature space. Associative
classification builds a classification using
rules only.
In associative classification, the prediction
process is to find one or several top ranked
rule(s) for prediction. In this process, the
prediction is made by the classification model.
The information gain is used to discriminate the
patterns being used by using it to determine the
min_sup and in the selection of the frequent
patterns.

24
Pros and Cons