Selecting the Right Interestingness Measure for Association Patterns - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Selecting the Right Interestingness Measure for Association Patterns

Description:

Pang-Ning Tan, Vipin Kumar, and Jaideep Srivastava Department of Computer Science and Engineering University of Minnesota Presented by Ahmet Bulut – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 21

Provided by: Ahme74

Learn more at: https://web.ece.ucsb.edu

Category:

more less

Transcript and Presenter's Notes

Title: Selecting the Right Interestingness Measure for Association Patterns

1
Selecting the Right Interestingness Measure for
Association Patterns

Pang-Ning Tan, Vipin Kumar, and Jaideep
Srivastava
Department of Computer Science and Engineering
University of Minnesota
Presented by Ahmet Bulut

2
Motivation

Major Data mining problem analysis of
relationships among variables
Finding sets of binary variables that co-occur
together
Association Rule Mining Agrawal et al.
How to find a suitable metric to capture the
dependencies between variables (defined in terms
of contingency tables)
Many metrics provide conflicting information
Goal Automated way of choosing the best metric
for an application domain

3
(No Transcript)
4
Justification for Conflicts

E10 is ranked highest by I measure but lowest
according to coefficient.
Recognize intrinsic properties of the existing
measures

5
Analysis of a Measure

The relationship between the gender of a student
and the grade obtained in a course
Number of male students X 2, Number of female
students X 10
One expects scale-invariance in this particular
application
Most measures are sensitive to scaling of rows
and columns
such as gini index, interest, mutual information
etc.

6
Solutions to zero-in

Support based pruning
Eliminate uncorrelated and poorly correlated
patterns
Table Standardization
To modify contingency tables to have uniform
margins
Many measures provide non-conflicting information
Expectation of domain experts
Choose the measure that agrees with the
expectations the most.
Number of contingency tables, T, is high
It is possible to extract a small set, S, of
contingency tables
Find the best measure for S to approximate for T

7
Preliminaries

The similarity between any two measures M1 and
M2 the similarity between OM1(T) and OM2(T)
The similarity metric used is correlation
coefficient
corr(OM1(T), OM2(T) ) gt threshold then similar

8
Desired Properties of a Measure M
9
Properties of a Measure M contd.

Denote 2X2 contingency table as a contingency
matrix
Interestingness measure is a matrix operator, O
such that
OM k where k is a scalar.
For instance for Coefficient as the
interestingness measure
k equals to normalized form of the determinant
operator
Det(M) f11f00 f01f10
Statistical Independence
a singular matrix M whose determinant equal to 0.

10
Properties of a Measure M contd.

Property 1 Symmetry under variable permutation
O(MT) O(M)
cosine (IS), interest factor(I), odds ratio ( )
Property 2 Row/Column Scaling Invariance
RCk1 0 0 k2
R x M is row scaling and M x R is column scaling
If O(RM) O(M) and O(MR) O(M), then M is
row/scale invariant
odds ratio ( ) satisfies this property along
with Yules Q and Y

Property 3 Antisymmetry under row/column
permutation
S 0 11 0
If O(SM) -O(M), antisymmetric under row
permutation
If O(MS) -O(M), antisymmetric under column
permutation
Measures that are symmetric under the row and
column permutation operations no distinction
between positive and negative correlations of a
table
For example gini index

Property 4 Inversion Invariance
S0 11 0
row and column permutation together
If O(SMS)O(M), inversion invariant
Insight flip presence with absence and vice
versa for binary variables.
coefficient, odds ratio, collective strength
are symmetric binary measures
Jaccard measure is asymmetric

11
Property 4 and Property 5

Market Basket analysis requires unequal treatment
of binary values of a variable
A symmetric measure like the one above is not
suitable
Property 5 Null Invariance If O(MC) O(M)
where C0 00 k and k is a positive constant
For binary variables more records added that do
not contain the two variables under
consideration Co-occurrence emphasized

12
Effect of Support Based Pruning

Randomly generated synthetic dataset of 10,000
contingency tables
Darker cells, correlation gt 0.85, and lighter
cells indicate otherwise
Tighter bounds on the support of the patterns
many measures become correlated

13
Elimination of poorly correlated tables using
Support-based Pruning

Minimum support threshold to prune out the low
support patterns
Having a maximum support threshold equal
elimination of uncorrelated, negatively
correlated and positively correlated tables
Having a lower bound of support will prune out
the negatively correlated or uncorrelated tables.

14
Table standardization

A standardized table visual depiction of the
disjoint distribution of two variables after
elimination of non-uniform marginals

15
Implications of standardization

The rankings from different measures become
identical

16
Implications of standardization contd

After standardization, a matrix has x y y x
where
y N/2-x and x f11
If you consider monotonically increasing
functions of x (nearly all of the measures are)
Identical rankings on standardized, positively
correlated tables
Some measures do not satisfy this property
Consider the values of x where N/4 lt x lt N/2
IPF favors odds ratio measure, therefore final
rankings agree with odds ratio rankings before
standardization
Leave with Different standardization techniques
may be more appropriate for different application
domains

17
Measure Selection Based on Rankings by Experts

Ideally, experts rank all the contingency tables,
choose the best measure accordingly
Laborious task if the number of tables is too
large
Provide a smaller set of tables to decide the
best measure

18
Table Selection via Disjoint Algorithm

Use Disjoint algorithm to choose a subset of
tables of cardinality k.
Rank tables according to various measures
Compute the similarity between different measures
A good table selection scheme minimizes

19
Experimental Results
20
Conclusions

Key properties to consider for selecting the
right measure
No measure is consistently better than others
Situations where most measures provide correlated
info
Choosing the right measure on a non-biased small
set of all the tables give good estimates to the
ideal solution
As a future work
Extension to k-way contingency tables
Association between mixed data types