Title: Informative rule set and unbalanced class distributions
1Informative rule set and unbalanced class
distributions
- J. Li, H. Shen, R. Topor
- Mining Informative Rule Set for Prediction
- Journal of Intelligent Information Systems, 222,
155174, 2004 - L. Gu, J. Li, H. He, G. J. Williams, S. Hawkins,
C. Kelman - Association Rule Discovery with Unbalanced Class
Distributions - Australian Conference on Artificial Intelligence
2003 221-232 - Presented by Jonas Maaskola
2Outline
- Introduction
- Notation and rule measures
- Informative rule set
- Definition the informative rule set (IR)
- Lemmas and properties
- Algorithm mine IR
- Unbalanced Classes
- Illustration of the problem
- Different interestingness metrics
- Application example evaluated
3Part 0Introduction
4Notation
- I 1,2,...,m a set of items.
- Transaction T Í I is a set of items
- Database D collection of transactions
- Given two non-overlapping itemsets X,Y
Association rule XgtY defined if - sup(X) gt ?sup and conf(XgtY) gt ?conf
- (Please see next slide for definition of sup(X)
and conf(XgtY).)
5Rule measures
- sup(X) Support of itemset X. Relative frequency
of transactions containing X. - conf(XgtY) Confidence of rule XgtY. Conditional
probability of transactions containing Y given
they contain X.
6Rule measures
- If A,B are itemsets, denote AÈB as AB.
- Given two rules Agtc and ABgtc, Agtc is more
general than ABgtc and ABgtc is more specific
than Agtc.
7Part 1Informative rule set
8Informative vs. association rule set
- Association rule set
- Includes all association rules that exceed a
confidence threshold.
- Informative rule set
- Includes all rules satisfying a minimum support.
- Excludes all more specific rules whose confidence
is not greater than any of its more general
rules'.
9IR Example 1
- min sup(z) min conf(xgty) 0.5
- Transaction DB 1a,b,c, 2a,b,c, 3a,b,c,
4a,b,d, 5a,c,d, 6b,c,d
10IR Example 1
- min sup(z) min conf(xgty) 0.5
- Transaction DB 1a,b,c, 2a,b,c, 3a,b,c,
4a,b,d, 5a,c,d, 6b,c,d - 12 association rules exceed the thresholds
agtb(0.67,0.8), agtc(0.67,0.8), bgtc(0.67,0.8),
bgta(0.67,0.8), cgta(0.67,0.8), cgtb(0.67,0.8),
abgtc(0.5,0.75), acgtb(0.5,0.75),
bcgta(0.5,0.75), agtbc(0.5,0.6), bgtac(0.5,0.6),
cgtab(0.5,0.6)
11IR Example 1
- min sup(z) min conf(xgty) 0.5
- Transaction DB 1a,b,c, 2a,b,c, 3a,b,c,
4a,b,d, 5a,c,d, 6b,c,d - 12 association rules exceed the thresholds
agtb(0.67,0.8), agtc(0.67,0.8), bgtc(0.67,0.8),
bgta(0.67,0.8), cgta(0.67,0.8), cgtb(0.67,0.8),
abgtc(0.5,0.75), acgtb(0.5,0.75),
bcgta(0.5,0.75), agtbc(0.5,0.6), bgtac(0.5,0.6),
cgtab(0.5,0.6) - Informative rule set agtb(0.67,0.8),
agtc(0.67,0.8), bgtc(0.67,0.8), bgta(0.67,0.8),
cgta(0.67,0.8), cgtb(0.67,0.8)
12IR Example 2
- min sup(z) min conf(xgty) 0.5
- Rule set agtb(0.25, 1.0), agtc(0.2,0.7),
abgtc(0.2,0.7), bgtd(0.3,1.0), agtd(0.25,1.0)
13IR Example 2
- min sup(z) min conf(xgty) 0.5
- Rule set agtb(0.25, 1.0), agtc(0.2,0.7),
abgtc(0.2,0.7), bgtd(0.3,1.0), agtd(0.25,1.0) - In this case the IR is identical to the above
rule set, because - abgtc can not be ommited because the more general
rule agtc has same confidence and - agtd can not be ommited, as transitive reasoning
is not intended.
14Lemmas and properties of IR
- There exists a unique IR for any given rule set.
- IR is the smallest subset of AR fulfilling (4).
- To predict select matching rules in decreasing
order of confidence. Stop when satisfied or no
rules left. - IR predicts items in the same order as
association rule set when using confidence
priority.
15Candidate tree
16Algorithm mine informative rule set
- Input Database D, the minimum support s and the
minimum confidence ?. - Output The informative rule set R.
- Set the informative rule set R Ø
- Count support of 1-itemsets
- Initialize candidate tree T
- Generate new candidates as leaves of T
- While (new candidate set is non-empty)
- Count support of the new candidates
- Prune the new candiate set
- Include qualified rules from T to R
- Generate new candidates as leaves of T
- Return rule set R
17IR - Conclusions
- IR makes the same predictions as AR.
- IR set significantly smaller than AR when
minimum support is small. - IR can be generated efficiently.
- IR does not make use of transitive reasoning.
18Part 2Unbalanced classes
19Introductory problem illustration
- Consider the following cases
- Here P denotes a pattern that is observed in two
different classes C1 and C2.
20Introductory problem illustration
- Consider the following cases
Example 1 Prob(Pc2)0.6, Prob(Pc1)0.3 Prob(Pc
2) / Prob(Pc1) 2 Example 2
Prob(Pc2)0.95, Prob(Pc1)0.8 Prob(Pc2) /
Prob(Pc1) 1.19
21Nature ofinterestingness metrics
- The metrics should be fair for both large and
small classes. - More generally, they should be fair regardless of
the classes' distribution.
22Interestingness metrics
- Lift
- lift(XgtY) sup(XY)/(sup(X)sup(Y))
- Local support (reverse cond. prob.)
- lsup(XgtY) sup(XY)/sup(Y) Prob(XY)
- Exclusiveness
23Application example
- Identify groups of patients with high risk of
adverse drug reaction to certain drugs. - Both those patients with adverse drug reactions
and those taking the certain drugs are
underrepresented.
24Feature selection
- Interpret the m classes as the independent
variables. - The other variables are then the dependent
variables. - One now has to decide which of the dependent
variables have the strongest influence onto the
independent ones.
25Feature selection method
- Calculate a statistical measure on the joint
distribution of dependent and independent
variables. - Compare the value of dependent-independent
variable pairs to a certain cut-off value.
26Feature selection method ?²
- Bivariate analysis
- Calculate the ?² value of the dependent and
independent variables. - Compare to cut-off value for m-1 degrees of
freedom at a required p value.
27Feature selection method Logistic regression
- Do a regression of the form
- ln(p/(1-p)) aß1x1ß2x2...ßnxn
- Use coefficients ßi to compare the odds-ratio
ORießi to cutoff 1.
28Data
- Queensland Linked Data Set covers the period
July 1995 to June 1999 - De-identified patient level hospital separation
data, - Medicare Benefits Scheme data, and
- Pharmaceutical Benefits Scheme data.
- Initially extracted variables age, gender,
indigenous status, postcode, total number of bed
days, and 8 hospital flags. From PBS 15 drug
flags (ACE inhibitor scripts number, 14 other ATC
level-1 drug flags).
29Results of feature selection
- Selected 15 most discriminating features Among
them - Age
- Gender
- Hospital flags
- Flags for exposure to other drugs
- Selected data consists of 132000 records.
30Results of data mining
Rule 1 Gender Female Age 60 Took genito
urinary system and sex hormone drugs Yes Took
Antineoplastic and immunimodulating agent drugs
Yes Took musculo-skeletal system drugs Yes
31Results of data mining
Rule 2 Gender Female Had circulatory
disease Yes Took systemic hormonal
preparation drugs Yes Took musculo-skeletal
system drugs Yes Took various other drugs
Yes
32Results of data mining
Rule 3 Gender Female Had circulatory
disease Yes Had respiratory disease Yes
Took systemic hormonal preparation drugs Yes
Took various other drugs Yes
33(No Transcript)
34Concluding remarks
- Fair interestingness measures allow to find rules
for underrepresented classes. - Allows to identify key areas in the data worthy
of exploration and explanation. - Usage of IR leads to a compact selection of rules.
35We have reached the end...