Title: Evaluation of Association Patterns
1Evaluation of Association Patterns
2Evaluation of Association Patterns
- Association analysis algorithms have the
potential to generate a large number of
patterns. - In real commercial databases we could easily end
up with thousands or even millions of patterns,
many of which might not be interesting. - Very important to establish a set of
wellaccepted criteria for evaluating the quality
of association patterns. - First set of criteria can be established through
statistical arguments. - Second set of criteria can be established through
subjective arguments.
3Subjective Arguments
- A pattern is considered subjectively
uninteresting unless it reveals unexpected
information about the data. - E.g., the rule Butter ? Bread isnt
interesting, despite having high support and
confidence values. - On the other hand, the rule Diapers ? Beer is
interesting because the relationship is quite
unexpected and may suggest a new crossselling
opportunity for retailers. - Drawback Incorporating subjective knowledge into
pattern evaluation is a difficult task because it
requires a considerable amount of prior
information from the domain experts.
4Computing Interestingness Measures
- Given a rule X ? Y, the information needed to
compute rule interestingness can be obtained from
a contingency table
Contingency table for X ? Y
Y Y
X f11 f10 f1
X f01 f00 f0
f1 f0 T
Used to define various measures
5Pitfall of Confidence
The pitfall of confidence can be traced to the
fact that the measure ignores the support of the
itemset in the rule consequent.
Coffee ?Coffee
Tea 150 50 200
?Tea 750 150 900
900 200 1100
- Consider association rule Tea ?
Coffee - Confidence
- P(Coffee,Tea)/P(Tea) P(CoffeeTea)
150/200 0.75 (seems quite high) - But, P(Coffee) 0.9
- Thus knowing that a person is a tea drinker
actually decreases his/her probability of being a
coffee drinker from 90 to 75! - Although confidence is high, rule is misleading
- In fact P(Coffee?Tea)
- P(Coffee, ?Tea)/P(?Tea) 750/900 0.83
6Statistical Independence
- Population of 1000 students
- 600 students know how to swim (S)
- 700 students know how to bike (B)
- 420 students know how to swim and bike (S,B)
- P(SB) P(S) ( P(S?B)/P(B) .42 / .7 .6
P(S) ) - P(S?B)/P(B) P(S)
- P(S?B) P(S) ? P(B) gt Statistical independence
- P(S?B) gt P(S) ? P(B) gt Positively correlated
- i.e. if someone knows how to swim, then it is
more probable he knows how to bike, and vice
versa - P(S?B) lt P(S) ? P(B) gt Negatively correlated
- i.e. if someone knows how to swim, then it is
less probable he/she knows how to bike, and vice
versa
7Interest Factor
- Measure that takes into account statistical
dependence
- Interest factor compares the frequency of a
pattern against a baseline frequency computed
under the statistical independence assumption. - The baseline frequency for a pair of mutually
independent variables is
Or equivalently
8Interest Equation
- Fraction f11/N is an estimate for the joint
probability P(A,B), while f1 /N and f1 /N are
the estimates for P(A) and P(B), respectively. - If A and B are statistically independent, then
P(A?B)P(A)P(B), thus the Interest is 1.
9Example Interest
Coffee ?Coffee
Tea 150 50 200
?Tea 750 150 900
900 200 1100
Association Rule Tea ?
Coffee Interest 1501100 / (200900) 0.92
(lt 1, therefore they are negatively correlated)
10Simpsons Paradox
11Some other example
- Whats the confidence of the following rules
- (rule 1) HDTVYes ? Exercise machine Yes
- (rule 2) HDTVNo ? Exercise machine Yes
? - Confidence of rule 1 99/180 55
- Confidence of rule 2 54/120 45
- So, Customers who buy high-definition
televisions are more likely to buy exercise
machines that those who dont buy high-definition
televisions. Right? - Well, maybe not
12Stratification Simpson paradox
- Consider this more detailed table
- Whats the confidence of the rules for each
strata - (rule 1) HDTVYes ? Exercise machine Yes
- (rule 2) HDTVNo ? Exercise machine Yes
? - College students
- Confidence of rule 1 1/10 10
- Confidence of rule 2 4/34 11.8
- Working Adults
- Confidence of rule 1 98/170 57.7
- Confidence of rule 2 50/86 58.1
The rules suggest that, for each group, customers
who dont buy HDTV are more likely to buy
exercise machines, which contradict the previous
conclusion when data from the two customer groups
are pooled together.
13Importance of Stratification
- The lesson here is that proper stratification is
needed to avoid generating spurious patterns
resulting from Simpson's paradox. - For example
- Market basket data from a major supermarket chain
should be stratified according to store
locations, while - Medical records from various patients should be
stratified according to confounding factors such
as age and gender.
14Effect of Support Distribution
- Many real data sets have skewed support
distribution where most of the items have
relatively low to moderate frequencies, but a
small number of them have very high frequencies.
15Skewed distribution
- Tricky to choose the right support threshold for
mining such data sets. - If we set the threshold too high (e.g., 20),
then we may miss many interesting patterns
involving the low support items from G1. - Such low support items may correspond to
expensive products (such as jewelry) that are
seldom bought by customers, but whose patterns
are still interesting to retailers. - Conversely, when the threshold is set too low,
there is the risk of generating spurious patterns
that relate a highfrequency item such as milk to
a lowfrequency item such as caviar.
16Crosssupport patterns
- Cross-support patterns are those that relate a
highfrequency item such as milk to a
lowfrequency item such as caviar. - Likely to be spurious because their correlations
tend to be weak. - E.g. the confidence of caviar?milk is likely
to be high, but still the pattern is spurious,
since there isnt probably any correlation
between caviar and milk. - However, we dont want to use the Interest Factor
during the computation of frequent itemsets
because it doesnt have the antimonotone
property. - Interest factor is rather used as a
post-processing step. - So, we want to detect cross-support pattern by
looking at some antimonotone property.
17Crosssupport patterns
- Definition
- A crosssupport pattern is an itemset X i1, i2
,, ik whose support ratio
is less than a userspecified threshold
hc. Example Suppose the support for milk is
70, while the support for sugar is 10 and
caviar is 0.04 Given hc 0.01, the frequent
itemset milk, sugar, caviar is a crosssupport
pattern because its support ratio is r min
0.7, 0.1, 0.0004 / max 0.7, 0.1, 0.0004
0.0004 / 0.7 0.00058 lt 0.01
18Detecting crosssupport patterns
- E.g. assuming that hc 0.3, the itemsets p,q,
p,r, and p,q,r are crosssupport patterns. - Because their support ratios, being equal to 0.2,
are less than threshold hc. - We can apply a high support threshold, say, 20,
to eliminate the crosssupport patternsbut, - this may come at the expense of discarding other
interesting patterns such as the strongly
correlated itemset q,r that has support equal
to 16.7.
19Detecting crosssupport patterns
- Confidence pruning also doesnt help.
- Confidence for q?p is 80 even though p, q
is a crosssupport pattern. - Meanwhile, rule q ?r also has high confidence
even though q, r is not a crosssupport
pattern. - These demonstrate the difficulty of using the
confidence measure to distinguish between rules
extracted from crosssupport and
noncrosssupport patterns.
20Lowest confidence rule
- Notice that the rule p?q has very low
confidence because most of the transactions that
contain p do not contain q. - This observation suggests that
- Crosssupport patterns can be detected by
examining the lowest confidence rule that can be
extracted from a given itemset.
21Finding lowest confidence
- Recall the antimonotone property of confidence
- conf( i1 ,i2?i3,i4,,ik ) ? conf( i1 ,i2 ,
i3?i4,,ik ) - This property suggests that confidence never
increases as we shift more items from the left
to the righthand side of an association rule. - Hence, the lowest confidence rule that can be
extracted from a frequent itemset contains only
one item on its lefthand side.
22Finding lowest confidence
- Given a frequent itemset i1,i2,i3,i4,,ik, the
rule - ij? i1 ,i2 , i3, ij-1, ij1, i4,,ik
-
- has the lowest confidence if ?
- s(ij) max s(i1), s(i2),,s(ik)
- Follows directly from the definition of
confidence as the ratio between the rule's
support and the support of the rule antecedent.
23Finding lowest confidence
- Summarizing, the lowest confidence attainable
from a frequent itemset i1,i2,i3,i4,,ik, is
- This is also known as the h-confidence measure or
all-confidence measure.
24hconfidence
- Clearly, crosssupport patterns can be eliminated
by ensuring that the hconfidence values for the
patterns exceed some threshold hc. - Observe that the measure is also antimonotone,
i.e., -
- hconfidence(i1,i2,, ik) ? hconfidence(i1,i2
,, ik1 ) -
- and thus can be incorporated directly into the
mining algorithm.