Title: From Association Rules To Causality
1From Association Rules To Causality
Presenters Amol Shukla, University of Waterloo
Claude-Guy Quimper, University of Waterloo
2From Association Rules To Causality
Presentation Outline
- Limitations of Association Rules and the
Support-Confidence Framework - Generalizing Association Rules to Correlations
- Scalable Techniques for Mining Causal Structures
- Applications of Correlation and Causality
- Summary
3Review Association Rules Mining
- Itemset Ii1, , ik
- Find all the rules X?Y with min confidence and
support - support, s, probability that a transaction
contains X?Y - confidence, c, conditional probability that a
transaction having X also contains Y, i.e., P(YX)
Let min_support 50, min_conf 50. Two
example association rules are A ? C (50,
66.7) C ? A (50, 100)
4Limitations of Association Rules using
Support-Confidence Framework
- Negative implications or dependencies are ignored
- Consider the adjoining database.
- X and Y positively related,
- X and Z negatively related
- support and confidence of
- XgtZ dominates
- Only the presence of items is taken into account
5Limitations of Association Rules using
Support-Confidence Framework
- Another market basket data example
- Buys Tea gt Buys Coffee
- (support20,confidence80)
- Is this rule really valid?
- Pr(Buys Coffee)90
- Pr(Buys CoffeeBuys Tea)80
- Negative correlation between buying tea and
buying coffee is ignored
6From Association Rules To Causality
- Limitations of Association Rules and the
Support-Confidence Framework - Generalizing Association Rules to Correlations
- Scalable Techniques for Mining Causal Structures
- Applications of Correlation and Causality
- Summary
7What is Correlation?
- P(A) Probability that event A occurs
- P(A) Probability that event A does not
occur - P(AB) Probability that events A and B occur
together. - Events A and B are said to be independent if
- P(AB) P(A) x P(B)
- Otherwise A and B are dependent
- Events A and B are said to be correlated if any
of - AB, AB , AB, AB are dependent
- A correlation rule is a set of items that are
correlated
8Computing Correlation Rules Chi-squared Test
for Independence
- For an itemset Ii1,,ik, construct a
k-dimensional contingency table R i1,i1 x
x ik,ik - We need to test whether each cell r r1,,rk in
this table is dependent - Let O(r) denote the observed value of cell r in
this table, and E(r) be its expected value. - The chi-squared statistic is the computed as
- If ?2 0, the cells are independent. If ?2 gt
cut-off value,reject the independence assumption
9Example Computing the Chi-squared Statistic
E(Coffee,Tea) (90 x 25)/100 22.5 E(No
Coffee,Tea) (10 x 25)/100 2.5 E(Coffee,No
Tea) (90 x 75)/100 67.5 E(No Coffee,No
Tea)(10 x 75)/1007.5
?2 (20-22.5)2/22.5 (5-2.5)2/2.5
(70-67.5)2/67.5 (5-7.5)2/7.5 0.28 2.5
0.09 0.83 3.7
Since this value is greater than the cut-off
value (2.71 at 90 significance level), we reject
the independence assumption
10Determining the Cause of Correlation
- Define measures of interest for each cell I(r)
O(r) / E(r)
- I(r)gt1 indicates positive dependence and I(r)lt1
indicates negative dependence - The farther I(r) is from 1, the more a cell
contributes to the ?2 value, and the correlation.
Cell Counts
- Thus, No Coffee,Tea contributes the most to the
correlation, indicating that buying tea might
inhibit buying coffee
Measures of Interest
70/67.5
11Properties of Correlation
- If a set of items is correlated, all its
supersets are also correlated. Thus, correlation
is upward-closed - We can focus on minimal correlated itemsets to
reduce our search space - Support is downward-closed. A set has minimum
support only if all its subsets have minimum
support - We can combine correlation with support for an
effective pruning strategy
12Combining Correlation with Support
- Support-confidence framework looks at only the
top-left cell in the contingency table. To
incorporate negative dependence, we must consider
all the cells in the table - Combine correlation with support by defining
CT-support - Let s be a user specified min-support threshold.
Let p be a user-specified cut-off percentage
value - An itemset I is CT-supported if at least p of
the cells in its contingency table have support
not less than s - An itemset is significant if it is CT-supported
and minimally correlated
13A level-wise algorithm for finding correlation
rules
14Steps performed by the algorithm at level k
Start
Is the Itemset CT-supported?
No
Construct Contingency Table for next itemset at
the level
Add to the set NOTSIG
Yes
Done processing all itemsets at level k
No
Is ?2 greater than cut-off value?
Generate itemset(s) of size k1 such that all of
its subsets are in NOTSIG
Mark the itemset as significant
Yes
15Limitations of Correlation
- Correlation might not be valid for sparse
itemsets. At least 80 of the cells in the
contingency table must have expected value
greater than 5. - Finding correlation rules is computationally more
expensive than finding association rules. - Only indicates that the existence of a
relationship. Does not specify the nature of the
relationship, i.e., the cause and effect
phenomenon is ignored. - Identifying causality is important for
decision-making.
16From Association Rules to Causality
- Limitations of Association Rules and the
Support-Confidence Framework - Generalizing Association Rules to Correlations
- Scalable Techniques for Mining Causal Structures
- Applications of Correlation and Causality
- Summary
17Causality
33
33
33
Association Rule Hot-Dogs ? BBQ Sauce 33, 50
Causality Rule Hamburgers ? BBQ Sauce
18Bayesian Networks
- What is the best topology of a Bayesian network
that describes the observed data? - Problem Very expensive to compute
19Simplifying Causal Relationships
- Knowing the existence of a causal relationship is
as good as knowing the relationship
20Causality vs Correlation
- Two correlated variables can have either
21First Rule of Causality
1) Suppose we have threepair wise
dependentvariables
2) And two variables become independent when
conditionedon the third one
22First Rule of Causality
Then we have one of these following configurations
23Second Rule of Causality
- Suppose we havethree variables withthese
relationships
2) And the two independent variables become
dependentwhen conditioned on the third variable
24Second Rule of Causality
- Then the two independent variables cause the
third variable.
25Finding Causality
1) Construct a graph whereeach variable is a
vertex
2) Perform a Chi-squared testto determine
correlation
3) Add an edge labeled Cfor each correlated
test
4) Add an edge labeled Ufor each uncorrelated
test
5) For each triplet, check if acausality rule
can be applied
26Weaknesses of the Algorithm
- Causality rules do not cover all possible
causality relationships - The X2 test with confidence set to 95 is
expected to fail 5 times every 100 tests - Some variables might not be reported correlated
or uncorrelated
27From Association Rules to Causality
- Limitations of Association Rules and the
Support-Confidence Framework - Generalizing Association Rules to Correlations
- Scalable Techniques for Mining Causal Structures
- Applications of Correlation and Causality
- Summary
28Experiments (Census)
- Correlation rules
- Not a native English speaker ?? Not born in the
U.S - Served in the military ?? Male
- Married ?? more than 40 years old
- Causality Rules
- Male ? Moved Last 5 years, Support-Job
- Native-Amer. ? 20-40K ? House Holder
- Asian, Laborer ? lt 20K
29Experiments (Text Data)
- 416 distinct frequent words
- 86320 pairs of words, 10 are correlated
- Correlation Causality Rules
- Nelson, Mandela upi, not reuter
- area, province Iraqi, Iraq
- area, secretary, war united, states
- area, secretary, they prime, minister
30Beyond Correlation and Causality
- Correlation and causality seem to be stronger
mathematical model than confidence and support - It is possible to apply these concepts where
confidence and support were previously applied
31Association Rules with Constraints
- Correlation can be seen as a monotone constraint
- Algorithm obtained by modifying algorithms for
mining constrained association rules
32From Association Rules to Causality
- Limitations of Association Rules and the
Support-Confidence Framework - Generalizing Association Rules to Correlations
- Scalable Techniques for Mining Causal Structures
- Applications of Correlation and Causality
- Summary
33Conclusion (Good news)
- Correlation and causality are stronger
mathematical models to retrieve interesting
association rules - Allow to detect negative implications
- Causality explains why there is a correlation
34Conclusion (Bad news)
- Difficult to precisely detect correlation
(especially in sparse data cubes) - Not all causality relationships can be found
- Are the results really better than with support
and confidence?
35Open Problems
- How to discover hidden variables in causality
- How to resolve bi-directional causality for
disambiguatione.g prime ? minister
minister ?prime - How do we find causal patterns for more than 3
variables
36References
- Papers
- Beyond Market Baskets Generalizing Association
Rules to Correlations - Brin, Motwani,
Silverstein SIGMOD 97 - Scalable Techniques for Mining Causal
Structures - Silverstein, Brin, Motwani, Ullman
VLDB 98 - Efficient Mining of Constrained Correlated Sets
- Grahne, Lakshmanan, Wang ICDE 2000 - A Simple Constraint-Based Algorithm for
Efficiently Mining Observational Databases for
Causal Relationships - Cooper Data Mining and
Knowledge Discovery, vol 1, 1997 - Textbook
- Causality models, reasoning, and inference -
Judea Pearl Cambridge University Press, 2000
37From Association Rules To Causality