Data Mining - CSE5230

1 / 33

About This Presentation

Title:

Data Mining - CSE5230

Description:

describe the components of an association rule (AR) ... which items should not simultaneously be discounted. CSE5230 - Data Mining, 2002. Lecture 2.5 ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 34

Provided by: DavidSquir

more less

Transcript and Presenter's Notes

Title: Data Mining - CSE5230

1
Data Mining - CSE5230
CSE5230/DMS/2002/2

Market Basket Analysis
Machine Learning

2
Lecture Outline

Association Rules
Usefulness
Example
Choosing the right item set
What is a rule?
Is the Rule a Useful Predictor?
Discovering Large Itemsets
Strengths and Weaknesses
Machine Learning
Concept Learning
Hypothesis Characteristics
Complexity of Search Space
Learning as Compression
Minimum Message Length Principle
Noise and Redundancy

3
Lecture Objectives

By the end of this lecture, you should be able
to
describe the components of an association rule
(AR)
indicate why some ARs are more useful than others
give an example of why classes and taxonomies are
important for association rule discovery
explain the factors that determine whether an AR
is a useful predictor
describe the empirical cycle
explain the terms complete and consistent
with respect to concept learning
describe the characteristics of a useful
hypothesis
use the kangaroo in the mist metaphor to
describe search in machine learning
explain the Minimum Message Length principle

4
Association Rules (1)

Association Rule (AR) discovery is often referred
to as Market Basket Analysis (MBA), and is also
referred to as Affinity Grouping
A common example is the discovery of which items
are frequently sold together at a supermarket. If
this is known, decisions can be made about
arranging items on shelves
which items should be promoted together
which items should not simultaneously be
discounted

5
Association Rules (2)
Confidence
Rule Body
When a customer buys a shirt, in 70 of cases,
he or she will also buy a tie! We find this
happens in 13.5 of all purchases.
Rule Head
Support
6
Usefulness of ARs

Some rules are useful
unknown, unexpected and indicative of some action
to take.
Some rules are trivial
known by anyone familiar with the business.
Some rules are inexplicable
seem to have no explanation and do not suggest a
course of action.The key to success in
business is to know something that nobody else
knows Aristotle Onassis

7
AR Example Co-Occurrence Table

Customer Items
1 orange juice (OJ), cola
2 milk, orange juice, window cleaner
3 orange juice, detergent
4 orange juice, detergent, cola
5 window cleaner, cola
OJ Cleaner Milk Cola Detergent
OJ 4 1 1 2 2
Cleaner 1 2 1 1 0
Milk 1 1 1 0 0
Cola 2 1 0 3 1
Detergent 2 0 0 1 2

8
The AR Discovery Process

A co-occurrence cube would show associations in
three dimensions - hard to visualize more
We must
Choose the right set of items
Generate rules by deciphering the counts in the
co-occurrence matrix
Overcome the practical limits imposed by many
items in large numbers of transactions

9
ARs Choosing the Right Item Set

Choosing the right level of detail (the creation
of classes and a taxonomy)
Virtual items may be added to take advantage of
information that goes beyond the taxonomy
Anonymous versus signed transactions

10
ARs What is a Rule?

if condition then result
Note
if (nappies and Thursday) then beer
is usually better than (in the sense that it is
more actionable)
if Thursday then nappies and beer
because it has just one item in the result. If a
3 way combination is the most common, then
consider rules with just 1 item in the result,
e.g.
if (A and B) then C
if (A and C) then B

11
AR Is the Rule a Useful Predictor? (1)

Confidence is the ratio of the number of
transactions with all the items in the rule to
the number of transactions with just the items in
the condition. Consider if B and C then A
If this rule has a confidence of 0.33, it means
that when B and C occur in a transaction, there
is a 33 chance that A also occurs.

12
AR Is the Rule a Useful Predictor? (2)

Consider the following table of probabilities of
items and their combinations

13
AR Is the Rule a Useful Predictor? (3)

Now consider the following rules
It is tempting to choose If B and C then A,
because it is the most confident (33) - but
there is a problem

Rule p(condition) p(condition confidence
and result) if A and B then
C 0.25 0.05 0.20 if A and C then
B 0.20 0.05 0.25 if B and C then
A 0.15 0.05 0.33
14
AR Is the Rule a Useful Predictor? (4)

This rule is actually worse than just saying that
A randomly occurs in the transaction - which
happens 45 of the time
A measure called improvement indicates whether
the rule predicts the result better than just
assuming the result in the first place
p(condition and result) p(condition)p(resu
lt)

improvement
15
AR Is the Rule a Useful Predictor? (5)

When improvement gt 1, the rule is better at
predicting the result than random chance
The improvement measure is based on whether or
not the probabilityp(condition and result) is
higher than it would be if condition and result
were statistically independent
If there is no statistical dependence between
condition and result, improvement 1.

16
AR Is the Rule a Useful Predictor? (6)

Consider the improvement for our rules
Rule support confidence improvement
if A and B then C 0.05 0.20 0.50
if A and C then B 0.05 0.25 0.59
if B and C then A 0.05 0.33 0.74
if A then B 0.25 0.59 1.31
None of the rules with three items shows any
improvement - the best rule in the data actually
has only two items if A then B. A predicts the
occurrence of B 1.31 times better than chance.

17
AR Is the Rule a Useful Predictor? (7)

When improvement lt 1, negating the result
produces a better rule. For example if B and C
then not Ahas a confidence of 0.67 and thus an
improvement of 0.67/0.55 1.22
Negated rules may not be as useful as the
original association rules when it comes to
acting on the results

18
AR Discovering Large Item Sets

The term frequent itemset means a set S that
appears in at least fraction s of the baskets,
where s is some chosen constant, typically 0.01
(i.e. 1).
DM datasets are usually too large to fit in main
memory. When evaluating the running time of AR
discovery algorithms we
count the number of passes through the data.
Since the principal cost is often the time it
takes to read data from disk, the number of times
we need to read each datum is often the best
measure of running time of the algorithm.

19
AR Discovering Large Item Sets (2)

There is a key principle, called monotonicity or
the a-priori trick that helps us find frequent
itemsets
If a set of items S is frequent (i.e., appears in
at least fraction s of the baskets), then every
subset of S is also frequent.
To find frequent itemsets, we can
1. Proceed level-wise, finding first the frequent
items (sets of size 1), then the frequent pairs,
the frequent triples, etc.
Level-wise algorithms use one pass per level.
2. Find all maximal frequent itemsets (i.e., sets
S such that no proper superset of S is frequent)
in one (or few) passes

20
AR The A-priori Algorithm (1)

The A-priori algorithm proceeds level-wise.
1. Given support threshold s, in the first pass
we find the items that appear in at least
fraction s of the baskets. This set is called L1,
the frequent items
(Presumably there is enough main memory to
count occurrences of each item, since a typical
store sells no more than 100,000 different
items.)
2. Pairs of items in L1 become the candidate
pairs C2 for the second pass. We hope that the
size of C2 is not so large that there is not room
for an integer count per candidate pair. The
pairs in C2 whose count reaches s are the
frequent pairs, L2.

21
AR The A-priori Algorithm (2)

3. The candidate triples, C3 are those sets A,
B, C such that all of A, B, A, C and B, C
are in L2. On the third pass, count the
occurrences of triples in C3 those with a count
of at least s are the frequent triples, L3.
4. Proceed as far as you like (or until the sets
become empty). Li is the frequent sets of size i
Ci1 is the set of sets of size i 1 such that
each subset of size i is in Li.
The A-priori algorithm helps because the number
tuples which must be considered at each level is
much smaller than it otherwise would be.

22
AR Strengths and Weaknesses

Strengths
Clear understandable results
Supports undirected data mining
Works on variable length data
Is simple to understand
Weaknesses
Requires exponentially more computational effort
as the problem size grows
Suits items in transactions but not all problems
fit this description
It can be difficult to determine the right set of
items to analysis
It does not handle rare items well simply
considering the level of support will exclude
these items

23
Machine Learning

A general law can never be verified by a finite
number of observations. It can, however, be
falsified by only one observation. Karl
Popper
The patterns that machine learning algorithms
find can never be definitive theories
Any results discovered must to be tested for
statistical relevance

24
The Empirical Cycle
Analysis
Theory
Observation
Prediction
25
Concept Learning (1)

Example the concept of a wombat
a learning algorithm could consider the
characteristics (features) of many animals and be
advised in each case whether it is a wombat or
not. From this a definition would be deduced.
The definition is
complete if it recognizes all instances of a
concept ( in this case a wombat).
consistent if it does not classify any negative
examples as falling under the concept.

26
Concept Learning (2)

An incomplete definition is too narrow and would
not recognize some wombats.
An inconsistent definition is too broad and would
classify some non-wombats as wombats.
A bad definition could be both inconsistent and
incomplete.

27
Hypothesis Characteristics

Classification Accuracy
1 in a million wrong is better than 1 in 10
wrong.
Transparency
A person is able understand the hypothesis
generated. It is then much easier to take action
Statistical Significance
The hypothesis must perform better than the naïve
prediction. Imagine a situation where 80 of all
animals considered are wombats A theory that all
animals are wombats would be is right 80 of the
time! But would have been learnt about
classifying animals on the basis of their
characteristics.
Information Content
We look for a rich hypothesis. The more
information contained (while still being
transparent) the more understanding is gained and
the easier it is to formulate an action plan.

28
Complexity of Search Space

Machine learning can be considered as a search
problem. We wish to find the correct hypothesis
from among many.
If there are only a few hypotheses we could try
them all but if there are an infinite number we
need a better strategy.
If we have a measure of the quality of the
hypothesis we can use that measure to select
potential good hypotheses and based on the
selection try to improve the theories
(hill-climbing search)
Consider the metaphor of the kangaroo in the mist
(see example on whiteboard).
This demonstrates that it is important to know
the complexity of the search space. Also that
some pattern recognition problems are almost
impossible to solve.

29
Learning as a Compression

We have learnt something if we have an algorithm
that creates a description of the data that is
shorter than the original data set
A knowledge representation is required that is
incrementally compressible and an algorithm that
can achieve that incremental compression
The file-in could be a relation table and the
file-out a prediction or a suggested clustering

Algorithm
File-out
File-in
30
Types of Input Message (File-in)

Unstructured or random messages
Highly structured messages with patterns that are
easy to find
Highly structured messages that are difficult to
decipher
Partly structured messages
Most data sets considered by data mining are in
this class. There are patterns to be found but
the data sets are not highly regular

31
Minimum Message Length Principle

The best theory to explain data set is the one
that minimizes the sum of the length, in bits, of
the description of the theory, plus the length of
the data when encoded using the
theory. 0110001100100110110001101010111110010011
0 00110011000011
110001100110000111
i.e., if regularity is found in a data set and
the description of this regularity together with
the description of the exceptions is still
shorter than the original data set, then we have
found something of value.

32
Noise and Redundancy

The distortion or mutation of a message is the
number of bits that are corrupted
making the message longer by including redundant
information can ensure that a message is received
correctly even in the presence of noise
Some pattern recognition algorithms cope well
with the presence of noise, others do not
We could consider a database which lacks
integrity to contain a large amount of noise
patterns may exist for a small percentage of the
data due solely to noise

33
References

Berry J.A. Linoff G. Data Mining Techniques
For Marketing, Sales, and Customer Support John
Wiley Sons, Inc. 1997
Rakesh Agrawal and Ramakrishnan Srikant, Fast
Algorithms for Mining Association Rules, In Jorge
B. Bocca, Matthias Jarke and Carlo Zaniolo eds.,
VLDB'94, Proceedings of the 20th International
Conference on Very Large Data Bases, Santiago de
Chile, Chile, pp. 487-499, September 12-15 1994
CSE5230 web site links page