Title: COT5230 Data Mining
1COT5230 Data Mining
M O N A S H
A U S T R A L I A S I N T E R N A T I O N A
L U N I V E R S I T Y
2Lecture Outline
- Market Basket Analysis
- Machine Learning - Basic Concepts
3Data Mining Tasks 1
- Various Taxonomies exist. Berry Linoff define 6
tasks - Classification
- Estimation
- Prediction
- Affinity Grouping
- Clustering
- Description
4Data Mining Tasks 2
- The Tasks are also referred to as Operations.
Cabena et al. define 4 Operations - Predictive Modeling
- Database Segmentation
- Link Analysis
- Deviation Detection
5Affinity Grouping
- Affinity grouping is also referred to as Market
Basket Analysis - A common example is the discovery of which items
are frequently sold together at a supermarket. If
this is known, decisions can be made about - arranging items on shelves
- which items should be promoted together
- which items should not simultaneously be
discounted
6Market Basket Analysis
Confidence
Rule Body
When a customer buys a shirt, in 70 of cases,
he or she will also buy a tie! We find this
happens in 13.5 of all purchases.
Rule Head
Support
7The Usefulness of Market Basket Analysis
- Some rules are useful Unknown, unexpected and
indicative of some action to take. - Some rules are trivial Known by anyone familiar
with the business. - Some rules are inexplicable Seem to have no
explanation and do not suggest a course of
action.The key to success in business is to
know something that nobody else
knows Aristotle Onassis
8Co-Occurrence Table
Customer Items 1 orange juice (OJ),
cola 2 milk, orange juice, window
cleaner 3 orange juice, detergent 4 orange
juice, detergent, cola 5 window cleaner,
cola OJ Cleaner Milk Cola Detergent OJ 4
1 1 2 2 Cleaner 1 2
1 1 0 Milk 1 1 1 0 0 Cola 2
1 0 3 1 Detergent 2 0 0 1 2
9The Process for Market Basket Analysis
- A co-occurrence cube would show associations in
three dimensions - hard to visualize more - We must
- Choose the right set of items
- Generate rules by deciphering the counts in the
co-occurrence matrix - Overcome the practical limits imposed by many
items in large numbers of transactions
10Choosing the Right Set of Items
- Choosing the right level of detail (the creation
of classes and a taxonomy) - Virtual items may be added to take advantage of
information that goes beyond the taxonomy - Anonymous versus signed transactions
11What is a Rule?
If condition then result Note If nappies and
Thursday then beer is usually better than (in
the sense that it is more actionable) If
Thursday then nappies and beerbecause it has
just one item in the result If a 3 way
combination is the most common, then consider
rules with just 1 item in the result, e.g. If A
and B, then C If A and C, then B
12Is the Rule a Useful Predictor? - 1
- Confidence is the ratio of the number of
transactions with all the items in the rule to
the number of transactions with just the items in
the condition. Considerif B and C then A - If this rule has a confidence of 0.33, it means
that when B and C occur in a transaction, there
is a 33 chance that A also occurs.
13Is the Rule a Useful Predictor? - 2
- Consider the following table of probabilities of
items and there combinations
14Is the Rule a Useful Predictor? - 3
- Now consider the following rulesIt is
tempting to choose If B and C then A, because
it is the most confident (33) - but there is a
problem
15Is the Rule a Useful Predictor? - 4
- This rule is actually worse than just saying that
A randomly occurs in the transaction - which
happens 45 of the time - A measure called improvement indicates whether
the rule predicts the result better than just
assuming the result in the first place
p(condition and result) p(condition)p(resu
lt)
Improvement
16Is the Rule a Useful Predictor? - 5
- Improvement measures how much better a rule is at
predicting a result than just assuming the result
in the first place - When improvement gt 1, the rule is better at
predicting the result than random chance
17Is the Rule a Useful Predictor? - 6
- Consider the improvement for our rules
- None of the rules with three items shows any
improvement - the best rule in the data actually
has only two items if A then B. A predicts the
occurrence of B 1.31 times better than chance.
18Is the Rule a Useful Predictor? - 7
- When improvement lt 1, negating the result
produces a better rule. For example if B and C
then not Ahas a confidence of 0.67 and thus an
improvement of 0.67/0.55 1.22 - Negated rules may not be as useful as the
original association rules when it comes to
acting on the results
19Strengths and Weaknesses
- Strengths
- Clear understandable results
- Supports undirected data mining
- Works on variable length data
- Is simple to understand
- Weaknesses
- Requires exponentially more computational effort
as the problem size grows - Suits items in transactions but not all problems
fit this description - It can be difficult to determine the right set of
items to analysis - It does not handle rare items well simply
considering the level of support will exclude
these items
20Machine Learning
- A general law can never be verified by a finite
number of observations. It can, however, be
falsified by only one observation. Karl
Popper - The patterns that machine learning algorithms
find can never be definitive theories - Any results discovered must to be tested for
statistical relevance
21The Empirical Cycle
Analysis
Theory
Observation
Prediction
22Concept Learning - 1
- Example the concept of a wombat
- a learning algorithm could consider many animals
and be advised in each case whether it is a
wombat or not. From this a definition would be
deduced. - The definition is
- complete if it recognizes all instances of a
concept ( in this case a wombat). - consistent if it does not classify any negative
examples as falling under the concept.
23Concept Learning - 2
- An incomplete definition is too narrow and would
not recognize some wombats. - An inconsistent definition is too broad and would
classify some non-wombats as wombats. - A bad definition could be both inconsistent and
incomplete.
24Hypothesis Characteristics - 1
- Classification Accuracy
- 1 in a million wrong is better than 1 in 10
wrong. - Transparency
- A person is able understand the hypothesis
generated. It is then much easier to take action
25Hypothesis Characteristics - 2
- Statistical Significance
- The hypothesis must perform better than the naĂŻve
prediction. (Imagine if 80 of animals considered
are wombats and the theory is that all animals
are wombats then the theory is right 80 of the
time! But nothing has been learnt.) - Information Content
- We look for a rich hypothesis. The more
information contained (while still being
transparent) the more understanding is gained and
the easier it is to formulate an action plan.
26Complexity of Search Space
- Machine learning can be considered as a search
problem. We wish to find the correct hypothesis
from among many. - If there are only a few hypotheses we could try
them all but if there are an infinite number we
need a better strategy. - If we have a measure of the quality of the
hypothesis we can use that measure to select
potential good hypotheses and based on the
selection try to improve the theories
(hill-climbing search) - Consider the metaphor of the kangaroo in the
mist. - This demonstrates that it is important to know
the complexity of the search space. Also that
some pattern recognition patterns are almost
impossible to solve.
27Learning as a Compression
- We have learnt something if we have an algorithm
that creates a description of the data that is
shorter than the original data set - A knowledge representation is required that is
incrementally compressible and an algorithm that
can achieve that incremental compression - The file-in could be a relation table and the
file-out a prediction or a suggested clustering
Algorithm
File-out
File-in
28Types of Input Message (File-in)
- Unstructured or random messages
- Highly structured messages with patterns that are
easy to find - Highly structured messages that are difficult to
decipher - Partly structured messages
- Most data sets considered by data mining are in
this class. There are patterns to be found but
the data sets are not highly regular
29Minimum Message Length Principle
- The best theory to explain a set of data is the
one that minimizes the sum of the length, in
bits, of the description of the theory, plus the
length of the data when encoded with the help of
the theory. 011000110010011011000110101011111001
00110
00110011000011 110001100110000111 - Put another way, if regularity is found in a data
set and the description of this regularity
together with the description of the exceptions
is still shorter than the original data set, then
something of value has been found.
Original data set
Theory
Data set coded with the theory
30Noise and Redundancy
- The distortion or mutation of a message is the
number of bits that are corrupted - making the message longer by including redundant
information can ensure that a message is received
correctly even in the presence of noise - Some pattern recognition algorithms cope well
with the presence of noise, others do not - We could consider a database which lacks
integrity to contain a large amount of noise - patterns may exist for a small percentage of the
data due solely to noise