Title: Chapter 22: Advanced Querying and Information Retrieval
1Data Mining Chapter 24 of book Dr Eamonn
Keogh Computer Science Engineering
DepartmentUniversity of California -
RiversideRiverside,CA 92521eamonn_at_cs.ucr.edu
Dont bother reading 24.3.7 or 24.3.8
2What is Data Mining?
Data Mining has been defined as The nontrivial
extraction of implicit, previously unknown, and
potentially useful information from data.
data visualization
statistics
Data Mining
artificial intelligence
databases
Informally, data mining is the extraction of
interesting knowledge (rules, regularities,
patterns, constraints) from data in large
databases.
3Data Mining
- Broadly speaking, data mining is the process of
semi-automatically analyzing large databases to
find useful patterns - Like knowledge discovery in artificial
intelligence data mining discovers statistical
rules and patterns - Differs from machine learning in that it deals
with large volumes of data stored primarily on
disk. - Some types of knowledge discovered from a
database can be represented by a set of rules. - e.g., Young women with annual incomes greater
than 50,000 are most likely to buy sports cars - Other types of knowledge represented by
equations, or by prediction functions, or by
clusters - Some manual intervention is usually required
- Pre-processing of data, choice of which type of
pattern to find, postprocessing to find novel
patterns
4Applications of Data Mining
- Prediction based on past history
- Predict if a credit card applicant poses a good
credit risk, based on some attributes (income,
job type, age, ..) and past history - Predict if a customer is likely to switch brand
loyalty - Predict if a customer is likely to respond to
junk mail - Predict if a pattern of phone calling card usage
is likely to be fraudulent - Some examples of prediction mechanisms
- Classification
- Given a training set consisting of items
belonging to different classes, and a new item
whose class is unknown, predict which class it
belongs to - Regression formulae
- given a set of parameter-value to
function-result mappings for an unknown function,
predict the function-result for a new
parameter-value
5Applications of Data Mining (Cont.)
- Descriptive Patterns
- Associations
- Find books that are often bought by the same
customers. If a new customer buys one such book,
suggest that he buys the others too. - Other similar applications camera accessories,
clothes, etc. - Associations may also be used as a first step in
detecting causation - E.g. association between exposure to chemical X
and cancer, or new medicine and cardiac problems - Clusters
- E.g. typhoid cases were clustered in an area
surrounding a contaminated well - Detection of clusters remains important in
detecting epidemics
6Classification Rules
- Classification rules help assign new objects to a
set of classes. E.g., given a new automobile
insurance applicant, should he or she be
classified as low risk, medium risk or high risk? - Classification rules for above example could use
a variety of knowledge, such as educational level
of applicant, salary of applicant, age of
applicant, etc. - ? person P, P.degree masters and P.income gt
75,000 -
? P.credit excellent - ? person P, P.degree bachelors and
(P.income ? 25,000 and P.income ?
75,000)
? P.credit good - Rules are not necessarily exact there may be
some misclassifications - Classification rules can be compactly shown as a
decision tree.
7Decision Tree
? person P, P.degree masters and P.income gt
75,000 ? P.credit excellent
8Construction of Decision Trees
- Training set a data sample in which the grouping
for each tuple is already known. - Consider credit risk example Suppose degree is
chosen to partition the data at the root. - Since degree has a small number of possible
values, one child is created for each value. - At each child node of the root, further
classification is done if required. Here,
partitions are defined by income. - Since income is a continuous attribute, some
number of intervals are chosen, and one child
created for each interval. - Different classification algorithms use different
ways of choosing which attribute to partition on
at each node, and what the intervals, if any,
are. - In general
- Different branches of the tree could grow to
different levels. - Different nodes at the same level may use
different partitioning attributes.
9Construction of Decision Trees (Cont.)
- Greedy top down generation of decision trees.
- Each internal node of the tree partitions the
data into groups based on a partitioning
attribute, and a partitioning condition for the
node - More on choosing partitioning attribute/condition
shortly - Algorithm is greedy the choice is made once and
not revisited as more of the tree is constructed - The data at a node is not partitioned further if
either - all (or most) of the items at the node belong to
the same class, or - all attributes have been considered, and no
further partitioning is possible. - Such a node is a leaf node.
- Otherwise the data at the node is partitioned
further by picking an attribute for partitioning
data at the node.
10Best Splits
- Idea evaluate different attributes and
partitioning conditions and pick the one that
best improves the purity of the training set
examples - The initial training set has a mixture of
instances from different classes and is thus
relatively impure - E.g. if degree exactly predicts credit risk,
partitioning on degree would result in each child
having instances of only one class - I.e., the child nodes would be pure
- The purity of a set S of training instances can
be measured quantitatively in several ways. - Notation number of classes k, number of
instances S, fraction of instances in class
i pi. - The Gini measure of purity is defined as
-
- Gini (S) 1 - ?
- When all instances are in a single class, the
Gini value is 0, while it reaches its maximum (of
1 1 /k) if each class the same number of
instances. -
11Best Splits (Cont.)
- Another measure of purity is the entropy measure,
which is defined as - entropy (S) ?
- When a set S is split into multiple sets Si, I1,
2, , r, we can measure the purity of the
resultant set of sets as -
- purity(S1, S2, .., Sr) ?
- The information gain due to particular split of S
into Si, i 1, 2, ., r - Information-gain (S, S1, S2, ., Sr)
purity(S) purity (S1, S2, Sr) -
12Best Splits (Cont.)
- Measure of cost of a split
Information-content(S, S1, S2, .., Sr))
? - Information-gain ratio Information-gain (S,
S1, S2, , Sr) -
Information-content (S, S1, S2, .., Sr) - The best split for an attribute is the one that
gives the maximum information gain ratio - Continuous valued attributes
- Can be ordered in a fashion meaningful to
classification - e.g. integer and real values
- Categorical attributes
- Cannot be meaningfully ordered (e.g. country,
school/university, item-color, .)
13Finding Best Splits
- Categorical attributes
- Multi-way split, one child for each value
- may have too many children in some cases
- Binary split try all possible breakup of values
into two sets, and pick the best - Continuous valued attribute
- Binary split
- Sort values in the instances, try each as a split
point - E.g. if values are 1, 10, 15, 25, split at ?1,
? 10, ? 15 - Pick the value that gives best split
- Multi-way split more complicated, see
bibliographic notes - A series of binary splits on the same attribute
has roughly equivalent effect
14Decision-Tree Construction Algorithm I
- Procedure Grow.Tree(S) Partition(S)Procedure
Partition (S) if (purity(S) gt ?p or S lt ?s)
then return for each attribute A
evaluate splits on attribute A Use best split
found (across all attributes) to partition
S into S1, S2, ., Sr, for i 1, 2, .., r
Partition(Si)
15Decision-Tree Construction Algorithm II
- Variety of algorithms have been developed to
- Reduce CPU cost and/or
- Reduce IO cost when handling datasets larger than
memory - Improve accuracy of classification
- Decision tree may be overfitted, i.e., overly
tuned to given training set - Pruning of decision tree may be done on branches
that have too few training instances - When a subtree is pruned, an internal node
becomes a leaf - and its class is set to the majority class of the
instances that map to the node - Pruning can be done by using a part of the
training set to build tree, and a second part to
test the tree - prune subtrees that increase misclassification
on second part
16 A visual intuition of the classification problem
Given a database (called a training database) of
labeled examples, predict future unlabeled
examples
10
What is the class of Homer?
Shoe Size
Blood Sugar
17Decision-Tree A visual intuition
10
Shoe Size
Blood Sugar
18White Cell Count
Blood Sugar
19Other Types of Classifiers
- Further types of classifiers
- Neural net classifiers
- Bayesian classifiers
- Neural net classifiers use the training data to
train artificial neural nets - Widely studied in AI, wont cover here
- Bayesian classifiers use Bayes theorem, which
says - p(cj d) p(d cj ) p(cj)
- p(d)where p(cj
d) probability of instance d being in class cj,
- p(d cj) probability of
generating instance d given class cj, - p(cj) probability of
occurrence of class cj, and - p(d) probability of
instance d occurring -
- For more details see Keogh, E. Pazzani, M.
(1999). Learning augmented Bayesian classifiers
A comparison of distribution-based and
classification-based approaches. In Uncertainty
99, 7th. Int'l Workshop on AI and Statistics, Ft.
Lauderdale, FL, pp. 225--230.
20Naïve Bayesian Classifiers
- Bayesian classifiers require
- computation of p(d cj)
- precomputation of p(cj)
- p(d) can be ignored since it is the same for all
classes - To simplify the task, naïve Bayesian classifiers
assume attributes have independent distributions,
and thereby estimate - p(dcj) p(d1cj) p(d2cj) . (p(dncj)
- Each of the p(dicj) can be estimated from a
histogram on di values for each class cj - the histogram is computed from the training
instances - Histograms on multiple attributes are more
expensive to compute and store
21Naïve Bayesian Classifiers Visual Intuition I
5 foot 8
6 foot 6
4 foot 8
5 foot 8
22Naïve Bayesian Classifiers Visual Intuition II
p(cj d) probability of instance d being in
class cj,
P(male 5 foot 8 ) 10 / (10 2)
0.833 P(female 5 foot 8 ) 2 / (10 2)
0.166
10
2
5 foot 8
23Clustering
- Clustering Intuitively, finding clusters of
points in the given data such that similar points
lie in the same cluster - Can be formalized using distance metrics in
several ways - E.g. Group points into k sets (for a given k)
such that the average distance of points from the
centroid of their assigned group is minimized - Centroid point defined by taking average of
coordinates in each dimension. - Another metric minimize average distance between
every pair of points in a cluster - Has been studied extensively in statistics, but
on small data sets - Data mining systems aim at clustering techniques
that can handle very large data sets - E.g. the Birch clustering algorithm (more shortly)
24What is Clustering?
Also called unsupervised learning, sometimes
called classification by statisticians and
sorting by psychologists and segmentation by
people in marketing
- Organizing data into classes such that there is
- high intra-class similarity
- low inter-class similarity
- Finding the class labels and the number of
classes directly from the data (in contrast to
classification). - More informally, finding natural groupings among
objects. (I.e east coast cities, west coast
cities)
25What is a natural grouping among these objects?
26What is a natural grouping among these objects?
Clustering is subjective
School Employees
Simpson's Family
Males
Females
27What is Similarity?
The quality or state of being similar likeness
resemblance as, a similarity of features.
Webster's Dictionary
Similarity is hard to define, but We know it
when we see it The real meaning of similarity
is a philosophical question. We will take a more
pragmatic approach.
28Similarity Measures
For the moment assume that we can measure the
similarity between any two objects. (we will
cover this in detail later).
One intuitive example is to measure the distance
between two cities and call it the similarity.
For example we have D(LA,San Diego) 110, and
D(LA,New York) 3,000.
This would allow use to make (subjectively
correct) statements like LA is more similar to
San Francisco that it is to New York.
29Defining Distance Measures
Definition Let O1 and O2 be two objects from the
universe of possible objects. The distance
(dissimilarity) between O1 and O2 is a real
number denoted by D(O1,O2)
Peter
Piotr
0.23
3
342.7
30Peter
Piotr
When we peek inside one of these black boxes, we
see some function on two variables. These
functions might very simple or very complex. In
either case it is natural to ask, what properties
should these functions have?
d('', '') 0 d(s, '') d('', s) s -- i.e.
length of s d(s1ch1, s2ch2) min( d(s1, s2)
if ch1ch2 then 0 else 1 fi, d(s1ch1, s2) 1,
d(s1, s2ch2) 1 )
3
- What properties should a distance measure have?
- D(A,B) D(B,A) Symmetry
- D(A,A) 0 Constancy of Self-Similarity
- D(A,B) 0 IIf A B Positivity (Separation)
- D(A,B) ? D(A,C) D(B,C) Triangular Inequality
31Intuitions behind desirable distance measure
properties
D(A,B) D(B,A) Symmetry Otherwise you could
claim Alex looks like Bob, but Bob looks nothing
like Alex. D(A,A) 0 Constancy of
Self-Similarity Otherwise you could claim Alex
looks more like Bob, than Bob does. D(A,B) 0
IIf AB Positivity (Separation) Otherwise there
are objects in your world that are different, but
you cannot tell apart. D(A,B) ? D(A,C)
D(B,C) Triangular Inequality Otherwise you could
claim Alex is very like Bob, and Alex is very
like Carl, but Bob is very unlike Carl.
32Two Types of Clustering
- Partitional algorithms Construct various
partitions and then evaluate them by some
criterion (we will see an example called BIRCH) - Hierarchical algorithms Create a hierarchical
decomposition of the set of objects using some
criterion
Partitional
Hierarchical
33A Useful Tool for Summarizing Similarity
Measurements
In order to better appreciate and evaluate the
examples given in the early part of this talk, we
will now introduce the dendrogram.
The similarity between two objects in a
dendrogram is represented as the height of the
lowest internal node they share.
34Note that hierarchies are commonly used to
organize information, for example in a web
portal. Yahoos hierarchy is manually created,
we will focus on automatic creation of
hierarchies in data mining.
Business Economy
B2B Finance Shopping Jobs
Aerospace Agriculture Banking Bonds Animals
Apparel Career Workspace
35(Bovine0.69395,(Gibbon0.36079,(Orangutan0.33636
,(Gorilla0.17147,(Chimp0.19268,Human0.11927)0.
08386)0.06124)0.15057)0.54939)
36Desirable Properties of a Clustering Algorithm
- Scalability (in terms of both time and space)
- Ability to deal with different data types
- Minimal requirements for domain knowledge to
determine input parameters - Able to deal with noise and outliers
- Insensitive to order of input records
- Incorporation of user-specified constraints
- Interpretability and usability
37Hierarchical Clustering
Since we cannot test all possible trees we will
have to heuristic search of all possible trees.
We could do this.. Bottom-Up (agglomerative)
Starting with each item in its own cluster, find
the best pair to merge into a new cluster. Repeat
until all clusters are fused together. Top-Down
(divisive) Starting with all the data in a
single cluster, consider every possible way to
divide the cluster into two. Choose the best
division and recursively operate on both sides.
- The number of dendrograms with n leafs (2n
-3)!/(2(n -2)) (n -2)! - Number Number of Possible
- of Leafs Dendrograms
- 2 1
- 3 3
- 4 15
- 5 105
- ...
- 34,459,425
38We begin with a distance matrix which contains
the distances between every pair of objects in
our database.
D( , ) 8 D( , ) 1
39Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best
40Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best
Consider all possible merges
Choose the best
41Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best
Consider all possible merges
Choose the best
Consider all possible merges
Choose the best
42Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best
Consider all possible merges
Choose the best
Consider all possible merges
Choose the best
43We know how to measure the distance between two
objects, but defining the distance between an
object and a cluster, or defining the distance
between two clusters is non obvious.
- Single linkage (nearest neighbor) In this
method the distance between two clusters is
determined by the distance of the two closest
objects (nearest neighbors) in the different
clusters. - Complete linkage (furthest neighbor) In this
method, the distances between clusters are
determined by the greatest distance between any
two objects in the different clusters (i.e., by
the "furthest neighbors"). - Group average In this method, the distance
between two clusters is calculated as the average
distance between all pairs of objects in the two
different clusters.
44- Summary of Hierarchal Clustering Methods
- No need to specify the number of clusters in
advance. - Hierarchal nature maps nicely onto human
intuition for some domains - They do not scale well time complexity of at
least O(n2), where n is the number of total
objects. - Like any heuristic search algorithms, local
optima are a problem. - Interpretation of results is subjective.
45Partitional Clustering Algorithms
- Clustering algorithms have been designed to
handle very large datasets - E.g. the Birch algorithm
- Main idea use an in-memory R-tree to store
points that are being clustered - Insert points one at a time into the R-tree,
merging a new point with an existing cluster if
is less than some ? distance away - If there are more leaf nodes than fit in memory,
merge existing clusters that are close to each
other - At the end of first pass we get a large number of
clusters at the leaves of the R-tree - Merge clusters to reduce the number of clusters
46Partitional Clustering Algorithms
We need to specify the number of clusters in
advance, I have chosen 2
R10
R11
R10 R11 R12
R1 R2 R3
R4 R5 R6
R7 R8 R9
R12
Data nodes containing points
47Partitional Clustering Algorithms
R10
R11
R10 R11 R12
R1,R2 R3
R4 R5 R6
R7 R8 R9
R12
Data nodes containing points
48Partitional Clustering Algorithms
R10
R11
R12
49Up to this point we have simply assumed that we
can measure similarity, butHow do we measure
similarity?
Peter
Piotr
0.23
3
342.7
50A generic technique for measuring similarity
To measure the similarity between two objects,
transform one of the objects into the other, and
measure how much effort it took. The measure of
effort becomes the distance measure.
The distance between Patty and Selma. Change
dress color, 1 point Change earring shape, 1
point Change hair part, 1 point D(Patty,Selma
) 3
The distance between Marge and Selma. Change
dress color, 1 point Add earrings, 1
point Decrease height, 1 point Take up
smoking, 1 point Lose weight, 1
point D(Marge,Selma) 5
This is called the edit distance or the
transformation distance
51Edit Distance Example
How similar are the names Peter and
Piotr? Assume the following cost function
Substitution 1 Unit Insertion 1
Unit Deletion 1 Unit D(Peter,Piotr) is 3
It is possible to transform any string Q into
string C, using only Substitution, Insertion and
Deletion. Assume that each of these operators has
a cost associated with it. The similarity
between two strings can be defined as the cost of
the cheapest transformation from Q to C. Note
that for now we have ignored the issue of how we
can find this cheapest transformation
Peter Piter Pioter Piotr
Substitution (i for e)
Insertion (o)
Deletion (e)
52Association Rules(market basket analysis)
- Retail shops are often interested in associations
between different items that people buy. - Someone who buys bread is quite likely also to
buy milk - A person who bought the book Database System
Concepts is quite likely also to buy the book
Operating System Concepts. - Associations information can be used in several
ways. - E.g. when a customer buys a particular book, an
online shop may suggest associated books. - Association rules
- bread ? milk DB-Concepts,
OS-Concepts ? Networks - Left hand side antecedent, right hand side
consequent - An association rule must have an associated
population the population consists of a set of
instances - E.g. each transaction (sale) at a shop is an
instance, and the set of all transactions is the
population
53Association Rule Definitions
- Set of items II1,I2,,Im
- Transactions Dt1,t2, , tn, tj? I
- Itemset Ii1,Ii2, , Iik ? I
- Support of an itemset Percentage of transactions
which contain that itemset. - Large (Frequent) itemset Itemset whose number of
occurrences is above a threshold.
54Association Rules Example
I Beer, Bread, Jelly, Milk,
PeanutButter Support of Bread,PeanutButter is
60
55Association Rule Definitions
- Association Rule (AR) implication X ? Y where
X,Y ? I and X ? Y the null set - Support of AR (s) X ? Y Percentage of
transactions that contain X ?Y - Confidence of AR (a) X ? Y Ratio of number of
transactions that contain X ? Y to the number
that contain X
56Association Rules Ex (contd)
57Association Rules Ex (contd)
- Of 5 transactions, 3 involve both Bread and
PeanutButter, 3/5 60
- Of the 4 transactions that involve Bread, 3 of
them also involve PeanutButter 3/4 75
58Association Rule Problem
- Given a set of items II1,I2,,Im and a
database of transactions Dt1,t2, , tn where
tiIi1,Ii2, , Iik and Iij ? I, the Association
Rule Problem is to identify all association rules
X ? Y with a minimum support and confidence
(supplied by user). - NOTE Support of X ? Y is same as support of X ?
Y.
59Association Rule Algorithm (Basic Idea)
- Find Large Itemsets.
- Generate rules from frequent itemsets.
This is the simple naïve algorithm, better
algorithms exist.
60Association Rule Algorithm
- We are generally only interested in association
rules with reasonably high support (e.g. support
of 2 or greater) - Naïve algorithm
- Consider all possible sets of relevant items.
- For each set find its support (i.e. count how
many transactions purchase all items in the
set). - Large itemsets sets with sufficiently high
support - Use large itemsets to generate association rules.
- From itemset A generate the rule A - b ?b for
each b ? A. - Support of rule support (A).
- Confidence of rule support (A ) / support (A -
b)
61- From itemset A generate the rule A - b ?b for
each b ? A. - Support of rule support (A).
- Confidence of rule support (A ) / support (A -
b)
Lets say itemset A Bread, Butter, Milk Then
A - b ?b for each b ? A includes 3
possibilities Bread, Butter ? Milk Bread,
Milk ? Butter Butter, Milk ? Bread
62Apriori
- Large Itemset Property
- Any subset of a large itemset is large.
- Contrapositive
- If an itemset is not large,
- none of its supersets are large.
63Large Itemset Property
64Large Itemset Property
If B is not frequent, then none of the supersets
of B can be frequent. If ACD is frequent,
then all subsets of ACD (AC, AD, CD) must
be frequent.