Chapter 22: Advanced Querying and Information Retrieval

About This Presentation

Title:

Chapter 22: Advanced Querying and Information Retrieval

Description:

... between exposure to chemical X and cancer, or new medicine and cardiac problems ... Webster's Dictionary. Similarity Measures ... – PowerPoint PPT presentation

Number of Views:66

Avg rating:3.0/5.0

Slides: 65

Provided by: ssu80

Learn more at: http://www.cs.ucr.edu

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 22: Advanced Querying and Information Retrieval

1
Data Mining Chapter 24 of book Dr Eamonn
Keogh Computer Science Engineering
DepartmentUniversity of California -
RiversideRiverside,CA 92521eamonn_at_cs.ucr.edu
Dont bother reading 24.3.7 or 24.3.8
2
What is Data Mining?
Data Mining has been defined as The nontrivial
extraction of implicit, previously unknown, and
potentially useful information from data.
data visualization
statistics
Data Mining
artificial intelligence
databases
Informally, data mining is the extraction of
interesting knowledge (rules, regularities,
patterns, constraints) from data in large
databases.
3
Data Mining

Broadly speaking, data mining is the process of
semi-automatically analyzing large databases to
find useful patterns
Like knowledge discovery in artificial
intelligence data mining discovers statistical
rules and patterns
Differs from machine learning in that it deals
with large volumes of data stored primarily on
disk.
Some types of knowledge discovered from a
database can be represented by a set of rules.
e.g., Young women with annual incomes greater
than 50,000 are most likely to buy sports cars
Other types of knowledge represented by
equations, or by prediction functions, or by
clusters
Some manual intervention is usually required
Pre-processing of data, choice of which type of
pattern to find, postprocessing to find novel
patterns

4
Applications of Data Mining

Prediction based on past history
Predict if a credit card applicant poses a good
credit risk, based on some attributes (income,
job type, age, ..) and past history
Predict if a customer is likely to switch brand
loyalty
Predict if a customer is likely to respond to
junk mail
Predict if a pattern of phone calling card usage
is likely to be fraudulent
Some examples of prediction mechanisms
Classification
Given a training set consisting of items
belonging to different classes, and a new item
whose class is unknown, predict which class it
belongs to
Regression formulae
given a set of parameter-value to
function-result mappings for an unknown function,
predict the function-result for a new
parameter-value

5
Applications of Data Mining (Cont.)

Descriptive Patterns
Associations
Find books that are often bought by the same
customers. If a new customer buys one such book,
suggest that he buys the others too.
Other similar applications camera accessories,
clothes, etc.
Associations may also be used as a first step in
detecting causation
E.g. association between exposure to chemical X
and cancer, or new medicine and cardiac problems
Clusters
E.g. typhoid cases were clustered in an area
surrounding a contaminated well
Detection of clusters remains important in
detecting epidemics

6
Classification Rules

Classification rules help assign new objects to a
set of classes. E.g., given a new automobile
insurance applicant, should he or she be
classified as low risk, medium risk or high risk?
Classification rules for above example could use
a variety of knowledge, such as educational level
of applicant, salary of applicant, age of
applicant, etc.
? person P, P.degree masters and P.income gt
75,000
? P.credit excellent
? person P, P.degree bachelors and
(P.income ? 25,000 and P.income ?
75,000)
? P.credit good
Rules are not necessarily exact there may be
some misclassifications
Classification rules can be compactly shown as a
decision tree.

7
Decision Tree
? person P, P.degree masters and P.income gt
75,000 ? P.credit excellent
8
Construction of Decision Trees

Training set a data sample in which the grouping
for each tuple is already known.
Consider credit risk example Suppose degree is
chosen to partition the data at the root.
Since degree has a small number of possible
values, one child is created for each value.
At each child node of the root, further
classification is done if required. Here,
partitions are defined by income.
Since income is a continuous attribute, some
number of intervals are chosen, and one child
created for each interval.
Different classification algorithms use different
ways of choosing which attribute to partition on
at each node, and what the intervals, if any,
are.
In general
Different branches of the tree could grow to
different levels.
Different nodes at the same level may use
different partitioning attributes.

9
Construction of Decision Trees (Cont.)

Greedy top down generation of decision trees.
Each internal node of the tree partitions the
data into groups based on a partitioning
attribute, and a partitioning condition for the
node
More on choosing partitioning attribute/condition
shortly
Algorithm is greedy the choice is made once and
not revisited as more of the tree is constructed
The data at a node is not partitioned further if
either
all (or most) of the items at the node belong to
the same class, or
all attributes have been considered, and no
further partitioning is possible.
Such a node is a leaf node.
Otherwise the data at the node is partitioned
further by picking an attribute for partitioning
data at the node.

10
Best Splits

Idea evaluate different attributes and
partitioning conditions and pick the one that
best improves the purity of the training set
examples
The initial training set has a mixture of
instances from different classes and is thus
relatively impure
E.g. if degree exactly predicts credit risk,
partitioning on degree would result in each child
having instances of only one class
I.e., the child nodes would be pure
The purity of a set S of training instances can
be measured quantitatively in several ways.
Notation number of classes k, number of
instances S, fraction of instances in class
i pi.
The Gini measure of purity is defined as
Gini (S) 1 - ?
When all instances are in a single class, the
Gini value is 0, while it reaches its maximum (of
1 1 /k) if each class the same number of
instances.

11
Best Splits (Cont.)

Another measure of purity is the entropy measure,
which is defined as
entropy (S) ?
When a set S is split into multiple sets Si, I1,
2, , r, we can measure the purity of the
resultant set of sets as
purity(S1, S2, .., Sr) ?
The information gain due to particular split of S
into Si, i 1, 2, ., r
Information-gain (S, S1, S2, ., Sr)

purity(S) purity (S1, S2, Sr)

12
Best Splits (Cont.)

Measure of cost of a split
Information-content(S, S1, S2, .., Sr))

?
Information-gain ratio Information-gain (S,
S1, S2, , Sr)
Information-content (S, S1, S2, .., Sr)
The best split for an attribute is the one that
gives the maximum information gain ratio
Continuous valued attributes
Can be ordered in a fashion meaningful to
classification
e.g. integer and real values
Categorical attributes
Cannot be meaningfully ordered (e.g. country,
school/university, item-color, .)

13
Finding Best Splits

Categorical attributes
Multi-way split, one child for each value
may have too many children in some cases
Binary split try all possible breakup of values
into two sets, and pick the best
Continuous valued attribute
Binary split
Sort values in the instances, try each as a split
point
E.g. if values are 1, 10, 15, 25, split at ?1,
? 10, ? 15
Pick the value that gives best split
Multi-way split more complicated, see
bibliographic notes
A series of binary splits on the same attribute
has roughly equivalent effect

14
Decision-Tree Construction Algorithm I

Procedure Grow.Tree(S) Partition(S)Procedure
Partition (S) if (purity(S) gt ?p or S lt ?s)
then return for each attribute A
evaluate splits on attribute A Use best split
found (across all attributes) to partition
S into S1, S2, ., Sr, for i 1, 2, .., r
Partition(Si)

15
Decision-Tree Construction Algorithm II

Variety of algorithms have been developed to
Reduce CPU cost and/or
Reduce IO cost when handling datasets larger than
memory
Improve accuracy of classification
Decision tree may be overfitted, i.e., overly
tuned to given training set
Pruning of decision tree may be done on branches
that have too few training instances
When a subtree is pruned, an internal node
becomes a leaf
and its class is set to the majority class of the
instances that map to the node
Pruning can be done by using a part of the
training set to build tree, and a second part to
test the tree
prune subtrees that increase misclassification
on second part

16
A visual intuition of the classification problem
Given a database (called a training database) of
labeled examples, predict future unlabeled
examples
10
What is the class of Homer?
Shoe Size
Blood Sugar
17
Decision-Tree A visual intuition
10
Shoe Size
Blood Sugar
18
White Cell Count
Blood Sugar
19
Other Types of Classifiers

Further types of classifiers
Neural net classifiers
Bayesian classifiers
Neural net classifiers use the training data to
train artificial neural nets
Widely studied in AI, wont cover here
Bayesian classifiers use Bayes theorem, which
says
p(cj d) p(d cj ) p(cj)
p(d)where p(cj
d) probability of instance d being in class cj,
p(d cj) probability of
generating instance d given class cj,
p(cj) probability of
occurrence of class cj, and
p(d) probability of
instance d occurring

For more details see Keogh, E. Pazzani, M.
(1999). Learning augmented Bayesian classifiers
A comparison of distribution-based and
classification-based approaches. In Uncertainty
99, 7th. Int'l Workshop on AI and Statistics, Ft.
Lauderdale, FL, pp. 225--230.

20
Naïve Bayesian Classifiers

Bayesian classifiers require
computation of p(d cj)
precomputation of p(cj)
p(d) can be ignored since it is the same for all
classes
To simplify the task, naïve Bayesian classifiers
assume attributes have independent distributions,
and thereby estimate
p(dcj) p(d1cj) p(d2cj) . (p(dncj)
Each of the p(dicj) can be estimated from a
histogram on di values for each class cj
the histogram is computed from the training
instances
Histograms on multiple attributes are more
expensive to compute and store

21
Naïve Bayesian Classifiers Visual Intuition I
5 foot 8
6 foot 6
4 foot 8
5 foot 8
22
Naïve Bayesian Classifiers Visual Intuition II
p(cj d) probability of instance d being in
class cj,
P(male 5 foot 8 ) 10 / (10 2)
0.833 P(female 5 foot 8 ) 2 / (10 2)
0.166
10
2
5 foot 8
23
Clustering

Clustering Intuitively, finding clusters of
points in the given data such that similar points
lie in the same cluster
Can be formalized using distance metrics in
several ways
E.g. Group points into k sets (for a given k)
such that the average distance of points from the
centroid of their assigned group is minimized
Centroid point defined by taking average of
coordinates in each dimension.
Another metric minimize average distance between
every pair of points in a cluster
Has been studied extensively in statistics, but
on small data sets
Data mining systems aim at clustering techniques
that can handle very large data sets
E.g. the Birch clustering algorithm (more shortly)

24
What is Clustering?
Also called unsupervised learning, sometimes
called classification by statisticians and
sorting by psychologists and segmentation by
people in marketing

Organizing data into classes such that there is
high intra-class similarity
low inter-class similarity
Finding the class labels and the number of
classes directly from the data (in contrast to
classification).
More informally, finding natural groupings among
objects. (I.e east coast cities, west coast
cities)

25
What is a natural grouping among these objects?
26
What is a natural grouping among these objects?
Clustering is subjective
School Employees
Simpson's Family
Males
Females
27
What is Similarity?
The quality or state of being similar likeness
resemblance as, a similarity of features.
Webster's Dictionary
Similarity is hard to define, but We know it
when we see it The real meaning of similarity
is a philosophical question. We will take a more
pragmatic approach.
28
Similarity Measures
For the moment assume that we can measure the
similarity between any two objects. (we will
cover this in detail later).
One intuitive example is to measure the distance
between two cities and call it the similarity.
For example we have D(LA,San Diego) 110, and
D(LA,New York) 3,000.
This would allow use to make (subjectively
correct) statements like LA is more similar to
San Francisco that it is to New York.

MACSTEEL USA LOCATIONS

29
Defining Distance Measures
Definition Let O1 and O2 be two objects from the
universe of possible objects. The distance
(dissimilarity) between O1 and O2 is a real
number denoted by D(O1,O2)
Peter
Piotr
0.23
3
342.7
30
Peter
Piotr
When we peek inside one of these black boxes, we
see some function on two variables. These
functions might very simple or very complex. In
either case it is natural to ask, what properties
should these functions have?
d('', '') 0 d(s, '') d('', s) s -- i.e.
length of s d(s1ch1, s2ch2) min( d(s1, s2)
if ch1ch2 then 0 else 1 fi, d(s1ch1, s2) 1,
d(s1, s2ch2) 1 )
3

What properties should a distance measure have?
D(A,B) D(B,A) Symmetry
D(A,A) 0 Constancy of Self-Similarity
D(A,B) 0 IIf A B Positivity (Separation)
D(A,B) ? D(A,C) D(B,C) Triangular Inequality

31
Intuitions behind desirable distance measure
properties
D(A,B) D(B,A) Symmetry Otherwise you could
claim Alex looks like Bob, but Bob looks nothing
like Alex. D(A,A) 0 Constancy of
Self-Similarity Otherwise you could claim Alex
looks more like Bob, than Bob does. D(A,B) 0
IIf AB Positivity (Separation) Otherwise there
are objects in your world that are different, but
you cannot tell apart. D(A,B) ? D(A,C)
D(B,C) Triangular Inequality Otherwise you could
claim Alex is very like Bob, and Alex is very
like Carl, but Bob is very unlike Carl.
32
Two Types of Clustering

Partitional algorithms Construct various
partitions and then evaluate them by some
criterion (we will see an example called BIRCH)
Hierarchical algorithms Create a hierarchical
decomposition of the set of objects using some
criterion

Partitional
Hierarchical
33
A Useful Tool for Summarizing Similarity
Measurements
In order to better appreciate and evaluate the
examples given in the early part of this talk, we
will now introduce the dendrogram.
The similarity between two objects in a
dendrogram is represented as the height of the
lowest internal node they share.
34
Note that hierarchies are commonly used to
organize information, for example in a web
portal. Yahoos hierarchy is manually created,
we will focus on automatic creation of
hierarchies in data mining.
Business Economy
B2B Finance Shopping Jobs
Aerospace Agriculture Banking Bonds Animals
Apparel Career Workspace
35
(Bovine0.69395,(Gibbon0.36079,(Orangutan0.33636
,(Gorilla0.17147,(Chimp0.19268,Human0.11927)0.
08386)0.06124)0.15057)0.54939)
36
Desirable Properties of a Clustering Algorithm

Scalability (in terms of both time and space)
Ability to deal with different data types
Minimal requirements for domain knowledge to
determine input parameters
Able to deal with noise and outliers
Insensitive to order of input records
Incorporation of user-specified constraints
Interpretability and usability

37
Hierarchical Clustering
Since we cannot test all possible trees we will
have to heuristic search of all possible trees.
We could do this.. Bottom-Up (agglomerative)
Starting with each item in its own cluster, find
the best pair to merge into a new cluster. Repeat
until all clusters are fused together. Top-Down
(divisive) Starting with all the data in a
single cluster, consider every possible way to
divide the cluster into two. Choose the best
division and recursively operate on both sides.

The number of dendrograms with n leafs (2n
-3)!/(2(n -2)) (n -2)!
Number Number of Possible
of Leafs Dendrograms
2 1
3 3
4 15
5 105
...
34,459,425

38
We begin with a distance matrix which contains
the distances between every pair of objects in
our database.
D( , ) 8 D( , ) 1
39
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

40
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

41
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

42
Bottom-Up (agglomerative) Starting with each
item in its own cluster, find the best pair to
merge into a new cluster. Repeat until all
clusters are fused together.
Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

Consider all possible merges
Choose the best

43
We know how to measure the distance between two
objects, but defining the distance between an
object and a cluster, or defining the distance
between two clusters is non obvious.

Single linkage (nearest neighbor) In this
method the distance between two clusters is
determined by the distance of the two closest
objects (nearest neighbors) in the different
clusters.
Complete linkage (furthest neighbor) In this
method, the distances between clusters are
determined by the greatest distance between any
two objects in the different clusters (i.e., by
the "furthest neighbors").
Group average In this method, the distance
between two clusters is calculated as the average
distance between all pairs of objects in the two
different clusters.

Summary of Hierarchal Clustering Methods
No need to specify the number of clusters in
advance.
Hierarchal nature maps nicely onto human
intuition for some domains
They do not scale well time complexity of at
least O(n2), where n is the number of total
objects.
Like any heuristic search algorithms, local
optima are a problem.
Interpretation of results is subjective.

45
Partitional Clustering Algorithms

Clustering algorithms have been designed to
handle very large datasets
E.g. the Birch algorithm
Main idea use an in-memory R-tree to store
points that are being clustered
Insert points one at a time into the R-tree,
merging a new point with an existing cluster if
is less than some ? distance away
If there are more leaf nodes than fit in memory,
merge existing clusters that are close to each
other
At the end of first pass we get a large number of
clusters at the leaves of the R-tree
Merge clusters to reduce the number of clusters

46
Partitional Clustering Algorithms

The Birch algorithm

We need to specify the number of clusters in
advance, I have chosen 2
R10
R11
R10 R11 R12
R1 R2 R3
R4 R5 R6
R7 R8 R9
R12
Data nodes containing points
47
Partitional Clustering Algorithms

The Birch algorithm

R10
R11
R10 R11 R12
R1,R2 R3
R4 R5 R6
R7 R8 R9
R12
Data nodes containing points
48
Partitional Clustering Algorithms

The Birch algorithm

R10
R11
R12
49
Up to this point we have simply assumed that we
can measure similarity, butHow do we measure
similarity?
Peter
Piotr
0.23
3
342.7
50
A generic technique for measuring similarity
To measure the similarity between two objects,
transform one of the objects into the other, and
measure how much effort it took. The measure of
effort becomes the distance measure.
The distance between Patty and Selma. Change
dress color, 1 point Change earring shape, 1
point Change hair part, 1 point D(Patty,Selma
) 3
The distance between Marge and Selma. Change
dress color, 1 point Add earrings, 1
point Decrease height, 1 point Take up
smoking, 1 point Lose weight, 1
point D(Marge,Selma) 5
This is called the edit distance or the
transformation distance
51
Edit Distance Example
How similar are the names Peter and
Piotr? Assume the following cost function
Substitution 1 Unit Insertion 1
Unit Deletion 1 Unit D(Peter,Piotr) is 3
It is possible to transform any string Q into
string C, using only Substitution, Insertion and
Deletion. Assume that each of these operators has
a cost associated with it. The similarity
between two strings can be defined as the cost of
the cheapest transformation from Q to C. Note
that for now we have ignored the issue of how we
can find this cheapest transformation
Peter Piter Pioter Piotr
Substitution (i for e)
Insertion (o)
Deletion (e)
52
Association Rules(market basket analysis)

Retail shops are often interested in associations
between different items that people buy.
Someone who buys bread is quite likely also to
buy milk
A person who bought the book Database System
Concepts is quite likely also to buy the book
Operating System Concepts.
Associations information can be used in several
ways.
E.g. when a customer buys a particular book, an
online shop may suggest associated books.
Association rules
bread ? milk DB-Concepts,
OS-Concepts ? Networks
Left hand side antecedent, right hand side
consequent
An association rule must have an associated
population the population consists of a set of
instances
E.g. each transaction (sale) at a shop is an
instance, and the set of all transactions is the
population

53
Association Rule Definitions

Set of items II1,I2,,Im
Transactions Dt1,t2, , tn, tj? I
Itemset Ii1,Ii2, , Iik ? I
Support of an itemset Percentage of transactions
which contain that itemset.
Large (Frequent) itemset Itemset whose number of
occurrences is above a threshold.

54
Association Rules Example
I Beer, Bread, Jelly, Milk,
PeanutButter Support of Bread,PeanutButter is
60
55
Association Rule Definitions

Association Rule (AR) implication X ? Y where
X,Y ? I and X ? Y the null set
Support of AR (s) X ? Y Percentage of
transactions that contain X ?Y
Confidence of AR (a) X ? Y Ratio of number of
transactions that contain X ? Y to the number
that contain X

56
Association Rules Ex (contd)
57
Association Rules Ex (contd)

Of 5 transactions, 3 involve both Bread and
PeanutButter, 3/5 60

Of the 4 transactions that involve Bread, 3 of
them also involve PeanutButter 3/4 75

58
Association Rule Problem

Given a set of items II1,I2,,Im and a
database of transactions Dt1,t2, , tn where
tiIi1,Ii2, , Iik and Iij ? I, the Association
Rule Problem is to identify all association rules
X ? Y with a minimum support and confidence
(supplied by user).
NOTE Support of X ? Y is same as support of X ?
Y.

59
Association Rule Algorithm (Basic Idea)

Find Large Itemsets.
Generate rules from frequent itemsets.

This is the simple naïve algorithm, better
algorithms exist.
60
Association Rule Algorithm

We are generally only interested in association
rules with reasonably high support (e.g. support
of 2 or greater)
Naïve algorithm
Consider all possible sets of relevant items.
For each set find its support (i.e. count how
many transactions purchase all items in the
set).
Large itemsets sets with sufficiently high
support
Use large itemsets to generate association rules.
From itemset A generate the rule A - b ?b for
each b ? A.
Support of rule support (A).
Confidence of rule support (A ) / support (A -
b)

From itemset A generate the rule A - b ?b for
each b ? A.
Support of rule support (A).
Confidence of rule support (A ) / support (A -
b)

Lets say itemset A Bread, Butter, Milk Then
A - b ?b for each b ? A includes 3
possibilities Bread, Butter ? Milk Bread,
Milk ? Butter Butter, Milk ? Bread
62
Apriori

Large Itemset Property
Any subset of a large itemset is large.
Contrapositive
If an itemset is not large,
none of its supersets are large.

63
Large Itemset Property
64
Large Itemset Property
If B is not frequent, then none of the supersets
of B can be frequent. If ACD is frequent,
then all subsets of ACD (AC, AD, CD) must
be frequent.

Write a Comment

User Comments (0)