Techniques of Classification and Clustering

About This Presentation

Title:

Techniques of Classification and Clustering

Description:

Problem Description Assume A={A1, A2, , Ad}: (ordered or unordered) domain S= A1 A2 Ad : d-dimensional (numerical or non-numerical) space Input V={v1, v2 ... – PowerPoint PPT presentation

Number of Views:671

Avg rating:3.0/5.0

Slides: 127

Provided by: Jongpi

Category:

more less

Transcript and Presenter's Notes

Title: Techniques of Classification and Clustering

1
Techniques of Classification and Clustering
2
Problem Description

Assume
AA1, A2, , Ad (ordered or unordered) domain
S A1 ? A2 ? Ad d-dimensional (numerical or
non-numerical) space
Input
Vv1, v2, , vm d-dimensional points, where vi
?vi1, vi2, , vid?.
The jth component of vi is drawn from domain Aj.
Output
Gg1, g2, , gk a set groups of V with label
vL, where gi ? V.

3
Classification

Supervised classification
Discriminant analysis, or simply Classification
A collection of labeled (pre-classified) patterns
are provided
Aims to label a newly encountered, yet unlabeled
(training) patterns
Unsupervised classification
Clustering
Aims to group a given collection of unlabeled
patterns into meaningful clusters
Category labels are data driven

4
Methods for Classification

Neural Nets
Classification functions are obtained by passing
multiple passes over training sets
Poor generation efficiency
Not efficient handling of non-numerical data
Decision trees
If E contains only objects of one group, the
decision tree is just a leaf labeled with that
group.
Construct a DT that correctly classifies objects
in the training data set.
Test to classify the unseen objects in the test
data set.

5
Decision Trees (Ex Credit Analysis)
salary lt 20000
no
yes
education in graduate
accept
yes
no
accept
reject
6
Decision Trees

Pros
Fast execution time
Generated rules are easy to interpret by humans
Scale well for large data sets
Can handle high dimensional data
Cons
Cannot capture correlations among attributes
Consider only axis-parallel cuts

7
Decision Tree Algorithms

Classifiers from machine learning community
ID3J. R. Quinlan, Induction of decision trees,
Machine Learning, 1, 1986.
C4.5J. Ross Quinlan, C4.5 Programs for and
Neural Networks, Cambridge University Press,
Cambridge, 1996. Machine Learning, Morgan
Kaufman, 1993
CARTL. Breiman, J. H. Friedman, R. A. Olshen,
and C. J. Stone, Classification and Regression
Trees, Wadsworth, Belmont, 1984.
Classifiers for large database
SLIQMAR96, SPRINTJohn Shafer, Rakesh Agrawal,
and Manish Mehta, SPRINT A scalable parallel
classifier for data mining, the VLDB Conference,
Bombay, India, September 1996.
SONARTakeshi Fukuda, Yasuhiko Morimoto, and
Shinichi Morishita, Constructing efficient
decision trees by using optimized numeric
association rules, the VLDB Conference, Bombay,
India, 1996.
RainforestJ. Gehrke, R. Ramakrishnan, V. Ganti,
RainForest A Framework for Fast Decision Tree
Construction of Large Datasets, Proc. of VLDB
Conf., 1998.
Pruning phase followed by building phase

8
Decision Tree Algorithms

Building phase
Recursively split nodes using best splitting
attribute for node
Pruning phase
Smaller imperfect decision tree generally
achieves better accuracy
Prune leaf nodes recursively to prevent
over-fitting

9
Preliminaries

Theoretic Background
Entropy
Similarity measures
Advanced terms

10
Information Theory Concepts

Entropy of a random variable X with probability
distribution p(x)
The Kullback-Leibler(KL) Divergence or Relative
Entropy between two probability distributions p
and q
Mutual Information between random variables X and
Y

11
What is Entropy

S is a sample of training data set
Entropy measures the impurity of S

H(X)The entropy of X
If H(X)0, it means X is one value As H()
increases, X are heterogeneous values.
For the same number of X values,
Low Entropy means X is from a uniform (boring)
distribution A histogram of the frequency
distribution of values of X would be flat ?and
so the values sampled from it would be all over
the place
High Entropy means X is from varied (peaks and
valleys) distribution A histogram of the
frequency distribution of values of X would have
many lows and one or two highs ? and so the
values sampled from it would be more predictable.

12
Entropy-Based Data Segmentation
T. Fukuda, Y. Morimoto, S. Morishita, T.
Tokuyama, Constructing Efficient Decision Trees
by Using Optimized Numeric Association Rules,
Proc. of VLDB Conf., 1996.

Attribute has three categories, 40 C1, 30 C2, 30
C3.

C1 C2 C3
100 40 30 30

Splitting

S1 C1 C2 C3
60 40 10 10
S2 C1 C2 C3
40 0 20 20
S3 C1 C2 C3
60 20 20 20
S4 C1 C2 C3
40 20 10 10
13
Information Theoretic Measure
R. Rgrawal, S. Ghosh, T. Imielinski, B. Iyer, A.
Swami, An Interval Classifier for Database Mining
Applications, Proc. ofVLDB, 1992.

Information gain by branching on Ai
gain(Ai) E - Ei
The entropy E of an object set
the object set containing
object ek of group Gk.
The expected entropy for
the tree with Ai as the root
where Eij is the expected entropy for the subtree
of an object set.
Information content of
the value of Ai

14
Ex
C1 C2 C3
100 40 30 30

Splitting

S1 C1 C2 C3
60 20 20 20
S2 C1 C2 C3
40 20 10 10
S3 C1 C2 C3
40 40 0 0
S4 C1 C2 C3
30 0 30 0
S5 C1 C2 C3
30 0 0 30

Gain
gain(Ai) E - Ei

gain1E-E10.015 gain2E-E21.09
15
Distributional Similarity Measures

Cosine
Jaccard coefficient
Dice coefficient
Overlap coefficient
L1 distance (City block distance)
Euclidean distance (L2 distance)
Hellinger distance
Information Radius (Jensen-Shannon divergence)
Skew divergence
Confusion Probability
Lins Similarity Measure

16
Similarity Measures

Minkowski distance
Euclidean distance
p2
Manhattan distance
p1
Mahalanobis distance
Normalization due to weight schemes
? is the sample covariance matrix of the patterns
or the known covariance matrix of the pattern
generation process

17
General form

I (common (A,B)) information content associated
with the statement describing what A and B have
in common
I (description (A,B)) information content
associated with the statement describing A and B
?(s) probability of the statement within the
world of the objects in question, i.e., fraction
of objects exhibiting feature s.

IT-Sim (A,B)
18
Similarity Measures

The Set/Bag Model Let X and Y be two collections
of XML documents
Jaccards Coefficient
Dices Coefficient

19
Similarity Measures

Cosine-Similarity Measure (CSM)
The Vector-Space Model Cosine-Similarity Measure
(CSM)

20
Query Processing a single cosine

For every term i, with each doc j, store term
frequency tfij.
Some tradeoffs on whether to store term count,
term weight, or weighted by idfi.
At query time, accumulate component-wise sum
If youre indexing 5 billion documents (web
search) an array of accumulators is infeasible

Ideas?
21
Similarity Measures (2)

The Generalized Cosine-Similarity Measure (GCSM)
Let X and Y be vectors and
where
Hierarchical Model
Why only for depth?

22
2 Dim Similarities

Cosine Measure
Hellinger Measure
Tanimoto Measure
Clarity Measure

23
Advanced Terms

Conditional Entropy
Information Gain

24
Specific Conditional Entropy

H(YXv)
Suppose Im trying to predict output Y and I have
input X
XCollege Major, Y likes Gladiator
Lets assume this reflects the true probabilities

X Y
Math Yes
History No
CS Yes
Math No
Math No
CS Yes
History No
Math Yes

From this data we estimate
P(LikeGYes)0.5
P(MajorMath LikeGNo) 0.25
P(MajorMath)0.5
P(LikeGYes MajorHisgory)0
Note
-H(X)1.5 -H(Y)1
----
H(YXMath)1 H(YXHistory)0
H(YXCS)0

25
Conditional Entropy

Definition of Conditional Entropy
H(YX)The average specific conditional entropy
of Y
If you choose a record at random what will be the
conditional entropy of Y, conditioned on that
rows value of X
Expected number of bits to transmit Y if both
sides will know the value of X

X Y
Math Yes
History No
CS Yes
Math No
Math No
CS Yes
History No
Math Yes
vj Prob(Xvj) H(YXvj)
Math 0.5 1
History 0.25 0
CS 0.25 0
H(YX)0.510.2500.2500.5
26
Information Gain

Definition of Information Gain
IG(YX) I must transmit Y. How many bits on
average would it save me if both ends the line
knew X?
IG(YX) H(Y) H(YX)

X Y
Math Yes
History No
CS Yes
Math No
Math No
CS Yes
History No
Math Yes
H(Y) 1 H(YX) 0.510.2500.2500.5 Thus,
IG(YX) 1-0.5 0.5
27
Relative Information Gain

Definition of Relative Information Gain
RIG(YX) I must transmit Y, what fraction of
the bits on average would it save me if both ends
the line knew X?
RIG(YX) H(Y) H(YX)/H(Y)

X Y
Math Yes
History No
CS Yes
Math No
Math No
CS Yes
History No
Math Yes
H(Y) 1 H(YX) 0.510.2500.2500.5 Thus,
IG(YX) (1-0.5)/1 0.5
28
What is Information Gain used for?

Suppose you are trying to predict whether someone
is going to live past 80 years. From historical
data you might find
IG(LongLife HairColor) 0.01
IG(LongLife Smoker) 0.2
IG(LongLife Gender) 0.25
IG(LongLife LastDigitOfSSN) 0.00001
IG tells you how interesting 1 2-d contingency
table is going to be.

29
Clustering

Given
Data points and number of desired clusters K
Group the data points into K clusters
Data points within clusters are more similar than
across clusters
Sample applications
Customer segmentation
Market basket customer analysis
Attached mailing in direct marketing
Clustering companies with similar growth

30
A Clustering Example
Income High Children1 CarLuxury
Income Medium Children2 CarTruck
Cluster 1
Car Sedan and Children3 Income Medium
Income Low Children0 CarCompact
Cluster 4
Cluster 3
Cluster 2
31
Different ways of representing clusters
(b)
e
c
b
f
i
g
32
Clustering Methods

Partitioning
Given a set of objects and a clustering
criterion, partitional clustering obtains a
partition of the objects into clusters such that
the objects in a cluster are more similar to each
other than to objects in different clusters.
K-means, and K-mediod methods determine K cluster
representatives and assign each object to the
cluster with its representative closest to the
object such that the sum of the distances squared
between the objects and their representatives is
minimized.
Hierarchical
Nested sequence of partitions.
Agglomerative starts by placing each object in
its own cluster and then merge these atomic
clusters into larger and larger clusters until
all objects are in a single cluster.
Divisive starts with all objects in cluster and
subdividing into smaller pieces.

33
Algorithms

k-Means
Fuzzy C-Means Clustering
Hierarchical Clustering
Probabilistic Clustering

34
Similarity Measures (2)

Mutual Neighbor Distance (MND)
MND(xi, xj) NN(xi, xj)NN(xj, xi), where NN(xi,
xj) is the neighbor number xj with respect to xi.
Distance under context
s(xi, xj)f(xi, xj, e), where e is the context

35
K-Means Clustering Algorithm

Choose k cluster centers to coincide with k
randomly-chosen patterns
Assign each pattern to its closest cluster
center.
Recompute the cluster centers using the current
cluster memberships.
If a convergence criterion is not met, go to step
2.
Typical convergence criteria
No (or minimal) reassignment of patterns to new
cluster centers, or minimal decrease in squared
error.

36
Objective Function

k-Means algorithm aims at minimizing the
following objective function (square error
function)

37
K-Means Algorithm (Ex)
G
F
E
D
H
I
C
J
B
A
38
Distortion

Given a clustering ?, we denote by ?(x) the
centroid this clustering associates with an
arbitrary point x. A measure of quality for ?
Distortion? ?x d2(x, ?(x))/R
Where R is the total number of points and x
ranges over all input points.
Improvement
Distortion ?( parameters) log R
Distortion ? mk log R

39
Remarks

The way to initialize the means is the problem.
One popular way to start is to randomly choose k
of the samples
The results produced depend on the initial values
for the means
It can happen that the set of samples closest to
mi is empty, so the mi cannot be updated.
The results depend on the metric used to measure

40
Related Work Clustering

Graph-based clustering
For an XML document collection C, s-Graph sg (C)
(N, E), a directed graph such that N is the set
of all the elements and attributes in the
documents in C and (a, b) ? E if and only if a is
a parent element of b in document(s) in C (b can
be element or attribute).
For two sets, C1 and C2, of XML documents, the
distance between them, where sg(Ci) is the
number of edges

41
Fuzzy C-Means Clustering

FCM is a method of clustering which allows one
piece of data to belong to two or more clusters.
Fuzzy partitioning is carried out through an
iterative optimization of the objective function
shown above, with the update of membership u and
the cluster center c by

42
Membership

The iteration stop when
, where ? is a termination criterion
between 0 and 1, whereas k are the iteration
steps. This procedure converges to a local
minimum or a saddle point of Jm.

43
Fuzzy Clustering

Properties
uij ? 0,1 for all i,j
for all i
for all N

44
Speculations

Correlation between m and ?
More iteration k for less ?.

45
Hierarchical Clustering

Basic Process
Start by assigning each item to a cluster. N
clusters for N items. (Let the distances between
the clusters the same as the distances between
the items they contain.)
Find the closest (most similar) pair of clusters
and merge them into a single cluster.
Compute distances between the new cluster and
each of the old clusters.
Repeat steps 2 and 3 until all items are
clustered intoa single cluster of size N.

46
Hierarchical Clustering (Ex)
dendrogram
47
Hierarchical Clustering Algorithms

Single-linkage clustering
The distance between two clusters is the minimum
of the distances between all pairs of patterns
drawn from the two clusters (one pattern from the
first cluster, the other from the second).
Complete-linkage clustering
The distance between two clusters is the maximum
of the distances between all pairs of patterns
drawn from the two clusters
Average-linkage clustering
Minimum-variance algorithm

48
Single-/Complete-Link Clustering
1
2
1
1
1
2
2
2
2
1
2
2
1
2
1
2
1
2
2
2
2

1
2
1
2
1
1
2
1
2
2
2
1
1
2
1
1
2
2
1
2
1
49
Single Linkage Hierarchical Cluster

Steps
Begin with the disjoint clustering having level
L(k)0 and sequence number m0.
Find the least dissimilar pair of clusters in the
current clustering, d(r),(s) min d(i),(j),
where the minimum is over all pairs of clusters
in the current clustering.
Increment the sequence number mm1. Merge
clusters (r) and (s) into a single cluster to
form the next clustering m. Set L(m)
d(r),(s).
Update the proximity matrix, D, by deleting the
rows and columns corresponding to clusters (r)
and (s) and adding a row and column corresponding
to the newly formed cluster. The proximity
between the new cluster, denoted (r,s) and old
cluster (k) is defined d(k),(r,s) min
(d(s),(r), d(k),(s)).
If all objects are in one cluster, stop. Else go
to step 2.

50
Ex Single-Linkage

Cities ? States

0
51
Agglomerative Hierarchical Clustering
ALGORITHM Agglomerative Hierarchical
Clustering INPUT bit-vectors B in bitmap index
BI OUTPUT a tree T METHOD (1) Place each
bit-vector Bi in its cluster (singleton),
creating the list of clusters L
(initially, the leaves of T) LB1, B2, , Bn.
(2) Compute a merging cost function,
between every
pair of elements in L to find the two closest
clusters Bi,Bj which will be the
cheapest couple to merge. (3) Remove Bi and Bj
from L. (4) Merge Bi and Bj to create a new
internal node Bjj in T which will be the
parent of Bi and Bj in the result tree. (5)
Repeat from (2) until there is only one set
remaining.
52
Graph-Theoretic Clustering

Construct the minimal spanning tree (MST)
Delete the MST edges with the largest lengths

x2
B
3.5
0.5
C
1.5
A
6.5
1.5
D
F
G
1.7
E
x1
53
Improving k-Means
D. Pelleg and A. Moore, Accelerating Exact
k-means Algorithms with Geometric Reasoning, ACM
Proceedings of Conf. on Knowledge and Data
Mining, 1999.

Definitions
Center of clusters ? (Th2) Center of rectangle
owner(h)
c1 dominates c2 w.r.t. h ? if h is in the same
side as c1 wrt c2. (pg.7,9)
Update Centroid
If for all other centers c, c dominates c wrt h
(so cowner(h), pg 10) ? insert into owner(h) or
split h
(blacklist version) c1 dominates c2 wrt h for
any h contained in h. (pg.11)

54
Clustering Categorical Data ROCK

S. Guha, R. Rastogi, K. Shim, ROCK Robust
Clustering using linKs, IEEE Conf Data
Engineering, 1999
Use links to measure similarity/proximity
Not distance based
Computational complexity
Basic ideas
Similarity function and neighbors
Let T1 1,2,3, T23,4,5

55
Using Jaccard Coefficient

According to Jaccard coefficient, the distance
between 1,2,3 and 1,2,6 is the same as the
one between 1,2,3 and 1,2,4, although the
former is from two different clusters.

lt1,2,3,4,5gt CLUSTER 1 1,2,3 1,4,5 1,2,4
2,3,4 1,2,5 2,3,5 1,3,4 2,4,5 1,3,5
3,4,5
lt1,2,6,7gt CLUSTER 2 1,2,6 1,2,7 1,6,7 2,6,7

56
ROCK

Inducing LINK the main problem is local
properties involving only the two points are
considered
Neighbor If two points are similar enough with
each other, they are neighbors
Link the link for pair of points is the number
of common neighbors.

57
Rock Algorithm
S. Guha, R. Rastogi, K. Shim, ROCK Robust
Clustering using linKs, IEEE Conf Data
Engineering, 1999

Links The number of common neighbors for the
two points.
Algorithm
Draw random sample
Cluster with links
Label data in disk

1,2,3, 1,2,4, 1,2,5, 1,3,4,
1,3,5 1,4,5, 2,3,4, 2,3,5, 2,4,5,
3,4,5
3
1,2,3 1,2,4
58
Rock Algorithm
S. Guha, R. Rastogi, K. Shim, ROCK Robust
Clustering using linKs, IEEE Conf Data
Engineering, 1999

Criterion function to maximize link for the k
clusters
Ci denotes cluster i of size ni.

For the similarity threshold 0.5, link (1,2,6,
1,2,7) 4 link (1,2,6, 1,2,3) 3 link
(1,6,7, 1,2,3) 2 link (1,2,3, 1,4,5)
3
1,2,3 1,4,5 1,2,4 2,3,4 1,2,5
2,3,5 1,3,4 2,4,5 1,3,5 3,4,5
1,2,6 1,2,7 1,6,7 2,6,7
59
More on Hierarchical Clustering Methods

Major weakness of agglomerative clustering
methods
do not scale well time complexity of at least
O(n2), where n is the number of total objects
can never undo what was done previously
Integration of hierarchical with distance-based
clustering
BIRCH (1996) uses CF-tree and incrementally
adjusts the quality of sub-clusters
CURE (1998) selects well-scattered points from
the cluster and then shrinks them towards the
center of the cluster by a specified fraction

60
BIRCH
Zhang, Ramakrishnan, Livny, Birch Balanced
Iterative Reducing and Clustering using
Hierarchies, ACM SIGMOD 1996.

Pre-cluster data points using CF-tree
For each point
CF-tree is traversed to find the closest cluster
If the threshold criterion is satisfied, the
point is absorbed into the cluster
Otherwise, it forms a new cluster
Requires only single scan of data
Cluster summaries stored in CF-tree are given to
main memory hierarchical clustering algorithm

61
Initialization of BIRCH

CF of a cluster of n d-dimensional vectors,
V1,,Vn, is defined as (n,LS, SS)
n is the number of vectors
LS is the sum of vectors
SS is the sum of squares of vectors
CF1CF2 (n1 n1 LS1 LS1, SS1 SS1)
This property is used for incremental maintaining
cluster features
Distance between two clusters CF1 and CF2 is
defined to be the distance between their
centroids.

62
Zhang, Ramakrishnan, Livny, Birch Balanced
Iterative Reducing and Clustering using
Hierarchies, ACM SIGMOD 1996.
Clustering Feature Vector
Clustering Feature CF (N, LS, SS) N Number
of data points LS (linear sum of N data points)
?Ni1Xi SS (square sum of N data points
?Ni1Xi2
CF (5, (16,30),(54,190))
(3,4) (2,6) (4,5) (4,7) (3,8)
63
Notations
Zhang, Ramakrishnan, Livny, Birch Balanced
Iterative Reducing and Clustering using
Hierarchies, ACM SIGMOD 1996.

Given N d-dimensional data points in a cluster
Xi
Centroid X0, radius R, diameter D, controid
Euclidian distance D0, centroid Manhattan
distance D1

64
Notations (2)
Zhang, Ramakrishnan, Livny, Birch Balanced
Iterative Reducing and Clustering using
Hierarchies, ACM SIGMOD 1996.

Given N d-dimensional data points in a cluster
Xi
Average inter-cluster distance D2, average
intra-cluster distance D3, variance increase
distance D4

65
CF Tree
Zhang, Ramakrishnan, Livny, Birch Balanced
Iterative Reducing and Clustering using
Hierarchies, ACM SIGMOD 1996.
Root
B 7 L 6
Non-leaf node
CF1
CF3
CF2
CF5
child1
child3
child2
child5
Leaf node
Leaf node
CF1
CF2
CF6
prev
next
CF1
CF2
CF4
prev
next
66
Example

Given (T2?), B3 for 3, 6, 8, and 1
(2,(9, 45) ? (2,(4,10)), (2,(14,100))
For 2 inserted ?(1,(2,4))
(3,(6,14), (2,(14,100))
(2,(3,5)), (1,(3,9)) (2,(14,100))
For 5 inserted ?(1,(5,25))
(3,(6,14),
(3,(19,125))
(2,(3,5)), (1,(3,9)) (2,(11,61)), (1,(8,64))
For 7 inserted ? (1,(7,49))
(3,(6,14),
(4,(26,174))
(2,(3,5)), (1,(3,9)) (2,(11,61)),
(2,(15,113))

67
Evaluation of BIRCH

Scales linearly finds a good clustering with a
single scan and improves the quality with a few
additional scans
Weakness handles only numeric data and sensitive
to the order of the data record.

68
Data Summarization

To compress the data into suitable representative
objects
OPTICS Data Bubble

Finding clusters from hierarchical clustering
depending on the resolution
69
OPTICS
M. Ankerst, M. Breunig, H. Kriegel, J. Sander,
OPTICS Ordering Points to Identify the
Clustering Structure, ACM SIGMOD, 1999.

Pre N?(q) the subset of D contained in the
?neighborhood of q. (? is a radius)
Definition 1 (directly density-reachable) Object
p is directly density-reachable from object q
wrt. ? and MinPts in a set of objects D if 1) p ?
N,(q) (N?(q) is the subset of D contained in the
?-neighborhood of q.) 2) Card(N?(q)) gt MinPts
(Card(N) denotes the cardinality of the set N)
Definitions
Directly density-reachable (p.51 Figure 2) ?
density-reachable transitivity of ddr
Density-connected (p -gt o lt- q)
Core-distance ?, MinPts (p) MinPts_distance (p)
Reachability-distance ?, MinPts (p,o) wrt o
max(core-distance(o), dist(o,p)) ? Figure 4
Ex) cluster ordering ? reachability values Fig 12

70
Data Bubbles
M. Breunig, H. Kriegel, P. Kroger, J. Sander,
Data Bubbles Quality Preserving Performance
Boosting for Hierarchical Clustering, ACM SIGMOD,
2001.

?-neighborhood of P
k-distance of P, at least for k objects O ? D it
holds d(P,O) d(P,O), and at most k-1 objects
O ? D it holds d(P,O) lt d(P,O).
k-nearest neighbors of P
MinPts-dist(P) a distance in which there are at
least MinPts objects within the ?-neighborhood of
P.

71
Data Bubbles
M. Breunig, H. Kriegel, P. Kroger, J. Sander,
Data Bubbles Quality Preserving Performance
Boosting for Hierarchical Clustering, ACM SIGMOD,
2001.

Structural distortion
Figure 11
Data Bubbles, B(n,rep,extend,nnDist)
n of objects in X rep a representative
bject for X extent estimation of the radius of
X nnDist partial function, estimating k-nearest
neighbor distances in X.
Distance (B,C) page 6-83

Dist(B.rep, C.rep) - B.extent C.extend
B.nnDist(1) C.nnDist(1)
Max B.nnDist(1) C.nnDist(1)
72
K-Means in SQL
C. Ordonez, Integrating K-Means Clustering with a
Relational DBMS Using SQL, IEEE TKDE 2006.

Dataset Yy1,y2,,yn d?n matrix, where yid?1
column vector
K-Means to find k clusters, by minimizing the
square error from the centers.
Square distance, Eq(1) and objective fn, Eq(2)
Matrices
W k weights (fractions of n) d?k matrix
C k means (centroids) d?k matrix
R k variances (square distances) k?1 matrix
Matrices
Mj contains the d sums of point dimension values
in cluster j d?k matrix
Qj contains the d sums of squared dimension
values in cluster j d?k matrix
Nj contains points in cluster j k?1 matrix
Intermediate matrices YH, YV, YD, YNN, NMQ, WCR
Figure 193

73
Y
YH
C
YV
CH
Y1 Y2 Y3
1 2 3
1 2 3
9 8 7
9 8 7
9 8 7
i Y1 Y2 Y3
1 1 2 3
2 1 2 3
3 9 8 7
4 9 8 7
5 9 8 7
l k C1/C2
1 1 1
2 1 2
3 1 3
1 2 9
2 2 8
3 2 7
i l val
1 1 1
1 2 2
1 3 3
2 1 1
2 2 2
2 3 3
3 1 9
3 2 8
3 3 7
4 1 9
4 2 8
4 3 7
5 1 9
5 2 8
5 3 7
j Y1 Y2 Y3
1 1 2 3
2 9 8 7
YNN
YD
Insert into C Select 1,1,Y1 From CH Where
j1 Insert into C Select d,k,Yd From CH Where
jk
i j
1 1
2 1
3 2
4 2
5 2
i d1 d2
1 0 116
2 0 116
3 116 0
4 116 0
5 116 0
Insert into YD Select i, sum(YV.val-C.C1)2) AS
d1, sum(YV.val-C.Ck)2) AS dk FROM YV,
C Where YV.l C.l Group by i
NMQ
WCR
Insert into YNN CASE When d1 lt d2 AND d1 lt dk
Then 1 When d2 lt d3 .. Then
2 ELSE k
l j N M Q
1 1 2 2 3
2 1 2 4 3
3 1 2 6 7
1 2 3 27 7
2 2 3 24 7
3 2 3 21 7
l j W C R
1 1 0.4 1 0
2 1 0.4 2 0
3 1 0.4 3 0
1 2 0.6 9 0
2 2 0.6 8 0
3 2 0.6 7 0
Insert into MNQ Select l,j,sum(1.0) AS N,
sum(YV.val) AS M, sum(YV.va.YV.val) AS Q FROM
YV, YNN Where YV.i YNN.i GROUP by l,j
74
Incremental Data Summarization
S. Nassar, J. Sander, C. Cheng, Incremental and
Effective Data Summarization for Dynamic
Hierarchical Clustering, ACM SIGMOD, 2004.

For DXi for 1?i?N, ?data bubble, the data
index ?i n/N.
For DXi with the mean ?X and standard
deviation ?X,
? is
good iff ???? - ?? , ?? ??,
under-filled iff ?lt ?? - ?? , and
over-filled iff ?gt ?? ??.

75
Research Issues

Reduction Dimensions
Approximation

76
(No Transcript)
77
Cure The Algorithm
Guha, Rastogi Shim, CURE An Efficient
Clustering Algorithm for Large Databases, ACM
SIGMOD, 1998

Guha, Rastogi Shim, CURE An Efficient
Clustering Algorithm for Large Databases, ACM
SIGMOD, 1998
Draw random sample s.
Partition sample to p partitions with size s/p
Partially cluster partitions into s/pq clusters
Eliminate outliers
By random sampling
If a cluster grows too slow, eliminate it.
Cluster partial clusters.
Label data in disk

78
Data Partitioning and Clustering
Guha, Rastogi Shim, CURE An Efficient
Clustering Algorithm for Large Databases, ACM
SIGMOD, 1998

s 50
p 2
s/p 25

s/pq 5

x
x
79
Cure Shrinking Representative Points
Guha, Rastogi Shim, CURE An Efficient
Clustering Algorithm for Large Databases, ACM
SIGMOD, 1998

Shrink the multiple representative points towards
the gravity center by a fraction of ?.
Multiple representatives capture the shape of the
cluster

80
Density-Based Clustering Methods

Clustering based on density (local cluster
criterion), such as density-connected points
Major features
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters as termination condition
Several interesting studies
DBSCAN Ester, et al. (KDD96)
OPTICS Ankerst, et al (SIGMOD99).
DENCLUE Hinneburg D. Keim (KDD98)
CLIQUE Agrawal, et al. (SIGMOD98)

81
CLIQUE (Clustering In QUEst)

Agrawal, Gehrke, Gunopulos, Raghavan, Automatic
Subspace Clustering of High Dimensional Data for
Data Mining Applications, ACM SIGMOD 1998.
Automatically identifying subspaces of a high
dimensional data space that allow better
clustering than original space
CLIQUE can be considered as both density-based
and grid-based
It partitions each dimension into the same number
of equal length interval
It partitions a d-dimensional data space into
non-overlapping rectangular units
A unit is dense if the fraction of total data
points contained in the unit exceeds the input
model parameter
A cluster is a maximal set of connected dense
units within a subspace

82
Salary (10,000)
7
6
5
4
3
2
1
age
0
20
30
40
50
60
? 3
83
CLIQUE The Major Steps
Agrawal, Gehrke, Gunopulos, Raghavan, Automatic
Subspace Clustering of High Dimensional Data for
Data Mining Applications, ACM SIGMOD 1998.

Partition the data space and find the number of
points that lie inside each cell of the
partition.
Identify the subspaces that contain clusters
using the Apriori principle
Identify clusters
Determine dense units in all subspaces of
interests
Determine connected dense units in all subspaces
of interests.
Generate minimal description for the clusters
Determine maximal regions that cover a cluster of
connected dense units for each cluster
Determination of minimal cover for each cluster

84
Strength and Weakness of CLIQUE

Strength
It automatically finds subspaces of the highest
dimensionality such that high density clusters
exist in those subspaces
It is insensitive to the order of records in
input and does not presume some canonical data
distribution
It scales linearly with the size of input and has
good scalability as the number of dimensions in
the data increases
Weakness
The accuracy of the clustering result may be
degraded at the expense of simplicity of the
method

85
Model based clustering

Assume data generated from K probability
distributions
Typically Gaussian distribution Soft or
probabilistic version of K-means clustering
Need to find distribution parameters.
EM Algorithm

86
EM Algorithm

Initialize K cluster centers
Iterate between two steps
Expectation step assign points to clusters
Maximation step estimate model parameters

87
CURE (Clustering Using Epresentatives )

Guha, Rastogi Shim, CURE An Efficient
Clustering Algorithm for Large Databases, ACM
SIGMOD, 1998
Stops the creation of a cluster hierarchy if a
level consists of k clusters
Uses multiple representative points to evaluate
the distance between clusters, adjusts well to
arbitrary shaped clusters and avoids single-link
effect

88
Drawbacks of Distance-Based Method
Guha, Rastogi Shim, CURE An Efficient
Clustering Algorithm for Large Databases, ACM
SIGMOD, 1998

Drawbacks of square-error based clustering method
Consider only one point as representative of a
cluster
Good only for convex shaped, similar size and
density, and if k can be reasonably estimated

89
BIRCH
Zhang, Ramakrishnan, Livny, Birch Balanced
Iterative Reducing and Clustering using
Hierarchies, ACM SIGMOD 1996.

Dependent on order of insertions
Works for convex, isotropic clusters of uniform
size
Labeling Problem
Centroid approach
Labeling Problem even with correct centers, we
cannot label correctly

90
Jensen-Shannon Divergence

Jensen-Shannon(JS) divergence between two
probability distributions
where
Jensen-Shannon(JS) divergence between a finite
number of probability distributions

91
Information-Theoretic Clustering (preserving
mutual information)

(Lemma) The loss in mutual information equals
Interpretation Quality of each cluster is
measured by the Jensen-Shannon Divergence between
the individual distributions in the cluster.
Can rewrite the above as
Goal Find a clustering that minimizes the above
loss

92
Information Theoretic Co-clustering (preserving
mutual information)

(Lemma) Loss in mutual information equals
where
Can be shown that q(x,y) is a maximum entropy
approximation to p(x,y).
q(x,y) preserves marginals q(x)p(x) q(y)p(y)

93
parameters that determine q are
(m-k)(kl-1)(n-l)
94
Preserving Mutual Information

Lemma
Note that may be thought of as
the prototype of row cluster (the usual
centroid of the cluster is
)
Similarly,

95
Example Continued
96
Co-Clustering Algorithm
97
Properties of Co-clustering Algorithm

Theorem The co-clustering algorithm
monotonically decreases loss in mutual
information (objective function value)
Marginals p(x) and p(y) are preserved at every
step (q(x)p(x) and q(y)p(y) )
Can be generalized to higher dimensions

98
(No Transcript)
99
Applications -- Text Classification

Assigning class labels to text documents
Training and Testing Phases

New Document
Class-1
Document collection
Grouped into classes
Classifier (Learns from Training data)
New Document With Assigned class
Class-m
Training Data
100
Dimensionality Reduction

Feature Selection
Feature Clustering

Select the best words
Throw away rest
Frequency based pruning
Information criterion based
pruning

Document Bag-of-words
Vector Of words
Word1
Wordk
m
1
Vector Of words
Cluster1

Do not throw away words
Cluster words instead
Use clusters as features

Document Bag-of-words
Clusterk
m
101
Experiments

Data sets
20 Newsgroups data
20 classes, 20000 documents
Classic3 data set
3 classes (cisi, med and cran), 3893 documents
Dmoz Science HTML data
49 leaves in the hierarchy
5000 documents with 14538 words
Available at http//www.cs.utexas.edu/users/manyam
/dmoz.txt
Implementation Details
Bow for indexing,co-clustering, clustering and
classifying

102
Naïve Bayes with word clusters

Naïve Bayes classifier
Assign document d to the class with the highest
score
Relation to KL Divergence
Using word clusters instead of words
where parameters for clusters are estimated
according to joint statistics

103
Selecting Correlated Attributes
T. Fukuda, Y. Morimoto, S. Morishita, T.
Tokuyama, Constructing Efficient Decision Trees
by Using Optimized Numeric Association Rules,
Proc. of VLDB Conf., 1996.

To decide A and A are strongly correlated iff
where a threshold ? ? 1

104
MDL-based Decision Tree Pruning

M. Mehta, J. Rissanen, R. Agrawal, MDL-based
Decision Tree Pruning, Proc. on KDD Conf., 1995.
Two steps for induction of decision trees
Construct a DT using training data
Reduce the DT by pruning to prevent overfitting
Possible approaches
Cost-complexity pruning using a separate set of
samples for pruning
DT pruning using the same training data sets for
testing
MDL-based pruning using Minimum Description
Length (MDL) principle.

105
Pruning Using MDL Principle
M. Mehta, J. Rissanen, R. Agrawal, MDL-based
Decision Tree Pruning, Proc. on KDD Conf., 1995.

View decision tree as a means for efficiently
encoding classes of records in training set
MDL Principle best tree is the one that can
encode records using the fewest bits
Cost of encoding tree includes
1 bit for encoding type of each node (e.g. leaf
or internal)
Csplit cost of encoding attribute and value for
each split
nE cost of encoding the n records in each leaf
(E is entropy)

106
Pruning Using MDL Principle
M. Mehta, J. Rissanen, R. Agrawal, MDL-based
Decision Tree Pruning, Proc. on KDD Conf., 1995.

Problem to compute the minimum cost subtree at
root of built tree
Suppose minCN is the cost of encoding the minimum
cost subtree rooted at N
Prune children of a node N if minCN nE1
Compute minCN as follows
N is leaf nE1
N has children N1 and N2 minnE1,Csplit1minCN
1minCN2
Prune tree in a bottom-up fashion

107
MDL Pruning - Example
R. Rastogi, K. Shim, PUBLIC A Decision Tree
Classifier that Integrates Building and Pruning,
Proc. of VLDB Conf., 1998.

Cost of encoding records in N, (nE1) 3.8
Csplit 2.6
minCN min3.8,2.6111 3.8
Since minCN nE1, N1 and N2 are pruned

108
PUBLIC

R. Rastogi, K. Shim, PUBLIC A Decision Tree
Classifier that Integrates Building and Pruning,
Proc. of VLDB Conf., 1998.
Prune tree during (not after) building phase
Execute pruning algorithm (periodically) on
partial tree
Problem how to compute minCN for a yet to be
expanded leaf N in a partial tree
Solution compute lower bound on the subtree cost
at N and use this as minCN when pruning
minCN is thus a lower bound on the cost of
subtree rooted at N
Prune children of a node N if minCN nE1
Guaranteed to generate identical tree to that
generated by SPRINT

109
PUBLIC(1)
R. Rastogi, K. Shim, PUBLIC A Decision Tree
Classifier that Integrates Building and Pruning,
Proc. of VLDB Conf., 1998.
sal education Label
10K High-school Reject
40K Under Accept
15K Under Reject
75K grad Accept
18K grad Accept

Simple lower bound for a subtree 1
Cost of encoding records in N nE1 5.8
Csplit 4
minCN min5.8, 4111 5.8
Since minCN nE1, N1 and N2 are pruned

110
PUBLIC(S)

Theorem The cost of any subtree with s splits
and rooted at node N is at least 2s1slog a
a is the number of attributes
k is the number of classes
ni (gt ni1) is the number of records belonging
to class i
Lower bound on subtree cost at N is thus the
minimum of
nE1 (cost with zero split)
2s1slog a

k
å
ni
is2
k
å
ni
is2
111
Whats Clustering

Clustering is a kind of unsupervised learning.
Clustering is a method of grouping data that
share similar trend and patterns.
Clustering of data is a method by which large
sets of data is grouped into clusters of smaller
sets of similar data.
Example

After clustering
Thus, we see clustering means grouping of data or
dividing a large data set into smaller data sets
of some similarity.
112
Partitional Algorithms

Enumerate K partitions optimizing some criterion
Example square-error criterion
Where x is the ith pattern belonging to the jth
cluster and c is the centroid of the jth cluster.

113
Squared Error Clustering Method

Select an initial partition of the patterns with
a fixed number of clusters and cluster centers
Assign each pattern to its closest cluster center
and compute the new cluster centers as the
centroids of the clusters. Repeat this step until
convergence is achieved, i.e., until the cluster
membership is stable.
Merge and split clusters based on some heuristic
information, optionally repeating step 2.

114
Agglomerative Clustering Algorithm

Place each pattern in its own cluster. Construct
a list of interpattern distances for all distinct
unordered pairs of patterns, and sort this list
in ascending order
Step through the sorted list of distances,
forming for each distinct dissimilarity value dk
a graph on the patterns where pairs of patterns
closer than dk are connected by a graph edge. If
all the patterns are members of a connected
graph, stop. Otherwise, repeat this step.
The output of the algorithm is a nested hierarchy
of graphs with can be cut at a desired
dissimilarity level forming a partition
identified by simply connected components in the
corresponding graph.

115
Agglomerative Hierarchical Clustering

Mostly used hierarchical clustering algorithm
Initially each point is a distinct cluster
Repeatedly merge closest clusters until the
number of clusters becomes K
Closest dmean (Ci, Cj)
dmin (Ci, Cj)
Likewise dave (Ci, Cj) and dmax (Ci, Cj)

116
Clustering

Summary of Drawbacks of Traditional Methods
Partitional algorithms split large clusters
Centroid-based method splits large and
non-hyperspherical clusters
Centers of subclusters can be far apart
Minimum spanning tree algorithm is sensitive to
outliers and slight change in position
Exhibits chaining effect on string of outliers
Cannot scale up for large databases

117
Model-based Clustering

Mixture of Gaussians
Gaussian pdf P(?i)
Data point, N(?i,?2I)
Consider
Data points x1, x2,, xN
P(?1),, P(?k),?
Likelihood function
Maximize the likelihood function by calculating

118
Overview of EM Clustering

Extensions and generalizations. The EM
(expectation maximization) algorithm extends the
k-means clustering technique in two important
ways
Instead of assigning cases or observations to
clusters to maximize the differences in means for
continuous variables, the EM clustering algorithm
computes probabilities of cluster memberships
based on one or more probability distributions.
The goal of the clustering algorithm then is to
maximize the overall probability or likelihood of
the data, given the (final) clusters.
Unlike the classic implementation of k-means
clustering, the general EM algorithm can be
applied to both continuous and categorical
variables (note that the classic k-means
algorithm can also be modified to accommodate
categorical variables).

119
EM Algorithm

The EM algorithm for clustering is described in
detail in Witten and Frank (2001).
The basic approach and logic of this clustering
method is as follows.
Suppose you measure a single continuous variable
in a large sample of observations.
Further, suppose that the sample consists of two
clusters of observations with different means
(and perhaps different standard deviations)
within each sample, the distribution of values
for the continuous variable follows the normal
distribution.
The resulting distribution of values (in the
population) may look like this

120
EM v.s. k-Means

Classification probabilities instead of
classifications. The results of EM clustering are
different from those computed by k-means
clustering. The latter will assign observations
to clusters to maximize the distances between
clusters. The EM algorithm does not compute
actual assignments of observations to clusters,
but classification probabilities. In other words,
each observation belongs to each cluster with a
certain probability. Of course, as a final result
you can usually review an actual assignment of
observations to clusters, based on the (largest)
classification probability.

121
Finding k

V-fold cross-validation. This type of
cross-validation is useful when no test sample is
available and the learning sample is too small to
have the test sample taken from it. A specified V
value for V-fold cross-validation determines the
number of random subsamples, as equal in size as
possible, that are formed from the learning
sample. The classification tree of the specified
size is computed V times, each time leaving out
one of the subsamples from the computations, and
using that subsample as a test sample for
cross-validation, so that each subsample is used
V - 1 times in the learning sample and just once
as the test sample. The CV costs computed for
each of the V test samples are then averaged to
give the V-fold estimate of the CV costs.

122
Expectation Maximization

A mixture of Gaussians
Ex x130, P(x1)1/2 x218, P(x2)u x30,
P(x3)2u x423, P(x4)1/2-3u
Likelihood for X1 a students x2 b students
x3 c students x4 d students
To maximize L, calculate the log Likelihood L

Supposing a14, b6, c9,d10, then u1/10. If
x1x2 h students ? abh ? ah/(u1), b2uh/(u1)
123
Gaussian (Normal) pdf

The Gaussian function with mean (?) and standard
deviation (?). The properties of the function
symmetric about the mean
Gains its maximum value at the mean, the minimum
value at plus and minus infinity
The distribution is often referred to as bell
shaped
At one standard deviation from the mean the
function has dropped to about 2/3 of its maximum
value, at two standard deviations it has falled
to about a 1/7.
The area under the function one standard
deviation from the mean is about 0.682. Two
standard deviations it is 0.9545, and the three
s.d. it is 0.9973. The total area under the curve
is 1.

124
Gaussian
Think the cumulative distribution, F?,?2(x)
125
Multi-variate Density Estimation
Mixture of Gaussians

contains all the parameters of the mixture
model. pi are known as mixing proportions or
coefficients.
A mixture of Gaussians model
Generic mixture

P(y) y1
y2 P(xy1) P(xy2)
126
Mixture Density

If we are given just x we do not know which
mixture component this example came from
We can evaluate the posterior probability that an
observed x was generated from the first mixture
component

Write a Comment

User Comments (0)