Title: Techniques of Classification and Clustering
1Techniques of Classification and Clustering
2Problem Description
- Assume
- AA1, A2, , Ad (ordered or unordered) domain
- S A1 ? A2 ? Ad d-dimensional (numerical or
non-numerical) space - Input
- Vv1, v2, , vm d-dimensional points, where vi
?vi1, vi2, , vid?. - The jth component of vi is drawn from domain Aj.
- Output
- Gg1, g2, , gk a set groups of V with label
vL, where gi ? V.
3Classification
- Supervised classification
- Discriminant analysis, or simply Classification
- A collection of labeled (pre-classified) patterns
are provided - Aims to label a newly encountered, yet unlabeled
(training) patterns - Unsupervised classification
- Clustering
- Aims to group a given collection of unlabeled
patterns into meaningful clusters - Category labels are data driven
4Methods for Classification
- Neural Nets
- Classification functions are obtained by passing
multiple passes over training sets - Poor generation efficiency
- Not efficient handling of non-numerical data
- Decision trees
- If E contains only objects of one group, the
decision tree is just a leaf labeled with that
group. - Construct a DT that correctly classifies objects
in the training data set. - Test to classify the unseen objects in the test
data set.
5Decision Trees (Ex Credit Analysis)
salary lt 20000
no
yes
education in graduate
accept
yes
no
accept
reject
6Decision Trees
- Pros
- Fast execution time
- Generated rules are easy to interpret by humans
- Scale well for large data sets
- Can handle high dimensional data
- Cons
- Cannot capture correlations among attributes
- Consider only axis-parallel cuts
7Decision Tree Algorithms
- Classifiers from machine learning community
- ID3J. R. Quinlan, Induction of decision trees,
Machine Learning, 1, 1986. - C4.5J. Ross Quinlan, C4.5 Programs for and
Neural Networks, Cambridge University Press,
Cambridge, 1996. Machine Learning, Morgan
Kaufman, 1993 - CARTL. Breiman, J. H. Friedman, R. A. Olshen,
and C. J. Stone, Classification and Regression
Trees, Wadsworth, Belmont, 1984. - Classifiers for large database
- SLIQMAR96, SPRINTJohn Shafer, Rakesh Agrawal,
and Manish Mehta, SPRINT A scalable parallel
classifier for data mining, the VLDB Conference,
Bombay, India, September 1996. - SONARTakeshi Fukuda, Yasuhiko Morimoto, and
Shinichi Morishita, Constructing efficient
decision trees by using optimized numeric
association rules, the VLDB Conference, Bombay,
India, 1996. - RainforestJ. Gehrke, R. Ramakrishnan, V. Ganti,
RainForest A Framework for Fast Decision Tree
Construction of Large Datasets, Proc. of VLDB
Conf., 1998. - Pruning phase followed by building phase
8Decision Tree Algorithms
- Building phase
- Recursively split nodes using best splitting
attribute for node - Pruning phase
- Smaller imperfect decision tree generally
achieves better accuracy - Prune leaf nodes recursively to prevent
over-fitting
9Preliminaries
- Theoretic Background
- Entropy
- Similarity measures
- Advanced terms
10Information Theory Concepts
- Entropy of a random variable X with probability
distribution p(x) - The Kullback-Leibler(KL) Divergence or Relative
Entropy between two probability distributions p
and q - Mutual Information between random variables X and
Y
11What is Entropy
- S is a sample of training data set
- Entropy measures the impurity of S
- H(X)The entropy of X
- If H(X)0, it means X is one value As H()
increases, X are heterogeneous values. - For the same number of X values,
- Low Entropy means X is from a uniform (boring)
distribution A histogram of the frequency
distribution of values of X would be flat ?and
so the values sampled from it would be all over
the place - High Entropy means X is from varied (peaks and
valleys) distribution A histogram of the
frequency distribution of values of X would have
many lows and one or two highs ? and so the
values sampled from it would be more predictable.
12Entropy-Based Data Segmentation
T. Fukuda, Y. Morimoto, S. Morishita, T.
Tokuyama, Constructing Efficient Decision Trees
by Using Optimized Numeric Association Rules,
Proc. of VLDB Conf., 1996.
- Attribute has three categories, 40 C1, 30 C2, 30
C3.
C1 C2 C3
100 40 30 30
S1 C1 C2 C3
60 40 10 10
S2 C1 C2 C3
40 0 20 20
S3 C1 C2 C3
60 20 20 20
S4 C1 C2 C3
40 20 10 10
13Information Theoretic Measure
R. Rgrawal, S. Ghosh, T. Imielinski, B. Iyer, A.
Swami, An Interval Classifier for Database Mining
Applications, Proc. ofVLDB, 1992.
- Information gain by branching on Ai
- gain(Ai) E - Ei
- The entropy E of an object set
- the object set containing
- object ek of group Gk.
- The expected entropy for
- the tree with Ai as the root
- where Eij is the expected entropy for the subtree
of an object set. - Information content of
- the value of Ai
14Ex
C1 C2 C3
100 40 30 30
S1 C1 C2 C3
60 20 20 20
S2 C1 C2 C3
40 20 10 10
S3 C1 C2 C3
40 40 0 0
S4 C1 C2 C3
30 0 30 0
S5 C1 C2 C3
30 0 0 30
gain1E-E10.015 gain2E-E21.09
15Distributional Similarity Measures
- Cosine
- Jaccard coefficient
- Dice coefficient
- Overlap coefficient
- L1 distance (City block distance)
- Euclidean distance (L2 distance)
- Hellinger distance
- Information Radius (Jensen-Shannon divergence)
- Skew divergence
- Confusion Probability
- Lins Similarity Measure
16Similarity Measures
- Minkowski distance
- Euclidean distance
- p2
- Manhattan distance
- p1
- Mahalanobis distance
- Normalization due to weight schemes
- ? is the sample covariance matrix of the patterns
or the known covariance matrix of the pattern
generation process
17General form
- I (common (A,B)) information content associated
with the statement describing what A and B have
in common - I (description (A,B)) information content
associated with the statement describing A and B - ?(s) probability of the statement within the
world of the objects in question, i.e., fraction
of objects exhibiting feature s.
IT-Sim (A,B)
18Similarity Measures
- The Set/Bag Model Let X and Y be two collections
of XML documents - Jaccards Coefficient
- Dices Coefficient
19Similarity Measures
- Cosine-Similarity Measure (CSM)
- The Vector-Space Model Cosine-Similarity Measure
(CSM)
20Query Processing a single cosine
- For every term i, with each doc j, store term
frequency tfij. - Some tradeoffs on whether to store term count,
term weight, or weighted by idfi. - At query time, accumulate component-wise sum
- If youre indexing 5 billion documents (web
search) an array of accumulators is infeasible
Ideas?
21Similarity Measures (2)
- The Generalized Cosine-Similarity Measure (GCSM)
Let X and Y be vectors and - where
- Hierarchical Model
- Why only for depth?
222 Dim Similarities
- Cosine Measure
- Hellinger Measure
- Tanimoto Measure
- Clarity Measure
23Advanced Terms
- Conditional Entropy
- Information Gain
24Specific Conditional Entropy
- H(YXv)
- Suppose Im trying to predict output Y and I have
input X - XCollege Major, Y likes Gladiator
- Lets assume this reflects the true probabilities
X Y
Math Yes
History No
CS Yes
Math No
Math No
CS Yes
History No
Math Yes
- From this data we estimate
- P(LikeGYes)0.5
- P(MajorMath LikeGNo) 0.25
- P(MajorMath)0.5
- P(LikeGYes MajorHisgory)0
- Note
- -H(X)1.5 -H(Y)1
- ----
- H(YXMath)1 H(YXHistory)0
- H(YXCS)0
25Conditional Entropy
- Definition of Conditional Entropy
- H(YX)The average specific conditional entropy
of Y - If you choose a record at random what will be the
conditional entropy of Y, conditioned on that
rows value of X - Expected number of bits to transmit Y if both
sides will know the value of X
X Y
Math Yes
History No
CS Yes
Math No
Math No
CS Yes
History No
Math Yes
vj Prob(Xvj) H(YXvj)
Math 0.5 1
History 0.25 0
CS 0.25 0
H(YX)0.510.2500.2500.5
26Information Gain
- Definition of Information Gain
- IG(YX) I must transmit Y. How many bits on
average would it save me if both ends the line
knew X? - IG(YX) H(Y) H(YX)
X Y
Math Yes
History No
CS Yes
Math No
Math No
CS Yes
History No
Math Yes
H(Y) 1 H(YX) 0.510.2500.2500.5 Thus,
IG(YX) 1-0.5 0.5
27Relative Information Gain
- Definition of Relative Information Gain
- RIG(YX) I must transmit Y, what fraction of
the bits on average would it save me if both ends
the line knew X? - RIG(YX) H(Y) H(YX)/H(Y)
X Y
Math Yes
History No
CS Yes
Math No
Math No
CS Yes
History No
Math Yes
H(Y) 1 H(YX) 0.510.2500.2500.5 Thus,
IG(YX) (1-0.5)/1 0.5
28What is Information Gain used for?
- Suppose you are trying to predict whether someone
is going to live past 80 years. From historical
data you might find - IG(LongLife HairColor) 0.01
- IG(LongLife Smoker) 0.2
- IG(LongLife Gender) 0.25
- IG(LongLife LastDigitOfSSN) 0.00001
- IG tells you how interesting 1 2-d contingency
table is going to be.
29Clustering
- Given
- Data points and number of desired clusters K
- Group the data points into K clusters
- Data points within clusters are more similar than
across clusters - Sample applications
- Customer segmentation
- Market basket customer analysis
- Attached mailing in direct marketing
- Clustering companies with similar growth
30A Clustering Example
Income High Children1 CarLuxury
Income Medium Children2 CarTruck
Cluster 1
Car Sedan and Children3 Income Medium
Income Low Children0 CarCompact
Cluster 4
Cluster 3
Cluster 2
31Different ways of representing clusters
(b)
e
c
b
f
i
g
32Clustering Methods
- Partitioning
- Given a set of objects and a clustering
criterion, partitional clustering obtains a
partition of the objects into clusters such that
the objects in a cluster are more similar to each
other than to objects in different clusters. - K-means, and K-mediod methods determine K cluster
representatives and assign each object to the
cluster with its representative closest to the
object such that the sum of the distances squared
between the objects and their representatives is
minimized. - Hierarchical
- Nested sequence of partitions.
- Agglomerative starts by placing each object in
its own cluster and then merge these atomic
clusters into larger and larger clusters until
all objects are in a single cluster. - Divisive starts with all objects in cluster and
subdividing into smaller pieces.
33Algorithms
- k-Means
- Fuzzy C-Means Clustering
- Hierarchical Clustering
- Probabilistic Clustering
34Similarity Measures (2)
- Mutual Neighbor Distance (MND)
- MND(xi, xj) NN(xi, xj)NN(xj, xi), where NN(xi,
xj) is the neighbor number xj with respect to xi. - Distance under context
- s(xi, xj)f(xi, xj, e), where e is the context
35K-Means Clustering Algorithm
- Choose k cluster centers to coincide with k
randomly-chosen patterns - Assign each pattern to its closest cluster
center. - Recompute the cluster centers using the current
cluster memberships. - If a convergence criterion is not met, go to step
2. - Typical convergence criteria
- No (or minimal) reassignment of patterns to new
cluster centers, or minimal decrease in squared
error.
36Objective Function
- k-Means algorithm aims at minimizing the
following objective function (square error
function)
37K-Means Algorithm (Ex)
G
F
E
D
H
I
C
J
B
A
38Distortion
- Given a clustering ?, we denote by ?(x) the
centroid this clustering associates with an
arbitrary point x. A measure of quality for ? - Distortion? ?x d2(x, ?(x))/R
- Where R is the total number of points and x
ranges over all input points. - Improvement
- Distortion ?( parameters) log R
- Distortion ? mk log R
39Remarks
- The way to initialize the means is the problem.
One popular way to start is to randomly choose k
of the samples - The results produced depend on the initial values
for the means - It can happen that the set of samples closest to
mi is empty, so the mi cannot be updated. - The results depend on the metric used to measure
40Related Work Clustering
- Graph-based clustering
- For an XML document collection C, s-Graph sg (C)
(N, E), a directed graph such that N is the set
of all the elements and attributes in the
documents in C and (a, b) ? E if and only if a is
a parent element of b in document(s) in C (b can
be element or attribute). - For two sets, C1 and C2, of XML documents, the
distance between them, where sg(Ci) is the
number of edges
41Fuzzy C-Means Clustering
- FCM is a method of clustering which allows one
piece of data to belong to two or more clusters. - Fuzzy partitioning is carried out through an
iterative optimization of the objective function
shown above, with the update of membership u and
the cluster center c by
42Membership
- The iteration stop when
, where ? is a termination criterion
between 0 and 1, whereas k are the iteration
steps. This procedure converges to a local
minimum or a saddle point of Jm.
43Fuzzy Clustering
- Properties
- uij ? 0,1 for all i,j
- for all i
- for all N
44Speculations
- Correlation between m and ?
- More iteration k for less ?.
45Hierarchical Clustering
- Basic Process
- Start by assigning each item to a cluster. N
clusters for N items. (Let the distances between
the clusters the same as the distances between
the items they contain.) - Find the closest (most similar) pair of clusters
and merge them into a single cluster. - Compute distances between the new cluster and
each of the old clusters. - Repeat steps 2 and 3 until all items are
clustered intoa single cluster of size N.
46Hierarchical Clustering (Ex)
dendrogram
47Hierarchical Clustering Algorithms
- Single-linkage clustering
- The distance between two clusters is the minimum
of the distances between all pairs of patterns
drawn from the two clusters (one pattern from the
first cluster, the other from the second). - Complete-linkage clustering
- The distance between two clusters is the maximum
of the distances between all pairs of patterns
drawn from the two clusters - Average-linkage clustering
- Minimum-variance algorithm
48Single-/Complete-Link Clustering
1
2
1
1
1
2
2
2
2
1
2
2
1
2
1
2
1
2
2
2
2
1
2
1
2
1
1
2
1
2
2
2
1
1
2
1
1
2
2
1
2
1
49Single Linkage Hierarchical Cluster
- Steps
- Begin with the disjoint clustering having level
L(k)0 and sequence number m0. - Find the least dissimilar pair of clusters in the
current clustering, d(r),(s) min d(i),(j),
where the minimum is over all pairs of clusters
in the current clustering. - Increment the sequence number mm1. Merge
clusters (r) and (s) into a single cluster to
form the next clustering m. Set L(m)
d(r),(s). - Update the proximity matrix, D, by deleting the
rows and columns corresponding to clusters (r)
and (s) and adding a row and column corresponding
to the newly formed cluster. The proximity
between the new cluster, denoted (r,s) and old
cluster (k) is defined d(k),(r,s) min
(d(s),(r), d(k),(s)). - If all objects are in one cluster, stop. Else go
to step 2.
50Ex Single-Linkage
0
51Agglomerative Hierarchical Clustering
ALGORITHM Agglomerative Hierarchical
Clustering INPUT bit-vectors B in bitmap index
BI OUTPUT a tree T METHOD (1) Place each
bit-vector Bi in its cluster (singleton),
creating the list of clusters L
(initially, the leaves of T) LB1, B2, , Bn.
(2) Compute a merging cost function,
between every
pair of elements in L to find the two closest
clusters Bi,Bj which will be the
cheapest couple to merge. (3) Remove Bi and Bj
from L. (4) Merge Bi and Bj to create a new
internal node Bjj in T which will be the
parent of Bi and Bj in the result tree. (5)
Repeat from (2) until there is only one set
remaining.
52Graph-Theoretic Clustering
- Construct the minimal spanning tree (MST)
- Delete the MST edges with the largest lengths
x2
B
3.5
0.5
C
1.5
A
6.5
1.5
D
F
G
1.7
E
x1
53Improving k-Means
D. Pelleg and A. Moore, Accelerating Exact
k-means Algorithms with Geometric Reasoning, ACM
Proceedings of Conf. on Knowledge and Data
Mining, 1999.
- Definitions
- Center of clusters ? (Th2) Center of rectangle
owner(h) - c1 dominates c2 w.r.t. h ? if h is in the same
side as c1 wrt c2. (pg.7,9) - Update Centroid
- If for all other centers c, c dominates c wrt h
(so cowner(h), pg 10) ? insert into owner(h) or
split h - (blacklist version) c1 dominates c2 wrt h for
any h contained in h. (pg.11)
54Clustering Categorical Data ROCK
- S. Guha, R. Rastogi, K. Shim, ROCK Robust
Clustering using linKs, IEEE Conf Data
Engineering, 1999 - Use links to measure similarity/proximity
- Not distance based
- Computational complexity
- Basic ideas
- Similarity function and neighbors
- Let T1 1,2,3, T23,4,5
55Using Jaccard Coefficient
- According to Jaccard coefficient, the distance
between 1,2,3 and 1,2,6 is the same as the
one between 1,2,3 and 1,2,4, although the
former is from two different clusters.
lt1,2,3,4,5gt CLUSTER 1 1,2,3 1,4,5 1,2,4
2,3,4 1,2,5 2,3,5 1,3,4 2,4,5 1,3,5
3,4,5
lt1,2,6,7gt CLUSTER 2 1,2,6 1,2,7 1,6,7 2,6,7
56ROCK
- Inducing LINK the main problem is local
properties involving only the two points are
considered - Neighbor If two points are similar enough with
each other, they are neighbors - Link the link for pair of points is the number
of common neighbors.
57Rock Algorithm
S. Guha, R. Rastogi, K. Shim, ROCK Robust
Clustering using linKs, IEEE Conf Data
Engineering, 1999
- Links The number of common neighbors for the
two points. - Algorithm
- Draw random sample
- Cluster with links
- Label data in disk
1,2,3, 1,2,4, 1,2,5, 1,3,4,
1,3,5 1,4,5, 2,3,4, 2,3,5, 2,4,5,
3,4,5
3
1,2,3 1,2,4
58Rock Algorithm
S. Guha, R. Rastogi, K. Shim, ROCK Robust
Clustering using linKs, IEEE Conf Data
Engineering, 1999
- Criterion function to maximize link for the k
clusters - Ci denotes cluster i of size ni.
For the similarity threshold 0.5, link (1,2,6,
1,2,7) 4 link (1,2,6, 1,2,3) 3 link
(1,6,7, 1,2,3) 2 link (1,2,3, 1,4,5)
3
1,2,3 1,4,5 1,2,4 2,3,4 1,2,5
2,3,5 1,3,4 2,4,5 1,3,5 3,4,5
1,2,6 1,2,7 1,6,7 2,6,7
59More on Hierarchical Clustering Methods
- Major weakness of agglomerative clustering
methods - do not scale well time complexity of at least
O(n2), where n is the number of total objects - can never undo what was done previously
- Integration of hierarchical with distance-based
clustering - BIRCH (1996) uses CF-tree and incrementally
adjusts the quality of sub-clusters - CURE (1998) selects well-scattered points from
the cluster and then shrinks them towards the
center of the cluster by a specified fraction
60BIRCH
Zhang, Ramakrishnan, Livny, Birch Balanced
Iterative Reducing and Clustering using
Hierarchies, ACM SIGMOD 1996.
- Pre-cluster data points using CF-tree
- For each point
- CF-tree is traversed to find the closest cluster
- If the threshold criterion is satisfied, the
point is absorbed into the cluster - Otherwise, it forms a new cluster
- Requires only single scan of data
- Cluster summaries stored in CF-tree are given to
main memory hierarchical clustering algorithm
61Initialization of BIRCH
- CF of a cluster of n d-dimensional vectors,
V1,,Vn, is defined as (n,LS, SS) - n is the number of vectors
- LS is the sum of vectors
- SS is the sum of squares of vectors
- CF1CF2 (n1 n1 LS1 LS1, SS1 SS1)
- This property is used for incremental maintaining
cluster features - Distance between two clusters CF1 and CF2 is
defined to be the distance between their
centroids.
62Zhang, Ramakrishnan, Livny, Birch Balanced
Iterative Reducing and Clustering using
Hierarchies, ACM SIGMOD 1996.
Clustering Feature Vector
Clustering Feature CF (N, LS, SS) N Number
of data points LS (linear sum of N data points)
?Ni1Xi SS (square sum of N data points
?Ni1Xi2
CF (5, (16,30),(54,190))
(3,4) (2,6) (4,5) (4,7) (3,8)
63Notations
Zhang, Ramakrishnan, Livny, Birch Balanced
Iterative Reducing and Clustering using
Hierarchies, ACM SIGMOD 1996.
- Given N d-dimensional data points in a cluster
Xi - Centroid X0, radius R, diameter D, controid
Euclidian distance D0, centroid Manhattan
distance D1
64Notations (2)
Zhang, Ramakrishnan, Livny, Birch Balanced
Iterative Reducing and Clustering using
Hierarchies, ACM SIGMOD 1996.
- Given N d-dimensional data points in a cluster
Xi - Average inter-cluster distance D2, average
intra-cluster distance D3, variance increase
distance D4
65CF Tree
Zhang, Ramakrishnan, Livny, Birch Balanced
Iterative Reducing and Clustering using
Hierarchies, ACM SIGMOD 1996.
Root
B 7 L 6
Non-leaf node
CF1
CF3
CF2
CF5
child1
child3
child2
child5
Leaf node
Leaf node
CF1
CF2
CF6
prev
next
CF1
CF2
CF4
prev
next
66Example
- Given (T2?), B3 for 3, 6, 8, and 1
- (2,(9, 45) ? (2,(4,10)), (2,(14,100))
- For 2 inserted ?(1,(2,4))
- (3,(6,14), (2,(14,100))
- (2,(3,5)), (1,(3,9)) (2,(14,100))
- For 5 inserted ?(1,(5,25))
- (3,(6,14),
(3,(19,125)) - (2,(3,5)), (1,(3,9)) (2,(11,61)), (1,(8,64))
- For 7 inserted ? (1,(7,49))
- (3,(6,14),
(4,(26,174)) - (2,(3,5)), (1,(3,9)) (2,(11,61)),
(2,(15,113))
67Evaluation of BIRCH
- Scales linearly finds a good clustering with a
single scan and improves the quality with a few
additional scans - Weakness handles only numeric data and sensitive
to the order of the data record.
68Data Summarization
- To compress the data into suitable representative
objects - OPTICS Data Bubble
Finding clusters from hierarchical clustering
depending on the resolution
69OPTICS
M. Ankerst, M. Breunig, H. Kriegel, J. Sander,
OPTICS Ordering Points to Identify the
Clustering Structure, ACM SIGMOD, 1999.
- Pre N?(q) the subset of D contained in the
?neighborhood of q. (? is a radius) - Definition 1 (directly density-reachable) Object
p is directly density-reachable from object q
wrt. ? and MinPts in a set of objects D if 1) p ?
N,(q) (N?(q) is the subset of D contained in the
?-neighborhood of q.) 2) Card(N?(q)) gt MinPts
(Card(N) denotes the cardinality of the set N) - Definitions
- Directly density-reachable (p.51 Figure 2) ?
density-reachable transitivity of ddr - Density-connected (p -gt o lt- q)
- Core-distance ?, MinPts (p) MinPts_distance (p)
- Reachability-distance ?, MinPts (p,o) wrt o
max(core-distance(o), dist(o,p)) ? Figure 4 - Ex) cluster ordering ? reachability values Fig 12
70Data Bubbles
M. Breunig, H. Kriegel, P. Kroger, J. Sander,
Data Bubbles Quality Preserving Performance
Boosting for Hierarchical Clustering, ACM SIGMOD,
2001.
- ?-neighborhood of P
- k-distance of P, at least for k objects O ? D it
holds d(P,O) d(P,O), and at most k-1 objects
O ? D it holds d(P,O) lt d(P,O). - k-nearest neighbors of P
- MinPts-dist(P) a distance in which there are at
least MinPts objects within the ?-neighborhood of
P.
71Data Bubbles
M. Breunig, H. Kriegel, P. Kroger, J. Sander,
Data Bubbles Quality Preserving Performance
Boosting for Hierarchical Clustering, ACM SIGMOD,
2001.
- Structural distortion
- Figure 11
- Data Bubbles, B(n,rep,extend,nnDist)
- n of objects in X rep a representative
bject for X extent estimation of the radius of
X nnDist partial function, estimating k-nearest
neighbor distances in X. - Distance (B,C) page 6-83
Dist(B.rep, C.rep) - B.extent C.extend
B.nnDist(1) C.nnDist(1)
Max B.nnDist(1) C.nnDist(1)
72K-Means in SQL
C. Ordonez, Integrating K-Means Clustering with a
Relational DBMS Using SQL, IEEE TKDE 2006.
- Dataset Yy1,y2,,yn d?n matrix, where yid?1
column vector - K-Means to find k clusters, by minimizing the
square error from the centers. - Square distance, Eq(1) and objective fn, Eq(2)
- Matrices
- W k weights (fractions of n) d?k matrix
- C k means (centroids) d?k matrix
- R k variances (square distances) k?1 matrix
- Matrices
- Mj contains the d sums of point dimension values
in cluster j d?k matrix - Qj contains the d sums of squared dimension
values in cluster j d?k matrix - Nj contains points in cluster j k?1 matrix
- Intermediate matrices YH, YV, YD, YNN, NMQ, WCR
Figure 193
73Y
YH
C
YV
CH
Y1 Y2 Y3
1 2 3
1 2 3
9 8 7
9 8 7
9 8 7
i Y1 Y2 Y3
1 1 2 3
2 1 2 3
3 9 8 7
4 9 8 7
5 9 8 7
l k C1/C2
1 1 1
2 1 2
3 1 3
1 2 9
2 2 8
3 2 7
i l val
1 1 1
1 2 2
1 3 3
2 1 1
2 2 2
2 3 3
3 1 9
3 2 8
3 3 7
4 1 9
4 2 8
4 3 7
5 1 9
5 2 8
5 3 7
j Y1 Y2 Y3
1 1 2 3
2 9 8 7
YNN
YD
Insert into C Select 1,1,Y1 From CH Where
j1 Insert into C Select d,k,Yd From CH Where
jk
i j
1 1
2 1
3 2
4 2
5 2
i d1 d2
1 0 116
2 0 116
3 116 0
4 116 0
5 116 0
Insert into YD Select i, sum(YV.val-C.C1)2) AS
d1, sum(YV.val-C.Ck)2) AS dk FROM YV,
C Where YV.l C.l Group by i
NMQ
WCR
Insert into YNN CASE When d1 lt d2 AND d1 lt dk
Then 1 When d2 lt d3 .. Then
2 ELSE k
l j N M Q
1 1 2 2 3
2 1 2 4 3
3 1 2 6 7
1 2 3 27 7
2 2 3 24 7
3 2 3 21 7
l j W C R
1 1 0.4 1 0
2 1 0.4 2 0
3 1 0.4 3 0
1 2 0.6 9 0
2 2 0.6 8 0
3 2 0.6 7 0
Insert into MNQ Select l,j,sum(1.0) AS N,
sum(YV.val) AS M, sum(YV.va.YV.val) AS Q FROM
YV, YNN Where YV.i YNN.i GROUP by l,j
74Incremental Data Summarization
S. Nassar, J. Sander, C. Cheng, Incremental and
Effective Data Summarization for Dynamic
Hierarchical Clustering, ACM SIGMOD, 2004.
- For DXi for 1?i?N, ?data bubble, the data
index ?i n/N. - For DXi with the mean ?X and standard
deviation ?X, - ? is
- good iff ???? - ?? , ?? ??,
- under-filled iff ?lt ?? - ?? , and
- over-filled iff ?gt ?? ??.
75Research Issues
- Reduction Dimensions
- Approximation
76(No Transcript)
77Cure The Algorithm
Guha, Rastogi Shim, CURE An Efficient
Clustering Algorithm for Large Databases, ACM
SIGMOD, 1998
- Guha, Rastogi Shim, CURE An Efficient
Clustering Algorithm for Large Databases, ACM
SIGMOD, 1998 - Draw random sample s.
- Partition sample to p partitions with size s/p
- Partially cluster partitions into s/pq clusters
- Eliminate outliers
- By random sampling
- If a cluster grows too slow, eliminate it.
- Cluster partial clusters.
- Label data in disk
78Data Partitioning and Clustering
Guha, Rastogi Shim, CURE An Efficient
Clustering Algorithm for Large Databases, ACM
SIGMOD, 1998
x
x
79Cure Shrinking Representative Points
Guha, Rastogi Shim, CURE An Efficient
Clustering Algorithm for Large Databases, ACM
SIGMOD, 1998
- Shrink the multiple representative points towards
the gravity center by a fraction of ?. - Multiple representatives capture the shape of the
cluster
80Density-Based Clustering Methods
- Clustering based on density (local cluster
criterion), such as density-connected points - Major features
- Discover clusters of arbitrary shape
- Handle noise
- One scan
- Need density parameters as termination condition
- Several interesting studies
- DBSCAN Ester, et al. (KDD96)
- OPTICS Ankerst, et al (SIGMOD99).
- DENCLUE Hinneburg D. Keim (KDD98)
- CLIQUE Agrawal, et al. (SIGMOD98)
81CLIQUE (Clustering In QUEst)
- Agrawal, Gehrke, Gunopulos, Raghavan, Automatic
Subspace Clustering of High Dimensional Data for
Data Mining Applications, ACM SIGMOD 1998. - Automatically identifying subspaces of a high
dimensional data space that allow better
clustering than original space - CLIQUE can be considered as both density-based
and grid-based - It partitions each dimension into the same number
of equal length interval - It partitions a d-dimensional data space into
non-overlapping rectangular units - A unit is dense if the fraction of total data
points contained in the unit exceeds the input
model parameter - A cluster is a maximal set of connected dense
units within a subspace
82Salary (10,000)
7
6
5
4
3
2
1
age
0
20
30
40
50
60
? 3
83CLIQUE The Major Steps
Agrawal, Gehrke, Gunopulos, Raghavan, Automatic
Subspace Clustering of High Dimensional Data for
Data Mining Applications, ACM SIGMOD 1998.
- Partition the data space and find the number of
points that lie inside each cell of the
partition. - Identify the subspaces that contain clusters
using the Apriori principle - Identify clusters
- Determine dense units in all subspaces of
interests - Determine connected dense units in all subspaces
of interests. - Generate minimal description for the clusters
- Determine maximal regions that cover a cluster of
connected dense units for each cluster - Determination of minimal cover for each cluster
84Strength and Weakness of CLIQUE
- Strength
- It automatically finds subspaces of the highest
dimensionality such that high density clusters
exist in those subspaces - It is insensitive to the order of records in
input and does not presume some canonical data
distribution - It scales linearly with the size of input and has
good scalability as the number of dimensions in
the data increases - Weakness
- The accuracy of the clustering result may be
degraded at the expense of simplicity of the
method
85Model based clustering
- Assume data generated from K probability
distributions - Typically Gaussian distribution Soft or
probabilistic version of K-means clustering - Need to find distribution parameters.
- EM Algorithm
86EM Algorithm
- Initialize K cluster centers
- Iterate between two steps
- Expectation step assign points to clusters
- Maximation step estimate model parameters
87CURE (Clustering Using Epresentatives )
- Guha, Rastogi Shim, CURE An Efficient
Clustering Algorithm for Large Databases, ACM
SIGMOD, 1998 - Stops the creation of a cluster hierarchy if a
level consists of k clusters - Uses multiple representative points to evaluate
the distance between clusters, adjusts well to
arbitrary shaped clusters and avoids single-link
effect
88Drawbacks of Distance-Based Method
Guha, Rastogi Shim, CURE An Efficient
Clustering Algorithm for Large Databases, ACM
SIGMOD, 1998
- Drawbacks of square-error based clustering method
- Consider only one point as representative of a
cluster - Good only for convex shaped, similar size and
density, and if k can be reasonably estimated
89BIRCH
Zhang, Ramakrishnan, Livny, Birch Balanced
Iterative Reducing and Clustering using
Hierarchies, ACM SIGMOD 1996.
- Dependent on order of insertions
- Works for convex, isotropic clusters of uniform
size - Labeling Problem
- Centroid approach
- Labeling Problem even with correct centers, we
cannot label correctly
90Jensen-Shannon Divergence
- Jensen-Shannon(JS) divergence between two
probability distributions -
-
- where
- Jensen-Shannon(JS) divergence between a finite
number of probability distributions
91Information-Theoretic Clustering (preserving
mutual information)
- (Lemma) The loss in mutual information equals
- Interpretation Quality of each cluster is
measured by the Jensen-Shannon Divergence between
the individual distributions in the cluster. - Can rewrite the above as
- Goal Find a clustering that minimizes the above
loss
92Information Theoretic Co-clustering (preserving
mutual information)
- (Lemma) Loss in mutual information equals
- where
-
- Can be shown that q(x,y) is a maximum entropy
approximation to p(x,y). - q(x,y) preserves marginals q(x)p(x) q(y)p(y)
93parameters that determine q are
(m-k)(kl-1)(n-l)
94Preserving Mutual Information
- Lemma
-
- Note that may be thought of as
the prototype of row cluster (the usual
centroid of the cluster is
) - Similarly,
95Example Continued
96Co-Clustering Algorithm
97Properties of Co-clustering Algorithm
- Theorem The co-clustering algorithm
monotonically decreases loss in mutual
information (objective function value) - Marginals p(x) and p(y) are preserved at every
step (q(x)p(x) and q(y)p(y) ) - Can be generalized to higher dimensions
98(No Transcript)
99Applications -- Text Classification
- Assigning class labels to text documents
- Training and Testing Phases
New Document
Class-1
Document collection
Grouped into classes
Classifier (Learns from Training data)
New Document With Assigned class
Class-m
Training Data
100Dimensionality Reduction
- Feature Selection
- Feature Clustering
1
- Select the best words
- Throw away rest
- Frequency based pruning
- Information criterion based
- pruning
Document Bag-of-words
Vector Of words
Word1
Wordk
m
1
Vector Of words
Cluster1
- Do not throw away words
- Cluster words instead
- Use clusters as features
Document Bag-of-words
Clusterk
m
101Experiments
- Data sets
- 20 Newsgroups data
- 20 classes, 20000 documents
- Classic3 data set
- 3 classes (cisi, med and cran), 3893 documents
- Dmoz Science HTML data
- 49 leaves in the hierarchy
- 5000 documents with 14538 words
- Available at http//www.cs.utexas.edu/users/manyam
/dmoz.txt - Implementation Details
- Bow for indexing,co-clustering, clustering and
classifying
102Naïve Bayes with word clusters
- Naïve Bayes classifier
- Assign document d to the class with the highest
score - Relation to KL Divergence
- Using word clusters instead of words
- where parameters for clusters are estimated
according to joint statistics
103Selecting Correlated Attributes
T. Fukuda, Y. Morimoto, S. Morishita, T.
Tokuyama, Constructing Efficient Decision Trees
by Using Optimized Numeric Association Rules,
Proc. of VLDB Conf., 1996.
- To decide A and A are strongly correlated iff
- where a threshold ? ? 1
104MDL-based Decision Tree Pruning
- M. Mehta, J. Rissanen, R. Agrawal, MDL-based
Decision Tree Pruning, Proc. on KDD Conf., 1995. - Two steps for induction of decision trees
- Construct a DT using training data
- Reduce the DT by pruning to prevent overfitting
- Possible approaches
- Cost-complexity pruning using a separate set of
samples for pruning - DT pruning using the same training data sets for
testing - MDL-based pruning using Minimum Description
Length (MDL) principle.
105Pruning Using MDL Principle
M. Mehta, J. Rissanen, R. Agrawal, MDL-based
Decision Tree Pruning, Proc. on KDD Conf., 1995.
- View decision tree as a means for efficiently
encoding classes of records in training set - MDL Principle best tree is the one that can
encode records using the fewest bits - Cost of encoding tree includes
- 1 bit for encoding type of each node (e.g. leaf
or internal) - Csplit cost of encoding attribute and value for
each split - nE cost of encoding the n records in each leaf
(E is entropy)
106Pruning Using MDL Principle
M. Mehta, J. Rissanen, R. Agrawal, MDL-based
Decision Tree Pruning, Proc. on KDD Conf., 1995.
- Problem to compute the minimum cost subtree at
root of built tree - Suppose minCN is the cost of encoding the minimum
cost subtree rooted at N - Prune children of a node N if minCN nE1
- Compute minCN as follows
- N is leaf nE1
- N has children N1 and N2 minnE1,Csplit1minCN
1minCN2 - Prune tree in a bottom-up fashion
107MDL Pruning - Example
R. Rastogi, K. Shim, PUBLIC A Decision Tree
Classifier that Integrates Building and Pruning,
Proc. of VLDB Conf., 1998.
- Cost of encoding records in N, (nE1) 3.8
- Csplit 2.6
- minCN min3.8,2.6111 3.8
- Since minCN nE1, N1 and N2 are pruned
108PUBLIC
- R. Rastogi, K. Shim, PUBLIC A Decision Tree
Classifier that Integrates Building and Pruning,
Proc. of VLDB Conf., 1998. - Prune tree during (not after) building phase
- Execute pruning algorithm (periodically) on
partial tree - Problem how to compute minCN for a yet to be
expanded leaf N in a partial tree - Solution compute lower bound on the subtree cost
at N and use this as minCN when pruning - minCN is thus a lower bound on the cost of
subtree rooted at N - Prune children of a node N if minCN nE1
- Guaranteed to generate identical tree to that
generated by SPRINT
109PUBLIC(1)
R. Rastogi, K. Shim, PUBLIC A Decision Tree
Classifier that Integrates Building and Pruning,
Proc. of VLDB Conf., 1998.
sal education Label
10K High-school Reject
40K Under Accept
15K Under Reject
75K grad Accept
18K grad Accept
- Simple lower bound for a subtree 1
- Cost of encoding records in N nE1 5.8
- Csplit 4
- minCN min5.8, 4111 5.8
- Since minCN nE1, N1 and N2 are pruned
110PUBLIC(S)
- Theorem The cost of any subtree with s splits
and rooted at node N is at least 2s1slog a - a is the number of attributes
- k is the number of classes
- ni (gt ni1) is the number of records belonging
to class i - Lower bound on subtree cost at N is thus the
minimum of - nE1 (cost with zero split)
- 2s1slog a
k
Ã¥
ni
is2
k
Ã¥
ni
is2
111Whats Clustering
- Clustering is a kind of unsupervised learning.
- Clustering is a method of grouping data that
share similar trend and patterns. - Clustering of data is a method by which large
sets of data is grouped into clusters of smaller
sets of similar data. - Example
After clustering
Thus, we see clustering means grouping of data or
dividing a large data set into smaller data sets
of some similarity.
112Partitional Algorithms
- Enumerate K partitions optimizing some criterion
- Example square-error criterion
- Where x is the ith pattern belonging to the jth
cluster and c is the centroid of the jth cluster.
113Squared Error Clustering Method
- Select an initial partition of the patterns with
a fixed number of clusters and cluster centers - Assign each pattern to its closest cluster center
and compute the new cluster centers as the
centroids of the clusters. Repeat this step until
convergence is achieved, i.e., until the cluster
membership is stable. - Merge and split clusters based on some heuristic
information, optionally repeating step 2.
114Agglomerative Clustering Algorithm
- Place each pattern in its own cluster. Construct
a list of interpattern distances for all distinct
unordered pairs of patterns, and sort this list
in ascending order - Step through the sorted list of distances,
forming for each distinct dissimilarity value dk
a graph on the patterns where pairs of patterns
closer than dk are connected by a graph edge. If
all the patterns are members of a connected
graph, stop. Otherwise, repeat this step. - The output of the algorithm is a nested hierarchy
of graphs with can be cut at a desired
dissimilarity level forming a partition
identified by simply connected components in the
corresponding graph.
115Agglomerative Hierarchical Clustering
- Mostly used hierarchical clustering algorithm
- Initially each point is a distinct cluster
- Repeatedly merge closest clusters until the
number of clusters becomes K - Closest dmean (Ci, Cj)
- dmin (Ci, Cj)
- Likewise dave (Ci, Cj) and dmax (Ci, Cj)
116Clustering
- Summary of Drawbacks of Traditional Methods
- Partitional algorithms split large clusters
- Centroid-based method splits large and
non-hyperspherical clusters - Centers of subclusters can be far apart
- Minimum spanning tree algorithm is sensitive to
outliers and slight change in position - Exhibits chaining effect on string of outliers
- Cannot scale up for large databases
117Model-based Clustering
- Mixture of Gaussians
- Gaussian pdf P(?i)
- Data point, N(?i,?2I)
- Consider
- Data points x1, x2,, xN
- P(?1),, P(?k),?
- Likelihood function
- Maximize the likelihood function by calculating
118Overview of EM Clustering
- Extensions and generalizations. The EM
(expectation maximization) algorithm extends the
k-means clustering technique in two important
ways - Instead of assigning cases or observations to
clusters to maximize the differences in means for
continuous variables, the EM clustering algorithm
computes probabilities of cluster memberships
based on one or more probability distributions.
The goal of the clustering algorithm then is to
maximize the overall probability or likelihood of
the data, given the (final) clusters. - Unlike the classic implementation of k-means
clustering, the general EM algorithm can be
applied to both continuous and categorical
variables (note that the classic k-means
algorithm can also be modified to accommodate
categorical variables).
119EM Algorithm
- The EM algorithm for clustering is described in
detail in Witten and Frank (2001). - The basic approach and logic of this clustering
method is as follows. - Suppose you measure a single continuous variable
in a large sample of observations. - Further, suppose that the sample consists of two
clusters of observations with different means
(and perhaps different standard deviations)
within each sample, the distribution of values
for the continuous variable follows the normal
distribution. - The resulting distribution of values (in the
population) may look like this
120EM v.s. k-Means
- Classification probabilities instead of
classifications. The results of EM clustering are
different from those computed by k-means
clustering. The latter will assign observations
to clusters to maximize the distances between
clusters. The EM algorithm does not compute
actual assignments of observations to clusters,
but classification probabilities. In other words,
each observation belongs to each cluster with a
certain probability. Of course, as a final result
you can usually review an actual assignment of
observations to clusters, based on the (largest)
classification probability.
121Finding k
- V-fold cross-validation. This type of
cross-validation is useful when no test sample is
available and the learning sample is too small to
have the test sample taken from it. A specified V
value for V-fold cross-validation determines the
number of random subsamples, as equal in size as
possible, that are formed from the learning
sample. The classification tree of the specified
size is computed V times, each time leaving out
one of the subsamples from the computations, and
using that subsample as a test sample for
cross-validation, so that each subsample is used
V - 1 times in the learning sample and just once
as the test sample. The CV costs computed for
each of the V test samples are then averaged to
give the V-fold estimate of the CV costs.
122Expectation Maximization
- A mixture of Gaussians
- Ex x130, P(x1)1/2 x218, P(x2)u x30,
P(x3)2u x423, P(x4)1/2-3u - Likelihood for X1 a students x2 b students
x3 c students x4 d students - To maximize L, calculate the log Likelihood L
Supposing a14, b6, c9,d10, then u1/10. If
x1x2 h students ? abh ? ah/(u1), b2uh/(u1)
123Gaussian (Normal) pdf
- The Gaussian function with mean (?) and standard
deviation (?). The properties of the function - symmetric about the mean
- Gains its maximum value at the mean, the minimum
value at plus and minus infinity - The distribution is often referred to as bell
shaped - At one standard deviation from the mean the
function has dropped to about 2/3 of its maximum
value, at two standard deviations it has falled
to about a 1/7. - The area under the function one standard
deviation from the mean is about 0.682. Two
standard deviations it is 0.9545, and the three
s.d. it is 0.9973. The total area under the curve
is 1.
124Gaussian
Think the cumulative distribution, F?,?2(x)
125Multi-variate Density Estimation
Mixture of Gaussians
- contains all the parameters of the mixture
model. pi are known as mixing proportions or
coefficients. - A mixture of Gaussians model
- Generic mixture
P(y) y1
y2 P(xy1) P(xy2)
126Mixture Density
- If we are given just x we do not know which
mixture component this example came from - We can evaluate the posterior probability that an
observed x was generated from the first mixture
component