Title: Chapter 5: Clustering
1Chapter 5 Clustering
2Searching for groups
- Clustering is unsupervised or undirected.
- Unlike classification, in clustering, no
pre-classified data. - Search for groups or clusters of data points
(records) that are similar to one another. - Similar points may mean similar customers,
products, that will behave in similar ways.
3Group similar points together
- Group points into classes using some distance
measures. - Within-cluster distance, and between cluster
distance - Applications
- As a stand-alone tool to get insight into data
distribution - As a preprocessing step for other algorithms
4An Illustration
5Examples of Clustering Applications
- Marketing Help marketers discover distinct
groups in their customer bases, and then use this
knowledge to develop targeted marketing programs - Insurance Identifying groups of motor insurance
policy holders with some interesting
characteristics. - City-planning Identifying groups of houses
according to their house type, value, and
geographical location
6Concepts of Clustering
- Clusters
- Different ways of representing clusters
- Division with boundaries
- Spheres
- Probabilistic
- Dendrograms
1 2 3
I1 I2 In
0.5 0.2 0.3
7Clustering
- Clustering quality
- Inter-clusters distance ? maximized
- Intra-clusters distance ? minimized
- The quality of a clustering result depends on
both the similarity measure used by the method
and its application. - The quality of a clustering method is also
measured by its ability to discover some or all
of the hidden patterns - Clustering vs. classification
- Which one is more difficult? Why?
- There are a huge number of clustering techniques.
8Dissimilarity/Distance Measure
- Dissimilarity/Similarity metric Similarity is
expressed in terms of a distance function, which
is typically metric d (i, j) - The definitions of distance functions are usually
very different for interval-scaled, boolean,
categorical, ordinal and ratio variables. - Weights should be associated with different
variables based on applications and data
semantics. - It is hard to define similar enough or good
enough. The answer is typically highly
subjective.
9Types of data in clustering analysis
- Interval-scaled variables
- Binary variables
- Nominal, ordinal, and ratio variables
- Variables of mixed types
10Interval-valued variables
- Continuous measurements in a roughly linear
scale, e.g., weight, height, temperature, etc - Standardize data (depending on applications)
- Calculate the mean absolute deviation
- where
- Calculate the standardized measurement (z-score)
11Similarity Between Objects
- Distance Measure the similarity or dissimilarity
between two data objects - Some popular ones include Minkowski distance
- where (xi1, xi2, , xip) and (xj1, xj2, , xjp)
are two p-dimensional data objects, and q is a
positive integer - If q 1, d is Manhattan distance
12Similarity Between Objects (Cont.)
- If q 2, d is Euclidean distance
- Properties
- d(i,j) ? 0
- d(i,i) 0
- d(i,j) d(j,i)
- d(i,j) ? d(i,k) d(k,j)
- Also, one can use weighted distance, and many
other similarity/distance measures.
13Binary Variables
- A contingency table for binary data
- Simple matching coefficient (invariant, if the
binary variable is symmetric) - Jaccard coefficient (noninvariant if the binary
variable is asymmetric)
Object j
Object i
14Dissimilarity of Binary Variables
- Example
- gender is a symmetric attribute (not used below)
- the remaining attributes are asymmetric
attributes - let the values Y and P be set to 1, and the value
N be set to 0
15Nominal Variables
- A generalization of the binary variable in that
it can take more than 2 states, e.g., red,
yellow, blue, green, etc - Method 1 Simple matching
- m of matches, p total of variables
- Method 2 use a large number of binary variables
- creating a new binary variable for each of the M
nominal states
16Ordinal Variables
- An ordinal variable can be discrete or continuous
- Order is important, e.g., rank
- Can be treated like interval-scaled (f is a
variable) - replace xif by their ranks
- map the range of each variable onto 0, 1 by
replacing i-th object in the f-th variable by - compute the dissimilarity using methods for
interval-scaled variables
17Ratio-Scaled Variables
- Ratio-scaled variable a measurement on a
nonlinear scale, approximately at exponential
scale, such as AeBt or Ae-Bt, e.g., growth of a
bacteria population. - Methods
- treat them like interval-scaled variablesnot a
good idea! (why?the scale can be distorted) - apply logarithmic transformation
- yif log(xif)
- treat them as continuous ordinal data and then
treat their ranks as interval-scaled
18Variables of Mixed Types
- A database may contain all six types of variables
- symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio - One may use a weighted formula to combine their
effects - f is binary or nominal
- dij(f) 0 if xif xjf , or dij(f) 1 o.w.
- f is interval-based use the normalized distance
- f is ordinal or ratio-scaled
- compute ranks rif and
- and treat zif as interval-scaled
19Major Clustering Techniques
- Partitioning algorithms Construct various
partitions and then evaluate them by some
criterion - Hierarchy algorithms Create a hierarchical
decomposition of the set of data (or objects)
using some criterion - Density-based based on connectivity and density
functions - Model-based A model is hypothesized for each of
the clusters and the idea is to find the best fit
of the model to each other.
20Partitioning Algorithms Basic Concept
- Partitioning method Construct a partition of a
database D of n objects into a set of k clusters - Given a k, find a partition of k clusters that
optimizes the chosen partitioning criterion - Global optimal exhaustively enumerate all
partitions - Heuristic methods k-means and k-medoids
algorithms - k-means Each cluster is represented by the
center of the cluster - k-medoids or PAM (Partition around medoids) Each
cluster is represented by one of the objects in
the cluster
21The K-Means Clustering
- Given k, the k-means algorithm is as follows
- Choose k cluster centers to coincide with k
randomly-chosen points - Assign each data point to the closest cluster
center - Recompute the cluster centers using the current
cluster memberships. - If a convergence criterion is not met, go to 2).
- Typical convergence criteria are no (or minimal)
reassignment of data points to new cluster
centers, or minimal decrease in squared error.
p is a point and mi is the mean of cluster Ci
22Example
- For simplicity, 1 dimensional data and k2.
- data 1, 2, 5, 6,7
- K-means
- Randomly select 5 and 6 as initial centroids
- gt Two clusters 1,2,5 and 6,7 meanC18/3,
meanC26.5 - gt 1,2, 5,6,7 meanC11.5, meanC26
- gt no change.
- Aggregate dissimilarity 0.52 0.52 12
12 2.5
23Comments on K-Means
- Strength efficient O(tkn), where n is data
points, k is clusters, and t is iterations.
Normally, k, t ltlt n. - Comment Often terminates at a local optimum. The
global optimum may be found using techniques such
as deterministic annealing and genetic
algorithms - Weakness
- Applicable only when mean is defined, difficult
for categorical data - Need to specify k, the number of clusters, in
advance - Sensitive to noisy data and outliers
- Not suitable to discover clusters with non-convex
shapes - Sensitive to initial seeds
24Variations of the K-Means Method
- A few variants of the k-means which differ in
- Selection of the initial k seeds
- Dissimilarity measures
- Strategies to calculate cluster means
- Handling categorical data k-modes
- Replacing means of clusters with modes
- Using new dissimilarity measures to deal with
categorical objects - Using a frequency based method to update modes of
clusters
25k-Medoids clustering method
- k-Means algorithm is sensitive to outliers
- Since an object with an extremely large value may
substantially distort the distribution of the
data. - Medoid the most centrally located point in a
cluster, as a representative point of the
cluster. - An example
- In contrast, a centroid is not necessarily inside
a cluster.
Initial Medoids
26Partition Around Medoids
- PAM
- Given k
- Randomly pick k instances as initial medoids
- Assign each data point to the nearest medoid x
- Calculate the objective function
- the sum of dissimilarities of all points to their
nearest medoids. (squared-error criterion) - Randomly select an point y
- Swap x by y if the swap reduces the objective
function - Repeat (3-6) until no change
27Comments on PAM
Outlier (100 unit away)
- Pam is more robust than k-means in the presence
of noise and outliers because a medoid is less
influenced by outliers or other extreme values
than a mean (why?) - Pam works well for small data sets but does not
scale well for large data sets. - O(k(n-k)2 ) for each change
- where n is of data, k is of clusters
28CLARA Clustering Large Applications
- CLARA Built in statistical analysis packages,
such as S - It draws multiple samples of the data set,
applies PAM on each sample, and gives the best
clustering as the output - Strength deals with larger data sets than PAM
- Weakness
- Efficiency depends on the sample size
- A good clustering based on samples will not
necessarily represent a good clustering of the
whole data set if the sample is biased - There are other scale-up methods e.g., CLARANS
29Hierarchical Clustering
- Use distance matrix for clustering. This method
does not require the number of clusters k as an
input, but needs a termination condition
30Agglomerative Clustering
- At the beginning, each data point forms a cluster
(also called a node). - Merge nodes/clusters that have the least
dissimilarity. - Go on merging
- Eventually all nodes belong to the same cluster
31A Dendrogram Shows How the Clusters are Merged
Hierarchically
Decompose data objects into a several levels of
nested partitioning (tree of clusters), called a
dendrogram. A clustering of the data objects is
obtained by cutting the dendrogram at the desired
level, then each connected component forms a
cluster.
32Divisive Clustering
- Inverse order of agglomerative clustering
- Eventually each node forms a cluster on its own
33More on Hierarchical Methods
- Major weakness of agglomerative clustering
methods - do not scale well time complexity at least
O(n2), where n is the total number of objects - can never undo what was done previously
- Integration of hierarchical with distance-based
clustering to scale-up these clustering methods - BIRCH (1996) uses CF-tree and incrementally
adjusts the quality of sub-clusters - CURE (1998) selects well-scattered points from
the cluster and then shrinks them towards the
center of the cluster by a specified fraction
34Summary
- Cluster analysis groups objects based on their
similarity and has wide applications - Measure of similarity can be computed for various
types of data - Clustering algorithms can be categorized into
partitioning methods, hierarchical methods,
density-based methods, etc - Clustering can also be used for outlier detection
which are useful for fraud detection - What is the best clustering algorithm?
35Other Data Mining Methods
36Sequence analysis
- Market basket analysis analyzes things that
happen at the same time. - How about things happen over time?
- E.g., If a customer buys a bed, he/she is likely
to come to buy a mattress later - Sequential analysis needs
- A time stamp for each data record
- customer identification
37Sequence analysis (cont )
- The analysis shows which item come before, after
or at the same time as other items. - Sequential patterns can be used for analyzing
cause and effect. - Other applications
- Finding cycles in association rules
- Some association rules hold strongly in certain
periods of time - E.g., every Monday people buy item X and Y
together - Stock market predicting
- Predicting possible failure in network, etc
38Discovering holes in data
- Holes are empty (sparse) regions in the data
space that contain few or no data points. Holes
may represent impossible value combinations in
the application domain. - E.g., in a disease database, we may find that
certain test values and/or symptoms do not go
together, or when certain medicine is used, some
test value never go beyond certain range. - Such information could lead to significant
discovery a cure to a disease or some biological
law.
39Data and pattern visualization
- Data visualization Use computer graphics effect
to reveal the patterns in data, - 2-D, 3-D scatter plots, bar charts, pie charts,
line plots, animation, etc. - Pattern visualization Use good interface and
graphics to present the results of data mining. - Rule visualizer, cluster visualizer, etc
40Scaling up data mining algorithms
- Adapt data mining algorithms to work on very
large databases. - Data reside on hard disk (too large to fit in
main memory) - Make fewer passes over the data
- Quadratic algorithms are too expensive
- Many data mining algorithms are quadratic,
especially, clustering algorithms.