Title: Clustering Tutorial
1Clustering Tutorial
- Elias Raftopoulos
- HY539 29/3/06
- Prof. Maria Papadopouli
2Roadmap
- Math Reminder
- Principle Components Analysis
- Clustering
- ANOVA
3Standard Deviation
- Statistics analyzing data sets in terms of the
relationships between the individual points - Standard Deviation is a measure of the spread of
the data - Calculation average distance from the mean of
the data
4Variance
- Another measure of the spread of the data in a
data set - Calculation
- Var( X ) E(( x µ )2)
- Why have both variance and SD to calculate the
spread of data? - Variance is claimed to be the original
statistical measure of spread of data. However
its unit would be expressed as a square e.g.
cm2, which is unrealistic to express heights or
other measures. Hence SD as the square root of
variance was born.
5Covariance
- Variance measure of the deviation from the mean
for points in one dimension e.g. heights - Covariance as a measure of how much each of the
dimensions vary from the mean with respect to
each other. - Covariance is measured between 2 dimensions to
see if there is a relationship between the 2
dimensions e.g. number of hours studied marks
obtained - The covariance between one dimension and itself
is the variance
6Covariance Matrix
- Representing Covariance between dimensions as a
matrix e.g. for 3 dimensions - cov(x,x) cov(x,y) cov(x,z)
- C cov(y,x) cov(y,y) cov(y,z)
- cov(z,x) cov(z,y) cov(z,z)
- Diagonal is the variances of x, y and z
- cov(x,y) cov(y,x) hence matrix is symmetrical
about the diagonal - N-dimensional data will result in nxn covariance
matrix
7Covariance
- Exact value is not as important as its sign.
- A positive value of covariance indicates both
dimensions increase or decrease together e.g. as
the number of hours studied increases, the marks
in that subject increase. - A negative value indicates while one increases
the other decreases, or vice-versa e.g. active
social life at RIT vs performance in CS dept. - If covariance is zero the two dimensions are
independent of each other e.g. heights of
students vs the marks obtained in a subject
8Transformation matrices
- Consider
- 2 3 3 12 3
- 2 1 2 8 2
- Square transformation matrix transforms (3,2)
from its original location. Now if we were to
take a multiple of (3,2) - 3 6
- 2 4
- 2 3 6 24 6
- 2 1 4 16 4
x
x
4
x
2
x
x
4
9Transformation matrices
- Scale vector (3,2) by a value 2 to get (6,4)
- Multiply by the square transformation matrix
- We see the result is still a multiple of 4.
- WHY?
- A vector consists of both length and direction.
Scaling a vector only changes its length and not
its direction. This is an important observation
in the transformation of matrices leading to
formation of eigenvectors and eigenvalues. - Irrespective of how much we scale (3,2) by, the
solution is always a multiple of 4.
10eigenvalue problem
- The eigenvalue problem is any problem having the
following form - A . v ? . v
- A n x n matrix
- v n x 1 non-zero vector
- ? scalar
- Any value of ? for which this equation has a
solution is called the eigenvalue of A and vector
v which corresponds to this value is called the
eigenvector of A.
11eigenvalue problem
- 2 3 3 12 3
- 2 1 2 8 2
- A . v ? . v
- Therefore, (3,2) is an eigenvector of the square
matrix A and 4 is an eigenvalue of A - Given matrix A, how can we calculate the
eigenvector and eigenvalues for A?
x
x
4
12Calculating eigenvectors eigenvalues
- Given A . v ? . v
- A . v - ? . I . v 0
- (A - ? . I ). v 0
- Finding the roots of A - ? . I will give the
eigenvalues and for each of these eigenvalues
there will be an eigenvector - Example
13Calculating eigenvectors eigenvalues
- If A 0 1
- -2 -3
- Then A - ? . I 0 1 ? 0 0
- -2 -3 0 ?
- -? 1 ?2 3? 2 0
- -2 -3-?
- This gives us 2 eigenvalues
- ?1 -1 and ?2 -2
14Properties of eigenvectors and eigenvalues
- Note that Irrespective of how much we scale (3,2)
by, the solution is always a multiple of 4. - Eigenvectors can only be found for square
matrices and not every square matrix has
eigenvectors. - Given an n x n matrix, we can find n eigenvectors
15Roadmap
- Principle Components Analysis
- Clustering
- ANOVA
16PCA
- principal components analysis (PCA) is a
technique that can be used to simplify a dataset - It is a linear transformation that chooses a new
coordinate system for the data set such that - greatest variance by any projection of the data
set comes to lie on the first axis (then called
the first principal component), - the second greatest variance on the second axis,
and so on. - PCA can be used for reducing dimensionality by
eliminating the later principal components.
17PCA
- By finding the eigenvalues and eigenvectors of
the covariance matrix, we find that the
eigenvectors with the largest eigenvalues
correspond to the dimensions that have the
strongest correlation in the dataset. - This is the principal component.
- PCA is a useful statistical technique that has
found application in - fields such as face recognition and image
compression - finding patterns in data of high dimension
18PCA process STEP 1
- Subtract the mean
- from each of the data dimensions. All the x
values have x subtracted and y values have y
subtracted from them. This produces a data set
whose mean is zero. - Subtracting the mean makes variance and
covariance calculation easier by simplifying
their equations. The variance and co-variance
values are not affected by the mean value.
19PCA process STEP 1
- DATA
- x y
- 2.5 2.4
- 0.5 0.7
- 2.2 2.9
- 1.9 2.2
- 3.1 3.0
- 2.3 2.7
- 2 1.6
- 1 1.1
- 1.5 1.6
- 1.1 0.9
ZERO MEAN DATA x y .69 .49 -1.31
-1.21 .39 .99 .09 .29 1.29 1.09 .49
.79 .19 -.31 -.81 -.81 -.31 -.31 -.71
-1.01
20PCA process STEP 1
21PCA process STEP 2
- Calculate the covariance matrix
- cov .616555556 .615444444
- .615444444 .716555556
- since the non-diagonal elements in this
covariance matrix are positive, we should expect
that both the x and y variable increase together.
22PCA process STEP 3
- Calculate the eigenvectors and eigenvalues of the
covariance matrix - eigenvalues .0490833989
- 1.28402771
- eigenvectors -.735178656 -.677873399
- .677873399 -.735178656
23PCA process STEP 3
- eigenvectors are plotted as diagonal dotted lines
on the plot. - Note they are perpendicular to each other.
- Note one of the eigenvectors goes through the
middle of the points, like drawing a line of best
fit. - The second eigenvector gives us the other, less
important, pattern in the data, that all the
points follow the main line, but are off to the
side of the main line by some amount.
24PCA process STEP 4
- Reduce dimensionality and form feature vector the
eigenvector with the highest eigenvalue is the
principle component of the data set. - In our example, the eigenvector with the larges
eigenvalue was the one that pointed down the
middle of the data. - Once eigenvectors are found from the covariance
matrix, the next step is to order them by
eigenvalue, highest to lowest. This gives you the
components in order of significance.
25PCA process STEP 4
- Now, if you like, you can decide to ignore the
components of lesser significance - You do lose some information, but if the
eigenvalues are small, you dont lose much - n dimensions in your data
- calculate n eigenvectors and eigenvalues
- choose only the first p eigenvectors
- final data set has only p dimensions.
26PCA process STEP 4
- Feature Vector
- FeatureVector (eig1 eig2 eig3 eign)
- We can either form a feature vector with both of
the eigenvectors - -.677873399 -.735178656
- -.735178656 .677873399
- or, we can choose to leave out the smaller, less
significant component and only have a single
column - - .677873399
- - .735178656
27PCA process STEP 5
- Deriving the new data
- FinalData RowFeatureVector x RowZeroMeanData
- RowFeatureVector is the matrix with the
eigenvectors in the columns transposed so that
the eigenvectors are now in the rows, with the
most significant eigenvector at the top - RowZeroMeanData is the mean-adjusted data
transposed, ie. the data items are in each
column, with each row holding a separate
dimension.
28PCA process STEP 5
S
R
VT
U
factors
variables
variables
factors
significant
sig.
noise
noise
noise
significant
factors
factors
samples
samples
29PCA process STEP 5
- FinalData is the final data set, with data items
in columns, and dimensions along rows. - What will this give us?
- It will give us the original data solely in terms
of the vectors we chose. - We have changed our data from being in terms of
the axes x and y , and now they are in terms of
our 2 eigenvectors.
30PCA process STEP 5
- FinalData transpose dimensions along columns
- x y
- -.827970186 -.175115307
- 1.77758033 .142857227
- -.992197494 .384374989
- -.274210416 .130417207
- -1.67580142 -.209498461
- -.912949103 .175282444
- .0991094375 -.349824698
- 1.14457216 .0464172582
- .438046137 .0177646297
- 1.22382056 -.162675287
31PCA process STEP 5
32Reconstruction of original Data
- If we reduced the dimensionality, obviously, when
reconstructing the data we would lose those
dimensions we chose to discard. In our example
let us assume that we considered only the x
dimension
33Reconstruction of original Data
- x
- -.827970186
- 1.77758033
- -.992197494
- -.274210416
- -1.67580142
- -.912949103
- .0991094375
- 1.14457216
- .438046137
- 1.22382056
34Roadmap
- Principle Components Analysis
- Clustering
- ANOVA
35What is Cluster Analysis?
- Cluster a collection of data objects
- Similar to the objects in the same cluster
(Intraclass similarity) - Dissimilar to the objects in other clusters
(Interclass dissimilarity) - Cluster analysis
- Statistical method for grouping a set of data
objects into clusters - A good clustering method produces high quality
clusters with high intraclass similarity and low
interclass similarity - Clustering is unsupervised classification
- Can be a stand-alone tool or as a preprocessing
step for other algorithms
36Group objects according to their similarity
Cluster a set of objects that are similar to
each other and separated from the
other objects. Example green/ red data
points were generated from two different normal
distributions
37Clustering data
object expression data matrix
- Experiments/samples are given as the row and
column vectors of an expression data matrix - Clustering may be applied either to objects
experiments (regarded as vectors in Ro or Rn).
n experiments
o objects
38Pattern matrix ? Proximity matrix
- Pattern matrix (nxp)
- pattributes
- n of objects
- Proximity matrix (nxn)
- d(i,j)difference/
- dissimilarity between i and j
39Proximity matrix
- Clustering methods require that a index of
proximity, or alikeness, or affinity or
association be established between pairs of
patterns - A proximity index is either a similarity or a
dissimilarity - The crucial problem in identifying clusters in
data is to specify what proximity is and how to
measure it
40Proximity indices
- A proximity index between the ith and kth
patterns is denoted d(i,k) and must satisfy the
following three properties - 1. (a) for a dissimilarity d(i,i) 0, all i
- (b) for a similarity d(i,i) max
d(i,k), all I - 2. d(i,k) d(k,i), all (i,k)
- 3. d(i,k) 0, all (i,k)
41Different proximity measures
- r 2(Euclidean distance)
- 42 221/2 4.472
- r 1(Manhattan distance)
- 4 2 6
- r ? 8 (sup distance)
- max4,2 4
42 K-Means Clustering
- The meaning of K-means
- Why it is called K-means clustering K points
are used to represent the clustering result each
point corresponds to the centre (mean) of a
cluster - Each point is assigned to the cluster with the
closest center point - The number K, must be specified
- Basic algorithm
43The K-Means Clustering Method
- Given k, the k-means algorithm is implemented in
4 steps - Partition objects into k non-empty subsets
- Arbitrarily choose k points as initial centers
- Assign each object to the cluster with the
nearest seed point (center) - Calculate the mean of the cluster and update the
seed point - Go back to Step 3, stop when no more new
assignment
44The K-Means Clustering Method (cntd)
- The basic step of k-means clustering is simple
- Iterate until stable ( no object move group)
- Determine the centroid coordinate
- Determine the distance of each object to the
centroids - Group the object based on minimum distance
45The K-Means Clustering Method (cntd)
46The K-Means Clustering Results
10
9
8
7
6
5
Update the cluster means
Assign each objects to most similar center
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
reassign
reassign
K2 Arbitrarily choose K object as initial
cluster center
Update the cluster means
47Weaknesses of the K-Means Method
- Unable to handle noisy data and outliers
- Very large or very small values could skew the
mean - Not suitable to discover clusters with non-convex
shapes
48Hierarchical Clustering
- Start with every data point in a separate cluster
- Keep merging the most similar pairs of data
points/clusters until we have one big cluster
left - This is called a bottom-up or agglomerative
method
49Hierarchical Clustering (cont.)
- This produces a binary tree or dendrogram
- The final cluster is the root and each data item
is a leaf - The height of the bars indicate how close the
items are
50Hierarchical Clustering Demo
51Levels of Clustering
52Linkage in Hierarchical Clustering
- We already know about distance measures between
data items, but what about between a data item
and a cluster or between two clusters? - We just treat a data point as a cluster with a
single item, so our only problem is to define a
linkage method between clusters - As usual, there are lots of choices
53Average Linkage
- Definition
- Each cluster ci is associated with a mean vector
?i which is the mean of all the data items in the
cluster - The distance between two clusters ci and cj is
then just d(?i , ?j ) - This is somewhat non-standard this method is
usually referred to as centroid linkage and
average linkage is defined as the average of all
pairwise distances between points in the two
clusters
54Single Linkage
- The minimum of all pairwise distances between
points in the two clusters - Tends to produce long, loose clusters
55Complete Linkage
- The maximum of all pairwise distances between
points in the two clusters - Tends to produce very tight clusters
56Distances between clusters (summary)
- Calculation of the distance between two clusters
is based on the pairwise distances between
members of the clusters. - Complete linkage largest distance between points
- Average linkage average distance between points
- Single linkage smallest distance between points
- Centroid distance between centroids
Complete linkage gives preference to
compact/spherical clusters. Single linkage can
produce long stretched clusters.
57 A B C D E
A 0 1 2 2 3
B 1 0 2 4 3
C 2 2 0 1 5
D 2 4 1 0 3
E 3 3 5 3 0
EXAMPLE
58More on Hierarchical Clustering Methods
- Major advantage
- Conceptually very simple
- Easy to implement ? most commonly used technique
- Major weakness of agglomerative clustering
methods - do not scale well time complexity of at least
O(n2), where n is the number of total objects - can never undo what was done previously ? high
likelihood of getting stuck in local minima
59Roadmap
- Principle Components Analysis
- Clustering
- ANOVA
60(M)ANOVA
- The analysis of variance technique in One-Way
Analysis of Variance (ANOVA) takes a set of
grouped data and determine whether the mean of a
variable differs significantly between groups - Often there are multiple variables and you are
interested in determining whether the entire set
of means is different from one group to the next - There is a multivariate version of analysis of
variance that can address that problem (MANOVA)