Clustering - PowerPoint PPT Presentation

1 / 35

About This Presentation

Title:

Clustering

Description:

Clustering. Petter Mostad. Clustering vs. class prediction. Class prediction: A learning set of objects with known classes. Goal: put new objects into existing classes ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 36

Provided by: csCha

Category:

more less

Transcript and Presenter's Notes

Title: Clustering

1
Clustering

Petter Mostad

2
Clustering vs. class prediction

Class prediction
A learning set of objects with known classes
Goal put new objects into existing classes
Also called Supervised learning, or
classification
Clustering
No learning set, no given classes
Goal discover the best classes or groupings
Also called Unsupervised learning, or class
discovery

3
Overview

General clustering theory
Steps, methods, algorithms, issues...
Clustering microarray data
Recommendations for this kind of data
Programs for clustering
Some other visualization techniques

4
Issues in clustering

Used to explore and visualize data, with few
preconceptions
Many subjective choices must be made, so a
clustering output tends to be subjective
It is difficult to get truly statistically
significant conclusions
Algorithms will always produce clusters, whether
any exist in the data or not

5
Steps in clustering

Feature selection and extraction
Defining and computing similarities
Clustering or grouping objects
Assessing, presenting, and using the result

6
1. Feature selection and extraction

Deciding which measurements matter for similarity
Data reduction
Filtering away objects
Normalization of measurements

7
The data matrix

Every row contains the measurements for one
object.
Similarities are computed between all pairs of
rows
If measurements are of same type, one can instead
cluster them!

measurements
objects
8
2. Defining and computing similarities

Similarity measures for continuous data vectors
Euclidean distance
Minkowski distance (including Manhattan metric)
Mahalanobis distance where S is a
covariance matrix

Centered and non-centered (absolute) Pearson
correlation
centered
non-centered
where
Spearman rank correlation
Compute the ranking of the numbers in each vector
Find correlation between ranking numbers
....

10
Geometrical view of clustering

If measurements are coordinates, objects become
points in some space
If the simiarity measure is Euclidean distance,
the goal is to group nearby points
Note When we have only 2 or 3 measurements per
object, we can do better than most algorithms
using visual inspection

11
Similarity measures for discrete data

Comparing two binary vectors, count the numbers
a,b,c,d of 1-1s, 1-0s, 0-1s, and 0-0s,
respectively
Construct different similarity measurements based
on these numbers
Similarity of for example trees or other objects
can be defined in reasonable ways

12
Similarities using contexts

Mutual Neighbour Distance
where is the neighbour number of x
with respect to y
This is not a metric, but similarities do not
need to be based on metrics.

13
3. Clustering or grouping

Hierarchical clusterings
Divisive Starts with one big cluster and
subdivides on cluster in each step
Agglomerative Starts with each object in
separate cluster. In each step, joins the two
closest clusters
Partitional clusterings
Probabilistic or fuzzy clusterings

14
Hierarchical clustering

Agglomerative clustering depends on type of
linkage, i.e., how to compute the distance
between merged cluster (UV) and old cluster (W)
d(UV, W) min(d(U, W), d(V,W)) (single linkage)
d(UV, W) max(d(U,W), d(V,W)) (complete linkage)
d(UV, W) average over all distances between
objects in (UV) and objects in W (average
linkage, or UPGMA Unweighted Pair Group Method
with Arithmetic mean)
The output is a dendrogram
A simplification of average linkage is often
implemented (average group linkage) It may
lead to inverted dendrograms!

15
Dendrograms, visualizations

The data matrix is often visualized using three
colors, representing positive, negative, and zero
values.
Hierarchical clustering results often represented
with a dendrogram. The similarity at which
clusters merge should correspond to height of
corresponding horizontal line in dendrogram!
To display the dendrogram, the objects (lines or
columns) need to be sorted, this can be done in
two ways at every time when two clusters are
merged.

16
(No Transcript)
17
Wards hierarchical clustering

Agglomerative.
Goal minimize Error Sum of Squares (ESS) at
every step.
ESS The sum over all clusters, of the sum of
the squares of the distances from the objects to
the cluster centroid.
When joining two clusters, find the pair that
results in the smallest increase in ESS.

18
Partitional clusterings

The number of desired clusters is fixed at the
start
K-means clustering
Partition into k initial clusters
Iteratively, reassign points to groups with the
closest centroid. Recompute centroids.
Repeat until stability
The result may depend on initial clusters
May include a procedure joining or splitting
clusters according to size
The choice of number of clusters may not be
obvious

19
Probabilistic or fuzzy clustering

The output is, for each object and each cluster,
a probability or weight that the object belongs
to the cluster
Example The observations are modelled as
produced by drawing from a number of probability
densities (often multivariate normal). Parameters
are then estimated with Maximum Likelihood (for
example using EM algorithm).
Example A fuzzy version of k-means, where
weights for objects are changed iteratively

20
Neural networks for clustering

Neural networks are mathematical models made to
be similar to actual neural networks
They consist of layers of nodes that send out
signals based probabilistically on input
signals
Most known uses are classifications, i.e., with
learning sets

21
Self-Organising Maps (SOM)
22
Clustering as optimization

Given similarity definition and definition of
what is an optimal clustering, it can often be
a huge algorithmic challenge to find the optimum.
Example Subdivide many thousand objects into 50
clusters, minimizing e.g. the sum of the squared
distances to centroids.
Then, algorithms for optimization are central.

23
Genetic algorithms

Tries to use evolution to obtain good solutions
to a problem
A number of solutions are kept at every step
They may then mate or mutate, to produce new
solutions. The fittest solutions are kept.
Can be seen as an optimization algorithm
A great challenge to design ways of mating and
mutating that produce an efficient algorithm

24
Simulated annealing

A general optimization technique
Iterative At every step, nearby solutions are
chosen with probabilities depending on their
optimality (so even less optimal solutions may be
chosen)
As the algorithm proceeds, and the temperature
sinks, the probability of choosing less optimal
solutions also sinks.
Is a good general way to avoid local optima.

25
4. Assessing and using the result

Visualization and summarization of the clusters
Note You should always investigate the
dependence of your results on the choices you
have made for the clustering!

26
Examples of applications of clustering

Image analysis
Speech recognition
Data mining
....

27
Clustering microarray data
samples

Samples are columns, genes are rows, in data
matrix
What values to cluster?
What is a biologically relevant measure of
similarity?
One can cluster genes and/or samples

genes
28
Clustering microarray data

Use logged data, usually
Data should be on same scale (but usually is if
you use data that is already normalized)
You may have to filter away genes that show too
little variation over samples.
Use an appropriate distance measure for the
question you want to focus on (Pearson
correlation often works OK).
Use appropriate clustering algorithm
(Hierarchical average linkage usually works OK).
If you draw some conclusion from the clustering
results, try to vary your clustering choices to
see how stable these results are.
Clustering works best as a tool to generate
hypotheses and ideas, which may then be tested in
other ways.

29
Clustering tumor samples
30
Clustering to confirm or reject hypotheses?

A clustering may appear to validate, or be
validated by, a grouping derived by using other
data
Caution The many different ways to do a
clustering may make it possible to tweak it to
produce the clusters you want
There is a huge and complex multiple testing
problem
Note that small changes in data can change result
dramatically
If you insist on trying to get significance
Using permutations of data
Using resampling of data (bootstrapping)

31
How to do clustering Programs

A good program for clustering and visualization
HCE
Great visualization options
Adapted to microarray data
http//www.cs.umd.edu/hcil/hce/
Can import similarity matrices
Classic for microarray data Cluster TreeView
(Eisen)
R/BioConductor package cluster, hclust function,
heatmap function, ...
Many other programs/packages

32
Other visualization techniques Principal
Components

The principal components can be viewed as the
axes of a better coordinate system for the
data.
Better in the sense that the data is maximally
spread out along the first principal components.
The principal components correspond to
eigenvectors of the covariance matrix of the
data.
The eigenvalues represent the part of the total
variance explained by each of the principal
components.

33
Principal component analysis of expression data
34
Principal component analysis of expression data
35
Other visualization techniques Multidimensional
scaling

Start with some points in a very high dimension.
Goal Display these points in a lower dimension,
so that distances between them are similar to
distances in original dimension.
May also try to preserve only the ranking of the
pairwise distances.
Makes it possible to use powerful visual
inspection, in 2 or 3 dimensions.
Can sometimes give very convincing pictures
separating samples in a predicted way.