Title: IMM Publikationsdatabase
1Clustering on the Simplex
Morten Mørup DTU Informatics Intelligent Signal
Processing Technical University of Denmark
- TexPoint fonts used in EMF.
- Read the TexPoint manual before you delete this
box. AAA
2Joint work with
Lars Kai Hansen DTU Informatics Intelligent
Signal Processing Technical University of Denmark
Christian Walder DTU Informatics Intelligent
Signal Processing Technical University of Denmark
3Clustering
- Cluster analysis or clustering is the assignment
of a set of observations into subsets (called
clusters) so that observations in the same
cluster are similar in some sense. (Wikipedia)
4Clustering approaches
- K-means iterative refinement algorithm (Lloyd,
1982 Hartigan, 1979) - Problem NP-complete (Megiddo and Supowit, 1984)
- Relaxations of the hard assigment problem
- Annealing approaches basedon temperature
parameter(T?0 the original clustering problem is
recovered)(see for instance Hofmann and Buhmann,
1997) - Fuzzy clustering (Hathaway and Bezdek, 1988)
- Expectation Maximization (Mixture of Gaussians)
- Spectral Clustering
Assignmnt Step (S) Assign each data point to
the cluster with closest mean value Update Step
(C) Calculate the new mean value for each
cluster
Guarantee of optimality
No single change in assignment better than
current assignment (1-spin stability).
Drawbacks
Previously relaxations are either not exact or
dependent on some problem specific annealing
parameter in order to recover the original binary
combinatorial assignments.
5From the K-means objective to Pairwise Clustering
K-mean objective
Pairwise Clustering (Buhmann and Hofmann, 1994)
K similarity matrix, KXTX equivalent tothe
k-means objective
6Although Clustering is hard there is room to be
simple(x) minded!
Binary Combinatorial (BC)
Simplicial Relaxation (SR)
7The simplicial relaxation (SR) admits standard
continuous optimization to solve for the pairwise
clustering problems.
For instance by normalization invariant projected
gradient ascent
8Synthetic data example
K-means
SR-clustering
Brown and grey clusters each contain 1000
data-points in R2 Whereas the remaining clusters
each have 250 data-points.
9SR-clustering algorithm driven by high density
regions
10Thus, solutions in general substantially better
than Lloyds algorithm having the same
computational complexity
SR-clustering (?init1)
SR-clustering (?init0.01)
Lloyds K-means
11K-means SR-clustering (?init1) SR-clustering (?
init0.01)
10 components
50 components
100 components
12SR-clustering for Kernel based semi-supervised
learning
Kernel based semi-supervised learning based on
pairwise clustering
(Basu et al, 2004, Kulis et al. 2005, Kulis et
al, 2009)
13Simplicial relaxation admit solving the problem
as a (non-convex) continous optimization problem
14Class labels can be handled explicitly fixing
Must and cannot links can be absorbed into the
Kernel
Hence the problem reduces more or less to
standard SR-clustering problem for the estimation
of S
15At stationarity we have that the gradients of
elements in each column of S that are 1 are
larger than elements that are 0. Thus,
evaluating the impact of the supervision can be
done estimating the minimal lagrange multipliers
that guarantee stationarity of the solution
obtained by the SR-clustering algorithm. This is
a convex optimization problem
Thus, Lagrange multipliers give a measure of
conflict between the data and the supervision
16Digit classification with one miss-labeled data
observation from each class.
17Community Detection in Complex Networks
Communities/modules a natural divisions of
network nodes into densely connected subgroups
(Newman Girvan 2003)
G(V,E)
Adjacency Matrix A
Permuted adjacency matrix PAPT
Community detection algorithm
Permutation P of graph from clustering
assignment S
18Common Community detection objectives
- Hamiltonian (Fu Anderson, 1986, Reichardt
Bornholdt, 2004) - Modularity (Newman Girvan, 2004)
Generic problems of the form
19Again we can make an exact relaxation to the
simplex!
20(No Transcript)
21(No Transcript)
22SR-clustering of complex networks
Quality of solutions comparable to results
obtained by extensive Gibbs sampling
23So far we have demonstrated how binary
combinatorial constraints are recovered at
stationarity when relaxing the problems to the
simplex. However, simplex constraints also
holds promising data mining properties of their
own!
24The Principal Convex Hull (PCH)
The Convex Hull
Def The convex hull/convex envelope of X?RM?N is
the minimal convex set containing X. (Informally
it can be described as a rubber band wrapped
around the data points.) Finding the convex hull
is solvable in linear time, O(N) (McCallum and D.
Avis, 1979) However, the size of the convex set
grows exponentially with the dimensionality of
the data, O(logM-1(N)) (Dwyer, 1988)
Def The best convex set of size K according to
some measure of distortion D() (Mørup et al.
2009). (Informally it can be described as a less
flexible rubber band that wraps most of the data
points.)
25The mathematical formulation of the Principal
Convex Hull (PCH) is given by two simplex
constraints
Principal in terms of the Frobenius norm
C Give the fraction in which observations in X
are used to form each feature (distinct
aspects/freaks). In general C will be very
sparse!! S Give the fraction each observation
resembles each distinct aspects XC.
X
X
C
S
?
(note when K large enough such that
the PCH recover the convex hull)
26Relation between the PCH model, low rank
decomposition and clustering approaches
PCH naturally bridges clustering and low-rank
approximations!
27Two important properties of the PCH model
The PCH model is invariant to affine
transformation and scaling
The PCH model is unique up to permutation of the
components
28A feature extraction example
More contrast in features than obtained by
clustering approaches. As such, PCH aim for
distict aspects/regions in data
The PCH model strives to attain Platonic Ideal
Forms
29PCH model for PET data(Positron Emission
Tomography)
Data contain 3 components High-Binding
regions Low-binding regions Non-binding
regions Each voxel given concentrationfraction
of these regions
XC
S
30NMF spectroscopy of samples of mixtures of
propanol butanol and pentanol.
31Collaborative filtering example
Medium size and large size Movie lens data
(www.grouplens.org) Medium size 1,000,209
ratings of 3,952 movies by 6,040 users Large
size 10,000,054 ratings of 10,677 movies given
by 71,567
32Conclusion
- The simplex offers unique data mining properties
- Simplicial relaxations (SR) form exact
relaxation of common hard assignment clustering
problems, i.e. K-means, Pairwise Clustering and
Community detection in graphs. - SR Enable to solve binary combinatorial problems
using standard solvers from continuous
optimization. - The proposed SR-clustering algorithm outperforms
traditional iterative refinement algorithms - No need for annealing parameter. hard
assignments guaranteed atstationarity (Theorem 1
and 2) - Semi-Supervised learning can be posed as
continuous optimization problem with associated
lagrange multipliers giving an evaluation
measure of each supervised constraint
33Conclusion cont.
- The Principal Convex Hull (PCH) formed by two
types of simplex constraints - Extract distinct aspects of the data
- Relevant for data mining in general where low
rank approximation and clustering approaches
have been invoked.
34A reformulation of Lex Parsimoniae
The simplest explanation is usually the best.
- William of Ockham
The simplex explanation is usually the best.
Simplicity is the ultimate sophistication.
- Leonardo Da Vinci
Simplexity is the ultimate sophistication.
The presented work is described in M. Mørup and
L. K. Hansen An Exact Relaxation of Clustering,
Submitted JMLR 2009 M. Mørup, C. Walder and L. K.
Hansen Simplicial Semi-supervised Learning,
submitted M. Mørup and L. K. Hansen Platonic
Forms Revisited, submitted