DB Seminar Series: Semisupervised Projected Clustering - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

DB Seminar Series: Semisupervised Projected Clustering

Description:

... a small amount of domain knowledge available (e.g. the functions ... is no way to utilize the domain knowledge that is accessible (active learning v. ... – PowerPoint PPT presentation

Number of Views:82

Avg rating:3.0/5.0

Slides: 51

Provided by: kevi60

Category:

more less

Transcript and Presenter's Notes

Title: DB Seminar Series: Semisupervised Projected Clustering

1
DB Seminar Series Semi-supervised Projected
Clustering

By Kevin Yip (4th May 2004)

2
Outline

Introduction
Projected clustering
Semi-supervised clustering
Our problem
Our new algorithm
Experimental results
Future works and extensions

3
Projected Clustering

Where are the clusters?

4
Projected Clustering

Where are the clusters?

5
Projected Clustering

Pattern-based projected cluster

6
Projected Clustering

Goal to discover clusters and their relevant
dimensions that optimize a certain objective
function.
Previous approaches
Partitional PROCLUS, ORCLUS
One cluster at a time DOC, FastDOC, MineClus
Hierarchical HARP

7
Projected Clustering

Limitations of the approaches
Cannot detect clusters of extremely low
dimensionalities (clusters with low percentage of
relevant dimensions, e.g. only 5 of input
dimensions are relevant)
Require the input of parameter values that are
hard for users to supply
Performance sensitive to the parameter values
High time complexity

8
Semi-supervised Clustering

In some applications, there is usually a small
amount of domain knowledge available (e.g. the
functions of 5 of the genes probed on a
microarray).
The knowledge may not be suitable/sufficient for
carrying out classification.
Clustering algorithms make little use of external
knowledge.

9
Semi-supervised Clustering

The idea of semi-supervised clustering
Use the models implicitly assumed behind a
clustering algorithm (e.g. compact hypersphere of
k-means, density-connected irregular regions of
DBScan)
Use external knowledge to guide the tuning of
model parameters (e.g. location of cluster
centers)

10
Semi-supervised Clustering

Why not clustering?
The clusters produced may not be the ones
required.
There could be multiple possible groupings.
There is no way to utilize the domain knowledge
that is accessible (active learning v.s. passive
validation).

(Guha et al., 1998)
11
Semi-supervised Clustering

Why not classification?
There is insufficient labeled data
Objects are not labeled.
The amount of labeled objects is statistically
insignificant.
The labeled objects do not cover all classes.
The labeled objects of a class do not cover all
cases (e.g. they are all found at one side of a
class).
It is not always possible to find a
classification method with an underlying model
that fits the data (e.g. pattern-based
similarity).

12
Our Problem

Data Model
The input dataset has n objects and d dimensions
The dataset contains k disjoint clusters, and
possibly some outlier objects
Each cluster is associated with a set of relevant
dimensions
If a dimension is relevant to a cluster, the
projections of the cluster members on the
dimension are random samples of a local Gaussian
distribution
Other projections are random samples of a global
distribution (e.g. uniform distribution or
Gaussian distribution with a standard deviation
much larger than those of the local distributions)

13
Our Problem

Resulting data if a dimension is relevant to a
cluster, the projections of its members on the
dimension will be close to each other (the
within-cluster variance much smaller than
irrelevant dimensions).
Example

14
Our Problem
15
Our Problem
16
Our Problem
17
Our Problem
18
Our Problem

Problem definition
Inputs
The dataset D
The target number of clusters k
A (possibly empty) set Io of labeled objects
(obj. ID, class label), which may or may not
cover all classes
A (possibly empty) set Iv of labeled relevant
dimensions (dim. ID, class label), which may or
may not cover all classes. A single dimension can
be specified as relevant to multiple clusters

19
Our Problem

Problem definition (contd)
Outputs
A set of k disjoint projected clusters with a
(locally) optimal objective score
A (possibly empty) set of outlier objects

20
Our Problem

Assumptions made in this study
There is a primary clustering target (c.f.
biclustering)
Disjoint, axis-parallel clusters (c.f. subspace
clustering and ORCLUS)
Distance-based similarity
One cluster per class (c.f. decision tree)
All inputs are correct (but can be biased, i.e.,
with projections on the relevant dimensions
deviated from the cluster center)

21
Our New Algorithm

Basic idea k-medoid/median
Determine the potential medoids (seeds) and
relevant dimensions of each cluster
Assign every object to the cluster (or to the
outlier list) that gives the greatest improvement
to the objective score
Decide which medoids are good/bad
A good medoid replace by cluster median, refine
selected dimensions
A bad medoid replace by another seed
Repeat 2 and 3 until no improvements can be
obtained in a certain number of iterations

22
Our New Algorithm

Issues to consider
Design of the objective function
Selection of relevant dimensions for a cluster
Determination of seeds and the relevant
dimensions of the corresponding potential
clusters
Replacement of medoids

23
Our New Algorithm

Design goals of the objective function
Should not have a trivial best score (e.g. when
each cluster selects only one dimension)
Should not be ruined by the selection of a small
amount of irrelevant dimensions
Should be robust (clustering accuracy should not
degrade seriously when the input parameter values
are not very accurate)

24
Our New Algorithm

The objective function
Overall score
Score componentof cluster Ci
Contribution of selecteddimension vj on the
scorecomponent of Ci
normalization factor

25
Our New Algorithm

Characteristics of the objective function
Higher score gt better clustering
No trivial best score when each cluster selects
only one or selects all dimensions
Relevant dimensions (dimensions with smaller
) constitute more to the objective score
Robust? (To be discussed soon)

26
Our New Algorithm

Dimension selection
In order to maximize
,all dimensions with
should be selected.
Appropriate values of
Should be at least ?j2, the global variance of
dimension vj
Scheme 1
Scheme 1b , but only dimensions
withare selected gt easier to compare the
results with different m

27
Our New Algorithm

Scheme 2 estimate the probability for an
irrelevant dimension to be selected (global
distribution needs to be known). If the global
distribution is Gaussian
If ni values are randomly sampled from the global
distribution of an irrelevant dimension vj, the
random variable (ni-1)?ij2/ ?j2 has a chi-square
distribution with ni-1 degrees of freedom.
Suppose we want the probability of selecting an
irrelevant dimension to be p, then
From the cumulative chi-square distribution, the
corresponding can be computed.

28
Our New Algorithm

Probability density function and cumulative
distribution (ni30)

(30-1)?ij2/ ?i2 ? 19gt ?ij2 ? 0.66?i2 (
m?i2)
29
Our New Algorithm

Robustness of the algorithm
A good value of m should be
Large enough to tolerate local variances
Small enough to distinguish localvariances from
global variances
The best value to use is data-dependent, but
provided the difference between local and global
variances is large, there is usually a wide range
of values that lead to results with acceptable
performance (e.g. 0.3 lt m lt 0.7)

30
Our New Algorithm

Determination of seeds and the relevant
dimensions of the corresponding potential
clusters
Traditional approach
One seed pool
Seeds determined randomly/by max-min distance
method/by preclustering (e.g. hierarchical)
Relevant dimensions of each cluster determined by
a set of objects near the medoid (in the input
space)

31
Our New Algorithm

Our proposal seed group
Seeds are stored in separate seed groups, each
seed group contains a small number (e.g. 5) seeds
One private seed group for each cluster with some
inputs
A number of public seed groups are shared by all
clusters without external inputs
The seeds of the cluster with the largest amount
of inputs are initialized first (as we are most
confident in their correctness), and then those
with less inputs, and so on. Finally, the public
seed groups are initialized.

32
Our New Algorithm

Our proposal seeds selection
Based on low-dimensionalhistograms (grids)
Relevant dimensiongt small variancegt high
density
Procedures
Determine starting point
Hill-climbing
gt Need to determine both the dimensions used in
constructing the grid and the starting point

33
Our New Algorithm

Determining the grid-constructing dimensions and
the starting point
Case 1 a cluster with both labeled objects and
labeled relevant dimensions
Case 2 a cluster with only labeled objects
Case 3 a cluster with only labeled relevant
dimensions
Case 4 a cluster with no inputs

34
Our New Algorithm

Case 1 both kinds of inputs are available
Form a seed cluster by the input objects
Rank all dimensions by
All dimensions with positive ?ij or in the input
set Iv are candidate dimensions for constructing
the histograms
Relative chance of being selected ?ij if
dimension vj is not in Iv, 1 otherwise
The starting point is the median of the seed
cluster

35
Our New Algorithm

Example cluster 2
?2x 0.68
?2y 0.83
?2z -0.02
The hill-climbing mechanismfixes errors due to
biasedinputs

36
Our New Algorithm

Case 2 labeled objects only
Similar to case 1, but the chance for each
dimension to be selected is based on ?ij only
Case 3 labeled dimensions only
Similar to case 1, but with no staring point,
i.e., all cells are examined, and the one with
the highest density will be returned

37
Our New Algorithm

Case 4 no inputs
The tentative seed is the one with the maximum
projected distance to the closest selected seeds
(modified max-min distance method)
For each dimension, an one-dimensional histogram
is constructed to determine the density of
objects around the projection of the tentative
seed
The chance for each dimension being selected to
construct the grid is based on the density
The tentative seed is used as the starting point

38
Our New Algorithm

Medoid drawing/replacement
The medoid of each cluster is initially drawn
from
The corresponding private seed group, if
available
A unique public seed group, otherwise
After assignment, the medoid for cluster Ci is
likely to be a bad one if
(?i / maxi ?I) is small the cluster has a low
quality as compared to other clusters
(?i / max ?I) is small the cluster has a low
quality as compared to a perfect cluster
The cluster is very similar to another cluster

39
Our New Algorithm

Medoid drawing/replacement (contd)
Each time, only one potential bad medoid is
replaced since the probability of simultaneously
correcting multiple medoids is low
The target bad medoid is replaced by a seed from
the corresponding private seed group or a new
public seed group
The medoids of other clusters are replaced by
cluster medoids, and the relevant dimensions are
reselected
The algorithm keeps track on the best set of
medoids and relevant dimensions

40
Experimental Results

Dataset 1 n1000, d100, k5, lreal5-40 (5-40
of d)
No external inputs
Algorithms
HARP
PROCLUS
SSPC
CLARANS (non-projected control)

41
Experimental Results

Best performance (results with the best ARI
values)

42
Experimental Results

Best performance v.s. average performance

43
Experimental Results

Robustness (lreal10)

44
Experimental Results

Dataset 2 n150, d3000, k5, lreal30 (1 of d)
Inputs
Io size, Iv size1-9
4 combinations both, labeled objects only,
labeled relevant dimensions only, none
Coverage 1-5 clusters (20-100)

45
Experimental Results

Increasing input size (100 coverage)

46
Experimental Results

Increasing coverage (input size3)

47
Experimental Results

Increasing coverage (input size6)

48
Future Works and Extensions

Other required experiments
Biased inputs
Multiple labeling methods for a single dataset
Scalability
Real data
Imperfect data with artificial outliers and
errors
Searching for the best k

49
Future Works and Extensions

To be considered in the future
Other input types (e.g. must-links and
cannot-links)
Wrong/Inconsistent inputs
Pattern-based and range-based similarity
Non-disjoint clusters

50
References

Projected clustering
HARP A Hierarchical Algorithm with Automatic
Relevant Attribute Selection for Projected
Clustering(DB Seminar on 20 Sep 2002)
Semi-supervised clustering
The Semi-supervised Clustering Problem(DB
Seminar on 2 Jan 2004)

Write a Comment

User Comments (0)