DB Seminar Series: Semisupervised Projected Clustering - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

DB Seminar Series: Semisupervised Projected Clustering

Description:

... a small amount of domain knowledge available (e.g. the functions ... is no way to utilize the domain knowledge that is accessible (active learning v. ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 51
Provided by: kevi60
Category:

less

Transcript and Presenter's Notes

Title: DB Seminar Series: Semisupervised Projected Clustering


1
DB Seminar Series Semi-supervised Projected
Clustering
  • By Kevin Yip (4th May 2004)

2
Outline
  • Introduction
  • Projected clustering
  • Semi-supervised clustering
  • Our problem
  • Our new algorithm
  • Experimental results
  • Future works and extensions

3
Projected Clustering
  • Where are the clusters?

4
Projected Clustering
  • Where are the clusters?

5
Projected Clustering
  • Pattern-based projected cluster

6
Projected Clustering
  • Goal to discover clusters and their relevant
    dimensions that optimize a certain objective
    function.
  • Previous approaches
  • Partitional PROCLUS, ORCLUS
  • One cluster at a time DOC, FastDOC, MineClus
  • Hierarchical HARP

7
Projected Clustering
  • Limitations of the approaches
  • Cannot detect clusters of extremely low
    dimensionalities (clusters with low percentage of
    relevant dimensions, e.g. only 5 of input
    dimensions are relevant)
  • Require the input of parameter values that are
    hard for users to supply
  • Performance sensitive to the parameter values
  • High time complexity

8
Semi-supervised Clustering
  • In some applications, there is usually a small
    amount of domain knowledge available (e.g. the
    functions of 5 of the genes probed on a
    microarray).
  • The knowledge may not be suitable/sufficient for
    carrying out classification.
  • Clustering algorithms make little use of external
    knowledge.

9
Semi-supervised Clustering
  • The idea of semi-supervised clustering
  • Use the models implicitly assumed behind a
    clustering algorithm (e.g. compact hypersphere of
    k-means, density-connected irregular regions of
    DBScan)
  • Use external knowledge to guide the tuning of
    model parameters (e.g. location of cluster
    centers)

10
Semi-supervised Clustering
  • Why not clustering?
  • The clusters produced may not be the ones
    required.
  • There could be multiple possible groupings.
  • There is no way to utilize the domain knowledge
    that is accessible (active learning v.s. passive
    validation).

(Guha et al., 1998)
11
Semi-supervised Clustering
  • Why not classification?
  • There is insufficient labeled data
  • Objects are not labeled.
  • The amount of labeled objects is statistically
    insignificant.
  • The labeled objects do not cover all classes.
  • The labeled objects of a class do not cover all
    cases (e.g. they are all found at one side of a
    class).
  • It is not always possible to find a
    classification method with an underlying model
    that fits the data (e.g. pattern-based
    similarity).

12
Our Problem
  • Data Model
  • The input dataset has n objects and d dimensions
  • The dataset contains k disjoint clusters, and
    possibly some outlier objects
  • Each cluster is associated with a set of relevant
    dimensions
  • If a dimension is relevant to a cluster, the
    projections of the cluster members on the
    dimension are random samples of a local Gaussian
    distribution
  • Other projections are random samples of a global
    distribution (e.g. uniform distribution or
    Gaussian distribution with a standard deviation
    much larger than those of the local distributions)

13
Our Problem
  • Resulting data if a dimension is relevant to a
    cluster, the projections of its members on the
    dimension will be close to each other (the
    within-cluster variance much smaller than
    irrelevant dimensions).
  • Example

14
Our Problem
15
Our Problem
16
Our Problem
17
Our Problem
18
Our Problem
  • Problem definition
  • Inputs
  • The dataset D
  • The target number of clusters k
  • A (possibly empty) set Io of labeled objects
    (obj. ID, class label), which may or may not
    cover all classes
  • A (possibly empty) set Iv of labeled relevant
    dimensions (dim. ID, class label), which may or
    may not cover all classes. A single dimension can
    be specified as relevant to multiple clusters

19
Our Problem
  • Problem definition (contd)
  • Outputs
  • A set of k disjoint projected clusters with a
    (locally) optimal objective score
  • A (possibly empty) set of outlier objects

20
Our Problem
  • Assumptions made in this study
  • There is a primary clustering target (c.f.
    biclustering)
  • Disjoint, axis-parallel clusters (c.f. subspace
    clustering and ORCLUS)
  • Distance-based similarity
  • One cluster per class (c.f. decision tree)
  • All inputs are correct (but can be biased, i.e.,
    with projections on the relevant dimensions
    deviated from the cluster center)

21
Our New Algorithm
  • Basic idea k-medoid/median
  • Determine the potential medoids (seeds) and
    relevant dimensions of each cluster
  • Assign every object to the cluster (or to the
    outlier list) that gives the greatest improvement
    to the objective score
  • Decide which medoids are good/bad
  • A good medoid replace by cluster median, refine
    selected dimensions
  • A bad medoid replace by another seed
  • Repeat 2 and 3 until no improvements can be
    obtained in a certain number of iterations

22
Our New Algorithm
  • Issues to consider
  • Design of the objective function
  • Selection of relevant dimensions for a cluster
  • Determination of seeds and the relevant
    dimensions of the corresponding potential
    clusters
  • Replacement of medoids

23
Our New Algorithm
  • Design goals of the objective function
  • Should not have a trivial best score (e.g. when
    each cluster selects only one dimension)
  • Should not be ruined by the selection of a small
    amount of irrelevant dimensions
  • Should be robust (clustering accuracy should not
    degrade seriously when the input parameter values
    are not very accurate)

24
Our New Algorithm
  • The objective function
  • Overall score
  • Score componentof cluster Ci
  • Contribution of selecteddimension vj on the
    scorecomponent of Ci
  • normalization factor

25
Our New Algorithm
  • Characteristics of the objective function
  • Higher score gt better clustering
  • No trivial best score when each cluster selects
    only one or selects all dimensions
  • Relevant dimensions (dimensions with smaller
    ) constitute more to the objective score
  • Robust? (To be discussed soon)

26
Our New Algorithm
  • Dimension selection
  • In order to maximize
    ,all dimensions with
    should be selected.
  • Appropriate values of
  • Should be at least ?j2, the global variance of
    dimension vj
  • Scheme 1
  • Scheme 1b , but only dimensions
    withare selected gt easier to compare the
    results with different m

27
Our New Algorithm
  • Scheme 2 estimate the probability for an
    irrelevant dimension to be selected (global
    distribution needs to be known). If the global
    distribution is Gaussian
  • If ni values are randomly sampled from the global
    distribution of an irrelevant dimension vj, the
    random variable (ni-1)?ij2/ ?j2 has a chi-square
    distribution with ni-1 degrees of freedom.
  • Suppose we want the probability of selecting an
    irrelevant dimension to be p, then
  • From the cumulative chi-square distribution, the
    corresponding can be computed.

28
Our New Algorithm
  • Probability density function and cumulative
    distribution (ni30)

(30-1)?ij2/ ?i2 ? 19gt ?ij2 ? 0.66?i2 (
m?i2)
29
Our New Algorithm
  • Robustness of the algorithm
  • A good value of m should be
  • Large enough to tolerate local variances
  • Small enough to distinguish localvariances from
    global variances
  • The best value to use is data-dependent, but
    provided the difference between local and global
    variances is large, there is usually a wide range
    of values that lead to results with acceptable
    performance (e.g. 0.3 lt m lt 0.7)

30
Our New Algorithm
  • Determination of seeds and the relevant
    dimensions of the corresponding potential
    clusters
  • Traditional approach
  • One seed pool
  • Seeds determined randomly/by max-min distance
    method/by preclustering (e.g. hierarchical)
  • Relevant dimensions of each cluster determined by
    a set of objects near the medoid (in the input
    space)

31
Our New Algorithm
  • Our proposal seed group
  • Seeds are stored in separate seed groups, each
    seed group contains a small number (e.g. 5) seeds
  • One private seed group for each cluster with some
    inputs
  • A number of public seed groups are shared by all
    clusters without external inputs
  • The seeds of the cluster with the largest amount
    of inputs are initialized first (as we are most
    confident in their correctness), and then those
    with less inputs, and so on. Finally, the public
    seed groups are initialized.

32
Our New Algorithm
  • Our proposal seeds selection
  • Based on low-dimensionalhistograms (grids)
  • Relevant dimensiongt small variancegt high
    density
  • Procedures
  • Determine starting point
  • Hill-climbing
  • gt Need to determine both the dimensions used in
    constructing the grid and the starting point

33
Our New Algorithm
  • Determining the grid-constructing dimensions and
    the starting point
  • Case 1 a cluster with both labeled objects and
    labeled relevant dimensions
  • Case 2 a cluster with only labeled objects
  • Case 3 a cluster with only labeled relevant
    dimensions
  • Case 4 a cluster with no inputs

34
Our New Algorithm
  • Case 1 both kinds of inputs are available
  • Form a seed cluster by the input objects
  • Rank all dimensions by
  • All dimensions with positive ?ij or in the input
    set Iv are candidate dimensions for constructing
    the histograms
  • Relative chance of being selected ?ij if
    dimension vj is not in Iv, 1 otherwise
  • The starting point is the median of the seed
    cluster

35
Our New Algorithm
  • Example cluster 2
  • ?2x 0.68
  • ?2y 0.83
  • ?2z -0.02
  • The hill-climbing mechanismfixes errors due to
    biasedinputs

36
Our New Algorithm
  • Case 2 labeled objects only
  • Similar to case 1, but the chance for each
    dimension to be selected is based on ?ij only
  • Case 3 labeled dimensions only
  • Similar to case 1, but with no staring point,
    i.e., all cells are examined, and the one with
    the highest density will be returned

37
Our New Algorithm
  • Case 4 no inputs
  • The tentative seed is the one with the maximum
    projected distance to the closest selected seeds
    (modified max-min distance method)
  • For each dimension, an one-dimensional histogram
    is constructed to determine the density of
    objects around the projection of the tentative
    seed
  • The chance for each dimension being selected to
    construct the grid is based on the density
  • The tentative seed is used as the starting point

38
Our New Algorithm
  • Medoid drawing/replacement
  • The medoid of each cluster is initially drawn
    from
  • The corresponding private seed group, if
    available
  • A unique public seed group, otherwise
  • After assignment, the medoid for cluster Ci is
    likely to be a bad one if
  • (?i / maxi ?I) is small the cluster has a low
    quality as compared to other clusters
  • (?i / max ?I) is small the cluster has a low
    quality as compared to a perfect cluster
  • The cluster is very similar to another cluster

39
Our New Algorithm
  • Medoid drawing/replacement (contd)
  • Each time, only one potential bad medoid is
    replaced since the probability of simultaneously
    correcting multiple medoids is low
  • The target bad medoid is replaced by a seed from
    the corresponding private seed group or a new
    public seed group
  • The medoids of other clusters are replaced by
    cluster medoids, and the relevant dimensions are
    reselected
  • The algorithm keeps track on the best set of
    medoids and relevant dimensions

40
Experimental Results
  • Dataset 1 n1000, d100, k5, lreal5-40 (5-40
    of d)
  • No external inputs
  • Algorithms
  • HARP
  • PROCLUS
  • SSPC
  • CLARANS (non-projected control)

41
Experimental Results
  • Best performance (results with the best ARI
    values)

42
Experimental Results
  • Best performance v.s. average performance

43
Experimental Results
  • Robustness (lreal10)

44
Experimental Results
  • Dataset 2 n150, d3000, k5, lreal30 (1 of d)
  • Inputs
  • Io size, Iv size1-9
  • 4 combinations both, labeled objects only,
    labeled relevant dimensions only, none
  • Coverage 1-5 clusters (20-100)

45
Experimental Results
  • Increasing input size (100 coverage)

46
Experimental Results
  • Increasing coverage (input size3)

47
Experimental Results
  • Increasing coverage (input size6)

48
Future Works and Extensions
  • Other required experiments
  • Biased inputs
  • Multiple labeling methods for a single dataset
  • Scalability
  • Real data
  • Imperfect data with artificial outliers and
    errors
  • Searching for the best k

49
Future Works and Extensions
  • To be considered in the future
  • Other input types (e.g. must-links and
    cannot-links)
  • Wrong/Inconsistent inputs
  • Pattern-based and range-based similarity
  • Non-disjoint clusters

50
References
  • Projected clustering
  • HARP A Hierarchical Algorithm with Automatic
    Relevant Attribute Selection for Projected
    Clustering(DB Seminar on 20 Sep 2002)
  • Semi-supervised clustering
  • The Semi-supervised Clustering Problem(DB
    Seminar on 2 Jan 2004)
Write a Comment
User Comments (0)
About PowerShow.com