CSE634: DATA CLUSTERING METHODS - PowerPoint PPT Presentation

1 / 90
About This Presentation
Title:

CSE634: DATA CLUSTERING METHODS

Description:

Discovery of clusters with arbitrary shape. Good efficiency on large databases. 9/6/09 ... Good for both automatic and interactive cluster analysis, including finding ... – PowerPoint PPT presentation

Number of Views:539
Avg rating:3.0/5.0
Slides: 91
Provided by: csSu5
Category:

less

Transcript and Presenter's Notes

Title: CSE634: DATA CLUSTERING METHODS


1
CSE634 DATACLUSTERING METHODS
Group 9
  • Karthik Anandh Govindaraj (105845335)
  • Shashank Viswandha ( 105955553 )
  • Praveen Durairaj (105948340 )
  • Ravikanth Pulavarthy ( 105227609 )

2
CSE634 DATACLUSTERING METHODS
Group 9
  • Karthik Anandh Govindaraj
  • (karthikanandh_at_gmail.com)

3
References
  • Ester, M., Kriegel, H.-P., Sander, J., and Xu, X.
  • A Density-Based Algorithm for Discovering
    Clusters in Large Spatial Databases with Noise.,
    In Proc. 2nd International Conference on
    Knowledge Discovery and Data Mining
    (KDD'96),pages 226-231,1996
  • M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J.
    Sander.OPTICS Ordering Points to Identify the
    Clustering Structure.In Proc. ACM SIGMOD Int.
    Conf. on Management of Data (SIGMOD99), pages
    4960,1999.
  • Hinneburg A., Keim D.A. An Efficient Approach to
    Clustering in Large Multimedia Databases with
    Noise, In Proc. 4rd Int. Conf. on Knowledge
    Discovery and Data Mining(KDD'98), AAAI Press,
    pages 58-65,1998.

4
What is Cluster Analysis?
  • Cluster a collection of data objects
  • Similar to the objects in the same cluster
    (Intraclass similarity)
  • Dissimilar to the objects in other clusters
    (Interclass dissimilarity)
  • Cluster analysis
  • Statistical method for grouping a set of data
    objects into clusters
  • A good clustering method produces high quality
    clusters with high intraclass similarity and low
    interclass similarity
  • Clustering is unsupervised classification

5
Clustering methods
  • Partitioning methods
  • Hierarchical methods
  • Density-based methods
  • Grid-based methods

6
Issues - Large Spatial Databases
  • Minimal requirements of domain knowledge to
    determine the input parameters
  • Discovery of clusters with arbitrary shape
  • Good efficiency on large databases

7
Density-Based Clustering Methods
  • Clustering based on density (local cluster
    criterion), such as density-connected points
  • Features
  • Discover clusters of arbitrary shape
  • Handle noise
  • One scan
  • Need density parameters as termination condition
  • Studies
  • DBSCAN Ester, et al. (KDD96)
  • OPTICS Ankerst, et al (SIGMOD99).
  • DENCLUE Hinneburg Keim (KDD98)

8
Density-Based Clustering Definitions
  • Parameters
  • Eps Maximum radius of the neighbourhood
  • MinPts Minimum number of points in an
    Eps-neighbourhood of that point
  • NEps(p) q belongs to D dist(p,q) lt Eps

9
Definitions
  • Directly density-reachable A point p is directly
    density-reachable from a point q wrt. Eps, MinPts
    if
  • 1) p belongs to NEps(q)
  • 2) core point condition
  • NEps (q) gt MinPts

10
Contd.
  • Density-reachable
  • A point p is density-reachable from a point q
    wrt. Eps, MinPts if there is a chain of points
    p1, , pn, p1 q, pn p such that pi1 is
    directly density-reachable from pi
  • Density-connected
  • A point p is density-connected to a point q wrt.
    Eps, MinPts if there is a point o such that both,
    p and q are density-reachable from o wrt. Eps and
    MinPts.

p
p1
q
11
Density Based Cluster Definition
  • cluster - A maximal set of density-connected
    points
  • A cluster C is a subset of D satisfying
  • For all p,q if p is in C, and q is density
    reachable from p, then q is also in C
  • For all p,q in C p is density connected to q

12
Contd.
  • Lemma 1 If p is a core point, and O is the set
    of points density reachable from p, then O is a
    cluster
  • Lemma 2 Let C be a cluster and p be any core
    point of C, then C equals the set of density
    reachable points from p
  • Implication Finding density reachable point of
    an arbitrary point generates a cluster. A cluster
    is unique determined by any of its core points

13
DBSCAN The Algorithm
  • Arbitrary select a point p
  • Retrieve all points density-reachable from p wrt
    Eps and MinPts.
  • If p is a core point, a cluster is formed.
  • If p is a border point, no points are
    density-reachable from p and DBSCAN visits the
    next point of the database.
  • Continue the process until all of the points have
    been processed.
  • Complexity O(kN2)

14
OPTICS A Cluster-Ordering Method
  • OPTICS Ordering Points To Identify the
    Clustering Structure
  • Ankerst, Breunig, Kriegel, and Sander (SIGMOD99)
  • Produces a special order of the database wrt its
    density-based clustering structure
  • This cluster-ordering contains info equiv to the
    density-based clusterings corresponding to a
    broad range of parameter settings
  • Good for both automatic and interactive cluster
    analysis, including finding intrinsic clustering
    structure
  • Can be represented graphically or using
    visualization techniques

15
Core- and Reachability Distance
  • Parameters generating distance e, fixed value
    MinPts
  • core-distancee,MinPts(o)
  • smallest distance such that o is a core object
  • (if that distance is e ? otherwise)
  • reachability-distancee,MinPts(p, o)
  • smallest distance such that p is
  • directly density-reachable from o
  • (if that distance is e ? otherwise)

16
The Algorithm OPTICS
  • foreach o e Database
  • // initially, o.processed false for all objects
    o
  • if o.processed false
  • insert (o, ?) into ControlList
  • while ControlList is not empty
  • select first element (o, r-dist) from
    ControlList
  • retrieve Ne(o) and determine c_dist
    core-distance(o)
  • set o.processed true
  • write (o, r_dist, c_dist) to file
  • if o is a core object at any distance e

17
Contd..
  • foreach p e Ne(o) not yet processed
  • determine r_distp reachability-distance(p, o)
  • if (p, _) ? ControlList
  • insert (p, r_distp) in ControlList
  • else if (p, old_r_dist) e ControlList and r_distp
    lt old_r_dist
  • update (p, r_distp) in ControlList

18
Reachability-distance
undefined

Cluster-order of the objects
19
DENCLUE using density functions
  • DENsity-based CLUstEring by Hinneburg Keim
    (KDD98)
  • Major features
  • Solid mathematical foundation
  • Good for data sets with large amounts of noise
  • Allows a compact mathematical description of
    arbitrarily shaped clusters in high-dimensional
    data sets
  • Significant faster than existing algorithm
    (faster than DBSCAN by a factor of up to 45)
  • But needs a large number of parameters

20
DENCLUE Technical Essence
  • Uses grid cells but only keeps information about
    grid cells that do actually contain data points
    and manages these cells in a tree-based access
    structure.
  • Influence function describes the impact of a
    data point within its neighborhood.
  • Overall density of the data space can be
    calculated as the sum of the influence function
    of all data points.
  • Clusters can be determined mathematically by
    identifying density attractors.
  • Density attractors are local maximal of the
    overall density function.

21
Gradient The steepness of a slope
  • Example

22
Example Density Computation
Dx1,x2,x3,x4 fDGaussian(x) influence(x1)
influence(x2) influence(x3)
influence(x4)0.040.060.080.60.78
x1
x3
0.04
0.08
y
x2
x4
0.06
x
0.6
Remark the density value of y would be larger
than the one for x
23
Density Attractor
24
Basic Steps - DENCLUE Algorithms
  • Determine density atttractors
  • Associate data objects with density attractors (?
    initial clustering)
  • Merge the initial clusters further relying on a
    hierarchical clustering approach

25
CSE634 DATACLUSTERING METHODS
Group 9
  • Shashank Viswanadha
  • (sviswana_at_cs.sunysb.edu)

26
Sources and References
  • Data Mining, Concepts and Techniques by Jiawei
    Han and Micheline Kamber ( Second Edition )
  • Data Clustering by A.K.Jain (Michigan State
    University ), M.N.Murthy ( Indian Institute of
    Science ) and P.J.Flynn ( The Ohio State
    University )
  • http//www.cs.rutgers.edu/mlittman/cours
    es/lightai03/jain99data.pdf
  • STING A Statistical Information Grid Approach to
    Spatial Data Mining by Wei Wang ( University of
    California, LA ), Joing Yang ( University of
    California, LA ), and Richard Muntz ( University
    of California, LA ).
  • http//www.sigmod.org/vldb/conf/1997/P186.PDF
  • WaveCluster a wavelet-based clustering approach
    for spatial data in very large databases by
    Gholamhosein Sheikholeslami, Surojit Chatterjee,
    Aidong Zhang. (The VLDB Journal (2000) 8 289304
    )
  • http//www.cs.uiuc.edu/homes/hanj/refs/pa
    pers/sheikholeslami98.pdf

27
Overview
  • Grid-Based Methods
  • STING
  • WaveCluster
  • Clustering High-Dimensional Data
  • CLIQUE
  • PROCLUS

28
Grid-Based Methods
  • Uses multiresolution grid data structure
  • Operations are performed on finite number of
    cells which form a grid
  • Fast processing
  • Examples
  • STING Explores statistical information stored in
    grid cells
  • WaveCluster Clusters objects using wavelet
    transform methods

29
STING STatistical INformation Grid
  • Grid-based multiresolution clustering technique
    in which the spatial area is divided into
    rectangular cells.
  • Different levels of rectangular cell corresponds
    to different levels of resolution forming
    hierarchical structure.
  • Statistical parameters of higher-level cells can
    be computed from the parameters of the
    lower-level cells.

30
Contd.
31
Contd.
32
Contd.
  • Types of parameters
  • Attribute independent number of objects in a
    cell
  • Attribute dependent mean, stdev, min and max.
  • Types of distribution that the attribute value
    can follow are
  • Normal
  • Uniform
  • Exponential
  • None ( if distribution unknown )

33
Contd.
  • Majorly used for query answering.
  • Advantages-
  • Query independent
  • Facilitates parallel processing and incremental
    updating
  • Efficiency
  • Disadvantage-
  • All the cluster boundaries are either horizontal
    or
  • vertical, and no diagonal boundary is
    detected
  • Time complexity for query processing is O(g),
    where g is the total number of grid cells at the
    lowest level which is much smaller than number of
    objects.

34
WaveCluster Clustering using wavelet
Transformation
  • WaveCluster, a multiresolution clustering
    algorithm involves two steps
  • Summarizes the data imposing a multidimensional
    grid structure onto data space.
  • Transform the original feature space, finding
    dense regions in the transformed space.
  • Handles large data sets efficiently, discovers
    clusters with arbitrary shape, handles outliers,
    insensitive to order of input.

35
Contd.
  • Why is wavelet transformation useful for
    clustering
  • Unsupervised clustering It uses hat-shape
    filters to emphasize region where points cluster,
    but simultaneously to suppress weaker information
    in their boundary

36
Contd.
  • Effective removal of outliers

37
Contd.
  • Multiresolution The multiresolution property of
    wavelet transform can help in detecting the
    clusters at different levels of detail. wavelet
    transform provides multiple levels of
    decompositions, which results in clusters at
    different scales from fine to coarse.
  • Cost efficiency Since applying wavelet transform
    is very fast, it makes our approach
    cost-effective. As will be shown in the
    experiments, clustering very large datasets takes
    only a few seconds. Using parallel processing, we
    can obtain even faster responses.

38
Clustering High-Dimensional Data
  • Introduce clustering methods which are designed
    for clustering high dimensions generally over 10,
    or even thousands of dimensions for some tasks
  • Primitives to be avoided for when finding
    clusters in high dimensional data are
  • Noise produced by irrelevant dimensions
  • Sparse data
  • Data points located at different dimensions
    become equally distanced.

39
Contd.
  • Techniques used
  • Feature transformation methods Transform the
    data onto smaller space while generally
    preserving the original distance between objects.
    Examples, principal component analysis and
    singular value decomposition
  • Feature selection methods Commonly used for data
    reduction by removing irrelevant or redundant
    dimensions ( attributes ).
  • Subspace clustering This is an extension to
    feature selection method. Searches for groups of
    clusters within different subspaces of the same
    data set.

40
Contd.
  • Examples
  • CLIQUE dimension-growth subspace clustering
  • PROCLUS dimension-reduction projected clustering

41
CLIQUE
  • CLIQUEs clustering algorithm outline
  • Identify the sparse and crowded areas in space,
    thereby discovering the overall distribution.
  • Cluster is defined as a maximal set of connected
    dense units.
  • Performs multidimensional clustering in two steps
  • Partitions d-dimensional data space into
    nononverlaping rectangular units, identifying the
    dense units among these.
  • Generates a minimal description for each cluster.

42
Contd.
  • Insensitive to the order of input object
  • Doesnt presume any canonical data distribution
  • It scales linearly with the size of input and
    hence has good scalability
  • Clustering results are dependent on proper tuning
    of grid size.
  • Also difficult to find clusters of rather
    different density within different dimensional
    subspaces

43
PROCLUS
  • Typical dimension-reduction subspace clustering
    method.
  • Consists of three phases
  • Initialization
  • Iteration
  • Cluster refinement
  • Initialization select a set of initial mediods
    that are far apart from each other so as to
    ensure that each cluster is represented by
    atleast one object in the selected set.

44
Contd.
  • Iteration selects a random set of K mediods
    from the reduced set and replaces bad mediods
    with randomly chosen new mediods if the
    clustering is improved.
  • Refinement computes new dimensions for each
    mediod based on the clusters found, reassigns
    points to mediods and removes outliners.

45
CSE634 DATACLUSTERING METHODS
Group 9
  • Praveen Durairaj
  • (praveend_at_cs.sunysb.edu)

46
Sources and References
  • Data Mining, Concepts and Techniques by Jiawei
    Han and Micheline Kamber ( Second Edition )
  • Data Clustering by A.K.Jain (Michigan State
    University ), M.N.Murthy ( Indian Institute of
    Science ) and P.J.Flynn ( The Ohio State
    University )
  • http//www.cs.rutgers.edu/mlittman/cours
    es/lightai03/jain99data.pdf
  • Clustering Through Decision Tree Construction
    (2000) - Bing Liu, Yiyuan Xia, Philip S Yu
  • link

47
Constraint-Based Cluster Analysis
  • Used when clustering task involves a very high
    dimensional space.
  • User Preferences.
  • Constraints while clustering.
  • Example
  • Expected number of Clusters
  • Minimal/Maximal cluster size

48
Categories of constraint based clustering
  • Constraints on individual objects
  • Constraints on the selection of clustering
    parameters
  • Constraints on distance or similarity functions
  • Clustering with obstacle object
  • User specified constraints on properties of
    individual clusters
  • Semi-supervised clustering based on partial
    supervision

49
Clustering with Obstacle objects
  • Consider obstacle objects during clustering
  • Partitioning clustering method
  • k-medoids method
  • Uses triangulation method to compute the distance
    between two objects.
  • Computational cost is very high if a large number
    of objects and obstacles are present.

50
Solving clustering with obstacles-Visibility
Graphs
  • A visibility graph is the graph, VG (V,E), such
    that the each vertex of the obstacles has a
    corresponding node in V and two nodes v1 and
    v2, in V are joined by an edge in E if and
    only if the corresponding vertices they represent
    are visible to each other.
  • An example visibility graph

51
Visibility graphs
  • Consider another visibility graph VG (V, E)
    created from VG by adding two points p and q,
    in V.
  • The shortest points between the two points p
    and q will be a sub-path of VG.

52
Cost of distance computation
  • Preprocessing and optimization techniques are
    used
  • Triangulating the region into triangles
  • Group nearby points to form micro-clusters
  • Uses two types of indices for optimization
  • VV indices, for any pair of obstacle vertices
  • MV indices, for any pair of micro-cluster and
    obstacle vertex

53
User constrained cluster analysis
  • Constrained optimization problem
  • Package industry n customers and k service
    stations
  • Customer classification
  • High value customers
  • Ordinary customers

54
Micro-clustering
  • Partition data set into k clusters satisfying
    user specified constraints
  • Iterative refinement of solution
  • Move m surplus objects from cluster Ci to
  • Cj
  • Total sum of the distance of the objects to their
    corresponding cluster centers is reduced

55
Computational efficiency
  • Should handle deadlock situations
  • Constraint may be impossible to satisfy
  • Data preprocessed to form micro-clusters
  • Object Movement
  • Deadlock detection
  • Constraint satisfaction
  • Advantage
  • Scalability is reduced

56
Semi-supervised cluster analysis
  • Clustering process based on user feedback or
    guidance constraints
  • Pair-wise constraints
  • Objects are labeled as belonging to the same
    cluster or different clusters
  • Generates highly desirable clusters

57
Methods
  • Constraint-based semi supervised clustering
  • Relies on user provided labels or constraints
  • Example CLTree (based on decision trees)
  • Distance-based semi supervised clustering
  • Adaptive distance measure
  • String-edit distance using Expectation-Maximizatio
    n
  • Euclidean distance

58
Clustering using decision trees
  • Converts clustering problem into a classification
    problem
  • Considers set of points to be clustered in one
    class Y
  • Adds a set of relatively uniformly distributed
    non existence points with label N
  • Do not physically add points, but only assume
    their existence

59
Clustering using decision trees
a) Set of data points (Y) to be clustered
b) Addition of uniformly distributed points
c) Clustering the resulting with Y points only
60
Clustering using decision trees
  • Works efficiently because the decision tree only
    needs the number of N points
  • The number of N points for the current node E is
    determined by the following rule (note that at
    the root node, the number of inherited N points
    is 0)
  • If the number of N points inherited from the
    parent node of E is less than the number of Y
    points in E then
  • the number of N points for E is increased to the
    number of Y points in E
  • else the number of inherited N points is used for
    E

61
Clustering in Data Mining
  • Searching for useful information in large volumes
    of data
  • Current real world Data mining systems
  • Detecting trends and patterns of play for NBA
    players
  • Categorizing patterns of children in the foster
    care system
  • Data Mining approaches while using clustering
  • Segmentation
  • Predictive Modeling
  • Visualization of large databases

62
Segmentation
  • Clusters homogeneous groups
  • Clustering pixels in Landsat images
  • Each pixel has 7 values from different satellite
    bands
  • Clusters these 7 values into 256 groups and
    performs a k-means algorithm
  • Image displayed with the spatial information

63
Predictive Modeling
  • Clusters group items
  • Infers rules to characterize groups and suggest
    models
  • Consider magazine subscribers
  • Clustered based on age, sex, income etc
  • Groups clustered further to predict whether the
    subscribers will renew the subscription

64
Visualization
  • Aid human analysts in identifying groups that
    have similar characteristics
  • WinViz tool
  • Exports derived clusters as new attributes and
    characterizes them
  • Cereals can be clustered based on calories,
    carbohydrates, sugar etc
  • Milk cereals can be characterized by high
    potassium content

65
Mining large unstructured databases
  • Classifying web documents using words or
    functions of words
  • Problems
  • Very high dimensionality of data sets
  • Relatively small sets of labeled samples
  • Cluster words from a small collection in the
    world wide space in the document space

66
CSE634 DATACLUSTERING METHODS
Group 9
  • Ravikanth Pulavarthy
  • (ravi.ingr_at_gmail.com)

67
Sources and References
  • Data Mining, Concepts and Techniques by Jiawei
    Han and Micheline Kamber ( Second Edition)
  • Data Clustering by A.K.Jain (Michigan State
    University ), M.N.Murthy ( Indian Institute of
    Science ) and P.J.Flynn ( The Ohio State
    University )
  • http//www.cs.rutgers.edu/mlittman/cours
    es/lightai03/jain99data.pdf
  • Parsing images of Architectural
    Scenes-A.Berg,M.Agrawala,J.Malik
  • http//www.cs.berkeley.edu/asimma/294-fal
    l06/projects/reports/grabler.pdf

68
What defines an object?
"I stand at the window and see a house, trees,
sky. Theoretically I might say there were 327
brightnesses and nuances of colour. Do I have
"327"? No. I have sky, house, and trees." --Max
Wertheimer
69
Segmentation and Grouping
To recognize objects rather than dealing with
too many pixels we need a compact/summary
representation Obtain this representation from
image/motion sequence/set of tokens What is
interesting and what is not depends on the
application
70
Image segmentation
  • Segmentation splitting an image into regions
    based on some criteria (intensity, color,
    texture, orientation energy, ).

71
Segmentation Algorithms
  • Simple Segmentation Algorithms
  • Thresholding
  • Segmentation by Clustering
  • Agglomerative clustering
  • Divisive clustering
  • K-means

72
Thresholding
  • Gray level thresholding is the simplest
    segmentation process.

Multilevel thresholding
(object)
(background)
73
Thresholding
  • Thresholding is computationally inexpensive and
    fast
  • Correct threshold selection is crucial for
    successful threshold segmentation

74
Thresholding-example
75
Simple Clustering Methods
  • Two natural Algorithms
  • Agglomerative clustering (bottom up)
  • attach closest to cluster it is closest to
  • repeat
  • Divisive clustering (top-down)
  • split cluster along best boundary
  • repeat

76
Agglomerative Methods
  • Make each point a separate cluster
  • Until the clustering is satisfactory
  • Merge the two clusters with the smallest
    inter-cluster distance

77
Divisive Methods
  • Construct a single cluster containing all points
  • Until the clustering is satisfactory
  • - Split the cluster that yields the two
    components with the largest intercluster distance

78
Agglomerative Versus Divisive Clustering
  • The user can specify the desired number of
    clusters as a termination condition

79
Measure of distance used
  • Min Distance dmin( Ci ,Cj)minP?Ci,P?Cjp-p
  • Nearest Neighbor Clustering Algorithm
  • Max Distancedmax ( Ci ,Cj)maxP?Ci,P?Cjp-p
  • Farthest Neighbour Clustering
    Algorithm
  • Mean Distance
  • dmean(Ci ,Cj )mi-mj where mi is
    the mean for Ci.
  • Average Distancedavg(Ci ,Cj )1/ninjS
    S p-p

  • P?Ci P?Cj

80
Single Linkage
  • The distance between clusters is based on the
    points in each cluster that are nearest apart.

81
Complete Linkage Method
  • The distance between clusters is based on the
    points in each cluster that are farthest apart.

82
Centroid Linkage Method
  • The distance between clusters is defined as the
    distance between cluster centroids.

83
Average Linkage Method
  • The distance between clusters is the average
    distance between all pairs of observations.

84
Optimality
  • Neither agglomerative clustering nor divisive
    clustering is optimal
  • In other words, the set of centroids which they
    give is not guaranteed to minimise distortion

85
Contd.
  • For example
  • In agglomerative clustering, a dense cluster of
    data points will be combined into a single
    centroid
  • But to minimise distortion, need several
    centroids in a region where there are many data
    points
  • A single outlier may get its own cluster
  • Agglomerative clustering provides a useful
    starting point, but further refinement is needed

86
K-means Clustering
  • Choose a fixed number of clusters
  • Choose cluster centers and point-cluster
    allocations to minimize error

87
K-means Algorithm
  • Choose k data points to act as cluster centers
  • Until the clustering is satisfactory
  • Assign each data point to the cluster that has
    the nearest cluster center
  • Ensure each cluster has at least one data point
    splitting, etc
  • Replace the cluster centers with the means of the
    elements in the clusters

88
Image
Clusters on intensity
Clusters on color
K-means clustering using intensity alone and
color alone
89
Conclusion
  • The approaches for the high dimensional spatial
    data clustering methods are well addressed.
  • Some of the applications of data clustering in
    data mining and image segmentation are discussed
    as these are vital as huge amounts of spatial
    data are obtained in real life from satellite
    images, medical equipments, geographic
    information systems (GIS), image database
    exploration, etc.

90
THANK YOU
Write a Comment
User Comments (0)
About PowerShow.com