k-Means and DBSCAN - PowerPoint PPT Presentation

About This Presentation
Title:

k-Means and DBSCAN

Description:

k-Means and DBSCAN Gyozo Gidofalvi Uppsala Database Laboratory Announcements Updated material for assignment 2 on the lab course home page. Posted sign-up sheets for ... – PowerPoint PPT presentation

Number of Views:201
Avg rating:3.0/5.0
Slides: 23
Provided by: itUuSeed
Category:
Tags: dbscan | dbscan | means

less

Transcript and Presenter's Notes

Title: k-Means and DBSCAN


1
k-Means and DBSCAN
  • Gyozo Gidofalvi
  • Uppsala Database Laboratory

2
Announcements
  • Updated material for assignment 2 on the lab
    course home page.
  • Posted sign-up sheets for labs and examinations
    for assignment 2 outside P1321.
  • Posted office hours

3
k-Means
  • Input
  • M (set of points)
  • k (number of clusters)
  • Output
  • µ1, , µk (cluster centroids)
  • k-Means clusters the M point into K clustersby
    minimizing the squared error function
  • clusters Si i1, , k. µi is the centroid of
    all xj?Si.

4
k-Means algorithm
  • select (m1 mK) randomly from M initial
    centroids
  • do
  • (µ1 µK) (m1 mK)
  • all clusters Ci
  • for each point p in M
  • compute cluster membership of p
  • i argminj(dist(µj, p))
  • assign p to the corresponding cluster
  • Ci Ci ? p
  • end
  • for each cluster Ci recompute the centroids
  • mi avg(p in Ci)
  • while exists mi ? µi convergence criterion

5
K-Means on three clusters
6
Im feeling Unlucky
Bad initial points
7
kmeans in practice
  • How to choose initial centroids
  • select randomly among the data points
  • generate completely randomly
  • How to choose k
  • study the data
  • run k-Means for different k
  • measure squared error for each k
  • Run kmeans many times!
  • Get many choices of initial points

8
k-Means iteration step in AmosQL
  • Calculate point-to-centroid distances
    calp2c_distance()
  • select p, c, d
  • from Vector of Number p, Vector of Number c,
    Number d
  • where p in bag(iota(1,10))
  • and c in bag(iota(1,10))
  • and d euclid(p,c)
  • Assign each point to the closest centroid
    calc_cluster_assignment()
  • groupby((p2c_distances1()), argminv)
  • Recalculate centroids calc_clust_means()
  • groupby(calc_cluster_assignment1(),
    col_means)

9
Transitive closure
  • tclose is a second order function to explore
    graphs where the edges are expressed by a
    transition function fno
  • tclose(Function fno, Object o)-gtBag of Object
  • fno(o) produces the children of o
  • tclose applies the transition function fno(o),
    then fno(fno(o)), then fno(fno(fno(o))), etc
    until fno returns no new results

10
Iterate until convergence with tclose in AmosQL
  • create function bagidiv2(Bag of Number b)
  • -gtBag of Bag of Number
  • as (select floor(n/2) from Number n where n in
    b)
  • create function vecchild_idiv2(Vector of Number
    vb)
  • -gtBag of Vector of
    Number
  • as sort(bagidiv2(in(vb)))
  • create function vecconverge_tclose(Bag of Number
    ib)
  • -gtBag of
    Vector of Number
  • / tclose function iterating the bagchild_idiv2
    function until convergence /
  • as select ov
  • from Vector of Number ov
  • where ov in tclose('vecchild_idiv2',
    sort(ib))

11
What about this?!
Non-spherical clusters
Noise
12
k-Means pros and cons
-
Easy Fast Works only for well-shaped clusters
Scalable? Sensitive to outliers
Sensitive to noise
Must know k a priori
13
Questions
  • Euclidean distance results in spherical clusters
  • What cluster shape does the Manhattan distance
    give?
  • Think of other distance measures too. What
    cluster shapes will those yield?
  • Assuming that the K-means algorithm converges
  • in I iterations, with N points and X features for
    each point
  • give an approximation of the complexity of the
    algorithm expressed in K, I, N, and X.
  • Can the K-means algorithm be parallelized?
  • How?

14
DBSCAN
  • Density Based Spatial Clustering of Applications
    with Noise
  • Basic idea
  • If an object p is density connected to q,
  • then p and q belong to the same cluster
  • If an object is not density connected to any
    other object
  • it is considered noise

15
Definitions
  • e-neigborhood
  • The e-neigborhood of an object p is the set of
    objects within e-distance of p
  • core object
  • An object q is a core object iffthere are at
    least MinPts objects in qs e-neighbourhood
  • directly density reachable (ddr)
  • An object p is ddr from q iff q is
  • a core object and p is inside
  • the eneighbourhood of q

p
q
16
Reachability and Connectivity
  • density reachable (dr)
  • An object p is dr from q iff there exists a
    chain of objects q1 qn s.t.- q1 is ddr from q,
    - q2 is ddr from q1, - q3 is ddr from and p
    is ddr from qn
  • density connected (dc)
  • p is dc to r iff- exist an object q such that p
    is dr from q - and r is dr from q

17
Recall
  • Basic idea
  • If an object p is density connected to q,
  • then p and q belong to the same cluster
  • If an object is not density connected to any
    other object
  • it is considered noise

18
DBSCAN
  • i 1
  • do
  • take a point p from M
  • find the set of points P which are density
    connected to p
  • if P
  • M M \ p
  • else
  • CiP
  • ii1
  • M M \ P
  • end
  • while M ?

HOW?
19
Fining density connected componnets
  • If r is dc to p ? there exists q, s.t. both p and
    r are dr from q. i.e., there exists a ddr-chain
    from q to both r and p and q is a core object.
  • Recall tclose is a second order function to
    explore graphs where the edges are expressed by a
    transition function fno.
  • fno ddr

20
Fining dc components in AmosQL
  • Assuming q is a core object and the a ddr
    function with the following signature is defined
    ddr(Integer q)-gtBag of Integer p
  • Then
  • create function dc(Integer q)-gtBag of Integer
  • as select p
  • from Integer p
  • where p in tclose(ddr, q)

21
DBSCAN pros and cons
-
Clusters of arbitrary shape Robust to noise Requires connected regions of sufficiently high density
Does not need an a priori k Deterministic Data sets with varying densities are problematic
Scalable?
22
Questions
  • Why is the dc criterion useful to define a
    cluster, instead of dr or ddr?
  • For which points are density reachable
    symmetric?i.e. for which p, q dr(p, q) and
    dr(q, p)?
  • Express using only core objects and ddr, which
    objects will belong to a cluster
Write a Comment
User Comments (0)
About PowerShow.com