Title: A Framework for Projected Clustering of High
1A Framework for Projected Clustering of High
Dimensional Data Streams
Proceedings of the 30th VLDB Conference, Toronto,
Canada, 2004
2Motivation and Underlying Concepts
- All dimensions should not be considered in high
dimensional setup for clustering - The Fading Cluster Structure Use fading function
- The half life t0 of a point is defined as the
time at which f(t0) (12)f(0). - A fading cluster structure at time t for a set of
d-dimensional points - The clustering structure properties called
additivity and temporal multiplicity - The clustering process requires a simultaneous
maintenance of the clusters as well as the set of
dimensions associated with each cluster
3HPStream High-Dimentional Projected Stream
Clustering Method
4HPStream Algorithm Brief Explanation
-Set parameters -Normalization Process -Initial
Clustering using k-means and Init
Number -ComputeDimensions This procedure
determines the dimensions in such a way that the
spread along the chosen dimensions is as small as
possible -The next step is the determination of
the closest cluster to the incoming data point
using FindProjectedDist -The procedure for
determination of the limiting radius is denoted
by FindLimitingRadius -Finally decision which
cluster to add or delete.
5(No Transcript)
6(No Transcript)
7(No Transcript)
8Experimental Setup
HPStream compared with Clustream both
implemented on MS VC One synthetic data and 2
sets of Real world data - Network Intrusion and
Forest cover type data sets. Comparison criteria
for judging the 2 algorithms - accuracy
clustering quality - efficiency stream
processing rate - sensitivity varying decay
rate, l and radius threshold - scalability
varying number of dimensions and
clusters Parameters initialized as
following Decay-rate 05, Spread radius
factor 2, InitNumber 2000, Average
Projected Dimensionality l gt d/2.
9Comparing Accuracy Using clustering quality and
cluster purity
10Accuracy comparison continued
11Accuracy comparison continued
12Efficiency comparison using Stream Processing
Rate
13Sensitivity Varying l
14Sensitivity Varying radius threshold and decay
rate
15Scalability varying dimensionality and number
of clusters