Clustering Geometric Data Streams - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Clustering Geometric Data Streams

Description:

geometric models growing larger. Stanford's Michelangelo project (David 28 mil. ... 187 106 points 3 coordinates 8 bytes 4.5 GB. must be processed out-of ... – PowerPoint PPT presentation

Number of Views:52

Avg rating:3.0/5.0

Slides: 23

Provided by: herakl

Category:

more less

Transcript and Presenter's Notes

Title: Clustering Geometric Data Streams

1
ClusteringGeometric Data Streams

Jirí Skála

Ivana Kolingerová
ZCU/FAV/KIV 2007
2
Talk outline

motivation
background
existing solution
improvements
experiments observations
conclusion

3
Motivation

why data streams?
geometric models growing larger
Stanfords Michelangelo project (David 28 mil.
vertices, St. Matthew 187 mil. vertices)
187106 points 3 coordinates 8 bytes 4.5 GB
must be processed out-of-core
why clustering?
use hierarchical clustering to create
multiresolution model
various LOD in different parts

4
Background data stream

ordered set of data
data coming online or stored on HDD
too large to fit in main memory
viewed only in order random access extremely
inefficient or even impossible
processed in one or very few linear scans

5
Background clustering

grouping similar elements together
vertices, DB entries, documents
similarity most often measured as Euclidean
distance
k-means, k-median clustering
facility location
clients and facilities
facility cost

k-means
k-median
6
Facility location

no data streams yet
introduced by Charikar and Guha, 1999
initial solution iteratively refined by local
improvements local search algorithm
initial solution
points taken in random order
first point always a facility
others become facility with probability p d /
fc
if d / fc gt 1 then p 1
otherwise connect to closest existing facility

7
Facility location

pick a point at random (new facility candidate)
compute function gain
pay for opening a facility
inspect all points and compare distance to
facility
inspect facilities and determine whether they can
be closed
if gain gt 0 then perform reassignments closures
repeated m log m times?

8
Facility location
New facility candidate
After reassignments closures
9
Data stream clustering

proposed by Guha et al., 2000
data stream processed in blocks
clustering within each block
cluster centers given weight and passed to higher
level
when higher level full, clustered again
distances multiplied by point weights

10
Data stream clustering

time for video

11
Improvements

limiting the search space
inspect only points whose reassignment can
improve the solution
i.e., those assigned to facilities within 2 fc
radius
does not work for weighted points

12
Improvements

modification from k-median to facility location
choosing the facility cost
equal to the diagonal of bounding box
weight normalization
we need to keep weights around 1, i.e., average
weight equal to 1
divide weights by their average

13
Experiments setting the facility cost

high setting
aggressive clustering
low number of large clusters
low setting
moderate clustering
many small clusters
set facility cost equal to diagonal of bounding
box
affects memory and running time

14
Experiments setting the facility cost
diagonal
2 diagonal
1/2 diagonal
1/4 diagonal
15
Experiments input point distribution

many authors rely on data being ordered
usually true
presented algorithm can handle unordered data
as well
there may be a problem
with few outliers

16
Experiments input point distribution
1st block
2nd block
3rd block
higher level
17
Experiments input point distribution
1st block
2nd block
3rd block
higher level
18
Experiments block size

affects memory requirements
required memory
somewhat affects clustering result
affects running time
required iterations

19
Experiments number of iterations

m log m iterations necessary for a
constant-factor approximation
for large blocks running time grows unpleasantly
0.1 m iterations seem to be enough for data with
clusters even less

20
Experiments number of iterations

1640 points

6560 iterations
164 iterations
21
Conclusion

modified data stream approach to facility
location
introduced facility weight normalization
improvement to limit the number of points
inspected
experiments
discussion of parameter settings
description of algorithm behavior

22
References

M. Charikar, S. Guha, Improved Combinatorial
Algorithms for the Facility Location and k-Median
Problems. Proc. 40th Sympos. on Foundations of
Computer Science, 1999, pp. 378-- 388.
S. Guha, N. Mishra, R. Motwani, L. OCallaghan,
Clustering Data Streams. In Proceedings of the
Annual Symposium on Foundations of Computer
Science. IEEE, November 2000
L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha,
R. Motwani, Streaming-Data Algorithms for
High-Quality Clustering. In Proceedings of IEEE
International Conference on Data Engineering,
March 2002.
S. Guha, A. Meyerson, N. Mishra, R. Motwani, L.
O'Callaghan, Clustering data streams Theory and
practice. IEEE Trans. Knowl. Data Eng 15, 3
(2003), 515-528.