Clustering Geometric Data Streams - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Clustering Geometric Data Streams

Description:

geometric models growing larger. Stanford's Michelangelo project (David 28 mil. ... 187 106 points 3 coordinates 8 bytes 4.5 GB. must be processed out-of ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 23
Provided by: herakl
Category:

less

Transcript and Presenter's Notes

Title: Clustering Geometric Data Streams


1
ClusteringGeometric Data Streams
  • Jirí Skála

Ivana Kolingerová
ZCU/FAV/KIV 2007
2
Talk outline
  • motivation
  • background
  • existing solution
  • improvements
  • experiments observations
  • conclusion

3
Motivation
  • why data streams?
  • geometric models growing larger
  • Stanfords Michelangelo project (David 28 mil.
    vertices, St. Matthew 187 mil. vertices)
  • 187106 points 3 coordinates 8 bytes 4.5 GB
  • must be processed out-of-core
  • why clustering?
  • use hierarchical clustering to create
    multiresolution model
  • various LOD in different parts

4
Background data stream
  • ordered set of data
  • data coming online or stored on HDD
  • too large to fit in main memory
  • viewed only in order random access extremely
    inefficient or even impossible
  • processed in one or very few linear scans

5
Background clustering
  • grouping similar elements together
  • vertices, DB entries, documents
  • similarity most often measured as Euclidean
    distance
  • k-means, k-median clustering
  • facility location
  • clients and facilities
  • facility cost

k-means
k-median
6
Facility location
  • no data streams yet
  • introduced by Charikar and Guha, 1999
  • initial solution iteratively refined by local
    improvements local search algorithm
  • initial solution
  • points taken in random order
  • first point always a facility
  • others become facility with probability p d /
    fc
  • if d / fc gt 1 then p 1
  • otherwise connect to closest existing facility

7
Facility location
  • pick a point at random (new facility candidate)
  • compute function gain
  • pay for opening a facility
  • inspect all points and compare distance to
    facility
  • inspect facilities and determine whether they can
    be closed
  • if gain gt 0 then perform reassignments closures
  • repeated m log m times?

8
Facility location
New facility candidate
After reassignments closures
9
Data stream clustering
  • proposed by Guha et al., 2000
  • data stream processed in blocks
  • clustering within each block
  • cluster centers given weight and passed to higher
    level
  • when higher level full, clustered again
  • distances multiplied by point weights

10
Data stream clustering
  • time for video

11
Improvements
  • limiting the search space
  • inspect only points whose reassignment can
    improve the solution
  • i.e., those assigned to facilities within 2 fc
    radius
  • does not work for weighted points

12
Improvements
  • modification from k-median to facility location
  • choosing the facility cost
  • equal to the diagonal of bounding box
  • weight normalization
  • we need to keep weights around 1, i.e., average
    weight equal to 1
  • divide weights by their average

13
Experiments setting the facility cost
  • high setting
  • aggressive clustering
  • low number of large clusters
  • low setting
  • moderate clustering
  • many small clusters
  • set facility cost equal to diagonal of bounding
    box
  • affects memory and running time

14
Experiments setting the facility cost
diagonal
2 diagonal
1/2 diagonal
1/4 diagonal
15
Experiments input point distribution
  • many authors rely on data being ordered
  • usually true
  • presented algorithm can handle unordered data
  • as well
  • there may be a problem
  • with few outliers

16
Experiments input point distribution
1st block
2nd block
3rd block
higher level
17
Experiments input point distribution
1st block
2nd block
3rd block
higher level
18
Experiments block size
  • affects memory requirements
  • required memory
  • somewhat affects clustering result
  • affects running time
  • required iterations

19
Experiments number of iterations
  • m log m iterations necessary for a
    constant-factor approximation
  • for large blocks running time grows unpleasantly
  • 0.1 m iterations seem to be enough for data with
    clusters even less

20
Experiments number of iterations
  • 1640 points

6560 iterations
164 iterations
21
Conclusion
  • modified data stream approach to facility
    location
  • introduced facility weight normalization
  • improvement to limit the number of points
    inspected
  • experiments
  • discussion of parameter settings
  • description of algorithm behavior

22
References
  • M. Charikar, S. Guha, Improved Combinatorial
    Algorithms for the Facility Location and k-Median
    Problems. Proc. 40th Sympos. on Foundations of
    Computer Science, 1999, pp. 378-- 388.
  • S. Guha, N. Mishra, R. Motwani, L. OCallaghan,
    Clustering Data Streams. In Proceedings of the
    Annual Symposium on Foundations of Computer
    Science. IEEE, November 2000
  • L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha,
    R. Motwani, Streaming-Data Algorithms for
    High-Quality Clustering. In Proceedings of IEEE
    International Conference on Data Engineering,
    March 2002.
  • S. Guha, A. Meyerson, N. Mishra, R. Motwani, L.
    O'Callaghan, Clustering data streams Theory and
    practice. IEEE Trans. Knowl. Data Eng 15, 3
    (2003), 515-528.
Write a Comment
User Comments (0)
About PowerShow.com