Clustering Data Streams

About This Presentation

Title:

Clustering Data Streams

Description:

Tracking network data to study change in traffic patterns and possible intrusions ... Ester M., Kriegel H.-P., Sander J. & Xu X (1998) ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 20

Provided by: weic

Category:

more less

Transcript and Presenter's Notes

Title: Clustering Data Streams

1
Clustering Data Streams

Chun Wei
Dept Computer Information Technology
Advisor Dr. Sprague

2
Data Stream

Massive data sets accumulated at an astonishing
rate.
Examples
Tracking network data to study change in traffic
patterns and possible intrusions
Tracking meteorological data, such as temperatures

3
NASA MISR satellite

Collects several TB of satellite imagery data
daily

4
Challenges

Compactness of data representation
Fast, incremental processing of new data points
(one-pass and linear access of data)
Clear and fast identification of changes in
evolving clustering models

5
Compactness

Utilize a data structure that summarizes a group
of data points, minimizing the storage space
The space required does not grow appreciably with
the number of points processed

6
Incremental Processing of data

When clustering new data points, the algorithm
should not require comparison with all the points
processed in the past
The data must be processed as they are produced.
Linear scan is required, random access is
prohibitively expensive.

7
Identification of Changes

The algorithm must be able to
diagnose changes in evolving data streams
distinguish outliers from data points that
represent a new cluster

8
Current Algorithms

BIRCH
STREAM
CLU-STREAM

9
BIRCH

Use CF vectors to store data
CF (N, ?Xi2 , ? Xi ) Xi is a vector
Store the number of points, the linear sum and
the square sum of all data points in a
micro-cluster
Sufficient to calculate centroids, radius,
diameter and distances

10
B-Tree
Root
29
16
22
7
39
3
5
19
20
22
24
27
38
2
7
14
16
29
33
34
11
Building of CF Tree

B-Tree, with a branch factor B, threshold T and L
maximum number of entries in a leave node

12
Adjusting CF Tree

Increases the threshold T so that each leaf entry
to absorb more points. T can be set as radius or
diameter.
Leaf entries with far fewer points are regarded
as outliers and written back to disk.

13
STREAM

Process data streams in batches of points
Use weighted centroids Ci to represent ith batch
of points.
Recursively cluster the weighted centroids until
k-clusters

14
Problems with BIRCH STREAM

Old data points are equally important as new data
points
May not be able to detect new trends in evolving
data stream

15
CLU-STREAM

Also use CF vectors to store data summary
Use time stamps to record the elapsed time from
the beginning
Take snapshots at different time stamp, favoring
the most recent data
(Snapshot micro-clusters stored at particular
moments in the stream)

16
CLU-STREAM (continue)

A snapshot contains q micro-clusters, q depends
on the memory available
New data points will be assigned to one of the
micro-clusters in previous snapshot if it falls
within the maximum boundary of that
micro-cluster.

17
CLU-STREAM (continue)

If a new data points fails to fit into any
current cluster, a new cluster is created, and an
existing one is deleted or two merged.
A cluster is removed if the average time-stamp
when it absorbs m new data points is least
recent.

18
Detect New Trends

Comparing clustering results from snapshots to
snapshots reveals trends in evolving data stream.

19
References

Aggarwal, C. C., Han J., Wang, J. Yu, P. S.
(2003). A Framework for Clustering Evolving Data
Stream. In Proc. of the 29th VLDB Conference.
Barbara, D. (2003). Requirements for Clustering
Data Streams. SIGKDD Explorations, 3 (2), 23-27.
Ester M., Kriegel H.-P., Sander J. Xu X (1998).
Clustering for Mining in Large Spatial Databases.
Special Issue on Data Mining, KI-Journal,
ScienTec Publishing, No. 1.
Guha, S., Meyerson, A., Mishra, N. Motwani, R.,
Callaghan, L. (2003). Clustering Data Streams
Theory and Practice. IEEE Transactions on
Knowledge and Data Engineering, 15 (3), 515-528.
Zhang, T., Ramakrishnan R. Livny, M. (1996).
BIRCH An Efficient Data Clustering Method for
Very Large Databases. In Proc. of ACM SIGMOD
International Conference on Management of Data.