Clustering Data Streams - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Clustering Data Streams

Description:

Tracking network data to study change in traffic patterns and possible intrusions ... Ester M., Kriegel H.-P., Sander J. & Xu X (1998) ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 20
Provided by: weic
Category:

less

Transcript and Presenter's Notes

Title: Clustering Data Streams


1
Clustering Data Streams
  • Chun Wei
  • Dept Computer Information Technology
  • Advisor Dr. Sprague

2
Data Stream
  • Massive data sets accumulated at an astonishing
    rate.
  • Examples
  • Tracking network data to study change in traffic
    patterns and possible intrusions
  • Tracking meteorological data, such as temperatures

3
NASA MISR satellite
  • Collects several TB of satellite imagery data
    daily

4
Challenges
  • Compactness of data representation
  • Fast, incremental processing of new data points
    (one-pass and linear access of data)
  • Clear and fast identification of changes in
    evolving clustering models

5
Compactness
  • Utilize a data structure that summarizes a group
    of data points, minimizing the storage space
  • The space required does not grow appreciably with
    the number of points processed

6
Incremental Processing of data
  • When clustering new data points, the algorithm
    should not require comparison with all the points
    processed in the past
  • The data must be processed as they are produced.
    Linear scan is required, random access is
    prohibitively expensive.

7
Identification of Changes
  • The algorithm must be able to
  • diagnose changes in evolving data streams
  • distinguish outliers from data points that
    represent a new cluster

8
Current Algorithms
  • BIRCH
  • STREAM
  • CLU-STREAM

9
BIRCH
  • Use CF vectors to store data
  • CF (N, ?Xi2 , ? Xi ) Xi is a vector
  • Store the number of points, the linear sum and
    the square sum of all data points in a
    micro-cluster
  • Sufficient to calculate centroids, radius,
    diameter and distances

10
B-Tree
Root
29
16
22
7
39
3
5
19
20
22
24
27
38
2
7
14
16
29
33
34
11
Building of CF Tree
  • B-Tree, with a branch factor B, threshold T and L
    maximum number of entries in a leave node

12
Adjusting CF Tree
  • Increases the threshold T so that each leaf entry
    to absorb more points. T can be set as radius or
    diameter.
  • Leaf entries with far fewer points are regarded
    as outliers and written back to disk.

13
STREAM
  • Process data streams in batches of points
  • Use weighted centroids Ci to represent ith batch
    of points.
  • Recursively cluster the weighted centroids until
    k-clusters

14
Problems with BIRCH STREAM
  • Old data points are equally important as new data
    points
  • May not be able to detect new trends in evolving
    data stream

15
CLU-STREAM
  • Also use CF vectors to store data summary
  • Use time stamps to record the elapsed time from
    the beginning
  • Take snapshots at different time stamp, favoring
    the most recent data
  • (Snapshot micro-clusters stored at particular
    moments in the stream)

16
CLU-STREAM (continue)
  • A snapshot contains q micro-clusters, q depends
    on the memory available
  • New data points will be assigned to one of the
    micro-clusters in previous snapshot if it falls
    within the maximum boundary of that
    micro-cluster.

17
CLU-STREAM (continue)
  • If a new data points fails to fit into any
    current cluster, a new cluster is created, and an
    existing one is deleted or two merged.
  • A cluster is removed if the average time-stamp
    when it absorbs m new data points is least
    recent.

18
Detect New Trends
  • Comparing clustering results from snapshots to
    snapshots reveals trends in evolving data stream.

19
References
  • Aggarwal, C. C., Han J., Wang, J. Yu, P. S.
    (2003). A Framework for Clustering Evolving Data
    Stream. In Proc. of the 29th VLDB Conference.
  • Barbara, D. (2003). Requirements for Clustering
    Data Streams. SIGKDD Explorations, 3 (2), 23-27.
  • Ester M., Kriegel H.-P., Sander J. Xu X (1998).
    Clustering for Mining in Large Spatial Databases.
    Special Issue on Data Mining, KI-Journal,
    ScienTec Publishing, No. 1.
  • Guha, S., Meyerson, A., Mishra, N. Motwani, R.,
    Callaghan, L. (2003). Clustering Data Streams
    Theory and Practice. IEEE Transactions on
    Knowledge and Data Engineering, 15 (3), 515-528.
  • Zhang, T., Ramakrishnan R. Livny, M. (1996).
    BIRCH An Efficient Data Clustering Method for
    Very Large Databases. In Proc. of ACM SIGMOD
    International Conference on Management of Data.
Write a Comment
User Comments (0)
About PowerShow.com