Mining Dynamics of Data Streams in Multidimensional Space Cluster Analysis

1 / 19
About This Presentation
Title:

Mining Dynamics of Data Streams in Multidimensional Space Cluster Analysis

Description:

Subtractive property. Let C1 and C2 be two sets of points such that. Then holds. Use ... Compute the set of net micro-clusters based on the subtractive property ... –

Number of Views:57
Avg rating:3.0/5.0
Slides: 20
Provided by: jiaw186
Category:

less

Transcript and Presenter's Notes

Title: Mining Dynamics of Data Streams in Multidimensional Space Cluster Analysis


1
Mining Dynamics of Data Streams in
Multidimensional SpaceCluster Analysis
  • Jiawei Han
  • Department of Computer Science
  • University of Illinois at Urbana-Champaign
  • www.cs.uiuc.edu/hanj

2
Outline
  • Why clustering data streams?
  • Cluster analysis A general overview
  • Our design methodology mining changes of
    clusters in multi-dimensional space
  • Tilted time frame window
  • Micro-cluster analysis and maintenance
  • Online macro-cluster analysis

3
Why Clustering Data Streams?
  • What is cluster analysis?Grouping a set of data
    objects into a set of clusters, s.t. the
    intra-cluster similarity is high and the
    inter-cluster similarity is low
  • Application example Network intrusion detection
  • Detect bursts of activities or abrupt changes in
    real timeby on-line clustering
  • Clustering A stream data reduction technique
  • New requirements in stream clustering
  • Generate high-quality clusters in one scan
  • High quality, efficient incremental clustering
  • Analysis should take care of multi-dimensional
    space

4
Major Clustering Approaches in Traditional
Cluster Analysis
  • Partitioning algorithms Construct various
    partitions and then evaluate them by some
    criterion
  • E.g., k-means, k-medoids, etc.
  • Hierarchy algorithms Create a hierarchical
    decomposition of the set of data (or objects)
    using some criterion
  • Often needs to integrate with other clustering
    methods, e.g., BIRCH
  • Density-based based on connectivity and density
    functions
  • Finding clusters of arbitrary shapes, e.g.,
    DBSCAN, OPTICS, etc.
  • Grid-based based on a multiple-level granularity
    structure
  • View space as grid structures, e.g., STING,
    CLIQUE
  • Model-based find the best fit of the model to
    all the clusters
  • Good for conceptual clustering, e.g., COBWEB, SOM

5
The K-Means Clustering Process
  • MacQueen67 Each cluster is represented by the
    center of the cluster

10
9
8
7
6
5
Update the cluster means
Assign each objects to most similar center
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
reassign
reassign
K2 Arbitrarily choose K object as initial
cluster center
Update the cluster means
6
Clustering Data Streams Previous Work
  • K-median stream clustering (Guha, Motwani, et al.
    2000-2002)
  • Data stream with points from metric space
  • Find k centers in the stream such that the sum of
    distances from data points to their closest
    center is minimized.
  • Only the k centroids (representing the clustering
    results) retain when new data comes
  • Only use the new data set to perform incremental
    clustering
  • The previous data carries weights of the previous
    many points
  • The error is bounded by continuous incremental
    updates
  • The simple algorithm yields constant factor
    approximation
  • Weakness of the method
  • Low quality for evolving data streams (register
    only k centers)
  • Limited functionality at exploring cluster
    evolutions over time

7
CluStream A Framework for Clustering Evolving
Data Streams (VLDB03)
  • Outline of our methodology
  • Divide the clustering process into online and
    offline components
  • Online periodically stores summary statistics
    about the stream data
  • Micro-clustering better quality than k-means
  • Incremental, online processing and maintenance
  • Offline answers various user queries based on
    the stored summary statistics
  • Tilted time frame work register dynamic changes
  • With limited overhead to achieve high
    efficiency, scalability, quality of results and
    power of evolution/change detection

8
BIRCH (1996)
  • Birch Balanced Iterative Reducing and Clustering
    using Hierarchies, by Zhang, Ramakrishnan, Livny
    (SIGMOD96)
  • Incrementally construct a CF (Clustering Feature)
    tree, a hierarchical data structure for
    multiphase clustering
  • Phase 1 scan DB to build an initial in-memory CF
    tree (a multi-level compression of the data that
    tries to preserve the inherent clustering
    structure of the data)
  • Phase 2 use an arbitrary clustering algorithm to
    cluster the leaf nodes of the CF-tree
  • Scales linearly finds a good clustering with a
    single scan and improves the quality with a few
    additional scans

9
Clustering Feature Vector
CF (5, (16, 30),(54,190))
(3,4) (2,6) (4,5) (4,7) (3,8)
10
CF-Tree in BIRCH
  • Clustering feature
  • Summary of the statistics for a given subcluster
    the 0-th, 1st and 2nd moments of the subcluster
    from the statistical point of view
  • Registers crucial measurements for computing
    cluster and utilizes storage efficiently
  • A CF tree is a height-balanced tree that stores
    the clustering features for a hierarchical
    clustering
  • A nonleaf node in a tree has descendants or
    children
  • The nonleaf nodes store sums of the CFs of their
    children
  • A CF tree has two parameters
  • Branching factor specify the maximum number of
    children
  • threshold max diameter of sub-clusters stored at
    the leaf nodes

11
CF Tree
Root
B 7 L 6
Non-leaf node
CF1
CF3
CF2
CF5
child1
child3
child2
child5
Leaf node
Leaf node
CF1
CF2
CF6
prev
next
CF1
CF2
CF4
prev
next
12
Micro-Clusters Design Methodology
  • Data streams
  • Multi-dimensional points with
    time stamps T1, Tk .
  • Each point contains d dimensions, i.e.,
  • A micro-cluster for n points is defined as a (2.d
    3)-tuple
  • Online statistical data collection
  • Maintain a total of q micro-clusters, M1 Mq,
    where q is usually significantly larger than of
    natural clusters, and is determined by the amount
    of available memory
  • Each newly created micro-cluster is associated
    with a unique id, and a merged micro-cluster is
    associated with a list of ids

13
Titled Time Window Framework
  • Natural tilted time frame window
  • Example Minimal quarter, then 4 quarters ? 1
    hour, 24 hours ? day,
  • Logarithmic, pyramidal, and geometric tilted time
    frame windows
  • Not going to be implemented at the first stage
  • Each micro-cluster is associated with a tilted
    time window

14
Incremental Update of Micro-Clusters
  • If a new data point falls within the
    maximum boundary of its closest micro-cluster Mp
    , is added to Mp
  • Maximum boundary a factor t of the RMS deviation
    of the data points in Mp from its centroid
  • Otherwise, a new micro-cluster is created for
    , which corresponds to (1) an outlier, or (2) a
    new cluster
  • Delete an old cluster or merge two closest
    clusters?
  • First determine if it is safe to delete the
    micro-cluster with the least recent time-stamp
  • If not, merge the two closest micro-clusters

15
Macro-Cluster Creation
  • Given a user-specified time-horizon h and the
    number of macro-clusters, K
  • Find the K high-level clusters over the
    pre-specified horizon h
  • Subtractive property
  • Let C1 and C2 be two sets of points such that
  • Then
    holds.
  • Use
  • to denote the micro-cluster of the set of n
    points, C
  • Compute the set of net micro-clusters based on
    the subtractive property
  • Use a variant of k-means algorithm to compute the
    final K macro-clusters
  • Each net micro-cluster is treated as a weighted
    pseudo-point

16
Evolution Analysis of Micro-Clusters
  • Given a user-specified time-horizon h and two
    clock times, t1 and t2 (where t1 lt t2 )
  • Analyze the evolution nature of data arriving
    between (t2h,t2), and the data arriving between
    (t1h,t1)
  • Answer the following questions
  • Are there new clusters in the data at time t1
    which were not present at time t2?
  • Have some of the original clusters been lost?
  • Have some of the original clusters at time t1,
    shifted in position and nature?

17
Clustering StreamFinding Stream Dynamics
  • Application Network intrusion detection
  • Detect bursts of activities or abrupt changes in
    real timeby on-line clustering
  • The combination of tilted time window and
    micro-clustering
  • capture sufficient statistics of the evolving
    data streams
  • without sacrificing the underlying space- and
    time- efficiency of the online clustering process
  • Provides a wide variety of functionalities
  • macro-cluster creation over different horizons
  • stream evolution analysis

18
References
  • C. Aggarwal, J. Han, J. Wang, and P. S. Yu, A
    Framework for Clustering Evolving Data Streams,
    VLDB'03
  • B. Babcock, S. Babu, M. Datar, R. Motawani, and
    J. Widom, Models and issues in data stream
    systems, PODS'02 (tutorial).
  • Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang,
    Multi-dimensional regression analysis of
    time-series data streams, VLDB'02.
  • M. Garofalakis, J. Gehrke, and R. Rastogi,
    Querying and mining data streams You only get
    one look, SIGMOD'02 (tutorial).
  • S. Guha, N. Mishra, R. Motwani, and L.
    O'Callaghan, Clustering data streams, FOCS'00.
  • L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha,
    R. Motwani, High-Performance Clustering of
    Streams and Large Data Sets,ICDE'02

19
www.cs.uiuc.edu/hanj
  • Thank you !!!
Write a Comment
User Comments (0)
About PowerShow.com