Title: Mining Dynamics of Data Streams in Multidimensional Space Cluster Analysis
1Mining Dynamics of Data Streams in
Multidimensional SpaceCluster Analysis
- Jiawei Han
- Department of Computer Science
- University of Illinois at Urbana-Champaign
- www.cs.uiuc.edu/hanj
2Outline
- Why clustering data streams?
- Cluster analysis A general overview
- Our design methodology mining changes of
clusters in multi-dimensional space - Tilted time frame window
- Micro-cluster analysis and maintenance
- Online macro-cluster analysis
3Why Clustering Data Streams?
- What is cluster analysis?Grouping a set of data
objects into a set of clusters, s.t. the
intra-cluster similarity is high and the
inter-cluster similarity is low - Application example Network intrusion detection
- Detect bursts of activities or abrupt changes in
real timeby on-line clustering - Clustering A stream data reduction technique
- New requirements in stream clustering
- Generate high-quality clusters in one scan
- High quality, efficient incremental clustering
- Analysis should take care of multi-dimensional
space
4Major Clustering Approaches in Traditional
Cluster Analysis
- Partitioning algorithms Construct various
partitions and then evaluate them by some
criterion - E.g., k-means, k-medoids, etc.
- Hierarchy algorithms Create a hierarchical
decomposition of the set of data (or objects)
using some criterion - Often needs to integrate with other clustering
methods, e.g., BIRCH - Density-based based on connectivity and density
functions - Finding clusters of arbitrary shapes, e.g.,
DBSCAN, OPTICS, etc. - Grid-based based on a multiple-level granularity
structure - View space as grid structures, e.g., STING,
CLIQUE - Model-based find the best fit of the model to
all the clusters - Good for conceptual clustering, e.g., COBWEB, SOM
5The K-Means Clustering Process
- MacQueen67 Each cluster is represented by the
center of the cluster
10
9
8
7
6
5
Update the cluster means
Assign each objects to most similar center
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
reassign
reassign
K2 Arbitrarily choose K object as initial
cluster center
Update the cluster means
6Clustering Data Streams Previous Work
- K-median stream clustering (Guha, Motwani, et al.
2000-2002) - Data stream with points from metric space
- Find k centers in the stream such that the sum of
distances from data points to their closest
center is minimized. - Only the k centroids (representing the clustering
results) retain when new data comes - Only use the new data set to perform incremental
clustering - The previous data carries weights of the previous
many points - The error is bounded by continuous incremental
updates - The simple algorithm yields constant factor
approximation - Weakness of the method
- Low quality for evolving data streams (register
only k centers) - Limited functionality at exploring cluster
evolutions over time
7CluStream A Framework for Clustering Evolving
Data Streams (VLDB03)
- Outline of our methodology
- Divide the clustering process into online and
offline components - Online periodically stores summary statistics
about the stream data - Micro-clustering better quality than k-means
- Incremental, online processing and maintenance
- Offline answers various user queries based on
the stored summary statistics - Tilted time frame work register dynamic changes
- With limited overhead to achieve high
efficiency, scalability, quality of results and
power of evolution/change detection
8BIRCH (1996)
- Birch Balanced Iterative Reducing and Clustering
using Hierarchies, by Zhang, Ramakrishnan, Livny
(SIGMOD96) - Incrementally construct a CF (Clustering Feature)
tree, a hierarchical data structure for
multiphase clustering - Phase 1 scan DB to build an initial in-memory CF
tree (a multi-level compression of the data that
tries to preserve the inherent clustering
structure of the data) - Phase 2 use an arbitrary clustering algorithm to
cluster the leaf nodes of the CF-tree - Scales linearly finds a good clustering with a
single scan and improves the quality with a few
additional scans
9Clustering Feature Vector
CF (5, (16, 30),(54,190))
(3,4) (2,6) (4,5) (4,7) (3,8)
10CF-Tree in BIRCH
- Clustering feature
- Summary of the statistics for a given subcluster
the 0-th, 1st and 2nd moments of the subcluster
from the statistical point of view - Registers crucial measurements for computing
cluster and utilizes storage efficiently - A CF tree is a height-balanced tree that stores
the clustering features for a hierarchical
clustering - A nonleaf node in a tree has descendants or
children - The nonleaf nodes store sums of the CFs of their
children - A CF tree has two parameters
- Branching factor specify the maximum number of
children - threshold max diameter of sub-clusters stored at
the leaf nodes
11CF Tree
Root
B 7 L 6
Non-leaf node
CF1
CF3
CF2
CF5
child1
child3
child2
child5
Leaf node
Leaf node
CF1
CF2
CF6
prev
next
CF1
CF2
CF4
prev
next
12Micro-Clusters Design Methodology
- Data streams
- Multi-dimensional points with
time stamps T1, Tk . - Each point contains d dimensions, i.e.,
- A micro-cluster for n points is defined as a (2.d
3)-tuple - Online statistical data collection
- Maintain a total of q micro-clusters, M1 Mq,
where q is usually significantly larger than of
natural clusters, and is determined by the amount
of available memory - Each newly created micro-cluster is associated
with a unique id, and a merged micro-cluster is
associated with a list of ids
13Titled Time Window Framework
- Natural tilted time frame window
- Example Minimal quarter, then 4 quarters ? 1
hour, 24 hours ? day, - Logarithmic, pyramidal, and geometric tilted time
frame windows - Not going to be implemented at the first stage
- Each micro-cluster is associated with a tilted
time window
14Incremental Update of Micro-Clusters
- If a new data point falls within the
maximum boundary of its closest micro-cluster Mp
, is added to Mp - Maximum boundary a factor t of the RMS deviation
of the data points in Mp from its centroid - Otherwise, a new micro-cluster is created for
, which corresponds to (1) an outlier, or (2) a
new cluster - Delete an old cluster or merge two closest
clusters? - First determine if it is safe to delete the
micro-cluster with the least recent time-stamp - If not, merge the two closest micro-clusters
15Macro-Cluster Creation
- Given a user-specified time-horizon h and the
number of macro-clusters, K - Find the K high-level clusters over the
pre-specified horizon h - Subtractive property
- Let C1 and C2 be two sets of points such that
- Then
holds. - Use
- to denote the micro-cluster of the set of n
points, C - Compute the set of net micro-clusters based on
the subtractive property - Use a variant of k-means algorithm to compute the
final K macro-clusters - Each net micro-cluster is treated as a weighted
pseudo-point
16Evolution Analysis of Micro-Clusters
- Given a user-specified time-horizon h and two
clock times, t1 and t2 (where t1 lt t2 ) - Analyze the evolution nature of data arriving
between (t2h,t2), and the data arriving between
(t1h,t1) - Answer the following questions
- Are there new clusters in the data at time t1
which were not present at time t2? - Have some of the original clusters been lost?
- Have some of the original clusters at time t1,
shifted in position and nature?
17Clustering StreamFinding Stream Dynamics
- Application Network intrusion detection
- Detect bursts of activities or abrupt changes in
real timeby on-line clustering - The combination of tilted time window and
micro-clustering - capture sufficient statistics of the evolving
data streams - without sacrificing the underlying space- and
time- efficiency of the online clustering process - Provides a wide variety of functionalities
- macro-cluster creation over different horizons
- stream evolution analysis
18References
- C. Aggarwal, J. Han, J. Wang, and P. S. Yu, A
Framework for Clustering Evolving Data Streams,
VLDB'03 - B. Babcock, S. Babu, M. Datar, R. Motawani, and
J. Widom, Models and issues in data stream
systems, PODS'02 (tutorial). - Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang,
Multi-dimensional regression analysis of
time-series data streams, VLDB'02. - M. Garofalakis, J. Gehrke, and R. Rastogi,
Querying and mining data streams You only get
one look, SIGMOD'02 (tutorial). - S. Guha, N. Mishra, R. Motwani, and L.
O'Callaghan, Clustering data streams, FOCS'00. - L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha,
R. Motwani, High-Performance Clustering of
Streams and Large Data Sets,ICDE'02
19www.cs.uiuc.edu/hanj