Mining Dynamics of Data Streams in Multidimensional Space Cluster Analysis

1 / 19

About This Presentation

Title:

Mining Dynamics of Data Streams in Multidimensional Space Cluster Analysis

Description:

Subtractive property. Let C1 and C2 be two sets of points such that. Then holds. Use ... Compute the set of net micro-clusters based on the subtractive property ... –

Number of Views:57

Avg rating:3.0/5.0

Slides: 20

Provided by: jiaw186

Category:

more less

Transcript and Presenter's Notes

Title: Mining Dynamics of Data Streams in Multidimensional Space Cluster Analysis

1
Mining Dynamics of Data Streams in
Multidimensional SpaceCluster Analysis

Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/hanj

2
Outline

Why clustering data streams?
Cluster analysis A general overview
Our design methodology mining changes of
clusters in multi-dimensional space
Tilted time frame window
Micro-cluster analysis and maintenance
Online macro-cluster analysis

3
Why Clustering Data Streams?

What is cluster analysis?Grouping a set of data
objects into a set of clusters, s.t. the
intra-cluster similarity is high and the
inter-cluster similarity is low
Application example Network intrusion detection
Detect bursts of activities or abrupt changes in
real timeby on-line clustering
Clustering A stream data reduction technique
New requirements in stream clustering
Generate high-quality clusters in one scan
High quality, efficient incremental clustering
Analysis should take care of multi-dimensional
space

4
Major Clustering Approaches in Traditional
Cluster Analysis

Partitioning algorithms Construct various
partitions and then evaluate them by some
criterion
E.g., k-means, k-medoids, etc.
Hierarchy algorithms Create a hierarchical
decomposition of the set of data (or objects)
using some criterion
Often needs to integrate with other clustering
methods, e.g., BIRCH
Density-based based on connectivity and density
functions
Finding clusters of arbitrary shapes, e.g.,
DBSCAN, OPTICS, etc.
Grid-based based on a multiple-level granularity
structure
View space as grid structures, e.g., STING,
CLIQUE
Model-based find the best fit of the model to
all the clusters
Good for conceptual clustering, e.g., COBWEB, SOM

5
The K-Means Clustering Process

MacQueen67 Each cluster is represented by the
center of the cluster

10
9
8
7
6
5
Update the cluster means
Assign each objects to most similar center
4
3
2
1
0
0
1
2
3
4
5
6
7
8
9
10
reassign
reassign
K2 Arbitrarily choose K object as initial
cluster center
Update the cluster means
6
Clustering Data Streams Previous Work

K-median stream clustering (Guha, Motwani, et al.
2000-2002)
Data stream with points from metric space
Find k centers in the stream such that the sum of
distances from data points to their closest
center is minimized.
Only the k centroids (representing the clustering
results) retain when new data comes
Only use the new data set to perform incremental
clustering
The previous data carries weights of the previous
many points
The error is bounded by continuous incremental
updates
The simple algorithm yields constant factor
approximation
Weakness of the method
Low quality for evolving data streams (register
only k centers)
Limited functionality at exploring cluster
evolutions over time

7
CluStream A Framework for Clustering Evolving
Data Streams (VLDB03)

Outline of our methodology
Divide the clustering process into online and
offline components
Online periodically stores summary statistics
about the stream data
Micro-clustering better quality than k-means
Incremental, online processing and maintenance
Offline answers various user queries based on
the stored summary statistics
Tilted time frame work register dynamic changes
With limited overhead to achieve high
efficiency, scalability, quality of results and
power of evolution/change detection

8
BIRCH (1996)

Birch Balanced Iterative Reducing and Clustering
using Hierarchies, by Zhang, Ramakrishnan, Livny
(SIGMOD96)
Incrementally construct a CF (Clustering Feature)
tree, a hierarchical data structure for
multiphase clustering
Phase 1 scan DB to build an initial in-memory CF
tree (a multi-level compression of the data that
tries to preserve the inherent clustering
structure of the data)
Phase 2 use an arbitrary clustering algorithm to
cluster the leaf nodes of the CF-tree
Scales linearly finds a good clustering with a
single scan and improves the quality with a few
additional scans

9
Clustering Feature Vector
CF (5, (16, 30),(54,190))
(3,4) (2,6) (4,5) (4,7) (3,8)
10
CF-Tree in BIRCH

Clustering feature
Summary of the statistics for a given subcluster
the 0-th, 1st and 2nd moments of the subcluster
from the statistical point of view
Registers crucial measurements for computing
cluster and utilizes storage efficiently
A CF tree is a height-balanced tree that stores
the clustering features for a hierarchical
clustering
A nonleaf node in a tree has descendants or
children
The nonleaf nodes store sums of the CFs of their
children
A CF tree has two parameters
Branching factor specify the maximum number of
children
threshold max diameter of sub-clusters stored at
the leaf nodes

11
CF Tree
Root
B 7 L 6
Non-leaf node
CF1
CF3
CF2
CF5
child1
child3
child2
child5
Leaf node
Leaf node
CF1
CF2
CF6
prev
next
CF1
CF2
CF4
prev
next
12
Micro-Clusters Design Methodology

Data streams
Multi-dimensional points with
time stamps T1, Tk .
Each point contains d dimensions, i.e.,
A micro-cluster for n points is defined as a (2.d
3)-tuple
Online statistical data collection
Maintain a total of q micro-clusters, M1 Mq,
where q is usually significantly larger than of
natural clusters, and is determined by the amount
of available memory
Each newly created micro-cluster is associated
with a unique id, and a merged micro-cluster is
associated with a list of ids

13
Titled Time Window Framework

Natural tilted time frame window
Example Minimal quarter, then 4 quarters ? 1
hour, 24 hours ? day,
Logarithmic, pyramidal, and geometric tilted time
frame windows
Not going to be implemented at the first stage
Each micro-cluster is associated with a tilted
time window

14
Incremental Update of Micro-Clusters

If a new data point falls within the
maximum boundary of its closest micro-cluster Mp
, is added to Mp
Maximum boundary a factor t of the RMS deviation
of the data points in Mp from its centroid
Otherwise, a new micro-cluster is created for
, which corresponds to (1) an outlier, or (2) a
new cluster
Delete an old cluster or merge two closest
clusters?
First determine if it is safe to delete the
micro-cluster with the least recent time-stamp
If not, merge the two closest micro-clusters

15
Macro-Cluster Creation

Given a user-specified time-horizon h and the
number of macro-clusters, K
Find the K high-level clusters over the
pre-specified horizon h
Subtractive property
Let C1 and C2 be two sets of points such that
Then
holds.
Use
to denote the micro-cluster of the set of n
points, C
Compute the set of net micro-clusters based on
the subtractive property
Use a variant of k-means algorithm to compute the
final K macro-clusters
Each net micro-cluster is treated as a weighted
pseudo-point

16
Evolution Analysis of Micro-Clusters

Given a user-specified time-horizon h and two
clock times, t1 and t2 (where t1 lt t2 )
Analyze the evolution nature of data arriving
between (t2h,t2), and the data arriving between
(t1h,t1)
Answer the following questions
Are there new clusters in the data at time t1
which were not present at time t2?
Have some of the original clusters been lost?
Have some of the original clusters at time t1,
shifted in position and nature?

17
Clustering StreamFinding Stream Dynamics

Application Network intrusion detection
Detect bursts of activities or abrupt changes in
real timeby on-line clustering
The combination of tilted time window and
micro-clustering
capture sufficient statistics of the evolving
data streams
without sacrificing the underlying space- and
time- efficiency of the online clustering process
Provides a wide variety of functionalities
macro-cluster creation over different horizons
stream evolution analysis

18
References

C. Aggarwal, J. Han, J. Wang, and P. S. Yu, A
Framework for Clustering Evolving Data Streams,
VLDB'03
B. Babcock, S. Babu, M. Datar, R. Motawani, and
J. Widom, Models and issues in data stream
systems, PODS'02 (tutorial).
Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang,
Multi-dimensional regression analysis of
time-series data streams, VLDB'02.
M. Garofalakis, J. Gehrke, and R. Rastogi,
Querying and mining data streams You only get
one look, SIGMOD'02 (tutorial).
S. Guha, N. Mishra, R. Motwani, and L.
O'Callaghan, Clustering data streams, FOCS'00.
L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha,
R. Motwani, High-Performance Clustering of
Streams and Large Data Sets,ICDE'02