Title: Mining Dynamics of Data Streams in Multidimensional Space
1Mining Dynamics of Data Streams in
Multidimensional Space
- Jiawei Han
- Department of Computer Science
- University of Illinois at Urbana-Champaign
- www.cs.uiuc.edu/hanj
2Outline
- Characteristics of data streams
- Mining dynamics in data streams
- Multi-dimensional (regression) analysis of data
streams - Stream cubing and stream OLAP methods
- Mining other kinds of patterns in data streams
- Research problems
3Characteristics of Data Streams
- Data Streams
- Data streamscontinuous, ordered, changing, fast,
huge amount - Traditional DBMSdata stored in finite,
persistent data sets - Characteristics
- Huge volumes of continuous data, possibly
infinite - Fast changing and requires fast, real-time
response - Data stream captures nicely our data processing
needs of today - Random access is expensivesingle linear scan
algorithm (can only have one look) - Store only the summary of the data seen thus far
- Most stream data are at pretty low-level or
multi-dimensional in nature, needs multi-level
and multi-dimensional processing
4Stream Data Applications
- Telecommunication calling records
- Business credit card transaction flows
- Network monitoring and traffic engineering
- Financial market stock exchange
- Engineering industrial processes power supply
manufacturing - Sensor, monitoring surveillance video streams
- Security monitoring
- Web logs and Web page click streams
- Massive data sets (even saved but random access
is too expensive)
5Architecture Stream Query Processing
User/Application
SDMS (Stream Data Management System)
Results
Multiple streams
Stream Query Processor
Scratch Space (Main memory and/or Disk)
6Challenges of Stream Data Processing
- Multiple, continuous, rapid, time-varying,
ordered streams - Main memory computations
- Queries are often continuous
- Evaluated continuously as stream data arrives
- Answer updated over time
- Queries are often complex
- Beyond element-at-a-time processing
- Beyond stream-at-a-time processing
- Beyond relational queries (scientific, data
mining, OLAP) - Multi-level/multi-dimensional processing and data
mining - Most stream data are at pretty low-level or
multi-dimensional in nature
7Processing Stream Queries
- Query types
- One-time query vs. continuous query (being
evaluated continuously as stream continues to
arrive) - Predefined query vs. ad-hoc query (issued
on-line) - Unbounded memory requirements
- For real-time response, main memory algorithm
should be used - Memory requirement is unbounded if one will join
future tuples - Approximate query answering
- With bounded memory, it is not always possible to
produce exact answers - High-quality approximate answers are desired
- Data reduction and synopsis construction methods
- Sketches, random sampling, histograms, wavelets,
etc.
8Projects on DSMS (Data Stream Management System)
- Research projects and system prototypes
- STREAM (Stanford) A general-purpose DSMS
- Cougar (Cornell) sensors
- Aurora (Brown/MIT) sensor monitoring, dataflow
- Hancock (ATT) telecom streams
- Niagara (OGI/Wisconsin) Internet XML databases
- OpenCQ (Georgia Tech) triggers, incr. view
maintenance - Tapestry (Xerox) pub/sub content-based filtering
- Telegraph (Berkeley) adaptive engine for sensors
- Tradebot (www.tradebot.com) stock tickers
streams - Tribeca (Bellcore) network monitoring
- Streaminer (UIUC) new project for stream data
mining
9Stream Data Mining vs. Stream Querying
- Stream miningA more challenging task
- It shares most of the difficulties with stream
querying - Patterns are hidden and more general than
querying - It may require exploratory analysis
- Not necessarily continuous queries
- Stream data mining tasks
- Multi-dimensional on-line analysis of streams
- Mining outliers and unusual patterns in stream
data - Clustering data streams
- Classification of stream data
10Stream Data Mining Tasks
- Multi-dimensional (on-line) analysis of streams
- Clustering data streams
- Classification of data streams
- Mining frequent patterns in data streams
- Mining sequential patterns in data streams
- Mining partial periodicity in data streams
- Mining notable gradients in data streams
- Mining outliers and unusual patterns in data
streams - , more?
11Challenges for Mining Dynamics in Data Streams
- Most stream data are at pretty low-level or
multi-dimensional in nature needs ML/MD
processing - Analysis requirements
- Multi-dimensional trends and unusual patterns
- Capturing important changes at multi-dimensions/le
vels - Fast, real-time detection and response
- Comparing with data cube Similarity and
differences - Stream (data) cube or stream OLAP Is this
feasible? - Can we implement it efficiently?
12Multi-Dimensional Stream Analysis Examples
- Analysis of Web click streams
- Raw data at low levels seconds, web page
addresses, user IP addresses, - Analysts want changes, trends, unusual patterns,
at reasonable levels of details - E.g., Average clicking traffic in North America
on sports in the last 15 minutes is 40 higher
than that in the last 24 hours. - Analysis of power consumption streams
- Raw data power consumption flow for every
household, every minute - Patterns one may find average hourly power
consumption surges up 30 for manufacturing
companies in Chicago in the last 2 hours today
than that of the same day a week ago
13A Key StepStream Data Reduction
- Challenges of OLAPing stream data
- Raw data cannot be stored
- Simple aggregates are not powerful enough
- History shape and patterns at different levels
are desirable multi-dimensional regression
analysis - Proposal
- A scalable multi-dimensional stream data cube
that can aggregate regression model of stream
data efficiently without accessing the raw data - Stream data compression
- Compress the stream data to support memory- and
time-efficient multi-dimensional regression
analysis
14Regression Cube for Time-Series
- Initially, one time-series per base cell
- Too costly to store all these time-series
- Too costly to compute regression at
multi-dimensional space - Regression cube
- Base cube only store regression parameters of
base cells (e.g., 2 points vs. 1000 points) - All the upper level cuboids can be computed
precisely for linear regression on both standard
dimensions and time dimensions - For quadratic regression, we need 5 points
- In general, we need
- where k 2 for quadratic.
15Basics of General Linear Regression
- n tuples in one cell (xi , yi), i 1..n, where
yi is the measure attribute to be analyzed - For sample i , a vector of k user-defined
predictors ui - The linear regression model
-
- where ? is a k 1 vector of regression
parameters
16Stock Price ExampleAggregation in Standard
Dimensions
- Simple linear regression on time series data
- Cells of two companies
- After aggregation
17Stock Price ExampleAggregation in Time Dimension
- Cells of two adjacent
- time intervals
- After aggregation
18Feasibility of Stream Regression Analysis
- Efficient storage and scalable (independent of
the number of tuples in data cells) - Lossless aggregation without accessing the raw
data - Fast aggregation computationally efficient
- Regression models of data cells at all levels
- General results covered a large and the most
popular class of regression - Including quadratic, polynomial, and nonlinear
models
19A Stream Cube Architecture
- A tilted time frame
- Different time granularities
- second, minute, quarter, hour, day, week,
- Critical layers
- Minimum interest layer (m-layer)
- Observation layer (o-layer)
- User watches at o-layer and occasionally needs
to drill-down down to m-layer - Partial materialization of stream cubes
- Full materialization too space and time
consuming - No materialization slow response at query time
- Partial materialization what do we mean
partial?
20A Tilted Time-Frame Model
Up to 7 days
Up to a year
Logarithmic (exponential) scale
2t
1t
4t
8t
16t
Time
Now
21Benefits of Tilted Time-Frame Model
- Each cell stores the measures according to
tilt-time-frame - Limited memory space Impossible to store the
history in full scale - Emphasis more on recent data
- Most applications emphasize on recent data (slide
window) - Natural partition on different time granularities
- Putting different weights on remote data
- Useful even for uniform weight
- Tilted time-frame forms a new time dimension
- for mining changes and evolutions
- Essential for mining dynamic patterns or outliers
- Finding those with dramatic changes
- E.g., exceptional stocksnot following the trends
22Two Critical Layers in the Stream Cube
(, theme, quarter)
o-layer (observation)
(user-group, URL-group, minute)
m-layer (minimal interest)
(individual-user, URL, second)
(primitive) stream data layer
23On-Line Materialization vs. On-Line Computation
- On-line materialization
- Materialization takes precious resources and time
- Only incremental materialization (with slide
window) - Only materialize cuboids of the critical
layers? - Some intermediate cells that should be
materialized - Popular path approach vs. exception cell approach
- Materialize intermediate cells along the popular
paths - Exception cells how to set up exception
thresholds? - Notice exceptions do not have monotonic behavior
- Computation problem
- How to compute and store stream cubes
efficiently? - How to discover unusual cells between the
critical layer?
24Stream Cube Structure from m-layer to o-layer
(A1, , C1)
(A1, , C2)
(A1, , C2)
(A1, , C2)
(A2, B1, C1)
(A1, B1, C2)
(A1, B2, C1)
(A2, , C2)
(A2, B1, C2)
A2, B2, C1)
(A1, B2, C2)
(A2, B2, C2)
25Stream Cube Computation
- Cube structure from m-layer to o-layer
- Three approaches
- All cuboids approach
- Materializing all cells (too much in both space
and time) - Exceptional cells approach
- Materializing only exceptional cells (saves space
but not time to compute and definition of
exception is not flexible) - Popular path approach
- Computing and materializing cells only along a
popular path - Using H-tree structure to store computed cells
(which form the stream cubea selectively
materialized cube)
26An H-Tree Cubing Structure
root
Observation layer
sports
politics
entertainment
uiuc
uic
uic
uiuc
Minimal int. layer
jeff
Jim
jeff
mary
Q.I.
Q.I.
Q.I.
27Benefits of H-Tree and H-Cubing
- H-tree and H-cubing
- Developed for computing data cubes and ice-berg
cubes - J. Han, J. Pei, G. Dong, and K. Wang, Efficient
Computation of Iceberg Cubes with Complex
Measures, SIGMOD'01 - Compressed database
- Fast cubing
- Space preserving in cube computation
- Using H-tree for stream cubing
- Space preserving
- Intermediate aggregates can be computed
incrementally and saved in tree nodes - Facilitate computing other cells and
multi-dimensional analysis - H-tree with computed cells can be viewed as
stream cube
28Time and Space vs. Number of Tuples at the
m-Layer (Dataset D3L3C10T400K)
a) Time vs. m-layer size
b) Space vs. m-layer size
29Time and Space vs. the Number of Levels
a) Time vs. levels
b) Space vs. levels
30Mining Other Dynamic Patterns in Stream Data
- Clustering and outlier analysis for stream mining
- Clustering data streams (Guha, Motwani et al.
2000-2002) - History-sensitive, high-quality incremental
clustering - Classification of stream data
- Evolution of decision trees Domingos et al.
(2000, 2001) - Incremental integration of new streams in
decision-tree induction - Frequent pattern analysis
- Approximate frequent patterns (Manku Motwani
VLDB02) - Evolution and dramatic changes of frequent
patterns
31Clustering for Mining Stream Dynamics
- Network intrusion detection one example
- Detect bursts of activities or abrupt changes in
real timeby on-line clustering - Our methodology
- Tilted time frame work o.w. dynamic changes
cannot be found - Micro-clustering better quality than
k-means/k-median - incremental, online processing and maintenance)
- Two stages micro-clustering and macro-clustering
- With limited overhead to achieve high
efficiency, scalability, quality of results and
power of evolution/change detection
32Classification for Dynamic Data Streams
- Decision tree induction for stream data
classification - VFDT (Very Fast Decision Tree)/CVFDT (Domingos,
Hulten, Spencer, KDD00/KDD01) - Is decision-tree good for modeling fast changing
data, e.g., stock market analysis? - Methodology
- Tilted time framework
- Instead of decision-trees, should we consider
other models which do not changes drastically,
e.g., Naïve Bayesian with boosting? - Incremental updating, dynamic maintenance, and
model construction - Comparing of models to find changes
33Frequent Patterns for Stream Data
- Frequent pattern mining is valuable in stream
applications - e.g., network intrusion mining (Dokas, et al02)
- Mining precise freq patterns in stream data
unrealistic - Even store them in a compressed form, such as
FPtree - How to mine frequent patterns with good
approximation? - Approximate frequent patterns (Manku Motwani
VLDB02) - Keep only current frequent patterns? No changes
can be detected - Our approach
- Keep Pattern-trees at the tilted time window
frame (using tree-sharing method) - Mining evolution and dramatic changes of frequent
patterns
34Conclusions
- Stream data mining A rich and largely unexplored
field - Current research focus in database community
- DSMS system architecture, continuous query
processing, supporting mechanisms - Stream data mining and stream OLAP analysis
- Powerful tools for finding general and unusual
patterns - Effectiveness, efficiency and scalability lots
of open problems - Our philosophy
- A multi-dimensional stream analysis framework
- Time is a special dimension tilted time frame
- What to compute and what to save?Critical layers
- Very partial materialization/precomputation
popular path approach - Mining dynamics of stream data
35References
- B. Babcock, S. Babu, M. Datar, R. Motawani, and
J. Widom, Models and issues in data stream
systems, PODS'02 (tutorial). - S. Babu and J. Widom, Continuous queries over
data streams, SIGMOD Record, 30109--120, 2001. - Y. Chen, G. Dong, J. Han, J. Pei, B. W. Wah, and
J. Wang. Online analytical processing stream
data Is it feasible?, DMKD'02. - Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang,
Multi-dimensional regression analysis of
time-series data streams, VLDB'02. - P. Domingos and G. Hulten, Mining high-speed
data streams, KDD'00. - M. Garofalakis, J. Gehrke, and R. Rastogi,
Querying and mining data streams You only get
one look, SIGMOD'02 (tutorial). - J. Gehrke, F. Korn, and D. Srivastava, On
computing correlated aggregates over continuous
data streams, SIGMOD'01. - S. Guha, N. Mishra, R. Motwani, and L.
O'Callaghan, Clustering data streams, FOCS'00. - G. Hulten, L. Spencer, and P. Domingos, Mining
time-changing data streams, KDD'01.
36www.cs.uiuc.edu/hanj