Title: MAIDS: MiningAlarmingIncidents inDataStreams
1MAIDS Mining Alarming Incidents in Data Streams
- A discussion on the MAIDS project
- May 20, 2003
2Outline
- Characteristics of data streams
- Mining dynamics in data streams
- Multi-dimensional analysis of data streams
- Stream query and stream cubing
- Querying, statistical summary, OLAP, regression,
gradient analysis, - Mining frequent patterns in data streams
- Clustering data streams
- Classification of data streams
- Planning for implementation and testing
3Characteristics of Data Streams
- Data Streams
- Data streamscontinuous, ordered, changing, fast,
huge amount - Traditional DBMSdata stored in finite,
persistent data sets - Characteristics
- Huge volumes of continuous data, possibly
infinite - Fast changing and requires fast, real-time
response - Data stream captures nicely our data processing
needs of today - Random access is expensivesingle linear scan
algorithm (can only have one look) - Store only the summary of the data seen thus far
- Most stream data are at pretty low-level or
multi-dimensional in nature, needs multi-level
and multi-dimensional processing
4Stream Data Applications
- Telecommunication calling records
- Business credit card transaction flows
- Network monitoring and traffic engineering
- Financial market stock exchange
- Engineering industrial processes power supply
manufacturing - Sensor, monitoring surveillance video streams
- Security monitoring
- Web logs and Web page click streams
- Massive data sets (even saved but random access
is too expensive)
5Architecture Stream Query Processing
User/Application
SDMS (Stream Data Management System)
Results
Multiple streams
Stream Query Processor
Scratch Space (Main memory and/or Disk)
6Challenges of Stream Data Processing
- Multiple, continuous, rapid, time-varying,
ordered streams - Main memory computations
- Queries are often continuous
- Evaluated continuously as stream data arrives
- Answer updated over time
- Queries are often complex
- Beyond element-at-a-time processing
- Beyond stream-at-a-time processing
- Beyond relational queries (scientific, data
mining, OLAP) - Multi-level/multi-dimensional processing and data
mining - Most stream data are at pretty low-level or
multi-dimensional in nature
7Processing Stream Queries
- Query types
- One-time query vs. continuous query (being
evaluated continuously as stream continues to
arrive) - Predefined query vs. ad-hoc query (issued
on-line) - Unbounded memory requirements
- For real-time response, main memory algorithm
should be used - Memory requirement is unbounded if one will join
future tuples - Approximate query answering
- With bounded memory, it is not always possible to
produce exact answers - High-quality approximate answers are desired
- Data reduction and synopsis construction methods
- Sketches, random sampling, histograms, wavelets,
etc.
8Projects on DSMS (Data Stream Management System)
- Research projects and system prototypes
- STREAM (Stanford) A general-purpose DSMS
- Cougar (Cornell) sensors
- Aurora (Brown/MIT) sensor monitoring, dataflow
- Hancock (ATT) telecom streams
- Niagara (OGI/Wisconsin) Internet XML databases
- OpenCQ (Georgia Tech) triggers, incr. view
maintenance - Tapestry (Xerox) pub/sub content-based filtering
- Telegraph (Berkeley) adaptive engine for sensors
- Tradebot (www.tradebot.com) stock tickers
streams - Tribeca (Bellcore) network monitoring
- Streaminer MAIDS (UIUC NCSA) new projects
for stream data mining
9Stream Data Mining vs. Stream Querying
- Stream miningA more challenging task
- It shares most of the difficulties with stream
querying - Patterns are hidden and more general than
querying - It may require exploratory analysis
- Not necessarily continuous queries
- Stream data mining tasks
- Multi-dimensional on-line analysis of streams
- Mining outliers and unusual patterns in stream
data - Clustering data streams
- Classification of stream data
10Stream Data Mining Tasks
- Multi-dimensional (on-line) statistical analysis
of streams - Clustering data streams
- Classification of data streams
- Mining frequent patterns in data streams
- Mining sequential patterns in data streams
- Mining partial periodicity in data streams
- Mining notable gradients in data streams
- Mining outliers and unusual patterns in data
streams - , more?
11Challenges for Mining Dynamics in Data Streams
- Most stream data are at pretty low-level or
multi-dimensional in nature needs ML/MD
processing - Analysis requirements
- Multi-dimensional trends and unusual patterns
- Capturing important changes at multi-dimensions/le
vels - Fast, real-time detection and response
- Comparing with data cube Similarity and
differences - Stream (data) cube or stream OLAP Is this
feasible? - Can we implement it efficiently?
12Multi-Dimensional Stream Analysis Examples
- Analysis of Web click streams
- Raw data at low levels seconds, web page
addresses, user IP addresses, - Analysts want changes, trends, unusual patterns,
at reasonable levels of details - E.g., Average clicking traffic in North America
on sports in the last 15 minutes is 40 higher
than that in the last 24 hours. - Analysis of power consumption streams
- Raw data power consumption flow for every
household, every minute - Patterns one may find average hourly power
consumption surges up 30 for manufacturing
companies in Chicago in the last 2 hours today
than that of the same day a week ago
13A Key StepStream Data Reduction
- Challenges of OLAPing stream data
- Raw data cannot be stored
- Simple aggregates are not powerful enough
- History shape and patterns at different levels
are desirable multi-dimensional regression
analysis - Proposal
- A scalable multi-dimensional stream data cube
that can aggregate regression model of stream
data efficiently without accessing the raw data - Stream data compression
- Compress the stream data to support memory- and
time-efficient multi-dimensional regression
analysis
14A Tilted Time-Frame Model
Up to 7 days
Up to a year
Logarithmic (exponential) scale
2t
1t
4t
8t
16t
Time
Now
15Benefits of Tilted Time-Frame Model
- Each cell stores the measures according to
tilt-time-frame - Limited memory space Impossible to store the
history in full scale - Emphasis more on recent data
- Most applications emphasize on recent data (slide
window) - Natural partition on different time granularities
- Putting different weights on remote data
- Useful even for uniform weight
- Tilted time-frame forms a new time dimension
- for mining changes and evolutions
- Essential for mining dynamic patterns or outliers
- Finding those with dramatic changes
- E.g., exceptional stocksnot following the trends
16A Stream Cube Architecture
- A tilted time frame
- Different time granularities
- second, minute, quarter, hour, day, week,
- Critical layers
- Minimum interest layer (m-layer)
- Observation layer (o-layer)
- User watches at o-layer and occasionally needs
to drill-down down to m-layer - Partial materialization of stream cubes
- Full materialization too space and time
consuming - No materialization slow response at query time
- Partial materialization what do we mean
partial?
17Two Critical Layers in the Stream Cube
(, theme, quarter)
o-layer (observation)
(user-group, URL-group, minute)
m-layer (minimal interest)
(individual-user, URL, second)
(primitive) stream data layer
18Stream Cube Structure from m-layer to o-layer
(A1, , C1)
(A1, , C2)
(A1, , C2)
(A1, , C2)
(A2, B1, C1)
(A1, B1, C2)
(A1, B2, C1)
(A2, , C2)
(A2, B1, C2)
A2, B2, C1)
(A1, B2, C2)
(A2, B2, C2)
19An H-Tree Cubing Structure
root
Observation layer
sports
politics
entertainment
uiuc
uic
uic
uiuc
Minimal int. layer
jeff
Jim
jeff
mary
Q.I.
Q.I.
Q.I.
20Benefits of H-Tree and H-Cubing
- H-tree and H-cubing
- Developed for computing data cubes and iceberg
cubes - J. Han, J. Pei, G. Dong, and K. Wang, Efficient
Computation of Iceberg Cubes with Complex
Measures, SIGMOD'01 - D. Xin, J. Han, X. Li, B. Wah, Star-Cubing
Computing Iceberg Cubes by Top-Down and Bottom-Up
Integration, VLDB'03, Berlin, Germany, Sept.
2003. - Compressed database, fast cubing, space
preserving - Using H-tree for stream cubing
- Space preserving Intermediate aggregates can be
computed incrementally and saved in tree nodes - Facilitate computing other cells and
multi-dimensional analysis - H-tree with computed cells can be viewed as
stream cube
21Use of Stream Cubing Regression Analysis
- Regression modeling of cells at all dimensions
and levels - Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang,
Multi-dimensional regression analysis of
time-series data streams, VLDB'02. - Efficient storage and scalable and fast
aggregation - Lossless aggregation without accessing the raw
data - Covered a large and the most popular class of
regression - Including quadratic, polynomial, and nonlinear
models - Methodology can be used for other statistical
summary, gradient analysis, and so on
22Clustering Data Streams
- Network intrusion detection one example
- Detect bursts of activities or abrupt changes in
real timeby on-line clustering - Two major methodologies
- Motwani et al. (Stanford and HP Lab)
- S. Guha, N. Mishra, R. Motwani, and L.
O'Callaghan, Clustering data streams, FOCS'00. - Merging and changing k-media cluster centers
- Our approach (UIUC and IBM)
- Tilted time frame to store historical data in
compressed way - Mining evolving data streams
23Clustering Evolving Data Streams
- Why clustering evolving data streams?
- Finding evolutions of clusters not just current
clusters - C. Aggarwal, J. Han, J. Wang, P. S. Yu, A
Framework for Clustering Evolving Data Streams,
VLDB'03 - Methodology
- Tilted time frame work compression mining
changes - Micro-clustering better quality than
k-means/k-median - incremental, online processing and maintenance
- Two stages micro-clustering and macro-clustering
- With limited overhead to achieve high
efficiency, scalability, quality of results and
power of evolution/change detection
24Decision Tree Induction with Stream Data
- VFDT/CVFDT
- P. Domingos and G. Hulten, Mining high-speed
data streams, KDD'00 - G. Hulten, L. Spencer, and P. Domingos, Mining
time-changing data streams,KDD'01 - VFDT (Very Fast Decision Tree) (Domingos and
Hulten00) - With high probability, constructs an identical
model that a traditional (greedy) method would
learn - If it cannot be inserted into the same branch,
construct shadow branches as preparation for
changes - If the shadow becomes dominant, switch of tree
branches occurs - CVFDT Extension to time changing data
25Single-Pass Algorithm (An Example)
Packets gt 10
Data Stream
yes
no
Protocol http
SP(Bytes) - SP(Packets) gt
Packets gt 10
Data Stream
yes
no
Bytes gt 60K
Protocol http
yes
Protocol ftp
From Gehrkes SIGMOD tutorial slides
26Our Approaches for Stream Classification
- Why our approaches?
- Is decision-tree good for modeling fast changing
data?too fast changes may make trees obsolete - May other models have better survivability
(adaptability)? - Can we find and compare evolution behavior?
- Our methodology
- Tilted time framework (compression and evolution)
- Instead of decision-trees, consider other models
that do not need dramatic changes, e.g., Naïve
Bayesian with boosting, K-NN? - Incremental updating, dynamic maintenance, and
model construction - Comparing of models to find changes
27Stream Classification by Naïve Bayesian
- Naïve Bayesian boosting
- A working paper with Jiong Yang, Wei Wang and
Xifeng Yan - A stable model that registers attribute-value
pairs (similar to Raiforest-based classification) - History can be recorded using tilted time
framework - Using boosting to increase classification
accuracy - Adaptable to dramatic changes
- Incremental updating, dynamic maintenance, and
fast model construction - Comparing of models to find changes
28Stream Classification by K-Nearest Neighbor
- Classification based on nearest neighbors
- C. Agarwal (IBM), J. Han, J. Wang, P. S. Yu
(IBM), A Framework for Effective Classification
of Evolving Data, a working paper - Two kinds of stream classification tasks
- Type I Prediction of peers in the current stream
- Type II Prediction of the behavior in the next
window - Registration of basic statistics (clustering
features) using B-tree (Birch-like) structure - Different philosophies for training and model
construction for type I and II tasks
29Mining Frequent Patterns for Stream Data
- Frequent pattern mining is valuable in stream
applications - e.g., network intrusion mining (Dokas, et al02)
- Mining precise freq. patterns in stream data
unrealistic - Even store them in a compressed form, such as
FPtree - How to mine frequent patterns with good
approximation? - Approximate frequent patterns (Manku Motwani,
VLDB02) - Major ideas not tracing items until it becomes
first frequent - Adv guarantee error bound
- Disadv keep a large set of traces
- Our comments
- Keep only current frequent patterns? No changes
can be detected
30Our Approach on Frequent Stream Patterns
- Approach 1 Mining with Multiple Time
Granularities - C. Giannella, J. Han, J. Pei, X. Yan and P.S. Yu,
Mining Frequent Patterns in Data Streams at
Multiple Time Granularities, Next Gen. Data
Mining, MIT Press, 2003 - Keep pattern-trees at the tilted time window
frame (using tree-sharing method) - Mining evolution and dramatic changes of frequent
patterns - Approach 2 Mining only interested itemsets
- Identify interested items in stream environment
- Keep precise/compressed history in tilted time
window - Mining using FP-tree and related fast mining
method
31A Discussion of Our Work Plan
- System architecture design
- Need preprocessing of stream data flow
- Two working modes
- Disk-based and true stream processing
- Main memory algorithms
- cube and tilted time frame structure
- Test data sets
- Network flow (multi-dimensional data)
- Web click stream analysis
- Stock trading data?
- Multiple stream weather data?
32Conclusions
- Stream data mining A rich and largely unexplored
field - Current research focus in database community
- DSMS system architecture, continuous query
processing, supporting mechanisms - Stream data mining and stream OLAP analysis
- Powerful tools for finding general and unusual
patterns - Effectiveness, efficiency and scalability lots
of open problems - Our philosophy
- A multi-dimensional stream analysis framework
- Time is a special dimension tilted time frame
- What to compute and what to save?Critical layers
- Very partial materialization/precomputation
popular path approach - Mining dynamics of stream data
33References
- B. Babcock, S. Babu, M. Datar, R. Motawani, and
J. Widom, Models and issues in data stream
systems, PODS'02 (tutorial). - S. Babu and J. Widom, Continuous queries over
data streams, SIGMOD Record, 30109--120, 2001. - Y. Chen, G. Dong, J. Han, J. Pei, B. W. Wah, and
J. Wang. Online analytical processing stream
data Is it feasible?, DMKD'02. - Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang,
Multi-dimensional regression analysis of
time-series data streams, VLDB'02. - P. Domingos and G. Hulten, Mining high-speed
data streams, KDD'00. - M. Garofalakis, J. Gehrke, and R. Rastogi,
Querying and mining data streams You only get
one look, SIGMOD'02 (tutorial). - J. Gehrke, F. Korn, and D. Srivastava, On
computing correlated aggregates over continuous
data streams, SIGMOD'01. - S. Guha, N. Mishra, R. Motwani, and L.
O'Callaghan, Clustering data streams, FOCS'00. - G. Hulten, L. Spencer, and P. Domingos, Mining
time-changing data streams, KDD'01.
34www.cs.uiuc.edu/hanj