MAIDS: MiningAlarmingIncidents inDataStreams - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

MAIDS: MiningAlarmingIncidents inDataStreams

Description:

Querying, statistical summary, OLAP, regression, gradient analysis, ... Niagara (OGI/Wisconsin): Internet XML databases. OpenCQ (Georgia Tech): triggers, incr. ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 35

Provided by: jiaw186

Category:

more less

Transcript and Presenter's Notes

Title: MAIDS: MiningAlarmingIncidents inDataStreams

1
MAIDS Mining Alarming Incidents in Data Streams

A discussion on the MAIDS project
May 20, 2003

2
Outline

Characteristics of data streams
Mining dynamics in data streams
Multi-dimensional analysis of data streams
Stream query and stream cubing
Querying, statistical summary, OLAP, regression,
gradient analysis,
Mining frequent patterns in data streams
Clustering data streams
Classification of data streams
Planning for implementation and testing

3
Characteristics of Data Streams

Data Streams
Data streamscontinuous, ordered, changing, fast,
huge amount
Traditional DBMSdata stored in finite,
persistent data sets
Characteristics
Huge volumes of continuous data, possibly
infinite
Fast changing and requires fast, real-time
response
Data stream captures nicely our data processing
needs of today
Random access is expensivesingle linear scan
algorithm (can only have one look)
Store only the summary of the data seen thus far
Most stream data are at pretty low-level or
multi-dimensional in nature, needs multi-level
and multi-dimensional processing

4
Stream Data Applications

Telecommunication calling records
Business credit card transaction flows
Network monitoring and traffic engineering
Financial market stock exchange
Engineering industrial processes power supply
manufacturing
Sensor, monitoring surveillance video streams
Security monitoring
Web logs and Web page click streams
Massive data sets (even saved but random access
is too expensive)

5
Architecture Stream Query Processing
User/Application
SDMS (Stream Data Management System)
Results
Multiple streams
Stream Query Processor
Scratch Space (Main memory and/or Disk)
6
Challenges of Stream Data Processing

Multiple, continuous, rapid, time-varying,
ordered streams
Main memory computations
Queries are often continuous
Evaluated continuously as stream data arrives
Answer updated over time
Queries are often complex
Beyond element-at-a-time processing
Beyond stream-at-a-time processing
Beyond relational queries (scientific, data
mining, OLAP)
Multi-level/multi-dimensional processing and data
mining
Most stream data are at pretty low-level or
multi-dimensional in nature

7
Processing Stream Queries

Query types
One-time query vs. continuous query (being
evaluated continuously as stream continues to
arrive)
Predefined query vs. ad-hoc query (issued
on-line)
Unbounded memory requirements
For real-time response, main memory algorithm
should be used
Memory requirement is unbounded if one will join
future tuples
Approximate query answering
With bounded memory, it is not always possible to
produce exact answers
High-quality approximate answers are desired
Data reduction and synopsis construction methods
Sketches, random sampling, histograms, wavelets,
etc.

8
Projects on DSMS (Data Stream Management System)

Research projects and system prototypes
STREAM (Stanford) A general-purpose DSMS
Cougar (Cornell) sensors
Aurora (Brown/MIT) sensor monitoring, dataflow
Hancock (ATT) telecom streams
Niagara (OGI/Wisconsin) Internet XML databases
OpenCQ (Georgia Tech) triggers, incr. view
maintenance
Tapestry (Xerox) pub/sub content-based filtering
Telegraph (Berkeley) adaptive engine for sensors
Tradebot (www.tradebot.com) stock tickers
streams
Tribeca (Bellcore) network monitoring
Streaminer MAIDS (UIUC NCSA) new projects
for stream data mining

9
Stream Data Mining vs. Stream Querying

Stream miningA more challenging task
It shares most of the difficulties with stream
querying
Patterns are hidden and more general than
querying
It may require exploratory analysis
Not necessarily continuous queries
Stream data mining tasks
Multi-dimensional on-line analysis of streams
Mining outliers and unusual patterns in stream
data
Clustering data streams
Classification of stream data

10
Stream Data Mining Tasks

Multi-dimensional (on-line) statistical analysis
of streams
Clustering data streams
Classification of data streams
Mining frequent patterns in data streams
Mining sequential patterns in data streams
Mining partial periodicity in data streams
Mining notable gradients in data streams
Mining outliers and unusual patterns in data
streams
, more?

11
Challenges for Mining Dynamics in Data Streams

Most stream data are at pretty low-level or
multi-dimensional in nature needs ML/MD
processing
Analysis requirements
Multi-dimensional trends and unusual patterns
Capturing important changes at multi-dimensions/le
vels
Fast, real-time detection and response
Comparing with data cube Similarity and
differences
Stream (data) cube or stream OLAP Is this
feasible?
Can we implement it efficiently?

12
Multi-Dimensional Stream Analysis Examples

Analysis of Web click streams
Raw data at low levels seconds, web page
addresses, user IP addresses,
Analysts want changes, trends, unusual patterns,
at reasonable levels of details
E.g., Average clicking traffic in North America
on sports in the last 15 minutes is 40 higher
than that in the last 24 hours.
Analysis of power consumption streams
Raw data power consumption flow for every
household, every minute
Patterns one may find average hourly power
consumption surges up 30 for manufacturing
companies in Chicago in the last 2 hours today
than that of the same day a week ago

13
A Key StepStream Data Reduction

Challenges of OLAPing stream data
Raw data cannot be stored
Simple aggregates are not powerful enough
History shape and patterns at different levels
are desirable multi-dimensional regression
analysis
Proposal
A scalable multi-dimensional stream data cube
that can aggregate regression model of stream
data efficiently without accessing the raw data
Stream data compression
Compress the stream data to support memory- and
time-efficient multi-dimensional regression
analysis

14
A Tilted Time-Frame Model
Up to 7 days
Up to a year
Logarithmic (exponential) scale
2t
1t
4t
8t
16t
Time
Now
15
Benefits of Tilted Time-Frame Model

Each cell stores the measures according to
tilt-time-frame
Limited memory space Impossible to store the
history in full scale
Emphasis more on recent data
Most applications emphasize on recent data (slide
window)
Natural partition on different time granularities
Putting different weights on remote data
Useful even for uniform weight
Tilted time-frame forms a new time dimension
for mining changes and evolutions
Essential for mining dynamic patterns or outliers
Finding those with dramatic changes
E.g., exceptional stocksnot following the trends

16
A Stream Cube Architecture

A tilted time frame
Different time granularities
second, minute, quarter, hour, day, week,
Critical layers
Minimum interest layer (m-layer)
Observation layer (o-layer)
User watches at o-layer and occasionally needs
to drill-down down to m-layer
Partial materialization of stream cubes
Full materialization too space and time
consuming
No materialization slow response at query time
Partial materialization what do we mean
partial?

17
Two Critical Layers in the Stream Cube
(, theme, quarter)
o-layer (observation)
(user-group, URL-group, minute)
m-layer (minimal interest)
(individual-user, URL, second)
(primitive) stream data layer
18
Stream Cube Structure from m-layer to o-layer
(A1, , C1)
(A1, , C2)
(A1, , C2)
(A1, , C2)
(A2, B1, C1)
(A1, B1, C2)
(A1, B2, C1)
(A2, , C2)
(A2, B1, C2)
A2, B2, C1)
(A1, B2, C2)
(A2, B2, C2)
19
An H-Tree Cubing Structure
root
Observation layer
sports
politics
entertainment
uiuc
uic
uic
uiuc
Minimal int. layer
jeff
Jim
jeff
mary
Q.I.
Q.I.
Q.I.
20
Benefits of H-Tree and H-Cubing

H-tree and H-cubing
Developed for computing data cubes and iceberg
cubes
J. Han, J. Pei, G. Dong, and K. Wang, Efficient
Computation of Iceberg Cubes with Complex
Measures, SIGMOD'01
D. Xin, J. Han, X. Li, B. Wah, Star-Cubing
Computing Iceberg Cubes by Top-Down and Bottom-Up
Integration, VLDB'03, Berlin, Germany, Sept.
2003.
Compressed database, fast cubing, space
preserving
Using H-tree for stream cubing
Space preserving Intermediate aggregates can be
computed incrementally and saved in tree nodes
Facilitate computing other cells and
multi-dimensional analysis
H-tree with computed cells can be viewed as
stream cube

21
Use of Stream Cubing Regression Analysis

Regression modeling of cells at all dimensions
and levels
Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang,
Multi-dimensional regression analysis of
time-series data streams, VLDB'02.
Efficient storage and scalable and fast
aggregation
Lossless aggregation without accessing the raw
data
Covered a large and the most popular class of
regression
Including quadratic, polynomial, and nonlinear
models
Methodology can be used for other statistical
summary, gradient analysis, and so on

22
Clustering Data Streams

Network intrusion detection one example
Detect bursts of activities or abrupt changes in
real timeby on-line clustering
Two major methodologies
Motwani et al. (Stanford and HP Lab)
S. Guha, N. Mishra, R. Motwani, and L.
O'Callaghan, Clustering data streams, FOCS'00.
Merging and changing k-media cluster centers
Our approach (UIUC and IBM)
Tilted time frame to store historical data in
compressed way
Mining evolving data streams

23
Clustering Evolving Data Streams

Why clustering evolving data streams?
Finding evolutions of clusters not just current
clusters
C. Aggarwal, J. Han, J. Wang, P. S. Yu, A
Framework for Clustering Evolving Data Streams,
VLDB'03
Methodology
Tilted time frame work compression mining
changes
Micro-clustering better quality than
k-means/k-median
incremental, online processing and maintenance
Two stages micro-clustering and macro-clustering
With limited overhead to achieve high
efficiency, scalability, quality of results and
power of evolution/change detection

24
Decision Tree Induction with Stream Data

VFDT/CVFDT
P. Domingos and G. Hulten, Mining high-speed
data streams, KDD'00
G. Hulten, L. Spencer, and P. Domingos, Mining
time-changing data streams,KDD'01
VFDT (Very Fast Decision Tree) (Domingos and
Hulten00)
With high probability, constructs an identical
model that a traditional (greedy) method would
learn
If it cannot be inserted into the same branch,
construct shadow branches as preparation for
changes
If the shadow becomes dominant, switch of tree
branches occurs
CVFDT Extension to time changing data

25
Single-Pass Algorithm (An Example)
Packets gt 10
Data Stream
yes
no
Protocol http
SP(Bytes) - SP(Packets) gt
Packets gt 10
Data Stream
yes
no
Bytes gt 60K
Protocol http
yes
Protocol ftp
From Gehrkes SIGMOD tutorial slides
26
Our Approaches for Stream Classification

Why our approaches?
Is decision-tree good for modeling fast changing
data?too fast changes may make trees obsolete
May other models have better survivability
(adaptability)?
Can we find and compare evolution behavior?
Our methodology
Tilted time framework (compression and evolution)
Instead of decision-trees, consider other models
that do not need dramatic changes, e.g., Naïve
Bayesian with boosting, K-NN?
Incremental updating, dynamic maintenance, and
model construction
Comparing of models to find changes

27
Stream Classification by Naïve Bayesian

Naïve Bayesian boosting
A working paper with Jiong Yang, Wei Wang and
Xifeng Yan
A stable model that registers attribute-value
pairs (similar to Raiforest-based classification)
History can be recorded using tilted time
framework
Using boosting to increase classification
accuracy
Adaptable to dramatic changes
Incremental updating, dynamic maintenance, and
fast model construction
Comparing of models to find changes

28
Stream Classification by K-Nearest Neighbor

Classification based on nearest neighbors
C. Agarwal (IBM), J. Han, J. Wang, P. S. Yu
(IBM), A Framework for Effective Classification
of Evolving Data, a working paper
Two kinds of stream classification tasks
Type I Prediction of peers in the current stream
Type II Prediction of the behavior in the next
window
Registration of basic statistics (clustering
features) using B-tree (Birch-like) structure
Different philosophies for training and model
construction for type I and II tasks

29
Mining Frequent Patterns for Stream Data

Frequent pattern mining is valuable in stream
applications
e.g., network intrusion mining (Dokas, et al02)
Mining precise freq. patterns in stream data
unrealistic
Even store them in a compressed form, such as
FPtree
How to mine frequent patterns with good
approximation?
Approximate frequent patterns (Manku Motwani,
VLDB02)
Major ideas not tracing items until it becomes
first frequent
Adv guarantee error bound
Disadv keep a large set of traces
Our comments
Keep only current frequent patterns? No changes
can be detected

30
Our Approach on Frequent Stream Patterns

Approach 1 Mining with Multiple Time
Granularities
C. Giannella, J. Han, J. Pei, X. Yan and P.S. Yu,
Mining Frequent Patterns in Data Streams at
Multiple Time Granularities, Next Gen. Data
Mining, MIT Press, 2003
Keep pattern-trees at the tilted time window
frame (using tree-sharing method)
Mining evolution and dramatic changes of frequent
patterns
Approach 2 Mining only interested itemsets
Identify interested items in stream environment
Keep precise/compressed history in tilted time
window
Mining using FP-tree and related fast mining
method

31
A Discussion of Our Work Plan

System architecture design
Need preprocessing of stream data flow
Two working modes
Disk-based and true stream processing
Main memory algorithms
cube and tilted time frame structure
Test data sets
Network flow (multi-dimensional data)
Web click stream analysis
Stock trading data?
Multiple stream weather data?

32
Conclusions

Stream data mining A rich and largely unexplored
field
Current research focus in database community
DSMS system architecture, continuous query
processing, supporting mechanisms
Stream data mining and stream OLAP analysis
Powerful tools for finding general and unusual
patterns
Effectiveness, efficiency and scalability lots
of open problems
Our philosophy
A multi-dimensional stream analysis framework
Time is a special dimension tilted time frame
What to compute and what to save?Critical layers
Very partial materialization/precomputation
popular path approach
Mining dynamics of stream data

33
References

B. Babcock, S. Babu, M. Datar, R. Motawani, and
J. Widom, Models and issues in data stream
systems, PODS'02 (tutorial).
S. Babu and J. Widom, Continuous queries over
data streams, SIGMOD Record, 30109--120, 2001.
Y. Chen, G. Dong, J. Han, J. Pei, B. W. Wah, and
J. Wang. Online analytical processing stream
data Is it feasible?, DMKD'02.
Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang,
Multi-dimensional regression analysis of
time-series data streams, VLDB'02.
P. Domingos and G. Hulten, Mining high-speed
data streams, KDD'00.
M. Garofalakis, J. Gehrke, and R. Rastogi,
Querying and mining data streams You only get
one look, SIGMOD'02 (tutorial).
J. Gehrke, F. Korn, and D. Srivastava, On
computing correlated aggregates over continuous
data streams, SIGMOD'01.
S. Guha, N. Mishra, R. Motwani, and L.
O'Callaghan, Clustering data streams, FOCS'00.
G. Hulten, L. Spencer, and P. Domingos, Mining
time-changing data streams, KDD'01.