MAIDS: MiningAlarmingIncidents inDataStreams - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

MAIDS: MiningAlarmingIncidents inDataStreams

Description:

Querying, statistical summary, OLAP, regression, gradient analysis, ... Niagara (OGI/Wisconsin): Internet XML databases. OpenCQ (Georgia Tech): triggers, incr. ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 35
Provided by: jiaw186
Category:

less

Transcript and Presenter's Notes

Title: MAIDS: MiningAlarmingIncidents inDataStreams


1
MAIDS Mining Alarming Incidents in Data Streams
  • A discussion on the MAIDS project
  • May 20, 2003

2
Outline
  • Characteristics of data streams
  • Mining dynamics in data streams
  • Multi-dimensional analysis of data streams
  • Stream query and stream cubing
  • Querying, statistical summary, OLAP, regression,
    gradient analysis,
  • Mining frequent patterns in data streams
  • Clustering data streams
  • Classification of data streams
  • Planning for implementation and testing

3
Characteristics of Data Streams
  • Data Streams
  • Data streamscontinuous, ordered, changing, fast,
    huge amount
  • Traditional DBMSdata stored in finite,
    persistent data sets
  • Characteristics
  • Huge volumes of continuous data, possibly
    infinite
  • Fast changing and requires fast, real-time
    response
  • Data stream captures nicely our data processing
    needs of today
  • Random access is expensivesingle linear scan
    algorithm (can only have one look)
  • Store only the summary of the data seen thus far
  • Most stream data are at pretty low-level or
    multi-dimensional in nature, needs multi-level
    and multi-dimensional processing

4
Stream Data Applications
  • Telecommunication calling records
  • Business credit card transaction flows
  • Network monitoring and traffic engineering
  • Financial market stock exchange
  • Engineering industrial processes power supply
    manufacturing
  • Sensor, monitoring surveillance video streams
  • Security monitoring
  • Web logs and Web page click streams
  • Massive data sets (even saved but random access
    is too expensive)

5
Architecture Stream Query Processing
User/Application
SDMS (Stream Data Management System)
Results
Multiple streams
Stream Query Processor
Scratch Space (Main memory and/or Disk)
6
Challenges of Stream Data Processing
  • Multiple, continuous, rapid, time-varying,
    ordered streams
  • Main memory computations
  • Queries are often continuous
  • Evaluated continuously as stream data arrives
  • Answer updated over time
  • Queries are often complex
  • Beyond element-at-a-time processing
  • Beyond stream-at-a-time processing
  • Beyond relational queries (scientific, data
    mining, OLAP)
  • Multi-level/multi-dimensional processing and data
    mining
  • Most stream data are at pretty low-level or
    multi-dimensional in nature

7
Processing Stream Queries
  • Query types
  • One-time query vs. continuous query (being
    evaluated continuously as stream continues to
    arrive)
  • Predefined query vs. ad-hoc query (issued
    on-line)
  • Unbounded memory requirements
  • For real-time response, main memory algorithm
    should be used
  • Memory requirement is unbounded if one will join
    future tuples
  • Approximate query answering
  • With bounded memory, it is not always possible to
    produce exact answers
  • High-quality approximate answers are desired
  • Data reduction and synopsis construction methods
  • Sketches, random sampling, histograms, wavelets,
    etc.

8
Projects on DSMS (Data Stream Management System)
  • Research projects and system prototypes
  • STREAM (Stanford) A general-purpose DSMS
  • Cougar (Cornell) sensors
  • Aurora (Brown/MIT) sensor monitoring, dataflow
  • Hancock (ATT) telecom streams
  • Niagara (OGI/Wisconsin) Internet XML databases
  • OpenCQ (Georgia Tech) triggers, incr. view
    maintenance
  • Tapestry (Xerox) pub/sub content-based filtering
  • Telegraph (Berkeley) adaptive engine for sensors
  • Tradebot (www.tradebot.com) stock tickers
    streams
  • Tribeca (Bellcore) network monitoring
  • Streaminer MAIDS (UIUC NCSA) new projects
    for stream data mining

9
Stream Data Mining vs. Stream Querying
  • Stream miningA more challenging task
  • It shares most of the difficulties with stream
    querying
  • Patterns are hidden and more general than
    querying
  • It may require exploratory analysis
  • Not necessarily continuous queries
  • Stream data mining tasks
  • Multi-dimensional on-line analysis of streams
  • Mining outliers and unusual patterns in stream
    data
  • Clustering data streams
  • Classification of stream data

10
Stream Data Mining Tasks
  • Multi-dimensional (on-line) statistical analysis
    of streams
  • Clustering data streams
  • Classification of data streams
  • Mining frequent patterns in data streams
  • Mining sequential patterns in data streams
  • Mining partial periodicity in data streams
  • Mining notable gradients in data streams
  • Mining outliers and unusual patterns in data
    streams
  • , more?

11
Challenges for Mining Dynamics in Data Streams
  • Most stream data are at pretty low-level or
    multi-dimensional in nature needs ML/MD
    processing
  • Analysis requirements
  • Multi-dimensional trends and unusual patterns
  • Capturing important changes at multi-dimensions/le
    vels
  • Fast, real-time detection and response
  • Comparing with data cube Similarity and
    differences
  • Stream (data) cube or stream OLAP Is this
    feasible?
  • Can we implement it efficiently?

12
Multi-Dimensional Stream Analysis Examples
  • Analysis of Web click streams
  • Raw data at low levels seconds, web page
    addresses, user IP addresses,
  • Analysts want changes, trends, unusual patterns,
    at reasonable levels of details
  • E.g., Average clicking traffic in North America
    on sports in the last 15 minutes is 40 higher
    than that in the last 24 hours.
  • Analysis of power consumption streams
  • Raw data power consumption flow for every
    household, every minute
  • Patterns one may find average hourly power
    consumption surges up 30 for manufacturing
    companies in Chicago in the last 2 hours today
    than that of the same day a week ago

13
A Key StepStream Data Reduction
  • Challenges of OLAPing stream data
  • Raw data cannot be stored
  • Simple aggregates are not powerful enough
  • History shape and patterns at different levels
    are desirable multi-dimensional regression
    analysis
  • Proposal
  • A scalable multi-dimensional stream data cube
    that can aggregate regression model of stream
    data efficiently without accessing the raw data
  • Stream data compression
  • Compress the stream data to support memory- and
    time-efficient multi-dimensional regression
    analysis

14
A Tilted Time-Frame Model
Up to 7 days
Up to a year
Logarithmic (exponential) scale
2t
1t
4t
8t
16t
Time
Now
15
Benefits of Tilted Time-Frame Model
  • Each cell stores the measures according to
    tilt-time-frame
  • Limited memory space Impossible to store the
    history in full scale
  • Emphasis more on recent data
  • Most applications emphasize on recent data (slide
    window)
  • Natural partition on different time granularities
  • Putting different weights on remote data
  • Useful even for uniform weight
  • Tilted time-frame forms a new time dimension
  • for mining changes and evolutions
  • Essential for mining dynamic patterns or outliers
  • Finding those with dramatic changes
  • E.g., exceptional stocksnot following the trends

16
A Stream Cube Architecture
  • A tilted time frame
  • Different time granularities
  • second, minute, quarter, hour, day, week,
  • Critical layers
  • Minimum interest layer (m-layer)
  • Observation layer (o-layer)
  • User watches at o-layer and occasionally needs
    to drill-down down to m-layer
  • Partial materialization of stream cubes
  • Full materialization too space and time
    consuming
  • No materialization slow response at query time
  • Partial materialization what do we mean
    partial?

17
Two Critical Layers in the Stream Cube
(, theme, quarter)
o-layer (observation)
(user-group, URL-group, minute)
m-layer (minimal interest)
(individual-user, URL, second)
(primitive) stream data layer
18
Stream Cube Structure from m-layer to o-layer
(A1, , C1)
(A1, , C2)
(A1, , C2)
(A1, , C2)
(A2, B1, C1)
(A1, B1, C2)
(A1, B2, C1)
(A2, , C2)
(A2, B1, C2)
A2, B2, C1)
(A1, B2, C2)
(A2, B2, C2)
19
An H-Tree Cubing Structure
root
Observation layer
sports
politics
entertainment
uiuc
uic
uic
uiuc
Minimal int. layer
jeff
Jim
jeff
mary
Q.I.
Q.I.
Q.I.
20
Benefits of H-Tree and H-Cubing
  • H-tree and H-cubing
  • Developed for computing data cubes and iceberg
    cubes
  • J. Han, J. Pei, G. Dong, and K. Wang, Efficient
    Computation of Iceberg Cubes with Complex
    Measures, SIGMOD'01
  • D. Xin, J. Han, X. Li, B. Wah, Star-Cubing
    Computing Iceberg Cubes by Top-Down and Bottom-Up
    Integration, VLDB'03, Berlin, Germany, Sept.
    2003.
  • Compressed database, fast cubing, space
    preserving
  • Using H-tree for stream cubing
  • Space preserving Intermediate aggregates can be
    computed incrementally and saved in tree nodes
  • Facilitate computing other cells and
    multi-dimensional analysis
  • H-tree with computed cells can be viewed as
    stream cube

21
Use of Stream Cubing Regression Analysis
  • Regression modeling of cells at all dimensions
    and levels
  • Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang,
    Multi-dimensional regression analysis of
    time-series data streams, VLDB'02.
  • Efficient storage and scalable and fast
    aggregation
  • Lossless aggregation without accessing the raw
    data
  • Covered a large and the most popular class of
    regression
  • Including quadratic, polynomial, and nonlinear
    models
  • Methodology can be used for other statistical
    summary, gradient analysis, and so on

22
Clustering Data Streams
  • Network intrusion detection one example
  • Detect bursts of activities or abrupt changes in
    real timeby on-line clustering
  • Two major methodologies
  • Motwani et al. (Stanford and HP Lab)
  • S. Guha, N. Mishra, R. Motwani, and L.
    O'Callaghan, Clustering data streams, FOCS'00.
  • Merging and changing k-media cluster centers
  • Our approach (UIUC and IBM)
  • Tilted time frame to store historical data in
    compressed way
  • Mining evolving data streams

23
Clustering Evolving Data Streams
  • Why clustering evolving data streams?
  • Finding evolutions of clusters not just current
    clusters
  • C. Aggarwal, J. Han, J. Wang, P. S. Yu, A
    Framework for Clustering Evolving Data Streams,
    VLDB'03
  • Methodology
  • Tilted time frame work compression mining
    changes
  • Micro-clustering better quality than
    k-means/k-median
  • incremental, online processing and maintenance
  • Two stages micro-clustering and macro-clustering
  • With limited overhead to achieve high
    efficiency, scalability, quality of results and
    power of evolution/change detection

24
Decision Tree Induction with Stream Data
  • VFDT/CVFDT
  • P. Domingos and G. Hulten, Mining high-speed
    data streams, KDD'00
  • G. Hulten, L. Spencer, and P. Domingos, Mining
    time-changing data streams,KDD'01
  • VFDT (Very Fast Decision Tree) (Domingos and
    Hulten00)
  • With high probability, constructs an identical
    model that a traditional (greedy) method would
    learn
  • If it cannot be inserted into the same branch,
    construct shadow branches as preparation for
    changes
  • If the shadow becomes dominant, switch of tree
    branches occurs
  • CVFDT Extension to time changing data

25
Single-Pass Algorithm (An Example)
Packets gt 10
Data Stream
yes
no
Protocol http
SP(Bytes) - SP(Packets) gt
Packets gt 10
Data Stream
yes
no
Bytes gt 60K
Protocol http
yes
Protocol ftp
From Gehrkes SIGMOD tutorial slides
26
Our Approaches for Stream Classification
  • Why our approaches?
  • Is decision-tree good for modeling fast changing
    data?too fast changes may make trees obsolete
  • May other models have better survivability
    (adaptability)?
  • Can we find and compare evolution behavior?
  • Our methodology
  • Tilted time framework (compression and evolution)
  • Instead of decision-trees, consider other models
    that do not need dramatic changes, e.g., Naïve
    Bayesian with boosting, K-NN?
  • Incremental updating, dynamic maintenance, and
    model construction
  • Comparing of models to find changes

27
Stream Classification by Naïve Bayesian
  • Naïve Bayesian boosting
  • A working paper with Jiong Yang, Wei Wang and
    Xifeng Yan
  • A stable model that registers attribute-value
    pairs (similar to Raiforest-based classification)
  • History can be recorded using tilted time
    framework
  • Using boosting to increase classification
    accuracy
  • Adaptable to dramatic changes
  • Incremental updating, dynamic maintenance, and
    fast model construction
  • Comparing of models to find changes

28
Stream Classification by K-Nearest Neighbor
  • Classification based on nearest neighbors
  • C. Agarwal (IBM), J. Han, J. Wang, P. S. Yu
    (IBM), A Framework for Effective Classification
    of Evolving Data, a working paper
  • Two kinds of stream classification tasks
  • Type I Prediction of peers in the current stream
  • Type II Prediction of the behavior in the next
    window
  • Registration of basic statistics (clustering
    features) using B-tree (Birch-like) structure
  • Different philosophies for training and model
    construction for type I and II tasks

29
Mining Frequent Patterns for Stream Data
  • Frequent pattern mining is valuable in stream
    applications
  • e.g., network intrusion mining (Dokas, et al02)
  • Mining precise freq. patterns in stream data
    unrealistic
  • Even store them in a compressed form, such as
    FPtree
  • How to mine frequent patterns with good
    approximation?
  • Approximate frequent patterns (Manku Motwani,
    VLDB02)
  • Major ideas not tracing items until it becomes
    first frequent
  • Adv guarantee error bound
  • Disadv keep a large set of traces
  • Our comments
  • Keep only current frequent patterns? No changes
    can be detected

30
Our Approach on Frequent Stream Patterns
  • Approach 1 Mining with Multiple Time
    Granularities
  • C. Giannella, J. Han, J. Pei, X. Yan and P.S. Yu,
    Mining Frequent Patterns in Data Streams at
    Multiple Time Granularities, Next Gen. Data
    Mining, MIT Press, 2003
  • Keep pattern-trees at the tilted time window
    frame (using tree-sharing method)
  • Mining evolution and dramatic changes of frequent
    patterns
  • Approach 2 Mining only interested itemsets
  • Identify interested items in stream environment
  • Keep precise/compressed history in tilted time
    window
  • Mining using FP-tree and related fast mining
    method

31
A Discussion of Our Work Plan
  • System architecture design
  • Need preprocessing of stream data flow
  • Two working modes
  • Disk-based and true stream processing
  • Main memory algorithms
  • cube and tilted time frame structure
  • Test data sets
  • Network flow (multi-dimensional data)
  • Web click stream analysis
  • Stock trading data?
  • Multiple stream weather data?

32
Conclusions
  • Stream data mining A rich and largely unexplored
    field
  • Current research focus in database community
  • DSMS system architecture, continuous query
    processing, supporting mechanisms
  • Stream data mining and stream OLAP analysis
  • Powerful tools for finding general and unusual
    patterns
  • Effectiveness, efficiency and scalability lots
    of open problems
  • Our philosophy
  • A multi-dimensional stream analysis framework
  • Time is a special dimension tilted time frame
  • What to compute and what to save?Critical layers
  • Very partial materialization/precomputation
    popular path approach
  • Mining dynamics of stream data

33
References
  • B. Babcock, S. Babu, M. Datar, R. Motawani, and
    J. Widom, Models and issues in data stream
    systems, PODS'02 (tutorial).
  • S. Babu and J. Widom, Continuous queries over
    data streams, SIGMOD Record, 30109--120, 2001.
  • Y. Chen, G. Dong, J. Han, J. Pei, B. W. Wah, and
    J. Wang. Online analytical processing stream
    data Is it feasible?, DMKD'02.
  • Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang,
    Multi-dimensional regression analysis of
    time-series data streams, VLDB'02.
  • P. Domingos and G. Hulten, Mining high-speed
    data streams, KDD'00.
  • M. Garofalakis, J. Gehrke, and R. Rastogi,
    Querying and mining data streams You only get
    one look, SIGMOD'02 (tutorial).
  • J. Gehrke, F. Korn, and D. Srivastava, On
    computing correlated aggregates over continuous
    data streams, SIGMOD'01.
  • S. Guha, N. Mishra, R. Motwani, and L.
    O'Callaghan, Clustering data streams, FOCS'00.
  • G. Hulten, L. Spencer, and P. Domingos, Mining
    time-changing data streams, KDD'01.

34
www.cs.uiuc.edu/hanj
  • Thank you !!!
Write a Comment
User Comments (0)
About PowerShow.com