Mining Dynamics of Data Streams in Multidimensional Space - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Mining Dynamics of Data Streams in Multidimensional Space

Description:

Multi-dimensional (regression) analysis of data streams. Stream ... Niagara (OGI/Wisconsin): Internet XML databases. OpenCQ (Georgia Tech): triggers, incr. ... – PowerPoint PPT presentation

Number of Views:85
Avg rating:3.0/5.0
Slides: 37
Provided by: jiaw186
Category:

less

Transcript and Presenter's Notes

Title: Mining Dynamics of Data Streams in Multidimensional Space


1
Mining Dynamics of Data Streams in
Multidimensional Space
  • Jiawei Han
  • Department of Computer Science
  • University of Illinois at Urbana-Champaign
  • www.cs.uiuc.edu/hanj

2
Outline
  • Characteristics of data streams
  • Mining dynamics in data streams
  • Multi-dimensional (regression) analysis of data
    streams
  • Stream cubing and stream OLAP methods
  • Mining other kinds of patterns in data streams
  • Research problems

3
Characteristics of Data Streams
  • Data Streams
  • Data streamscontinuous, ordered, changing, fast,
    huge amount
  • Traditional DBMSdata stored in finite,
    persistent data sets
  • Characteristics
  • Huge volumes of continuous data, possibly
    infinite
  • Fast changing and requires fast, real-time
    response
  • Data stream captures nicely our data processing
    needs of today
  • Random access is expensivesingle linear scan
    algorithm (can only have one look)
  • Store only the summary of the data seen thus far
  • Most stream data are at pretty low-level or
    multi-dimensional in nature, needs multi-level
    and multi-dimensional processing

4
Stream Data Applications
  • Telecommunication calling records
  • Business credit card transaction flows
  • Network monitoring and traffic engineering
  • Financial market stock exchange
  • Engineering industrial processes power supply
    manufacturing
  • Sensor, monitoring surveillance video streams
  • Security monitoring
  • Web logs and Web page click streams
  • Massive data sets (even saved but random access
    is too expensive)

5
Architecture Stream Query Processing
User/Application
SDMS (Stream Data Management System)
Results
Multiple streams
Stream Query Processor
Scratch Space (Main memory and/or Disk)
6
Challenges of Stream Data Processing
  • Multiple, continuous, rapid, time-varying,
    ordered streams
  • Main memory computations
  • Queries are often continuous
  • Evaluated continuously as stream data arrives
  • Answer updated over time
  • Queries are often complex
  • Beyond element-at-a-time processing
  • Beyond stream-at-a-time processing
  • Beyond relational queries (scientific, data
    mining, OLAP)
  • Multi-level/multi-dimensional processing and data
    mining
  • Most stream data are at pretty low-level or
    multi-dimensional in nature

7
Processing Stream Queries
  • Query types
  • One-time query vs. continuous query (being
    evaluated continuously as stream continues to
    arrive)
  • Predefined query vs. ad-hoc query (issued
    on-line)
  • Unbounded memory requirements
  • For real-time response, main memory algorithm
    should be used
  • Memory requirement is unbounded if one will join
    future tuples
  • Approximate query answering
  • With bounded memory, it is not always possible to
    produce exact answers
  • High-quality approximate answers are desired
  • Data reduction and synopsis construction methods
  • Sketches, random sampling, histograms, wavelets,
    etc.

8
Projects on DSMS (Data Stream Management System)
  • Research projects and system prototypes
  • STREAM (Stanford) A general-purpose DSMS
  • Cougar (Cornell) sensors
  • Aurora (Brown/MIT) sensor monitoring, dataflow
  • Hancock (ATT) telecom streams
  • Niagara (OGI/Wisconsin) Internet XML databases
  • OpenCQ (Georgia Tech) triggers, incr. view
    maintenance
  • Tapestry (Xerox) pub/sub content-based filtering
  • Telegraph (Berkeley) adaptive engine for sensors
  • Tradebot (www.tradebot.com) stock tickers
    streams
  • Tribeca (Bellcore) network monitoring
  • Streaminer (UIUC) new project for stream data
    mining

9
Stream Data Mining vs. Stream Querying
  • Stream miningA more challenging task
  • It shares most of the difficulties with stream
    querying
  • Patterns are hidden and more general than
    querying
  • It may require exploratory analysis
  • Not necessarily continuous queries
  • Stream data mining tasks
  • Multi-dimensional on-line analysis of streams
  • Mining outliers and unusual patterns in stream
    data
  • Clustering data streams
  • Classification of stream data

10
Stream Data Mining Tasks
  • Multi-dimensional (on-line) analysis of streams
  • Clustering data streams
  • Classification of data streams
  • Mining frequent patterns in data streams
  • Mining sequential patterns in data streams
  • Mining partial periodicity in data streams
  • Mining notable gradients in data streams
  • Mining outliers and unusual patterns in data
    streams
  • , more?

11
Challenges for Mining Dynamics in Data Streams
  • Most stream data are at pretty low-level or
    multi-dimensional in nature needs ML/MD
    processing
  • Analysis requirements
  • Multi-dimensional trends and unusual patterns
  • Capturing important changes at multi-dimensions/le
    vels
  • Fast, real-time detection and response
  • Comparing with data cube Similarity and
    differences
  • Stream (data) cube or stream OLAP Is this
    feasible?
  • Can we implement it efficiently?

12
Multi-Dimensional Stream Analysis Examples
  • Analysis of Web click streams
  • Raw data at low levels seconds, web page
    addresses, user IP addresses,
  • Analysts want changes, trends, unusual patterns,
    at reasonable levels of details
  • E.g., Average clicking traffic in North America
    on sports in the last 15 minutes is 40 higher
    than that in the last 24 hours.
  • Analysis of power consumption streams
  • Raw data power consumption flow for every
    household, every minute
  • Patterns one may find average hourly power
    consumption surges up 30 for manufacturing
    companies in Chicago in the last 2 hours today
    than that of the same day a week ago

13
A Key StepStream Data Reduction
  • Challenges of OLAPing stream data
  • Raw data cannot be stored
  • Simple aggregates are not powerful enough
  • History shape and patterns at different levels
    are desirable multi-dimensional regression
    analysis
  • Proposal
  • A scalable multi-dimensional stream data cube
    that can aggregate regression model of stream
    data efficiently without accessing the raw data
  • Stream data compression
  • Compress the stream data to support memory- and
    time-efficient multi-dimensional regression
    analysis

14
Regression Cube for Time-Series
  • Initially, one time-series per base cell
  • Too costly to store all these time-series
  • Too costly to compute regression at
    multi-dimensional space
  • Regression cube
  • Base cube only store regression parameters of
    base cells (e.g., 2 points vs. 1000 points)
  • All the upper level cuboids can be computed
    precisely for linear regression on both standard
    dimensions and time dimensions
  • For quadratic regression, we need 5 points
  • In general, we need
  • where k 2 for quadratic.

15
Basics of General Linear Regression
  • n tuples in one cell (xi , yi), i 1..n, where
    yi is the measure attribute to be analyzed
  • For sample i , a vector of k user-defined
    predictors ui
  • The linear regression model
  • where ? is a k 1 vector of regression
    parameters

16
Stock Price ExampleAggregation in Standard
Dimensions
  • Simple linear regression on time series data
  • Cells of two companies
  • After aggregation

17
Stock Price ExampleAggregation in Time Dimension
  • Cells of two adjacent
  • time intervals
  • After aggregation

18
Feasibility of Stream Regression Analysis
  • Efficient storage and scalable (independent of
    the number of tuples in data cells)
  • Lossless aggregation without accessing the raw
    data
  • Fast aggregation computationally efficient
  • Regression models of data cells at all levels
  • General results covered a large and the most
    popular class of regression
  • Including quadratic, polynomial, and nonlinear
    models

19
A Stream Cube Architecture
  • A tilted time frame
  • Different time granularities
  • second, minute, quarter, hour, day, week,
  • Critical layers
  • Minimum interest layer (m-layer)
  • Observation layer (o-layer)
  • User watches at o-layer and occasionally needs
    to drill-down down to m-layer
  • Partial materialization of stream cubes
  • Full materialization too space and time
    consuming
  • No materialization slow response at query time
  • Partial materialization what do we mean
    partial?

20
A Tilted Time-Frame Model
Up to 7 days
Up to a year
Logarithmic (exponential) scale
2t
1t
4t
8t
16t
Time
Now
21
Benefits of Tilted Time-Frame Model
  • Each cell stores the measures according to
    tilt-time-frame
  • Limited memory space Impossible to store the
    history in full scale
  • Emphasis more on recent data
  • Most applications emphasize on recent data (slide
    window)
  • Natural partition on different time granularities
  • Putting different weights on remote data
  • Useful even for uniform weight
  • Tilted time-frame forms a new time dimension
  • for mining changes and evolutions
  • Essential for mining dynamic patterns or outliers
  • Finding those with dramatic changes
  • E.g., exceptional stocksnot following the trends

22
Two Critical Layers in the Stream Cube
(, theme, quarter)
o-layer (observation)
(user-group, URL-group, minute)
m-layer (minimal interest)
(individual-user, URL, second)
(primitive) stream data layer
23
On-Line Materialization vs. On-Line Computation
  • On-line materialization
  • Materialization takes precious resources and time
  • Only incremental materialization (with slide
    window)
  • Only materialize cuboids of the critical
    layers?
  • Some intermediate cells that should be
    materialized
  • Popular path approach vs. exception cell approach
  • Materialize intermediate cells along the popular
    paths
  • Exception cells how to set up exception
    thresholds?
  • Notice exceptions do not have monotonic behavior
  • Computation problem
  • How to compute and store stream cubes
    efficiently?
  • How to discover unusual cells between the
    critical layer?

24
Stream Cube Structure from m-layer to o-layer
(A1, , C1)
(A1, , C2)
(A1, , C2)
(A1, , C2)
(A2, B1, C1)
(A1, B1, C2)
(A1, B2, C1)
(A2, , C2)
(A2, B1, C2)
A2, B2, C1)
(A1, B2, C2)
(A2, B2, C2)
25
Stream Cube Computation
  • Cube structure from m-layer to o-layer
  • Three approaches
  • All cuboids approach
  • Materializing all cells (too much in both space
    and time)
  • Exceptional cells approach
  • Materializing only exceptional cells (saves space
    but not time to compute and definition of
    exception is not flexible)
  • Popular path approach
  • Computing and materializing cells only along a
    popular path
  • Using H-tree structure to store computed cells
    (which form the stream cubea selectively
    materialized cube)

26
An H-Tree Cubing Structure
root
Observation layer
sports
politics
entertainment
uiuc
uic
uic
uiuc
Minimal int. layer
jeff
Jim
jeff
mary
Q.I.
Q.I.
Q.I.
27
Benefits of H-Tree and H-Cubing
  • H-tree and H-cubing
  • Developed for computing data cubes and ice-berg
    cubes
  • J. Han, J. Pei, G. Dong, and K. Wang, Efficient
    Computation of Iceberg Cubes with Complex
    Measures, SIGMOD'01
  • Compressed database
  • Fast cubing
  • Space preserving in cube computation
  • Using H-tree for stream cubing
  • Space preserving
  • Intermediate aggregates can be computed
    incrementally and saved in tree nodes
  • Facilitate computing other cells and
    multi-dimensional analysis
  • H-tree with computed cells can be viewed as
    stream cube

28
Time and Space vs. Number of Tuples at the
m-Layer (Dataset D3L3C10T400K)
a) Time vs. m-layer size
b) Space vs. m-layer size
29
Time and Space vs. the Number of Levels
a) Time vs. levels
b) Space vs. levels
30
Mining Other Dynamic Patterns in Stream Data
  • Clustering and outlier analysis for stream mining
  • Clustering data streams (Guha, Motwani et al.
    2000-2002)
  • History-sensitive, high-quality incremental
    clustering
  • Classification of stream data
  • Evolution of decision trees Domingos et al.
    (2000, 2001)
  • Incremental integration of new streams in
    decision-tree induction
  • Frequent pattern analysis
  • Approximate frequent patterns (Manku Motwani
    VLDB02)
  • Evolution and dramatic changes of frequent
    patterns

31
Clustering for Mining Stream Dynamics
  • Network intrusion detection one example
  • Detect bursts of activities or abrupt changes in
    real timeby on-line clustering
  • Our methodology
  • Tilted time frame work o.w. dynamic changes
    cannot be found
  • Micro-clustering better quality than
    k-means/k-median
  • incremental, online processing and maintenance)
  • Two stages micro-clustering and macro-clustering
  • With limited overhead to achieve high
    efficiency, scalability, quality of results and
    power of evolution/change detection

32
Classification for Dynamic Data Streams
  • Decision tree induction for stream data
    classification
  • VFDT (Very Fast Decision Tree)/CVFDT (Domingos,
    Hulten, Spencer, KDD00/KDD01)
  • Is decision-tree good for modeling fast changing
    data, e.g., stock market analysis?
  • Methodology
  • Tilted time framework
  • Instead of decision-trees, should we consider
    other models which do not changes drastically,
    e.g., Naïve Bayesian with boosting?
  • Incremental updating, dynamic maintenance, and
    model construction
  • Comparing of models to find changes

33
Frequent Patterns for Stream Data
  • Frequent pattern mining is valuable in stream
    applications
  • e.g., network intrusion mining (Dokas, et al02)
  • Mining precise freq patterns in stream data
    unrealistic
  • Even store them in a compressed form, such as
    FPtree
  • How to mine frequent patterns with good
    approximation?
  • Approximate frequent patterns (Manku Motwani
    VLDB02)
  • Keep only current frequent patterns? No changes
    can be detected
  • Our approach
  • Keep Pattern-trees at the tilted time window
    frame (using tree-sharing method)
  • Mining evolution and dramatic changes of frequent
    patterns

34
Conclusions
  • Stream data mining A rich and largely unexplored
    field
  • Current research focus in database community
  • DSMS system architecture, continuous query
    processing, supporting mechanisms
  • Stream data mining and stream OLAP analysis
  • Powerful tools for finding general and unusual
    patterns
  • Effectiveness, efficiency and scalability lots
    of open problems
  • Our philosophy
  • A multi-dimensional stream analysis framework
  • Time is a special dimension tilted time frame
  • What to compute and what to save?Critical layers
  • Very partial materialization/precomputation
    popular path approach
  • Mining dynamics of stream data

35
References
  • B. Babcock, S. Babu, M. Datar, R. Motawani, and
    J. Widom, Models and issues in data stream
    systems, PODS'02 (tutorial).
  • S. Babu and J. Widom, Continuous queries over
    data streams, SIGMOD Record, 30109--120, 2001.
  • Y. Chen, G. Dong, J. Han, J. Pei, B. W. Wah, and
    J. Wang. Online analytical processing stream
    data Is it feasible?, DMKD'02.
  • Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang,
    Multi-dimensional regression analysis of
    time-series data streams, VLDB'02.
  • P. Domingos and G. Hulten, Mining high-speed
    data streams, KDD'00.
  • M. Garofalakis, J. Gehrke, and R. Rastogi,
    Querying and mining data streams You only get
    one look, SIGMOD'02 (tutorial).
  • J. Gehrke, F. Korn, and D. Srivastava, On
    computing correlated aggregates over continuous
    data streams, SIGMOD'01.
  • S. Guha, N. Mishra, R. Motwani, and L.
    O'Callaghan, Clustering data streams, FOCS'00.
  • G. Hulten, L. Spencer, and P. Domingos, Mining
    time-changing data streams, KDD'01.

36
www.cs.uiuc.edu/hanj
  • Thank you !!!
Write a Comment
User Comments (0)
About PowerShow.com