Statistical Mining in Data Streams - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Statistical Mining in Data Streams

Description:

Statistical Mining in Data Streams Ankur Jain Dissertation Defense Computer Science, UC Santa Barbara Committee Edward Y. Chang (chair) Divyakant Agrawal – PowerPoint PPT presentation

Number of Views:223
Avg rating:3.0/5.0
Slides: 53
Provided by: Anku61
Category:

less

Transcript and Presenter's Notes

Title: Statistical Mining in Data Streams


1
Statistical Mining in Data Streams
  • Ankur Jain
  • Dissertation Defense
  • Computer Science, UC Santa Barbara
  • Committee
  • Edward Y. Chang (chair)
  • Divyakant Agrawal
  • Yuan-Fang Wang

2
Roadmap
  • The Data Stream Model
  • Introduction and research issues
  • Related work
  • Data Stream Mining
  • Stream data clustering
  • Bayesian reasoning for sensor stream processing
  • Contribution Summary
  • Future work

3
Data Streams
  • A data stream is an unbounded and continuous
    sequence of tuples.
  • Tuples arrive online and could be
    multi-dimensional
  • A tuple seen once cannot be easily retrieved
    later
  • No control over the tuple arrival order

4
Applications Sensor Networks
Applications Network Monitoring
Applications Text Processing
Applications
  • Video surveillance
  • Stock ticker monitoring
  • Process control manufacturing
  • Traffic monitoring analysis
  • Transaction log processing

Traditional DBMS does not work!
5
Data Stream Projects
  • STREAM (Stanford)
  • A general-purpose Data Stream Management System
    (DSMS)
  • Telegraph (Berkeley)
  • Adaptive query processing
  • TinyDB General purpose sensor database
  • Aurora Project (Brown/MIT)
  • Distributed stream processing
  • Introduces new operators (map, drop etc.)
  • The Cougar Project (Cornell)
  • Sensors form a distributed database system
  • Cross-layer optimizations (data management layer
    and the routing layer)
  • MAIDS (UIUC)
  • Mining Alarming Incidents in Data Streams
  • Streaminer Data stream mining

6
Data Stream Processing Key Ingredients
  • Adaptivity
  • Incorporate evolutionary changes in the stream
  • Approximation
  • Exact results are hard to compute fast with
    limited memory

7
A Data Stream Management System (DSMS)
The Central Stream Processing System
8
Thesis Outline
  • Develop fast, online, statistical methods for
    mining data streams.
  • Adaptive non-linear clustering in
    multi-dimensional streams
  • Bayesian reasoning sensor stream processing
  • Filtering methods for resource conservation
  • Change detection in data streams
  • Video sensor data stream processing

9
Roadmap
  • The Data Stream Model
  • Introduction and research issues
  • Related work
  • Data Stream Mining
  • Stream data clustering
  • Bayesian reasoning for sensor stream processing
  • Contribution Summary
  • Future work

10
Clustering in High-Dimensional Streams
  • Given a continuous sequence of points, group
    them into some number of clusters, such that the
    members of a cluster are geometrically close to
    each other.

11
Example Application Network Monitoring
INTERNET
Connection tuples (high-dimensional)
12
Stream Clustering New Challenges
  • One-pass restriction and limited memory
    constraint
  • Fading cluster technique proposed by Aggarwal et
    al.
  • Non-linear separation boundaries
  • We propose using the kernel trick to deal with
    the non-linearity issue
  • Data dimensionality
  • We propose effective incremental dimension
    reduction technique

13
The 2-Tier Framework
Adaptive Non-linear Clustering
Input Space
LDS
d-dimensional
Tier1 Stream Segmentation
q-dimensional q lt d
Tier 2 LDS Projection Update
Latest point received from the stream
C5
2-Tier clustering module (uses the kernel trick)
Fading Clusters
14
The Fading Cluster Methodology
  • Each cluster Ci, has a recency value Ri s.t.
  • Ri f(t-tlast), where,
  • t current time
  • tlast last time Ci was updated
  • f(t) e-? t
  • ? fading factor
  • A cluster is erased from memory (faded) when Ri
    h, where h is a user parameter
  • ? controls the influence of historical data
  • Total number of clusters is bounded

15
Non-linearity in Data
Input Space
Feature Space
Spectral clustering methods likely to perform
better
Traditional clustering techniques (k-means) do
not perform well
Feature space mapping
?
16
Non-linearity in Network Intrusion Data
Geometrically well-behaved trend
Use kernel trick
?
?
Input Space
Feature Space
ipsweep attack data
17
The Kernel Trick
  • Actual projection in higher dimension is
    computationally expensive
  • The kernel trick does the non-linear projection
    implicitly!
  • Given two input space vectors x,y
  • k(x,y) lt?(x),?(y)gt

Gaussian kernel function k(x,y)
exp(-?x-y2) used in the previous example !
Kernel Function
18
Kernel Trick - Working Example
Not required explicitly!
  • ?x (x1, x2) ? ?(x) (x12, x22, v2x1x2)
  • lt?(x),?(z)gt lt(x12, x22, v2x1x2), (z12, z22,
    v2z1z2)gt,
  • x12z12 x22z22 2x1x2z1z2,
  • (x1z1 x2z2)2,
  • ltx,zgt2.

Kernel trick allows us to make operations in
high-dimensional feature space using a kernel
function but without explicitly representing ?
19
Stream Clustering New Challenges
  • One-pass restriction and limited memory
    constraint
  • We use the fading cluster technique proposed by
    Aggarwal et. al.
  • Non-linear separation boundaries
  • We propose using kernel methods to deal with the
    non-linearity issue
  • Data dimensionality
  • We propose effective incremental dimension
    reduction technique

20
Dimensionality Reduction
  • PCA like kernel method desirable
  • Explicit representation EVD preferred
  • KPCA is computationally prohibitive - O(n3)
  • The principal components evolve with time
    frequent EVD updates may be necessary
  • We propose to perform EVD on grouped-data instead
    point-data

Requires a novel kernel method
21
The 2-Tier Framework
Adaptive Non-linear Clustering
Input Space
LDS
d-dimensional
Tier1 Stream Segmentation
q-dimensional q lt d
Tier 2 LDS Projection Update
Latest point received from the stream
C5
2-Tier clustering module (uses the kernel trick)
Fading Clusters
22
The 2-Tier Framework
  • Tier 1 captures the temporal locality in a
    segment
  • Segment is a group of contiguous points in the
    stream geometrically packed closely in the
    feature space
  • Tier 2 adaptively selects segments to project
    data in LDS
  • Selected segments are called representative
    segments
  • Implicit data in the feature space is projected
    explicitly in LDS such that the feature-space
    distances are preserved

23
The 2-Tier Framework
Obtain a point x From the stream
YES
TIER 2
Is S a representative segment?
Add S in memory and update LDS
Is (?(x) novel w.r.t S and s gt smin) OR is s
smax?
TIER 1
YES
NO
Clear contents of S
Obtain in LDS
NO
Add x to S
Is close to an active cluster?
YES
Update cluster centers and recency values.
Delete faded clusters
Assign x to its nearest cluster
Create new cluster with x
NO
24
Network Intrusion Stream
  • Simulated data from MIT Lincoln Labs.
  • 34 Continuous Attributes (Features)
  • 10.5 K Records
  • 22 types of intrusion attacks 1 normal class

25
Network Intrusion Stream
Clustering accuracy at LDS dimensionality u10
26
Efficiency - EVD Computations
Image data 5K Records, 576 Features, 10 digits
Newswire data 3.8K Records, 16.5K Features, 10
news topics
27
In Retrospect
  • We proposed an effective stream clustering
    framework
  • We use the kernel trick to delineate non-linear
    boundaries efficiently
  • We use stream segmentation approach to
    continuously project data into a low dimensional
    space

28
Roadmap
  • The Data Stream Model
  • Introduction and research issues
  • Related work
  • Contributions Towards Stream Mining
  • Stream data clustering
  • Bayesian reasoning sensor steam processing
  • Contribution Summary
  • Future work

29
Bayesian Reasoning for Sensor Data Processing
  • Users submit queries with precision constraints
  • Resource conservation is of prime concern to
    prolong system life
  • Data acquisition
  • Data communication

Find the temperature with 80 confidence
Use probabilistic models at central site for
approximate predictions preventing actual
acquisitions
30
Dependencies in Sensor Attributes
Attribute Acquisition Cost
Temperature 50 J
Voltage 5 J
Acquire Voltage!
Get Temperature
Dependency Model
Bayesian Networks
Report Temperature
Get Voltage
31
Using Correlation Models Deshpande et al.
  • Correlation models ignore conditional
    dependency

Intel Lab ( Real Sensor network data) Attributes
Voltage (V), Temperature (T), Humidity (H)
voltage is correlated with temperature
Humidity 35-40)
voltage is conditionally independent of
temperature, given humidity !
Deshpande et al. VLDB04
32
BN vs. Correlations
  • Correlation model Deshpande et. al.
  • Maintains all dependencies
  • Search space of finding best possible
    alternative sensor attribute is high
  • Joint probability is represented in O(n2) cells

NDBC Buoy Dataset
  • Bayesian Network
  • Maintains vital dependencies only
  • Lower search complexity O(n)
  • Storage O(nd), d avg. node degree
  • Intuitive dependency structure

Intel Lab. Dataset
33
Bayesian Networks (BN)
  • Qualitative Part Directed Acyclic Graph (DAG)
  • Nodes Sensor Attributes
  • Edges Attribute influence relationship
  • Quantitative Part Conditional Probability Table
    (CPT)
  • Each node X has its own CPT , P(Xparents(X))
  • Together, the BN represent the joint probability
    in
  • factored from P(T,H,V,L) P(T)P(HT)P(VH)P(LT
    )
  • The influence relationship is represented by
    conditional entropy function H.
  • H(Xi)?kl1 P( Xi xil
    )log(P( Xi xil ))
  • We learn the BN by minimizing H(Xi Parents(Xi)).

34
System Architecture
Storage
Acquisition Plan
Group Query (Q)
Acquisition Values
Query Processor
35
Finding the Candidate Attributes
  • For any attribute in the group-query Q, analyze
    candidates attributes in the Markov blanket
    recursively
  • Selection criterion
  • Select candidates in a
  • greedy fashion

Information Gain (Conditional Entropy)
Acquisition cost
Meet precision constraints
Maximize resource conservation
36
Experiments Resource Conservation
NDBC dataset, 7 attributes
Effect of using MB Property with ?min 0.90
Effect of using group-queries, Q - Group-query
size
37
Results - Selectivity
Wave Period (WP)
Wind Speed (SP)
Air Pressure (AP)
Wind Direction (DR)
Water Temperature (WT)
Wave Height (WH)
Air Temperature (AT)
38
In Retrospect
  • Bayesian networks can encode the sensor
    dependencies effectively
  • Our method provides significant resource
    conservation for group-queries

39
Contribution Summary
  • Adaptive Stream resource management using Kalman
    Filters. SIGMOD04
  • Adaptive sampling for sensor networks.
    DMSN04
  • Adaptive non-linear clustering for Data
    Streams. CIKM06
  • Using stationary-dynamic camera assemblies for
    wide-area video surveillance and selective
    attention. CVPR06
  • Filtering the data streams. in submission
  • Efficient diagnostic and aggregate queries on
    sensor networks.
  • in submission
  • OCODDS An On-line Change-Over Detection
    framework for tracking evolutionary changes in
    Data Streams. in submission

40
Future Work
  • Develop non-linear techniques for capturing
    temporal correlations in data streams
  • The Bayesian framework can be extended to address
    what-if queries with counterfactual evidence
  • The clustering framework can be extended for
    developing stream visualization systems
  • Incremental EVD techniques can improve the
    performance further

41
  • Thank You !

42
  • BACKUP SLDIES!

43
Back to Stream Clustering
  • We propose a 2-tier stream clustering framework
  • Tier 1 Kernel method that continuously divides
    the stream into segments
  • Tier 2 Kernel method that uses the segments to
    project data in a low-dimensional space (LDS)
  • The fading clusters reside in the LDS

44
Clustering LDS Projection
45
Clustering LDS Update
46
Network Intrusion Stream
Clustering accuracy at LDS dimensionality u10
Cluster strengths at LDS dimensionality u10
47
Effect of dimensionality
48
Query Plan Generation
  • Given a group query, the query plan computes
    candidate attributes that will actually be
    acquired to successfully address the query.
  • We exploit the Markov Blanket (MB) property to
    select candidate attributes.
  • Given a BN G, the Markov Blanket of a node Xi
    comprises the node, and its immediate parent and
    child.

49
Exploiting the MB Property
  • Given a node Xi and a set of arbitrary nodes Y
    in a BN s.t. MB(Xi) µ Y Xi), the conditional
    entropy of Xi given Y is at least as high as that
    given its Markov blanket or H(XiY)
    H(XiMB(Xi)).
  • Proof Separating MB(Xi) into two parts MB1
    MB(Xi) Y and MB2 MB(Xi) - MB1 and denoting Z
    Y MB(Xi)
  • H(XiY) H(XiZ,MB1) Y Z
    MB1
  • H(XiZ,MB1,MB2) Additional
    information cannot

  • increase entropy
  • H(XiZ, MB(Xi)) MB(Xi)
    MB1MB2
  • H(XiMB(Xi))
    Markov-blanket definition

50
Bayesian Reasoning -More Results
Effect of using MB Property with ?min 0.90
Query answer Quality loss 50-node Synth. Data BN
51
Bayesian Reasoning for Group Queries
  • More accurate in addressing group queries
  • Q (Xi, ?i)Xi 2 X Æ (0 lt ?i 1) Æ 1 i n)
    s.t. ?i ltmaxl P(Xi xil)
  • X X1,X2 ,X3,, Xn Sensor attributes
  • ?i Confidence parameters
  • P(Xi xil) Probability with which Xi assumes the
    value of xil
  • Bayesian reasoning is helpful in detecting
    abnormalities

52
Bayesian Reasoning Candidate attribute
selection algorithm
Write a Comment
User Comments (0)
About PowerShow.com