Title: Data Stream Mining with Extensible Markov Model
1Data Stream Mining with Extensible Markov Model
- Yu Meng, Margaret H. Dunham, F. Marco Marchetti,
- Jie Huang, Charlie Isaksson
- October 18, 2006
2Outline
- Data Stream Mining
- EMM Framework
- EMM Applications
- Future Work
- Conclusions
3Data Mining
- Is the process of automatically searching large
volumes of data for the nontrivial, hidden,
previously unknown, and potentially useful
information (interrelation of data) - Also called Knowledge-Discovery in Databases
(KDD) or Knowledge-Discovery and Data Mining, - Classification (Yahoo news, finance, etc.)
- Clustering (type of customers in online purchase)
- Association (Market Basket Analysis)
4Classification
- Given a collection of records (training set )
- Each record contains a set of attributes, one of
the attributes is the class. - Find a model for class attribute as a function
of the values of other attributes. - Goal previously unseen records should be
assigned a class as accurately as possible. - A test set is used to determine the accuracy of
the model. Usually, the given data set is divided
into training and test sets, with training set
used to build the model and test set used to
validate it. - Decision tree, neural network, naïve Bayes, etc.
- Classification is a supervised learning process.
5Illustrating Classification Task
6Clustering
- Finding groups of objects such that the objects
in a group will be similar (or related) to one
another and different from (or unrelated to) the
objects in other groups - Clustering is an unsupervised learning
7Association Rule Mining
- Given a set of transactions, find rules that will
predict the occurrence of an item based on the
occurrences of other items in the transaction
Market-Basket transactions
Example of Association Rules
Diaper ? Beer,Milk, Bread ?
Eggs,Coke,Beer, Bread ? Milk,
Implication means co-occurrence, not causality!
8Why Data Stream Mining?
- A growing number of applications generate streams
of data. - Computer network monitoring data (IEPM-BW2004,
Abilene 2005) - Call detail records in telecommunications (Cisco
VoIP data 2003) - Highway transportation traffic data (MnDot 2005)
- Online web purchase log records (JcPenny data
2003) - Sensor network data (Ouse, Serwent 2002)
- Stock exchange, transactions in retail chains,
ATM operations in banks, credit card
transactions.
9What we see from the data streams?
- Characteristics of data stream
- Records may arrive at a rapid rate
- High volume (possibly infinite) of continuous
data - Concept drifts Data distribution changes on the
fly - Data are raw
- Multidimensional
- Spatiality, Temporality
10What we see from the data streams?
- Requirements
- High efficient computation and processing of the
input streams in terms of both time and space.
Soft-real time and scalability. - Seek needles in a haystack. Rare event
detections.
Haixun Wang, Jian Pei, Philip S. Yu, ICDE 2005
Keogh, ICDM04
11What we see from the data streams?
- Stream processing restrictions
- Single pass Each record is examined at most once
- Bounded storage Limited Memory to be used
- Real-time Per record processing time must be low
- Incremental responses to queries
- Our Solution
- Data modeling (global synopsis)
- Mining on local patterns based on the synopsis
- Incremental, scalable algorithms
12Extensive Markov Model
- To develop a new data mining framework to model
spatiotemporal data stream, and mine interesting
local patterns. - Assumptions of data
- Data are collected in discrete time intervals,
- Data are in structured format,
- Data are multidimensional,
- Data hold an approximation of the Markov
property.
13Extensive Markov Model
- Capabilities of the technique
- soft real-time processing time (Incremental)
- Global modeling capability (scalable, synopsis)
- Local pattern finding capability (mining
performed on synopsis) - Adaptive to concept changes,
- Rare event detection
14Outline
- Introduction
- EMM Framework
- EMM Applications
- Future Work
- Conclusions
15EMM An Overview
- Motivation of EMM
- Markov process is a random process satisfying
Markov property. Markov chain is a Markov process
with discrete states. - Clustering - determine representative granules
in the data space. - Static Markov chain - dynamic Markov chain
- Map a cluster into a state in Markov chain
- What is EMM A data mining framework which models
spatiotemporal data stream and is employed for
local pattern detections. - EMM models data stream by interleaving a
clustering algorithm with a dynamic Markov chain. - EMM applies a series of efficient algorithms to
mine interesting patterns from the modeled data
(synopsis).
16EMM Overview
- EMM Clustering Algorithms
- Nearest neighbor O(m)
- Hierarchical Clustering
- O(log m)
- EMM Building Algorithms O(1)
- EMMIncrement algorithm,
- EMMDecrement algorithm,
- EMMMerge algorithm
- EMMSplit algorithm
- EMM Application Algorithms O(1)
- Predictions
- Anomaly detection
- Risk Assessment
- Emerging Event Finding
17EMM Components and Workflow
- - Flexibility
- Modularization
- It models while executes applications
18EMM A Walk Through
EMM Building
19EMM A Walk Through
CL111
CN11
EMM Building
20EMM A Walk Through
EMM Building
21EMM A Walk Through
EMM Applications
EMM Building
22EMM A Walk Through
EMM Applications
EMM Building
23EMM A Walk Through
EMM Applications
EMM Building
24More Issues of EMM
- Label of Nodes
- Cluster feature
- LS Medoid or Centroid
- Label of Links
-
- Calibration of Granularity of Clusters
- Determine threshold using Markov property
- Parameter free modeling Keogh, KDD04
25Modeling Performance
- Growth rate of EMM states (Matlab as a testbed)
- Sublinear growth of number of states
- Growth rate decreases
- Memory usage 0.02-0.04 of data size for Ouse,
Serwent, and MnDot. - Time efficiency
- Clustering O(m) vs. O(log m)
- Markov chain O(1)
- Continued learning
26Outline
- Introduction
- EMM Framework
- EMM Applications
- Anomaly detection
- Risk Assessment
- Emerging Event Finding
- Future Work
- Conclusions
27EMM Application Anomaly Detections
- Problem compares a synopsis representing
normal behavior to actual behavior. Any
deviation is flagged as a potential interesting
pattern. - Also known as Positive Security Model
http//www.imperva.com - Assume that everything that deviates from normal
is bad. - Methodology Concepts and rules
- Cardinality of nodes and links
- Normalized Occurrence Frequency and Normalized
Transition Probability - Performance Metric detection rate TP/(TPTN)
- Plus has the potential to detect interesting
patterns of all kind including "unknown"
patterns - Minus can lead to a high false alarm rate.
28EMM Application Anomaly Detections
29EMM Application Anomaly Detections
30EMM Application Risk Assessment
- Problem Mitigate false alarm rate while maintain
a high detection rate. - Methodology
- Historic feedbacks can be used as a free resource
to take out some possibly safe anomalies - Combine anomaly detection model and users
feedbacks. - Risk level index
- Evaluation metrics Detection rate, false alarm
rate. - Results and discussions
- 98 of the alarm incidents in most communities
are false alarms which distracts law enforcement
from real public safety responses. PurvisGary,
http//www.falsealarmreduction.com/
Detection rate TP/(TPTN) False alarm rate
FP/(TPFP)
31EMM Application Risk Assessment
32EMM Application Risk Assessment
33EMM Application Risk Assessment
34EMM Application Emerging Events
- Problem Model dynamic changing spatiotemporal
data series. Find emerging events that represent
new and significant trends. - How to delete obsolete nodes?
- How to identify the new trend at an early time?
- Methodology
- Sliding window EMMDelete
- Decay of importance Aging Score
- Extended Cluster Feature
- Extended Transition Labeling
- Emerging events
- Results and discussions
- O(1)
35EMM Application Emerging Events
1.0
1.0
1.0
1.0
1.0
1.0
0.6 0.7
0.3
0.4
36EMM Application Emerging Events
37Outline
- Introduction
- EMM Framework
- EMM Applications
- Future Work
- Conclusions
38Future Work Adaptive EMM
- Adaptive EMM
- Motivation Modeling dynamically changing data
profile needs change of cluster granularity. - Our proposed methodology local ensemble of EMMs
- One main EMM and two ancillary EMMs (less
descriptors ), - Compare performance of the three EMMs,
- Switch the main EMM
- Create a new ancillary EMM based on the new main
EMM (Faster time-to-mature). - New algorithms are needed
- EMMSplit
- EMMMerge
39Future Work Hierarchical EMM
- Hierarchical EMM The logical geographic area
under consideration will be divided into virtual
regions. A high level EMM is an agglomeration of
lower level EMMs. - Parallel EMM a high level EMM is a summary of
lower level EMMs with the same features/attributes
. - Heterogeneous EMM a lower level EMM is a
feature of the higher level EMM, - Recursive EMM a lower level EMM represents one
or several sub-states of the higher level EMM.
40Conclusions
- EMM is an efficient, modularized, flexible data
mining framework suitable for spatiotemporal data
steam processing - It has a series of applications,
- EMM complies with current research trend and
demanding techniques, - EMM is innovative,
- List of Publications.
41Related Publications
- Yu Meng and Margaret H. Dunham, "Mining
Developing Trends of Dynamic Spatiotemporal Data
Streams", Journal of Computers, Vol. 1, No. 3,
Academy Publisher, 2006. - Charlie Isaksson, Yu Meng and Margaret H. Dunham,
"Risk Leveling of Network Traffic Anomalies",Â
Int'l Journal of Computer Science and Network
Security (IJCSNS), Vol. 6, No. 6, 2006. - Yu Meng and Margaret H. Dunham, Online Mining of
Risk Level of Traffic Anomalies with User's
Feedbacks, in Proceedings of the Second IEEE
International Conference on Granular Computing
(GrC'06), Atlanta, GA, May 10-12, 2006. - Y. Meng, M.H. Dunham, F.M. Marchetti, and J.
Huang, Rare Event Detection in A Spatiotemporal
Environment, in Proceedings of the Second IEEE
International Conference on Granular Computing
(GrC'06), Atlanta, GA, May 10-12, 2006. - Yu Meng and Margaret H. Dunham, Efficient Mining
of Emerging Events in A Dynamic Spatiotemporal
Environment, in Proceedings of the Tenth
Pacific-Asia Conference on Knowledge Discovery
and Data Mining (PAKDD 2006) , Singapore, April
9-12, 2006, Springer LNCS Vol. 3918. - M.H. Dunham, Y. Meng, and J. Huang, Extensible
Markov Model, in Proceedings of the 4th IEEE
International Conference on Data Mining
(ICDM'04), Brighton, UK, November 1-4, 2004.
42Thank you