Chap' 8 Mining Stream, TimeSeries, and Sequence Data

About This Presentation

Title:

Chap' 8 Mining Stream, TimeSeries, and Sequence Data

Description:

Fast changing and requires fast, real-time response ... Window stitching - Stitch similar windows to form pairs of large similar ... – PowerPoint PPT presentation

Number of Views:351

Avg rating:3.0/5.0

Slides: 52

Provided by: jiaw185

Category:

more less

Transcript and Presenter's Notes

Title: Chap' 8 Mining Stream, TimeSeries, and Sequence Data

1
Chap. 8 Mining Stream, Time-Series, and Sequence
Data

Data Mining

2
Characteristics of Data Streams

Data Streams
Traditional DBMS - data stored in finite,
persistent data sets
Data streams - continuous, ordered, changing,
fast, huge amount
Characteristics
Fast changing and requires fast, real-time
response
Random access is expensive - single scan
algorithm (can only have one look)
Store only the summary of the data seen thus far
Most stream data are at pretty low-level or
multi-dimensional in nature, needs multi-level
and multi-dimensional processing

3
Stream Data Applications

Telecommunication calling records
Business credit card transaction flows
Financial market stock exchange
Computer network network traffic monitoring
Sensor network sensor data stream
Security monitoring video streams
Web click streams, Web log

4
DBMS versus DSMS

Persistent relations
One-time queries
Random access
Unbounded disk store
Only current state matters
No real-time services
Relatively low update rate
Data at any granularity
Assume precise data

Transient streams
Continuous queries
Sequential access
Bounded main memory
Historical data is important
Real-time requirements
Possibly multi-GB arrival rate
Data at fine granularity
Data stale/imprecise

5
Stream Query Processing
User/Application
Continuous Query
Results
Multiple streams
Stream Query Processor
Scratch Space (Main memory and/or Disk)
6
Challenges of Stream Data Processing

Multiple, continuous, rapid, time-varying,
ordered streams
Main memory computations
Queries are often continuous
Evaluated continuously as stream data arrives
Answer updated over time
Queries are often complex
Beyond element-at-a-time processing
Beyond stream-at-a-time processing
Beyond relational queries (scientific, data
mining, OLAP)
Multi-level/multi-dimensional processing and data
mining
Most stream data are at low-level or
multi-dimensional in nature

7
Methodologies for Stream Data Processing

Major challenges
Keep track of a large universe, e.g., pairs of IP
address
Methodology
Use synopsis data structure, much smaller (O(logk
N) space) than their base data set (O(N) space)
Compute an approximate answer within a small
error range (factor e of the actual answer)
Major methods
Random sampling
Histograms
Sliding windows
Multi-resolution model
Sketches
Radomized algorithms

8
Stream Data Mining

Stream miningA more challenging task
It shares most of the difficulties with stream
querying
But often requires less precision
Patterns are hidden and more general than
querying
It may require exploratory analysis
Not necessarily continuous queries
Stream data mining tasks
Multi-dimensional on-line analysis of streams
Mining outliers and unusual patterns in stream
data
Clustering data streams
Classification of stream data

9
Multi-Dimensional Stream Analysis Examples

Analysis of Web click streams
Raw data at low levels seconds, web page
addresses, user IP addresses,
Analysts want changes, trends, unusual patterns,
at reasonable levels of details
E.g., Average clicking traffic in North America
on sports in the last 15 minutes is 40 higher
than that in the last 24 hours.
Analysis of power consumption streams
Raw data power consumption flow for every
household, every minute
Patterns one may find average hourly power
consumption surges up 30 for manufacturing
companies in Chicago in the last 2 hours today
than that of the same day a week ago

10
A Stream Cube Architecture

A tilted time frame
Different time granularities
second, minute, quarter, hour, day, week,
Critical layers
Minimum interest layer (m-layer)
Observation layer (o-layer)
User watches at o-layer and occasionally needs
to drill-down down to m-layer
Partial materialization of stream cubes
Full materialization too space and time
consuming
No materialization slow response at query time
Partial materialization what do we mean
partial?

11
A Titled Time Model

Natural tilted time frame
Example Minimal quarter, 4 quarters ? 1 hour,
24 hours ? day,
Logarithmic tilted time frame
Example Minimal 1 minute, then 1, 2, 4, 8, 16,
32,

12
Two Critical Layers
(, theme, quarter)
o-layer (observation)
(user-group, URL-group, minute)
m-layer (minimal interest)
(individual-user, URL, second)
(primitive) stream data layer
13
Partial Materialization

On-line materialization
Materialization takes precious space and time
Only incremental materialization (with tilted
time frame)
Only materialize cuboids of the critical
layers?
Online computation may take too much time
Preferred solution
popular-path approach Materializing those along
the popular drilling paths
H-tree structure Such cuboids can be computed
and stored efficiently using the H-tree structure

14
Stream Cube Structure From m-layer to o-layer
15
Frequent Patterns for Stream Data

Frequent pattern mining is valuable in stream
applications
e.g., network intrusion mining
Mining precise freq. patterns in stream data
Unrealistic even we store them in a compressed
form, such as FPtree
? Approximate frequent patterns
Mining evolution freq. patterns
Use tilted time window frame
Mining evolution and dramatic changes of frequent
patterns

16
Mining Approximate Frequent Patterns

Approximate answers are often sufficient
Example a router is interested in all flows
whose frequency is at least 1 (s) of the entire
traffic stream seen so far, and feels that 1/10
of s (e 0.1) error is comfortable
Lossy Counting Algorithm
Major ideas not tracing items until it becomes
frequent
Adv guaranteed error bound
Disadv keep a large set of traces

17
Lossy Counting
Divide Stream into Buckets (bucket size is 1/ e
, exgt1000)
18
Lossy Counting

First bucket

19
Lossy Counting

Next bucket

20
Approximation Guarantee

Given (1) support threshold s, (2) error
threshold e, and (3) stream length N
Output items with frequency counts exceeding (s
e) N
How much do we undercount?
If stream length seen so far
N
and bucket-size
1/e
then frequency count error ? buckets
eN
Approximation guarantee
No false negatives
False positives have true frequency count at
least (se)N
Frequency count underestimated by at most eN

21
Lossy Counting Example
N1000, s0.1, e0.01, b size100 ? count error ?
10 (if count50, actual count is 4050)
22
Lossy Counting For Frequent Itemsets
Divide Stream into Buckets as for frequent
items But fill as many buckets as possible in
main memory one time
If we put 3 buckets of data into main memory one
time, Then decrease each frequency count by 3
23

Lossy Counting For Frequent Itemsets
Itemset ( ) is deleted.choose a large
number of buckets delete more
24
Lossy Counting For Frequent Itemsets
Pruning Itemsets Apriori Rule If we find
itemset ( ) is not frequent itemset, then
we neednt consider its superset
25
Classification for Data Streams

Decision tree induction for stream data
classification
VFDT(Very Fast Decision Tree) / CVFDT
Other stream classification methods
Instead of decision-trees, consider other models
Naïve Bayesian
Ensemble
Tilted time framework, incremental updating,
dynamic maintenance, and model construction
Comparing of models to find changes

26
Hoeffding Tree

Only uses small sample
Based on Hoeffding Bound principle
Hoeffding Bound (Additive Chernoff Bound)
r random variable
R range of r
n independent observations
Mean of r is at least ravg e, with probability
1 ?

27
Hoeffding Tree

Input
S sequence of examples
X attributes
G( ) evaluation function, e.g. Information Gain
d desired accuracy
Hoeffding Tree Algorithm
for each example in S
retrieve G(Xa) and G(Xb) // two highest G(Xi)
if ( G(Xa) G(Xb) gt e )
split on Xa
recurse to next node
break

28
Hoeffding Tree
29
Hoeffding Tree

Strengths
Scales better than traditional methods
Sublinear with sampling
Very small memory utilization
Incremental
Make class predictions in parallel
New examples are added as they come
Weakness
Could spend a lot of time with ties
Memory used with tree expansion
Number of candidate attributes

30
VFDT

VFDT (Very Fast DT) - Modifications to Hoeffding
Tree
Near-ties broken more aggressively
G computed every nmin
Deactivates certain leaves to save memory
Poor attributes dropped
Initialize with traditional learner (helps
learning curve)
Compare to Hoeffding Tree
Better time and memory
Compare to traditional decision tree
Similar accuracy
Better runtime with 1.61 million examples
21 minutes for VFDT
24 hours for C4.5
Still does not handle concept drift

31
CVFDT

Concept Drift
Time-changing data streams
Incorporate new and eliminate old
CVFDT (Concept-adapting VFDT)
Increments count with new example
Decrement old example
Sliding window
Nodes assigned monotonically increasing IDs
Grows alternate subtrees
When alternate more accurate ? replace old
O(w) better runtime than VFDT-window

32
Ensemble of Classifiers Algorithm

Method
Train K classifiers from K chunks
For each subsequent chunk
Train a new classifier
Test other classifiers against the chunk
Assign weight to each classifier
Select top K classifiers

33

Clustering Data Streams

Base on the k-median method
Data stream points from metric space
Find k clusters in the stream s.t. the sum of
distances from data points to their closest
center is minimized
Constant factor approximation algorithm
For each set of M records, Si, find O(k) centers
in S1, , Sl
Local clustering Assign each point in Si to its
closest center
Let S be centers for S1, , Sl with each center
weighted by number of points assigned to it
Cluster S to find k centers

34
Mining Time-Series and Sequence Data

Time-series database
Consists of sequences of values/events changing
with time
Data is recorded at regular intervals
Exgt Stock prices, power consumption,
precipitation
Sequence database
Database of ordered items
Exgt Web log data - page traverse sequence
Mining time-series and sequence data
Trend analysis
Similarity search
Mining of sequential/periodic patterns

35
(No Transcript)
36
Trend analysis

A time series
Illustrated as a time-series graph f(t)
Major components
Long-term(trend) movements
General direction of moving over a long interval
Cyclic movements
Long-term oscillation of trend curve
Seasonal movements
Identical patterns that appears annually during
specific period
Irregular movements

37
Estimation of Trend Curve

The least-square method
Find the best fitting curve that minimizes the
sum of the square error
The moving-average method
Average of n data values
Smoothing of time series
Exgt Stock price graph 5 day, 20 day, 60 day
average

38
Similarity Search

Similarity search
Finds data sequences that differ slightly from
the given query sequence
Two categories of similarity queries
Whole matching
Find a sequence that is similar to the query
sequence
Subsequence matching
Find all pairs of similar sequences
Typical Applications
Financial market
Scientific databases
Medical diagnosis

39
Data Transformation

Time domain ? frequency domain
Many techniques for signal analysis require the
data to be in the frequency domain
Transformations
Discrete fourier transform (DFT), wavelet
transform (DWT)
The distance in the time domain Euclidean
distance in the frequency domain
Matching
Construct multidimensional index using the first
few Fourier coefficients
Subsequence matching
Break each sequence into a set of pieces of
window with length w, and extract the features of
the subsequence inside the window

40
Enhanced Similarity Search

Allow gaps, differences in offsets or amplitudes
Normalize sequences with amplitude scaling and
offset translation
Two sequences are said to be similar if they have
enough non-overlapping time-ordered pairs of
similar subsequences
Steps for a similarity search
Atomic matching - Find all pairs of gap-free
windows of a small length that are similar
Window stitching - Stitch similar windows to form
pairs of large similar subsequences allowing gaps
Subsequence ordering - Linearly order the
subsequence matches to determine whether enough
similar pieces exist

41
Sequential Pattern Mining

Mining of frequently occurring patterns
Concentrate on symbolic patterns
Examples
(Buy PC, buy memory)
Applications
Customer retention
Medical treatment
Disaster (e.g. earthquakes), market prediction
Weblog click stream analysis
Methods for sequential pattern mining
Variations of Apriori-like algorithms

42
Sequential Pattern Mining

Given a set of sequences, find the complete set
of frequent subsequences

A sequence lt (ef) (ab) (df) c b gt
A sequence database
An element may contain a set of items. Items
within an element are unordered and we list them
alphabetically.
lta(bc)dcgt is a subsequence of lta(abc)(ac)d(cf)gt
Given support threshold min_sup 2, lt(ab)cgt is
one of sequential patterns
43
GSPGeneralized Sequential Pattern Mining

Outline of the method
Initially, every item in DB is a candidate of
length-1
for each level (i.e., sequences of length-k) do
scan database to collect support count for each
candidate sequence
generate candidate length-(k1) sequences from
length-k frequent sequences using Apriori
repeat until no frequent sequence or no candidate
can be found
Major strength
Candidate pruning by Apriori

44
GSPGeneralized Sequential Pattern Mining
min_sup 2
45
Biology Fundamentals DNA

DNA helix-shaped molecule whose constituents are
two parallel strands of nucleotides
DNA is usually represented by sequences of four
(A, C, G, T) nucleotides
This assumes only one strand is considered the
second strand is always derivable from the first
by pairing As with Ts and Cs with Gs and
vice-versa

Nucleotides (bases)
Adenine (A)
Cytosine (C)
Guanine (G)
Thymine (T)

46
Biology Fundamentals Genes

Gene Contiguous subparts of single strand DNA
that are templates for producing proteins. Genes
can appear in either of the DNA strand.
Chromosomes compact chains of coiled DNA
Genome The set of all genes in a given organism.

Source www.mtsinai.on.ca/pdmg/Genetics/basic.htm
47
Biology Fundamentals Protein

Genes are transcribed into RNA by a complex
ensemble of molecules. During transcription T is
substituted by the letter U (for uracil).
Triplets of consecutive nucleotides (called
codon) are repeatedly translated and produces one
corresponding amino acid
Protein Built by amino acid. It participates
with other proteins and molecules in keeping the
cell alive and interacting with its environment

Source fajerpc.magnet.fsu.edu/Education/2010/Lect
ures/26_DNA_Transcription.htm
48
Data Mining Bioinformatics

Many biological processes are not well-understood
Biological data is abundant and information-rich
Genomics proteomics data (sequences),
microarray and protein-arrays, protein database
(PDB), bio-testing data
Huge data banks, rich literature, openly
accessible
Largest and richest scientific data sets in the
world
Mining gain biological insight (data ?
knowledge)
Mining for correlations, linkages between disease
and gene sequences, protein networks,
classification, clustering, outliers, ...
Find correlations among linkages in literature
and heterogeneous databases

49
Data Mining Bioinformatics

Research and development of new tools for
bioinformatics
Similarity search and comparison between classes
of genes by finding and comparing frequent
patterns
Identify sequential patterns that play roles in
various diseases
New clustering and classification methods for
micro-array data and protein-array data analysis
Mining, indexing and similarity search in
sequential and structured (e.g., graph and
network) data sets
Path analysis linking genes/proteins to
different disease development stages
High-dimensional analysis and OLAP mining
Visualization tools and genetic/proteomic data
analysis

50
References

C. Aggarwal, J. Han, J. Wang, P. S. Yu. A
Framework for Clustering Data Streams, VLDB'03
C. C. Aggarwal, J. Han, J. Wang and P. S. Yu.
On-Demand Classification of Evolving Data
Streams, KDD'04
C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A
Framework for Projected Clustering of High
Dimensional Data Streams, VLDB'04
S. Babu and J. Widom. Continuous Queries over
Data Streams. SIGMOD Record, Sept. 2001
B. Babcock, S. Babu, M. Datar, R. Motwani and J.
Widom. Models and Issues in Data Stream Systems,
PODS'02. (Conference tutorial)
Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang.
"Multi-Dimensional Regression Analysis of
Time-Series Data Streams, VLDB'02
P. Domingos and G. Hulten, Mining high-speed
data streams, KDD'00
A. Dobra, M. N. Garofalakis, J. Gehrke, R.
Rastogi. Processing Complex Aggregate Queries
over Data Streams, SIGMOD02
J. Gehrke, F. Korn, D. Srivastava. On computing
correlated aggregates over continuous data
streams. SIGMOD'01
C. Giannella, J. Han, J. Pei, X. Yan and P.S. Yu.
Mining frequent patterns in data streams at
multiple time granularities, Kargupta, et al.
(eds.), Next Generation Data Mining04
S. Guha, N. Mishra, R. Motwani, and L.
O'Callaghan. Clustering Data Streams, FOCS'00
G. Hulten, L. Spencer and P. Domingos Mining
time-changing data streams. KDD 2001
S. Madden, M. Shah, J. Hellerstein, V. Raman,
Continuously Adaptive Continuous Queries over
Streams, SIGMOD02
G. Manku, R. Motwani. Approximate Frequency
Counts over Data Streams, VLDB02
A. Metwally, D. Agrawal, and A. El Abbadi.
Efficient Computation of Frequent and Top-k
Elements in Data Streams. ICDT'05

51
References

S. Muthukrishnan, Data streams algorithms and
applications, Proceedings of the fourteenth
annual ACM-SIAM symposium on Discrete algorithms,
2003
R. Motwani and P. Raghavan, Randomized
Algorithms, Cambridge Univ. Press, 1995
S. Viglas and J. Naughton, Rate-Based Query
Optimization for Streaming Information Sources,
SIGMOD02
Y. Zhu and D. Shasha. StatStream Statistical
Monitoring of Thousands of Data Streams in Real
Time, VLDB02
H. Wang, W. Fan, P. S. Yu, and J. Han, Mining
Concept-Drifting Data Streams using Ensemble
Classifiers, KDD'03
R. Agrawal, C. Faloutsos, and A. Swami. Efficient
similarity search in sequence databases. FODO93
(Foundations of Data Organization and
Algorithms).
R. Agrawal, K.-I. Lin, H.S. Sawhney, and K. Shim.
Fast similarity search in the presence of noise,
scaling, and translation in time-series
databases. VLDB'95.
R. Agrawal, G. Psaila, E. L. Wimmers, and M.
Zait. Querying shapes of histories. VLDB'95.
C. Chatfield. The Analysis of Time Series An
Introduction, 3rd ed. Chapman Hall, 1984.
C. Faloutsos, M. Ranganathan, and Y.
Manolopoulos. Fast subsequence matching in
time-series databases. SIGMOD'94.
D. Rafiei and A. Mendelzon. Similarity-based
queries for time series data. SIGMOD'97.
Y. Moon, K. Whang, W. Loh. Duality Based
Subsequence Matching in Time-Series Databases,
ICDE02
B.-K. Yi, H. V. Jagadish, and C. Faloutsos.
Efficient retrieval of similar time sequences
under time warping. ICDE'98.
B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V.
Jagadish, C. Faloutsos, and A. Biliris. Online
data mining for co-evolving time sequences.
ICDE'00.
Dennis Shasha and Yunyue Zhu. High Performance
Discovery in Time Series Techniques and Case
Studies, SPRINGER, 2004

52
References

R. Srikant and R. Agrawal. Mining sequential
patterns Generalizations and performance
improvements. EDBT96.
H. Mannila, H Toivonen, and A. I. Verkamo.
Discovery of frequent episodes in event
sequences. DAMI97.
M. Zaki. SPADE An Efficient Algorithm for Mining
Frequent Sequences. Machine Learning, 2001.
J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and
M.-C. Hsu. PrefixSpan Mining Sequential Patterns
Efficiently by Prefix-Projected Pattern Growth.
ICDE'01 (TKDE04).
J. Pei, J. Han and W. Wang, Constraint-Based
Sequential Pattern Mining in Large Databases,
CIKM'02.
X. Yan, J. Han, and R. Afshar. CloSpan Mining
Closed Sequential Patterns in Large Datasets.
SDM'03.
J. Wang and J. Han, BIDE Efficient Mining of
Frequent Closed Sequences, ICDE'04.
H. Cheng, X. Yan, and J. Han, IncSpan
Incremental Mining of Sequential Patterns in
Large Database, KDD'04.
J. Han, G. Dong and Y. Yin, Efficient Mining of
Partial Periodic Patterns in Time Series
Database, ICDE'99.
J. Yang, W. Wang, and P. S. Yu, Mining
asynchronous periodic patterns in time series
data, KDD'00.
A. Baxevanis and B. F. F. Ouellette.
Bioinformatics A Practical Guide to the Analysis
of Genes and Proteins (3rd ed.). John Wiley
Sons, 2004
R.Durbin, S.Eddy, A.Krogh and G.Mitchison.
Biological Sequence Analysis Probability Models
of Proteins and Nucleic Acids. Cambridge
University Press, 1998
N. C. Jones and P. A. Pevzner. An Introduction to
Bioinformatics Algorithms. MIT Press, 2004
I. Korf, M. Yandell, and J. Bedell. BLAST.
O'Reilly, 2003
L. R. Rabiner. A tutorial on hidden markov models
and selected applications in speech recognition.
Proc. IEEE, 77257--286, 1989
J. C. Setubal and J. Meidanis. Introduction to
Computational Molecular Biology. PWS Pub Co.,
1997.
M. S. Waterman. Introduction to Computational
Biology Maps, Sequences, and Genomes. CRC Press,
1995