INFS%20795%20PROJECT:%20Clustering%20Time%20Series - PowerPoint PPT Presentation

About This Presentation

Title:

INFS%20795%20PROJECT:%20Clustering%20Time%20Series

Description:

model-based: a model is hypothesized for each of the clusters to ... examples of medical data: heart-related EKG (top) and brain-related EEG (bottom) epileptic ... – PowerPoint PPT presentation

Number of Views:108

Avg rating:3.0/5.0

Slides: 28

Provided by: rafall

Learn more at: https://mason.gmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: INFS%20795%20PROJECT:%20Clustering%20Time%20Series

1
INFS 795 PROJECTClustering Time Series

presented by
Rafal Ladysz

2
AGENDA

INTRODUCTION
theoretical background
project objectives
other works
EXPERIMENTAL SETUP
data description
data preprocessing
tools and procedures
RESULTS AND CONCLUSIONS (so far)
NEXT STEPS
REFERENCES

3
INTRODUCTION theoretical background

clustering unsupervised ML technique of grouping
similar, unlabeled objects without prior
knowledge about them
clustering techniques can be divided and compared
in many ways, e.g.
exclusive vs. overlapping
deterministic vs. probabilistic
incremental vs. batch learning
hierarchical vs. flat
or
partitioning (e.g. k-means, EM)
hierarchical (agglomerative, divisive)
density-based
model-based a model is hypothesized for each of
the clusters to find the best fit of that model
to each other

4
INTRODUCTION theoretical background

example of partitioning algorithms
k-means
EM probabilistic generalization of k-means
k-means characteristics
suboptimal (susceptible to local minima)
sensitive to initial conditions and... outliers
requires number of clusters (k) as part of the
input
Euclidean distance is its most natural
dissimilarity metrics (spherical)
we remember how it works re-partitioning until
no changes
EM characteristics
generalization of k-means to probabilistic
setting (maintains probability of membership of
all clusters rather than assign elements to
initial clusters)
works iteratively
initialize means and covariance matrix
while the convergence criteria is not met compute
the probability of each data belonging to each
cluster
recompute the cluster distributions using the
current membership probabilities
cluster probabilities are stored as instance
weights using means and standard deviations of
the attributes
procedure stops when likelihood saturates

5
INTRODUCTION theoretical background

distance / (dis)similarity measures
Euclidean root square of sum of squares
main limitation very sensitive to outliers!
Keogh claims that
Euclidean distance error rate about 30
DTW error rate 3
but there is cost for accuracy
time to classify an instance using Euclidean
distance 1 sec
time to classify an instance using DTW 4,320 sec
by the way DTW stands for Dynamic Time Warping
(illustration and formula follow)

6
INTRODUCTION project objectives

in general clustering of evolving time series
data
issues to be taken into consideration
dimensionality
outliers
similarity measure(s)
number of elements (subsequences)
overall evaluation measure(s)
context recognition-based support for another
algorithm
in particular comparing and/or evaluating
efficiency and accuracy of k-means and EM
effect of initial cluster position for k-means
accuracy
efficiency and accuracy of Euclidean and DTW
distance measures in initializing cluster seeds
for k-means

7
INTRODUCTION other works

E. Keogh et al. inspired to use DTW as
alternative for Euclidean (DTW origins from
experiments in 1970s with voice recognition)
D. Barbara outlined prerequisites for clustering
data streams
H. Wanng et al. described techniques used in
detecting pattern similarity
similarity is buried deeply in subspaces not
direct relevance to my experiments since
arbitrarily selected attributes (time series
require temporal order)

8
PROJECT OBJECTIVES summary

challenges
data evolving time series (?!)
k-means initialization of seeds position and k
(attempt of automatic optimization for the
evolving data)
similarity measure Euclidean - error-prone, DTW
- costly
real time requirement (as target solution, not in
the project)
tools necessity to create (some of them) from
scratch
not encountered in the literature
motivation
support for already designed and implemented
software
comparing k-means vs. EM and Euclidean vs. DTW
the challenges listed above

9
EXPERIMENTAL DESIGN data description

three sources of data for more general results
medical EEG and EKG http
financial NYSE and currency exchange http
climatological temperature and SOI http
all the data are temporal (time series),
generated in their natural (not simulated)
environments
some knowledge available (for experimentator, not
the machine)
brief characteristics

10
EXPERIMENTAL DESIGN data description
heart failure occurrences
epileptic seizure duration
examples of medical data heart-related EKG (top)
and brain-related EEG (bottom)
11
EXPERIMENTAL DESIGN data description
seasonality (annual cycle)
periodicity or chaos?
examples of medical data temperature in Virginia
(top) Southern Oscillation Index (bottom)
12
EXPERIMENTAL DESIGN data description
do we see any patterns in either of these two?
examples of financial data New York Stock
Exchange (top) and currency exchange rate (bottom)
notice both time series originates from
cultural rather than natural environment
13
Dynamic Time Warping
Euclidean one-to-one
Dynamic Time Warping many-to-many
where ?(i, j) is the cumulative distance of the
distance d(i, j) and its minimum cumulative
distance among the adjacent cells
14
EXPERIMENTAL DESIGN data preprocessing

normalization not necessary
outliers detection not done for the exper. data
sets
remark not feasible for real-time scenario
(assumed)
subsequencing using another program (LET)
for Euclidean distance measure equal length
required done
computing mean for each subsequence and value
shifting
to enable Euclidean metrics capture
similarity of s.s. done
applying weighs to each
dimension (discrete sample value)
to favorize dimensions (points) closer
to cut-off (beginning) of the s.s.

15
EXPERIMENTAL DESIGN big picture

the general experimental proceeding regarding
initialization
FOR all (six) time series data
FOR dimensionalities D 30, 100
FOR subsequence weights w(1), w(1.05)
FOR ? 5, 10
FOR both (E, DTW) distance measures
FOR both constraints (Kmax, S)
capture and remember cluster seeds
apply to real clustering
evaluate final goodness

6x2x2x2x2x2 192 seed sets
16
EXPERIMENTAL DESIGN initialization

initialization phase collecting cluster seeds
subsequences in D-dimensional space
computing distance between the subsequences using
Euclidean (E) and DTW (D) measures using matrices
compare pair wise distances from matrices E and D
based on the above, create initial cluster seeds
see next slide (SPSS)

17
(No Transcript)
18
EXPERIMENTAL DESIGN tools and procedures

the core for the experiment is generating initial
k cluster seeds (to be further used by k-means)
that is done using 2 distance measures E. and
DTW
once the k seeds are generated (either way),
their positions are remembered and
each seed is assigned a class for final
evaluation
the initial cluster positions and/or classes are
passed on to the clustering program (SPSS and/or
Weka)
effective that moment, the algorithms are working
unattended
the objective is to evaluate impact of initial
clusters optimization (in terms of their
positions and number)

19
EXPERIMENTAL DESIGN tools and procedures

initial cluster seeds algorithmic approach
define constraints Kmin, Kmax, k 0, ?, S, S
start capturing time series subsequences (s.s.)
assign first seed to first s.s., increment k
do while either condition is fulfilled
k Kmax OR S S OR no more subsequences
if new s.s. is farther than ? from any seeds,
create new seed assigned to that s.s., increment
k
otherwise merge the s.s. to existing seed not
farther than ?
compute S
stop capturing s.s., label all generated seeds

20
EXPERIMENTAL DESIGN tools and procedures

how the number of clusters (seeds) is computed?
as we know, a good k-means algorithm minimizes
intra- while maximizing inter- distances (thus
grouping similar objects in separate clusters,
not too many, not too few)
the objective function used in the project is
S ltintracl. dist.gt/ltintercl. dist.gt

21
illustration of S
S ltintragt/ltintergt
Kmin
k number of clusters
this plot shows the idea of when to stop
capturing new cluster seeds the measure is the
slope between two neigboring points to avoid too
early termination, constrain of Kmin should be
imposed
22
illustration of ?
whenever newly captured seed candidate falls
within existing seeds orb, it is being fused
with the latter otherwise, its own orb is being
created during this processing phase we
optimize the number k of clusters for real
clustering there is no guarantee the estimated
number is in fact optimal
merging seeds within original orb ?
original seeds
outside existing seed orbs ? new orbs will be
created
...but one can beliefe it is more suitable than
just guessed same refers to initial seed
positions
23
EXPERIMENTAL DESIGN tools and procedures

computing Euclidean and DWT distances
coding my own program
temporarily using a program downloaded from
Internet
evaluating influence of initialization on
clustering accuracy SPSS for Windows, ver. 11
(Standard Edition)
comparing performance (accuracy and runtime) of
k-means and EM Weka

k-means, EM (SPSS)
computing distances (Euclidean and DTW)
time series subsequences
24
RESULTS AND CONCLUSIONS (so far)

after running 12 k-means sessions over 6
preprocessed datasets,
the average improvement WITH INITIALIZATION over
WITHOUT can be approximated as
39.4/112 vs. 77/110, i.e.
0.35 vs. 0.7
improvement is computed as the ratio of
intra/inter

25
summarizing RESULTS to be reported

performance measure of k-means WITH and WITHOUT
initialization
goodness evaluation (S)
subjective evaluation of clustering
performance comparison of k-means and EM in same
circumstances
performance comparison of Eucl. and DTW
error
runtime

26
NEXT STEPS

since now to project deadline
finishing E/DTW distance computing program
finishing k-optimizing program
generating 192 initial cluster seeds
clustering using the above initial cluster seeds
comparing with no initialization
after deadline (continuation if time allows)
write own k-means program (to run the whole
process in one batch, thus truly measuring
performance)
if results promising, embedding into another
program (LET)

27
REFERENCES