INFS%20795%20PROJECT:%20Clustering%20Time%20Series - PowerPoint PPT Presentation

About This Presentation
Title:

INFS%20795%20PROJECT:%20Clustering%20Time%20Series

Description:

model-based: a model is hypothesized for each of the clusters to ... examples of medical data: heart-related EKG (top) and brain-related EEG (bottom) epileptic ... – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 28
Provided by: rafall
Learn more at: https://mason.gmu.edu
Category:

less

Transcript and Presenter's Notes

Title: INFS%20795%20PROJECT:%20Clustering%20Time%20Series


1
INFS 795 PROJECTClustering Time Series
  • presented by
  • Rafal Ladysz

2
AGENDA
  • INTRODUCTION
  • theoretical background
  • project objectives
  • other works
  • EXPERIMENTAL SETUP
  • data description
  • data preprocessing
  • tools and procedures
  • RESULTS AND CONCLUSIONS (so far)
  • NEXT STEPS
  • REFERENCES

3
INTRODUCTION theoretical background
  • clustering unsupervised ML technique of grouping
    similar, unlabeled objects without prior
    knowledge about them
  • clustering techniques can be divided and compared
    in many ways, e.g.
  • exclusive vs. overlapping
  • deterministic vs. probabilistic
  • incremental vs. batch learning
  • hierarchical vs. flat
  • or
  • partitioning (e.g. k-means, EM)
  • hierarchical (agglomerative, divisive)
  • density-based
  • model-based a model is hypothesized for each of
    the clusters to find the best fit of that model
    to each other

4
INTRODUCTION theoretical background
  • example of partitioning algorithms
  • k-means
  • EM probabilistic generalization of k-means
  • k-means characteristics
  • suboptimal (susceptible to local minima)
  • sensitive to initial conditions and... outliers
  • requires number of clusters (k) as part of the
    input
  • Euclidean distance is its most natural
    dissimilarity metrics (spherical)
  • we remember how it works re-partitioning until
    no changes
  • EM characteristics
  • generalization of k-means to probabilistic
    setting (maintains probability of membership of
    all clusters rather than assign elements to
    initial clusters)
  • works iteratively
  • initialize means and covariance matrix
  • while the convergence criteria is not met compute
    the probability of each data belonging to each
    cluster
  • recompute the cluster distributions using the
    current membership probabilities
  • cluster probabilities are stored as instance
    weights using means and standard deviations of
    the attributes
  • procedure stops when likelihood saturates

5
INTRODUCTION theoretical background
  • distance / (dis)similarity measures
  • Euclidean root square of sum of squares
  • main limitation very sensitive to outliers!
  • Keogh claims that
  • Euclidean distance error rate about 30
  • DTW error rate 3
  • but there is cost for accuracy
  • time to classify an instance using Euclidean
    distance 1 sec
  • time to classify an instance using DTW 4,320 sec
  • by the way DTW stands for Dynamic Time Warping
    (illustration and formula follow)

6
INTRODUCTION project objectives
  • in general clustering of evolving time series
    data
  • issues to be taken into consideration
  • dimensionality
  • outliers
  • similarity measure(s)
  • number of elements (subsequences)
  • overall evaluation measure(s)
  • context recognition-based support for another
    algorithm
  • in particular comparing and/or evaluating
  • efficiency and accuracy of k-means and EM
  • effect of initial cluster position for k-means
    accuracy
  • efficiency and accuracy of Euclidean and DTW
    distance measures in initializing cluster seeds
    for k-means

7
INTRODUCTION other works
  • E. Keogh et al. inspired to use DTW as
    alternative for Euclidean (DTW origins from
    experiments in 1970s with voice recognition)
  • D. Barbara outlined prerequisites for clustering
    data streams
  • H. Wanng et al. described techniques used in
    detecting pattern similarity
  • similarity is buried deeply in subspaces not
    direct relevance to my experiments since
    arbitrarily selected attributes (time series
    require temporal order)

8
PROJECT OBJECTIVES summary
  • challenges
  • data evolving time series (?!)
  • k-means initialization of seeds position and k
  • (attempt of automatic optimization for the
    evolving data)
  • similarity measure Euclidean - error-prone, DTW
    - costly
  • real time requirement (as target solution, not in
    the project)
  • tools necessity to create (some of them) from
    scratch
  • not encountered in the literature
  • motivation
  • support for already designed and implemented
    software
  • comparing k-means vs. EM and Euclidean vs. DTW
  • the challenges listed above

9
EXPERIMENTAL DESIGN data description
  • three sources of data for more general results
  • medical EEG and EKG http
  • financial NYSE and currency exchange http
  • climatological temperature and SOI http
  • all the data are temporal (time series),
    generated in their natural (not simulated)
    environments
  • some knowledge available (for experimentator, not
    the machine)
  • brief characteristics

10
EXPERIMENTAL DESIGN data description
heart failure occurrences
epileptic seizure duration
examples of medical data heart-related EKG (top)
and brain-related EEG (bottom)
11
EXPERIMENTAL DESIGN data description
seasonality (annual cycle)
periodicity or chaos?
examples of medical data temperature in Virginia
(top) Southern Oscillation Index (bottom)
12
EXPERIMENTAL DESIGN data description
do we see any patterns in either of these two?
examples of financial data New York Stock
Exchange (top) and currency exchange rate (bottom)
notice both time series originates from
cultural rather than natural environment
13
Dynamic Time Warping
Euclidean one-to-one
Dynamic Time Warping many-to-many
where ?(i, j) is the cumulative distance of the
distance d(i, j) and its minimum cumulative
distance among the adjacent cells
14
EXPERIMENTAL DESIGN data preprocessing
  • normalization not necessary
  • outliers detection not done for the exper. data
    sets
  • remark not feasible for real-time scenario
    (assumed)
  • subsequencing using another program (LET)
  • for Euclidean distance measure equal length
    required done
  • computing mean for each subsequence and value
    shifting
  • to enable Euclidean metrics capture
  • similarity of s.s. done
  • applying weighs to each
  • dimension (discrete sample value)
  • to favorize dimensions (points) closer
  • to cut-off (beginning) of the s.s.

15
EXPERIMENTAL DESIGN big picture
  • the general experimental proceeding regarding
    initialization
  • FOR all (six) time series data
  • FOR dimensionalities D 30, 100
  • FOR subsequence weights w(1), w(1.05)
  • FOR ? 5, 10
  • FOR both (E, DTW) distance measures
  • FOR both constraints (Kmax, S)
  • capture and remember cluster seeds
  • apply to real clustering
  • evaluate final goodness

6x2x2x2x2x2 192 seed sets
16
EXPERIMENTAL DESIGN initialization
  • initialization phase collecting cluster seeds
    subsequences in D-dimensional space
  • computing distance between the subsequences using
    Euclidean (E) and DTW (D) measures using matrices
  • compare pair wise distances from matrices E and D
  • based on the above, create initial cluster seeds
  • see next slide (SPSS)

17
(No Transcript)
18
EXPERIMENTAL DESIGN tools and procedures
  • the core for the experiment is generating initial
    k cluster seeds (to be further used by k-means)
  • that is done using 2 distance measures E. and
    DTW
  • once the k seeds are generated (either way),
    their positions are remembered and
  • each seed is assigned a class for final
    evaluation
  • the initial cluster positions and/or classes are
    passed on to the clustering program (SPSS and/or
    Weka)
  • effective that moment, the algorithms are working
    unattended
  • the objective is to evaluate impact of initial
    clusters optimization (in terms of their
    positions and number)

19
EXPERIMENTAL DESIGN tools and procedures
  • initial cluster seeds algorithmic approach
  • define constraints Kmin, Kmax, k 0, ?, S, S
  • start capturing time series subsequences (s.s.)
  • assign first seed to first s.s., increment k
  • do while either condition is fulfilled
  • k Kmax OR S S OR no more subsequences
  • if new s.s. is farther than ? from any seeds,
  • create new seed assigned to that s.s., increment
    k
  • otherwise merge the s.s. to existing seed not
    farther than ?
  • compute S
  • stop capturing s.s., label all generated seeds

20
EXPERIMENTAL DESIGN tools and procedures
  • how the number of clusters (seeds) is computed?
  • as we know, a good k-means algorithm minimizes
    intra- while maximizing inter- distances (thus
    grouping similar objects in separate clusters,
    not too many, not too few)
  • the objective function used in the project is
  • S ltintracl. dist.gt/ltintercl. dist.gt

21
illustration of S
S ltintragt/ltintergt
Kmin
k number of clusters
this plot shows the idea of when to stop
capturing new cluster seeds the measure is the
slope between two neigboring points to avoid too
early termination, constrain of Kmin should be
imposed
22
illustration of ?
whenever newly captured seed candidate falls
within existing seeds orb, it is being fused
with the latter otherwise, its own orb is being
created during this processing phase we
optimize the number k of clusters for real
clustering there is no guarantee the estimated
number is in fact optimal
merging seeds within original orb ?
original seeds
outside existing seed orbs ? new orbs will be
created
...but one can beliefe it is more suitable than
just guessed same refers to initial seed
positions
23
EXPERIMENTAL DESIGN tools and procedures
  • computing Euclidean and DWT distances
  • coding my own program
  • temporarily using a program downloaded from
    Internet
  • evaluating influence of initialization on
    clustering accuracy SPSS for Windows, ver. 11
    (Standard Edition)
  • comparing performance (accuracy and runtime) of
    k-means and EM Weka

k-means, EM (SPSS)
computing distances (Euclidean and DTW)
time series subsequences
24
RESULTS AND CONCLUSIONS (so far)
  • after running 12 k-means sessions over 6
    preprocessed datasets,
  • the average improvement WITH INITIALIZATION over
    WITHOUT can be approximated as
  • 39.4/112 vs. 77/110, i.e.
  • 0.35 vs. 0.7
  • improvement is computed as the ratio of
    intra/inter

25
summarizing RESULTS to be reported
  • performance measure of k-means WITH and WITHOUT
    initialization
  • goodness evaluation (S)
  • subjective evaluation of clustering
  • performance comparison of k-means and EM in same
    circumstances
  • performance comparison of Eucl. and DTW
  • error
  • runtime

26
NEXT STEPS
  • since now to project deadline
  • finishing E/DTW distance computing program
  • finishing k-optimizing program
  • generating 192 initial cluster seeds
  • clustering using the above initial cluster seeds
  • comparing with no initialization
  • after deadline (continuation if time allows)
  • write own k-means program (to run the whole
    process in one batch, thus truly measuring
    performance)
  • if results promising, embedding into another
    program (LET)

27
REFERENCES
  • Wang, H. et al. Clustering by Pattern Similarity
    in Large Data Sets
  • Perng, C-S. et al. Landmarks A New Model for
    Similarity-Based Pattern...
  • Aggarwal, C. et al. A Framework for Clustering
    Evolving Data Streams
  • Barbara, D. Requirements for Clustering Data
    Streams
  • Keogh, E., Shruti, K. On the Need for Time
    Series Data Mining...
  • Gunopulas, D., Das, G. Finding Similar Time
    Series
  • Keogh, E. et al. Clustering of Time Series
    Subsequences is Meaningless...
  • Lin, J. et al. Iterative Incremental Clustering
    of Time Series
  • Keogh, E., Pazzani, J. An enhanced
    representation of rime series...
  • Kahveci, T. et al. Similarity Searching for
    Multi-attribute Sequences
  • and other information and public software
    resources found over Internet.
Write a Comment
User Comments (0)
About PowerShow.com