Fast Subsequence Matching in Time-Series Databases. - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Fast Subsequence Matching in Time-Series Databases.

Description:

George Liu / Luis L. Perez. Time series? Definition. Applications. Financial markets ... Given a database S with n sequences, all of them equally long, and a query ... – PowerPoint PPT presentation

Number of Views:67

Avg rating:3.0/5.0

Slides: 29

Provided by: cise8

Learn more at: https://www.cise.ufl.edu

Category:

more less

Transcript and Presenter's Notes

Title: Fast Subsequence Matching in Time-Series Databases.

1

Fast Subsequence Matching in Time-Series
Databases.
C. Faloustos, M. Ranganathan,
Y. Manolopoulos
Presented by
George Liu / Luis L. Perez

2
Time series?

Definition
Applications
Financial markets
Weather forecasting
Healthcare

3
What kind of problem are we trying to solve?

Whole sequence matching
Given a database S with n sequences, all of them
equally long, and a query sequence Q of the same
length.
Find all sequences in S that match with Q.
Subsequence matching
Given a database S with n sequences, with
potentially different lengths, and a query
sequence Q.
Find all sequences in S that contain Q.

4
Useful notation

Given a sequence S
Len(S) denotes the length of the sequence
Si denotes the ith element
Sij denotes the subsequence between Si and
Sj
Given two sequences, S and Q
D(S,Q) denotes the distance between S and Q.
Euclidean
Distance bound e
Max. distance for two sequences to be considered
equal

5
Naïve approaches

Sequential scanning
Clearly unfeasible
R-tree
Might work, but dimensionality is extremely high
(proportional to sequence length)?
Poor performance
What can we do to improve performance?

6
Dimensionality reduction

Redundant data, lots of patterns
Feature extraction
Data transformation
Cosine
Wavelet
Fourier lt-- we'll focus on this.

7
Discrete Fourier Transformation

Map a sequence x in time-domain to a sequence X
in frequency-domain
Reversible!
Fast and easy-to-implement algorithms
Energy preservation property
Key concept in dimensionality reduction.
Just keep the first 2 or 3 coefficients.

8
Parseval's theorem

Let S and Q be the original sequences.
S' and Q' after applying DFT. D(S,Q)
D(S',Q')
Why is this important?
Distance underestimation, remember the bound e.
D(S,Q) lt e ---gt D(S', Q') lt e
We will get no false dismissals.

9
Subsequence Matching

The problem
You are given a collection of N sequences of real
numbers. (S1, S2, .., Sn). Potentially different
length.
User specifies query subsequence of length Q and
the tolerance e, the max. acceptable
dis-similarity.
You want all to return all the sequences along
with the correct offsets k that matches the query
and acceptable e.
Solutions
many!

10
Possible Solutions

1) Brute Force method - Sequential scan every
possible subsequence of the data sequences for a
match.
2) I-Naive - Transform all subsequences to points
in feature space and store those points into an
R-tree.
3) ST-Index - Transform all subsequences to
points in feature space. Store MBRs of
sub-trails into an R-tree.
Note I-Naive and ST-Index are similar in the
initial steps.

11
Possible Solutions
I-naive

Assume that the min. query length is w. w
changes according to the application. (ie, stock
markets have a larger w that are interested in
weekly/monthly patterns)?
Procedure
1) Use the "sliding window" to find every
subsequence in a sequence.
2) DFT those subsequences of size w to a point
in featured space.
3) A trail is produced of Len(S)-w1 points.

12
Possible Solutions
I-naive

Procedure cont
4) Store all the points of the trails in feature
space in a spatial access method. (R-tree)?
5) When presented with a query of length w and
tolerance e, extract the features of the query
and perform the spatial access range query with
radius e.
6) Discard false alarms by retrieving all those
subsequences and calculating their actual
distance from the query.
Note Very, very slow approach. Worst that
Sequential Scan. You have a large R-tree (tall
and slow).

13
Possible Solutions
ST-Index

Assume that the min. query length is w. w
changes according to the application. (ie, stock
markets have a larger w that are interested in
weekly/monthly patterns)?
Procedure
1) Use the "sliding window" to find every
subsequence in a sequence.
2) DFT those subsequences of size w to a point
in featured space.
3) A trail is produced of Len(S)-w1 points.

14
Possible Solutions
ST-Index

Procedure cont.
4) Divide the trail of points in feature space
into sub-trails. (algorithm mentioned later)?
5) Represent each of them in a MBR.
6) Store the MBR into a spatial access method.
(ie. R-Tree)?

15
MBRs in F-Dimension
16
MBRs in F-Dimension
17
MBRs in F-Dimension
18
MBRs in F-Dimension
19
MBRs in F-Dimension
20
Insertions

Problem How do we divide these trails into
sub-trails?
Two heuristics
1) Every sub-trail has a predetermined, fixed
number. (I-fixed)?
2) Every sub-trail has a predetermined, fixed
length. (I-fixed)?
Solution Use an "adaptive heuristic."
(I-adaptive)?

21
I-adaptive Algorithm

- Based on the idea of the marginal cost of a
point in terms of disk accesses.
Marginal cost (mc) Disk Accesses of a given MBR
/ k points in a given MBR
Algorithm
Assign the first point of the trail in a
sub-trail.
FOR each successive point
IF it increase the marginal cost of the current
sub-trail
THEN start another sub-trail
ELSE include it in the current sub-trail

22
I-adaptive Algorithm
23
Searching

Consider the sub-trail length w and distance
bound e.
Let Q be the query sequence
If Len(Q) w, it's all good.
Algorithm Search_Short
Use DFT to map Q to a point q in feature space.
Make it a sphere with radius e.
Retrieve all the sub-trails whose MBRs intersect
the query region using our index.
Throw away false alarms.

24
Searching

Now, what if Len(Q) gt w?
Requires more analysis, but basically we have
that Len(Q) pw
So we can split Q in several subsequences of
length p.
What about the radius? r
e/sqrt(p)?

25
Searching

So we have...
Algorithm Search_Long
Break sequence Q in p sub-queries with radius
e/sqrt(p)?
Retrieve from the index all the sub-trails whose
MBRs insersect at least one of the other
sub-query regions.
Examine the sub-sequences, discard false alarms.

26
Experimental results
27
Experimental results