Fast Subsequence Matching in Time-Series Databases. - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Fast Subsequence Matching in Time-Series Databases.

Description:

George Liu / Luis L. Perez. Time series? Definition. Applications. Financial markets ... Given a database S with n sequences, all of them equally long, and a query ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 29
Provided by: cise8
Learn more at: https://www.cise.ufl.edu
Category:

less

Transcript and Presenter's Notes

Title: Fast Subsequence Matching in Time-Series Databases.


1
  • Fast Subsequence Matching in Time-Series
    Databases.
  • C. Faloustos, M. Ranganathan,
  • Y. Manolopoulos
  • Presented by
  • George Liu / Luis L. Perez

2
Time series?
  • Definition
  • Applications
  • Financial markets
  • Weather forecasting
  • Healthcare

3
What kind of problem are we trying to solve?
  • Whole sequence matching
  • Given a database S with n sequences, all of them
    equally long, and a query sequence Q of the same
    length.
  • Find all sequences in S that match with Q.
  • Subsequence matching
  • Given a database S with n sequences, with
    potentially different lengths, and a query
    sequence Q.
  • Find all sequences in S that contain Q.

4
Useful notation
  • Given a sequence S
  • Len(S) denotes the length of the sequence
  • Si denotes the ith element
  • Sij denotes the subsequence between Si and
    Sj
  • Given two sequences, S and Q
  • D(S,Q) denotes the distance between S and Q.
  • Euclidean
  • Distance bound e
  • Max. distance for two sequences to be considered
    equal

5
Naïve approaches
  • Sequential scanning
  • Clearly unfeasible
  • R-tree
  • Might work, but dimensionality is extremely high
    (proportional to sequence length)?
  • Poor performance
  • What can we do to improve performance?

6
Dimensionality reduction
  • Redundant data, lots of patterns
  • Feature extraction
  • Data transformation
  • Cosine
  • Wavelet
  • Fourier lt-- we'll focus on this.

7
Discrete Fourier Transformation
  • Map a sequence x in time-domain to a sequence X
    in frequency-domain
  • Reversible!
  • Fast and easy-to-implement algorithms
  • Energy preservation property
  • Key concept in dimensionality reduction.
  • Just keep the first 2 or 3 coefficients.

8
Parseval's theorem
  • Let S and Q be the original sequences.
  • S' and Q' after applying DFT. D(S,Q)
    D(S',Q')
  • Why is this important?
  • Distance underestimation, remember the bound e.
  • D(S,Q) lt e ---gt D(S', Q') lt e
  • We will get no false dismissals.

9
Subsequence Matching
  • The problem
  • You are given a collection of N sequences of real
    numbers. (S1, S2, .., Sn). Potentially different
    length.
  • User specifies query subsequence of length Q and
    the tolerance e, the max. acceptable
    dis-similarity.
  • You want all to return all the sequences along
    with the correct offsets k that matches the query
    and acceptable e.
  • Solutions
  • many!

10
Possible Solutions
  • 1) Brute Force method - Sequential scan every
    possible subsequence of the data sequences for a
    match.
  • 2) I-Naive - Transform all subsequences to points
    in feature space and store those points into an
    R-tree.
  • 3) ST-Index - Transform all subsequences to
    points in feature space. Store MBRs of
    sub-trails into an R-tree.
  • Note I-Naive and ST-Index are similar in the
    initial steps.

11
Possible Solutions
I-naive
  • Assume that the min. query length is w. w
    changes according to the application. (ie, stock
    markets have a larger w that are interested in
    weekly/monthly patterns)?
  • Procedure
  • 1) Use the "sliding window" to find every
    subsequence in a sequence.
  • 2) DFT those subsequences of size w to a point
    in featured space.
  • 3) A trail is produced of Len(S)-w1 points.

12
Possible Solutions
I-naive
  • Procedure cont
  • 4) Store all the points of the trails in feature
    space in a spatial access method. (R-tree)?
  • 5) When presented with a query of length w and
    tolerance e, extract the features of the query
    and perform the spatial access range query with
    radius e.
  • 6) Discard false alarms by retrieving all those
    subsequences and calculating their actual
    distance from the query.
  • Note Very, very slow approach. Worst that
    Sequential Scan. You have a large R-tree (tall
    and slow).

13
Possible Solutions
ST-Index
  • Assume that the min. query length is w. w
    changes according to the application. (ie, stock
    markets have a larger w that are interested in
    weekly/monthly patterns)?
  • Procedure
  • 1) Use the "sliding window" to find every
    subsequence in a sequence.
  • 2) DFT those subsequences of size w to a point
    in featured space.
  • 3) A trail is produced of Len(S)-w1 points.

14
Possible Solutions
ST-Index
  • Procedure cont.
  • 4) Divide the trail of points in feature space
    into sub-trails. (algorithm mentioned later)?
  • 5) Represent each of them in a MBR.
  • 6) Store the MBR into a spatial access method.
    (ie. R-Tree)?

15
MBRs in F-Dimension
16
MBRs in F-Dimension
17
MBRs in F-Dimension
18
MBRs in F-Dimension
19
MBRs in F-Dimension
20
Insertions
  • Problem How do we divide these trails into
    sub-trails?
  • Two heuristics
  • 1) Every sub-trail has a predetermined, fixed
    number. (I-fixed)?
  • 2) Every sub-trail has a predetermined, fixed
    length. (I-fixed)?
  • Solution Use an "adaptive heuristic."
    (I-adaptive)?

21
I-adaptive Algorithm
  • - Based on the idea of the marginal cost of a
    point in terms of disk accesses.
  • Marginal cost (mc) Disk Accesses of a given MBR
    / k points in a given MBR
  • Algorithm
  • Assign the first point of the trail in a
    sub-trail.
  • FOR each successive point
  • IF it increase the marginal cost of the current
    sub-trail
  • THEN start another sub-trail
  • ELSE include it in the current sub-trail

22
I-adaptive Algorithm
23
Searching
  • Consider the sub-trail length w and distance
    bound e.
  • Let Q be the query sequence
  • If Len(Q) w, it's all good.
  • Algorithm Search_Short
  • Use DFT to map Q to a point q in feature space.
    Make it a sphere with radius e.
  • Retrieve all the sub-trails whose MBRs intersect
    the query region using our index.
  • Throw away false alarms.

24
Searching
  • Now, what if Len(Q) gt w?
  • Requires more analysis, but basically we have
    that Len(Q) pw
  • So we can split Q in several subsequences of
    length p.
  • What about the radius? r
    e/sqrt(p)?

25
Searching
  • So we have...
  • Algorithm Search_Long
  • Break sequence Q in p sub-queries with radius
    e/sqrt(p)?
  • Retrieve from the index all the sub-trails whose
    MBRs insersect at least one of the other
    sub-query regions.
  • Examine the sub-sequences, discard false alarms.

26
Experimental results
27
Experimental results
  • Stock price database with 300,000 points
  • 1 number 4 bytes
  • DFT keeping first 3 coefficients (actually 6)
  • w 512 bytes
  • R-tree

28
Experimental results
  • Space
  • Naïve methods 24mb
  • This method 5kb
  • Time - short queries (Len(Q) w)?
  • 3 to 100 times better response times
  • Time - long queries (Len(Q) gt w)?
  • 10 to 100 times better response times
Write a Comment
User Comments (0)
About PowerShow.com