Fast Subsequence Matching in Time-Series Databases - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Fast Subsequence Matching in Time-Series Databases

Description:

Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University of Maryland ... – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 33
Provided by: Rui55
Learn more at: https://cis.temple.edu
Category:

less

Transcript and Presenter's Notes

Title: Fast Subsequence Matching in Time-Series Databases


1
Fast Subsequence Matching in Time-Series Databases
  • Christos Faloutsos
  • M. Ranganathan
  • Yannis Manolopoulos
  • Department of Computer Science and ISR
  • University of Maryland at College Park

Presented by Rui Li
2
Abstract
  • Goal To find an efficient indexing method to
    locate time series in a database
  • Main Idea
  • Map each time series into a small set of
    multidimensional rectangles in feature space
  • Rectangles can be readily indexed using
    traditional spatial access methods, e.g., R-tree

3
Introduction
  • Hot Problem Searching similar patterns in
    time-series databases
  • Applications
  • financial, marketing and production time series,
    e.g. stock prices
  • scientific databases, e.g. weather, geological,
    environmental data

4
Introduction (cont.)
  • Similarity Queries
  • Whole Matching
  • Subsequence Matching
  • partial matching
  • report time series along with offset

5
Introduction (cont.)
  • Whole Matching (Previous Work)
  • Use a distance-preserving transform (e.g., DFT)
    to extract f features from time series (e.g., the
    first f DFT coefficients), and then map them into
    points in the f-dimensional feature space
  • Spatial access method (e.g., R-trees) can be
    used to search for approximate queries

6
Introduction (cont.)
  • Subsequence Matching (Goal)
  • Map time series into rectangles in feature space
  • Spatial access methods as the eventual indexing
    mechanism

7
Background
  • To guarantee no false dismissals for range
    queries, the feature extraction function F()
    should satisfy the following formula
  • Parseval Theorem
  • The DFT preserves the Euclidean distance between
    two time series

8
Proposed Method
  • Mapping each time series to a trail in feature
    space
  • Use a sliding window of size w and place it at
    every possible offset
  • For each such placement of the window, extract
    the features of the subsequence inside the window
  • A time series of length L is mapped to a trail in
    feature space, consisting of
  • L-w1 points one point for each offset

9
  • Example1

10
  • Example2
  • (a) a sample stock-price time series
  • (b) its trail in the feature space of the 0-th
    and 1-st DFT coefficients
  • (c) its trail of the 1-st and 2-nd DFT
    coefficients

11
Proposed Method (cont.)
  • Indexing the trails
  • Simply storing the individual points of the trail
    in an R-tree is inefficient
  • Exploit the fact that successive points of the
    trail tend to be similar, i.e., the contents of
    the sliding window in nearby offsets tend to be
    similar
  • Divide the trail into sub-trails and represent
    each of them with its minimum bounding
    (hyper)-rectangle (MBR)
  • Store only a few MBRs

12
Proposed Method (cont.)
  • Indexing the trails (cont.)
  • Can guarantee no false dismissals when a query
    arrives, all the MBRs that intersect the query
    region are retrieved, i.e., all the qualifying
    sub-trails are retrieved, plus some false alarms

13
  • Return to example1

e
14
Proposed Method (cont.)
  • Indexing the trails (cont.)
  • Map a time series into a set of rectangles in
    feature space
  • Each MBR corresponds to a sub-trail

15
Proposed Method (cont.)
  • For each MBR we have to store
  • , which are the offsets of the first
    and last such positionings
  • A unique identifier for each time series
  • The extent of the MBR in each dimension, i.e.,
  • Store the MBRs in an R-tree
  • Recursively group the MBRs into parent MBRs,
    grandparent MBRs, etc.

16
  • Example1 (cont.)
  • assuming a fan-out of 4

17
Proposed Method (cont.)
  • The structure of a leaf node and a non-leaf node

18
Proposed Method (cont.)
  • Two questions
  • Insertions when a new time series is inserted,
    what is a good way to divide its trail into
    sub-trails
  • Queries how to handle queries, especially the
    ones that are longer than the sliding window

19
Proposed Method (cont.)
  • Insertion Dividing trails into sub-trails
  • Goal Optimal division so that the number of disk
    accesses is minimized

20
  • Example3fixed heuristic adaptive
    heuristic

21
Proposed Method (cont.)
  • Insertion (cont.)
  • Group trail-points into sub-trails by means of an
    adaptive heuristic
  • Based on a greedy algorithm, using a cost
    function to estimate the number of disk accesses
    for each of the options

22
Proposed Method (cont.)
  • Insertion (cont.)
  • The cost functionwhere is the
    sides of the n-dimensional MBR of a node in an
    R-tree
  • The marginal cost of each point
    where k is the number of points in this MBR

23
Proposed Method (cont.)
  • Insertion (cont.)
  • AlgorithmAssign the first point of the trail to
    a sub-trail (would be a predefined small MBR)FOR
    each successive point IF it increases the
    marginal cost of the current sub-trail THEN
    start a new sub-trail ELSE include it into the
    current sub-trail

24
Proposed Method (cont.)
  • Insertion (cont.)
  • The algorithm may not work well under certain
    circumstances
  • The algorithms goal is to minimize the size of
    each MBR, why dont we use clustering techniques!

25
Proposed Method (cont.)
  • Searching Queries longer than w
  • If Len(Q)w, the searching algorithm goes like
  • Map Q to a point q in the feature space the
    query corresponds to a sphere with center q and
    radius e
  • Retrieve the sub-trails whose MBRs intersect the
    query region
  • Examine the corresponding time series, and
    discard the false alarms

26
Proposed Method (cont.)
  • Searching (cont.)
  • If Len(Q)gtw, consider the following Lemma
  • Consider two sequences Q and S of the same length
    Len(Q)Len(S)pw
  • Consider their p disjoint subsequences
  • andwhere
  • If Q AND S agree within tolerance e, then at
    least one of the pairs of corresponding
    subsequence agree within tolerance

27
Proposed Method (cont.)
  • Searching (cont.)
  • If Len(Q)gtw, the searching algorithm goes like
  • The query time series Q is broken into p
    sub-queries which correspond to p spheres in the
    feature space with radius
  • Retrieve the sub-trails whose MBRs intersect at
    least one of the sub-query regions
  • Examine the corresponding subsequences of the
    time series, and discard the false alarms

28
Experiments
  • Experiments are ran on a stock prices database of
    329,000 points
  • Only the first 3 frequencies of the DFT are used
    thus the feature space has 6 dimensions (real and
    imaginary parts of each retained DFT coefficient)
  • Sliding window size w512

29
Experiments (cont.)
  • Query time series were generated by taking random
    offsets into the time series and obtaining
    subsequences of length Len(Q) from those offsets

30
Experiments (cont.)
  • For groups of experiments were carried out
  • Comparison of the proposed method against the
    method that has sub-trails with only one point
    each
  • Experiments to compare the response time
  • Experiments with queries longer than w
  • Experiments with larger databases

31
Related Works (citations)
  • Continuous queries over data streams
  • Similarity indexing with M-tree/SS-tree, etc.
  • Efficient time series matching by wavelets
  • Fast similarity search in the presence of noise,
    scaling, and translation in time-series databases

32
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com