Reza Sherkat and Davood Rafiei

1 / 29

About This Presentation

Title:

Reza Sherkat and Davood Rafiei

Description:

Efficiently Evaluating Order Preserving Similarity Queries over ... Pair-wise similarity of their observations {b, c} {f, g, h} {h, i} 4 {f, g, h} {f, g} ... –

Number of Views:76

Avg rating:3.0/5.0

Slides: 30

Provided by: lei7151

Category:

more less

Transcript and Presenter's Notes

Title: Reza Sherkat and Davood Rafiei

1
Efficiently Evaluating Order Preserving
Similarity Queries over Historical Market-Basket
Data

Reza Sherkat and Davood Rafiei
Department of Computing Science
University of Alberta
Canada

Travel assistance provided by the Mary Louise
Imrie Graduate Student Award
2
Overview

Introduction
Histories and Time-series
Similarity model for histories
Problem Definition
Proposed Approach
Results Highlight
Conclusions

3
Querying Histories Introduction

Querying multiple snapshots of data
Temporal selection, projection, and join queries
Finding similar time-series
Finding companies having similar stocks
Is it possible to define a notion of similarity
for objects based on the similarity of their
histories?

4
Histories

History A sequence of time-stamped observations
Time-series observations are real-values
Observations can be more general

the history of a patient
5
Similarity Model for Histories
History for 3 patients

Similarity of two histories depends on
Pair-wise similarity of their observations

6
Similarity Model for Histories
History for 3 patients

Similarity of two histories depends on
Pair-wise similarity of their observations

The order that similar observations are recorded
Constraints on time-stamps of observations

7
Problem Definition

Given a history as a query
Evaluate k-NN and Range queries efficiently.
For each history in the result, find its common
signature with the query - where the similarity
comes from?

8
Similarity Measure for Histories

Alignment of histories
An approach to line-up subsequences of two
histories
Denoted by a sequence of matches
is an observation in A (B) or a
gap ( ).
is the score of a match.
Alignment score measures the quality of an
alignment.

9
Alignments of Histories
Alignment score can be the sum of the score of
matches in the alignment.
10
Alignments of Histories
Alignment score can be the sum of the score of
matches in the alignment.
The best alignment of two histories
What is the best alignment of length 3?
11
Alignments of Histories
Alignment score can be the sum of the score of
matches in the alignment.
The best alignment of two histories
What is the best alignment of length 3?
If the match could not be
considered, what would be the best alignment of
length 2?
12
Constraints on the Alignments of Histories

The number of matches in the alignment.
l-alignment alignment with l matches
The r-neighborhood constraint
For each match
r ,l parameters of the similarity query.

13
Principle of Optimality
p(A)
p(B)
s(A)
s(B)
optimal alignment of p(A) and p(B)
optimal alignment of s(A) and s(B)
optimal alignment of A and B
concatenation operator
The principle of optimality holds if
14
Score of Optimal l-alignment
15
Similarity Measure for Histories
the score of optimal l-alignment of
two histories.
16
Similarity Queries over Collection of Histories

Straightforward (not practical) approach naïve
scan
Indexing techniques are proposed for metric
spaces,
but is not metric
when the distance between observations is not
metric.
when an r-neighberhood constraint is specified.
We propose upper bounds to prune history search
space.

17
A General Upper Bound for the Similarity Measure

Intuition The score of an optimal relaxed
l-alignment is not less than the score of optimal
l-alignment.
For each observation, find an optimal match.
Aggregate the scores for top l optimal matches to
find an upper bound for .

This upper bound can prune some extra
computations, but still all histories will be
accessed to evaluate a query.
18
An Index-based Upper Bound for the Similarity
Measure

Intuitions
Observations are sparse in real life
applications.
The score of an optimal relaxed match is not
less
than the score of an optimal match.
The score of an optimal relaxed alignment is not
less than the score of optimal relaxed
l-alignment.

19
Experiments

Experiments performed on AMD/XP 2600 512 Mb RAM
Datasets
DBLP
Synth1 Our synthetic data
Synth2 Modified IBM synthetic data generator
Investigated
Effectiveness of similarity measure
Efficiency of our approach
Pruning power, Running time, Saleability

20
(No Transcript)
21
Effectiveness of our Similarity Measure
observation document modeled as bit string
First observation randomly selected

22
Effectiveness of our Similarity Measure (cnt.)
Mean deviation of from for k-NN
queries
For 2,000 randomly generated queries
23
Pruning Power vs. k
Fraction of database examined 0 20
40 60 80 100
1 10
100 1024
No. of neighbours in k-NN query (LOG scale)
24
Running Time vs. k
Time (msec) 0 100 200 300
400 500 600
1 10
100 1024
Dataset Synth2, 8,000 Histories, 1,000 items
No. of neighbours in k-NN query (LOG scale)
25
Scalability for 1-NN queries
Time (msec)
8,000 16,000 32,000
64,000
No. of histories in the collection
26
Running time vs. Sparseness of Observations
Time (msec)
256 512 1,024 2,048
4,096 8,092
No. of items (LOG scale)
27
Conclusions

Introduced a domain-independent framework to
formulate and evaluate similarity queries over
historical data.
Generalized few concepts, including edit distance
and longest common subsequence to histories.
Developed upper bounds to efficiently evaluate
queries. One of our upper bounds can directly
take advantage of an index even though it is not
metric.
Our experiments confirm the effectiveness and
efficiency of our approach.