Title: Reza Sherkat and Davood Rafiei
1Efficiently Evaluating Order Preserving
Similarity Queries over Historical Market-Basket
Data
- Reza Sherkat and Davood Rafiei
- Department of Computing Science
- University of Alberta
- Canada
Travel assistance provided by the Mary Louise
Imrie Graduate Student Award
2Overview
- Introduction
- Histories and Time-series
- Similarity model for histories
- Problem Definition
- Proposed Approach
- Results Highlight
- Conclusions
3Querying Histories Introduction
- Querying multiple snapshots of data
- Temporal selection, projection, and join queries
- Finding similar time-series
- Finding companies having similar stocks
- Is it possible to define a notion of similarity
for objects based on the similarity of their
histories?
4Histories
- History A sequence of time-stamped observations
- Time-series observations are real-values
- Observations can be more general
the history of a patient
5Similarity Model for Histories
History for 3 patients
- Similarity of two histories depends on
- Pair-wise similarity of their observations
6Similarity Model for Histories
History for 3 patients
- Similarity of two histories depends on
- Pair-wise similarity of their observations
- The order that similar observations are recorded
- Constraints on time-stamps of observations
7Problem Definition
- Given a history as a query
- Evaluate k-NN and Range queries efficiently.
-
- For each history in the result, find its common
signature with the query - where the similarity
comes from?
8Similarity Measure for Histories
- Alignment of histories
- An approach to line-up subsequences of two
histories - Denoted by a sequence of matches
- is an observation in A (B) or a
gap ( ). - is the score of a match.
- Alignment score measures the quality of an
alignment.
9Alignments of Histories
Alignment score can be the sum of the score of
matches in the alignment.
10Alignments of Histories
Alignment score can be the sum of the score of
matches in the alignment.
The best alignment of two histories
What is the best alignment of length 3?
11Alignments of Histories
Alignment score can be the sum of the score of
matches in the alignment.
The best alignment of two histories
What is the best alignment of length 3?
If the match could not be
considered, what would be the best alignment of
length 2?
12Constraints on the Alignments of Histories
- The number of matches in the alignment.
-
- l-alignment alignment with l matches
- The r-neighborhood constraint
- For each match
- r ,l parameters of the similarity query.
13Principle of Optimality
p(A)
p(B)
s(A)
s(B)
optimal alignment of p(A) and p(B)
optimal alignment of s(A) and s(B)
optimal alignment of A and B
concatenation operator
The principle of optimality holds if
14Score of Optimal l-alignment
15Similarity Measure for Histories
the score of optimal l-alignment of
two histories.
16Similarity Queries over Collection of Histories
- Straightforward (not practical) approach naïve
scan - Indexing techniques are proposed for metric
spaces, - but is not metric
- when the distance between observations is not
metric. - when an r-neighberhood constraint is specified.
- We propose upper bounds to prune history search
space.
17A General Upper Bound for the Similarity Measure
- Intuition The score of an optimal relaxed
l-alignment is not less than the score of optimal
l-alignment. - For each observation, find an optimal match.
- Aggregate the scores for top l optimal matches to
find an upper bound for .
This upper bound can prune some extra
computations, but still all histories will be
accessed to evaluate a query.
18An Index-based Upper Bound for the Similarity
Measure
- Intuitions
- Observations are sparse in real life
applications. - The score of an optimal relaxed match is not
less - than the score of an optimal match.
- The score of an optimal relaxed alignment is not
- less than the score of optimal relaxed
l-alignment.
19Experiments
- Experiments performed on AMD/XP 2600 512 Mb RAM
- Datasets
- DBLP
- Synth1 Our synthetic data
- Synth2 Modified IBM synthetic data generator
- Investigated
- Effectiveness of similarity measure
- Efficiency of our approach
- Pruning power, Running time, Saleability
20(No Transcript)
21Effectiveness of our Similarity Measure
observation document modeled as bit string
First observation randomly selected
22Effectiveness of our Similarity Measure (cnt.)
Mean deviation of from for k-NN
queries
For 2,000 randomly generated queries
23Pruning Power vs. k
Fraction of database examined 0 20
40 60 80 100
1 10
100 1024
No. of neighbours in k-NN query (LOG scale)
24Running Time vs. k
Time (msec) 0 100 200 300
400 500 600
1 10
100 1024
Dataset Synth2, 8,000 Histories, 1,000 items
No. of neighbours in k-NN query (LOG scale)
25Scalability for 1-NN queries
Time (msec)
8,000 16,000 32,000
64,000
No. of histories in the collection
26Running time vs. Sparseness of Observations
Time (msec)
256 512 1,024 2,048
4,096 8,092
No. of items (LOG scale)
27Conclusions
- Introduced a domain-independent framework to
formulate and evaluate similarity queries over
historical data. - Generalized few concepts, including edit distance
and longest common subsequence to histories. - Developed upper bounds to efficiently evaluate
queries. One of our upper bounds can directly
take advantage of an index even though it is not
metric. - Our experiments confirm the effectiveness and
efficiency of our approach.
28- Thank you for your attention!
29Related Works
- Detecting, representing, querying histories
- Chawathe 1998, Chien 2001
- Similarity-based sequence matching
- Altschul 1990, Pearson 1990, Bieganski 1994
- Finding similar sequence of events
- Wang 2003
- Finding similar time series
- Agrawal 1995, Rafiei 1997, Keogh 2002,
Vlachos 2002, 2003, ...