Title: Characterizing and Exploiting Reference Locality in Data Stream Applications
1Characterizing and Exploiting Reference Locality
in Data Stream Applications
- Feifei Li, Ching Chang, George Kollios, Azer
Bestavros - Computer Science Department
- Boston University
2Data Stream Management System
Application
Unselected tuples
Query Processor
Memory
Data Stream Management System (DSMS)
3Observations
- Storage / Computation limitation
- Full contents of tuples of interest cannot be
stored in memory.
4Caching Problem in DSMS
sliding window joins
sum of
is the memory size
What tuples to store to max the size of join
results?
5Locality-Aware Algorithms
6Our Contributions
- Cast query processing with memory constraint in
DSMS as caching problem and analyze the two
causes of reference locality
- Provide a mathematical model and simple method to
infer it to characterize the reference locality
in data streams
- Show how to improve performance of data stream
applications with locality-aware algorithms
7Reference Locality - Definition
- In a data stream recently appearing tuples have
a high probability of appearing in the near
future.
8Inter Arrival Distance (IAD)
- A random variable that corresponds to the number
of tuples separating consecutive appearances of
the same tuple.
9Calculate distribution of IAD
i
a
c
e
b
a
i
xn
xnk
distance is k
10Sources of Reference Locality
- Long-term popularity vs. Short-term correlation
(web traces, Bestavros and Crovella)
11Independent Reference Model
- With the independent, identically-distributed
(IID) assumption
- Problem only captures reference locality due to
skewed popularity profile.
12Metrics of Reference Locality
- How to distinguish the two causes of reference
locality?
A
MS
MS
A
A
GG
GG
MS
IBM
Original Data Stream S
Compare IAD distribution of the two!
13Stock Transaction Traces
Daily stock transaction data from INET ATS, Inc.
Zipf-like Popularity Profile (log-log scale)
14Stock Transaction Traces
Still has strong reference locality, due to
skewed popularity distribution
CDF of IAD for Original and Randomly Permuted
Traces
15Network OD Flow Traces
Network traces of Origin-Destination (OD) flows
in two major networks US Abilene and
Sprint-Europe
Zipf-like Popularity Profile (log-log scale)
16Network OD Flow Traces
CDF of IAD for Original and Randomly Permuted
Traces
17Outline
- Motivation
- Reference Locality source and metrics
- A Locality-Aware Data Stream Model
- Application of Locality-Aware Model
- Max-subset Join
- Approximate count estimation
- Data summarization
- Performance Study
- Conclusion
18Locality-Aware Stream Model
Recent h tuples
2
2
5
10
4
10
7
7
Index
xn-1
xn-h
P
Recent h tuples of S
stream S
19Locality-Aware Stream Model
Recent h tuples
2
2
5
10
4
10
7
7
Index
xn-1
xn-h
stream S
20Locality-Aware Stream Model
Similar model appears for caching of web-traces,
example Konstantinos Psounis, et. al
21Infer the Model
Expected value for xn
Make N observations, infer ai and b (h1)
parameters
22Model on Real Traces- Stock
b degree of reference locality due to long-term
popularity 1-b due to short-term correlation
23Model on Real Traces- OD Flow
24Utilizing Model for Prediction
xn-h
xn
xn-1
xn1
xn2
xnT
S
T
The expected number of occurrence for tuple with
value e in a future period of T, ET(e).
Using only T1 constants calculated based on
the locality model of S
25Outline
- Motivation
- Reference Locality source and metrics
- A Locality-Aware Data Stream Model
- Application of Locality-Aware Model
- Max-subset Join
- Approximate count estimation
- Data summarization
- Performance Study
- Conclusion
26Approximate Sliding Window Join
sliding window joins
sum of
is the memory size
What tuples to store to max the size of join
results?
27Existing Approach
- Metrics Max-subset
- Previous approach
- Random load shedding poor performance (J. Kang
et. al, A. Das et. al) - Frequency model IID assumption (A. Das et. al)
- Age-based model too strict assumption (U.
Srivastava et. al) - Stochastic model not universal (J. Xie et. al)
28Marginal Utility
29Calculate Marginal Utility
10
x
8
x
13
x
x
8
9
S
n
Tuple Index
R
9
7
n
30ELBA
- Exact Locality-Based Algorithm (ELBA)
- Based on the previous analysis, calculate the
marginal utility of tuples in the buffer, evict
the victim with the smallest value - Expensive
31LBA
- Locality-Based Algorithm (LBA)
- Assume T is fixed, approximate marginal
- utility based on the prediction power of
locality - model.
- Depends on only T1 constants that could be
- pre-computed.
32Space Complexity
- A histogram stores both P over a domain size D
and T1 constants - histogram space usage is poly logarithm
O(polylogN) space usage for N values - (A. Gilbert, et. al)
33Sliding window join varying buffer size OD
Flow
34Sliding window join varying buffer size - Stock
35Sliding window join varying window size - stock
36Conclusion
- Reference locality property is important for
query processing with memory constraint in data
stream applications.
- Most real data streams have strong temporal
locality, i.e. short term correlations.
- How about spatial locality, i.e. correlation
among different attributes of the tuple?
37Thanks!
38Approximate Count Estimation
- Derive much tighter space bound for
Lossy-counting algorithm (G. Manku et. al) using
locality-aware techniques. - Tight space bound is important, as it tells us
how much memory space to allocate. -
39Data Summarization
- Define Entropy over a window in data stream using
locality-aware techniques, instead of the normal
way of entropy definition.
- Important for data summarization, change
detection, etc. - For example
40Data Stream Entropy
Data Streams Locality-Aware Entropy
Uniform IID 6.19
Permuted Stock Stream 5.48
Original Stock Stream 3.32
Higher degree of reference locality infers less
entropy