Characterizing and Exploiting Reference Locality in Data Stream Applications - PowerPoint PPT Presentation

About This Presentation

Title:

Characterizing and Exploiting Reference Locality in Data Stream Applications

Description:

Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer Science Department – PowerPoint PPT presentation

Number of Views:109

Avg rating:3.0/5.0

Slides: 41

Provided by: fei86

Learn more at: https://users.cs.utah.edu

Category:

more less

Transcript and Presenter's Notes

Title: Characterizing and Exploiting Reference Locality in Data Stream Applications

1
Characterizing and Exploiting Reference Locality
in Data Stream Applications

Feifei Li, Ching Chang, George Kollios, Azer
Bestavros
Computer Science Department
Boston University

2
Data Stream Management System
Application
Unselected tuples
Query Processor
Memory
Data Stream Management System (DSMS)
3
Observations

Storage / Computation limitation
Full contents of tuples of interest cannot be
stored in memory.

4
Caching Problem in DSMS
sliding window joins
sum of
is the memory size
What tuples to store to max the size of join
results?
5
Locality-Aware Algorithms
6
Our Contributions

Cast query processing with memory constraint in
DSMS as caching problem and analyze the two
causes of reference locality

Provide a mathematical model and simple method to
infer it to characterize the reference locality
in data streams

Show how to improve performance of data stream
applications with locality-aware algorithms

7
Reference Locality - Definition

In a data stream recently appearing tuples have
a high probability of appearing in the near
future.

8
Inter Arrival Distance (IAD)

A random variable that corresponds to the number
of tuples separating consecutive appearances of
the same tuple.

9
Calculate distribution of IAD
i
a
c
e
b
a
i

xn
xnk
distance is k
10
Sources of Reference Locality

Long-term popularity vs. Short-term correlation
(web traces, Bestavros and Crovella)

11
Independent Reference Model

With the independent, identically-distributed
(IID) assumption

Problem only captures reference locality due to
skewed popularity profile.

12
Metrics of Reference Locality

How to distinguish the two causes of reference
locality?

A
MS
MS
A
A
GG
GG
MS
IBM

Original Data Stream S
Compare IAD distribution of the two!
13
Stock Transaction Traces
Daily stock transaction data from INET ATS, Inc.
Zipf-like Popularity Profile (log-log scale)
14
Stock Transaction Traces
Still has strong reference locality, due to
skewed popularity distribution
CDF of IAD for Original and Randomly Permuted
Traces
15
Network OD Flow Traces
Network traces of Origin-Destination (OD) flows
in two major networks US Abilene and
Sprint-Europe
Zipf-like Popularity Profile (log-log scale)
16
Network OD Flow Traces
CDF of IAD for Original and Randomly Permuted
Traces
17
Outline

Motivation
Reference Locality source and metrics
A Locality-Aware Data Stream Model
Application of Locality-Aware Model
Max-subset Join
Approximate count estimation
Data summarization
Performance Study
Conclusion

18
Locality-Aware Stream Model
Recent h tuples
2
2
5
10
4
10
7
7

Index
xn-1
xn-h
P
Recent h tuples of S
stream S
19
Locality-Aware Stream Model
Recent h tuples
2
2
5
10
4
10
7
7

Index
xn-1
xn-h
stream S
20
Locality-Aware Stream Model
Similar model appears for caching of web-traces,
example Konstantinos Psounis, et. al
21
Infer the Model
Expected value for xn
Make N observations, infer ai and b (h1)
parameters
22
Model on Real Traces- Stock
b degree of reference locality due to long-term
popularity 1-b due to short-term correlation
23
Model on Real Traces- OD Flow
24
Utilizing Model for Prediction
xn-h
xn
xn-1

xn1
xn2

xnT
S

T
The expected number of occurrence for tuple with
value e in a future period of T, ET(e).
Using only T1 constants calculated based on
the locality model of S
25
Outline

Motivation
Reference Locality source and metrics
A Locality-Aware Data Stream Model
Application of Locality-Aware Model
Max-subset Join
Approximate count estimation
Data summarization
Performance Study
Conclusion

26
Approximate Sliding Window Join
sliding window joins
sum of
is the memory size
What tuples to store to max the size of join
results?
27
Existing Approach

Metrics Max-subset
Previous approach
Random load shedding poor performance (J. Kang
et. al, A. Das et. al)
Frequency model IID assumption (A. Das et. al)
Age-based model too strict assumption (U.
Srivastava et. al)
Stochastic model not universal (J. Xie et. al)

28
Marginal Utility
29
Calculate Marginal Utility
10
x
8
x
13
x
x
8
9
S

n
Tuple Index
R
9
7

n
30
ELBA

Exact Locality-Based Algorithm (ELBA)
Based on the previous analysis, calculate the
marginal utility of tuples in the buffer, evict
the victim with the smallest value
Expensive

31
LBA

Locality-Based Algorithm (LBA)
Assume T is fixed, approximate marginal
utility based on the prediction power of
locality
model.
Depends on only T1 constants that could be
pre-computed.

32
Space Complexity

A histogram stores both P over a domain size D
and T1 constants
histogram space usage is poly logarithm
O(polylogN) space usage for N values
(A. Gilbert, et. al)

33
Sliding window join varying buffer size OD
Flow
34
Sliding window join varying buffer size - Stock
35
Sliding window join varying window size - stock
36
Conclusion

Reference locality property is important for
query processing with memory constraint in data
stream applications.

Most real data streams have strong temporal
locality, i.e. short term correlations.

How about spatial locality, i.e. correlation
among different attributes of the tuple?

37
Thanks!
38
Approximate Count Estimation

Derive much tighter space bound for
Lossy-counting algorithm (G. Manku et. al) using
locality-aware techniques.
Tight space bound is important, as it tells us
how much memory space to allocate.

39
Data Summarization

Define Entropy over a window in data stream using
locality-aware techniques, instead of the normal
way of entropy definition.

Important for data summarization, change
detection, etc.
For example

40
Data Stream Entropy
Data Streams Locality-Aware Entropy
Uniform IID 6.19
Permuted Stock Stream 5.48
Original Stock Stream 3.32
Higher degree of reference locality infers less
entropy

Write a Comment

User Comments (0)