Characterizing and Exploiting Reference Locality in Data Stream Applications - PowerPoint PPT Presentation

About This Presentation
Title:

Characterizing and Exploiting Reference Locality in Data Stream Applications

Description:

Characterizing and Exploiting Reference Locality in Data Stream Applications Feifei Li, Ching Chang, George Kollios, Azer Bestavros Computer Science Department – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 41
Provided by: fei86
Category:

less

Transcript and Presenter's Notes

Title: Characterizing and Exploiting Reference Locality in Data Stream Applications


1
Characterizing and Exploiting Reference Locality
in Data Stream Applications
  • Feifei Li, Ching Chang, George Kollios, Azer
    Bestavros
  • Computer Science Department
  • Boston University

2
Data Stream Management System
Application
Unselected tuples
Query Processor
Memory
Data Stream Management System (DSMS)
3
Observations
  • Storage / Computation limitation
  • Full contents of tuples of interest cannot be
    stored in memory.

4
Caching Problem in DSMS
sliding window joins
sum of
is the memory size
What tuples to store to max the size of join
results?
5
Locality-Aware Algorithms
6
Our Contributions
  • Cast query processing with memory constraint in
    DSMS as caching problem and analyze the two
    causes of reference locality
  • Provide a mathematical model and simple method to
    infer it to characterize the reference locality
    in data streams
  • Show how to improve performance of data stream
    applications with locality-aware algorithms

7
Reference Locality - Definition
  • In a data stream recently appearing tuples have
    a high probability of appearing in the near
    future.

8
Inter Arrival Distance (IAD)
  • A random variable that corresponds to the number
    of tuples separating consecutive appearances of
    the same tuple.

9
Calculate distribution of IAD
i
a
c
e
b
a
i


xn
xnk
distance is k
10
Sources of Reference Locality
  • Long-term popularity vs. Short-term correlation
    (web traces, Bestavros and Crovella)

11
Independent Reference Model
  • With the independent, identically-distributed
    (IID) assumption
  • Problem only captures reference locality due to
    skewed popularity profile.

12
Metrics of Reference Locality
  • How to distinguish the two causes of reference
    locality?

A
MS
MS
A
A
GG
GG
MS
IBM


Original Data Stream S
Compare IAD distribution of the two!
13
Stock Transaction Traces
Daily stock transaction data from INET ATS, Inc.
Zipf-like Popularity Profile (log-log scale)
14
Stock Transaction Traces
Still has strong reference locality, due to
skewed popularity distribution
CDF of IAD for Original and Randomly Permuted
Traces
15
Network OD Flow Traces
Network traces of Origin-Destination (OD) flows
in two major networks US Abilene and
Sprint-Europe
Zipf-like Popularity Profile (log-log scale)
16
Network OD Flow Traces
CDF of IAD for Original and Randomly Permuted
Traces
17
Outline
  • Motivation
  • Reference Locality source and metrics
  • A Locality-Aware Data Stream Model
  • Application of Locality-Aware Model
  • Max-subset Join
  • Approximate count estimation
  • Data summarization
  • Performance Study
  • Conclusion

18
Locality-Aware Stream Model
Recent h tuples
2
2
5
10
4
10
7
7

Index
xn-1
xn-h
P
Recent h tuples of S
stream S
19
Locality-Aware Stream Model
Recent h tuples
2
2
5
10
4
10
7
7

Index
xn-1
xn-h
stream S
20
Locality-Aware Stream Model
Similar model appears for caching of web-traces,
example Konstantinos Psounis, et. al
21
Infer the Model
Expected value for xn
Make N observations, infer ai and b (h1)
parameters
22
Model on Real Traces- Stock
b degree of reference locality due to long-term
popularity 1-b due to short-term correlation
23
Model on Real Traces- OD Flow
24
Utilizing Model for Prediction
xn-h
xn
xn-1

xn1
xn2

xnT
S


T
The expected number of occurrence for tuple with
value e in a future period of T, ET(e).
Using only T1 constants calculated based on
the locality model of S
25
Outline
  • Motivation
  • Reference Locality source and metrics
  • A Locality-Aware Data Stream Model
  • Application of Locality-Aware Model
  • Max-subset Join
  • Approximate count estimation
  • Data summarization
  • Performance Study
  • Conclusion

26
Approximate Sliding Window Join
sliding window joins
sum of
is the memory size
What tuples to store to max the size of join
results?
27
Existing Approach
  • Metrics Max-subset
  • Previous approach
  • Random load shedding poor performance (J. Kang
    et. al, A. Das et. al)
  • Frequency model IID assumption (A. Das et. al)
  • Age-based model too strict assumption (U.
    Srivastava et. al)
  • Stochastic model not universal (J. Xie et. al)

28
Marginal Utility
29
Calculate Marginal Utility
10
x
8
x
13
x
x
8
9
S


n
Tuple Index
R
9
7

n
30
ELBA
  • Exact Locality-Based Algorithm (ELBA)
  • Based on the previous analysis, calculate the
    marginal utility of tuples in the buffer, evict
    the victim with the smallest value
  • Expensive

31
LBA
  • Locality-Based Algorithm (LBA)
  • Assume T is fixed, approximate marginal
  • utility based on the prediction power of
    locality
  • model.
  • Depends on only T1 constants that could be
  • pre-computed.

32
Space Complexity
  • A histogram stores both P over a domain size D
    and T1 constants
  • histogram space usage is poly logarithm
    O(polylogN) space usage for N values
  • (A. Gilbert, et. al)

33
Sliding window join varying buffer size OD
Flow
34
Sliding window join varying buffer size - Stock
35
Sliding window join varying window size - stock
36
Conclusion
  • Reference locality property is important for
    query processing with memory constraint in data
    stream applications.
  • Most real data streams have strong temporal
    locality, i.e. short term correlations.
  • How about spatial locality, i.e. correlation
    among different attributes of the tuple?

37
Thanks!
38
Approximate Count Estimation
  • Derive much tighter space bound for
    Lossy-counting algorithm (G. Manku et. al) using
    locality-aware techniques.
  • Tight space bound is important, as it tells us
    how much memory space to allocate.

39
Data Summarization
  • Define Entropy over a window in data stream using
    locality-aware techniques, instead of the normal
    way of entropy definition.
  • Important for data summarization, change
    detection, etc.
  • For example

40
Data Stream Entropy
Data Streams Locality-Aware Entropy
Uniform IID 6.19
Permuted Stock Stream 5.48
Original Stock Stream 3.32
Higher degree of reference locality infers less
entropy
Write a Comment
User Comments (0)
About PowerShow.com