Title: SWAT: Hierarchical Stream Summarization in Large Networks
1SWAT Hierarchical Stream Summarization in Large
Networks
- Ahmet Bulut Ambuj K. Singh
- Department of Computer Science
- University of California, Santa Barbara
- Santa Barbara, CA 93106
- USA
- Life is what happens to you while you are busy
making other plans
- --John Lennon
2Motivations
- Numerous real world applications generate streams
of data
- Call detail records in telecommunications
networks
- Transactions in retail chains, Sensor network
data
- Log records generated by Web Servers
NetFlow Data (Ciscos monitoring tool)
3Network Data Processing
- Traffic estimation
- How many bytes were sent between a pair of IP
addresses?
- List the top 100 IP addresses in terms of
traffic
- Traffic analysis
- What is the average duration of a TCP session?
- Fraud detection
- List all sessions that transmitted more than 1000
bytes
- Security/Denial of Service
- Identify IP addresses involved in more than 1000
sessions
4Application characteristics
- Data volume is massive (several terabytes)
- ATT collects 100 GBs of NetFlow data each day!
- New records arrive at a fast rate
- Dealing with transactional data streams drinking
from the proverbial fire hose.
Recent values are more important compared to old
values.
5Data Stream Computation Model
- A data stream is an infinite sequence of
elements . . .,xi, . . .
- Stream processing requirements
- Bounded memory small space usage
- Real-time low per record processing time (to
maintain synopsis)
- Efficiency fast response time and accurate
results to user queries
Synopsis in Memory
Data Stream
Stream Processing Engine
(Approximate) Answer
6Data Stream Query Model
- Inner (dot) product
- Network Data Processing What is the average
duration of a TCP session?
- Stock Data Processing What is the average
closing price of INTEL for the last month?
- Medical Sensor Data Processing Notify when the
weighted average of last 20 body temperature
measurements of a patient exceed a threshold
value. - Query A triple (I,W,d)
- I data items of interest
- W individual weights corresponding to each data
item
- d precision quality
7Outline
- Introduction
- Motivation Applications
- Stream computation model
- Stream query model
- Centralized system design
- Wavelet-based Approximation Tree
- Handling user queries
- De-centralized system design
- Adaptive stream replication in a large network
- Future directions
8One-Dimensional Haar Wavelets
- Wavelets Mathematical tool for hierarchical
decomposition of functions/signals
- Haar wavelets Simplest wavelet basis, easy to
understand and implement
- Recursive pairwise averaging and differencing at
different resolutions
Resolution Averages Detail
Coefficients
2, 2, 0, 2, 3, 5, 4, 4
----
3
2, 1, 4, 4
0, -1, -1, 0
2
1
0
9One-Dimensional Haar Wavelets
- Recent biased wavelet decomposition
- Keep a few of the coefficients at each
resolution
- Approximation of the original signal
Recent values are more important compared to old
values.
Resolution Averages Detail
Coefficients
2, 2, 0, 2, 3, 5, 4, 4
----
3
2, 1, 4, 4
0, -1, -1, 0
2
1
0
2.75, 1.5, 4, 4, 4 Approximation
10Haar Wavelet Coefficients
- Hierarchical decomposition structure
- Keep a few coefficients at each resolution O
(log N) coefficients for a window size of N
11Execution Trace on SWAT
98/16
46/8
54/8
44/8
54/8
8/4
22/4
32/4
36/4
12/4
14/2
6/2
18/2
10/2
8/2
12/2
26/2
2 8 6 4 2 10 16 6 8 2 2 0 4 2 12 14
2
10
4
6
12Query by Example
- Q(0,3,8,13,10,8,4,1)
- 10x0 8x3 4x8x13
Avg(3,18)
V
R2
Avg(7,14)
Avg(3,10)
R0
Avg(11,18)
L0
L1
Avg(3,6)
S2
Avg(1,4)
Avg(5,8)
Avg(1,2)
Avg(2,3)
Avg(0,1)
13Experimental Settings
- Real Dataset of size 3K daily maximum
temperature for the city of Santa Barbara, CA
from 1994 to 2001 http//www.ipm.ucdavis.edu
/WEATHER - Synthetic Dataset of size 10M.
- Data arrival period, Td and query arrival period,
Tq 1 sec
- Exponential inner product query Q(0, 1, 2, 3,
8, 4, 2, 1, 20)
- Linear inner product query Q(8, 9, 10, 11,
4, 3, 2, 1, 40)
- Execution at a query point
- fixed query mode the same recent biased query
repeatedly
- random query mode a new randomly chosen query
- Compare with the incremental Histogram Guha and
Koudas 02
14Performance measurements
- Dataset Real data
- Query mode fixed
- Exponential queries are
- more recent biased wrt
- linear queries
- Dataset Synthetic data
- Query mode fixed
15Performance measurements
- Dataset Synthetic data
- Query mode Random
- Average Query Response Times
- SWAT 2.80e-3 sec.
- Histogram 25.433 sec.
- Parameter tuning for Histogram
- Refer to the paper or the accompanying technical
report for more experimental results.
16Complexity analysis of SWAT
- Small space usage
- Real-time Per record processing time (to
maintain synopsis) must be low
- Time needed to answer posed queries,
- and the precision of answers
- Space complexity is O (log N)
- Amortized per item processing time O (1)
17Outline
- Introduction
- Motivation Applications
- Stream computation model
- Stream query model
- Centralized system design
- Wavelet-based Approximation Tree
- Handling user queries
- De-centralized system design
- Adaptive stream replication in a large network
- Future directions
18Stream Replication Motivations
- Centralized model the synopsis at a single
site.
- () easy system design, no replica consistency
- (-) the central cite becomes a bottleneck in
query intensive environments.
- Decentralized model the synopsis at multiple
sites.
- () less message transmissions in query intensive
environments
- (-) replica consistency in data intensive
environments
- Solution An adaptive stream replication that
minimizes the number of message transmissions
19System topology
Data Stream
S
2 8 6 4 2 10 16 6 8 2 2 0 4 2 12 14
Q1
A
C1
C2
Q4
A3
Q3
C3
C4
20Computation model
- Nodes maintain range values rather than exact
data values
dL, dH approximation for the data element d
Data window segment data
range subscription list (0,1) 25,45 C1,C
2 (2,3) 30,40 C2 (4,7) 2,7 C2 (8,15)
4,10 C2
21Query model
- Clients issue inner product queries Q(I,W,d) over
the stream.
-
- When a client receives a query Q(I,W,d) with d as
precision requirement
- Check the local cache to see if there are
approximations for I
- If d
- else return answer A
AL
Define d AH-AL
A
ß
Compute an answer with precision quality
22Adaptive Data Replication Wolfson, Jajodia, and
Huang 97
4
- The replication scheme is a sub-tree of nodes
that have the replica of the object
- The replication scheme R consists of nodes 3, 7
8
- The replication scheme expands and/or contracts
depending on read and write activities adaptively.
3
7
8
23Adaptive Stream Replication (SWAT-ASR)
PHASE CHANGE
S
, C1
32,38
34,35
(2,3) 30,40
(2,3) 32,38
Q1(3,1,8) X 4
Q1(3,1,8) X 4
C3
C1
(2,3) 30,40
Q0(3,1,20) X 3
Q0(3,1,20)
Q0(3,1,20) X 3
C3
24Experimental Settings
- Execute at each query point a new randomly chosen
inner product query (random query mode)
- Divergence Caching Huang, Sloan, and Wolfson
94 mechanism to reduce the number of object
transmissions in Client-Server architectures.
- Compute the optimal refresh rate (dH-dL for
dL,dH) using a window of past reads and writes
Poisson processes
- Adaptive Precision Setting Olston, Widom, and
Loo 01
- For every data value d, keep an approximation as
dL,dH, W dH-dL
- In case of an update, the width stays the same/is
enlarged
- In case of an unsatisfied query, the width stays
the same/is shrunk
25Performance measurements for adaptive stream
replication
26Performance measurements for adaptive stream
replication (cont.)
27Performance measurements for adaptive stream
replication (cont.)
Largest weight coefficient in weight vector
28Future directions
- Resource utilization in a multiple-streams
environment
- Multiple streams competing for limited resources
- limited memory
- limited bandwidth
- limited cpu
- How to schedule resources to optimize throughput
- Fraud and security monitoring
- Application of stream mining techniques developed
on logs for pattern analysis
29References
- GGR02, M. N. Garofalakis, J. Gehrke and R.
Rastogi. Querying and Mining Data Streams You
Only Get One Look. VLDB 2002.
- GKMS01, A. C. Gilbert and Y. Kotidis and S.
Muthukrishnan and M. Strauss, Surfing Wavelets on
Streams One-Pass Summaries for Approximate
Aggregate Queries, VLDB 2001. - GK02,S. Guha and N. Koudas, Approximating a
Data Stream for Querying and Estimation
Algorithms and Performance Evaluation, ICDE
2002. - HSW94, Y. Huang and R. H. Sloan and O. Wolfson,
Divergence Caching in Client Server
Architectures, PDIS 1994.
- OWL01, C. Olston and J. Widom and B. T. Loo,
Adaptive Precision Setting for Cached Approximate
Values, SIGMOD 2001.
- TR-BS02, A. Bulut and A. K. Singh, SWAT
Hierarchical Stream Summarization in Large
Networks, University of California Santa Barbara,
2002. - WJH97, O. Wolfson and S. Jajodia and Y. Huang,
An adaptive data replication algorithm. ACM TODS
1997.
- ZS02, Y. Zhu and D. Shasha. Statstream
Statistical monitoring of thousands of data
streams in real time. VLDB 2002.
30Thanks
you got to make some plans just in case