Randomization for Massive and Streaming Data Sets - PowerPoint PPT Presentation

About This Presentation

Title:

Randomization for Massive and Streaming Data Sets

Description:

Emerging DSMS variety of modern applications. Network monitoring and traffic engineering ... Possibly in adaptive/randomized fashion. Theorem: For any , E ... – PowerPoint PPT presentation

Number of Views:88

Avg rating:3.0/5.0

Slides: 39

Provided by: RajeevM2

Learn more at: http://theory.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Randomization for Massive and Streaming Data Sets

1
Randomization for Massive and Streaming Data Sets

Rajeev Motwani

2
Data Streams Mangement Systems

Traditional DBMS data stored in finite,
persistent data sets
Data Streams distributed, continuous,
unbounded, rapid, time-varying, noisy,
Emerging DSMS variety of modern applications
Network monitoring and traffic engineering
Telecom call records
Network security
Financial applications
Sensor networks
Manufacturing processes
Web logs and clickstreams
Massive data sets

3
DSMS Big Picture
Stored Result
DSMS
Input streams
Archive
Scratch Store
Stored Relations
4
Algorithmic Issues

Computational Model
Streaming data (or, secondary memory)
Bounded main memory
Techniques
New paradigms
Negative Results and Approximation
Randomization
Complexity Measures
Memory
Time per item (online, real-time)
Passes (linear scan in secondary memory)

5
Stream Model of Computation
Main Memory (Synopsis Data
Structures)
Increasing time
Memory poly(1/e, log N) Query/Update Time
poly(1/e, log N) N items so far, or window
size e error parameter
Data Stream
6
Toy Example Network Monitoring
Intrusion Warnings
Online Performance Metrics
Register Monitoring Queries
DSMS
Network measurements, Packet traces,
Archive
Scratch Store
Lookup Tables
7
Frequency Related Problems
Analytics on Packet Headers IP Addresses
How many elements have non-zero frequency?
8
Example 1 Distinct Values

Input Sequence X x1, x2, , xn,
Domain U 0,1,2, , u-1
Compute D(X) number of distinct values
Remarks
Assume stream size n is finite/known
(generally, n is window size)
Domain could be arbitrary (e.g., text, tuples)

9
Naïve Approach

Counter C(i) for each domain value i
Initialize counters C(i)? 0
Scan X incrementing appropriate counters
Problem
Memory size M ltlt n
Space O(u) possibly u gtgt n
(e.g., when counting distinct words in web crawl)

10
Negative Result

Theorem
Deterministic algorithms need M O(n log u) bits
Proof Information-theoretic arguments
Note Leaves open randomization/approximation

11
Randomized Algorithm
hU ? 1..t
Input Stream
Hash Table

Analysis
Random h ? few collisions avg list-size O(n/t)
Thus
Space O(n) since we need t O(n)
Time O(1) per item Expected

12
Improvement via Sampling?

Sample-based Estimation
Random Sample R (of size r) of n values in X
Compute D(R)
Estimator E D(R) x n/r
Benefit sublinear space
Cost estimation error is high
Why? low-frequency values underrepresented

13
Negative Result for Sampling

Consider estimator E of D(X) examining r items in
X
Possibly in adaptive/randomized fashion.
Theorem For any , E has relative error
with probability at least .
Remarks
r n/10 ? Error 75 with probability ½
Leaves open randomization/approximation on full
scans

14
Randomized Approximation

Simplified Problem For fixed t, is D(X) gtgt t?
Choose hash function h U?1..t
Initialize answer to NO
For each xi, if h(xi) t, set answer to YES
Observe need 1 bit memory only !
Theorem
If D(X) lt t, Poutput NO gt 0.25
If D(X) gt 2t, Poutput NO lt 0.14

Boolean Flag

1
hU ? 1..t

YES/NO
t
Input Stream
15
Analysis

Let Y be set of distinct elements of X
output NO no element of Y hashes to t
P element hashes to t 1/t
Thus Poutput NO (1-1/t)Y
Since Y D(X),
D(X) lt t ? Poutput NO gt (1-1/t)t gt 0.25
D(X) gt 2t ? Poutput NO lt (1-1/t)2t lt 1/e2

16
Boosting Accuracy

With 1 bit ? distinguish D(X)ltt from D(X)gt2t
Running O(log 1/d) instances in parallel
? reduce error probability to any dgt0
Running O(log n) in parallel for t 1, 2, 4,
8,, n ? can estimate D(X) within factor 2
Choice of multiplier 2 is arbitrary
? can use factor (1e) to reduce error to
e
Theorem Can estimate D(X) within factor (1e)
with probability (1-d) using space

17
Example 2 Elephants-and-Ants
Stream

Identify items whose current frequency exceeds
support threshold s 0.1.
Jacobson 2000, Estan-Verghese 2001

18
Algorithm 1 Lossy Counting
Step 1 Divide the stream into windows
Window-size W is function of support s specify
later
19
Lossy Counting in Action ...
Empty
20
Lossy Counting continued ...
21
Error Analysis
How much do we undercount?
If current size of stream N and
window-size W
1/e then
windows eN
frequency error ?
Rule of thumb Set e 10 of support
s Example Given support frequency s
1, set error frequency e 0.1
22
Putting it all together
Output Elements with counter values exceeding
(s-e)N
Approximation guarantees Frequencies
underestimated by at most eN No false
negatives False positives have true
frequency at least (se)N

How many counters do we need?
Worst case bound 1/e log eN counters
Implementation details

23
Algorithm 2 Sticky Sampling
? Create counters by sampling ? Maintain exact
counts thereafter
What is sampling rate?
24
Sticky Sampling contd...
For finite stream of length N Sampling rate
2/eN log 1/?s
? probability of failure
Output Elements with counter values exceeding
(s-e)N
Same Rule of thumb Set e 10 of support
s Example Given support threshold s 1,
set error threshold e 0.1 set
failure probability ? 0.01
25
Number of counters?
Finite stream of length N Sampling rate 2/eN
log 1/?s
Infinite stream with unknown N Gradually adjust
sampling rate
In either case, Expected number of counters
2/? log 1/?s
26
Example 3 Correlated Attributes

C1 C2 C3 C4 C5
R1 1 1 1 1 0
R2 1 1 0 1 0
R3 1 0 0 1 0
R4 0 0 1 0 1
R5 1 1 1 0 1
R6 1 1 1 1 1
R7 0 1 1 1 1
R8 0 1 1 1 0
Input Stream items with boolean attributes
Matrix M(r,c) 1 ? Row r has Attribute c
Identify Highly-correlated column-pairs

27
Correlation ? Similarity

View column as set of row-indexes (where it
has 1s)
Set Similarity (Jaccard measure)
Example

Ci Cj 0 1 1 0 1 1
sim(Ci,Cj) 2/5 0.4 0 0 1 1 0 1
28
Identifying Similar Columns?

Goal finding candidate pairs in small memory
Signature Idea
Hash columns Ci to small signature sig(Ci)
Set of signatures fits in memory
sim(Ci,Cj) approximated by sim(sig(Ci),sig(Cj))
Naïve Approach
Sample P rows uniformly at random
Define sig(Ci) as P bits of Ci in sample
Problem
sparsity ? would miss interesting part of
columns
sample would get only 0s in columns

29
Key Observation

For columns Ci, Cj, four types of rows
Ci Cj
A 1 1
B 1 0
C 0 1
D 0 0
Overload notation A rows of type A
Observation

30
Min Hashing

Randomly permute rows
Hash h(Ci) index of first row with 1 in column
Ci
Suprising Property
Ph(Ci) h(Cj) sim(Ci, Cj)
Why?
Both are A/(ABC)
Look down columns Ci, Cj until first non-Type-D
row
h(Ci) h(Cj) ?? if type A row

31
Min-Hash Signatures

Pick k random row permutations
Min-Hash Signature
sig(C) k indexes of first rows with 1 in column
C
Similarity of signatures
Define sim(sig(Ci),sig(Cj)) fraction of
permutations where Min-Hash values agree
Lemma Esim(sig(Ci),sig(Cj)) sim(Ci,Cj)

32
Example
Signatures
S1 S2 S3 Perm 1 (12345) 1 2
1 Perm 2 (54321) 4 5 4 Perm 3 (34512)
3 5 4
C1 C2 C3 R1 1 0 1 R2 0 1
1 R3 1 0 0 R4 1 0 1 R5 0 1
0
Similarities 1-2
1-3 2-3 Col-Col 0.00 0.50
0.25 Sig-Sig 0.00 0.67 0.00
33
Implementation Trick

Permuting rows even once is prohibitive
Row Hashing
Pick k hash functions hk 1,,n?1,,O(n)
Ordering under hk gives random row permutation
One-pass implementation

34
Comparing Signatures

Signature Matrix S
Rows Hash Functions
Columns Columns
Entries Signatures
Need Pair-wise similarity of signature columns
Problem
MinHash fits column signatures in memory
But comparing signature-pairs takes too much time
Limiting candidate pairs Locality Sensitive
Hashing

35
Summary

New algorithmic paradigms needed for streams and
massive data sets
Negative results abound
Need to approximate
Power of randomization

36
Thank You!
37
References

Rajeev Motwani (http//theory.stanford.edu/rajeev
)
STREAM Project (http//www-db.stanford.edu/stream)
STREAM The Stanford Stream Data Manager.
Bulletin of the Technical Committee on Data
Engineering 2003.
Motwani et al. Query Processing, Approximation,
and Resource Management in a Data Stream
Management System. CIDR 2003.
Babcock-Babu-Datar-Motwani-Widom. Models and
Issues in Data Stream Systems. PODS 2002.
Manku-Motwani. Approximate Frequency Counts over
Streaming Data. VLDB 2003.
Babcock-Datar-Motwani-OCallahan. Maintaining
Variance and K-Medians over Data Stream Windows.
PODS 2003.
Guha-Meyerson-Mishra-Motwani-OCallahan.
Clustering Data Streams Theory and Practice.
IEEE TKDE 2003.

38
References (contd)

Datar-Gionis-Indyk-Motwani. Maintaining Stream
Statistics over Sliding Windows. SIAM Journal on
Computing 2002.
Babcock-Datar-Motwani. Sampling From a Moving
Window Over Streaming Data. SODA 2002.
OCallahan-Guha-Mishra-Meyerson-Motwani.
High-Performance Clustering of Streams and Large
Data Sets. ICDE 2003.
Guha-Mishra-Motwani-OCallagahan. Clustering Data
Streams. FOCS 2000.
Cohen et al. Finding Interesting Associations
without Support Pruning. ICDE 2000.
Charikar-Chaudhuri-Motwani-Narasayya. Towards
Estimation Error Guarantees for Distinct Values.
PODS 2000.
Gionis-Indyk-Motwani. Similarity Search in High
Dimensions via Hashing. VLDB 1999.
Indyk-Motwani. Approximate Nearest Neighbors
Towards Removing the Curse of Dimensionality.
STOC 1998.