Title: Randomization for Massive and Streaming Data Sets
1Randomization for Massive and Streaming Data Sets
2Data Streams Mangement Systems
- Traditional DBMS data stored in finite,
persistent data sets - Data Streams distributed, continuous,
unbounded, rapid, time-varying, noisy, - Emerging DSMS variety of modern applications
- Network monitoring and traffic engineering
- Telecom call records
- Network security
- Financial applications
- Sensor networks
- Manufacturing processes
- Web logs and clickstreams
- Massive data sets
3DSMS Big Picture
Stored Result
DSMS
Input streams
Archive
Scratch Store
Stored Relations
4Algorithmic Issues
- Computational Model
- Streaming data (or, secondary memory)
- Bounded main memory
- Techniques
- New paradigms
- Negative Results and Approximation
- Randomization
- Complexity Measures
- Memory
- Time per item (online, real-time)
- Passes (linear scan in secondary memory)
5Stream Model of Computation
Main Memory (Synopsis Data
Structures)
Increasing time
Memory poly(1/e, log N) Query/Update Time
poly(1/e, log N) N items so far, or window
size e error parameter
Data Stream
6Toy Example Network Monitoring
Intrusion Warnings
Online Performance Metrics
Register Monitoring Queries
DSMS
Network measurements, Packet traces,
Archive
Scratch Store
Lookup Tables
7 Frequency Related Problems
Analytics on Packet Headers IP Addresses
How many elements have non-zero frequency?
8Example 1 Distinct Values
- Input Sequence X x1, x2, , xn,
- Domain U 0,1,2, , u-1
- Compute D(X) number of distinct values
- Remarks
- Assume stream size n is finite/known
- (generally, n is window size)
- Domain could be arbitrary (e.g., text, tuples)
9Naïve Approach
- Counter C(i) for each domain value i
- Initialize counters C(i)? 0
- Scan X incrementing appropriate counters
- Problem
- Memory size M ltlt n
- Space O(u) possibly u gtgt n
- (e.g., when counting distinct words in web crawl)
10Negative Result
- Theorem
- Deterministic algorithms need M O(n log u) bits
- Proof Information-theoretic arguments
- Note Leaves open randomization/approximation
11Randomized Algorithm
hU ? 1..t
Input Stream
Hash Table
- Analysis
- Random h ? few collisions avg list-size O(n/t)
- Thus
- Space O(n) since we need t O(n)
- Time O(1) per item Expected
12Improvement via Sampling?
- Sample-based Estimation
- Random Sample R (of size r) of n values in X
- Compute D(R)
- Estimator E D(R) x n/r
- Benefit sublinear space
- Cost estimation error is high
- Why? low-frequency values underrepresented
13Negative Result for Sampling
- Consider estimator E of D(X) examining r items in
X - Possibly in adaptive/randomized fashion.
- Theorem For any , E has relative error
-
-
- with probability at least .
- Remarks
- r n/10 ? Error 75 with probability ½
- Leaves open randomization/approximation on full
scans
14Randomized Approximation
- Simplified Problem For fixed t, is D(X) gtgt t?
- Choose hash function h U?1..t
- Initialize answer to NO
- For each xi, if h(xi) t, set answer to YES
- Observe need 1 bit memory only !
- Theorem
- If D(X) lt t, Poutput NO gt 0.25
- If D(X) gt 2t, Poutput NO lt 0.14
Boolean Flag
1
hU ? 1..t
YES/NO
t
Input Stream
15Analysis
- Let Y be set of distinct elements of X
- output NO no element of Y hashes to t
- P element hashes to t 1/t
- Thus Poutput NO (1-1/t)Y
- Since Y D(X),
- D(X) lt t ? Poutput NO gt (1-1/t)t gt 0.25
- D(X) gt 2t ? Poutput NO lt (1-1/t)2t lt 1/e2
16Boosting Accuracy
- With 1 bit ? distinguish D(X)ltt from D(X)gt2t
- Running O(log 1/d) instances in parallel
? reduce error probability to any dgt0 - Running O(log n) in parallel for t 1, 2, 4,
8,, n ? can estimate D(X) within factor 2 - Choice of multiplier 2 is arbitrary
? can use factor (1e) to reduce error to
e - Theorem Can estimate D(X) within factor (1e)
with probability (1-d) using space
17Example 2 Elephants-and-Ants
Stream
- Identify items whose current frequency exceeds
support threshold s 0.1. - Jacobson 2000, Estan-Verghese 2001
18Algorithm 1 Lossy Counting
Step 1 Divide the stream into windows
Window-size W is function of support s specify
later
19Lossy Counting in Action ...
Empty
20Lossy Counting continued ...
21Error Analysis
How much do we undercount?
If current size of stream N and
window-size W
1/e then
windows eN
frequency error ?
Rule of thumb Set e 10 of support
s Example Given support frequency s
1, set error frequency e 0.1
22Putting it all together
Output Elements with counter values exceeding
(s-e)N
Approximation guarantees Frequencies
underestimated by at most eN No false
negatives False positives have true
frequency at least (se)N
- How many counters do we need?
- Worst case bound 1/e log eN counters
- Implementation details
23Algorithm 2 Sticky Sampling
? Create counters by sampling ? Maintain exact
counts thereafter
What is sampling rate?
24Sticky Sampling contd...
For finite stream of length N Sampling rate
2/eN log 1/?s
? probability of failure
Output Elements with counter values exceeding
(s-e)N
Same Rule of thumb Set e 10 of support
s Example Given support threshold s 1,
set error threshold e 0.1 set
failure probability ? 0.01
25Number of counters?
Finite stream of length N Sampling rate 2/eN
log 1/?s
Infinite stream with unknown N Gradually adjust
sampling rate
In either case, Expected number of counters
2/? log 1/?s
26Example 3 Correlated Attributes
- C1 C2 C3 C4 C5
- R1 1 1 1 1 0
- R2 1 1 0 1 0
- R3 1 0 0 1 0
- R4 0 0 1 0 1
- R5 1 1 1 0 1
- R6 1 1 1 1 1
- R7 0 1 1 1 1
- R8 0 1 1 1 0
-
- Input Stream items with boolean attributes
- Matrix M(r,c) 1 ? Row r has Attribute c
- Identify Highly-correlated column-pairs
27Correlation ? Similarity
- View column as set of row-indexes (where it
has 1s) - Set Similarity (Jaccard measure)
- Example
-
Ci Cj 0 1 1 0 1 1
sim(Ci,Cj) 2/5 0.4 0 0 1 1 0 1
28Identifying Similar Columns?
- Goal finding candidate pairs in small memory
- Signature Idea
- Hash columns Ci to small signature sig(Ci)
- Set of signatures fits in memory
- sim(Ci,Cj) approximated by sim(sig(Ci),sig(Cj))
- Naïve Approach
- Sample P rows uniformly at random
- Define sig(Ci) as P bits of Ci in sample
- Problem
- sparsity ? would miss interesting part of
columns - sample would get only 0s in columns
29Key Observation
- For columns Ci, Cj, four types of rows
- Ci Cj
- A 1 1
- B 1 0
- C 0 1
- D 0 0
- Overload notation A rows of type A
- Observation
30Min Hashing
- Randomly permute rows
- Hash h(Ci) index of first row with 1 in column
Ci - Suprising Property
- Ph(Ci) h(Cj) sim(Ci, Cj)
- Why?
- Both are A/(ABC)
- Look down columns Ci, Cj until first non-Type-D
row - h(Ci) h(Cj) ?? if type A row
31Min-Hash Signatures
- Pick k random row permutations
- Min-Hash Signature
- sig(C) k indexes of first rows with 1 in column
C - Similarity of signatures
- Define sim(sig(Ci),sig(Cj)) fraction of
permutations where Min-Hash values agree - Lemma Esim(sig(Ci),sig(Cj)) sim(Ci,Cj)
32Example
Signatures
S1 S2 S3 Perm 1 (12345) 1 2
1 Perm 2 (54321) 4 5 4 Perm 3 (34512)
3 5 4
C1 C2 C3 R1 1 0 1 R2 0 1
1 R3 1 0 0 R4 1 0 1 R5 0 1
0
Similarities 1-2
1-3 2-3 Col-Col 0.00 0.50
0.25 Sig-Sig 0.00 0.67 0.00
33Implementation Trick
- Permuting rows even once is prohibitive
- Row Hashing
- Pick k hash functions hk 1,,n?1,,O(n)
- Ordering under hk gives random row permutation
- One-pass implementation
34Comparing Signatures
- Signature Matrix S
- Rows Hash Functions
- Columns Columns
- Entries Signatures
- Need Pair-wise similarity of signature columns
-
- Problem
- MinHash fits column signatures in memory
- But comparing signature-pairs takes too much time
- Limiting candidate pairs Locality Sensitive
Hashing
35Summary
- New algorithmic paradigms needed for streams and
massive data sets - Negative results abound
- Need to approximate
- Power of randomization
36 Thank You!
37References
- Rajeev Motwani (http//theory.stanford.edu/rajeev
) - STREAM Project (http//www-db.stanford.edu/stream)
- STREAM The Stanford Stream Data Manager.
Bulletin of the Technical Committee on Data
Engineering 2003. - Motwani et al. Query Processing, Approximation,
and Resource Management in a Data Stream
Management System. CIDR 2003. - Babcock-Babu-Datar-Motwani-Widom. Models and
Issues in Data Stream Systems. PODS 2002. - Manku-Motwani. Approximate Frequency Counts over
Streaming Data. VLDB 2003. - Babcock-Datar-Motwani-OCallahan. Maintaining
Variance and K-Medians over Data Stream Windows.
PODS 2003. - Guha-Meyerson-Mishra-Motwani-OCallahan.
Clustering Data Streams Theory and Practice.
IEEE TKDE 2003.
38References (contd)
- Datar-Gionis-Indyk-Motwani. Maintaining Stream
Statistics over Sliding Windows. SIAM Journal on
Computing 2002. - Babcock-Datar-Motwani. Sampling From a Moving
Window Over Streaming Data. SODA 2002. - OCallahan-Guha-Mishra-Meyerson-Motwani.
High-Performance Clustering of Streams and Large
Data Sets. ICDE 2003. - Guha-Mishra-Motwani-OCallagahan. Clustering Data
Streams. FOCS 2000. - Cohen et al. Finding Interesting Associations
without Support Pruning. ICDE 2000. - Charikar-Chaudhuri-Motwani-Narasayya. Towards
Estimation Error Guarantees for Distinct Values.
PODS 2000. - Gionis-Indyk-Motwani. Similarity Search in High
Dimensions via Hashing. VLDB 1999. - Indyk-Motwani. Approximate Nearest Neighbors
Towards Removing the Curse of Dimensionality.
STOC 1998.