Randomization for Massive and Streaming Data Sets - PowerPoint PPT Presentation

About This Presentation
Title:

Randomization for Massive and Streaming Data Sets

Description:

Emerging DSMS variety of modern applications. Network monitoring and traffic engineering ... Possibly in adaptive/randomized fashion. Theorem: For any , E ... – PowerPoint PPT presentation

Number of Views:88
Avg rating:3.0/5.0
Slides: 39
Provided by: RajeevM2
Category:

less

Transcript and Presenter's Notes

Title: Randomization for Massive and Streaming Data Sets


1
Randomization for Massive and Streaming Data Sets
  • Rajeev Motwani

2
Data Streams Mangement Systems
  • Traditional DBMS data stored in finite,
    persistent data sets
  • Data Streams distributed, continuous,
    unbounded, rapid, time-varying, noisy,
  • Emerging DSMS variety of modern applications
  • Network monitoring and traffic engineering
  • Telecom call records
  • Network security
  • Financial applications
  • Sensor networks
  • Manufacturing processes
  • Web logs and clickstreams
  • Massive data sets

3
DSMS Big Picture
Stored Result
DSMS
Input streams
Archive
Scratch Store
Stored Relations
4
Algorithmic Issues
  • Computational Model
  • Streaming data (or, secondary memory)
  • Bounded main memory
  • Techniques
  • New paradigms
  • Negative Results and Approximation
  • Randomization
  • Complexity Measures
  • Memory
  • Time per item (online, real-time)
  • Passes (linear scan in secondary memory)

5
Stream Model of Computation
Main Memory (Synopsis Data
Structures)
Increasing time
Memory poly(1/e, log N) Query/Update Time
poly(1/e, log N) N items so far, or window
size e error parameter
Data Stream
6
Toy Example Network Monitoring
Intrusion Warnings
Online Performance Metrics
Register Monitoring Queries
DSMS
Network measurements, Packet traces,
Archive
Scratch Store
Lookup Tables
7
Frequency Related Problems
Analytics on Packet Headers IP Addresses
How many elements have non-zero frequency?
8
Example 1 Distinct Values
  • Input Sequence X x1, x2, , xn,
  • Domain U 0,1,2, , u-1
  • Compute D(X) number of distinct values
  • Remarks
  • Assume stream size n is finite/known
  • (generally, n is window size)
  • Domain could be arbitrary (e.g., text, tuples)

9
Naïve Approach
  • Counter C(i) for each domain value i
  • Initialize counters C(i)? 0
  • Scan X incrementing appropriate counters
  • Problem
  • Memory size M ltlt n
  • Space O(u) possibly u gtgt n
  • (e.g., when counting distinct words in web crawl)

10
Negative Result
  • Theorem
  • Deterministic algorithms need M O(n log u) bits
  • Proof Information-theoretic arguments
  • Note Leaves open randomization/approximation

11
Randomized Algorithm
hU ? 1..t
Input Stream
Hash Table
  • Analysis
  • Random h ? few collisions avg list-size O(n/t)
  • Thus
  • Space O(n) since we need t O(n)
  • Time O(1) per item Expected

12
Improvement via Sampling?
  • Sample-based Estimation
  • Random Sample R (of size r) of n values in X
  • Compute D(R)
  • Estimator E D(R) x n/r
  • Benefit sublinear space
  • Cost estimation error is high
  • Why? low-frequency values underrepresented

13
Negative Result for Sampling
  • Consider estimator E of D(X) examining r items in
    X
  • Possibly in adaptive/randomized fashion.
  • Theorem For any , E has relative error
  • with probability at least .
  • Remarks
  • r n/10 ? Error 75 with probability ½
  • Leaves open randomization/approximation on full
    scans

14
Randomized Approximation
  • Simplified Problem For fixed t, is D(X) gtgt t?
  • Choose hash function h U?1..t
  • Initialize answer to NO
  • For each xi, if h(xi) t, set answer to YES
  • Observe need 1 bit memory only !
  • Theorem
  • If D(X) lt t, Poutput NO gt 0.25
  • If D(X) gt 2t, Poutput NO lt 0.14

Boolean Flag

1
hU ? 1..t


YES/NO
t
Input Stream
15
Analysis
  • Let Y be set of distinct elements of X
  • output NO no element of Y hashes to t
  • P element hashes to t 1/t
  • Thus Poutput NO (1-1/t)Y
  • Since Y D(X),
  • D(X) lt t ? Poutput NO gt (1-1/t)t gt 0.25
  • D(X) gt 2t ? Poutput NO lt (1-1/t)2t lt 1/e2

16
Boosting Accuracy
  • With 1 bit ? distinguish D(X)ltt from D(X)gt2t
  • Running O(log 1/d) instances in parallel
    ? reduce error probability to any dgt0
  • Running O(log n) in parallel for t 1, 2, 4,
    8,, n ? can estimate D(X) within factor 2
  • Choice of multiplier 2 is arbitrary
    ? can use factor (1e) to reduce error to
    e
  • Theorem Can estimate D(X) within factor (1e)
    with probability (1-d) using space

17
Example 2 Elephants-and-Ants
Stream
  • Identify items whose current frequency exceeds
    support threshold s 0.1.
  • Jacobson 2000, Estan-Verghese 2001

18
Algorithm 1 Lossy Counting
Step 1 Divide the stream into windows
Window-size W is function of support s specify
later
19
Lossy Counting in Action ...
Empty
20
Lossy Counting continued ...
21
Error Analysis
How much do we undercount?
If current size of stream N and
window-size W
1/e then
windows eN
frequency error ?
Rule of thumb Set e 10 of support
s Example Given support frequency s
1, set error frequency e 0.1
22
Putting it all together
Output Elements with counter values exceeding
(s-e)N
Approximation guarantees Frequencies
underestimated by at most eN No false
negatives False positives have true
frequency at least (se)N
  • How many counters do we need?
  • Worst case bound 1/e log eN counters
  • Implementation details

23
Algorithm 2 Sticky Sampling
? Create counters by sampling ? Maintain exact
counts thereafter
What is sampling rate?
24
Sticky Sampling contd...
For finite stream of length N Sampling rate
2/eN log 1/?s
? probability of failure
Output Elements with counter values exceeding
(s-e)N
Same Rule of thumb Set e 10 of support
s Example Given support threshold s 1,
set error threshold e 0.1 set
failure probability ? 0.01
25
Number of counters?
Finite stream of length N Sampling rate 2/eN
log 1/?s
Infinite stream with unknown N Gradually adjust
sampling rate
In either case, Expected number of counters
2/? log 1/?s
26
Example 3 Correlated Attributes
  • C1 C2 C3 C4 C5
  • R1 1 1 1 1 0
  • R2 1 1 0 1 0
  • R3 1 0 0 1 0
  • R4 0 0 1 0 1
  • R5 1 1 1 0 1
  • R6 1 1 1 1 1
  • R7 0 1 1 1 1
  • R8 0 1 1 1 0
  • Input Stream items with boolean attributes
  • Matrix M(r,c) 1 ? Row r has Attribute c
  • Identify Highly-correlated column-pairs

27
Correlation ? Similarity
  • View column as set of row-indexes (where it
    has 1s)
  • Set Similarity (Jaccard measure)
  • Example

Ci Cj 0 1 1 0 1 1
sim(Ci,Cj) 2/5 0.4 0 0 1 1 0 1
28
Identifying Similar Columns?
  • Goal finding candidate pairs in small memory
  • Signature Idea
  • Hash columns Ci to small signature sig(Ci)
  • Set of signatures fits in memory
  • sim(Ci,Cj) approximated by sim(sig(Ci),sig(Cj))
  • Naïve Approach
  • Sample P rows uniformly at random
  • Define sig(Ci) as P bits of Ci in sample
  • Problem
  • sparsity ? would miss interesting part of
    columns
  • sample would get only 0s in columns

29
Key Observation
  • For columns Ci, Cj, four types of rows
  • Ci Cj
  • A 1 1
  • B 1 0
  • C 0 1
  • D 0 0
  • Overload notation A rows of type A
  • Observation

30
Min Hashing
  • Randomly permute rows
  • Hash h(Ci) index of first row with 1 in column
    Ci
  • Suprising Property
  • Ph(Ci) h(Cj) sim(Ci, Cj)
  • Why?
  • Both are A/(ABC)
  • Look down columns Ci, Cj until first non-Type-D
    row
  • h(Ci) h(Cj) ?? if type A row

31
Min-Hash Signatures
  • Pick k random row permutations
  • Min-Hash Signature
  • sig(C) k indexes of first rows with 1 in column
    C
  • Similarity of signatures
  • Define sim(sig(Ci),sig(Cj)) fraction of
    permutations where Min-Hash values agree
  • Lemma Esim(sig(Ci),sig(Cj)) sim(Ci,Cj)

32
Example
Signatures
S1 S2 S3 Perm 1 (12345) 1 2
1 Perm 2 (54321) 4 5 4 Perm 3 (34512)
3 5 4
C1 C2 C3 R1 1 0 1 R2 0 1
1 R3 1 0 0 R4 1 0 1 R5 0 1
0
Similarities 1-2
1-3 2-3 Col-Col 0.00 0.50
0.25 Sig-Sig 0.00 0.67 0.00
33
Implementation Trick
  • Permuting rows even once is prohibitive
  • Row Hashing
  • Pick k hash functions hk 1,,n?1,,O(n)
  • Ordering under hk gives random row permutation
  • One-pass implementation

34
Comparing Signatures
  • Signature Matrix S
  • Rows Hash Functions
  • Columns Columns
  • Entries Signatures
  • Need Pair-wise similarity of signature columns
  • Problem
  • MinHash fits column signatures in memory
  • But comparing signature-pairs takes too much time
  • Limiting candidate pairs Locality Sensitive
    Hashing

35
Summary
  • New algorithmic paradigms needed for streams and
    massive data sets
  • Negative results abound
  • Need to approximate
  • Power of randomization

36
Thank You!
37
References
  • Rajeev Motwani (http//theory.stanford.edu/rajeev
    )
  • STREAM Project (http//www-db.stanford.edu/stream)
  • STREAM The Stanford Stream Data Manager.
    Bulletin of the Technical Committee on Data
    Engineering 2003.
  • Motwani et al. Query Processing, Approximation,
    and Resource Management in a Data Stream
    Management System. CIDR 2003.
  • Babcock-Babu-Datar-Motwani-Widom. Models and
    Issues in Data Stream Systems. PODS 2002.
  • Manku-Motwani. Approximate Frequency Counts over
    Streaming Data. VLDB 2003.
  • Babcock-Datar-Motwani-OCallahan. Maintaining
    Variance and K-Medians over Data Stream Windows.
    PODS 2003.
  • Guha-Meyerson-Mishra-Motwani-OCallahan.
    Clustering Data Streams Theory and Practice.
    IEEE TKDE 2003.

38
References (contd)
  • Datar-Gionis-Indyk-Motwani. Maintaining Stream
    Statistics over Sliding Windows. SIAM Journal on
    Computing 2002.
  • Babcock-Datar-Motwani. Sampling From a Moving
    Window Over Streaming Data. SODA 2002.
  • OCallahan-Guha-Mishra-Meyerson-Motwani.
    High-Performance Clustering of Streams and Large
    Data Sets. ICDE 2003.
  • Guha-Mishra-Motwani-OCallagahan. Clustering Data
    Streams. FOCS 2000.
  • Cohen et al. Finding Interesting Associations
    without Support Pruning. ICDE 2000.
  • Charikar-Chaudhuri-Motwani-Narasayya. Towards
    Estimation Error Guarantees for Distinct Values.
    PODS 2000.
  • Gionis-Indyk-Motwani. Similarity Search in High
    Dimensions via Hashing. VLDB 1999.
  • Indyk-Motwani. Approximate Nearest Neighbors
    Towards Removing the Curse of Dimensionality.
    STOC 1998.
Write a Comment
User Comments (0)
About PowerShow.com