Estimating Entropy for Data Streams - PowerPoint PPT Presentation

1 / 10
About This Presentation
Title:

Estimating Entropy for Data Streams

Description:

Review of Data Streams Motivation: ... mining email and text message streams, etc. The Mathematical Model Sequence of integers A = a1, ... – PowerPoint PPT presentation

Number of Views:153
Avg rating:3.0/5.0
Slides: 11
Provided by: Khan158
Category:

less

Transcript and Presenter's Notes

Title: Estimating Entropy for Data Streams


1
Estimating Entropy for Data Streams
  • Khanh Do Ba, Dartmouth College
  • Advisor S. Muthu Muthukrishnan

2
Review of Data Streams
  • Motivation huge data stream that needs to be
    mined for info efficiently.
  • Applications monitoring IP traffic, mining email
    and text message streams, etc.

3
The Mathematical Model
  • Sequence of integers A ?a1, , am?, where each
    ai ? N 1, , n.
  • For each v ? N, the frequency mv of v is
    occurrences of v in A.
  • Statistics to be estimated are functions on A,
    but usually just on the mvs (e.g. frequency
    moments).

4
What is Entropy?
  • In physics measure of disorder in a system.
  • In math measure of randomness (or uniformity) of
    a probability distribution.
  • Formula

5
Entropy on Data Streams
  • For big m, mv/m ? Prv. So formula becomes
  • Suffices to compute m (easy) and

6
The Goal
  • Approximation algorithm to estimate µ.
  • Approximate means to output a number Y such that
    PrY µ ?? ?µ ?? e, for any user-specified ?,
    e gt 0.
  • Restrictions o(n), preferably Õ(1), space, and
    only 1 pass over data.

7
The Algorithm
  • We want Y to have EY µ and very small
    variance, so find a computable random variable X
    with EX µ and small variance, and compute it
    several times.
  • Y is the median of s2 RVs Yi, each of which is
    the mean of s1 RVs Xij X (independently,
    identically computed).

8
Computing X
  • Choose p ? 1, , m uniformly at random.
  • Let r q ? p aq ap ( ? 1).
  • X mr log r (r 1) log (r 1).

9
The Analysis
  • Easy EY EX µ.
  • Hard VarY is very small.
  • Turns out s1 O(log n), s2 O(1) works.
  • Each X maintained in O(log n log m) space.
  • Total O(s1s2(log n log m)) O(log n log m).

10
Future Directions
  • Extension to insert/delete streams. Applications
    in
  • DBMSs where massive secondary storage cannot be
    scanned quickly enough to answer real-time
    queries.
  • Monitoring open flows through internet routers.
  • Lowerbound proof showing algorithm is optimal, or
    an improved algorithm.
Write a Comment
User Comments (0)
About PowerShow.com