Verifying and Mining Frequent Patterns from Large Windows over Data Streams

About This Presentation
Title:

Verifying and Mining Frequent Patterns from Large Windows over Data Streams

Description:

Hope: In the absence of cocncept drifts, not many changes in status ... SWIM's middle-road approach: incrementally maintain frequent patterns over sliding windows ... –

Number of Views:45
Avg rating:3.0/5.0
Slides: 39
Provided by: csU5
Learn more at: http://web.cs.ucla.edu
Category:

less

Transcript and Presenter's Notes

Title: Verifying and Mining Frequent Patterns from Large Windows over Data Streams


1
Verifying and Mining Frequent Patterns from
Large Windows over Data Streams
  • Barzan Mozafari,
  • Hetal Thakkar,
  • and Carlo Zaniolo

ICDE 2008 Cancun, Mexico
2
Finding Frequent Patterns for Association Rule
Mining
  • Given a set of transactions T and a support
    threshold s, find all patterns with support gt s
  • Apriori Agrawal 94, FP-growth Han 00
  • Fast light algorithms for data streams
  • More than 30 proposals Jiang 06
  • For mining windows over streams
  • In particular DSMSs divide windows into panes,
    a.k.a. slides
  • As in our Stream Mill Miner system

3
Moment (Maintaining Closed Frequent Itemsets over
a Stream Sliding Window)
  • Yun Chi, Haixun Wang, Philip S. Yu, Richard R.
    Muntz
  • Collaboration of UCLA IBM

4
Closed Enumeration Tree (CET)
  • Very similar to FP-tree, except that keeps a
    dynamic set of items
  • Closed freq itemsets
  • Boundary itemsets

5
Moment Algorithm (I)
  • Hope In the absence of cocncept drifts, not many
    changes in status
  • Maintains two types of boundary nodes
  • Freq / non-freq
  • Closed / non-closed
  • Taking specific actions to maintain a shifting
    boundary whenever a concept shift occurs

6
Moment Algorithm (II)
  • Infreq gateway nodes
  • Infreq its parent freq result of a candidate
    join
  • Unpromising gateway nodes
  • Freq prefix of a closed w/ same support
  • Intermiddiate nodes
  • Freq has a child w/ same supp not unpromising
  • Closed nodes
  • Closed freq

7
Moment Algorithm (III)
  • Increments
  • Add/Delete to/from CET upon arrival/expiration of
    each transaction.
  • Downside
  • Batch operations not applicable, suffers from big
    slide sizes
  • Advantage
  • Efficient for small slides

8
CanTree Leung 05
  • Use a fixed canonical order according to
    decreasing single freq.
  • Use a single-round version of FP-growth
  • Algorithm
  • Upon each window move
  • Add/Remove new/expired trans to/from FP-tree
    (using the same item order)
  • Run FP-growth! (Without any pruning)

9
CanTree (cont.)
  • Pros
  • Very efficient for large slides
  • Cons
  • Inefficient for small slides
  • Not scallable for large windows
  • Needs memory for entire window

10
Frequent Patterns Mining overData Streams
Expired
New

.
S4
S5
S6
S7
W4
W5
  • Challenges
  • Computation
  • Storage
  • Real-time response
  • Customization
  • Integration with the DSMS

11
Frequent Patterns Mining over Data Streams
  • Difficult problem Chi 04, Leung 05, Cheung
    03, Koh 04,
  • Mining each window from scratch - too expensive
  • Subsequent windows have many freq patterns in
    common
  • Updating frequent patterns every new tuple, also
    too expensive
  • SWIMs middle-road approach incrementally
    maintain frequent patterns over sliding windows
  • Desiderata scalability with slide size and
    window size

12
SWIM (Sliding Window Incremental Miner)
  • If pattern p is freq in a window, it must be freq
    in at least one of its slides -- keep a union of
    freq patterns of all slides (PT)

Expired
New

.
S4
S5
S6
S7
W4
W5
Mine
Mining Alg.
PT
Prune PT
PT F5 U F6 U F7
PT F4 U F5 U F6
13
SWIM
  • For each new slide Si
  • Find all frequent patterns in Si (using
    FP-growth)
  • Verify frequency of these new patterns in each
    window slide
  • Immediately or
  • With delay (lt N slides)
  • Trade-off max delay vs. computation.
  • No false negatives or false positives!

14
SWIM Design Choices
  • Data Structure for Sis FP-tree Han 00
  • Data Structure for PT FP-tree
  • Mining Algorithm FP-growth
  • Count/Update frequencies Naïve? Hash-tree?
  • Counting is the bottleneck ?
  • New and improved counting method named
    Conditional Counting

15
Conditional Counting
  • Verification
  • Given a set of transactions T, a set of patterns
    P, and a threshold s
  • Goal Find the exact freq of each p ? P w.r.t. to
    T, IF AND ONLY IF its freq is ? s
  • If s0, verification counting, but if sgt0 extra
    computation can be avoided
  • Proposed fast verifiers
  • DTV, DFV, hybrid

16
Conditionalization on FP-trees
FP-tree
FP-tree g
FP-tree gd
17
Attempt I DTV (Double-Tree Verifier)
  • Not only conditionalize the fp-tree, but also the
    pattern tree

18
FP-tree
FP-tree g
Header Table
(a2,b2,c2,d2) (a1,b1,c1) (b1,e1) Conditio
nal pattern base of g
Header Table
Header Table
Header Table
Header Table
Header Table
pattern tree g, after verification against
FP-tree
Filling original pattern tree using reverse
pointers
Initial pattern tree
pattern tree g
19
DTV (cont.)
  • Scales up well on large trees
  • Much pruning from conditionalization
  • However, for smaller trees
  • Less pruning
  • Overhead of conditionalization not always worth
    it

20
Attempt II DFV(Depth-First Verifier)
  • Each node n in PT corresponds to a unique pattern
    pn, therefore
  • For each node n in PT
  • Traverse the FP-tree and count the occurrence of
    pn in a depth-first order
  • Keep the nodes marked as FAIL/OK while visiting
    their children
  • Utilize these marks for optimized execution
  • More efficient when both trees are small

21
DFV (cont.)
22
DFV (cont.)
23
Comparing Verifiers
24
Hybrid Verifier
  • Start with performing DTV recursively
  • Until the resulting trees are small enough, then
    perform DFV

25
Comparing Verifiers
26
Verifiers vs. Hash Trees (Counting)
27
SWIM with Hybrid Verifier (I)
28
SWIM with Hybrid Verifier (II)
29
Applications of Verifiers (I)
  • Improving counting in static mining methods
  • Candidate-generation (and pruning) phase
  • Example Toivonen Approach Toivonen 96
  • Maintain a boundary of smallest non-frequent and
    largest frequent patterns
  • Check the frequency of boundary patterns

30
Applications of Verifiers (II)
  • In case resources are limited
  • Mine once
  • Keep monitoring the current patterns (by
    verifying them)
  • Since verifying is computationally cheaper
  • Whenever a significant concept shift is detected,
    mine again!

31
Monitoring/Concept Shift Detection
  • Verification is much faster than mining (when it
    suffices)

32
Privacy Preserving Applications
  • Random noise methods
  • Add many fake items into the transactions to
    increase the variance Evfimievski 03
  • Overhead
  • Long transactions (in the order of the no of
    items)
  • Lemma Max depth of the recursion in DTV is lt
    the max len of the patterns to be verified.
  • Run-time independent of the transaction length

33
Optimization when integrated into a DSMS
  • Stream Mill Miner (SMM) provides integrated
    support for online mining algorithms by
  • User Define Aggregates (UDAs)
  • Definition of Mining Models
  • Constraints used for optimization
  • Max allowed delay
  • Interesting/Uninteresting items
  • Interesting/Uninteresting patterns
  • These are turned from post-conditions into
    pre-conditions

34
Conclusions
  • SWIM for incremental mining over large windows
  • More efficient than existing approaches on data
    streams
  • Trade-off between real-time response, efficiency,
    memory, etc.
  • Efficient algorithms for verification/conditional
    counting
  • DTV, DFV, and Hybrid
  • These can be used to speed-up many applications
  • Incremental mining, enhancing static algorithms,
    privacy preserving techniques,
  • Implementations of SWIM and the verifiers
    available at http//wis.cs.ucla.edu/swim/index.h
    tm

35
References
  • Agrawal 94 R. Agrawal and R. Srikant. Fast
    algorithms for mining association rules in large
    databases. In VLDB, pages 487499, 1994.
  • Cheung 03 W. Cheung and O. R. Zaiane,
    Incremental mining of frequent patterns without
    candidate generation or support, in DEAS, 2003.
  • Chi 04 Y. Chi, H. Wang, P. S. Yu, and R. R.
    Muntz, Moment Maintaining closed frequent
    itemsets over a stream sliding window, in ICDM,
    November 2004.
  • Evfimievski 03 A. Evfimievski, J. Gehrke, and
    R. Srikant, Limiting privacy breaches in privacy
    preserving data mining, in PODS, 2003.
  • Han 00 J. Han, J. Pei, and Y. Yin. Mining
    frequent patterns without candidate generation.
    In SIGMOD, 2000.
  • Koh 04 J. Koh and S. Shieh, An efficient
    approach for maintaining association rules based
    on adjusting fp-tree structures. in DASFAA,
    2004.
  • Leung 05 C.-S. Leung, Q. Khan, and T. Hoque,
    Cantree A tree structure for efficient
    incremental mining of frequent patterns, in
    ICDM, 2005.
  • Toivonen 96 H. Toivonen, Sampling large
    databases for association rules, in VLDB, 1996,
    pp. 134145.

36
Thank you!
  • Questions?

37
DFV (cont.)
38
DFV (cont.)
Write a Comment
User Comments (0)
About PowerShow.com