Title: Verifying and Mining Frequent Patterns from Large Windows over Data Streams
1Verifying and Mining Frequent Patterns from
Large Windows over Data Streams
- Barzan Mozafari,
- Hetal Thakkar,
- and Carlo Zaniolo
ICDE 2008 Cancun, Mexico
2Finding Frequent Patterns for Association Rule
Mining
- Given a set of transactions T and a support
threshold s, find all patterns with support gt s - Apriori Agrawal 94, FP-growth Han 00
- Fast light algorithms for data streams
- More than 30 proposals Jiang 06
- For mining windows over streams
- In particular DSMSs divide windows into panes,
a.k.a. slides - As in our Stream Mill Miner system
3Moment (Maintaining Closed Frequent Itemsets over
a Stream Sliding Window)
- Yun Chi, Haixun Wang, Philip S. Yu, Richard R.
Muntz - Collaboration of UCLA IBM
4Closed Enumeration Tree (CET)
- Very similar to FP-tree, except that keeps a
dynamic set of items - Closed freq itemsets
- Boundary itemsets
5Moment Algorithm (I)
- Hope In the absence of cocncept drifts, not many
changes in status - Maintains two types of boundary nodes
- Freq / non-freq
- Closed / non-closed
- Taking specific actions to maintain a shifting
boundary whenever a concept shift occurs
6Moment Algorithm (II)
- Infreq gateway nodes
- Infreq its parent freq result of a candidate
join - Unpromising gateway nodes
- Freq prefix of a closed w/ same support
- Intermiddiate nodes
- Freq has a child w/ same supp not unpromising
- Closed nodes
- Closed freq
7Moment Algorithm (III)
- Increments
- Add/Delete to/from CET upon arrival/expiration of
each transaction. - Downside
- Batch operations not applicable, suffers from big
slide sizes - Advantage
- Efficient for small slides
8CanTree Leung 05
- Use a fixed canonical order according to
decreasing single freq. - Use a single-round version of FP-growth
- Algorithm
- Upon each window move
- Add/Remove new/expired trans to/from FP-tree
(using the same item order) - Run FP-growth! (Without any pruning)
9CanTree (cont.)
- Pros
- Very efficient for large slides
- Cons
- Inefficient for small slides
- Not scallable for large windows
- Needs memory for entire window
10Frequent Patterns Mining overData Streams
Expired
New
.
S4
S5
S6
S7
W4
W5
- Challenges
- Computation
- Storage
- Real-time response
- Customization
- Integration with the DSMS
11Frequent Patterns Mining over Data Streams
- Difficult problem Chi 04, Leung 05, Cheung
03, Koh 04, - Mining each window from scratch - too expensive
- Subsequent windows have many freq patterns in
common - Updating frequent patterns every new tuple, also
too expensive - SWIMs middle-road approach incrementally
maintain frequent patterns over sliding windows - Desiderata scalability with slide size and
window size
12SWIM (Sliding Window Incremental Miner)
- If pattern p is freq in a window, it must be freq
in at least one of its slides -- keep a union of
freq patterns of all slides (PT)
Expired
New
.
S4
S5
S6
S7
W4
W5
Mine
Mining Alg.
PT
Prune PT
PT F5 U F6 U F7
PT F4 U F5 U F6
13SWIM
- For each new slide Si
- Find all frequent patterns in Si (using
FP-growth) - Verify frequency of these new patterns in each
window slide - Immediately or
- With delay (lt N slides)
- Trade-off max delay vs. computation.
- No false negatives or false positives!
14SWIM Design Choices
- Data Structure for Sis FP-tree Han 00
- Data Structure for PT FP-tree
- Mining Algorithm FP-growth
- Count/Update frequencies Naïve? Hash-tree?
- Counting is the bottleneck ?
- New and improved counting method named
Conditional Counting
15Conditional Counting
- Verification
- Given a set of transactions T, a set of patterns
P, and a threshold s - Goal Find the exact freq of each p ? P w.r.t. to
T, IF AND ONLY IF its freq is ? s - If s0, verification counting, but if sgt0 extra
computation can be avoided - Proposed fast verifiers
- DTV, DFV, hybrid
16Conditionalization on FP-trees
FP-tree
FP-tree g
FP-tree gd
17Attempt I DTV (Double-Tree Verifier)
- Not only conditionalize the fp-tree, but also the
pattern tree
18FP-tree
FP-tree g
Header Table
(a2,b2,c2,d2) (a1,b1,c1) (b1,e1) Conditio
nal pattern base of g
Header Table
Header Table
Header Table
Header Table
Header Table
pattern tree g, after verification against
FP-tree
Filling original pattern tree using reverse
pointers
Initial pattern tree
pattern tree g
19DTV (cont.)
- Scales up well on large trees
- Much pruning from conditionalization
- However, for smaller trees
- Less pruning
- Overhead of conditionalization not always worth
it
20Attempt II DFV(Depth-First Verifier)
- Each node n in PT corresponds to a unique pattern
pn, therefore - For each node n in PT
- Traverse the FP-tree and count the occurrence of
pn in a depth-first order - Keep the nodes marked as FAIL/OK while visiting
their children - Utilize these marks for optimized execution
- More efficient when both trees are small
21DFV (cont.)
22DFV (cont.)
23Comparing Verifiers
24Hybrid Verifier
- Start with performing DTV recursively
- Until the resulting trees are small enough, then
perform DFV
25Comparing Verifiers
26Verifiers vs. Hash Trees (Counting)
27SWIM with Hybrid Verifier (I)
28SWIM with Hybrid Verifier (II)
29Applications of Verifiers (I)
- Improving counting in static mining methods
- Candidate-generation (and pruning) phase
- Example Toivonen Approach Toivonen 96
- Maintain a boundary of smallest non-frequent and
largest frequent patterns - Check the frequency of boundary patterns
30Applications of Verifiers (II)
- In case resources are limited
- Mine once
- Keep monitoring the current patterns (by
verifying them) - Since verifying is computationally cheaper
- Whenever a significant concept shift is detected,
mine again!
31Monitoring/Concept Shift Detection
- Verification is much faster than mining (when it
suffices)
32Privacy Preserving Applications
- Random noise methods
- Add many fake items into the transactions to
increase the variance Evfimievski 03 - Overhead
- Long transactions (in the order of the no of
items) - Lemma Max depth of the recursion in DTV is lt
the max len of the patterns to be verified. - Run-time independent of the transaction length
33Optimization when integrated into a DSMS
- Stream Mill Miner (SMM) provides integrated
support for online mining algorithms by - User Define Aggregates (UDAs)
- Definition of Mining Models
- Constraints used for optimization
- Max allowed delay
- Interesting/Uninteresting items
- Interesting/Uninteresting patterns
- These are turned from post-conditions into
pre-conditions
34Conclusions
- SWIM for incremental mining over large windows
- More efficient than existing approaches on data
streams - Trade-off between real-time response, efficiency,
memory, etc. - Efficient algorithms for verification/conditional
counting - DTV, DFV, and Hybrid
- These can be used to speed-up many applications
- Incremental mining, enhancing static algorithms,
privacy preserving techniques, - Implementations of SWIM and the verifiers
available at http//wis.cs.ucla.edu/swim/index.h
tm
35References
- Agrawal 94 R. Agrawal and R. Srikant. Fast
algorithms for mining association rules in large
databases. In VLDB, pages 487499, 1994. - Cheung 03 W. Cheung and O. R. Zaiane,
Incremental mining of frequent patterns without
candidate generation or support, in DEAS, 2003. - Chi 04 Y. Chi, H. Wang, P. S. Yu, and R. R.
Muntz, Moment Maintaining closed frequent
itemsets over a stream sliding window, in ICDM,
November 2004. - Evfimievski 03 A. Evfimievski, J. Gehrke, and
R. Srikant, Limiting privacy breaches in privacy
preserving data mining, in PODS, 2003. - Han 00 J. Han, J. Pei, and Y. Yin. Mining
frequent patterns without candidate generation.
In SIGMOD, 2000. - Koh 04 J. Koh and S. Shieh, An efficient
approach for maintaining association rules based
on adjusting fp-tree structures. in DASFAA,
2004. - Leung 05 C.-S. Leung, Q. Khan, and T. Hoque,
Cantree A tree structure for efficient
incremental mining of frequent patterns, in
ICDM, 2005. - Toivonen 96 H. Toivonen, Sampling large
databases for association rules, in VLDB, 1996,
pp. 134145.
36Thank you!
37DFV (cont.)
38DFV (cont.)