Verifying and Mining Frequent Patterns from Large Windows over Data Streams

About This Presentation

Title:

Verifying and Mining Frequent Patterns from Large Windows over Data Streams

Description:

Hope: In the absence of cocncept drifts, not many changes in status ... SWIM's middle-road approach: incrementally maintain frequent patterns over sliding windows ... –

Number of Views:45

Avg rating:3.0/5.0

Slides: 39

Provided by: csU5

Learn more at: http://web.cs.ucla.edu

Category:

more less

Transcript and Presenter's Notes

Title: Verifying and Mining Frequent Patterns from Large Windows over Data Streams

1
Verifying and Mining Frequent Patterns from
Large Windows over Data Streams

Barzan Mozafari,
Hetal Thakkar,
and Carlo Zaniolo

ICDE 2008 Cancun, Mexico
2
Finding Frequent Patterns for Association Rule
Mining

Given a set of transactions T and a support
threshold s, find all patterns with support gt s
Apriori Agrawal 94, FP-growth Han 00
Fast light algorithms for data streams
More than 30 proposals Jiang 06
For mining windows over streams
In particular DSMSs divide windows into panes,
a.k.a. slides
As in our Stream Mill Miner system

3
Moment (Maintaining Closed Frequent Itemsets over
a Stream Sliding Window)

Yun Chi, Haixun Wang, Philip S. Yu, Richard R.
Muntz
Collaboration of UCLA IBM

4
Closed Enumeration Tree (CET)

Very similar to FP-tree, except that keeps a
dynamic set of items
Closed freq itemsets
Boundary itemsets

5
Moment Algorithm (I)

Hope In the absence of cocncept drifts, not many
changes in status
Maintains two types of boundary nodes
Freq / non-freq
Closed / non-closed
Taking specific actions to maintain a shifting
boundary whenever a concept shift occurs

6
Moment Algorithm (II)

Infreq gateway nodes
Infreq its parent freq result of a candidate
join
Unpromising gateway nodes
Freq prefix of a closed w/ same support
Intermiddiate nodes
Freq has a child w/ same supp not unpromising
Closed nodes
Closed freq

7
Moment Algorithm (III)

Increments
Add/Delete to/from CET upon arrival/expiration of
each transaction.
Downside
Batch operations not applicable, suffers from big
slide sizes
Advantage
Efficient for small slides

8
CanTree Leung 05

Use a fixed canonical order according to
decreasing single freq.
Use a single-round version of FP-growth
Algorithm
Upon each window move
Add/Remove new/expired trans to/from FP-tree
(using the same item order)
Run FP-growth! (Without any pruning)

9
CanTree (cont.)

Pros
Very efficient for large slides
Cons
Inefficient for small slides
Not scallable for large windows
Needs memory for entire window

10
Frequent Patterns Mining overData Streams
Expired
New

.
S4
S5
S6
S7
W4
W5

Challenges
Computation
Storage
Real-time response
Customization
Integration with the DSMS

11
Frequent Patterns Mining over Data Streams

Difficult problem Chi 04, Leung 05, Cheung
03, Koh 04,
Mining each window from scratch - too expensive
Subsequent windows have many freq patterns in
common
Updating frequent patterns every new tuple, also
too expensive
SWIMs middle-road approach incrementally
maintain frequent patterns over sliding windows
Desiderata scalability with slide size and
window size

12
SWIM (Sliding Window Incremental Miner)

If pattern p is freq in a window, it must be freq
in at least one of its slides -- keep a union of
freq patterns of all slides (PT)

Expired
New

.
S4
S5
S6
S7
W4
W5
Mine
Mining Alg.
PT
Prune PT
PT F5 U F6 U F7
PT F4 U F5 U F6
13
SWIM

For each new slide Si
Find all frequent patterns in Si (using
FP-growth)
Verify frequency of these new patterns in each
window slide
Immediately or
With delay (lt N slides)
Trade-off max delay vs. computation.
No false negatives or false positives!

14
SWIM Design Choices

Data Structure for Sis FP-tree Han 00
Data Structure for PT FP-tree
Mining Algorithm FP-growth
Count/Update frequencies Naïve? Hash-tree?
Counting is the bottleneck ?
New and improved counting method named
Conditional Counting

15
Conditional Counting

Verification
Given a set of transactions T, a set of patterns
P, and a threshold s
Goal Find the exact freq of each p ? P w.r.t. to
T, IF AND ONLY IF its freq is ? s
If s0, verification counting, but if sgt0 extra
computation can be avoided
Proposed fast verifiers
DTV, DFV, hybrid

16
Conditionalization on FP-trees
FP-tree
FP-tree g
FP-tree gd
17
Attempt I DTV (Double-Tree Verifier)

Not only conditionalize the fp-tree, but also the
pattern tree

18
FP-tree
FP-tree g
Header Table
(a2,b2,c2,d2) (a1,b1,c1) (b1,e1) Conditio
nal pattern base of g
Header Table
Header Table
Header Table
Header Table
Header Table
pattern tree g, after verification against
FP-tree
Filling original pattern tree using reverse
pointers
Initial pattern tree
pattern tree g
19
DTV (cont.)

Scales up well on large trees
Much pruning from conditionalization
However, for smaller trees
Less pruning
Overhead of conditionalization not always worth
it

20
Attempt II DFV(Depth-First Verifier)

Each node n in PT corresponds to a unique pattern
pn, therefore
For each node n in PT
Traverse the FP-tree and count the occurrence of
pn in a depth-first order
Keep the nodes marked as FAIL/OK while visiting
their children
Utilize these marks for optimized execution
More efficient when both trees are small

21
DFV (cont.)
22
DFV (cont.)
23
Comparing Verifiers
24
Hybrid Verifier

Start with performing DTV recursively
Until the resulting trees are small enough, then
perform DFV

25
Comparing Verifiers
26
Verifiers vs. Hash Trees (Counting)
27
SWIM with Hybrid Verifier (I)
28
SWIM with Hybrid Verifier (II)
29
Applications of Verifiers (I)

Improving counting in static mining methods
Candidate-generation (and pruning) phase
Example Toivonen Approach Toivonen 96
Maintain a boundary of smallest non-frequent and
largest frequent patterns
Check the frequency of boundary patterns

30
Applications of Verifiers (II)

In case resources are limited
Mine once
Keep monitoring the current patterns (by
verifying them)
Since verifying is computationally cheaper
Whenever a significant concept shift is detected,
mine again!

31
Monitoring/Concept Shift Detection

Verification is much faster than mining (when it
suffices)

32
Privacy Preserving Applications

Random noise methods
Add many fake items into the transactions to
increase the variance Evfimievski 03
Overhead
Long transactions (in the order of the no of
items)
Lemma Max depth of the recursion in DTV is lt
the max len of the patterns to be verified.
Run-time independent of the transaction length

33
Optimization when integrated into a DSMS

Stream Mill Miner (SMM) provides integrated
support for online mining algorithms by
User Define Aggregates (UDAs)
Definition of Mining Models
Constraints used for optimization
Max allowed delay
Interesting/Uninteresting items
Interesting/Uninteresting patterns
These are turned from post-conditions into
pre-conditions

34
Conclusions

SWIM for incremental mining over large windows
More efficient than existing approaches on data
streams
Trade-off between real-time response, efficiency,
memory, etc.
Efficient algorithms for verification/conditional
counting
DTV, DFV, and Hybrid
These can be used to speed-up many applications
Incremental mining, enhancing static algorithms,
privacy preserving techniques,
Implementations of SWIM and the verifiers
available at http//wis.cs.ucla.edu/swim/index.h
tm

35
References

Agrawal 94 R. Agrawal and R. Srikant. Fast
algorithms for mining association rules in large
databases. In VLDB, pages 487499, 1994.
Cheung 03 W. Cheung and O. R. Zaiane,
Incremental mining of frequent patterns without
candidate generation or support, in DEAS, 2003.
Chi 04 Y. Chi, H. Wang, P. S. Yu, and R. R.
Muntz, Moment Maintaining closed frequent
itemsets over a stream sliding window, in ICDM,
November 2004.
Evfimievski 03 A. Evfimievski, J. Gehrke, and
R. Srikant, Limiting privacy breaches in privacy
preserving data mining, in PODS, 2003.
Han 00 J. Han, J. Pei, and Y. Yin. Mining
frequent patterns without candidate generation.
In SIGMOD, 2000.
Koh 04 J. Koh and S. Shieh, An efficient
approach for maintaining association rules based
on adjusting fp-tree structures. in DASFAA,
2004.
Leung 05 C.-S. Leung, Q. Khan, and T. Hoque,
Cantree A tree structure for efficient
incremental mining of frequent patterns, in
ICDM, 2005.
Toivonen 96 H. Toivonen, Sampling large
databases for association rules, in VLDB, 1996,
pp. 134145.