Mining Compressed FrequentPattern Sets - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Mining Compressed FrequentPattern Sets

Description:

To solve this problem, it's natural to explore how to 'compress' the patterns ... Our compressing framework. Clustering frequent ... Pattern Compressing Problem ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 23
Provided by: DBL94
Category:

less

Transcript and Presenter's Notes

Title: Mining Compressed FrequentPattern Sets


1
Mining Compressed Frequent-Pattern Sets
  • Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng
  • Department of Computer Science
  • University of Illinois at Urbana-Champaign

VLDB2005
2
  • Introduction
  • Problem statement
  • Discovering Representative Patterns
  • Performance study
  • Conclusion

3
Introduction
  • Frequent pattern mining
  • High minimum support common sense patterns
  • Low minimum support explosive number of results
  • To solve this problem, its natural to explore
    how to compress the patterns

4
Two major approaches
  • Lossless compression
  • Closed frequent patterns
  • Lossy approximation
  • Maximal frequent patterns

5
A Motivating Example
  • A subset of frequent item-sets in accident
    dataset
  • High-quality compression needs to consider both
    expression and support

Expression of P1
Support of P1
6
A Motivating Example
  • Closed frequent pattern
  • Report P1,P2,P3,P4,P5
  • Emphasize too much on support
  • no compression
  • Maximal frequent pattern
  • Only report P3
  • Only care about the expression
  • Loss the information of support
  • A desirable output P2,P3,P4

7
  • Our compressing framework
  • Clustering frequent patterns by pattern
    similarity
  • Pick a representative pattern for each cluster
  • Three problems
  • How to measure the similarity of the patterns
  • How to define quality guaranteed clusters where
    there is a representative pattern best describing
    the whole cluster
  • How to efficiently discover these clusters

8
Problem statement
  • Distance measure Let P1 and P2 be two closed
    patterns. The distance of P1 and P2 is defined
    as
  • Ex Let T(P1)t1, t2, t3, t4, t5,
    T(P2)t1, t2, t3, t4, t6, then
    D(P1, P2)1-4/61/3

9
Clustering criterion
  • A pattern P is d-covered by another pattern P if
    P can be expressed by P and D(P, P)?d.
  • A set of patterns form a d-cluster if there
    exists a representative pattern Pr such that for
    each pattern P in the set, P is d-covered by Pr.

10
Pattern Compressing Problem
  • Given a transaction database, a min_sup M and the
    cluster quality measure d
  • The pattern compression problem is to find a set
    of representative patterns R
  • For each frequent pattern P, there is a
    representative pattern Pr?R which covers P
  • The value of R is minimized.

11
Discovering Representative Patterns
  • RPglobal
  • RPlocal

12
To collect the complete coverage information
  • Input FP, M, d
  • Output representative patterns
  • Begin
  • for each P?FP s.t. support(P)?M
  • insert P into the set E
  • for each Q?FP, s.t. Q covers P
  • insert P into set(Q)
  • while E!F
  • find a RP that maximizes set(RP)
  • for each Q?set(RP)
  • remove Q from E and the remaining sets
  • output RP
  • End

To find the set of representative patterns
13
  • RPglobal is expensive
  • Assume all frequent patterns are mined
  • Need to compute the pair-wise distance between
    all frequent patterns
  • Need to find the globally best representative
    pattern
  • RPlocal
  • Find a locally good representative pattern
  • Directly mine from raw data
  • Do not compute the distance pair-wisely

14
RPlocal
  • Algorithm
  • Follow the depth-first search in pattern space
  • Remember all previously discovered representative
    patterns
  • For each pattern P
  • Not covered yet
  • Being Visited in the second time which traversal
    back from its sons
  • Select a representative pattern using local
    method (with P as new probe pattern)

15
Pattern P
Ps son
Visited patterns covering P
16
Efficient Implementation
  • Non Closed Pattern
  • Exist a super pattern with same support
  • Closed_Index (N bits)
  • Each bit remembers the consistency of an item
  • Aggregate the closed_index with pattern
  • Not closed if at least one out-pattern bit is set

(c,a) 111010
f does not belong to (c,a). Support of (c,a) is
same as support of (f,c,a). (c,a) is not closed
17
(No Transcript)
18
Performance study
  • Comparing algorithms
  • FPclose an efficient algorithm to generate all
    closed itemsets, winner of FIMI workshop 2003
  • RPglobal first use FPclose to generate closed
    itemsets, then use global greedy method to find
    representative patterns
  • RPlocal directly used local method to find
    representative patterns from raw data

19
Performance Study
  • Number of Representative Patterns

20
Performance Study
  • Running Time

21
Performance Study
  • Quality of Representative Patterns

22
Conclusion
  • To approximate a collection of frequent patterns
  • RPglobal
  • works well on small collection of patterns
  • RPlocal
  • much more efficient
  • Still quite good compression quality
Write a Comment
User Comments (0)
About PowerShow.com