From Path Tree To Frequent Patterns - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

From Path Tree To Frequent Patterns

Description:

Deficient in multi-user environment: one scan to construct the structure ... Construct the initial header-table from the children of the root, output pattern ... – PowerPoint PPT presentation

Number of Views:76
Avg rating:3.0/5.0
Slides: 24
Provided by: arb4
Category:

less

Transcript and Presenter's Notes

Title: From Path Tree To Frequent Patterns


1
From Path Tree To Frequent Patterns
  • -A Framework for Mining Frequent Patterns

Yabo Xu1, Jeffrey Xu Yu1, Guimei Liu2, Hongjun
Lu2 1.Chinese University of Hongkong 2.
Hongkong University of Science and Technology
2
Outline
  • Frequent pattern mining
  • Mining in memory PP-Mine
  • Performance Study
  • Framework under multi-user environments
  • Experimental Results
  • Conclusion

3
Frequent Pattern Mining
  • Problem Definition
  • find frequent patterns whose support is greater
    than a threshold ?
  • An essential task in many data mining problems,
    e.g. Association, correlation, causality
  • Two Main Approaches
  • Generation and Test Approach( Apriori-like
    algorithms)
  • Pattern Growth Approach

4
Pattern Growth Approach
  • FP-Growth, H-Mine and our approach PP-Mine
  • Strength
  • Compress database into a compact structure, and
    project the database on its frequent patterns,
    thus reduce the database scan
  • Extend patterns without candidate generation,
    avoid candidate generating cost
  • Weakness
  • FP-Growth need extra space and cost to generate
    conditional FP-Tree
  • H-Mine cant do with the dense dataset
  • Deficient in multi-user environment one scan to
    construct the structure

5
Mining in memory?PP-Mine
  • Main features
  • two pushing operations and a no-counting
    strategy below
  • Push-down Operation ( Apriori Heuristic )
  • Pushing-down to one of its children is to check
    the itemset with one more item, e.g ab-gtabc
  • Push-right Operation (Dynamic link-justification)
  • Pushing-right is to push the child to its
    corresponding sibling, that helps to identify sub
    trees for a pattern transitively
  • No-counting
  • Counting is done as a side-effort of
    pushing-right in an accumulated manner

6
PP-Mine(cont)-An example
root
1.push-down from node a, to mine subtrees a.b,
a.c, a.d
c
a
b
d
2.after mining these subtrees, push-right these
subtrees to as right-siblings
c
d
d
d
3.after mining subtrees of node-b, push-right
them to bs right-siblings.
d
4. All 4 subtrees( a.b.c, a.c, b.c, c ) are
collected for pattern c
A complete prefix-path tree with 4 items
7
Mining in memory?PPM-Tree
TID Items bought (ordered) frequent
path 100 c,d,e,f,g,i c, d, e,
g 200 a,c,d,e,m a,c,d,e 300
a,b,d,g,k a,d,e,g 400 a,c,h a,c
min_support threshold 2
  • Build Procedure
  • 1. Frequent items list( frequency order) a3,
    c3, d3, e2, g2
  • Insert every frequent path into the prefix tree.
  • PPM-Tree(Prefix-Path Tree)
  • Every node in the tree represents a path and the
    frequency of this path.

8
PP-Mine(cont)-A Running Example
  • Construct the initial header-table from the
    children of the root, output pattern a3
  • Push-down from node a, and construct
    sub-header-table, output pattern ac2

9
PP-Mine(cont)-A Running Example
  • After mining a.c-prefix, push-right a.cs
    children to a.cs right siblings.
  • Mining a.d-prefix, output pattern ad3

10
Algorithm Analysis
  • PP-Mine Vs FP-Growth
  • FP-Tree link all the nodes with same item-names,
    while PP-Tree is node-link free while
    constructed.
  • FP-Growth explore the tree using bottom-up
    fashion, construct conditional FP-Tree, and
    PP-Mine follow depth-first order, conditional
    FP-Tree replaced by dynamic link adjusting
  • PP-Mine Vs H-Mine
  • H-Mine dynamic hyper-link adjusting using
    hyper-structure
  • Hyper-structure not so compact as tree structure
  • After hyperlink adjusting, it need to count the
    projected database, while PP-tree do it in an
    accumulation fashion.

11
Performance Study on PP-Mine
General dataset
Dense dataset
12
Framework -From Path Tree To Frequent Pattern

Loading
Loading
dataset
Disk PP-Tree
13
Framework -Four Components
1. PPM-Tree an efficient mining algorithm on it
2.Three loading algorithms

3.Coding Strategy (PPM-Tree ? PPD-Tree)
dataset
Disk PP-Tree
4.Coded prefix-path tree on disk (PPD-Tree)
  • The advantages
  • Reduce the cost of constructing trees
  • Load only the relative portion of the whole tree
  • Use an even faster algorithm in memory

14
Comparison with FP-Tree and H-Struct
  • Simple than FP-Tree( node-link free)
  • Compact than H-Struct

15
Code PP-Tree into Disk
  • A complete prefix-path tree of rank 5, each node
    is encoded with a number( of the pre-order of
    traversal of tree)
  • The node with shadow is the actual existing node
  • For every node with number M, we can calculate
    following easily.
  • the path-gtM(code)
  • M -gtthe path this node represent ( restore)

16
PPD-Tree representation
  • A PPD-Tree with rank N is represented by
  • (T, F, I, ?m)
  • T- a heap for the tree structure, where an
    element consists of a code and its count
  • F- N frequent 1-itemsets with their counts in
    order
  • I- an index indicating the ranges of codes in
    disk-pages
  • ?m -This PPD-Tree can be used for mining frequent
    itemsets with a minimum threshold ?m

17
Efficient Loading
  • Three tasks
  • Frequent Itemsets Mining
  • mining frequent patterns whose support is
    greater than or equal to ?
  • Frequent Superitemsets Mining
  • mining frequent patterns that include all items
    in V, and have a support that is greater than or
    equal to ?.
  • Frequent Subitemsets Mining
  • mining frequent patterns that are included all
    items in V, and have a support that is greater
    than or equal to ?.
  • Load algorithm
  • Push constraint into loading phase
  • Load a sub PPD-tree from disk
  • Construct a minimum PPm-Tree in memory

18
Efficient Loading PPt-load
  • Example
  • Constraints t3
  • load only items with frequencies greater than 3

19
Efficient Loading PP?-load
  • Example
  • Constraints
  • ttm,Va,d
  • find all the paths that contain a and d, and
    merge them

20
Efficient Loading PP?-load
  • Example
  • Constraints
  • t tm,Va,d,e
  • find all the part of paths that are contained in
    a,d,e, and merge them

21
Performance Study for Loading
  • Loading Vs Constructing
  • X-the tree size we load and construct.
  • Y- the time we use
  • The building cost are much larger than loading
  • The loading time is proportional to the size of
    the tree loaded

22
Performance Study for Loading
  • Pruning power of PP?-load
  • X - the item numbers in the constraint
  • Y - the loading PP-Tree size
  • Choose different items from different
    regions(top, high, middle, low)
  • The loaded part are decreased greatly

23
Conclusion
  • A new framework for mining frequent patterns in
    multi-user environment
  • A novel algorithm PP-Mine
  • Two prefix-path tree representation(PPM-Tree, and
    PPD-Tree) and a coding strategy for the
    transformation
  • Push constraints into loading phase, reduce the
    I/O cost and CPU cost.
Write a Comment
User Comments (0)
About PowerShow.com