From Path Tree To Frequent Patterns - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

From Path Tree To Frequent Patterns

Description:

Deficient in multi-user environment: one scan to construct the structure ... Construct the initial header-table from the children of the root, output pattern ... – PowerPoint PPT presentation

Number of Views:76

Avg rating:3.0/5.0

Slides: 24

Provided by: arb4

Category:

more less

Transcript and Presenter's Notes

Title: From Path Tree To Frequent Patterns

1
From Path Tree To Frequent Patterns

-A Framework for Mining Frequent Patterns

Yabo Xu1, Jeffrey Xu Yu1, Guimei Liu2, Hongjun
Lu2 1.Chinese University of Hongkong 2.
Hongkong University of Science and Technology
2
Outline

Frequent pattern mining
Mining in memory PP-Mine
Performance Study
Framework under multi-user environments
Experimental Results
Conclusion

3
Frequent Pattern Mining

Problem Definition
find frequent patterns whose support is greater
than a threshold ?
An essential task in many data mining problems,
e.g. Association, correlation, causality
Two Main Approaches
Generation and Test Approach( Apriori-like
algorithms)
Pattern Growth Approach

4
Pattern Growth Approach

FP-Growth, H-Mine and our approach PP-Mine
Strength
Compress database into a compact structure, and
project the database on its frequent patterns,
thus reduce the database scan
Extend patterns without candidate generation,
avoid candidate generating cost
Weakness
FP-Growth need extra space and cost to generate
conditional FP-Tree
H-Mine cant do with the dense dataset
Deficient in multi-user environment one scan to
construct the structure

5
Mining in memory?PP-Mine

Main features
two pushing operations and a no-counting
strategy below
Push-down Operation ( Apriori Heuristic )
Pushing-down to one of its children is to check
the itemset with one more item, e.g ab-gtabc
Push-right Operation (Dynamic link-justification)
Pushing-right is to push the child to its
corresponding sibling, that helps to identify sub
trees for a pattern transitively
No-counting
Counting is done as a side-effort of
pushing-right in an accumulated manner

6
PP-Mine(cont)-An example
root
1.push-down from node a, to mine subtrees a.b,
a.c, a.d
c
a
b
d
2.after mining these subtrees, push-right these
subtrees to as right-siblings
c
d
d
d
3.after mining subtrees of node-b, push-right
them to bs right-siblings.
d
4. All 4 subtrees( a.b.c, a.c, b.c, c ) are
collected for pattern c
A complete prefix-path tree with 4 items
7
Mining in memory?PPM-Tree
TID Items bought (ordered) frequent
path 100 c,d,e,f,g,i c, d, e,
g 200 a,c,d,e,m a,c,d,e 300
a,b,d,g,k a,d,e,g 400 a,c,h a,c
min_support threshold 2

Build Procedure
1. Frequent items list( frequency order) a3,
c3, d3, e2, g2
Insert every frequent path into the prefix tree.
PPM-Tree(Prefix-Path Tree)
Every node in the tree represents a path and the
frequency of this path.

8
PP-Mine(cont)-A Running Example

Construct the initial header-table from the
children of the root, output pattern a3
Push-down from node a, and construct
sub-header-table, output pattern ac2

9
PP-Mine(cont)-A Running Example

After mining a.c-prefix, push-right a.cs
children to a.cs right siblings.
Mining a.d-prefix, output pattern ad3

10
Algorithm Analysis

PP-Mine Vs FP-Growth
FP-Tree link all the nodes with same item-names,
while PP-Tree is node-link free while
constructed.
FP-Growth explore the tree using bottom-up
fashion, construct conditional FP-Tree, and
PP-Mine follow depth-first order, conditional
FP-Tree replaced by dynamic link adjusting
PP-Mine Vs H-Mine
H-Mine dynamic hyper-link adjusting using
hyper-structure
Hyper-structure not so compact as tree structure
After hyperlink adjusting, it need to count the
projected database, while PP-tree do it in an
accumulation fashion.

11
Performance Study on PP-Mine
General dataset
Dense dataset
12
Framework -From Path Tree To Frequent Pattern

Loading
Loading
dataset
Disk PP-Tree
13
Framework -Four Components
1. PPM-Tree an efficient mining algorithm on it
2.Three loading algorithms

3.Coding Strategy (PPM-Tree ? PPD-Tree)
dataset
Disk PP-Tree
4.Coded prefix-path tree on disk (PPD-Tree)

The advantages
Reduce the cost of constructing trees
Load only the relative portion of the whole tree
Use an even faster algorithm in memory

14
Comparison with FP-Tree and H-Struct

Simple than FP-Tree( node-link free)
Compact than H-Struct

15
Code PP-Tree into Disk

A complete prefix-path tree of rank 5, each node
is encoded with a number( of the pre-order of
traversal of tree)
The node with shadow is the actual existing node
For every node with number M, we can calculate
following easily.
the path-gtM(code)
M -gtthe path this node represent ( restore)

16
PPD-Tree representation

A PPD-Tree with rank N is represented by
(T, F, I, ?m)
T- a heap for the tree structure, where an
element consists of a code and its count
F- N frequent 1-itemsets with their counts in
order
I- an index indicating the ranges of codes in
disk-pages
?m -This PPD-Tree can be used for mining frequent
itemsets with a minimum threshold ?m

17
Efficient Loading

Three tasks
Frequent Itemsets Mining
mining frequent patterns whose support is
greater than or equal to ?
Frequent Superitemsets Mining
mining frequent patterns that include all items
in V, and have a support that is greater than or
equal to ?.
Frequent Subitemsets Mining
mining frequent patterns that are included all
items in V, and have a support that is greater
than or equal to ?.
Load algorithm
Push constraint into loading phase
Load a sub PPD-tree from disk
Construct a minimum PPm-Tree in memory

18
Efficient Loading PPt-load

Example
Constraints t3
load only items with frequencies greater than 3

19
Efficient Loading PP?-load

Example
Constraints
ttm,Va,d
find all the paths that contain a and d, and
merge them

20
Efficient Loading PP?-load

Example
Constraints
t tm,Va,d,e
find all the part of paths that are contained in
a,d,e, and merge them

21
Performance Study for Loading

Loading Vs Constructing
X-the tree size we load and construct.
Y- the time we use
The building cost are much larger than loading
The loading time is proportional to the size of
the tree loaded

22
Performance Study for Loading

Pruning power of PP?-load
X - the item numbers in the constraint
Y - the loading PP-Tree size
Choose different items from different
regions(top, high, middle, low)
The loaded part are decreased greatly

23
Conclusion

A new framework for mining frequent patterns in
multi-user environment
A novel algorithm PP-Mine
Two prefix-path tree representation(PPM-Tree, and
PPD-Tree) and a coding strategy for the
transformation
Push constraints into loading phase, reduce the
I/O cost and CPU cost.

Write a Comment

User Comments (0)