Title: From Path Tree To Frequent Patterns
1From Path Tree To Frequent Patterns
- -A Framework for Mining Frequent Patterns
Yabo Xu1, Jeffrey Xu Yu1, Guimei Liu2, Hongjun
Lu2 1.Chinese University of Hongkong 2.
Hongkong University of Science and Technology
2Outline
- Frequent pattern mining
- Mining in memory PP-Mine
- Performance Study
- Framework under multi-user environments
- Experimental Results
- Conclusion
3Frequent Pattern Mining
- Problem Definition
- find frequent patterns whose support is greater
than a threshold ? - An essential task in many data mining problems,
e.g. Association, correlation, causality - Two Main Approaches
- Generation and Test Approach( Apriori-like
algorithms) - Pattern Growth Approach
4Pattern Growth Approach
- FP-Growth, H-Mine and our approach PP-Mine
- Strength
- Compress database into a compact structure, and
project the database on its frequent patterns,
thus reduce the database scan - Extend patterns without candidate generation,
avoid candidate generating cost - Weakness
- FP-Growth need extra space and cost to generate
conditional FP-Tree - H-Mine cant do with the dense dataset
- Deficient in multi-user environment one scan to
construct the structure
5Mining in memory?PP-Mine
- Main features
- two pushing operations and a no-counting
strategy below - Push-down Operation ( Apriori Heuristic )
- Pushing-down to one of its children is to check
the itemset with one more item, e.g ab-gtabc - Push-right Operation (Dynamic link-justification)
- Pushing-right is to push the child to its
corresponding sibling, that helps to identify sub
trees for a pattern transitively - No-counting
- Counting is done as a side-effort of
pushing-right in an accumulated manner
6PP-Mine(cont)-An example
root
1.push-down from node a, to mine subtrees a.b,
a.c, a.d
c
a
b
d
2.after mining these subtrees, push-right these
subtrees to as right-siblings
c
d
d
d
3.after mining subtrees of node-b, push-right
them to bs right-siblings.
d
4. All 4 subtrees( a.b.c, a.c, b.c, c ) are
collected for pattern c
A complete prefix-path tree with 4 items
7Mining in memory?PPM-Tree
TID Items bought (ordered) frequent
path 100 c,d,e,f,g,i c, d, e,
g 200 a,c,d,e,m a,c,d,e 300
a,b,d,g,k a,d,e,g 400 a,c,h a,c
min_support threshold 2
- Build Procedure
- 1. Frequent items list( frequency order) a3,
c3, d3, e2, g2 - Insert every frequent path into the prefix tree.
- PPM-Tree(Prefix-Path Tree)
- Every node in the tree represents a path and the
frequency of this path.
8PP-Mine(cont)-A Running Example
- Construct the initial header-table from the
children of the root, output pattern a3 - Push-down from node a, and construct
sub-header-table, output pattern ac2
9PP-Mine(cont)-A Running Example
- After mining a.c-prefix, push-right a.cs
children to a.cs right siblings. - Mining a.d-prefix, output pattern ad3
10Algorithm Analysis
- PP-Mine Vs FP-Growth
- FP-Tree link all the nodes with same item-names,
while PP-Tree is node-link free while
constructed. - FP-Growth explore the tree using bottom-up
fashion, construct conditional FP-Tree, and
PP-Mine follow depth-first order, conditional
FP-Tree replaced by dynamic link adjusting - PP-Mine Vs H-Mine
- H-Mine dynamic hyper-link adjusting using
hyper-structure - Hyper-structure not so compact as tree structure
- After hyperlink adjusting, it need to count the
projected database, while PP-tree do it in an
accumulation fashion.
11Performance Study on PP-Mine
General dataset
Dense dataset
12Framework -From Path Tree To Frequent Pattern
Loading
Loading
dataset
Disk PP-Tree
13Framework -Four Components
1. PPM-Tree an efficient mining algorithm on it
2.Three loading algorithms
3.Coding Strategy (PPM-Tree ? PPD-Tree)
dataset
Disk PP-Tree
4.Coded prefix-path tree on disk (PPD-Tree)
- The advantages
- Reduce the cost of constructing trees
- Load only the relative portion of the whole tree
- Use an even faster algorithm in memory
14Comparison with FP-Tree and H-Struct
- Simple than FP-Tree( node-link free)
- Compact than H-Struct
15Code PP-Tree into Disk
- A complete prefix-path tree of rank 5, each node
is encoded with a number( of the pre-order of
traversal of tree) - The node with shadow is the actual existing node
- For every node with number M, we can calculate
following easily. - the path-gtM(code)
- M -gtthe path this node represent ( restore)
-
16PPD-Tree representation
- A PPD-Tree with rank N is represented by
- (T, F, I, ?m)
- T- a heap for the tree structure, where an
element consists of a code and its count - F- N frequent 1-itemsets with their counts in
order - I- an index indicating the ranges of codes in
disk-pages - ?m -This PPD-Tree can be used for mining frequent
itemsets with a minimum threshold ?m
17Efficient Loading
- Three tasks
- Frequent Itemsets Mining
- mining frequent patterns whose support is
greater than or equal to ? - Frequent Superitemsets Mining
- mining frequent patterns that include all items
in V, and have a support that is greater than or
equal to ?. - Frequent Subitemsets Mining
- mining frequent patterns that are included all
items in V, and have a support that is greater
than or equal to ?. - Load algorithm
- Push constraint into loading phase
- Load a sub PPD-tree from disk
- Construct a minimum PPm-Tree in memory
18Efficient Loading PPt-load
- Example
- Constraints t3
- load only items with frequencies greater than 3
19Efficient Loading PP?-load
- Example
- Constraints
- ttm,Va,d
- find all the paths that contain a and d, and
merge them
20Efficient Loading PP?-load
- Example
- Constraints
- t tm,Va,d,e
- find all the part of paths that are contained in
a,d,e, and merge them
21Performance Study for Loading
- Loading Vs Constructing
- X-the tree size we load and construct.
- Y- the time we use
- The building cost are much larger than loading
- The loading time is proportional to the size of
the tree loaded
22Performance Study for Loading
- Pruning power of PP?-load
- X - the item numbers in the constraint
- Y - the loading PP-Tree size
- Choose different items from different
regions(top, high, middle, low) - The loaded part are decreased greatly
23Conclusion
- A new framework for mining frequent patterns in
multi-user environment - A novel algorithm PP-Mine
- Two prefix-path tree representation(PPM-Tree, and
PPD-Tree) and a coding strategy for the
transformation - Push constraints into loading phase, reduce the
I/O cost and CPU cost.