Title: Efficient and Effective Itemset Pattern Summarization: Regressionbased Approaches
1Efficient and Effective Itemset Pattern
Summarization Regression-based Approaches
- Ruoming Jin
- Kent State University
- Joint work with Muad Abu-Ata, Yang Xiang, and
Ning Ruan (KSU)
2Problem Definition
- Given a large collection of frequent itemsets and
their supports, how we can concisely represent
them? - Coverage criterion
- The Spanning Set Approach F. Arati, A. Gionis,
Mannila, Approximating a collection of frequent
sets, KDD04. - Frequency criterion
- The Profile-based Approach X. Yan, H. Cheng, J.
Han, and D. Xin, Summarizing itemset patterns, a
profile-based approach, KDD05. - The Markov Random Field Approach C. Wang and S.
Parthasarathy, Summarizing itemset patterns using
probabilistic models, KDD06.
3 Frequency Criterion
- The restoration function of a set of itemsets S
is a function - The restoration error
-
- We use 2-norm in this study.
4Probabilistic Restoration Function
- Applying the independence probabilistic model for
a set of itemsets S - An example,
-
5Problem 1 Optimal Parameters
What are the optimal parameters,
p(S),p(a),p(c),p(d), minimizing the restoration
error
6Non-Linear Regression
- We introduce the independent variable
- We have S data points.
-
-
7Linear Regression Approximation
Using Taylor expansion, we show the restoration
error from linear regression is very close to
the error by using the non-linear regression!
8 Problem 2 Optimal Partition
- To reduce the restoration error, we adopt the
partition strategy - Partition the entire collection of frequent
itemsets into K disjoint subsets, and build the
restoration function for each subset - How to optimally partition a set of itemsets into
K disjoint subsets so that the total restoration
error can be minimized?
9Our Approaches
- NP-hard problem
- Two heuristic algorithms
- K-Regression
- Tree Regression
10K-Regression
- A k-means type clustering procedure
- Random partition the set of itemsets S into K
partition - Regression Step Apply regression to find the
optimal parameters on each partition - Re-assignment Step For each itemset, assign it
to the partition which minimizes its restoration
error based on the optimal parameters discovered
by Step 2 - Repeat 2 and 3 until the total restoration error
does not increase or the improvement is small - Just as k-means, k-regression is guaranteed
to converge!
11Tree Regression
Sa,b,c,d,a,b,a,c,b,c,a,d,c,d,
a,b,c,a,b,d,a,c,d
Using Regression to find optimal parameters for
each subset of itemsets
12Tree Regression Construction
- A Decision-type of construction algorithm
- Question 1 How to find K subsets of itemsets?
- Question 2 How to find the optimal splitting?
- Answer for Q1
- Maintain a queue for the current leaf node, and
always pick up the leaf nodes with the maximal
average restoration error to split - Answer for Q2
- Maximally reduce the total restoration error
- Min E(S)-E(S_1)-E(S_2)
13An Interesting Connection
- Jerome H. Friedmans 1977 paper, A
tree-structured Approach to nonparametric
multiple regression. - Unfortunately, this work seems never got enough
attention. However, it seems part of the
inspiration for the CART (regression tree) and
MARS (Multivariate Adaptive Regression Spline).
14Experimental Results
15Chess Restoration Error
16BMS-POS Restoration Error
17BMS-POS Running Time
18Conclusion
- Using linear regression to identify optimal
parameters of the probabilistic restoration
function (based on the independence assumption)
for a set of itemsets - Two algorithms to optimally partition the set of
itemsets into K parts - K-regression
- Tree regression
19Thanks!!