Title: Improving Data Mining Utility with Projective Sampling
1Improving Data Mining Utility with Projective
Sampling
- Mark Last
- Department of Information Systems Engineering
- Ben-Gurion University of the Negev, Beer-Sheva,
Israel
E-mail mlast_at_bgu.ac.il Home Page
http//www.bgu.ac.il/mlast/
2Agenda
- Introduction
- Learning Curves and Progressive Sampling
- The Projective Sampling Strategy
- Empirical Results
- Conclusions and Future Research
3Motivation Data is not born free
- The training data is often scarce and costly
- Real-world examples
- A limited number of patient records stored by a
hospital - Results of a costly engineering experiment
- Seasonal records in an agricultural database
- Even when the raw data is free, its preparation
may still be labor intensive! - Critical question
- Should we spend our resources (time and/or money)
on acquiring more examples?
4Total Cost of the Classification Process (based
on Weiss and Tian, 2008)
Training Set
Score Set
Future examples to be classified by the model
Used to induce the classification model
- Total Cost nCtr err(n)SCerr
CPU(n)Ctime - Ctr - cost of acquiring and labeling each new
training example - Cerr - cost of each misclassified example from
the score set - Ctime cost per one unit of CPU time
- n number of training set examples used to
induce the model - S - the score set of future examples to be
classified by the model - err (n) the model error rate measured on the
score set - CPU(n) CPU time required to induce the model
5What is this research about?
- Problem Statement
- Find the best training set size n that is
expected to maximize the overall utility
(minimize the Total Cost) - Basic Idea - Projective Sampling
- Estimate the optimal training set size using
learning and run-time curves projected from a
small subset of potentially available data - Research Objectives
- Calculate the optimal training set size for a
variety of learning curve equations (with and
without CPU costs) - Improve the utility of the data mining process
using the best fitting curves for a given dataset
and an algorithm
6Some Learning Curves for a Decision-Tree Algorithm
Slow rise
Rapid rise with oscillations
Rapid rise
Plateau
Rapid rise
Slow rise
7The Best Fit for a Learning Curve
- Frey and Fisher (1999)
- The power law is the best fit for modeling the
C4.5 error rates - Last (2007)
- The power law is the best fit for modeling the
error rates of an oblivious decision-tree
algorithm (Information Network) - Singh (2005)
- The power law is only second best to the
logarithmic regression for ID3, k-Nearest
Neighbors, Support Vector Machines, and
Artificial Neural Networks
8Progressive Sampling Strategy(Provost et al.,
1999, Weiss and Tian, 2008)
- General strategy
- Start with some initial amount of training data
n0 - Iteratively increase the training set until there
is an increase in total cost - Popular schedules
- Uniform (arithmetic) sampling
- n0, n0 ? , n0 2? ,
- Geometric Sampling
- n0, an0, a2n0,
9Limitations of Progressive Sampling
- Overfitting some local perturbations in the error
rate - Progressive sampling costs may exceed the optimal
ones by 10-200 (Weiss and Tian, 2008) - Potential overhead associated with purchasing and
pre-processing each sampling increment
(especially with uniform sampling). - Our expectation
- The projective sampling strategy should reduce
data mining costs by estimating the optimal
training set size from a small subset of
potentially available data
10The Projective Sampling Strategy
- Set a fixed sampling increment ?
- Each acquired sample one data point
- Do
- Acquire a new data point
- Compute Pearson's correlation coefficient for
each candidate fitting function (given at least
three data points) - Dependent variable err(n)
- Independent variable training set size n
- Find a function with a minimal correlation
coefficient Best_Corr - Why minimal
- While ((Best_Corr 0) and (n lt nmax))
- Estimate the regression coefficients of the
selected function - Estimate the optimal training set size n
- Induce the classification model M (n) from n
examples
11Candidate Fitting Functions
- Learning Curves
- Logarithmic errLog (n) a b logn
- Weiss and Tian errWT (n) a bn / (n 1)
- Power Law errPL (n) anb
- Exponential errExp (n) abn
- Run-time Curves
- Linear CPUL (n) d?n
- Power law CPUPL (n) cnd
12Converting Learning Curves into the Linear Form y
a bx
13Pearson's Correlation Coefficient
k number of data points
14Linear Regression Coefficients y a bx
- The least squares estimate of the slope
- The least squares estimate of the intercept
-
k number of data points
15Total Cost Functions
- Total CostLog (n) nCtr dnCtime SCerr
(a b logn) - Total CostWT (n) nCtr dnCtime SCerr
( abn / (n1)) - Total CostPL (n) nCtr dnCtime SCerr
anb - Total CostExp (n) nCtr dnCtime SCerr
abn
16Optimizing the Training Set Size
- Let
- R Cerr / Ctr
- Ctr 1
- CPUL (n) d?n
- Logarithmic
- Total CostLog (n) n dnCtime SR (a
b logn) - Weiss and Tian
- Total CostWT (n) n dnCtime SR ( abn
/ (n1)) - Power Law
- Total CostPL (n) n dnCtime SR anb
- Exponential
- Total CostExp (n) n dnCtime SR abn
17Experimental Settings
- Ten benchmark datasets (see next slide)
- Each dataset was randomly partitioned into
25-50 of test examples and 50-75 of examples
potentially available for training . - The sampling increment ? was set to 1 of the
maximum possible training set size - The error rate of each increment was averaged
over 10 random partitions of the training set. - Sampling schedules Uniform, Geometric (a2),
Straw Man, Projective, Optimal - Cost Ratios (R) 1 50,000
- CPU Factors 0 and 1 (per one millisecond of CPU
time)
18Datasets Description
19Projected Fitting Functions
20Projected and Actual Learning Curves Small
Datasets
21Projected and Actual Learning Curves Medium and
Large Datasets
22Comparison of Sampling Schedules
R Cerr / Ctr
23Detailed Sampling Schedules without Induction
CostsSmall Datasets
Uniform
Geometric, Straw Man, Projected, Optimal
24Detailed Sampling Schedules without Induction
Costs Medium and Large Datasets
Geometric
Uniform, Straw Man, Projected, Optimal
Geometric
Optimal
Uniform, Straw Man, Projected
25Conclusions
- The projective sampling strategy estimates the
optimal training set size by fitting an
analytical function to a partial learning curve - The proposed methodology was evaluated on 10
benchmark datasets of variable size using a
decision-tree algorithm. - The results show that under negligible induction
costs and high data acquisition costs, the
projective sampling outperforms, on average, the
alternative, progressive sampling techniques.
26Future Research
- Further optimization of projective sampling
schedules, especially under substantial CPU costs - Improving utility of cost-sensitive data mining
algorithms - Modeling learning curves for non-random
(active) sampling and labeling techniques
27Thank you!
Merci Beaucoup!