Improving Data Mining Utility with Projective Sampling - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Improving Data Mining Utility with Projective Sampling

Description:

Real-world examples. A limited number of patient records stored by a hospital ... Seasonal records in an agricultural database ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 28

Provided by: velblodVid

Category:

more less

Transcript and Presenter's Notes

Title: Improving Data Mining Utility with Projective Sampling

1
Improving Data Mining Utility with Projective
Sampling

Mark Last
Department of Information Systems Engineering
Ben-Gurion University of the Negev, Beer-Sheva,
Israel

E-mail mlast_at_bgu.ac.il Home Page
http//www.bgu.ac.il/mlast/
2
Agenda

Introduction
Learning Curves and Progressive Sampling
The Projective Sampling Strategy
Empirical Results
Conclusions and Future Research

3
Motivation Data is not born free

The training data is often scarce and costly
Real-world examples
A limited number of patient records stored by a
hospital
Results of a costly engineering experiment
Seasonal records in an agricultural database
Even when the raw data is free, its preparation
may still be labor intensive!
Critical question
Should we spend our resources (time and/or money)
on acquiring more examples?

4
Total Cost of the Classification Process (based
on Weiss and Tian, 2008)
Training Set
Score Set
Future examples to be classified by the model
Used to induce the classification model

Total Cost nCtr err(n)SCerr
CPU(n)Ctime
Ctr - cost of acquiring and labeling each new
training example
Cerr - cost of each misclassified example from
the score set
Ctime cost per one unit of CPU time
n number of training set examples used to
induce the model
S - the score set of future examples to be
classified by the model
err (n) the model error rate measured on the
score set
CPU(n) CPU time required to induce the model

5
What is this research about?

Problem Statement
Find the best training set size n that is
expected to maximize the overall utility
(minimize the Total Cost)
Basic Idea - Projective Sampling
Estimate the optimal training set size using
learning and run-time curves projected from a
small subset of potentially available data
Research Objectives
Calculate the optimal training set size for a
variety of learning curve equations (with and
without CPU costs)
Improve the utility of the data mining process
using the best fitting curves for a given dataset
and an algorithm

6
Some Learning Curves for a Decision-Tree Algorithm
Slow rise
Rapid rise with oscillations
Rapid rise
Plateau
Rapid rise
Slow rise
7
The Best Fit for a Learning Curve

Frey and Fisher (1999)
The power law is the best fit for modeling the
C4.5 error rates
Last (2007)
The power law is the best fit for modeling the
error rates of an oblivious decision-tree
algorithm (Information Network)
Singh (2005)
The power law is only second best to the
logarithmic regression for ID3, k-Nearest
Neighbors, Support Vector Machines, and
Artificial Neural Networks

8
Progressive Sampling Strategy(Provost et al.,
1999, Weiss and Tian, 2008)

General strategy
Start with some initial amount of training data
n0
Iteratively increase the training set until there
is an increase in total cost
Popular schedules
Uniform (arithmetic) sampling
n0, n0 ? , n0 2? ,
Geometric Sampling
n0, an0, a2n0,

9
Limitations of Progressive Sampling

Overfitting some local perturbations in the error
rate
Progressive sampling costs may exceed the optimal
ones by 10-200 (Weiss and Tian, 2008)
Potential overhead associated with purchasing and
pre-processing each sampling increment
(especially with uniform sampling).
Our expectation
The projective sampling strategy should reduce
data mining costs by estimating the optimal
training set size from a small subset of
potentially available data

10
The Projective Sampling Strategy

Set a fixed sampling increment ?
Each acquired sample one data point
Do
Acquire a new data point
Compute Pearson's correlation coefficient for
each candidate fitting function (given at least
three data points)
Dependent variable err(n)
Independent variable training set size n
Find a function with a minimal correlation
coefficient Best_Corr
Why minimal
While ((Best_Corr 0) and (n lt nmax))
Estimate the regression coefficients of the
selected function
Estimate the optimal training set size n
Induce the classification model M (n) from n
examples

11
Candidate Fitting Functions

Learning Curves
Logarithmic errLog (n) a b logn
Weiss and Tian errWT (n) a bn / (n 1)
Power Law errPL (n) anb
Exponential errExp (n) abn
Run-time Curves
Linear CPUL (n) d?n
Power law CPUPL (n) cnd

12
Converting Learning Curves into the Linear Form y
a bx
13
Pearson's Correlation Coefficient
k number of data points
14
Linear Regression Coefficients y a bx

The least squares estimate of the slope
The least squares estimate of the intercept

k number of data points
15
Total Cost Functions

Total CostLog (n) nCtr dnCtime SCerr
(a b logn)
Total CostWT (n) nCtr dnCtime SCerr
( abn / (n1))
Total CostPL (n) nCtr dnCtime SCerr
anb
Total CostExp (n) nCtr dnCtime SCerr
abn

16
Optimizing the Training Set Size

Let
R Cerr / Ctr
Ctr 1
CPUL (n) d?n
Logarithmic
Total CostLog (n) n dnCtime SR (a
b logn)
Weiss and Tian
Total CostWT (n) n dnCtime SR ( abn
/ (n1))
Power Law
Total CostPL (n) n dnCtime SR anb
Exponential
Total CostExp (n) n dnCtime SR abn

17
Experimental Settings

Ten benchmark datasets (see next slide)
Each dataset was randomly partitioned into
25-50 of test examples and 50-75 of examples
potentially available for training .
The sampling increment ? was set to 1 of the
maximum possible training set size
The error rate of each increment was averaged
over 10 random partitions of the training set.
Sampling schedules Uniform, Geometric (a2),
Straw Man, Projective, Optimal
Cost Ratios (R) 1 50,000
CPU Factors 0 and 1 (per one millisecond of CPU
time)

18
Datasets Description
19
Projected Fitting Functions
20
Projected and Actual Learning Curves Small
Datasets
21
Projected and Actual Learning Curves Medium and
Large Datasets
22
Comparison of Sampling Schedules
R Cerr / Ctr
23
Detailed Sampling Schedules without Induction
CostsSmall Datasets
Uniform
Geometric, Straw Man, Projected, Optimal
24
Detailed Sampling Schedules without Induction
Costs Medium and Large Datasets
Geometric
Uniform, Straw Man, Projected, Optimal
Geometric
Optimal
Uniform, Straw Man, Projected
25
Conclusions

The projective sampling strategy estimates the
optimal training set size by fitting an
analytical function to a partial learning curve
The proposed methodology was evaluated on 10
benchmark datasets of variable size using a
decision-tree algorithm.
The results show that under negligible induction
costs and high data acquisition costs, the
projective sampling outperforms, on average, the
alternative, progressive sampling techniques.

26
Future Research

Further optimization of projective sampling
schedules, especially under substantial CPU costs
Improving utility of cost-sensitive data mining
algorithms
Modeling learning curves for non-random
(active) sampling and labeling techniques

27
Thank you!
Merci Beaucoup!

Write a Comment

User Comments (0)