Improving Data Mining Utility with Projective Sampling - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Improving Data Mining Utility with Projective Sampling

Description:

Real-world examples. A limited number of patient records stored by a hospital ... Seasonal records in an agricultural database ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 28
Provided by: velblodVid
Category:

less

Transcript and Presenter's Notes

Title: Improving Data Mining Utility with Projective Sampling


1
Improving Data Mining Utility with Projective
Sampling
  • Mark Last
  • Department of Information Systems Engineering
  • Ben-Gurion University of the Negev, Beer-Sheva,
    Israel

E-mail mlast_at_bgu.ac.il Home Page
http//www.bgu.ac.il/mlast/
2
Agenda
  • Introduction
  • Learning Curves and Progressive Sampling
  • The Projective Sampling Strategy
  • Empirical Results
  • Conclusions and Future Research

3
Motivation Data is not born free
  • The training data is often scarce and costly
  • Real-world examples
  • A limited number of patient records stored by a
    hospital
  • Results of a costly engineering experiment
  • Seasonal records in an agricultural database
  • Even when the raw data is free, its preparation
    may still be labor intensive!
  • Critical question
  • Should we spend our resources (time and/or money)
    on acquiring more examples?

4
Total Cost of the Classification Process (based
on Weiss and Tian, 2008)
Training Set
Score Set
Future examples to be classified by the model
Used to induce the classification model
  • Total Cost nCtr err(n)SCerr
    CPU(n)Ctime
  • Ctr - cost of acquiring and labeling each new
    training example
  • Cerr - cost of each misclassified example from
    the score set
  • Ctime cost per one unit of CPU time
  • n number of training set examples used to
    induce the model
  • S - the score set of future examples to be
    classified by the model
  • err (n) the model error rate measured on the
    score set
  • CPU(n) CPU time required to induce the model

5
What is this research about?
  • Problem Statement
  • Find the best training set size n that is
    expected to maximize the overall utility
    (minimize the Total Cost)
  • Basic Idea - Projective Sampling
  • Estimate the optimal training set size using
    learning and run-time curves projected from a
    small subset of potentially available data
  • Research Objectives
  • Calculate the optimal training set size for a
    variety of learning curve equations (with and
    without CPU costs)
  • Improve the utility of the data mining process
    using the best fitting curves for a given dataset
    and an algorithm

6
Some Learning Curves for a Decision-Tree Algorithm
Slow rise
Rapid rise with oscillations
Rapid rise
Plateau
Rapid rise
Slow rise
7
The Best Fit for a Learning Curve
  • Frey and Fisher (1999)
  • The power law is the best fit for modeling the
    C4.5 error rates
  • Last (2007)
  • The power law is the best fit for modeling the
    error rates of an oblivious decision-tree
    algorithm (Information Network)
  • Singh (2005)
  • The power law is only second best to the
    logarithmic regression for ID3, k-Nearest
    Neighbors, Support Vector Machines, and
    Artificial Neural Networks

8
Progressive Sampling Strategy(Provost et al.,
1999, Weiss and Tian, 2008)
  • General strategy
  • Start with some initial amount of training data
    n0
  • Iteratively increase the training set until there
    is an increase in total cost
  • Popular schedules
  • Uniform (arithmetic) sampling
  • n0, n0 ? , n0 2? ,
  • Geometric Sampling
  • n0, an0, a2n0,

9
Limitations of Progressive Sampling
  • Overfitting some local perturbations in the error
    rate
  • Progressive sampling costs may exceed the optimal
    ones by 10-200 (Weiss and Tian, 2008)
  • Potential overhead associated with purchasing and
    pre-processing each sampling increment
    (especially with uniform sampling).
  • Our expectation
  • The projective sampling strategy should reduce
    data mining costs by estimating the optimal
    training set size from a small subset of
    potentially available data

10
The Projective Sampling Strategy
  • Set a fixed sampling increment ?
  • Each acquired sample one data point
  • Do
  • Acquire a new data point
  • Compute Pearson's correlation coefficient for
    each candidate fitting function (given at least
    three data points)
  • Dependent variable err(n)
  • Independent variable training set size n
  • Find a function with a minimal correlation
    coefficient Best_Corr
  • Why minimal
  • While ((Best_Corr 0) and (n lt nmax))
  • Estimate the regression coefficients of the
    selected function
  • Estimate the optimal training set size n
  • Induce the classification model M (n) from n
    examples

11
Candidate Fitting Functions
  • Learning Curves
  • Logarithmic errLog (n) a b logn
  • Weiss and Tian errWT (n) a bn / (n 1)
  • Power Law errPL (n) anb
  • Exponential errExp (n) abn
  • Run-time Curves
  • Linear CPUL (n) d?n
  • Power law CPUPL (n) cnd

12
Converting Learning Curves into the Linear Form y
a bx
13
Pearson's Correlation Coefficient
k number of data points
14
Linear Regression Coefficients y a bx
  • The least squares estimate of the slope
  • The least squares estimate of the intercept

k number of data points
15
Total Cost Functions
  • Total CostLog (n) nCtr dnCtime SCerr
    (a b logn)
  • Total CostWT (n) nCtr dnCtime SCerr
    ( abn / (n1))
  • Total CostPL (n) nCtr dnCtime SCerr
    anb
  • Total CostExp (n) nCtr dnCtime SCerr
    abn

16
Optimizing the Training Set Size
  • Let
  • R Cerr / Ctr
  • Ctr 1
  • CPUL (n) d?n
  • Logarithmic
  • Total CostLog (n) n dnCtime SR (a
    b logn)
  • Weiss and Tian
  • Total CostWT (n) n dnCtime SR ( abn
    / (n1))
  • Power Law
  • Total CostPL (n) n dnCtime SR anb
  • Exponential
  • Total CostExp (n) n dnCtime SR abn

17
Experimental Settings
  • Ten benchmark datasets (see next slide)
  • Each dataset was randomly partitioned into
    25-50 of test examples and 50-75 of examples
    potentially available for training .
  • The sampling increment ? was set to 1 of the
    maximum possible training set size
  • The error rate of each increment was averaged
    over 10 random partitions of the training set.
  • Sampling schedules Uniform, Geometric (a2),
    Straw Man, Projective, Optimal
  • Cost Ratios (R) 1 50,000
  • CPU Factors 0 and 1 (per one millisecond of CPU
    time)

18
Datasets Description
19
Projected Fitting Functions
20
Projected and Actual Learning Curves Small
Datasets
21
Projected and Actual Learning Curves Medium and
Large Datasets
22
Comparison of Sampling Schedules
R Cerr / Ctr
23
Detailed Sampling Schedules without Induction
CostsSmall Datasets
Uniform
Geometric, Straw Man, Projected, Optimal
24
Detailed Sampling Schedules without Induction
Costs Medium and Large Datasets
Geometric
Uniform, Straw Man, Projected, Optimal
Geometric
Optimal
Uniform, Straw Man, Projected
25
Conclusions
  • The projective sampling strategy estimates the
    optimal training set size by fitting an
    analytical function to a partial learning curve
  • The proposed methodology was evaluated on 10
    benchmark datasets of variable size using a
    decision-tree algorithm.
  • The results show that under negligible induction
    costs and high data acquisition costs, the
    projective sampling outperforms, on average, the
    alternative, progressive sampling techniques.

26
Future Research
  • Further optimization of projective sampling
    schedules, especially under substantial CPU costs
  • Improving utility of cost-sensitive data mining
    algorithms
  • Modeling learning curves for non-random
    (active) sampling and labeling techniques

27
Thank you!
Merci Beaucoup!
Write a Comment
User Comments (0)
About PowerShow.com