Maximizing Classifier Utility when Training Data is Costly - PowerPoint PPT Presentation

About This Presentation

Title:

Maximizing Classifier Utility when Training Data is Costly

Description:

Both types of curves allows the practitioner to understand the ... Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining. ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 34

Provided by: stormCis

Learn more at: https://storm.cis.fordham.edu

Category:

more less

Transcript and Presenter's Notes

Title: Maximizing Classifier Utility when Training Data is Costly

1
Maximizing Classifier Utility when Training Data
is Costly

Gary M. Weiss
Ye Tian
Fordham University

2
Outline

Introduction
Motivation, cost model
Experimental Methodology
Results
Adult data set
Progressive Sampling
Related Work
Future Work/Conclusion

3
Motivation

Utility-Based Data Mining
Concerned with utility of overall data mining
process
A key cost is the cost of training data
These costs often ignored (except for active
learning)
First ones to analyze the impact of a very simple
cost model
In doing so we fill a hole in existing research
Our cost model
A fixed cost for acquiring labeled training
examples
No separate cost for class labels, missing
features, etc.
Turney1 called this the cost of cases
No control over which training examples chosen
No active learning

4
Motivation (cont.)

Efficient progressive sampling2
Determines optimal training set size
Optimal is where the learning curve reaches a
plateau
Assumes data acquisition costs are essentially
zero
What if the acquisition costs are significant?

5
Motivating Examples

Predicting customer behavior/buying potential
Training data from DB and Ziff-Davis
These and other information vendors make money
by selling information
Poker playing
Learn about an opponent by playing him

6
Outline

Introduction
Motivation, cost model
Experimental Methodology
Results
Adult data set
Progressive Sampling
Related Work
Future Work/Conclusion

7
Experiments

Use C4.5 to determine relationship between
accuracy and training set size
20 runs used to increase reliability of results
Random sampling to reduce training set size
For this talk we focus on adult data set
21,000 examples
We utilize a predetermined sampling schedule
CPU times recorded, mainly for future work

8
Measuring Total Utility

Total cost Data Cost Error Cost
nCtr e S Cerr
n number training examples
e error rate
S number examples in score set
Ctr cost of a training example
Cerr cost of an error
Will know n and e for any experiment
With domain knowledge can estimate Ctr, Cerr, S
But we dont have this knowledge
Treat Ctr and Cerr as parameters and vary them
Assume S 100 with no loss of generality
If S is 100,000 then look at results for
Cerr/1,000

9
Measuring Total Utility (cont.)

Now only look at cost ratio, CtrCerr
Typical values evaluated 11, 11000, etc.
Relative cost ratio is Cerr/Ctr
Example
If cost ratio is 11000 then even trade-off if
buying 1000 training examples eliminates 1 error
Alternatively buying 1000 examples is worth a
1 reduction in error rate (then can ignore S
100)

10
Outline

Introduction
Motivation, cost model
Experimental Methodology
Results
Adult data set
Progressive Sampling
Related Work
Future Work/Conclusion

11
Learning Curve
12
Utility Curves
13
Utility Curves (Normalized Cost)
14
Optimal Training Set Size Curve
15
Value of Optimal Curve

Even without specific cost information, this
chart could be useful for a practitioner
Can put bounds on appropriate training set size
Analogous to Drummond and Holtes cost curves3
They looked at cost ratio of false positives and
negatives
We look at cost ratio of errors vs. cost of data
Both types of curves allows the practitioner to
understand the impact of the various costs

16
Idealized learning curve
17
Outline

Introduction
Motivation, cost model
Experimental Methodology
Results
Adult data set
Progressive Sampling
Related Work
Future Work/Conclusion

18
Progressive Sampling

We want to find the optimal training set size
Need to determine when to stop acquiring data
before acquiring all of it!
Strategy use a progressive sampling strategy
Key issues
When do we stop?
What sampling schedule should we use?

19
Our Progressive Sampling Strategy

We stop after first increase in total cost
Results therefore never optimal, but near-optimal
if learning curve is non-decreasing
We evaluate 2 simple sampling schedules
S1 10, 50, 100, 500, 1000, 2000, , 9000,
10,000, 12,000, 14,000,
S2 50, 100, 200, 400, 800, 1600,
S2 S1 are similar given modest sized data sets
Could use an adaptive strategy

20
Adult Data Set S1 vs. Straw Man
21
Progressive Sampling Conclusions

We can use progressive sampling to determine a
near optimal training set size
Effectiveness mainly based on how well behaved
the learning curve is (i.e., non-decreasing)
Sampling schedule/batch size is also important
Finer granularity requires more CPU time
But if data costly, CPU time most likely less
expensive
In our experiments, cumulative CPU time lt 1 minute

22
Related Work

Efficient progressive sampling2
It tries to efficiently find the asymptote
That work has a data cost of e
Stop only when added data has no benefit
Active Learning
Similar in that data cost is factored in but
setting different
User has control over which examples are selected
or features measured
Does not address simple cost of cases scenario
Find best class distribution when training data
costly4
Assumes training set size limited but size
pre-specified
Finds the best class distribution to maximize
performance

23
Limitations/Future Work

Improvements
Bigger data sets where learning curve plateaus
More sophisticated sampling schemes
Incorporate cost-sensitive learning (cost FP ?
FN)
Generate better behaved learning curves
Include CPU time in utility metric
Analyze other cost models
Study the learning curves
Real world motivating examples
Perhaps with cost information

24
Conclusion

We analyze impact of training data cost on
classification process
Introduce new ways of visualizing the impact of
data cost
Utility curves
Optimal training set size curves
Show that we can use progressive sampling to help
learn a near-optimal classifier

25
We Want Feedback

We are continuing this work
Clearly many minor enhancements possible
Feel free to suggest some more
Any major new directions/extensions?
What if anything is most interesting?
Any really good motivating examples that you are
familiar with

26
Questions?

If I have run out of time, please find me during
the break!!

27
References

P. Turney (2000). Types of cost in inductive
concept learning. Workshop on Cost-Sensitive
Learning at the 17th International Conference on
Machine Learning.
F. Provost, D. Jensen T. Oates (1999).
Proceedings of the 5th International Conference
on Knowledge Discovery and Data Mining.
C. Drummond R. Holte (2000). Explicitly
Representing Expected Cost An Alternative to ROC
Representation. Proceedings of the 6th ACM SIGKDD
International Conference of Knowledge Discovery
and Data Mining, 198-207.
G. Weiss F. Provost (2003). Learning when
Training Data are Costly The Effect of Class
Distribution on Tree Induction, Journal of
Artificial Intelligence Research, 19315-354.

28
Learning Curves for Large Data Sets
29
Optimal Curves for Large Data Sets
30
Learning Curves for Small Data Sets
31
Optimal Curves for Small Data Sets
32
Results for Adult Data Set
33
Optimal vs. S1 for Large Data Sets

Write a Comment

User Comments (0)