Targeted Marketing, KDD Cup and Customer Modeling - PowerPoint PPT Presentation

About This Presentation

Title:

Targeted Marketing, KDD Cup and Customer Modeling

Description:

KDD-CUP 1997 Awards. The GOLD MINER award is jointly shared by two contestants this year ... MineSet used a total of 6 variables in their final model ... – PowerPoint PPT presentation

Number of Views:160

Avg rating:3.0/5.0

Slides: 41

Provided by: grego122

Category:

more less

Transcript and Presenter's Notes

Title: Targeted Marketing, KDD Cup and Customer Modeling

1
Targeted Marketing,KDD Cup and Customer
Modeling
2
Outline

Direct Marketing
Review Evaluation Lift, Gains
KDD Cup 1997
Lift and Benefit estimation
Privacy and Data Mining

3
Direct Marketing Paradigm

Find most likely prospects to contact
Not everybody needs to be contacted
Number of targets is usually much smaller than
number of prospects
Typical Applications
retailers, catalogues, direct mail (and e-mail)
customer acquisition, cross-sell, attrition
...

4
Direct Marketing Evaluation

Accuracy on the entire dataset is not the right
measure
Approach
develop a target model
score all prospects and rank them by decreasing
score
select top P of prospects for action
Evaluate Performance on top P using Gains and
Lift

5
CPH (Gains) Random List vs Model-ranked list
Cumulative Hits
Pct list
5 of random list have 5 of targets, but 5 of
model ranked list have 21 of targets
CPH(5,model)21.
6
Lift Curve
Lift(P) CPH(P) / P
Lift (at 5) 21 / 5 4.2 better than
random
P -- percent of the list
7
KDD-CUP 1997

Task given data on past responders to
fund-raising, predict most likely responders for
new campaign
Population of 750K prospects
10K responded to a broad campaign mailing (1.4
response rate)
Analysis file included a stratified (non-random)
sample of
10K responders and 26K non-responders (28.7
response rate)
75 used for learning 25 used for validation
target variable removed from the validation data
set

8
KDD-CUP 1997 Data Set

321 fields/variables with sanitized names and
labels
Demographic information
Credit history
Promotion history
Significant effort on data preprocessing
leaker detection and removal

9
KDD-CUP Participant Statistics

45 companies/institutions participated
23 research prototypes
22 commercial tools
16 contestants turned in their results
9 research prototypes
7 commercial tools

10
KDD-CUP Algorithm Statistics
Of the 16 software/tools (Score as of best)
Algorithm of Entries Ave. Score Rules 2 87 k-NN
1 85 Bayesian 3 83 Multiple/Hybrid 4 79 Other 2 68
Decision Tree 4 44
11
KDD Cup 97 Evaluation

Best Gains at 40
Urban Science
BNB
Mineset
Best Gains at 10
BNB
Urban Science
Mineset

12
KDD-CUP 1997 Awards

The GOLD MINER award is jointly shared by two
contestants this year
1) Charles Elkan, Ph.D. from University of
California, San Diego with his software BNB,
Boosted Naive Bayesian Classifier
1) Urban Science Applications, Inc. with their
software gain, Direct Marketing Selection System
The BRONZE MINER award went to the runner-up
3) Silicon Graphics, Inc with their software
MineSet

13
KDD-CUP Results Discussion

Top finishers very close
Naïve Bayes algorithm was used by 2 of the top 3
contestants (BNB and MineSet)
BNB and MineSet did little data preprocessing
MineSet used a total of 6 variables in their
final model
Urban Science implemented a tremendous amount of
automated data preprocessing and exploratory data
analysis and developed more than 50 models in an
automated fashion to get to their results

14
KDD Cup 1997 Top 3 results
Top 3 finishers are very close
15
KDD Cup 1997 worst results
Note that the worst result (C6) was
actually worse than random. Competitor names
were kept anonymous, apart from top 3 winners
16
Better Model Evaluation?

Comparing Gains at 10 and 40 is ad-hoc
Are there more principled methods?
Area Under the Curve (AUC) of Gains Chart
Lift Quality
Ultimately, financial measures Campaign Benefits

17
Model Evaluation AUC

Area Under the Curve (AUC) is defined as the
Difference between Gains and Random Curves

Cum Hits
Selection
18
Model Evaluation Lift Quality

See Measuring Lift Quality in Database Marketing,
Piatetsky-Shapiro and Steingold, SIGKDD
Explorations, December 2000 .

AUC(Model) AUC(Random) LQ
-----------------------------
AUC(Perfect) AUC(Random)
19
Lift Quality (Lquality)

For a perfect model, Lquality 100
For a random model, Lquality 0
For KDD Cup 97,
Lquality(Urban Science) 43.3
Lquality(Elkan) 42.7
However, small differences in Lquality are not
significant

20
Estimating Profit Campaign Parameters

Direct Mail example
N -- number of prospects, e.g. 750,000
T -- fraction of targets, e.g. 0.014
B -- benefit of hitting a target, e.g. 20
Note this is simplification actual benefit
will vary
C -- cost of contacting a prospect, e.g. 0.68
P -- percentage selected for contact, e.g. 10
Lift(P ) -- model lift at P , e.g. 3

21
Contacting Top P of Model-Sorted List

Using previous example, let selection be P 10
and Lift(P) 3
Selection size N P , e.g. 75,000
Random has N P T targets in first P list, e.g.
1,050
Q How many targets are in model P-selection?
Model has more by a factor Lift(P) or N P T
Lift(P) targets in the selection, e.g. 3,150
Benefit of contacting the selection is N P T
Lift(P) B , e.g. 63,000
Cost of contacting N P is N P C , e.g. 51,000

22
Profit of Contacting Top P

Profit(P) Benefit(P) Cost(P)
N P T Lift(P) B - N P C
NP (T Lift(P) B - C ) e.g. 12,000
Q When is Profit Positive?

23
Finding Optimal Cutoff
Use the formula to estimate benefit for each
P Find optimal P
24
Feasibility Assessment

Expected Profit(P) depends on known
Cost C,
Benefit B,
Target Rate T
and unknown Lift(P)
To compute Lift(P) we need to get all the data,
load it, clean it, ask for correct data, build
models, ...

25
Can Expected Lift be estimated ?

only from N and T ?
In theory -- no, but in many practical
applications,
?!?! surprisingly yes ?!?!

26
Empirical Observations about Lift

For good models, usually Lift(P) is monotically
decreasing with P
Lift at fixed P (e.g. 0.05) is usually higher for
lower T
Special point P T
for a perfect predictor, all targets are in the
first T of the list, for a maximum lift of 1/T
What can we expect compared to 1/T ?

27
Meta Analysis of Lift

26 attrition cross-sell problems from finance
and telecom domains
N ranges from 1,000 to 150,000
T ranges from 1 to 22
No clear relation to N, but there is dependence
on T

28
Results Lift(T) vs 1/T

Tried several linear and log-linear fits

Best Model (R2 0.86) log10(Lift(T)) -0.05
0.52 log10(1/T) Approximately Lift(T) T
-0.5 sqrt (1/T)
29
Actual Lift(T) vs sqrt(1/T) for All Problems
Error Actual Lift - sqrt(1/T) Avg(Error)
-0.08 St. Dev(Error) 1.0
30
GPS Lift(T) Rule of Thumb

For targeted marketing campaigns,
where 0.01 lt T lt 0.25,
Lift(T) sqrt (1/T) ? 1
Exceptions for
truly predictable or random behaviors
poor models
information leakers

31
Estimating Entire Curve

Cumulative Percent Hits
CPH(P) Lift(P) P
CPH is easier to model than Lift
Several regressions for all CPH curves
Best results with regression
log10(CPH(P)) a b log10(P)
Average R2 0.97

32
CPH Curve Estimate

Approximately
CPH(P) sqrt(P)
bounds
P 0.6 lt CPH(P) lt P 0.4

33
Lift Curve Estimate

Since Lift(P) CPH(P)/P
Lift(P) 1/sqrt(P)
bounds
(1/P ) 0.4 lt Lift(P) lt (1/P ) 0.6

34
More onEstimating Lift and Profitability
G. Piatetsky-Shapiro, B. Masand, Estimating
Campaign Benefits and Modeling Lift, Proc.
KDD-99, ACM. www.KDnuggets.com/gpspubs/
35
KDD Cup 1998

Data from Paralyzed Veterans of America (charity)
Goal select mailing with the highest profit
Winners Urban Science, SAS, Quadstone
see full results and winners presentations at
www.kdnuggets.com/meetings/kdd98

36
KDD-CUP-98 Analysis Universe

Paralyzed Veterans of America (PVA), a
not-for-profit organization that provides
programs and services for US veterans with spinal
cord injuries or disease, generously provided the
data set
PVAs June 97 fund raising mailing, sent to 3.5
million donors, was selected as the competition
data
Within this universe, a group of 200K Lapsed
donors was of particular interest to PVA.
Lapsed donors are individuals who made their
last donation to PVA 13 to 24 months prior to the
mailing

37
KDD Cup-98 Example

Evaluation Expected profit maximization with a
mailing cost of 0.68
Sum of (actual donation-0.68) for all records
with predicted/ expected donation gt 0.68
Participant with the highest actual sum wins

38
KDD Cup Cost Matrix
Predicted Donation Predicted Donation
Yes No
Actual Donation Yes DonationAmt-0.68 0
Actual Donation No -0.68 0
39
KDD Cup 1998 Results
Model Selected Result Rank
GainSmarts 56,330 14,712 1
SAS 55,838 14,662 2
Quadstone 57,836 13,954 3

ALL 96,367 10,560 13

20 42,270 1,706 20
21 1,551 -54 21
Selected how many were selected by
the model Result the total profit (donations-cos
t) of the model ALL - selecting all
40
Summary