Title: Targeted Marketing, KDD Cup and Customer Modeling
1Targeted Marketing,KDD Cup and Customer
Modeling
2Outline
- Direct Marketing
- Review Evaluation Lift, Gains
- KDD Cup 1997
- Lift and Benefit estimation
- Privacy and Data Mining
3Direct Marketing Paradigm
- Find most likely prospects to contact
- Not everybody needs to be contacted
- Number of targets is usually much smaller than
number of prospects - Typical Applications
- retailers, catalogues, direct mail (and e-mail)
- customer acquisition, cross-sell, attrition
- ...
4Direct Marketing Evaluation
- Accuracy on the entire dataset is not the right
measure - Approach
- develop a target model
- score all prospects and rank them by decreasing
score - select top P of prospects for action
- Evaluate Performance on top P using Gains and
Lift
5CPH (Gains) Random List vs Model-ranked list
Cumulative Hits
Pct list
5 of random list have 5 of targets, but 5 of
model ranked list have 21 of targets
CPH(5,model)21.
6Lift Curve
Lift(P) CPH(P) / P
Lift (at 5) 21 / 5 4.2 better than
random
P -- percent of the list
7KDD-CUP 1997
- Task given data on past responders to
fund-raising, predict most likely responders for
new campaign - Population of 750K prospects
- 10K responded to a broad campaign mailing (1.4
response rate) - Analysis file included a stratified (non-random)
sample of - 10K responders and 26K non-responders (28.7
response rate) - 75 used for learning 25 used for validation
- target variable removed from the validation data
set
8KDD-CUP 1997 Data Set
- 321 fields/variables with sanitized names and
labels - Demographic information
- Credit history
- Promotion history
- Significant effort on data preprocessing
- leaker detection and removal
9KDD-CUP Participant Statistics
- 45 companies/institutions participated
- 23 research prototypes
- 22 commercial tools
- 16 contestants turned in their results
- 9 research prototypes
- 7 commercial tools
10KDD-CUP Algorithm Statistics
Of the 16 software/tools (Score as of best)
Algorithm of Entries Ave. Score Rules 2 87 k-NN
1 85 Bayesian 3 83 Multiple/Hybrid 4 79 Other 2 68
Decision Tree 4 44
11KDD Cup 97 Evaluation
- Best Gains at 40
- Urban Science
- BNB
- Mineset
- Best Gains at 10
- BNB
- Urban Science
- Mineset
12KDD-CUP 1997 Awards
- The GOLD MINER award is jointly shared by two
contestants this year - 1) Charles Elkan, Ph.D. from University of
California, San Diego with his software BNB,
Boosted Naive Bayesian Classifier - 1) Urban Science Applications, Inc. with their
software gain, Direct Marketing Selection System - The BRONZE MINER award went to the runner-up
- 3) Silicon Graphics, Inc with their software
MineSet
13KDD-CUP Results Discussion
- Top finishers very close
- Naïve Bayes algorithm was used by 2 of the top 3
contestants (BNB and MineSet) - BNB and MineSet did little data preprocessing
- MineSet used a total of 6 variables in their
final model - Urban Science implemented a tremendous amount of
automated data preprocessing and exploratory data
analysis and developed more than 50 models in an
automated fashion to get to their results
14KDD Cup 1997 Top 3 results
Top 3 finishers are very close
15KDD Cup 1997 worst results
Note that the worst result (C6) was
actually worse than random. Competitor names
were kept anonymous, apart from top 3 winners
16Better Model Evaluation?
- Comparing Gains at 10 and 40 is ad-hoc
- Are there more principled methods?
- Area Under the Curve (AUC) of Gains Chart
- Lift Quality
- Ultimately, financial measures Campaign Benefits
17Model Evaluation AUC
- Area Under the Curve (AUC) is defined as the
- Difference between Gains and Random Curves
Cum Hits
Selection
18Model Evaluation Lift Quality
- See Measuring Lift Quality in Database Marketing,
Piatetsky-Shapiro and Steingold, SIGKDD
Explorations, December 2000 .
AUC(Model) AUC(Random) LQ
-----------------------------
AUC(Perfect) AUC(Random)
19Lift Quality (Lquality)
- For a perfect model, Lquality 100
- For a random model, Lquality 0
- For KDD Cup 97,
- Lquality(Urban Science) 43.3
- Lquality(Elkan) 42.7
- However, small differences in Lquality are not
significant
20Estimating Profit Campaign Parameters
- Direct Mail example
- N -- number of prospects, e.g. 750,000
- T -- fraction of targets, e.g. 0.014
- B -- benefit of hitting a target, e.g. 20
- Note this is simplification actual benefit
will vary - C -- cost of contacting a prospect, e.g. 0.68
- P -- percentage selected for contact, e.g. 10
- Lift(P ) -- model lift at P , e.g. 3
21Contacting Top P of Model-Sorted List
- Using previous example, let selection be P 10
and Lift(P) 3 - Selection size N P , e.g. 75,000
- Random has N P T targets in first P list, e.g.
1,050 - Q How many targets are in model P-selection?
- Model has more by a factor Lift(P) or N P T
Lift(P) targets in the selection, e.g. 3,150 - Benefit of contacting the selection is N P T
Lift(P) B , e.g. 63,000 - Cost of contacting N P is N P C , e.g. 51,000
22Profit of Contacting Top P
- Profit(P) Benefit(P) Cost(P)
- N P T Lift(P) B - N P C
- NP (T Lift(P) B - C ) e.g. 12,000
- Q When is Profit Positive?
23Finding Optimal Cutoff
Use the formula to estimate benefit for each
P Find optimal P
24Feasibility Assessment
- Expected Profit(P) depends on known
- Cost C,
- Benefit B,
- Target Rate T
- and unknown Lift(P)
- To compute Lift(P) we need to get all the data,
load it, clean it, ask for correct data, build
models, ...
25Can Expected Lift be estimated ?
- only from N and T ?
- In theory -- no, but in many practical
applications, - ?!?! surprisingly yes ?!?!
26Empirical Observations about Lift
- For good models, usually Lift(P) is monotically
decreasing with P - Lift at fixed P (e.g. 0.05) is usually higher for
lower T - Special point P T
- for a perfect predictor, all targets are in the
first T of the list, for a maximum lift of 1/T - What can we expect compared to 1/T ?
27Meta Analysis of Lift
- 26 attrition cross-sell problems from finance
and telecom domains - N ranges from 1,000 to 150,000
- T ranges from 1 to 22
- No clear relation to N, but there is dependence
on T
28Results Lift(T) vs 1/T
- Tried several linear and log-linear fits
Best Model (R2 0.86) log10(Lift(T)) -0.05
0.52 log10(1/T) Approximately Lift(T) T
-0.5 sqrt (1/T)
29Actual Lift(T) vs sqrt(1/T) for All Problems
Error Actual Lift - sqrt(1/T) Avg(Error)
-0.08 St. Dev(Error) 1.0
30GPS Lift(T) Rule of Thumb
- For targeted marketing campaigns,
- where 0.01 lt T lt 0.25,
- Lift(T) sqrt (1/T) ? 1
- Exceptions for
- truly predictable or random behaviors
- poor models
- information leakers
31Estimating Entire Curve
- Cumulative Percent Hits
- CPH(P) Lift(P) P
- CPH is easier to model than Lift
- Several regressions for all CPH curves
- Best results with regression
- log10(CPH(P)) a b log10(P)
- Average R2 0.97
32CPH Curve Estimate
- Approximately
- CPH(P) sqrt(P)
- bounds
- P 0.6 lt CPH(P) lt P 0.4
33Lift Curve Estimate
- Since Lift(P) CPH(P)/P
- Lift(P) 1/sqrt(P)
- bounds
- (1/P ) 0.4 lt Lift(P) lt (1/P ) 0.6
34More onEstimating Lift and Profitability
G. Piatetsky-Shapiro, B. Masand, Estimating
Campaign Benefits and Modeling Lift, Proc.
KDD-99, ACM. www.KDnuggets.com/gpspubs/
35KDD Cup 1998
- Data from Paralyzed Veterans of America (charity)
- Goal select mailing with the highest profit
- Winners Urban Science, SAS, Quadstone
- see full results and winners presentations at
- www.kdnuggets.com/meetings/kdd98
36KDD-CUP-98 Analysis Universe
- Paralyzed Veterans of America (PVA), a
not-for-profit organization that provides
programs and services for US veterans with spinal
cord injuries or disease, generously provided the
data set - PVAs June 97 fund raising mailing, sent to 3.5
million donors, was selected as the competition
data - Within this universe, a group of 200K Lapsed
donors was of particular interest to PVA.
Lapsed donors are individuals who made their
last donation to PVA 13 to 24 months prior to the
mailing
37KDD Cup-98 Example
- Evaluation Expected profit maximization with a
mailing cost of 0.68 - Sum of (actual donation-0.68) for all records
with predicted/ expected donation gt 0.68 - Participant with the highest actual sum wins
38KDD Cup Cost Matrix
Predicted Donation Predicted Donation
Yes No
Actual Donation Yes DonationAmt-0.68 0
Actual Donation No -0.68 0
39KDD Cup 1998 Results
Model Selected Result Rank
GainSmarts 56,330 14,712 1
SAS 55,838 14,662 2
Quadstone 57,836 13,954 3
ALL 96,367 10,560 13
20 42,270 1,706 20
21 1,551 -54 21
Selected how many were selected by
the model Result the total profit (donations-cos
t) of the model ALL - selecting all
40Summary
- KDD Cup 1997 case study
- Model Evaluation AUC and Lift Quality
- Estimating Campaign Profit
- Feasibility Assessment
- GPS Rule of Thumb for Typical Lift Curve
- KDD Cup 1998