Title: KDD%20Cup%202009
1KDD Cup2009
- Fast Scoring on a Large Database
- Presentation of the Results at the KDD Cup
Workshop - June 28, 2008
- The Organizing Team
2KDD Cup 2009 Organizing Team
- Project team at Orange Labs RD
- Vincent Lemaire
- Marc Boullé
- Fabrice Clérot
- Raphaël Féraud
- Aurélie Le Cam
- Pascal Gouzien
-
- Beta testing and proceedings editor
- Gideon Dror
- Web site design
- Olivier Guyon (MisterP.net, France)
-
- Coordination (KDD cup co-chairs)
- Isabelle Guyon
- David Vogel
3Thanks to our sponsors
- Orange
- ACM SIGKDD
- Pascal
- Unipen
- Google
- Health Discovery Corp
- Clopinet
- Data Mining Solutions
- MPS
4Record KDD Cup Participation
Year Teams
1997 45
1998 57
1999 24
2000 31
2001 136
2002 18
2003 57
2004 102
2005 37
2006 68
2007 95
2008 128
2009 453
5Participation Statistics
- 1299 registered teams
- 7865 entries
- 46 countries
Argentina Germany Malaysia South Korea
Australia Greece Mexico Spain
Austria Hong Kong Netherlands Sweden
Belgium Hungary New Zealand Switzerland
Brazil India Pakistan Taiwan
Bulgaria Iran Portugal Turkey
Canada Ireland Romania Uganda
Chile Israel Russian Federation United Kingdom
China Italy Singapore Uruguay
Fiji Japan Slovak Republic United States
Finland Jordan Slovenia
France Latvia South Africa
6A worlwide operator
- One of the main telecommunication operators in
the world - Providing services to more than 170 millions
customers over five continents - Including 120 millions under the Orange Brand
7KDD Cup 2009 organized by OrangeCustomer
Relationship Management (CRM)
- Three marketing tasks predict the propensity of
customers - to switch provider Churn
- to buy new products or services Appentency
- to buy upgrades or new options proposed to them
Up-selling - Objective improve the return of investments
(ROI) of marketing campaigns - Increase the efficiency of the campaign given a
campaign cost - Decrease the campaign cost for a given marketing
objective - Better prediction leads to better ROI
8Data, constraints and requirements
- Train and deploy requirements
- About one hundred models per month
- Fast data preparation and modeling
- Fast deployment
- Model requirements
- Robust
- Accurate
- Understandable
- Business requirement
- Return of investment for the whole process
- Input data
- Relational databases
- Numerical or categorical
- Noisy
- Missing values
- Heavily unbalanced distribution
- Train data
- Hundreds of thousands of instances
- Tens of thousand of variables
- Deployment
- Tens of millions of instances
9In-house systemFrom raw data to scoring models
- Data warehouse
- Relational data base
- Data mart
- Star schema
- Feature construction
- PAC technology
- Generates tens of thousands of variables
- Data preparation and modeling
- Khiops technology
Data feeding
PAC
Khiops
10Design of the challenge
- Orange business objective
- Benchmark the in-house system against state of
the art techniques - Data
- Data store
- Not an option
- Data warehouse
- Confidentiality and scalability issues
- Relational data requires domain knowledge and
specialized skills - Tabular format
- Standard format for the data mining community
- Domain knowledge incorporated using feature
construction (PAC) - Easy anonymization
- Tasks
- Three representative marketing tasks
- Requirements
- Fast data preparation and modeling (fully
automatic)
11Data sets extraction and preparation
- Input data
- 10 relational table
- A few hundreds of fields
- One million customers
- Instance selection
- Resampling given the three marketing tasks
- Keep 100 000 instances, with less unbalanced
target distributions - Variable construction
- Using PAC technology
- 20000 constructed variables to get a tabular
representation - Keep 15 000 variables (discard constant
variables) - Small track subset of 230 variables related to
classical domain knowledge - Anonymization
- Discard variable names, discard identifiers
- Randomize order of variables
- Rescale each numerical variable by a random
factor
12Scientific and technical challenge
- Scientific objective
- Fast data preparation and modeling within five
days - Large scale 50 000 train and test data, 15 000
variables - Hetegeneous data
- Numerical with missing values
- Categorical with hundreds of values
- Heavily unbalanced distribution
- KDD social meeting objective
- Attract as many participants as possible
- Additional small track and slow track
- Online feedback on validation dataset
- Toy problem (only one informative input variable)
- Leverage challenge protocol overhead
- One month to explore descriptive data and test
submission protocol - Attractive conditions
- No intellectual property conditions
- Money prizes
13Business impact of the challenge
- Bring Orange datasets to the data mining
community - Benefit for community
- Access to challenging data
- Benefit for Orange
- Benchmark of numerous competing techniques
- Drive the research efforts towards Orange needs
- Evaluate the Orange in-house system
- High number of participants and high quality of
the results - Orange in-house results
- Improved by a significant margin when leveraging
all business requirements - Almost Parretto optimal when other criterions are
considered - (automation, very fast train and deploy,
robustness and understandability) - Need to study the best challenge methods to get
more insights
14KDD Cup 2009 Result Analysis
Best Result (period considered in the figure) In
House System (downloadable www.khiops.com) Base
line (Naïve Bayes)
15Overall Test AUC Fast
Best Results (on each dataset) Submissions
Good Result Very Quickly
16Overall Test AUC Fast
Best Results (on each dataset) Submissions
Good Result Very Quickly
- In House (Orange?) System
- No parameters
- On 1 standard laptop (mono proc)
- If deal as 3 different problems
17Overall Test AUC Fast
Very Fast Good Result
Small improvement after the first day (83.85 ?
84.93)
18Overall Test AUC Slow
Very Small improvement after the 5th day (84.93
? 85.2)
Improvement due to unscrambling?
19Overall Test AUC Submissions
23.24 of the submissions (gt0.5)lt Baseline
? 15.25 of the submissions (gt0.5)gt In House
? 84.75 of the submissions (gt0.5)lt In House
20Overall Test AUC 'Correlation' Test / Valid
21Overall Test AUC'Correlation' Test / Train
?
Random Values Submitted
Boosting Method orTrain Target Submitted
Over fitting
22Overall Test AUC
Test AUC - 24 hours
Test AUC - 12 hours
Test AUC 5 days
Test AUC 36 days
23Overall Test AUC
- ? Difference between
- best result at the end of the first day and
- best result at the end of the 36 days
?1.35
Test AUC - 12 hours
- time to adjust model parameters ?
- time to train ensemble method ?
- time to find more processors ?
- time to test more methods
- time to unscramble ?
-
Test AUC 36 days
24Test AUC f (time)
Churn Test AUC day ? 036
Appetency Test AUC day ? 036
Up-selling Test AUC day ? 036
Easier ?
Harder ?
25Test AUC f (time)
Churn Test AUC day ? 036
Appetency Test AUC day ? 036
Up-selling Test AUC day ? 036
? 1.84
? 1.38
? 0.11
Easier ?
Harder ?
- ? Difference between
- best result at the end of the first day and
- best result at the end of the 36 days
26CorrelationTest AUC / Valid AUC (5 days)
Churn Test/Valid day ? 05
Appetency Test/Valid day ? 05
Up-selling Test/Valid day ? 05
Easier ?
Harder ?
27CorrelationTrain AUC / Valid AUC (36 days)
Churn Test/Train day ? 036
Appetency Test/Train day ? 036
Up-selling Test/Train day ? 036
Difficulty to conclude something
28HistogramTest AUC / Valid AUC (05 or 5-36
days)
Churn Test AUC day ? 036
Appetency Test AUC day ? 036
Up-selling Test AUC day ? 036
Knowledge (parameters?) found during 5 days helps
after ?
29HistogramTest AUC / Valid AUC (05 or 5-36
days)
Churn Test AUC day ? 036
Appetency Test AUC day ? 036
Up-selling Test AUC day ? 036
Knowledge (parameters?) found during 5 days helps
after ?
YES !
Churn Test AUC day ? 536
Appetency Test AUC day ? 536
Up-selling Test AUC day ? 536
30Fact SheetsPreprocessing Feature Selection
PREPROCESSING (overall usage95)
Replacement of the missing values
Discretization
Normalizations
Grouping modalities
Other prepro
Principal Component Analysis
0
20
40
60
80
Percent of participants
FEATURE SELECTION (overall usage85)
Feature ranking
Filter method
Other FS
Forward / backward wrapper
Embedded method
Wrapper with search
0
10
20
30
40
50
60
Percent of participants
31Fact SheetsClassifier
CLASSIFIER (overall usage93)
Decision tree...
Linear classifier
Non-linear kernel
- About 30 logistic loss, gt15 exp loss, gt15 sq
loss, 10 hinge loss. - Less than 50 regularization (20 2-norm, 10
1-norm). - Only 13 unlabeled data.
Other Classif
Neural Network
Naïve Bayes
Nearest neighbors
Bayesian Network
Bayesian Neural Network
0
10
20
30
40
50
60
Percent of participants
32Fact Sheets Model Selection
MODEL SELECTION (overall usage90)
10 test
K-fold or leave-one-out
Out-of-bag est
Bootstrap est
Other-MS
- About 75 ensemble methods
(1/3 boosting, 1/3 bagging, 1/3 other). - About 10 used unscrambling.
Other cross-valid
Virtual leave-one-out
Penalty-based
Bi-level
Bayesian
0
10
20
30
40
50
60
Percent of participants
33Fact Sheets Implementation
Run in parallel
None
Multi-processor
Memory
Parallelism
Operating System
Software Platform
34Winning methods
- Fast track
- IBM research, USA Ensemble of a wide variety
of classifiers. Effort put into coding (most
frequent values coded with binary features,
missing values replaced by mean, extra features
constructed, etc.) - ID Analytics, Inc., USA Filterwrapper FS.
TreeNet by Salford Systems an additive boosting
decision tree technology, bagging also used. - David Slate Peter Frey, USA Grouping of
modalities/discretization, filter FS, ensemble of
decision trees. - Slow track
- University of Melbourne CV-based FS targeting
AUC. Boosting with classification trees and
shrinkage, using Bernoulli loss. - Financial Engineering Group, Inc., Japan
Grouping of modalities, filter FS using AIC,
gradient tree-classifier boosting. - National Taiwan University Average 3
classifiers (1) Solve joint multiclass problem
with l1-regularized maximum entropy model. (2)
AdaBoost with tree-based weak leaner. (3)
Selective Naïve Bayes. - ( small dataset unscrambling)
35Conclusion
- Participation exceeded our expectations. We thank
the participants for their hard work, our
sponsors, and Orange who offered - A problem of real industrial interest with
challenging scientific and technical aspects - Prizes.
- Lessons learned
- Do not under-estimate the participants five days
were given for the fast challenge, only a few
hours sufficed to some participants. - Ensemble methods are effective.
- Ensemble of decision trees offer off-the-shelf
solutions to problems with large numbers of
samples and attributes, mixed types of variables,
and lots of missing values.