KDD%20Cup%202009 - PowerPoint PPT Presentation

About This Presentation

Title:

KDD%20Cup%202009

Description:

About 75% ensemble methods (1/3 boosting, 1/3 bagging, 1/3 other). About 10% used unscrambling. ... boosting decision tree technology, bagging also used. ... – PowerPoint PPT presentation

Number of Views:324

Avg rating:3.0/5.0

Slides: 36

Provided by: lemaire

Category:

more less

Transcript and Presenter's Notes

Title: KDD%20Cup%202009

1
KDD Cup2009

Fast Scoring on a Large Database
Presentation of the Results at the KDD Cup
Workshop
June 28, 2008
The Organizing Team

2
KDD Cup 2009 Organizing Team

Project team at Orange Labs RD
Vincent Lemaire
Marc Boullé
Fabrice Clérot
Raphaël Féraud
Aurélie Le Cam
Pascal Gouzien
Beta testing and proceedings editor
Gideon Dror
Web site design
Olivier Guyon (MisterP.net, France)
Coordination (KDD cup co-chairs)
Isabelle Guyon
David Vogel

3
Thanks to our sponsors

Orange
ACM SIGKDD
Pascal
Unipen
Google
Health Discovery Corp
Clopinet
Data Mining Solutions
MPS

4
Record KDD Cup Participation
Year Teams
1997 45
1998 57
1999 24
2000 31
2001 136
2002 18
2003 57
2004 102
2005 37
2006 68
2007 95
2008 128
2009 453
5
Participation Statistics

1299 registered teams
7865 entries
46 countries

Argentina Germany Malaysia South Korea
Australia Greece Mexico Spain
Austria Hong Kong Netherlands Sweden
Belgium Hungary New Zealand Switzerland
Brazil India Pakistan Taiwan
Bulgaria Iran Portugal Turkey
Canada Ireland Romania Uganda
Chile Israel Russian Federation United Kingdom
China Italy Singapore Uruguay
Fiji Japan Slovak Republic United States
Finland Jordan Slovenia
France Latvia South Africa
6
A worlwide operator

One of the main telecommunication operators in
the world
Providing services to more than 170 millions
customers over five continents
Including 120 millions under the Orange Brand

7
KDD Cup 2009 organized by OrangeCustomer
Relationship Management (CRM)

Three marketing tasks predict the propensity of
customers
to switch provider Churn
to buy new products or services Appentency
to buy upgrades or new options proposed to them
Up-selling
Objective improve the return of investments
(ROI) of marketing campaigns
Increase the efficiency of the campaign given a
campaign cost
Decrease the campaign cost for a given marketing
objective
Better prediction leads to better ROI

8
Data, constraints and requirements

Train and deploy requirements
About one hundred models per month
Fast data preparation and modeling
Fast deployment
Model requirements
Robust
Accurate
Understandable
Business requirement
Return of investment for the whole process

Input data
Relational databases
Numerical or categorical
Noisy
Missing values
Heavily unbalanced distribution
Train data
Hundreds of thousands of instances
Tens of thousand of variables
Deployment
Tens of millions of instances

9
In-house systemFrom raw data to scoring models

Data warehouse
Relational data base
Data mart
Star schema
Feature construction
PAC technology
Generates tens of thousands of variables
Data preparation and modeling
Khiops technology

Data feeding
PAC
Khiops
10
Design of the challenge

Orange business objective
Benchmark the in-house system against state of
the art techniques
Data
Data store
Not an option
Data warehouse
Confidentiality and scalability issues
Relational data requires domain knowledge and
specialized skills
Tabular format
Standard format for the data mining community
Domain knowledge incorporated using feature
construction (PAC)
Easy anonymization
Tasks
Three representative marketing tasks
Requirements
Fast data preparation and modeling (fully
automatic)

11
Data sets extraction and preparation

Input data
10 relational table
A few hundreds of fields
One million customers
Instance selection
Resampling given the three marketing tasks
Keep 100 000 instances, with less unbalanced
target distributions
Variable construction
Using PAC technology
20000 constructed variables to get a tabular
representation
Keep 15 000 variables (discard constant
variables)
Small track subset of 230 variables related to
classical domain knowledge
Anonymization
Discard variable names, discard identifiers
Randomize order of variables
Rescale each numerical variable by a random
factor

12
Scientific and technical challenge

Scientific objective
Fast data preparation and modeling within five
days
Large scale 50 000 train and test data, 15 000
variables
Hetegeneous data
Numerical with missing values
Categorical with hundreds of values
Heavily unbalanced distribution
KDD social meeting objective
Attract as many participants as possible
Additional small track and slow track
Online feedback on validation dataset
Toy problem (only one informative input variable)
Leverage challenge protocol overhead
One month to explore descriptive data and test
submission protocol
Attractive conditions
No intellectual property conditions
Money prizes

13
Business impact of the challenge

Bring Orange datasets to the data mining
community
Benefit for community
Access to challenging data
Benefit for Orange
Benchmark of numerous competing techniques
Drive the research efforts towards Orange needs
Evaluate the Orange in-house system
High number of participants and high quality of
the results
Orange in-house results
Improved by a significant margin when leveraging
all business requirements
Almost Parretto optimal when other criterions are
considered
(automation, very fast train and deploy,
robustness and understandability)
Need to study the best challenge methods to get
more insights

14
KDD Cup 2009 Result Analysis
Best Result (period considered in the figure) In
House System (downloadable www.khiops.com) Base
line (Naïve Bayes)
15
Overall Test AUC Fast
Best Results (on each dataset) Submissions
Good Result Very Quickly
16
Overall Test AUC Fast
Best Results (on each dataset) Submissions
Good Result Very Quickly

In House (Orange?) System
No parameters
On 1 standard laptop (mono proc)
If deal as 3 different problems

17
Overall Test AUC Fast
Very Fast Good Result
Small improvement after the first day (83.85 ?
84.93)
18
Overall Test AUC Slow
Very Small improvement after the 5th day (84.93
? 85.2)
Improvement due to unscrambling?
19
Overall Test AUC Submissions
23.24 of the submissions (gt0.5)lt Baseline
? 15.25 of the submissions (gt0.5)gt In House
? 84.75 of the submissions (gt0.5)lt In House
20
Overall Test AUC 'Correlation' Test / Valid
21
Overall Test AUC'Correlation' Test / Train
?
Random Values Submitted
Boosting Method orTrain Target Submitted
Over fitting
22
Overall Test AUC
Test AUC - 24 hours
Test AUC - 12 hours
Test AUC 5 days
Test AUC 36 days
23
Overall Test AUC

? Difference between
best result at the end of the first day and
best result at the end of the 36 days

?1.35
Test AUC - 12 hours

time to adjust model parameters ?
time to train ensemble method ?
time to find more processors ?
time to test more methods
time to unscramble ?

Test AUC 36 days
24
Test AUC f (time)
Churn Test AUC day ? 036
Appetency Test AUC day ? 036
Up-selling Test AUC day ? 036
Easier ?
Harder ?
25
Test AUC f (time)
Churn Test AUC day ? 036
Appetency Test AUC day ? 036
Up-selling Test AUC day ? 036
? 1.84
? 1.38
? 0.11
Easier ?
Harder ?

? Difference between
best result at the end of the first day and
best result at the end of the 36 days

26
CorrelationTest AUC / Valid AUC (5 days)
Churn Test/Valid day ? 05
Appetency Test/Valid day ? 05
Up-selling Test/Valid day ? 05
Easier ?
Harder ?
27
CorrelationTrain AUC / Valid AUC (36 days)
Churn Test/Train day ? 036
Appetency Test/Train day ? 036
Up-selling Test/Train day ? 036
Difficulty to conclude something
28
HistogramTest AUC / Valid AUC (05 or 5-36
days)
Churn Test AUC day ? 036
Appetency Test AUC day ? 036
Up-selling Test AUC day ? 036
Knowledge (parameters?) found during 5 days helps
after ?
29
HistogramTest AUC / Valid AUC (05 or 5-36
days)
Churn Test AUC day ? 036
Appetency Test AUC day ? 036
Up-selling Test AUC day ? 036
Knowledge (parameters?) found during 5 days helps
after ?
YES !
Churn Test AUC day ? 536
Appetency Test AUC day ? 536
Up-selling Test AUC day ? 536
30
Fact SheetsPreprocessing Feature Selection
PREPROCESSING (overall usage95)
Replacement of the missing values
Discretization
Normalizations
Grouping modalities
Other prepro
Principal Component Analysis
0
20
40
60
80
Percent of participants
FEATURE SELECTION (overall usage85)
Feature ranking
Filter method
Other FS
Forward / backward wrapper
Embedded method
Wrapper with search
0
10
20
30
40
50
60
Percent of participants
31
Fact SheetsClassifier
CLASSIFIER (overall usage93)
Decision tree...
Linear classifier
Non-linear kernel

About 30 logistic loss, gt15 exp loss, gt15 sq
loss, 10 hinge loss.
Less than 50 regularization (20 2-norm, 10
1-norm).
Only 13 unlabeled data.

Other Classif
Neural Network
Naïve Bayes
Nearest neighbors
Bayesian Network
Bayesian Neural Network
0
10
20
30
40
50
60
Percent of participants
32
Fact Sheets Model Selection
MODEL SELECTION (overall usage90)
10 test
K-fold or leave-one-out
Out-of-bag est
Bootstrap est
Other-MS

About 75 ensemble methods
(1/3 boosting, 1/3 bagging, 1/3 other).
About 10 used unscrambling.

Other cross-valid
Virtual leave-one-out
Penalty-based
Bi-level
Bayesian
0
10
20
30
40
50
60
Percent of participants
33
Fact Sheets Implementation
Run in parallel
None
Multi-processor
Memory
Parallelism
Operating System
Software Platform
34
Winning methods

Fast track
IBM research, USA Ensemble of a wide variety
of classifiers. Effort put into coding (most
frequent values coded with binary features,
missing values replaced by mean, extra features
constructed, etc.)
ID Analytics, Inc., USA Filterwrapper FS.
TreeNet by Salford Systems an additive boosting
decision tree technology, bagging also used.
David Slate Peter Frey, USA Grouping of
modalities/discretization, filter FS, ensemble of
decision trees.
Slow track
University of Melbourne CV-based FS targeting
AUC. Boosting with classification trees and
shrinkage, using Bernoulli loss.
Financial Engineering Group, Inc., Japan
Grouping of modalities, filter FS using AIC,
gradient tree-classifier boosting.
National Taiwan University Average 3
classifiers (1) Solve joint multiclass problem
with l1-regularized maximum entropy model. (2)
AdaBoost with tree-based weak leaner. (3)
Selective Naïve Bayes.
( small dataset unscrambling)

35
Conclusion

Participation exceeded our expectations. We thank
the participants for their hard work, our
sponsors, and Orange who offered
A problem of real industrial interest with
challenging scientific and technical aspects
Prizes.
Lessons learned
Do not under-estimate the participants five days
were given for the fast challenge, only a few
hours sufficed to some participants.
Ensemble methods are effective.
Ensemble of decision trees offer off-the-shelf
solutions to problems with large numbers of
samples and attributes, mixed types of variables,
and lots of missing values.