Data Mining and Knowledge Acquizition Chapter 9 Case Studies - PowerPoint PPT Presentation

1 / 58

About This Presentation

Title:

Data Mining and Knowledge Acquizition Chapter 9 Case Studies

Description:

The choice of tool. Type of model to build and which ... How to handel the time element. Which data to include and how to calculate the dreied variables ... – PowerPoint PPT presentation

Number of Views:75

Avg rating:3.0/5.0

Slides: 59

Provided by: jiaw196

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining and Knowledge Acquizition Chapter 9 Case Studies

1
Data Mining and Knowledge Acquizition Chapter
9 Case Studies

Summer
2005

2
Chapter 9 Case Studies

Churn Modeling in Wireless Comminications
Web usage mining

The choice of tool
Type of model to build and which parameters to
set
Algorithm specific choices pruning decision
trees
How to segment the data for modeling
The size and density of the model set
How to handel the time element
Which data to include and how to calculate the
dreied variables

4
Segmenting the model set

Club member two third high value
Non club members
Recent customers
Joined in the previous eight or nine months
Insufficient billing history

5
Churn Modeling in Wireless Comminications

Predict which customers are likely to leave in
the near future

6
The Wireless Telefone Industry

Rapidly maturing
Initial states
Exponential growth
Churn is not a problem
Many new customers are joining then churning
Many new customers for every churner
Lost customers/new customers 15

As the industriy matures
Lost customer / new customer 80

8
Some Differences from ohter Industries

Relatively high cost of acquizition
Retaining an exisiting customer is much valuable
then attracting a new one
No direct customer contact
Brand management and direct marketing
Tremendous amount of data

9
The Business Problem

The largerst company in a developing country
Investing on DSS technologies
Deregulated market
Several enterents
Market maturing (one third of population)
Reactive marketing to proactive customer
management
Focus on exisiting customer
How to keep them
Make more profitable

10
Project Backround

Parallel with development of data warehouse
DSS based on relational OLAP by Microstrategy
Slice and dice marketing and sales data
Handset type, region, time of day
Such questions
MWhat is charen rate in April or May in club and
non club members

11
Specifics about the Market

About 5 millon customers
Mostly in cities
Type number percent churn rate
Club 1,500,000 30 1.3
Nonclub 3,500,000 70 0.9
Average call per customer around 12
No pay for incoming calls so data not collected
Club members more valuable special promotion
discount and coupons

12
What is Churn

Involuntary not paying the bill for several
months
Predicting may reduce losses
Voluntary every thing not involuntary
Models for voluntary churn should not predict
involuntary churn
Involves
Move out of service area
Pass away
Lured to other service provider
Do not develop different models for each group

13
Why useful

Churn prevension programs
Discounts on air time
Free incoming munites
Other promotions
Predict customers life time
1/(churn rate) expected lifetime
Calculate lifetime value
Prioritizing customer segments

14
Three Goals

Near-term goal identify a list of probable
churners
Medium-term goal Build a churn management
application
Long-term goal Gomplete customer relationship
management

15
Near-term goal identify a list of probable
churners

The marketing department need top 10,000
Not a score for each customer
Club members only
List by 24th of the month
10,000 club members most likely to churn by 24 th
of each month

16
Medium-term goal Build a churn management
application CMA

Aims
Runing churn models
Manage models
Data analysis before and after modeling
Import data and transform
Export churn scores
From building a data mining model to automating
models as much as possible

Auatomating new requirements
incompatiChaning the modeling technique every
month
Manually pruning decision trees
Incompatible with clustering
Needs to have very reasonable defolt sets for
modeling parameters
Preclude some hybrid techniques first decision
trees for variable selection then input into
neural nets or lojistic regression

18
Approach to Building a churn model

Define churn
Inventory available data
Customer information file age,ender ZIP
Service account file activation data,
Billing systems number of calls,
Building models
Decision trees
Deploying scores

Variables
explain phenomena in the real world
As opposedn to Mathematical transformations
Drieved variables
Growth rate of number of calls over time
Proportion of calls of different type
Change in proportions
Call to customer service

20
Measure the scores against what realy happens

How closes are estimated probabilities to the
acutal churn probabilities for each group
Are the churn scores relatively true
Does a higher churn score imply a higher
probability of churn

21
The Data

Customer

Account
Usage by hour
Service
Monthly bill summaries
22

Serviceaccess to the network provided to a
single telepone number.
Churn at the sevice level was primary interest
Account multiple services may share the same
account
Several phone in a houshold
Customer least defined of all

Billing system to warehouse

24
Historical Churn Rates

Predictors of churn in the near future is the
recent history of churn in verious dimensions
Handset churn rate
Demographic churn rate
Gender,age group,geographic area
For a total of several hundred combinations
ZIP code churn rate

25
Customer and Account Level

Social security number
Market id splits service area into different
marketing regions
Age and gender not so accurate but social
security number can be used to drive
Inaccurate data is self-reported
Income occupation not used

26
Service Level

Activation data and reason for activation
Deactivation data for churning customers
Features ordered by the customer
Billing plan
Handset type
Dealer where the sercice was activated
More accurate accept for dealers

27
Billing History

Monthly summary for nine months
Total amount billed,late charege and amount
overdue
All calls (numer of calls and amount billed)
Oversee calls (number and amount billed)
Fee paid services
Directory assistace charges
Provides several time series

28
Rejecting some Variables

Variabes that cheat
Future deactivation dates
Identifiers customer id,social security num.
Phone num.
Very high skewalmost all values are identical
Categories with too many values
Group into larger units dealer location into
market area
Additional lookup information weight,manufacturer
Historical churn rates included

Absolute dates to relative dates
Number of days to present
Number of Activation days rather then activation
data
Seasonality information
Store year or month or day of month
Untrustworthy values
Customer information is colleced by sale force
when customer sign up for service
No insentive for data collection
Salary, occupation

30
Drived Variables

Make sense
Can be explained to business users
Combination of variables
Even they do not make apperent sense
From billing system
Summation of variable for all months and theri
variance
The ratio of each month value to the total
Ratios between succesive months and between the
first and last months
Ratios within a month such as domestic or
overseas usae to total number of calls within a
month

Some are redundent
if number of calls in a month 0 then
All ratio are 0
Additional variables
Age of customer, lenght of services
What portion of customer life she is a customer
length/age
Rough estimate of customers worth
Length of serviceaverge revenue per month

32
Lessons about building churn models

Finding the most significant variables
History is the best predictor of the future
Churn rate of the handset
Churn by demographics or ZIP code
Number of different telephone by a customer
Customer with multiple telephone much less likely
to churn
Number of change of feautures,age,market type
Declien to 0 usage in the most recent month

33
Listening to the business users

Before starting
Assing a churn score for each customer
Marketing department need top 10,000 churners
that are club members
Initially by using a single model
Two different model for club and non-club members

34
Listening to the data

Not all customer have 6 month of history
Develop another model for recent customers
much more accurate
Lift value is close the theoretical prediction
Billing data not available?
Variables
Handset is one
Billing plan is family basic
Handset has a high churn rate
Exisiting customer if they leave handset and join
to famiiy plans get discounts

35
Including historical churn rates

Past is the best predictor of the future
Variables
Handset model
Demographic age,income,..
By area ZIP or market area
Ussage patterns
Breaking into several dimensions
Quantiles for total billing,total number of
calls,average duration

36
Composing the model set

I one month of history
Training set has 7 times higher
IIseven months of history
Seven month of history
Churners are in one month
Rich in history but only for one month september
May overfit data
III 4 months of history for predicting 3 months

When historical data is limited
When data warehouse is recently build
Customer growth is so rapid
Have little history
Size and density
Prefer 20-40 percent churners
Model set as large as possible
High oversampling rate
Actual rate are in the order of 1

38
Build a model for churn management application

Model automatically be rebuild
Avoiding manual pruning and other manual
processes
Much effort to adjust parameter for not one model
but

39
Listening data to determine model parameters

Many models were build
High density is requierd for churners
Oversampling
30 found to be better

Defining churn
Interesting customers who leave for a competitor
Uninteresting involuntary for not paying
How the churn results will be uses
A model for predicting life time value
Listing high value customer for a camping
Identifying data requirements
Include historical churn rates for handsets and
demographic rates
Models slide in time windows

41
Segmentation

Profilingusing data to profile or describe a
group of customers or prospects
Segmentation spliting the database into
different sections and segments
Market driven
Data driven
Demographic
Psycographic
Buying behavior
Risk pattern
Levels of profitibility

42
RMF Recency, Frequency, Monertary Value

Recency number of months from the last purchase
Predicting response to a subsequent offer
if you have recently purchase something from the
company, you are more likely to make another
purchase
Frequency number of purchases
Total or
Within a specified period
Monetary valuetotal dollar amount
Total or withi a specified period

43
How to create RMF scores

Sort the data three times by each variable
Each list is divided into equal slices
Quantiles
People in the top segment has a score of 5
In the second segment has a score of 4
Construct the RFM cube
Customers in the same cell has the same score

Customers who have made the most recently
purchases, buy frequently and spend lots of money
are in the 555 cell of the cube
A customer who has made a recent purchase, buys
frequently but does not spend lots of money might
be in 542

Each of the buckets for each variable may be
recoded
Bucket_1 coded as 9
Bucket_2 coded as 6
Bucket_3 as 3
Weigth are given to each of three variabels
Usually recency has the highest weight
Ex recency 5, frequency 3, monetary 1
A score is calculated for each customer
Based on the encodings and weights

46
Example RFM socres

Customer Recency frequency monetary
Smith 09.2001 10 322
Jones 10.2000 2 25
Jonson 10.1999 4 120
Frequency and monetary values are for the lost 24
months
Weights
recency 5,
frequency 3
monetary 2

Rules of encoding
Recency 20 points if in last 3 months
10 p 6
5 p 9
3 p 12
1 p 24
Frequencynumber of purchase within 24 months
X 4 points (maximum 20 points)
Monetary valeu spent lost 24 monts X 0.10
(maximum 20 points)

Customer Recency frequency monetary
Smith 20 20 20
Jones 3 8 2.5
Jonson 1 16 12
Customer Recency frequency monetary
Smith 20x5 20x3 20x2 200
Jones 3x5 8x3 2.5x2
44
Jonson 1x5 16x3 12x2 77

49
Disadvantages of RFM

Based on only the three variables
Other more valuable information may exist in the
datawarehouse
Recodings and weights of variables are arbitary
Not statisitcally based
Dificult the track customer movements from
segments
Customers move because of the actions of other
customers

50
Web Mining

Web content mining
Web structure mining
Web usage mining

51
Web Usage Mining
52
Describtion of data

Click stream 1,148.6 MB
777480 raws
First quarter of 2000
Data has
Session
Assertment
Content
Product
demographic

53
The problem

Clustering sessions
Transformations
K-means used
0-1 normalization for
Categorical variables
order or see product
Yes no to 0 1

54
Clustering Usege Sessions

A session episode of interraction between the
web user and the web server
Ended when the user leave the web side
Cluster the session based on their common
proeperties
Applications in education, e-commerce
Another problem clustering web users
Requier user data

55
Variables

Avgtime agerage session time
Totalclic total number of clicks in a session
AvAsLev average assertment level
Ass_UB assertment unique butique
Seasonal,brand order,sale assertment, life
style,M

________Clusters____________
Variables 1 2 3 4 5 6
Avgtime 47 43 53 39 41 50
Tot_clicks 18 6 5 25 33 7
Ass_avg 3.96 3.24 1.98 3.24 2.58 3.12
Pr_yn yes no no yes yes yes
Or_yn yes no no no yes no

57
Classification Problem

1. predict whether a visitor wil visit the
legcare product subcategory in the subsequent
three clicks
2. predict whether a visitor wil visit the
legware product subcategory in the subsequent
three clicks
3. predict whether a visitor wil click onther
page or leave the session