Title: Data Mining and Knowledge Acquizition Chapter 9 Case Studies
1Data Mining and Knowledge Acquizition Chapter
9 Case Studies
2Chapter 9 Case Studies
- Churn Modeling in Wireless Comminications
- Web usage mining
3- The choice of tool
- Type of model to build and which parameters to
set - Algorithm specific choices pruning decision
trees - How to segment the data for modeling
- The size and density of the model set
- How to handel the time element
- Which data to include and how to calculate the
dreied variables
4Segmenting the model set
- Club member two third high value
- Non club members
- Recent customers
- Joined in the previous eight or nine months
- Insufficient billing history
5Churn Modeling in Wireless Comminications
- Predict which customers are likely to leave in
the near future
6The Wireless Telefone Industry
- Rapidly maturing
- Initial states
- Exponential growth
- Churn is not a problem
- Many new customers are joining then churning
- Many new customers for every churner
- Lost customers/new customers 15
7- As the industriy matures
- Lost customer / new customer 80
8Some Differences from ohter Industries
- Relatively high cost of acquizition
- Retaining an exisiting customer is much valuable
then attracting a new one - No direct customer contact
- Brand management and direct marketing
- Tremendous amount of data
9The Business Problem
- The largerst company in a developing country
- Investing on DSS technologies
- Deregulated market
- Several enterents
- Market maturing (one third of population)
- Reactive marketing to proactive customer
management - Focus on exisiting customer
- How to keep them
- Make more profitable
10Project Backround
- Parallel with development of data warehouse
- DSS based on relational OLAP by Microstrategy
- Slice and dice marketing and sales data
- Handset type, region, time of day
- Such questions
- MWhat is charen rate in April or May in club and
non club members
11Specifics about the Market
- About 5 millon customers
- Mostly in cities
- Type number percent churn rate
- Club 1,500,000 30 1.3
- Nonclub 3,500,000 70 0.9
- Average call per customer around 12
- No pay for incoming calls so data not collected
- Club members more valuable special promotion
discount and coupons
12What is Churn
- Involuntary not paying the bill for several
months - Predicting may reduce losses
- Voluntary every thing not involuntary
- Models for voluntary churn should not predict
involuntary churn - Involves
- Move out of service area
- Pass away
- Lured to other service provider
- Do not develop different models for each group
13Why useful
- Churn prevension programs
- Discounts on air time
- Free incoming munites
- Other promotions
- Predict customers life time
- 1/(churn rate) expected lifetime
- Calculate lifetime value
- Prioritizing customer segments
14Three Goals
- Near-term goal identify a list of probable
churners - Medium-term goal Build a churn management
application - Long-term goal Gomplete customer relationship
management
15Near-term goal identify a list of probable
churners
- The marketing department need top 10,000
- Not a score for each customer
- Club members only
- List by 24th of the month
- 10,000 club members most likely to churn by 24 th
of each month
16Medium-term goal Build a churn management
application CMA
- Aims
- Runing churn models
- Manage models
- Data analysis before and after modeling
- Import data and transform
- Export churn scores
- From building a data mining model to automating
models as much as possible
17- Auatomating new requirements
- incompatiChaning the modeling technique every
month - Manually pruning decision trees
- Incompatible with clustering
- Needs to have very reasonable defolt sets for
modeling parameters - Preclude some hybrid techniques first decision
trees for variable selection then input into
neural nets or lojistic regression
18Approach to Building a churn model
- Define churn
- Inventory available data
- Customer information file age,ender ZIP
- Service account file activation data,
- Billing systems number of calls,
- Building models
- Decision trees
- Deploying scores
19- Variables
- explain phenomena in the real world
- As opposedn to Mathematical transformations
- Drieved variables
- Growth rate of number of calls over time
- Proportion of calls of different type
- Change in proportions
- Call to customer service
20Measure the scores against what realy happens
- How closes are estimated probabilities to the
acutal churn probabilities for each group - Are the churn scores relatively true
- Does a higher churn score imply a higher
probability of churn
21The Data
Account
Usage by hour
Service
Monthly bill summaries
22- Serviceaccess to the network provided to a
single telepone number. - Churn at the sevice level was primary interest
- Account multiple services may share the same
account - Several phone in a houshold
- Customer least defined of all
23- Billing system to warehouse
24Historical Churn Rates
- Predictors of churn in the near future is the
recent history of churn in verious dimensions - Handset churn rate
- Demographic churn rate
- Gender,age group,geographic area
- For a total of several hundred combinations
- ZIP code churn rate
25Customer and Account Level
- Social security number
- Market id splits service area into different
marketing regions - Age and gender not so accurate but social
security number can be used to drive - Inaccurate data is self-reported
- Income occupation not used
26Service Level
- Activation data and reason for activation
- Deactivation data for churning customers
- Features ordered by the customer
- Billing plan
- Handset type
- Dealer where the sercice was activated
- More accurate accept for dealers
27Billing History
- Monthly summary for nine months
- Total amount billed,late charege and amount
overdue - All calls (numer of calls and amount billed)
- Oversee calls (number and amount billed)
- Fee paid services
- Directory assistace charges
- Provides several time series
28Rejecting some Variables
- Variabes that cheat
- Future deactivation dates
- Identifiers customer id,social security num.
Phone num. - Very high skewalmost all values are identical
- Categories with too many values
- Group into larger units dealer location into
market area - Additional lookup information weight,manufacturer
- Historical churn rates included
29- Absolute dates to relative dates
- Number of days to present
- Number of Activation days rather then activation
data - Seasonality information
- Store year or month or day of month
- Untrustworthy values
- Customer information is colleced by sale force
when customer sign up for service - No insentive for data collection
- Salary, occupation
30Drived Variables
- Make sense
- Can be explained to business users
- Combination of variables
- Even they do not make apperent sense
- From billing system
- Summation of variable for all months and theri
variance - The ratio of each month value to the total
- Ratios between succesive months and between the
first and last months - Ratios within a month such as domestic or
overseas usae to total number of calls within a
month
31- Some are redundent
- if number of calls in a month 0 then
- All ratio are 0
- Additional variables
- Age of customer, lenght of services
- What portion of customer life she is a customer
length/age - Rough estimate of customers worth
- Length of serviceaverge revenue per month
32Lessons about building churn models
- Finding the most significant variables
- History is the best predictor of the future
- Churn rate of the handset
- Churn by demographics or ZIP code
- Number of different telephone by a customer
- Customer with multiple telephone much less likely
to churn - Number of change of feautures,age,market type
- Declien to 0 usage in the most recent month
33Listening to the business users
- Before starting
- Assing a churn score for each customer
- Marketing department need top 10,000 churners
that are club members - Initially by using a single model
- Two different model for club and non-club members
34Listening to the data
- Not all customer have 6 month of history
- Develop another model for recent customers
- much more accurate
- Lift value is close the theoretical prediction
- Billing data not available?
- Variables
- Handset is one
- Billing plan is family basic
- Handset has a high churn rate
- Exisiting customer if they leave handset and join
to famiiy plans get discounts
35Including historical churn rates
- Past is the best predictor of the future
- Variables
- Handset model
- Demographic age,income,..
- By area ZIP or market area
- Ussage patterns
- Breaking into several dimensions
- Quantiles for total billing,total number of
calls,average duration
36Composing the model set
- I one month of history
- Training set has 7 times higher
- IIseven months of history
- Seven month of history
- Churners are in one month
- Rich in history but only for one month september
- May overfit data
- III 4 months of history for predicting 3 months
37- When historical data is limited
- When data warehouse is recently build
- Customer growth is so rapid
- Have little history
- Size and density
- Prefer 20-40 percent churners
- Model set as large as possible
- High oversampling rate
- Actual rate are in the order of 1
38Build a model for churn management application
- Model automatically be rebuild
- Avoiding manual pruning and other manual
processes - Much effort to adjust parameter for not one model
but
39Listening data to determine model parameters
- Many models were build
- High density is requierd for churners
- Oversampling
- 30 found to be better
40- Defining churn
- Interesting customers who leave for a competitor
- Uninteresting involuntary for not paying
- How the churn results will be uses
- A model for predicting life time value
- Listing high value customer for a camping
- Identifying data requirements
- Include historical churn rates for handsets and
demographic rates - Models slide in time windows
41Segmentation
- Profilingusing data to profile or describe a
group of customers or prospects - Segmentation spliting the database into
different sections and segments - Market driven
- Data driven
- Demographic
- Psycographic
- Buying behavior
- Risk pattern
- Levels of profitibility
42RMF Recency, Frequency, Monertary Value
- Recency number of months from the last purchase
- Predicting response to a subsequent offer
- if you have recently purchase something from the
company, you are more likely to make another
purchase - Frequency number of purchases
- Total or
- Within a specified period
- Monetary valuetotal dollar amount
- Total or withi a specified period
43How to create RMF scores
- Sort the data three times by each variable
- Each list is divided into equal slices
- Quantiles
- People in the top segment has a score of 5
- In the second segment has a score of 4
- Construct the RFM cube
- Customers in the same cell has the same score
44- Customers who have made the most recently
purchases, buy frequently and spend lots of money
are in the 555 cell of the cube - A customer who has made a recent purchase, buys
frequently but does not spend lots of money might
be in 542
45- Each of the buckets for each variable may be
recoded - Bucket_1 coded as 9
- Bucket_2 coded as 6
- Bucket_3 as 3
- Weigth are given to each of three variabels
- Usually recency has the highest weight
- Ex recency 5, frequency 3, monetary 1
- A score is calculated for each customer
- Based on the encodings and weights
46Example RFM socres
- Customer Recency frequency monetary
- Smith 09.2001 10 322
- Jones 10.2000 2 25
- Jonson 10.1999 4 120
- Frequency and monetary values are for the lost 24
months - Weights
- recency 5,
- frequency 3
- monetary 2
47- Rules of encoding
- Recency 20 points if in last 3 months
- 10 p 6
- 5 p 9
- 3 p 12
- 1 p 24
- Frequencynumber of purchase within 24 months
- X 4 points (maximum 20 points)
- Monetary valeu spent lost 24 monts X 0.10
- (maximum 20 points)
48- Customer Recency frequency monetary
- Smith 20 20 20
- Jones 3 8 2.5
- Jonson 1 16 12
- Customer Recency frequency monetary
- Smith 20x5 20x3 20x2 200
- Jones 3x5 8x3 2.5x2
44 - Jonson 1x5 16x3 12x2 77
49Disadvantages of RFM
- Based on only the three variables
- Other more valuable information may exist in the
datawarehouse - Recodings and weights of variables are arbitary
- Not statisitcally based
- Dificult the track customer movements from
segments - Customers move because of the actions of other
customers
50Web Mining
- Web content mining
- Web structure mining
- Web usage mining
51Web Usage Mining
52Describtion of data
- Click stream 1,148.6 MB
- 777480 raws
- First quarter of 2000
- Data has
- Session
- Assertment
- Content
- Product
- demographic
53The problem
- Clustering sessions
- Transformations
- K-means used
- 0-1 normalization for
- Categorical variables
- order or see product
- Yes no to 0 1
54Clustering Usege Sessions
- A session episode of interraction between the
web user and the web server - Ended when the user leave the web side
- Cluster the session based on their common
proeperties - Applications in education, e-commerce
- Another problem clustering web users
- Requier user data
55Variables
- Avgtime agerage session time
- Totalclic total number of clicks in a session
- AvAsLev average assertment level
- Ass_UB assertment unique butique
- Seasonal,brand order,sale assertment, life
style,M -
56- ________Clusters____________
- Variables 1 2 3 4 5 6
- Avgtime 47 43 53 39 41 50
- Tot_clicks 18 6 5 25 33 7
- Ass_avg 3.96 3.24 1.98 3.24 2.58 3.12
- Pr_yn yes no no yes yes yes
- Or_yn yes no no no yes no
57Classification Problem
- 1. predict whether a visitor wil visit the
legcare product subcategory in the subsequent
three clicks - 2. predict whether a visitor wil visit the
legware product subcategory in the subsequent
three clicks - 3. predict whether a visitor wil click onther
page or leave the session
58- Functionalites classification
- Algorithm decision trees
- Tool Answer Tree