Title: Kinshuk Jerath, CMU
1Customer-Base Analysis Using Aggregated Data
(Or The Joys of RCSS)
- Kinshuk Jerath, CMU
- Peter S. Fader, Wharton (www.petefader.com)
- Bruce G. S. Hardie, LBS
-
2Customer-Base Analysis
Track the purchasing of a cohort of customers
and make predictions about their future
purchasing (collectively and individually)
3The Pareto/NBD Model(Schmittlein, Morrison, and
Colombo 1987)
- Transaction Process
- While active, number of transactions made by a
customer follows a Poisson process with
transaction rate ? - Transaction rates are distributed gamma(r,a)
across the population - Dropout Process
- Each customer has an unobserved lifetime of
length t, which is distributed exponential with
dropout rate µ - Dropout rates are distributed gamma(s,ß) across
the population - Astonishingly good fit and predictive performance
4Tracking Cumulative Repeat Transactions
5Conditional Expectations
6The Pareto/NBD works very well given
individual-level (disaggregate) data.
7Barriers to Disaggregate Data
- Many firms may not (be able to) keep detailed
individual-level records - Corporate information silos make data integration
difficult - Wariness given high-profile stories on data loss
- Data protection laws (with bans on trans-border
data flows) - General weaknesses with the firms information
systems capabilities - Anonymizing (and other statistical disclosure
control methods) costly and potentially
ineffective
8Repeated Cross-Sectional Summary Data
9Proof of Concept Tuscan Lifestyles
10Tuscan Lifestyles Data
11Proof of Concept Tuscan Lifestyles
12Tuscanizing the Pareto/NBD
Under the Pareto/NBD, for a specific individual,
given ? and µ
For a randomly chosen individual
13Tuscanizing the Pareto/NBD
14Augmenting the Model
- Assume a fraction of customers make exactly one
extra purchase in the first year
15Model Fit
16Cohort Comparison Heterogeneity
17Future Projections
E(CLV) for lt50 cohort 46 E(CLV) for 50
cohort 89
18Do We Need All Five Years of Data?
- Calibrate the model on years 1-3 only, predict
for years 4 and 5.
19Customer-Base Analysis Using Repeated
Cross-Sectional Summary (RCSS) Data
- Under more general conditions, what is the
information loss by aggregating data? - Under what conditions can a model built using
aggregated data accurately mimic its
individual-level counterpart? - How much aggregated data is required to do this
job well?
20Forward-looking vs. backward-looking histograms
21Summary of Results
- Using three or more quarters always matches
disaggregate performance in terms of - In-sample LL
- Out-of-sample histogram predictions
- This is true both for backwards-looking and
forward-looking approaches to creating
histograms - Validated across extensive simulations and a
real-world dataset
22Other Desirable Properties
- Just the percentage of total customers in each
bucket is sufficient dont even need actual
numbers - Data can be aperiodic (they just have to be
repeated) - Histograms can be of different time lengths,
e.g., - 3-month 6-month 4-month
- Histograms can be missing, e.g.,
- Qtr. 1, , Qtr. 3, Qtr. 4
- Data management/storage benefits
23Which Data Structure Would You (and Your
Customers) Prefer to Use?
or