Title: The Perils of Data Pruning in Consumer Choice Models
1The Perils of Data Pruning in Consumer Choice
Models
- Eric T. Bradlow
- Elaine Zanutto
- The Wharton School
2Outline
- Current Practice
- Why its a problem
- A small simulated example
- Application to Fader and Hardie 1996
- How do we fix this?
- Summary Comments
3Current Practice
- Many recent papers in Marketing have addressed
heterogeneity in consumer choice models for
scanner panel data - One primary feature of these models is SKU or
brand-specific intercepts - Problems arise when the number of SKUs or brands
is large (computation, instability,
reproducability, etc)
4Current Practice (cont)
- Therefore, current practice in EVERY paper that
has appeared in JMR and Marketing Science
(excluding Fader and Hardie 1996) is to - Observe the data
- Post-process the data by pruning brands using
various mechanisms (to be described) - Fit, typically multinomial-logit, models to the
remaining data ignoring the fact that the data
has been post-processed.
5Current Practice (cont)
- Some common Pruning Mechanisms that have appeared
are - Take Top X brands (e.g. 10)
- Choose all SKUs or brands with share gt Y (1-2)
- Choose all SKUs until Z share is represented
(80) - Restrict Analysis to the most popular sizes or
flavors - Collapse SKUs or brands into an Other category
- Each of these approaches reduces, sometimes
dramatically, the number of model parameters
6Why is this a problem?
- Pruning the data leads to
- Well-known
- Less parameters
- Lower Sample Size (lower power, larger SEs)
- Faster Computation
- Greater Stability
- Inestimable parameters
- Not-Known (Our contribution)
- A Non-Ignorable Missing Data Mechanism (Rubin and
Little,87)
7Missing Data Formulation
- Let Yobs denote the observed data (consumer
choices) - Let Ii denote an indicator variable as to whether
a given unit i is in the sample - The observed data is (Yobs, Ii) with associated
likelihood given by - f (Yobs, Ii?,f)
- Where ? are the parameters that govern the choice
model and f are the parameters that govern the
missing data process (which units end up in the
final sample)
8Non-ignorability assumptions
- So when people ignore the data pruning mechanism
they utilize (A) - f(Yobs?)
- rather than the correct likelihood (B)
- f (Yobs, Ii?,f)
- When is this ok? When does AB?
9For ignorability of the selection process
- f(IYobs,Ymis,?,f) f(IYobs, ?,f) or f(I?,f)
- This assumption is known as Missing at random
(MAR) or Missing Completely at Random (MCAR) - (B) The parameters ? and f are distinct
- These are highly highly suspect in consumer
choice models that have gone through post-process
data pruning.
10Example
- Imagine a log-log demand model for sales with
- Yi N(aßPi,s2)
- where we only observe those brands such that Yi gt
c. - Then the likelihood ignoring the selection
mechanism is - Where the true likelihood is
- These are not the same, and have different maxima
11Consequences of Non-Ignorability A simulation
- Use the log-log demand model with
- N200, a 500, ß -25, Pi U(0.1, 10), and
s(5,40,60) corresponding to R2 of 0.99, 0.74,
and 0.56 respectively - Selection mechanism Compare Top X brands (X100,
X160) vs. Random Sampling - 1000 simulates for each of the 12 conditions
12Simulation Results
- Random Sampling has minimal bias
- Top X Brands has significant Bias
- Bias increases with larger s
13Consequences of Non-Ignorability A Real Example
- We applied a series of data pruning mechanisms to
the well-cited Fader and Hardie (1996), JMR,
fabric-softener data set - (1) It is a paper that fits the model to ALL
brands (it is the purpose of the paper) - (2) Latent class-model, and our tenet is that
data pruning also has effects on the latent class
composition - (3) Data is readily available
- (4) It is a data set where many SKUs exist and
hence would be one where data pruning would
naturally occur
14The Fader and Hardie Data
- IRI Scanner Panel Data for 594 households, 9781
Purchases, Jan 1990-June 1992 in the fabric
softener category - 56 SKUs comprised of combinations (but not all)
of - Nine brands (Arm Hammer, Bounce, Cling Free,
Downy, Final Touch, Generic, Private Label,
Sta-Puf, and Toss n soft) - Four forms (sheets, concentrated, refill, liquid)
- Four formulas (regular, staingard, unscented, and
light) - Four sizes (small, medium, large, and extra large)
15Some basic statistics
- 73 of the total share is covered by the top 4
brands (Downy, Snuggle, Private Label, and Final
Touch) - However, if you screened on this you would
eliminate 6 of the best 16 selling SKUs. - 24 SKUs have less than 1 share.
- 1990 data (3227 purchases) are used to initiate
the model --loyalty variables (Guadagni and
Little, 1983)
16The Latent Class Multinomial Logit Model
- Where
- With pht(is) the choice probability of household
h, at the t-th purchase occasion for brand i - and ?s are the segment shares.
17The data pruning mechanisms utilized
- (1) All the data
- (2) Top 5 brands
- (3) Top 4 brands
- (4) Top 5 brands other
- (5) All brands with share gt 5
- (6) All brands with share gt 10
- (7) Top 30 SKUs
- (8) All SKUs gt 2 share
18Insert Fader and Hardie Results for
- One Segment models
- Probability of SKU 45 under one segment
- Two Segment models
- Probability of SKU 45 under two segment
19Summary of Empirical Findings
- Marketing mix coefficients change
- Latent class compositions change
- Ordering of popularities change (brand, form,
formula, size) - Probability estimates change
- Loyalties change
20Can We Fix it?
- Random Sampling
- PPS Sampling
- Sample proportional to share
- Weighted Likelihood approach
- Weight by probability of selection
- Note that in all of the approaches, the number of
parameters stays at the REDUCED level that the
analyst initially wanted
21Simulation Results for 3 approaches
- PPS Works well in reducing the bias
- Weighted Likelihood approach improves things but
does not fix
22Summary and Conclusions
- Reducing the number of parameters in SKU choice
models is a fact of life - It is important to recognize that the data
pruning mechanisms people typically utilize are
non-ignorable - There are approaches that can minimize/eliminate
the effects of data pruning