Title: Cost is the Dependent Variable
- Issues in Econometric Modeling
- Wei Yu
2Focus of the Analysis
- Costs in a fixed period (e.g., annual
person-level cost for a patients with substance
use disorders) - Costs without a large number of zeros (e.g., VA
3Patterns of Cost Data
- Skewed distribution
- Nonlinearity in response to covariates
- Cost response varies by type of care (e.g.,
outpatient to inpatient)
4Potential Problems in OLS
- OLS may yield biased and/or less precise
estimates of means and marginal effects
5Alternative Estimation Methods
- Log transformation
- Generalized Linear Models (GLM)
6Log Transformation
- log(y) Xß e
- where E(e) 0, E(X?e) 0
7Log Transformation (contd)
- Advantages
- Improved precision
- Disadvantages
- Results in log scale are not interesting
- Retranformation problems
- May not achieve linearity
- Log (y) Xb e
- Where E(e) 0 and E(Xe) 0
- E(y/x) exp (Xb e) exp(Xb) E(exp(e))
- If e is normally distributed N(0,s2),
- E(y/x) exp(Xß) exp( 0.5 s2)
9Retransformation (contd)
- If e is not normally distributed, but i.i.d.
- or exp(e) has constant mean and variance,
- E(y/x) exp(Xß ) exp(s),
- For the Smear retransformation
- s 1/n(Siêi)
10Retransformation (contd)
- If e is heteroscedastic in x
- E(y/x) f(x) x exp(Xb)
- E (y/x) ? cons x exp(Xb)
- ?E(y)/?x ? b x exp(Xb)
- Using an appropriate Generalized Linear Model
- Model specification
- A link function
- A mean-variance relationship
13GLM Picking a Link Function
- Box-Cox test
- Find MLE value of ? where
- y(?) (y? 1)/ ? when ? ? 0
- y(?) ln(y) when ? 0
- Stata boxcox
14GLM Picking a Link Function (contd)
- Examples
- If ?-1 inverse, (1 /y) Xb e
- If ? 0 ln(y), ln(y) Xb e
- If ? .5 square root, y1/2 Xb e
- If ? 1 linear, y Xb e
- If ? 2 square, y2 Xb e
15GLM Test for Linearity
- Pregibons link test
- yd0d1(xbhat)d2(xbhat)2 e
- Test d2hat 0
- Stata linktest
16GLM Test for Linearity (contd)
- Ramseys RESET test
- yd0d1(xbhat)d2(xbhat)2 d3(xbhat)3
d4(xbhat)4 e - Test d2hat d3hat d4hat 0
- Stata ovtest
17GLM Test for Linearity (contd)
- Modified Hosmer-Lemeshow Test
- Estimate model (e.g., ln(y)xb e )
- Retransform to get y on raw scale
- Compute êy- y on raw scale
- Create 10 groups, sorted by xbhat
- F-test of whether mean residuals different from
18GLM Test for Linearity (contd)
- All of the above tests are diagnostic, not
constructive. - If reject null, looking for problems either
- Left side (wrong power function) or
- Right side (wrong functional form of x)
19GLM Determine a Mean-Variance Relationship
- GLM family test (Park test)
- 1. Regress y (raw scale) on x
- 2. Save raw scale residuals ê and y
- 3. Regress ln(ê2) on ln(y) and a constant
- Alternative to step 1
- GLM of y on x with gamma family and log link.
20GLM Family Test (contd)
- Coefficient on ln(y) gives the family
- If ?hat 0 Gaussian (variance unrelated to mean)
- If ?hat 1 Poisson (variance equals mean)
- If ?hat 2 Gamma (variance exceeds mean)
- If ?hat 3 Wald or inverse Gaussian
21GLM Test for Over-Fitting
- Copas test
- Randomly split sample into two groups (half-half,
2/3-1/3, etc) - Estimate model on group 1
- Forecast to group 2
- y2 X2b1hat
- Regress y2 against y2
- y2 a0 a1 y2 e
- Test a1 1
- Repeat 1000 times to get a distribution.
22GLM Test for Over-fitting (contd)
- If reject null hypothesis, over-fitting may be a
problem - Examine
- The model
- Outliers
- Use it to compare models
23GLM Example
- Sample 300,000 randomly selected VA patients
- Dependent variable annual person-level cost
- Independent variable age, race, common chronic
conditions -
24GLM Example
- Graphs of the cost distribution
25Box-Cox Test
- Variable Total cost
- ? 0.04
- Link function ln(y)
26Link and RESET Tests
- Model ln(y) a bX e
- P-values
- Link lt0.001 (b2hat -0.14)
- RESET lt0.001
- Both tests showing problems
27Hosmer-Lemeshow Test
- F-test 497.9
- p-value lt 0.001
- Problem in upper groups
- Showing graph here
28GLM Family Test
- ?hat 1.96 (p lt 0.001)
- Family Gamma
29Copas Test for Ln(y)
Variable Obs Mean Std. Dev. Min Max cop
as1 1000 .06572 .0069411 .0500775 .095947 95
confidence interval for test of
slope1 .05407852 .08041483 Conclude The model
failed the Copas over-fitting test
30Methods Tried to Fix the Model
- Fixed outliers (both 1 and 10)
- Take a double log transformation on total cost
- Both methods did not improve the model fitting
- Consider functional forms for right-side
31Other Things We May Try
- Consider functional forms for the right-side
32The Final Model
- GLM with Gamma family and log link function
- With a large number of observations,
- tests are more likely to reject any hypothesis
that a coefficient is zero - OLS may provide reasonably accurate estimates
- When right-side variables are all indicators,
linearity may not be a major problem
