Title: Predicting household income in the ONS Longitudinal Study
1Predicting household income in the ONS
Longitudinal Study
- Salah Merad Karl Ashworth
- Office for National Statistics
2Talk Outline
- Brief information about the ONS Longitudinal
Study (LS) - Potential uses of predicted income
- Choice of data source
- The modelling process
- Using the LS to estimate domain means
comparisons with other data sources
3The ONS Longitudinal Study (LS)
- Around 1 sample of individuals in England
Wales - Linked Census records 1971 2001
- Linked to Vital Events (Births, Deaths, Cancer)
- A variety of socio-economic and demographic
variables - No income measure regular demand from census
users - Want to predict household (HH) gross income for
the 2001 data in the LS
4Potential uses of predicted income
- Descriptive
- Group means
- Small areas
- Modelling
- Predictor (surrogate for deprivation)
- Outcome (taking care if using variables used in
construction)
5Data Source Family Resources Survey (FRS)
- Population coverage (England and Wales,
self-employed included) - Income of all adults in a household collected
- All components of income included
- Earnings, Self-employment, Investments, Pensions,
Benefits (income support, disability) - Gross and net income, housing costs collected
- Large sample size
- Potential predictors have similar definition to
corresponding Census 2001 variables use FRS
2001-2002
6Data issues
- Data collected from Apr 01 to Mar 02
- Income collected is for the week preceding the
interview date point in time income - 69 out of 23,000 households reported zero or
negative total income - Negative values were returned by self-employed
- Set negative values to 0 for modelling
7Building a model predictors 1
- At household level
- Household composition, total number of dependent
children - Type of tenure
- Type of accommodation, number of rooms, location
of home address (GOR) - Car use (number of cars/number of adults)
- Total number of adults in employment (1 for
full-time, ½ for part-time)
8Building a model predictors 2
- Individual level (for household reference person
HRP) - Age, Age2
- Sex
- Economic activity
- Social and economic classification (NSSEC)
- Industry classification
- Highest qualification attained
- Ethnicity
9Building a model predictors 2
- Derivation of some variables and recoding was
performed to make Census and FRS variable
definitions and categories similar - New HRP in FRS derived as original not defined as
in Census 2001 - Highest qualification in FRS was derived as no
equivalent variable is available collapsed
categories
10Outcome variable transformation
- Income variables very skewed
- Applied square root and Log transformations
- Log transform yields a more symmetric
distribution - For 0 values, the log is set to 0
- Tried a censored regression model, and results
are very similar
11Fitting a model for total weekly income (Gross
weekly income)
- Multiple regression model using ordinary least
squares (OLS) for the whole sample - R2 0.57
- No evidence of collinearity (values of Variance
Inflation Factor very low ) - Model diagnostics
- Studentised residuals plot shows a number of
outlying points - On investigation, many of these points found to
have 0 or very small values of income
12Model diagnostics
- Test for heteroscedasticity Breusch-Pagan test
highly significant - Parameter estimators obtained using OLS unbiased,
but estimators not efficient and standard errors
could be inaccurate - Problems with bias-correction factor in
back-transformation - SEs of predictions generally too large
13Dealing with heteroscedasticity 1
- The weighted least squares yields efficient
estimates however, not clear how to - Adjust back-transformation of predicted Log
income in a new data set - Estimate SEs of predicted values in a new data
set - Need to estimate a complex model where the
variance of the residuals is also modelled not
explored this - We consider a simple approach approximate
solution
14Dealing with heteroscedasticity 2 a heuristic
- Distribution of residuals in CPM varies across
groups NSSEC, Qualification, Ethnicity - Fit a model in each group which variable to use?
- Model mixing (new methodology) ? Not done
- Bigger spread in some NSSEC categories use NSSEC
- Estimated a model in each NSSEC category using
OLS (Split population group models (SPGM) ) - Models not good in high income and self-employed
groups - Flexible split population group models (FSPGM)
- Use SPGM in medium and low income categories
- Use CPM in high income and self-employed
15Standard errors of predicted values
- Accuracy of SEs under FSPGM Investigated using
re-sampling - Select a random sample from dataset test sample
fixed - Fit model using bootstrap sample from remaining
dataset - Predict log income compute 95 CI for cases in
test sample - Repeat bootstrapping, fitting and prediction
process - Proportion of cases where CIs contain the
returned values - 97 overall
- Varies between 96 and 98 across NSSEC
categories - Varies between 91 and 98 in other groups
(Tenure, Ethnicity, HH composition)
16Model Validation
- Using re-sampling (repeated data splitting)
- CPM drop in R2 from 0.568 (training datasets) to
0.564 (validation datasets) - SPGM
- Drop is small in low income group (0.330 to
0.314) and medium income groups (0.550 to 0.520) - Drop is large in high income group (0.277 to
0.203 ) and self-employed (0.259 to 0.184) - Assess impact of fitting the model with and
without outliers - CPM, no outliers slightly lower R2 at low to
medium income groups - SPGM, no outliers slightly higher R2 in nearly
all groups
17Using the prediction rule in LS 2001 data Group
and area means
- Sampled units in the LS are individuals
- Large HHs are represented more than small HHs
- Need to apply weighting to estimate average HH
income - Use common weighting for all groups
- Weight of an individual from a HH of size k is
- Proportion of HHs of size k in Census 2001 /
Proportion of individuals in HHs of size k in LS
2001 data - Estimator is
- Sumgroup(weightpred income)/Sumgroup(weight)
18Using the prediction rule in LS 2001 data some
comparisons - NSSEC
19Comparisons - Tenure
20Comparisons Neighbourhood statistics (NeSS)
based estimates
- Obtained district level estimates of total weekly
income based on - NeSS ward published estimates (population
weighted) - Predicted values in LS data
- 67 of estimates are within 50 of each other
- 85 of estimates are within 75 of each other
- Relative differences
- Between -28 to 25
- 97 are between -20 and 20
21Comparisons - Ethnicity
22Further Issues
- Obtain more accurate estimates of the standard
errors of the residuals - Reduce bias in back-transformation
- Estimate standard errors of group estimates
- Model net income and income excluding housing
costs consistency between predicted values of
different income measures - Test suitability of predicted income in
correlation analysis - Effect of imputation in the LS
- Documentation for users