Predicting household income in the ONS Longitudinal Study - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Predicting household income in the ONS Longitudinal Study

Description:

Using the LS to estimate domain means comparisons with other data sources ... Owned outright. Pred. Pred. Ret. Pred. Ret. Tenure. LS. FRS-Validation. FRS-Training ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 23
Provided by: will150
Category:

less

Transcript and Presenter's Notes

Title: Predicting household income in the ONS Longitudinal Study


1
Predicting household income in the ONS
Longitudinal Study
  • Salah Merad Karl Ashworth
  • Office for National Statistics

2
Talk Outline
  • Brief information about the ONS Longitudinal
    Study (LS)
  • Potential uses of predicted income
  • Choice of data source
  • The modelling process
  • Using the LS to estimate domain means
    comparisons with other data sources

3
The ONS Longitudinal Study (LS)
  • Around 1 sample of individuals in England
    Wales
  • Linked Census records 1971 2001
  • Linked to Vital Events (Births, Deaths, Cancer)
  • A variety of socio-economic and demographic
    variables
  • No income measure regular demand from census
    users
  • Want to predict household (HH) gross income for
    the 2001 data in the LS

4
Potential uses of predicted income
  • Descriptive
  • Group means
  • Small areas
  • Modelling
  • Predictor (surrogate for deprivation)
  • Outcome (taking care if using variables used in
    construction)

5
Data Source Family Resources Survey (FRS)
  • Population coverage (England and Wales,
    self-employed included)
  • Income of all adults in a household collected
  • All components of income included
  • Earnings, Self-employment, Investments, Pensions,
    Benefits (income support, disability)
  • Gross and net income, housing costs collected
  • Large sample size
  • Potential predictors have similar definition to
    corresponding Census 2001 variables use FRS
    2001-2002

6
Data issues
  • Data collected from Apr 01 to Mar 02
  • Income collected is for the week preceding the
    interview date point in time income
  • 69 out of 23,000 households reported zero or
    negative total income
  • Negative values were returned by self-employed
  • Set negative values to 0 for modelling

7
Building a model predictors 1
  • At household level
  • Household composition, total number of dependent
    children
  • Type of tenure
  • Type of accommodation, number of rooms, location
    of home address (GOR)
  • Car use (number of cars/number of adults)
  • Total number of adults in employment (1 for
    full-time, ½ for part-time)

8
Building a model predictors 2
  • Individual level (for household reference person
    HRP)
  • Age, Age2
  • Sex
  • Economic activity
  • Social and economic classification (NSSEC)
  • Industry classification
  • Highest qualification attained
  • Ethnicity

9
Building a model predictors 2
  • Derivation of some variables and recoding was
    performed to make Census and FRS variable
    definitions and categories similar
  • New HRP in FRS derived as original not defined as
    in Census 2001
  • Highest qualification in FRS was derived as no
    equivalent variable is available collapsed
    categories

10
Outcome variable transformation
  • Income variables very skewed
  • Applied square root and Log transformations
  • Log transform yields a more symmetric
    distribution
  • For 0 values, the log is set to 0
  • Tried a censored regression model, and results
    are very similar

11
Fitting a model for total weekly income (Gross
weekly income)
  • Multiple regression model using ordinary least
    squares (OLS) for the whole sample
  • R2 0.57
  • No evidence of collinearity (values of Variance
    Inflation Factor very low )
  • Model diagnostics
  • Studentised residuals plot shows a number of
    outlying points
  • On investigation, many of these points found to
    have 0 or very small values of income

12
Model diagnostics
  • Test for heteroscedasticity Breusch-Pagan test
    highly significant
  • Parameter estimators obtained using OLS unbiased,
    but estimators not efficient and standard errors
    could be inaccurate
  • Problems with bias-correction factor in
    back-transformation
  • SEs of predictions generally too large

13
Dealing with heteroscedasticity 1
  • The weighted least squares yields efficient
    estimates however, not clear how to
  • Adjust back-transformation of predicted Log
    income in a new data set
  • Estimate SEs of predicted values in a new data
    set
  • Need to estimate a complex model where the
    variance of the residuals is also modelled not
    explored this
  • We consider a simple approach approximate
    solution

14
Dealing with heteroscedasticity 2 a heuristic
  • Distribution of residuals in CPM varies across
    groups NSSEC, Qualification, Ethnicity
  • Fit a model in each group which variable to use?
  • Model mixing (new methodology) ? Not done
  • Bigger spread in some NSSEC categories use NSSEC
  • Estimated a model in each NSSEC category using
    OLS (Split population group models (SPGM) )
  • Models not good in high income and self-employed
    groups
  • Flexible split population group models (FSPGM)
  • Use SPGM in medium and low income categories
  • Use CPM in high income and self-employed

15
Standard errors of predicted values
  • Accuracy of SEs under FSPGM Investigated using
    re-sampling
  • Select a random sample from dataset test sample
    fixed
  • Fit model using bootstrap sample from remaining
    dataset
  • Predict log income compute 95 CI for cases in
    test sample
  • Repeat bootstrapping, fitting and prediction
    process
  • Proportion of cases where CIs contain the
    returned values
  • 97 overall
  • Varies between 96 and 98 across NSSEC
    categories
  • Varies between 91 and 98 in other groups
    (Tenure, Ethnicity, HH composition)

16
Model Validation
  • Using re-sampling (repeated data splitting)
  • CPM drop in R2 from 0.568 (training datasets) to
    0.564 (validation datasets)
  • SPGM
  • Drop is small in low income group (0.330 to
    0.314) and medium income groups (0.550 to 0.520)
  • Drop is large in high income group (0.277 to
    0.203 ) and self-employed (0.259 to 0.184)
  • Assess impact of fitting the model with and
    without outliers
  • CPM, no outliers slightly lower R2 at low to
    medium income groups
  • SPGM, no outliers slightly higher R2 in nearly
    all groups

17
Using the prediction rule in LS 2001 data Group
and area means
  • Sampled units in the LS are individuals
  • Large HHs are represented more than small HHs
  • Need to apply weighting to estimate average HH
    income
  • Use common weighting for all groups
  • Weight of an individual from a HH of size k is
  • Proportion of HHs of size k in Census 2001 /
    Proportion of individuals in HHs of size k in LS
    2001 data
  • Estimator is
  • Sumgroup(weightpred income)/Sumgroup(weight)

18
Using the prediction rule in LS 2001 data some
comparisons - NSSEC
19
Comparisons - Tenure
20
Comparisons Neighbourhood statistics (NeSS)
based estimates
  • Obtained district level estimates of total weekly
    income based on
  • NeSS ward published estimates (population
    weighted)
  • Predicted values in LS data
  • 67 of estimates are within 50 of each other
  • 85 of estimates are within 75 of each other
  • Relative differences
  • Between -28 to 25
  • 97 are between -20 and 20

21
Comparisons - Ethnicity
22
Further Issues
  • Obtain more accurate estimates of the standard
    errors of the residuals
  • Reduce bias in back-transformation
  • Estimate standard errors of group estimates
  • Model net income and income excluding housing
    costs consistency between predicted values of
    different income measures
  • Test suitability of predicted income in
    correlation analysis
  • Effect of imputation in the LS
  • Documentation for users
Write a Comment
User Comments (0)
About PowerShow.com