Title: Imputation of numerical data under linear edit restrictions
1Imputation of numerical data under linear edit
restrictions
2Contents
- Edit restrictions
- Imputation and edit restrictions
- Current approach adjustment of imputed values
- Alternative approaches
- Our algorithm
- Outline
- Fourier-Motzkin elimination
- Statistical distribution
- Example
3Edit restrictions
- Used to define consistent data
- Examples
- T P C
- P 0.5 T
- In general, linear numerical edit restrictions
written as Ax b - Defines a feasible region of allowed values
4Why edit restrictions?
- Statistical institutes have responsibility to
supply undisputed data for many different users
in society, - For most users, inconsistent data are
incomprehensible. They may reject data as an
invalid source or make adjustments themselves. - For simplicity we ensure consistency during edit
and imputation phase rather than during
estimation phase
5Imputation and edit restrictions
- Imputation
- replacement of missing values with values
representing a statistical distribution - Imputation under edit restrictions
- replacement of missing values with values
representing a statistical distribution while
simultaneously satisfying edit restrictions
6Current approach
- Standard approach at Statistics Netherlands
- Impute first without taking edit restrictions
into account - Adjust imputed values so data satisfy the edit
restrictions - Adjustment of imputed values to satisfy the edit
restrictions is done in such a way that the
adjustments are as small as possible
7Adjustment of imputed data (1/2)
- Minimise distance between imputed record
(x1,,xn) and adjusted record (y1,,yn) under the
constraint that adjusted record satisfies all
edit restrictions
8Adjustment of imputed data (2/2)
9Problem with current approach
- Adjustment of imputed values leads to a record on
the boundary of the feasible region for the
variables to be imputed - An approach that leads to records inside the
feasible region for the variables to be imputed
would be preferred
10Alternative approaches (1/2)
- Use truncated multivariate normal distribution
with support on the feasible region of the
variables to be imputed - Disadvantage
- Truncated multivariate normal distribution is
complicated - Even determining the mean is complex
11Alternative approaches (2/2)
- Partially incomplete MCMC
- Separate regression imputation model for each
variable to be imputed - Iteratively impute all variables until
convergence to joint distribution - For each variable to be imputed the edit
restrictions reduce to a feasible interval - Disadvantage
- For each variable to be imputed a separate
regression model has to be specified and
estimated - Joint distribution may not exist
12Our approach
- Estimate the model parameters, e.g. by means of
the EM algorithm - Repeat the following steps for each variable i to
be imputed - Fill in observed values in edit restrictions
- Use Fourier-Motzkin elimination to determine edit
restrictions for variable i - Draw value for variable i, using the conditional
distribution given all known values (either
observed or imputed) until it satisfies the edit
restrictions
13Handling edits Fourier-Motzkin elimination
- Given a set of linear constraints Fourier-Motzkin
elimination can be used to determine constraints
for a subset of variables - If the constraints for a subset can be satisfied,
the constraints for the entire set of variables
can also be satisfied - In our case the edit restrictions are the
constraints
14Fourier-Motzkin elimination example (1/2)
- Suppose 3 edit restrictions are given
- X Y
- Y 5X
- Y Z
- Elimination Y leads to
- X 5X
- X Z
15Fourier-Motzkin elimination example (2/2)
- Conversely, given the edit restrictions
- X 5X
- X Z
- Hence a value Y exists such that
- X Y min(5X, Z)
- That is, a value Y exists such that
- X Y
- Y 5X
- Y Z
16The statistical distribution
- For simplicity we assume the data to be
approximately multivariately normally distributed - All conditional distribution are hence also
approximately multivariately normally distributed
17Example of our algorithm (1/6)
- Edit restrictions given by
- T P C
- P 0.5T
- -0.1T P
- T 0
- T 550N
- N 5
- T, P and C are missing
18Example of our algorithm (2/6)
- Fill in observed value for N into the edit
restrictions - This leads to the following edits restrictions
for T, C and P - T P C
- P 0.5T
- -0.1T P
- T 0
- T 2750
19Example of our algorithm (3/6)
- Eliminate P
- Edit restrictions for T and C
- T C 0.5T
- -0.1T T C
- T 0
- T 2750
20Example of our algorithm (3/6)
- Eliminate P
- Edit restrictions for T and C
- 0.5T C
- C 1.1T
- T 0
- T 2750
21Example of our algorithm (4/6)
- Eliminate C
- Edit restrictions for T
- 0.5T 1.1T
- T 0
- T 2750
- Now we draw values for T from distribution for T
given observed value N until value satisfies edit
restrictions, say T 1200
22Example of our algorithm (5/6)
- We consider edit restrictions for T and C
- 0.5T C
- C 1.1T
- T 0
- T 2750
- Fill in imputed value for T (1200)
- 600 C
- C 1320
- Draw values for C from distribution for C given
observed or imputed values for N and T until edit
restrictions are satisfied, say C 700
23Example of our algorithm (6/6)
- We consider edit restrictions for T, C and P
- T P C
- P 0.5T
- -0.1T P
- T 0
- T 2750
- Fill in imputed values for T (1200) and C (700)
- 1200 P 700
- P 600
- -120 P
- We impute only allowed value for P 500
- Imputed record T1200, C700, P500, N5
24Current status of research
- Software has been developed and tested
- Currently, we are carrying out evaluation
experiments - Our evaluation results will be compared to the
current approach at Statistics Netherlands
(imputation followed by adjustment of imputed
values)