Assessing the Quality of Age and Date Reporting

About This Presentation

Title:

Assessing the Quality of Age and Date Reporting

Description:

dct (for dictionary), with the location (record number and column number) of each variable ... urban | .29 -9.26 .28 -9.22 .68 -3.31. rural | 1.00 ---- 1.00 ... – PowerPoint PPT presentation

Number of Views:109

Avg rating:3.0/5.0

Slides: 67

Provided by: unkn598

Category:

more less

Transcript and Presenter's Notes

Title: Assessing the Quality of Age and Date Reporting

1
Assessing the Quality of Age and Date Reporting
in the Demographic and Health Surveys Thoma
s W. Pullum Department of Sociology The
University of Texas at Austin tom.pullum_at_mail.utex
as.edu Prepared for UT Population Research
Center brownbag, April 22, 2005
2
This presentation describes a combination of two
things

A report to DHS that will appear in one of their
series of publications. It has been completed
and is currently going through their editing
process.
A paper given at the session on Statistical
Demography at the PAA meetings in Philadelphia.
I hope to revise this and submit it to a journal,
probably Demographic Research.

3
During the past several decades there has been a
general shift of methodological perspective in
demographic analysis. This shift has coincided
with --the increasing availability of
individual-level data, --the development of
statistical methods to analyze such data, --and
vastly increased computing power. One category
of demographic methods that has not participated
in the transition to a statistical framework is
the analysis of the quality of age and date
reporting. The techniques to identify and adjust
for such deficiencies are still expressed in
terms of aggregated dataprincipally age ratios,
sex ratios, and summary measures such as Myers
Blended Index.
4
Goal to maintain the reasoning behind these
methods but to modify and extend them to have the
following features --to be calculated from
files of individual cases --to be calculated with
a readily available statistical package, such as
STATA --to be calculated with statistical models
such as logit regression --to be accompanied
with parameter estimates, standard errors,
confidence intervals, and test statistics --to
incorporate the sampling design by using sampling
weights and clustering --to be able to include
categorical or interval-level covariates.
5
Why is it important to have good quality
estimates of ages and dates?

Eligibility for inclusion in a sample or for some
questions is determined partly by current age
Several rates depend on ages and datesespecially
recent levels of fertility and infant/child
mortality
The overall quality of a surveys data is
probably indicated by the quality of the age and
date reporting

6
Quality is relative

There is not a clear demarcation between poor
quality and good quality. In DHS surveys, in
particular, the best one can do is to rank the
surveys and identify the ones that have the most
problems.
The analysis described here would ideally be
accompanied by simulations that show the
sensitivity of summary measures, such as infant
mortality rates, to specific levels of heaping
and displacement of the dates of childrens dates
of birth and death.

7
Demographic and Health Surveys

Household survey
Survey of women age 15-49, including complete
birth histories
Other surveys of men, youth, or service
availability, not part of this project

8
DHS Household Survey

Roster of all household members, with age, sex,
de jure or de facto residence, education,
relation to head
Quality of housing, household possessions
Usually some country-specific questions on
parental survival, health, presence of iodized
salt, use of mosquito nets, etc.

9
Survey of Women

Eligible respondents are ALL de jure resident
women age 15-49 in the household
Complete birth histories
Health questions about children born in the last
five years (some variation in this interval)
Other topics such as contraceptive use,
employment, prenatal care
Some country-specific topics

10
Data for this project

99 Household surveys conducted from 1990 to 2003,
with a mean of 69,143 individuals.
125 Surveys of women 15-49 conducted from 1985 to
2003, with a mean of 9,904 women.
Corresponding 125 child files, with a mean of
30,865 children. These are constructed from the
woman files, with one record per child. Some of
the mothers data is repeated on that record.

11
Data processing strategy

DHS provides the following rectangular text files
for each survey
.dat, the data in fixed format, with a fixed
number of records per case
.do, the variable names, labels, and category
labels
.dct (for dictionary), with the location (record
number and column number) of each variable
DHS has standard names for variables, and
standard categories (usually), but not standard
locations

12
Data processing strategy (cont.)

Identify the variables of interestages, dates,
interview information, and possible covariates,
and list them in a text file.
Edit the .do and .dct files to retain only the
variables of interest. This was done with a
Fortran program in a single run.
Construct Stata data files (.dta) for all of the
files of households and of women in two Stata
runs.

13
Data processing strategy (cont.)

Construct all the child files (one record per
child) in one Stata run, by applying the reshape
command to the birth history records in the .dta
files of women
Then apply the procedures to all household files,
all woman files, all child files (three runs).
Each of these three runs produces a new data
file, with one record per data set, giving the
various measures.
Finally, analyze each of these three summary
files with histograms, tables, etc.

14
The reports of ages and dates in DHS surveys that
are of primary interest include the
following --ages of all household members,
reported in the household survey --ages and
birth dates of women, reported in the survey of
women 15-49 --ages at marriage and dates of
marriage of these women --dates of birth of
children --ages at death of children who have
died, especially before age two. DHS surveys
include other ages and dates, but the ones listed
above would generally be considered most
important, and the others would be analyzed
similarly.
15
Three principal kinds of problems can arise with
these reports Incompleteness Heaping
Displacement The omission of eligible cases is
not included. This could be extremely serious,
for example if there was systematic omission of
births, especially of ones that had resulted in
an early child death, but there is very little
evidence of such omission in DHS surveys, at
least, so it will not be considered to be an
issue. In some other applications, such as an
evaluation of vital statistics reporting in a
developing country, it certainly could not be
ignored.
16
Incompleteness First, in the surveys of women,
although not in the household surveys, there can
be incompleteness in the reporting of an age or
date. For example, a woman may report her
current age but not her birth year or birth
month. In this case, or if the responses that
are given are internally inconsistent, DHS uses
automated imputation procedures so that age,
birth year, and birth month are all present and
consistent.
17
Histogram for the incompleteness of the womans
age data.

The x axis is the proportion of women in a survey
who did not give complete and consistent values
of age, month of birth, and year of birth
(relative to the month and year of interview).
The y axis is the number of surveys with this
level of incompleteness
DHS resolves any incompleteness or
inconsistencies with an imputation procedure

18
(No Transcript)
19
Table to list the surveys with the most serious
incompleteness of age, age at marriage, and ages
of children

Table 3.1.1 Surveys of women 15-49 with highest
levels of incompleteness of age, marriage, or
birth data.
Column (1) Country
Column (2) Median year of survey
Column (3) Proportion of women missing any age
or birth date information
Column (4) Proportion of women missing any
marriage age or date information
Column (5) Sum, across births, of proportions
missing any birth history data
The table lists surveys with (3).6 or (4).6 or
(5)1.0.
(1) (2) (3) (4) (5)
Bangladesh 1997 0.78 0.10 0.03
Bangladesh 2000 0.92 0.66 0.11
Benin 1996 0.81 0.69 1.74
Benin 2001 0.74 0.67 1.42
Burkina Faso 1993 0.72 0.57 0.94
Burkina Faso 1999 0.82 0.70 1.58
Burundi 1987 0.62 0.30 0.60
.
Sudan 1990 0.84 0.64 1.99
Togo 1988 0.73 0.63 1.58
Togo 1998 0.69 0.64 1.02

20
Heaping The second type of problem is heaping.
This usually takes the form of excessive numbers
of cases reported at ages ending in 0 or 5, but
heaping can occur at other digits, especially for
persons under age 20, and sometimes takes the
form of heaping on calendar years that end in 0
or 5. Age at the death of a child is given in
months if the child died before the second
birthday. There is usually substantial heaping
at 12 months and often some heaping at 6 and 18
months.
21
Histogram for age heaping in the household survey

Myers Index is the percentage of cases that
would have to be shifted from one final digit
(0-9) to another in order to get a uniform
distribution across the final digit
The x axis gives the value of the index
The y axis gives the number of surveys with this
value of the index

22
(No Transcript)
23
Table listing the household surveys with the most
age heaping

Table 2.1.1 Household surveys with strongest
evidence of heaping by age.
Column (1) Country
Column (2) Median year of survey
Column (3) Myers Index
Column (4) Percent excess at final digit 0 or 5
The table lists surveys with (3)10 or (4)10
(1) (2) (3) (4)
Bangladesh 1994 11.6 10.4
Bangladesh 1996 10.4 9.7
Bangladesh 2000 12.4 12.1
Benin 2001 10.6 8.4
Chad 1997 12.4 11.0
India 1993 17.1 15.0
India 1999 17.1 15.3
Niger 1992 15.3 14.3
Niger 1998 13.2 12.8
Nigeria 1990 19.7 19.6
Nigeria 1999 15.6 15.2

The list includes ALL the surveys in South Asia,
except for the 1987 survey of Sri Lanka.
The others surveys were in Yemen and West Africa.
Age heaping is probably the type of misreporting
that is least sensitive to interviewer effects
and most sensitive to the cultural meaning of
age.

25
Displacement Thirdly, there can be net
transfers or displacements of age. Interviewers
have some motivation to shift the ages of women
who are just inside the boundaries of the 15-49
interval in order to reduce their workload.
Shifting of eligible respondents to be below
the minimum age of eligibility Shifting of
eligible respondents to be above the maximum age
of eligibility There may also be some shifting
of births to be outside the maximum age of
eligibility for the health questions
26
Histograms showing the levels of age displacement
of women

The x axis is the estimated percentage of women
ACTUALLY 15-19 who were misreported as 10-14, or
the estimated percentage of women ACTUALLY 45-49
who were misreported as 50-54.
The y axis is the number of surveys with this
level of displacement.

27
(No Transcript)
28
(No Transcript)
29

Tables list the surveys with estimated downward
transfer levels of 10 or more, and surveys with
estimated upward transfer levels of 20 or more.
Upward transfers are more common than downward
transfers. Why?
All of these countries are in sub-Saharan Africa
(except for the 1997 survey of Kyrgyzstan). The
tables will not be given here.

30
Transfers of children outside the window for
extra health questions

Interviewers are also motivated to report
children as being older than they actually are,
in order to reduce their workload.
The measure is an estimate of the percentage of
children ACTUALLY one year inside the window who
were misreported as being one year outside the
window.

31
(No Transcript)
32

Threshold for listing surveys with high levels of
birth displacement was 10.
Most of the surveys were in sub-Saharan Africa,
but also some in Middle East, and surveys in
Pakistan, Haiti, Guatemala.

33
Displacement of birthdates can have a very
serious impact

When there are two successive surveys in the same
country, we can estimate fertility and infant
mortality rates in a window before the first
survey, using both the first and the second
survey.
Use the three calendar years before the first
survey for the TFR and the five calendar years
before the first survey for the IMR.

34
The second survey tends to give a higher estimate
of the TFR
35
The second survey also tends to give a higher
estimate of the IMR
36
Implications of getting different estimates for
two successive surveys

It is not possible to generalize about whether
the first survey or the second survey is more
accurateit depends on whether there was worse
displacement in one survey or the other.
There can also be other explanations of such
differences, related to coverage, quality of the
sample, other interviewer effects, etc.

37
How do you actually calculate these measures of
misreporting?

Go back to the goal I stated earlier, to develop
measures that have statistical properties
Calculated from individual data
Incorporate the sampling design
Use statistical packages
Have standard errors
Can have covariates
It helps to distinguish incompleteness, on the
one hand, from heaping and displacement, on the
other

38
Identifying and assessing incompleteness Incomp
leteness of age and date reporting would
traditionally be assessed with a distribution of
the different kinds of information that are given
(e.g. age, year of birth, and month of birth),
and calculating the proportion of cases in which
the information is incomplete or inconsistent.
I will illustrate how logit regression can be
applied in this context.
39
Identifying and assessing heaping and
displacement Traditional methods typically
proceed through two steps. Step 1
calculation of expected frequencies, proportions,
or ratios. Step 2 calculation of an index
based on differences that should be close to zero
or ratios that should be close to one if there is
no misreporting. I will illustrate how
multinomial logit regression, and logit
regression, respectively, can be used instead.
40
Examples of step 1 (the calculation of expected
values) Myers Blended Index assumes that,
after adjustment, each final digit 0 through 9
will be equally likely the expected proportion
at each final digit will be .10. Successive
age ratios should be approximately equal. E.g.
the ratio of females age 10-14 to females age 5-9
should be about the same as the ratio of females
age 15-19 to females age 10-14. But if females
have been systematically shifted downwards across
age 15, then the first ratio should be noticeably
larger than the second one.
41
Examples of step 2 (the calculation of a summary
measure of deviations from expected
values) Myers Blended Index is just the index
of dissimilarity for a comparison of the observed
(but blended) proportions at each final digit
with the expected proportions, uniformly .10 it
is one-half the sum of the absolute
deviations. Rutstein and Bicego (1990) use an
overall measure of age displacement which is the
adds (a) the difference between the two age
ratios around age 15 and (be) the difference
between the two age ratios around age 50.
42
Note that statistical models generally involve
the same basic logic --Calculation of expected
values --Summary measures of the deviations
between observed and expected values
43
Focus now on three examples Example 1 Use
logit regression to analyze incompleteness of age
reporting in the Bangladesh 2000 survey of
women Example 2 Use multinomial logit
regression to analyze age heaping in the India
1998/99 survey of women (a modification of Myers
Blended Index) Example 3 Use logit
regression to analyze transfers below age 15 in
the 1990 Nigeria household survey (a modification
of the age ratio approach)
44
Example 1 Incompleteness of age reporting in
the Bangladesh 2000 survey of women A variable
y, incompleteness, is assigned the value 1 if
the reporting of age and birthdate was incomplete
(v0141) and 0 otherwise. We do a logit
regression of y with no covariates, getting a
coefficient b0 on the logit scale. The
exponential of b0 will be the observed odds of an
incomplete response the observed proportion will
be given by exp(b0)/1exp(b0). A confidence
interval for the population proportion is
obtained by applying the same transformation to
the two ends of the confidence interval for the
population value of b0.
45
Logit regression applied to incompleteness of
womans age

. logit y pweightv005, cluster(v001)
(sum of wgt is 1.0544e10)
Iteration 0 log pseudo-likelihood -2636.087
Logit estimates
Number of obs 10544
Wald chi2(0) 0.00
Prob chi2 .
Log pseudo-likelihood -2636.087
Pseudo R2 0.0000
(standard errors
adjusted for clustering on v001)
--------------------------------------------------
----------------------------
Robust
y Coef. Std. Err. z Pz
95 Conf. Interval
-------------------------------------------------
----------------------------
_cons 2.608361 .0755412 34.53 0.000
2.460303 2.756419
--------------------------------------------------
----------------------------

46
Convert the coefficients in the output to
estimated proportions

Exp(2.608361)13.5768
13.5768/(113.5768).9314
Point estimate is .9314
95 confidence interval is (.9213, .9403)
These estimates are adjusted for sample weights
and clustering

47
Multivariate analysis Then y can be regressed
on covariates for a much more complete
description of the pattern of incompleteness than
would otherwise be possible. We estimate a
series of models with four covariates Type of
place of residence District Age interval
(reported or imputed) Womans years of schooling
48

Logit regressions of incompleteness of age and
birthdate
reporting on type of place of residence,
district, age, and
education. DHS survey of Bangladesh 2000.
n10,544. _
__________________________________________________
________
Incomplete Model 1 Model 2
Model 3
Age/Birthdate OR z OR z
OR z
__________________________________________________
_________
Type of Place
urban .29 -9.26 .28 -9.22
.68 -3.31
rural 1.00 ---- 1.00 ----
1.00 ----
District
Barisal .48 -2.96 .47 -2.97
.51 -2.87
Chittagong .79 -1.10 .79 -1.07
.97 -0.18
Dhaka 1.00 ---- 1.00 ----
1.00 ----
Khulna .37 -4.94 .38 -4.83
.32 -5.62
Rajashahi .68 -2.02 .68 -2.01
.56 -2.98
Sylhet .64 -1.59 .61 -1.73
.50 -3.02

49
Example 2 Age heaping in the India 1998/99
household survey

488,839 de jure residents age 0-79
This is a modification of Myers Blended Index .
Myers Index is traditionally calculated from
aggregated data, that is, from an age
distribution in single years of age, using a
spreadsheet, as illustrated in the following
table.

50
India 1998/99 household survey, unweighted de
jure age distribution

0 11,250
1 10,454
2 10,868
3 10,812
4 12,006
5 13,167
6 12,735
7 11,641
8 13,508
9 10,247
10 14,218
11 9,130
12 14,086
13 10,173
14 10,928
15 11,359
16 11,159
17 8,664
18 13,360

20 13,267
21 6,468
22 10,694
23 7,020
24 7,326
25 13,971
26 7,534
27 6,125
28 9,355
29 4,302
30 15,545
31 3,210
32 7,606
33 3,648
34 3,857
35 15,131
36 4,785
37 3,231
38 6,030

51
India 1998/99 household survey, unweighted de
jure age distribution

40 12,798
41 2,152
42 4,702
43 2,340
44 2,260
45 10,462
46 2,595
47 2,241
48 3,844
49 1,824
50 7,008
51 1,780
52 3,492
53 1,862
54 1,739
55 7,323
56 2,047
57 1,270
58 2,545

60 9,657
61 863
62 1,963
63 879
64 818
65 6,443
66 775
67 648
68 1,178
69 516
70 5,632
71 350
72 877
73 308
74 337
75 2,185
76 342
77 151
78 360

52
Percentage distribution of household residents
across final digit of age

Column (1) Unweighted
Column (2) Weighted by sampling weights
Column (3) Weighted by product of Myers weights
and sampling weights
Column (4) Absolute deviation of column (3)
from a uniform distribution
y (1) (2) (3)
(4)
-------------------------------------------------
---
0 18.28 18.28 18.10
8.10
1 7.04 7.16 6.06
3.94
2 11.11 11.17 10.73
0.73
3 7.58 7.51 6.91
3.09
4 8.03 8.01 7.57
2.43
5 16.37 16.43 17.17
7.17
6 8.59 8.63 8.78
1.22
7 6.95 6.90 7.18
2.82
8 10.27 10.21 11.12
1.12
9 5.79 5.71 6.39
3.61
-------------------------------------------------
---
Total 100.00 100.00 100.00
34.23

53
Multinomial logit approach to Myers Blended
Index --Within a range such as 0-79, a
respondents age is converted to a tens digit and
a ones digit. --Calculate a multinomial logit
regression with the ones digit as dependent
variable y, no covariates, and weights wt . In
Stata, mlogit y pweightwt. --Construct
(using the predict command in Stata) ten
variables that are the estimated probabilities
that y0, y1, , y9 for each case in the file.
The estimated probabilities will be the same for
every case and may be referred to as p0, p1,,p9.
--Construct a variable
to get Myers Index. --M will
have the same value for every case.
54

This will give the same value of Myers Index,
17.11, as the spreadsheet approach.
The index can be obtained as the average of M for
all cases or just be listed out for the first
case.
It is then possible to add covariates and get M
as a function of one or more other variables.

55
Example covariate is the first digit of the
reported age, which takes the values 0, 1, , 7.

56
Here the covariate is completed years of
schooling of the household respondent

57
Example 3 Use of logit regression to analyze
transfers below age 15 in the 1990 Nigeria
household survey (a modification of the age ratio
approach) Standard method Identify two age
intervals below the boundary and one above
it Age Observed
Interval Frequency 5- 9
a 10-14 b 15-19 c Downward
transfers will tend to reduce c and inflate b.
The difference (b/a)-(c/b) measures the amount of
downward transfer.
58
Modified approach, expressed in terms of
aggregate data Identify two age intervals
below the boundary and two above it Age
Observed Fitted Interval
Frequency Frequency 5- 9 a a 10-14
b b 15-19 c c 20-24 d d Assume
that the only net transfers are between b and c,
and the fitted frequencies follow a regular
pattern
59
The single survey with the strongest evidence of
downward shifts of women was the 1990 survey of
Nigeria. Age Observed
Fitted Interval Frequency Frequency
5- 9 a3974 a 10-14 b3259
b2832 15-19 c1733
c2159 20-24 d1760 d The
proportion of true cases shifted downward from
age 15-19 to age 10-14 is (2159 1733) / 2159
1 (1733/2159) .197 or 19.7.
60
The calculations described above for aggregate
data can be replicated with a logit regression.
The crucial step is the construction of two
artificial variables. The first one, called x,
distinguishes the first and fourth age intervals
from the second and third. The second, called y,
distinguishes the second interval in the pair
from the first interval.
61
Layout of the four successive age groups into a
2x2 table for logit regression approach to age
transfers. a number of cases in first age group
(e.g. 5-9), b number of cases in second age
group (e.g. 10-14), c number of cases in third
age group (e.g. 15-19), d number of cases in
fourth age group (e.g. 20-24).

x 0
0 1

----------------------- 0
0 a
b y ----------------------
1 d c
------------------------
62
Do a logit regression of y on x using the
frequencies as weights. Equivalently, do a
logit regression of y on x using the underlying
individual-level data file.
63
Manipulation of the coefficients from these logit
regression will produce interpretable measures of
the amount of net transfer and estimates of the
probability of a downward or upward
transfer. By adding covariates to the logit
regression (and interactions between the
covariates and x) we can obtain a multivariate
model of age transfers.
64
The Nigeria 1990 survey has strong evidence of
negative transfers across age 15, i.e., many
women age 15-19 were misreported at age 10-14.
This is most severe if the household head has low
education and minimal if the household head has
high education.
65
Summary and conclusions --Logit and multinomial
logit regression can be used to re-state some of
the most common procedures for assessing the
quality of age and date reporting. These models
allow for the incorporation of sampling weights
and clustering --The inclusion of covariates
in these models will allow for a better
understanding of the sources of misreporting.
66
Standard errors and test statistics should be
used cautiously.

Even when there is no misreporting at all, Myers
blended method may spuriously suggest that there
is age heaping.
The assumptions of the model for measuring age
transfers may not be satisfied, leading to
spurious evidence of displacement.
The best evidence of misreporting occurs when a
test is statistically significant AND the
estimated level is above some threshold of
substantive significance.