Title: Assessing the Quality of Age and Date Reporting
1Assessing the Quality of Age and Date Reporting
in the Demographic and Health Surveys Thoma
s W. Pullum Department of Sociology The
University of Texas at Austin tom.pullum_at_mail.utex
as.edu Prepared for UT Population Research
Center brownbag, April 22, 2005
2This presentation describes a combination of two
things
- A report to DHS that will appear in one of their
series of publications. It has been completed
and is currently going through their editing
process. - A paper given at the session on Statistical
Demography at the PAA meetings in Philadelphia.
I hope to revise this and submit it to a journal,
probably Demographic Research.
3During the past several decades there has been a
general shift of methodological perspective in
demographic analysis. This shift has coincided
with --the increasing availability of
individual-level data, --the development of
statistical methods to analyze such data, --and
vastly increased computing power. One category
of demographic methods that has not participated
in the transition to a statistical framework is
the analysis of the quality of age and date
reporting. The techniques to identify and adjust
for such deficiencies are still expressed in
terms of aggregated dataprincipally age ratios,
sex ratios, and summary measures such as Myers
Blended Index.
4Goal to maintain the reasoning behind these
methods but to modify and extend them to have the
following features --to be calculated from
files of individual cases --to be calculated with
a readily available statistical package, such as
STATA --to be calculated with statistical models
such as logit regression --to be accompanied
with parameter estimates, standard errors,
confidence intervals, and test statistics --to
incorporate the sampling design by using sampling
weights and clustering --to be able to include
categorical or interval-level covariates.
5Why is it important to have good quality
estimates of ages and dates?
- Eligibility for inclusion in a sample or for some
questions is determined partly by current age - Several rates depend on ages and datesespecially
recent levels of fertility and infant/child
mortality - The overall quality of a surveys data is
probably indicated by the quality of the age and
date reporting
6Quality is relative
- There is not a clear demarcation between poor
quality and good quality. In DHS surveys, in
particular, the best one can do is to rank the
surveys and identify the ones that have the most
problems. - The analysis described here would ideally be
accompanied by simulations that show the
sensitivity of summary measures, such as infant
mortality rates, to specific levels of heaping
and displacement of the dates of childrens dates
of birth and death.
7Demographic and Health Surveys
- Household survey
- Survey of women age 15-49, including complete
birth histories - Other surveys of men, youth, or service
availability, not part of this project -
8DHS Household Survey
- Roster of all household members, with age, sex,
de jure or de facto residence, education,
relation to head - Quality of housing, household possessions
- Usually some country-specific questions on
parental survival, health, presence of iodized
salt, use of mosquito nets, etc.
9Survey of Women
- Eligible respondents are ALL de jure resident
women age 15-49 in the household - Complete birth histories
- Health questions about children born in the last
five years (some variation in this interval) - Other topics such as contraceptive use,
employment, prenatal care - Some country-specific topics
10Data for this project
- 99 Household surveys conducted from 1990 to 2003,
with a mean of 69,143 individuals. - 125 Surveys of women 15-49 conducted from 1985 to
2003, with a mean of 9,904 women. - Corresponding 125 child files, with a mean of
30,865 children. These are constructed from the
woman files, with one record per child. Some of
the mothers data is repeated on that record.
11Data processing strategy
- DHS provides the following rectangular text files
for each survey - .dat, the data in fixed format, with a fixed
number of records per case - .do, the variable names, labels, and category
labels - .dct (for dictionary), with the location (record
number and column number) of each variable - DHS has standard names for variables, and
standard categories (usually), but not standard
locations
12Data processing strategy (cont.)
- Identify the variables of interestages, dates,
interview information, and possible covariates,
and list them in a text file. - Edit the .do and .dct files to retain only the
variables of interest. This was done with a
Fortran program in a single run. - Construct Stata data files (.dta) for all of the
files of households and of women in two Stata
runs.
13Data processing strategy (cont.)
- Construct all the child files (one record per
child) in one Stata run, by applying the reshape
command to the birth history records in the .dta
files of women - Then apply the procedures to all household files,
all woman files, all child files (three runs). - Each of these three runs produces a new data
file, with one record per data set, giving the
various measures. - Finally, analyze each of these three summary
files with histograms, tables, etc.
14The reports of ages and dates in DHS surveys that
are of primary interest include the
following --ages of all household members,
reported in the household survey --ages and
birth dates of women, reported in the survey of
women 15-49 --ages at marriage and dates of
marriage of these women --dates of birth of
children --ages at death of children who have
died, especially before age two. DHS surveys
include other ages and dates, but the ones listed
above would generally be considered most
important, and the others would be analyzed
similarly.
15Three principal kinds of problems can arise with
these reports Incompleteness Heaping
Displacement The omission of eligible cases is
not included. This could be extremely serious,
for example if there was systematic omission of
births, especially of ones that had resulted in
an early child death, but there is very little
evidence of such omission in DHS surveys, at
least, so it will not be considered to be an
issue. In some other applications, such as an
evaluation of vital statistics reporting in a
developing country, it certainly could not be
ignored.
16 Incompleteness First, in the surveys of women,
although not in the household surveys, there can
be incompleteness in the reporting of an age or
date. For example, a woman may report her
current age but not her birth year or birth
month. In this case, or if the responses that
are given are internally inconsistent, DHS uses
automated imputation procedures so that age,
birth year, and birth month are all present and
consistent.
17Histogram for the incompleteness of the womans
age data.
- The x axis is the proportion of women in a survey
who did not give complete and consistent values
of age, month of birth, and year of birth
(relative to the month and year of interview). - The y axis is the number of surveys with this
level of incompleteness - DHS resolves any incompleteness or
inconsistencies with an imputation procedure
18(No Transcript)
19Table to list the surveys with the most serious
incompleteness of age, age at marriage, and ages
of children
- Table 3.1.1 Surveys of women 15-49 with highest
levels of incompleteness of age, marriage, or
birth data. - Column (1) Country
- Column (2) Median year of survey
- Column (3) Proportion of women missing any age
or birth date information - Column (4) Proportion of women missing any
marriage age or date information - Column (5) Sum, across births, of proportions
missing any birth history data - The table lists surveys with (3).6 or (4).6 or
(5)1.0. - (1) (2) (3) (4) (5)
- Bangladesh 1997 0.78 0.10 0.03
- Bangladesh 2000 0.92 0.66 0.11
- Benin 1996 0.81 0.69 1.74
- Benin 2001 0.74 0.67 1.42
- Burkina Faso 1993 0.72 0.57 0.94
- Burkina Faso 1999 0.82 0.70 1.58
- Burundi 1987 0.62 0.30 0.60
- .
- Sudan 1990 0.84 0.64 1.99
- Togo 1988 0.73 0.63 1.58
- Togo 1998 0.69 0.64 1.02
20 Heaping The second type of problem is heaping.
This usually takes the form of excessive numbers
of cases reported at ages ending in 0 or 5, but
heaping can occur at other digits, especially for
persons under age 20, and sometimes takes the
form of heaping on calendar years that end in 0
or 5. Age at the death of a child is given in
months if the child died before the second
birthday. There is usually substantial heaping
at 12 months and often some heaping at 6 and 18
months.
21Histogram for age heaping in the household survey
- Myers Index is the percentage of cases that
would have to be shifted from one final digit
(0-9) to another in order to get a uniform
distribution across the final digit - The x axis gives the value of the index
- The y axis gives the number of surveys with this
value of the index
22(No Transcript)
23Table listing the household surveys with the most
age heaping
- Table 2.1.1 Household surveys with strongest
evidence of heaping by age. - Column (1) Country
- Column (2) Median year of survey
- Column (3) Myers Index
- Column (4) Percent excess at final digit 0 or 5
- The table lists surveys with (3)10 or (4)10
-
- (1) (2) (3) (4)
- Bangladesh 1994 11.6 10.4
- Bangladesh 1996 10.4 9.7
- Bangladesh 2000 12.4 12.1
- Benin 2001 10.6 8.4
- Chad 1997 12.4 11.0
- India 1993 17.1 15.0
- India 1999 17.1 15.3
- Niger 1992 15.3 14.3
- Niger 1998 13.2 12.8
- Nigeria 1990 19.7 19.6
- Nigeria 1999 15.6 15.2
24 - The list includes ALL the surveys in South Asia,
except for the 1987 survey of Sri Lanka. - The others surveys were in Yemen and West Africa.
- Age heaping is probably the type of misreporting
that is least sensitive to interviewer effects
and most sensitive to the cultural meaning of
age.
25 Displacement Thirdly, there can be net
transfers or displacements of age. Interviewers
have some motivation to shift the ages of women
who are just inside the boundaries of the 15-49
interval in order to reduce their workload.
Shifting of eligible respondents to be below
the minimum age of eligibility Shifting of
eligible respondents to be above the maximum age
of eligibility There may also be some shifting
of births to be outside the maximum age of
eligibility for the health questions
26Histograms showing the levels of age displacement
of women
- The x axis is the estimated percentage of women
ACTUALLY 15-19 who were misreported as 10-14, or
the estimated percentage of women ACTUALLY 45-49
who were misreported as 50-54. - The y axis is the number of surveys with this
level of displacement.
27(No Transcript)
28(No Transcript)
29- Tables list the surveys with estimated downward
transfer levels of 10 or more, and surveys with
estimated upward transfer levels of 20 or more.
Upward transfers are more common than downward
transfers. Why? - All of these countries are in sub-Saharan Africa
(except for the 1997 survey of Kyrgyzstan). The
tables will not be given here.
30Transfers of children outside the window for
extra health questions
- Interviewers are also motivated to report
children as being older than they actually are,
in order to reduce their workload. - The measure is an estimate of the percentage of
children ACTUALLY one year inside the window who
were misreported as being one year outside the
window.
31(No Transcript)
32 - Threshold for listing surveys with high levels of
birth displacement was 10. - Most of the surveys were in sub-Saharan Africa,
but also some in Middle East, and surveys in
Pakistan, Haiti, Guatemala.
33 Displacement of birthdates can have a very
serious impact
- When there are two successive surveys in the same
country, we can estimate fertility and infant
mortality rates in a window before the first
survey, using both the first and the second
survey. - Use the three calendar years before the first
survey for the TFR and the five calendar years
before the first survey for the IMR.
34The second survey tends to give a higher estimate
of the TFR
35The second survey also tends to give a higher
estimate of the IMR
36Implications of getting different estimates for
two successive surveys
- It is not possible to generalize about whether
the first survey or the second survey is more
accurateit depends on whether there was worse
displacement in one survey or the other. - There can also be other explanations of such
differences, related to coverage, quality of the
sample, other interviewer effects, etc.
37How do you actually calculate these measures of
misreporting?
- Go back to the goal I stated earlier, to develop
measures that have statistical properties - Calculated from individual data
- Incorporate the sampling design
- Use statistical packages
- Have standard errors
- Can have covariates
- It helps to distinguish incompleteness, on the
one hand, from heaping and displacement, on the
other
38 Identifying and assessing incompleteness Incomp
leteness of age and date reporting would
traditionally be assessed with a distribution of
the different kinds of information that are given
(e.g. age, year of birth, and month of birth),
and calculating the proportion of cases in which
the information is incomplete or inconsistent.
I will illustrate how logit regression can be
applied in this context.
39Identifying and assessing heaping and
displacement Traditional methods typically
proceed through two steps. Step 1
calculation of expected frequencies, proportions,
or ratios. Step 2 calculation of an index
based on differences that should be close to zero
or ratios that should be close to one if there is
no misreporting. I will illustrate how
multinomial logit regression, and logit
regression, respectively, can be used instead.
40Examples of step 1 (the calculation of expected
values) Myers Blended Index assumes that,
after adjustment, each final digit 0 through 9
will be equally likely the expected proportion
at each final digit will be .10. Successive
age ratios should be approximately equal. E.g.
the ratio of females age 10-14 to females age 5-9
should be about the same as the ratio of females
age 15-19 to females age 10-14. But if females
have been systematically shifted downwards across
age 15, then the first ratio should be noticeably
larger than the second one.
41Examples of step 2 (the calculation of a summary
measure of deviations from expected
values) Myers Blended Index is just the index
of dissimilarity for a comparison of the observed
(but blended) proportions at each final digit
with the expected proportions, uniformly .10 it
is one-half the sum of the absolute
deviations. Rutstein and Bicego (1990) use an
overall measure of age displacement which is the
adds (a) the difference between the two age
ratios around age 15 and (be) the difference
between the two age ratios around age 50.
42Note that statistical models generally involve
the same basic logic --Calculation of expected
values --Summary measures of the deviations
between observed and expected values
43Focus now on three examples Example 1 Use
logit regression to analyze incompleteness of age
reporting in the Bangladesh 2000 survey of
women Example 2 Use multinomial logit
regression to analyze age heaping in the India
1998/99 survey of women (a modification of Myers
Blended Index) Example 3 Use logit
regression to analyze transfers below age 15 in
the 1990 Nigeria household survey (a modification
of the age ratio approach)
44Example 1 Incompleteness of age reporting in
the Bangladesh 2000 survey of women A variable
y, incompleteness, is assigned the value 1 if
the reporting of age and birthdate was incomplete
(v0141) and 0 otherwise. We do a logit
regression of y with no covariates, getting a
coefficient b0 on the logit scale. The
exponential of b0 will be the observed odds of an
incomplete response the observed proportion will
be given by exp(b0)/1exp(b0). A confidence
interval for the population proportion is
obtained by applying the same transformation to
the two ends of the confidence interval for the
population value of b0.
45Logit regression applied to incompleteness of
womans age
- . logit y pweightv005, cluster(v001)
- (sum of wgt is 1.0544e10)
- Iteration 0 log pseudo-likelihood -2636.087
- Logit estimates
Number of obs 10544 -
Wald chi2(0) 0.00 -
Prob chi2 . - Log pseudo-likelihood -2636.087
Pseudo R2 0.0000 - (standard errors
adjusted for clustering on v001) - --------------------------------------------------
---------------------------- - Robust
- y Coef. Std. Err. z Pz
95 Conf. Interval - -------------------------------------------------
---------------------------- - _cons 2.608361 .0755412 34.53 0.000
2.460303 2.756419 - --------------------------------------------------
----------------------------
46Convert the coefficients in the output to
estimated proportions
- Exp(2.608361)13.5768
- 13.5768/(113.5768).9314
- Point estimate is .9314
- 95 confidence interval is (.9213, .9403)
- These estimates are adjusted for sample weights
and clustering
47Multivariate analysis Then y can be regressed
on covariates for a much more complete
description of the pattern of incompleteness than
would otherwise be possible. We estimate a
series of models with four covariates Type of
place of residence District Age interval
(reported or imputed) Womans years of schooling
48 - Logit regressions of incompleteness of age and
birthdate - reporting on type of place of residence,
district, age, and - education. DHS survey of Bangladesh 2000.
n10,544. _ - __________________________________________________
________ - Incomplete Model 1 Model 2
Model 3 - Age/Birthdate OR z OR z
OR z - __________________________________________________
_________ - Type of Place
- urban .29 -9.26 .28 -9.22
.68 -3.31 - rural 1.00 ---- 1.00 ----
1.00 ---- -
- District
- Barisal .48 -2.96 .47 -2.97
.51 -2.87 - Chittagong .79 -1.10 .79 -1.07
.97 -0.18 - Dhaka 1.00 ---- 1.00 ----
1.00 ---- - Khulna .37 -4.94 .38 -4.83
.32 -5.62 - Rajashahi .68 -2.02 .68 -2.01
.56 -2.98 - Sylhet .64 -1.59 .61 -1.73
.50 -3.02 -
49Example 2 Age heaping in the India 1998/99
household survey
- 488,839 de jure residents age 0-79
- This is a modification of Myers Blended Index .
- Myers Index is traditionally calculated from
aggregated data, that is, from an age
distribution in single years of age, using a
spreadsheet, as illustrated in the following
table.
50India 1998/99 household survey, unweighted de
jure age distribution
- 0 11,250
- 1 10,454
- 2 10,868
- 3 10,812
- 4 12,006
- 5 13,167
- 6 12,735
- 7 11,641
- 8 13,508
- 9 10,247
- 10 14,218
- 11 9,130
- 12 14,086
- 13 10,173
- 14 10,928
- 15 11,359
- 16 11,159
- 17 8,664
- 18 13,360
- 20 13,267
- 21 6,468
- 22 10,694
- 23 7,020
- 24 7,326
- 25 13,971
- 26 7,534
- 27 6,125
- 28 9,355
- 29 4,302
- 30 15,545
- 31 3,210
- 32 7,606
- 33 3,648
- 34 3,857
- 35 15,131
- 36 4,785
- 37 3,231
- 38 6,030
51India 1998/99 household survey, unweighted de
jure age distribution
- 40 12,798
- 41 2,152
- 42 4,702
- 43 2,340
- 44 2,260
- 45 10,462
- 46 2,595
- 47 2,241
- 48 3,844
- 49 1,824
- 50 7,008
- 51 1,780
- 52 3,492
- 53 1,862
- 54 1,739
- 55 7,323
- 56 2,047
- 57 1,270
- 58 2,545
- 60 9,657
- 61 863
- 62 1,963
- 63 879
- 64 818
- 65 6,443
- 66 775
- 67 648
- 68 1,178
- 69 516
- 70 5,632
- 71 350
- 72 877
- 73 308
- 74 337
- 75 2,185
- 76 342
- 77 151
- 78 360
52Percentage distribution of household residents
across final digit of age
- Column (1) Unweighted
- Column (2) Weighted by sampling weights
- Column (3) Weighted by product of Myers weights
and sampling weights - Column (4) Absolute deviation of column (3)
from a uniform distribution -
- y (1) (2) (3)
(4) - -------------------------------------------------
--- - 0 18.28 18.28 18.10
8.10 - 1 7.04 7.16 6.06
3.94 - 2 11.11 11.17 10.73
0.73 - 3 7.58 7.51 6.91
3.09 - 4 8.03 8.01 7.57
2.43 - 5 16.37 16.43 17.17
7.17 - 6 8.59 8.63 8.78
1.22 - 7 6.95 6.90 7.18
2.82 - 8 10.27 10.21 11.12
1.12 - 9 5.79 5.71 6.39
3.61 - -------------------------------------------------
--- - Total 100.00 100.00 100.00
34.23
53Multinomial logit approach to Myers Blended
Index --Within a range such as 0-79, a
respondents age is converted to a tens digit and
a ones digit. --Calculate a multinomial logit
regression with the ones digit as dependent
variable y, no covariates, and weights wt . In
Stata, mlogit y pweightwt. --Construct
(using the predict command in Stata) ten
variables that are the estimated probabilities
that y0, y1, , y9 for each case in the file.
The estimated probabilities will be the same for
every case and may be referred to as p0, p1,,p9.
--Construct a variable
to get Myers Index. --M will
have the same value for every case.
54 - This will give the same value of Myers Index,
17.11, as the spreadsheet approach. - The index can be obtained as the average of M for
all cases or just be listed out for the first
case. - It is then possible to add covariates and get M
as a function of one or more other variables.
55 Example covariate is the first digit of the
reported age, which takes the values 0, 1, , 7.
56Here the covariate is completed years of
schooling of the household respondent
57Example 3 Use of logit regression to analyze
transfers below age 15 in the 1990 Nigeria
household survey (a modification of the age ratio
approach) Standard method Identify two age
intervals below the boundary and one above
it Age Observed
Interval Frequency 5- 9
a 10-14 b 15-19 c Downward
transfers will tend to reduce c and inflate b.
The difference (b/a)-(c/b) measures the amount of
downward transfer.
58Modified approach, expressed in terms of
aggregate data Identify two age intervals
below the boundary and two above it Age
Observed Fitted Interval
Frequency Frequency 5- 9 a a 10-14
b b 15-19 c c 20-24 d d Assume
that the only net transfers are between b and c,
and the fitted frequencies follow a regular
pattern
59The single survey with the strongest evidence of
downward shifts of women was the 1990 survey of
Nigeria. Age Observed
Fitted Interval Frequency Frequency
5- 9 a3974 a 10-14 b3259
b2832 15-19 c1733
c2159 20-24 d1760 d The
proportion of true cases shifted downward from
age 15-19 to age 10-14 is (2159 1733) / 2159
1 (1733/2159) .197 or 19.7.
60The calculations described above for aggregate
data can be replicated with a logit regression.
The crucial step is the construction of two
artificial variables. The first one, called x,
distinguishes the first and fourth age intervals
from the second and third. The second, called y,
distinguishes the second interval in the pair
from the first interval.
61Layout of the four successive age groups into a
2x2 table for logit regression approach to age
transfers. a number of cases in first age group
(e.g. 5-9), b number of cases in second age
group (e.g. 10-14), c number of cases in third
age group (e.g. 15-19), d number of cases in
fourth age group (e.g. 20-24).
x 0
0 1
----------------------- 0
0 a
b y ----------------------
1 d c
------------------------
62Do a logit regression of y on x using the
frequencies as weights. Equivalently, do a
logit regression of y on x using the underlying
individual-level data file.
63Manipulation of the coefficients from these logit
regression will produce interpretable measures of
the amount of net transfer and estimates of the
probability of a downward or upward
transfer. By adding covariates to the logit
regression (and interactions between the
covariates and x) we can obtain a multivariate
model of age transfers.
64The Nigeria 1990 survey has strong evidence of
negative transfers across age 15, i.e., many
women age 15-19 were misreported at age 10-14.
This is most severe if the household head has low
education and minimal if the household head has
high education.
65Summary and conclusions --Logit and multinomial
logit regression can be used to re-state some of
the most common procedures for assessing the
quality of age and date reporting. These models
allow for the incorporation of sampling weights
and clustering --The inclusion of covariates
in these models will allow for a better
understanding of the sources of misreporting.
66Standard errors and test statistics should be
used cautiously.
- Even when there is no misreporting at all, Myers
blended method may spuriously suggest that there
is age heaping. - The assumptions of the model for measuring age
transfers may not be satisfied, leading to
spurious evidence of displacement. - The best evidence of misreporting occurs when a
test is statistically significant AND the
estimated level is above some threshold of
substantive significance.