Assessing the Quality of Age and Date Reporting - PowerPoint PPT Presentation

1 / 66
About This Presentation
Title:

Assessing the Quality of Age and Date Reporting

Description:

dct (for dictionary), with the location (record number and column number) of each variable ... urban | .29 -9.26 .28 -9.22 .68 -3.31. rural | 1.00 ---- 1.00 ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 67
Provided by: unkn598
Category:

less

Transcript and Presenter's Notes

Title: Assessing the Quality of Age and Date Reporting


1
Assessing the Quality of Age and Date Reporting
in the Demographic and Health Surveys       Thoma
s W. Pullum Department of Sociology The
University of Texas at Austin tom.pullum_at_mail.utex
as.edu       Prepared for UT Population Research
Center brownbag, April 22, 2005 
2
This presentation describes a combination of two
things
  • A report to DHS that will appear in one of their
    series of publications. It has been completed
    and is currently going through their editing
    process.
  • A paper given at the session on Statistical
    Demography at the PAA meetings in Philadelphia.
    I hope to revise this and submit it to a journal,
    probably Demographic Research.

3
During the past several decades there has been a
general shift of methodological perspective in
demographic analysis. This shift has coincided
with   --the increasing availability of
individual-level data, --the development of
statistical methods to analyze such data, --and
vastly increased computing power.   One category
of demographic methods that has not participated
in the transition to a statistical framework is
the analysis of the quality of age and date
reporting. The techniques to identify and adjust
for such deficiencies are still expressed in
terms of aggregated dataprincipally age ratios,
sex ratios, and summary measures such as Myers
Blended Index.  
4
Goal to maintain the reasoning behind these
methods but to modify and extend them to have the
following features   --to be calculated from
files of individual cases --to be calculated with
a readily available statistical package, such as
STATA --to be calculated with statistical models
such as logit regression --to be accompanied
with parameter estimates, standard errors,
confidence intervals, and test statistics --to
incorporate the sampling design by using sampling
weights and clustering --to be able to include
categorical or interval-level covariates.
5
Why is it important to have good quality
estimates of ages and dates?
  • Eligibility for inclusion in a sample or for some
    questions is determined partly by current age
  • Several rates depend on ages and datesespecially
    recent levels of fertility and infant/child
    mortality
  • The overall quality of a surveys data is
    probably indicated by the quality of the age and
    date reporting

6
Quality is relative
  • There is not a clear demarcation between poor
    quality and good quality. In DHS surveys, in
    particular, the best one can do is to rank the
    surveys and identify the ones that have the most
    problems.
  • The analysis described here would ideally be
    accompanied by simulations that show the
    sensitivity of summary measures, such as infant
    mortality rates, to specific levels of heaping
    and displacement of the dates of childrens dates
    of birth and death.

7
Demographic and Health Surveys
  • Household survey
  • Survey of women age 15-49, including complete
    birth histories
  • Other surveys of men, youth, or service
    availability, not part of this project

8
DHS Household Survey
  • Roster of all household members, with age, sex,
    de jure or de facto residence, education,
    relation to head
  • Quality of housing, household possessions
  • Usually some country-specific questions on
    parental survival, health, presence of iodized
    salt, use of mosquito nets, etc.

9
Survey of Women
  • Eligible respondents are ALL de jure resident
    women age 15-49 in the household
  • Complete birth histories
  • Health questions about children born in the last
    five years (some variation in this interval)
  • Other topics such as contraceptive use,
    employment, prenatal care
  • Some country-specific topics

10
Data for this project
  • 99 Household surveys conducted from 1990 to 2003,
    with a mean of 69,143 individuals.
  • 125 Surveys of women 15-49 conducted from 1985 to
    2003, with a mean of 9,904 women.
  • Corresponding 125 child files, with a mean of
    30,865 children. These are constructed from the
    woman files, with one record per child. Some of
    the mothers data is repeated on that record.

11
Data processing strategy
  • DHS provides the following rectangular text files
    for each survey
  • .dat, the data in fixed format, with a fixed
    number of records per case
  • .do, the variable names, labels, and category
    labels
  • .dct (for dictionary), with the location (record
    number and column number) of each variable
  • DHS has standard names for variables, and
    standard categories (usually), but not standard
    locations

12
Data processing strategy (cont.)
  • Identify the variables of interestages, dates,
    interview information, and possible covariates,
    and list them in a text file.
  • Edit the .do and .dct files to retain only the
    variables of interest. This was done with a
    Fortran program in a single run.
  • Construct Stata data files (.dta) for all of the
    files of households and of women in two Stata
    runs.

13
Data processing strategy (cont.)
  • Construct all the child files (one record per
    child) in one Stata run, by applying the reshape
    command to the birth history records in the .dta
    files of women
  • Then apply the procedures to all household files,
    all woman files, all child files (three runs).
  • Each of these three runs produces a new data
    file, with one record per data set, giving the
    various measures.
  • Finally, analyze each of these three summary
    files with histograms, tables, etc.

14
The reports of ages and dates in DHS surveys that
are of primary interest include the
following   --ages of all household members,
reported in the household survey --ages and
birth dates of women, reported in the survey of
women 15-49 --ages at marriage and dates of
marriage of these women --dates of birth of
children --ages at death of children who have
died, especially before age two.   DHS surveys
include other ages and dates, but the ones listed
above would generally be considered most
important, and the others would be analyzed
similarly.
15
Three principal kinds of problems can arise with
these reports   Incompleteness   Heaping  
Displacement   The omission of eligible cases is
not included. This could be extremely serious,
for example if there was systematic omission of
births, especially of ones that had resulted in
an early child death, but there is very little
evidence of such omission in DHS surveys, at
least, so it will not be considered to be an
issue. In some other applications, such as an
evaluation of vital statistics reporting in a
developing country, it certainly could not be
ignored.
16
Incompleteness   First, in the surveys of women,
although not in the household surveys, there can
be incompleteness in the reporting of an age or
date. For example, a woman may report her
current age but not her birth year or birth
month. In this case, or if the responses that
are given are internally inconsistent, DHS uses
automated imputation procedures so that age,
birth year, and birth month are all present and
consistent.
17
Histogram for the incompleteness of the womans
age data.
  • The x axis is the proportion of women in a survey
    who did not give complete and consistent values
    of age, month of birth, and year of birth
    (relative to the month and year of interview).
  • The y axis is the number of surveys with this
    level of incompleteness
  • DHS resolves any incompleteness or
    inconsistencies with an imputation procedure

18
(No Transcript)
19
Table to list the surveys with the most serious
incompleteness of age, age at marriage, and ages
of children
  • Table 3.1.1 Surveys of women 15-49 with highest
    levels of incompleteness of age, marriage, or
    birth data.
  • Column (1) Country
  • Column (2) Median year of survey
  • Column (3) Proportion of women missing any age
    or birth date information
  • Column (4) Proportion of women missing any
    marriage age or date information
  • Column (5) Sum, across births, of proportions
    missing any birth history data
  • The table lists surveys with (3).6 or (4).6 or
    (5)1.0.
  • (1) (2) (3) (4) (5)
  • Bangladesh 1997 0.78 0.10 0.03
  • Bangladesh 2000 0.92 0.66 0.11
  • Benin 1996 0.81 0.69 1.74
  • Benin 2001 0.74 0.67 1.42
  • Burkina Faso 1993 0.72 0.57 0.94
  • Burkina Faso 1999 0.82 0.70 1.58
  • Burundi 1987 0.62 0.30 0.60
  • .
  • Sudan 1990 0.84 0.64 1.99
  • Togo 1988 0.73 0.63 1.58
  • Togo 1998 0.69 0.64 1.02

20
Heaping   The second type of problem is heaping.
This usually takes the form of excessive numbers
of cases reported at ages ending in 0 or 5, but
heaping can occur at other digits, especially for
persons under age 20, and sometimes takes the
form of heaping on calendar years that end in 0
or 5. Age at the death of a child is given in
months if the child died before the second
birthday. There is usually substantial heaping
at 12 months and often some heaping at 6 and 18
months.
21
Histogram for age heaping in the household survey
  • Myers Index is the percentage of cases that
    would have to be shifted from one final digit
    (0-9) to another in order to get a uniform
    distribution across the final digit
  • The x axis gives the value of the index
  • The y axis gives the number of surveys with this
    value of the index

22
(No Transcript)
23
Table listing the household surveys with the most
age heaping
  • Table 2.1.1 Household surveys with strongest
    evidence of heaping by age.
  • Column (1) Country
  • Column (2) Median year of survey
  • Column (3) Myers Index
  • Column (4) Percent excess at final digit 0 or 5
  • The table lists surveys with (3)10 or (4)10
  • (1) (2) (3) (4)
  • Bangladesh 1994 11.6 10.4
  • Bangladesh 1996 10.4 9.7
  • Bangladesh 2000 12.4 12.1
  • Benin 2001 10.6 8.4
  • Chad 1997 12.4 11.0
  • India 1993 17.1 15.0
  • India 1999 17.1 15.3
  • Niger 1992 15.3 14.3
  • Niger 1998 13.2 12.8
  • Nigeria 1990 19.7 19.6
  • Nigeria 1999 15.6 15.2

24
  • The list includes ALL the surveys in South Asia,
    except for the 1987 survey of Sri Lanka.
  • The others surveys were in Yemen and West Africa.
  • Age heaping is probably the type of misreporting
    that is least sensitive to interviewer effects
    and most sensitive to the cultural meaning of
    age.

25
Displacement   Thirdly, there can be net
transfers or displacements of age. Interviewers
have some motivation to shift the ages of women
who are just inside the boundaries of the 15-49
interval in order to reduce their workload.
  Shifting of eligible respondents to be below
the minimum age of eligibility   Shifting of
eligible respondents to be above the maximum age
of eligibility   There may also be some shifting
of births to be outside the maximum age of
eligibility for the health questions
26
Histograms showing the levels of age displacement
of women
  • The x axis is the estimated percentage of women
    ACTUALLY 15-19 who were misreported as 10-14, or
    the estimated percentage of women ACTUALLY 45-49
    who were misreported as 50-54.
  • The y axis is the number of surveys with this
    level of displacement.

27
(No Transcript)
28
(No Transcript)
29
  • Tables list the surveys with estimated downward
    transfer levels of 10 or more, and surveys with
    estimated upward transfer levels of 20 or more.
    Upward transfers are more common than downward
    transfers. Why?
  • All of these countries are in sub-Saharan Africa
    (except for the 1997 survey of Kyrgyzstan). The
    tables will not be given here.

30
Transfers of children outside the window for
extra health questions
  • Interviewers are also motivated to report
    children as being older than they actually are,
    in order to reduce their workload.
  • The measure is an estimate of the percentage of
    children ACTUALLY one year inside the window who
    were misreported as being one year outside the
    window.

31
(No Transcript)
32
  • Threshold for listing surveys with high levels of
    birth displacement was 10.
  • Most of the surveys were in sub-Saharan Africa,
    but also some in Middle East, and surveys in
    Pakistan, Haiti, Guatemala.

33
Displacement of birthdates can have a very
serious impact
  • When there are two successive surveys in the same
    country, we can estimate fertility and infant
    mortality rates in a window before the first
    survey, using both the first and the second
    survey.
  • Use the three calendar years before the first
    survey for the TFR and the five calendar years
    before the first survey for the IMR.

34
The second survey tends to give a higher estimate
of the TFR
35
The second survey also tends to give a higher
estimate of the IMR
36
Implications of getting different estimates for
two successive surveys
  • It is not possible to generalize about whether
    the first survey or the second survey is more
    accurateit depends on whether there was worse
    displacement in one survey or the other.
  • There can also be other explanations of such
    differences, related to coverage, quality of the
    sample, other interviewer effects, etc.

37
How do you actually calculate these measures of
misreporting?
  • Go back to the goal I stated earlier, to develop
    measures that have statistical properties
  • Calculated from individual data
  • Incorporate the sampling design
  • Use statistical packages
  • Have standard errors
  • Can have covariates
  • It helps to distinguish incompleteness, on the
    one hand, from heaping and displacement, on the
    other

38
 Identifying and assessing incompleteness   Incomp
leteness of age and date reporting would
traditionally be assessed with a distribution of
the different kinds of information that are given
(e.g. age, year of birth, and month of birth),
and calculating the proportion of cases in which
the information is incomplete or inconsistent.
I will illustrate how logit regression can be
applied in this context.   
39
Identifying and assessing heaping and
displacement   Traditional methods typically
proceed through two steps.   Step 1
calculation of expected frequencies, proportions,
or ratios.   Step 2 calculation of an index
based on differences that should be close to zero
or ratios that should be close to one if there is
no misreporting.   I will illustrate how
multinomial logit regression, and logit
regression, respectively, can be used instead.
40
Examples of step 1 (the calculation of expected
values)   Myers Blended Index assumes that,
after adjustment, each final digit 0 through 9
will be equally likely the expected proportion
at each final digit will be .10.   Successive
age ratios should be approximately equal. E.g.
the ratio of females age 10-14 to females age 5-9
should be about the same as the ratio of females
age 15-19 to females age 10-14. But if females
have been systematically shifted downwards across
age 15, then the first ratio should be noticeably
larger than the second one.  
41
Examples of step 2 (the calculation of a summary
measure of deviations from expected
values)   Myers Blended Index is just the index
of dissimilarity for a comparison of the observed
(but blended) proportions at each final digit
with the expected proportions, uniformly .10 it
is one-half the sum of the absolute
deviations.   Rutstein and Bicego (1990) use an
overall measure of age displacement which is the
adds (a) the difference between the two age
ratios around age 15 and (be) the difference
between the two age ratios around age 50.
42
Note that statistical models generally involve
the same basic logic   --Calculation of expected
values   --Summary measures of the deviations
between observed and expected values  
43
Focus now on three examples   Example 1 Use
logit regression to analyze incompleteness of age
reporting in the Bangladesh 2000 survey of
women   Example 2 Use multinomial logit
regression to analyze age heaping in the India
1998/99 survey of women (a modification of Myers
Blended Index)   Example 3 Use logit
regression to analyze transfers below age 15 in
the 1990 Nigeria household survey (a modification
of the age ratio approach)  
44
Example 1 Incompleteness of age reporting in
the Bangladesh 2000 survey of women   A variable
y, incompleteness, is assigned the value 1 if
the reporting of age and birthdate was incomplete
(v0141) and 0 otherwise. We do a logit
regression of y with no covariates, getting a
coefficient b0 on the logit scale. The
exponential of b0 will be the observed odds of an
incomplete response the observed proportion will
be given by exp(b0)/1exp(b0). A confidence
interval for the population proportion is
obtained by applying the same transformation to
the two ends of the confidence interval for the
population value of b0.
45
Logit regression applied to incompleteness of
womans age
  • . logit y pweightv005, cluster(v001)
  • (sum of wgt is 1.0544e10)
  • Iteration 0 log pseudo-likelihood -2636.087
  • Logit estimates
    Number of obs 10544

  • Wald chi2(0) 0.00

  • Prob chi2 .
  • Log pseudo-likelihood -2636.087
    Pseudo R2 0.0000
  • (standard errors
    adjusted for clustering on v001)
  • --------------------------------------------------
    ----------------------------
  • Robust
  • y Coef. Std. Err. z Pz
    95 Conf. Interval
  • -------------------------------------------------
    ----------------------------
  • _cons 2.608361 .0755412 34.53 0.000
    2.460303 2.756419
  • --------------------------------------------------
    ----------------------------

46
Convert the coefficients in the output to
estimated proportions
  • Exp(2.608361)13.5768
  • 13.5768/(113.5768).9314
  • Point estimate is .9314
  • 95 confidence interval is (.9213, .9403)
  • These estimates are adjusted for sample weights
    and clustering

47
Multivariate analysis   Then y can be regressed
on covariates for a much more complete
description of the pattern of incompleteness than
would otherwise be possible. We estimate a
series of models with four covariates   Type of
place of residence District Age interval
(reported or imputed) Womans years of schooling
48
 
  • Logit regressions of incompleteness of age and
    birthdate
  • reporting on type of place of residence,
    district, age, and
  • education. DHS survey of Bangladesh 2000.
    n10,544. _
  • __________________________________________________
    ________
  • Incomplete Model 1 Model 2
    Model 3
  • Age/Birthdate OR z OR z
    OR z
  • __________________________________________________
    _________
  • Type of Place
  • urban .29 -9.26 .28 -9.22
    .68 -3.31
  • rural 1.00 ---- 1.00 ----
    1.00 ----
  • District
  • Barisal .48 -2.96 .47 -2.97
    .51 -2.87
  • Chittagong .79 -1.10 .79 -1.07
    .97 -0.18
  • Dhaka 1.00 ---- 1.00 ----
    1.00 ----
  • Khulna .37 -4.94 .38 -4.83
    .32 -5.62
  • Rajashahi .68 -2.02 .68 -2.01
    .56 -2.98
  • Sylhet .64 -1.59 .61 -1.73
    .50 -3.02

49
Example 2 Age heaping in the India 1998/99
household survey
  • 488,839 de jure residents age 0-79
  • This is a modification of Myers Blended Index .
  • Myers Index is traditionally calculated from
    aggregated data, that is, from an age
    distribution in single years of age, using a
    spreadsheet, as illustrated in the following
    table.

50
India 1998/99 household survey, unweighted de
jure age distribution
  • 0 11,250
  • 1 10,454
  • 2 10,868
  • 3 10,812
  • 4 12,006
  • 5 13,167
  • 6 12,735
  • 7 11,641
  • 8 13,508
  • 9 10,247
  • 10 14,218
  • 11 9,130
  • 12 14,086
  • 13 10,173
  • 14 10,928
  • 15 11,359
  • 16 11,159
  • 17 8,664
  • 18 13,360
  • 20 13,267
  • 21 6,468
  • 22 10,694
  • 23 7,020
  • 24 7,326
  • 25 13,971
  • 26 7,534
  • 27 6,125
  • 28 9,355
  • 29 4,302
  • 30 15,545
  • 31 3,210
  • 32 7,606
  • 33 3,648
  • 34 3,857
  • 35 15,131
  • 36 4,785
  • 37 3,231
  • 38 6,030

51
India 1998/99 household survey, unweighted de
jure age distribution
  • 40 12,798
  • 41 2,152
  • 42 4,702
  • 43 2,340
  • 44 2,260
  • 45 10,462
  • 46 2,595
  • 47 2,241
  • 48 3,844
  • 49 1,824
  • 50 7,008
  • 51 1,780
  • 52 3,492
  • 53 1,862
  • 54 1,739
  • 55 7,323
  • 56 2,047
  • 57 1,270
  • 58 2,545
  • 60 9,657
  • 61 863
  • 62 1,963
  • 63 879
  • 64 818
  • 65 6,443
  • 66 775
  • 67 648
  • 68 1,178
  • 69 516
  • 70 5,632
  • 71 350
  • 72 877
  • 73 308
  • 74 337
  • 75 2,185
  • 76 342
  • 77 151
  • 78 360

52
Percentage distribution of household residents
across final digit of age
  • Column (1) Unweighted
  • Column (2) Weighted by sampling weights
  • Column (3) Weighted by product of Myers weights
    and sampling weights
  • Column (4) Absolute deviation of column (3)
    from a uniform distribution
  •  
  • y (1) (2) (3)
    (4)
  • -------------------------------------------------
    ---
  • 0 18.28 18.28 18.10
    8.10
  • 1 7.04 7.16 6.06
    3.94
  • 2 11.11 11.17 10.73
    0.73
  • 3 7.58 7.51 6.91
    3.09
  • 4 8.03 8.01 7.57
    2.43
  • 5 16.37 16.43 17.17
    7.17
  • 6 8.59 8.63 8.78
    1.22
  • 7 6.95 6.90 7.18
    2.82
  • 8 10.27 10.21 11.12
    1.12
  • 9 5.79 5.71 6.39
    3.61
  • -------------------------------------------------
    ---
  • Total 100.00 100.00 100.00
    34.23

53
Multinomial logit approach to Myers Blended
Index   --Within a range such as 0-79, a
respondents age is converted to a tens digit and
a ones digit.   --Calculate a multinomial logit
regression with the ones digit as dependent
variable y, no covariates, and weights wt . In
Stata, mlogit y pweightwt.   --Construct
(using the predict command in Stata) ten
variables that are the estimated probabilities
that y0, y1, , y9 for each case in the file.
The estimated probabilities will be the same for
every case and may be referred to as p0, p1,,p9.
  --Construct a variable
to get Myers Index.   --M will
have the same value for every case.
54
  • This will give the same value of Myers Index,
    17.11, as the spreadsheet approach.
  • The index can be obtained as the average of M for
    all cases or just be listed out for the first
    case.
  • It is then possible to add covariates and get M
    as a function of one or more other variables.

55
Example covariate is the first digit of the
reported age, which takes the values 0, 1, , 7.
                                   
56
Here the covariate is completed years of
schooling of the household respondent            
                                     
57
Example 3 Use of logit regression to analyze
transfers below age 15 in the 1990 Nigeria
household survey (a modification of the age ratio
approach)   Standard method Identify two age
intervals below the boundary and one above
it   Age Observed
Interval Frequency 5- 9
a 10-14 b 15-19 c   Downward
transfers will tend to reduce c and inflate b.
The difference (b/a)-(c/b) measures the amount of
downward transfer.
58
Modified approach, expressed in terms of
aggregate data   Identify two age intervals
below the boundary and two above it   Age
Observed Fitted Interval
Frequency Frequency 5- 9 a a 10-14
b b 15-19 c c 20-24 d d   Assume
that the only net transfers are between b and c,
and the fitted frequencies follow a regular
pattern
59
The single survey with the strongest evidence of
downward shifts of women was the 1990 survey of
Nigeria.   Age Observed
Fitted Interval Frequency Frequency
5- 9 a3974 a 10-14 b3259
b2832 15-19 c1733
c2159 20-24 d1760 d   The
proportion of true cases shifted downward from
age 15-19 to age 10-14 is   (2159 1733) / 2159
1 (1733/2159) .197 or 19.7.
60
The calculations described above for aggregate
data can be replicated with a logit regression.
The crucial step is the construction of two
artificial variables. The first one, called x,
distinguishes the first and fourth age intervals
from the second and third. The second, called y,
distinguishes the second interval in the pair
from the first interval.
61
Layout of the four successive age groups into a
2x2 table for logit regression approach to age
transfers. a number of cases in first age group
(e.g. 5-9), b number of cases in second age
group (e.g. 10-14), c number of cases in third
age group (e.g. 15-19), d number of cases in
fourth age group (e.g. 20-24).




x 0                            
            0      1

----------------------- 0
0 a
b y ----------------------
1 d c
------------------------
62
Do a logit regression of y on x using the
frequencies as weights.   Equivalently, do a
logit regression of y on x using the underlying
individual-level data file.  
63
Manipulation of the coefficients from these logit
regression will produce interpretable measures of
the amount of net transfer and estimates of the
probability of a downward or upward
transfer.     By adding covariates to the logit
regression (and interactions between the
covariates and x) we can obtain a multivariate
model of age transfers.    
64
The Nigeria 1990 survey has strong evidence of
negative transfers across age 15, i.e., many
women age 15-19 were misreported at age 10-14.
This is most severe if the household head has low
education and minimal if the household head has
high education.  
65
Summary and conclusions   --Logit and multinomial
logit regression can be used to re-state some of
the most common procedures for assessing the
quality of age and date reporting. These models
allow for the incorporation of sampling weights
and clustering   --The inclusion of covariates
in these models will allow for a better
understanding of the sources of misreporting.
66
Standard errors and test statistics should be
used cautiously.
  • Even when there is no misreporting at all, Myers
    blended method may spuriously suggest that there
    is age heaping.
  • The assumptions of the model for measuring age
    transfers may not be satisfied, leading to
    spurious evidence of displacement.
  • The best evidence of misreporting occurs when a
    test is statistically significant AND the
    estimated level is above some threshold of
    substantive significance.
Write a Comment
User Comments (0)
About PowerShow.com