Title: Kevin A Henry, Ph.D
1Estimating the accuracy of different
geographical imputation methods
- Kevin A Henry, Ph.D
- New Jersey Cancer Registry
- Cancer Epidemiology Services
- Frank Boscoe, Ph.D
- New York State Cancer Registry
Paper Presentation NAACCR Annual Meeting, 2007,
Detroit, MI
2Introduction
- Geographical Imputation
- Methods to assign a case a geographic location
that is approximate or accurate given available
geographic and demographic data - Goal of geo-imputation is to assign a case a
location at one geographical aggregate level
based on information from one or more known
geographical aggregates (Boscoe 2007). - Assigned locations can be
-
- Area (e.g. census tract, block group)
- Point (e.g. latitude longitude within census
tract)
Geo-imputation Example Zip code to census tract
08648
Black Population 3260
Available Case Information
1,639
692
50.
21
19.5
293
636
8.9
3Introduction
- Why should we geo-impute?
- Studies can be biased due to the geographic
non-randomness of ungeocoded cases or cases
geocoded to zip code centroid (Oliver et al.
2006). - Cases geocoded to a zip code centroid may not be
located in the correct census tract. - Removing cases geocoded by zip code can result in
selection bias. - Cases geocoded to zip code centroids can inflate
case counts at the location where the zip
centroid falls.
No systematic evaluation of geo-imputation has
been completed to determine which method offers
the best predictive power.
4Study Objective
- Examine the usefulness of geo-imputation for
assigning census tracts to cases that have
been previously geocoded to only a zip code
centroid.
Study Questions
- What census tract demographic information (e.g.
race, age) provides the best predictive value to
assign a case to the correct census tract? - Is demographic based geo-imputation better than
two alternatives? - 1) Selecting census tracts within a zip code
zone randomly - 2) Using the census tracts originally assigned
to cases based on the zip code centroid
location.
5Background What is a zip code
- ZIP or Zone Improvement Program are linear
features associated with specific roads or
specific addresses
- Zip code zones are created by digitizing
boundaries around geographically street ranges
Census Tracts Falling Within in Zip Code Zone
Zip Code Centroid
Street Segments Used for Geocoding
6Background New Jersey Zip Codes
- 558 zip code zones
- 92 of zip codes have 2 or more potential census
tracts - 1 zip code has 23 potential census tracts
- Average tracts per zip code 6
Census Tracts Per Zip Code
25
20
15
Percent
10
5
0
1
3
5
7
9
11
13
15
17
19
21
23
Tract Frequency
7Methods Study Population
- New Jersey residents diagnosed with breast,
prostate and colorectal cancer geocoded to a full
street address (2000-2004, N96,852, NJSCR) - Additional study exclusions (N4100)
- No age or race
- Invalid zip codes
- Invalid census tracts
- Cases geocoded to zip centroids with only one
census tract - Registry Variables
-
Original Case Data
Imputed Case Data
Compared with
Census Tracts Assigned to Cases
Census Tracts Assigned to Cases
Truth
8Methods Demographic Data
- Creation of Census Tract Populations
- 2000 Census block populations aggregated into zip
codes (Tele Atlas, 2006). - Census tract populations created to include only
populations within zip code.
Zip code 07524
Total Tract Population
- 2000 SF1 Census populations included
-Total Population (P001001) -White alone
(P003003) -Black or African Amer. alone
(P003004) -Asian alone (P003006) -Hispanic or
Latino (P004002) -Total Population by age
(P012003-P012049)
3,101
6,774
Census Block Population
- Cumulative probabilities calculated for each
tract per zip code.
9Method Geo-imputation
Step 1
Step 2
Calculate Cumulative Probabilities From CT
Population
Generate random number for each case (0-1)
07001
3
2
18.4
32.8
4
1
15.9.
32.7
10Methods Test Samples
- Random samples for race and age groups
stratified by population density (Quintiles). - Geo-imputations completed for each subset
- Compared imputed census tracts with the tracts
from the original case data (truth). - Each imputation was run 1000 times.
- Results Boxplots of mean of matches.
11Results
35
Rural
30
Urban
25
Mean Percent Correct
20
No imputation (17.1)
15
10
5,079 - 11,579
1,133 - 2,882
2,883 - 5,078
11,579
Population Per Square Mile by Census Tract
12Results
30
26.3
Asia, White, Black Hispanic Combined
25
24.6
(24)
22.2
22
Mean Percent Correct
20
No imputation (17.1)
15
Random
13
10
Asian
Black
Hispanic
White
Total Population
N4000
N3000
N25000
N1500
N33,500
Population
13Results
30
25
Age Combined (24.9)
20
Mean Percent Correct
No imputation (17.1)
15
Random
13
10
40-44
45-49
50-54
55-59
60-61
62-64
65-66
67-69
70-74
75-79
80-84
85
Age groups
14Conclusion
- Geo-imputation provides a higher match rate than
no-imputation or randomly allocating tracts. -
- Percent correct dependent on population density.
- Imputation based on race specific population
slightly higher than total population (23.1 vs
24 ). - States with larger rural populations would likely
have better match rates than New Jersey. - Geographic imputation does offer some advantages
and no serious drawbacks compared with the
alternative of excluding ungeocoded cases from an
analysis.
15Thank you