Title: Geocoding addresses from a large populationbased study: Lessons learned and applied
1Geocoding addresses from a large population-based
study Lessons learned and applied
- Jane McElroy
- University of WisconsinComprehensive Cancer
Center - Gaylord E Nelson Institute for Environmental
Studies - 27 Nov 2002
- Monitoring Population Health (PM 803)
- Population Health and Sciences Department
2Geocoding
- Showing the location of mailing addresses on a
map by converting these addresses into respective
latitude and longitude coordinates
3Requirements
- Authority file
- e.g., TIGER street map (available online for 1995
and 2000) - Input file
- Data of study participants addresses
- GIS software to compare the two files
- ArcView 3.2 (used for this presentation)
- Centrus (used by the State)
- Dynamap
- Geographic Data Technology, Inc., etc.
4Street matching options in ESRIs ArcView 3.2
- US street
- US street w/ zone
- (zonei.e.,zip code or mcd)
- Zip 4
- Etc.
Input file choices
Authority file
We used US street with zone for our study but
intersections dont have zone option
5Authority file(from Jerry Sullivan, DOA)
- Statutory authority for addressing is vested in
the county however, some counties never
exercised this power, leaving it to townships. - Wisconsin therefore has several dozen different
addressing systems. These systems vary based on - point of origin
- directionality (NSEW, center out)
- use of prefixes, or not
- numbers per mile (100, 264, 300, 400, 1000)
- mixing directions on a single road segment
6Authority File Limitations
- Incomplete information
- (e.g.,missing direction prefix, very small
streets not included) - Spelling mistakes
- Ranges not provides
- (especially true for county, state and federal
roads) - Quality varies by county
- (due to the resources available by county to
compile street map information) - Accuracy ??????
7Input File Limitations
- Post office standardized addresses not recorded
- (e.g., WI Rap Wisconsin Rapids)
- Use of PO Box Rural Route, RFD, or Firecodes as
address - Spelling mistakes
- Alias road name used but official name provide
in map - (e.g., County Road AA Steed Road)
8Breast Cancer Population-based Case-Control Study
Phase 1 1988 1991 Age 20-79 Phase 2 1991
1994 Age 50-79
Total number 14,804
Cases from statewide cancer registry Controls
from DOT drivers license list (and Health Care Financing
Admin list (65-79 yrs old)
Response rate 85 (cases) and 87 (controls)
9Geocoding Flowchart
Mailing Addresses N14,804
Post Office Standardized Step 1
Addresses
Yes n12,950
No n1,854
Successful match with Step 2 county level maps?
Successful match using Step 3 Internet
mapping engines?
No n1,915
Yes n276
yes
No n746
Yes
Recontact of study participants for Step
4 better address successful
no
GIS software Geocoded addresses n11,311 (77)
GIS software Step 5 Geocoded zip code
centroid n470 (3.0)
Internet mapping engines Geocoded
addresses n3,023 (20)
10Why worry about the address quality?
participants rural 8222 (56)0-30 2679
(18)31-55 1843 (12)56-70 1350
(9)...71-99 704 (5)..100
rurality
12 cnty 14 cnty 14 cnty 17
cnty 15 cnty
11Geocoding Flowchart
Mailing Addresses N14,804
12Geocoding Flowchart
Mailing Addresses N14,804
Post Office Standardized Step 1
Addresses
13Quality of Addresses from Different Sources
n14,804
Garbled addressesnot able to fix with software
Already standardized
Fixed with standardization software
14Geocoding Flowchart
Mailing Addresses N14,804
Post Office Standardized Step 1
Addresses
yes
Successful match with Step 2 county level maps?
15Study Participants Geocoding Results Statewide
Batch
N10140 (68)
Bayfield cnty n5 of 36
6 geocoded in wrong county
Sawyer cnty n0 of 36
16Lincoln
Winnebago
LaCrosse
Jefferson Waukesha Milwaukee Racine Kenosha
Sauk
17Dane county TIGER 2000 vs 1995
18Ashland county TIGER 2000 vs 1995
19Geocoding Flowchart
Mailing Addresses N14,804
Post Office Standardized Step 1
Addresses
yes
no
no
Successful match with Step 2 county level maps?
Successful match using Step 3 Internet
mapping engines?
20Look up address examples
- N2395 USH 53
- 543 CHICOG ST
- 3223 500ST ,APT 24
- 546 HWY B BX 395 E RR,3
- WINTERGREEN APT 1001, 5603 JANESVILLE ROA
- E13984 TOWN CK LK RD
21Strategies to Improve the Match Rate in-house
- Look-up study participants using Internet mapping
engines by 1) their addresses, 2) their telephone
numbers, 3) their names - MapQuest.com for addresses Anywho.com for phone
number and name - Add better address, intersections, or xy
coordinate to input file
22Look up address examples
- N2395 USH 53 ? N2395 US HWY 53
- 543 CHICOG ST ? 543 Chicago St
- 3223 500ST ,APT 24 ? 3223 500th St
- 546 HWY B BX 395 E RR,3 ? 546 Cnty Rd B
- WINTERGREEN APT 1001, 5603 JANESVILLE ROA ? 5603
Janesville Rd - E13984 TOWN CK LK RD ? E13984 Town Creek Lake
Rd
23Look-up screen
24Geocoding Flowchart
Mailing Addresses N14,804
Post Office Standardized Step 1
Addresses
yes
no
no
Successful match with Step 2 county level maps?
Successful match using Step 3 Internet
mapping engines?
yes
no
Recontact of study participants for Step
4 better address successful
25Re-contacted no hope participants
Clear spatial bias as indicated by the study
participants that we could not find better
addresses to geocode. Implications in any type
of spatial analysis
Percentage number recontacted/total number
by county
26Strategies to Improve the Match Rate no hope
file
- Recontact by telephone study participants
- 2. For PO Box addresses, can contact postmaster
and request street address associated with that
PO box - US Title 39
- Code of Federal Regulations
- Section 265.4(a)(4)(i)
- From PO Box Rental Form 1093, request name and
street address of PO Box holder
27Response and geocoding rate of study participants
recontacted by mortality statusn597
28Why obtain intersection info?
- 196 addresses and intersections geocoded
50 match rate
34 match rate
23 match rate
Intersection match status
11 improvement in match rate w/ intersection
Address match status
29Why avoid county, state, federal road as part of
intersection?
- 176 addresses obtained from recontacting
participants
30Geocoding Flowchart
Mailing Addresses N14,804
Post Office Standardized Step 1
Addresses
Yes n12,950
No n1,854
Successful match with Step 2 county level maps?
Successful match using Step 3 Internet
mapping engines?
No n1,915
Yes n276
yes
No n746
Yes
Recontact of study participants for Step
4 better address successful
no
GIS software Geocoded addresses n11,311 (77)
GIS software Step 5 Geocoded zip code
centroid n470 (3.0)
Internet mapping engines Geocoded
addresses n3,023 (20)
31Final Geocode
participants rural 8222 (56)0-30 2679
(18)31-55 1843 (12)56-70 1350
(9)...71-99 704 (5)..100
rurality
12 cnty 14 cnty 14 cnty 17
cnty 15 cnty
32Things we learned 1 Post office standardize the
addresses
- Use post office service or other standardization
software packages (e.g., Semaphore Corp) for
retrospective data - Train address gatherers to only accept geocodable
addresses (no PO Box, RR, RFD) and enter them in
a standardized fashion or design a screen to
facilitate data entry
33Things we learned 2Geocoding addresses
- Geocode by county
- Geocode both address and intersection
- Use appropriate map by county (TIGER 1995 and
TIGER 2000) for addresses and intersections
34Things we learned 3Geocoding Intersections
- Use the study participants street address as one
of the parts of the intersection - Avoid as much as possible county, state and
federal roads as part of the intersection - Dont assume the study participant understands
the word intersection - Dont use major road when asking for
intersection information
35Things we learned 4Internet mapping engines
- MapBlast.com no longer provides x,y coordinates
on lineMapTech is designing an interface to do
the same for a fee (150 update charge) - Anywho.com works well for reverse telephone
lookups and name lookups - Need to establish a priori acceptance criteria
for reverse lookups and code the decision rule
used
36Things we learned 5Costs
- Software
- ArcView for geocoding
- MapTech for xy coordinants
- FoxPro to design screen for updating addresses
(excel spreadsheet works too) - Post office standardization software
- Personnel
- Geocoder(s) (w/skill in geocoding softwarewe
trained ours) - Time
- ½ hour / county to geocode input file (done at
least 3 times) - 20-30 lookups/hour with experienced geocoder
- Study management estimation start-up (40 hrs)
weekly (5-10 hrs)
37Questions to ask when reading a study that used
geocoding
- What is the matching rate?
- How is that obtainede.g., by state by county
by zip code? - 98 metropolitan area 40 rural 69 overall
which is OK but there can be study implications - What is the accuracy criteria?
- This mean how reliable are the locations?
- How much fuzziness was allowed to geocode (called
spelling and match sensitivity scores)
38Questions to ask when reading a study that used
geocoding cont
- Are the un-geocoded locations randomly
distributed over the area of analysis so there is
not a clear spatial bias? - Are the un-geocoded locations a very small
percentage of total such there is that minimal
impact on the analysis?
39Conclusion
- Much more complicated and labor intensive than
seems at first glance to get the matching rate
above 68 for our addresses - Study ramification based on matching rate
- Weakest link (at this date) is incomplete
authority file, especially for rural areas - Once locations are geocoded, very interesting
analyses can be done (e.g. Roberts, Kriegers
and Rushtons analysis)
40Acknowledgements
- Drs. Patrick Remington, Amy Trentham-Dietz,
Stephanie Robert, and Polly Newcomb for their
collaboration on the design and implementation of
the project - Drs. Henry Anderson, Larry Hanrahan, Russell
Kirby, Marty Kanarek, Colin Jefcoate, and William
Sonzogni for advice and support on this project - Laura Stephenson of the Wisconsin Cancer
Reporting System for assistance with data - Betty Granda, Christina Kantor, Elizabeth
Mannering, Kathy Peck, Lisa Sieczkowski, Jerry
Phipps, John Hampton, Nicole Angresano, Mina Kim,
and Linda Haskins for data collection and study
management - Ayak Reec, Jeffrey Pearson, Indiana Strombom,
Stephanie Holmes, LeAnn Anderson, Kwang Kim, and
Luxme Harihan for geocoding - Mary Pankratz, Math Heinzel, Peter Nepokroeff,
John Laedlein, and Gene Hafermann for technical
support. - This study was supported by National Cancer
Institute grants RO1 CA47147 and U01 CA82004 -