Title: Andrew Borthwick, PhD
1The NY Citywide Immunization Registrys MEDD
De-Duplication Project
Andrew Borthwick, PhD Vikki Papadouka, PhD,
MPH Deborah Walker, PhD
ChoiceMaker Technologies, Inc. Andrew.Borthwick_at_
choicemaker.com
New York City Department of Health vpapadou_at_dohla
n.cn.ci.nyc.ny.us dwalker_at_dohlan.cn.ci.nyc.ny.us
Adapted from a presentation at the 34th National
Immunization Conference Washington, DC July 7,
2000
2New York Citywide Immunization Registry The MEDD
De-duplication Project The NYC CIR
- New York Citywide Immunization Registry was
mandated in January 1997 - All health-care providers are required to submit
immunizations - Goals of the system
- Doctors look up kids immunization statuses to
determine which shots to give - Notify parents when their children are due for an
appointment - Identify citywide immunization trends
- Similar registries are being built at the state
and local level around the country
3New York Citywide Immunization Registry The MEDD
De-duplication Project NYC CIR Background
- About 122,000 children are born in NYC every year
- Each month the CIR receives
- 50-100,000 patient records and
- 80-200,000 immunization records
- From gt1,100 institutions and private providers
- Given this volume, hand-matching each new record
before it enters the CIR is unrealistic
4New York Citywide Immunization Registry The MEDD
De-duplication Project NYC CIR Background
- Contains 1.8 million records
- Very high duplication rate estimated at 3
records 2 children because of very strict
criteria for automatic merging - During April-September 1998 CIR staff reviewed
and manually de-duplicated about 260,000 record
pairs spent 1,700 hours
5New York Citywide Immunization Registry The MEDD
De-duplication Project MEDD What it is
- A system for deciding when two records represent
the same child - Fast and accurate
- Replicates the human decision-making process
6New York Citywide Immunization Registry The MEDD
De-duplication Project MEDDs Decision-Making
Process
- For every record pair, MEDD computes a
probability between 0 and 100 that the pair
should be merged - High probabilities ? merge
- Low probabilities ? dont merge
- Intermediate probabilities (close to 50)
indicate dont know and require human review - Thresholds dividing the merge/ dont know/ dont
merge cases are set by the user
7New York Citywide Immunization Registry The MEDD
De-duplication Project Maximum Entropy Modeling
- MEDD uses Maximum Entropy Modeling
- A new statistical decision-making technique
- Learn the human judgment process by training from
examples - Has been used in sentence parsing, computer
vision, financial modeling, and proper-name
identification - Has achieved state-of-the-art results on these
problems
8New York Citywide Immunization Registry The MEDD
De-duplication Project Maximum Entropy Modeling
Features
- Maximum Entropy uses Features
- Feature a function which looks at specific
fields in the pair of records to make a merge
or dont merge decision - MEDD has many different features, each of which
is assigned a weight during training
9New York Citywide Immunization Registry The MEDD
De-duplication Project Sample MEDD Features
- Mothers Birthday
- Match of Moms Bday predicts Merge
- Mismatch of Moms Bday predicts No-Merge
- Neither feature fires if Moms Bday wasnt
filled in on both records - We have no evidence in this case
- Many other features
- Childs birthday
- Childs first and last name
- Medicaid Number
10New York Citywide Immunization Registry The MEDD
De-duplication Project Training the System
Record pairs hand-marked with merge/no-merge
decisions
A set of features
Maximum Entropy Parameter Estimator
A weight for each feature
11 New York Citywide Immunization Registry The MEDD
De-duplication Project Probability Computation
For a pair of records, MEDD computes the
probability that the pair should be merged as
Merge product of weights of all features
predicting merge for the pair NoMerge
product of weights of all features
predicting no merge for the pair
12High Probability. Human Decision Merge
Field Name Record Record Feature Weight Prediction
Field Name 1 2 Feature Weight Prediction
Last name Smith Smith Match 1.153 Merge
First name Emily Emely No-match Soundex 1.350 4.708 No-merge Merge
DOB 04/28/97 04/28/97 Match 1.138 Merge
Multiple birth N N
Moms Maiden Name CRUZ
Mothers DOB 12/04/76
Street 4528 3rd Ave 4528 3rd Ave Match 4.342 Merge
City Bronx Bronx Match 1.103 Merge
State NY NY
Zip 10462 10462 Match 3.013 Merge
Phone 718-123-4567 718-123-6789 No-match 2.130 No-merge
Med Rec Number 11856437503 11856437503 Match 6.587 Merge
Merge Total 587.2
No-merge total 2.9
MEDD predicts Merge with 99.5 confidence
13Low Probability. Human Decision No-Merge
Field Name Record Record Feature Weight Prediction
Field Name 1 2 Feature Weight Prediction
Last name Lopez Lopez Match 1.153 Merge
First name Girl Susan
DOB 1/11/97 1/2/97 No-match 28.949 No-merge
Multiple birth N N
Moms Maiden Name Lopez
Mothers DOB
Street 987 Cornelia 456 Park No-match 2.937 No-merge
City Brooklyn Brooklyn Match 1.103 Merge
State NY NY
Zip 11211 11211 Match 3.013 Merge
Phone 718-123-4567 718-234-5678 No-match 2.130 No-merge
Med Rec Number 1001002 567435
Merge Total 3.8
No-merge total 181.1
MEDD predicts No-merge with 97.9 confidence
14Intermediate Probability. Human Decision Merge
Field Name Record Record Feature Weight Prediction
Field Name 1 2 Feature Weight Prediction
Last name Hernandez Hernandez Match 1.153 Merge
First name Boy David
DOB 2/14/97 2/14/97 Match 1.138 Merge
Multiple birth N N
Moms Maiden Name Hernandez
Mothers DOB 11/4/78
Street 142 4th Ave 142 4th Ave Match 4.342 Merge
City Bronx Bronx Match 1.103 Merge
State NY NY
Zip 11051 11052 No-match 2.551 No-merge
Phone 718-524-4879 718-524-4878 No-match 2.130 No-merge
Med Rec Number 1001002 567435
Merge Total 6.3
No-merge total 5.4
Predicts Merge with 53.9 confidence (Human
review)
15New York Citywide Immunization Registry The MEDD
De-duplication Project Sophisticated MEDD
features Name Frequency
- Name Frequency
- Rodriguez is 9 times more common than Walker
in NYC - Less than 3 kids per year are born with the names
Borthwick and Papadouka - Hence we build features categorizing names as
very common, somewhat common, very rare,
etc. - Given that we have a name match, the fact that
the names are very common is a feature predicting
dont merge - A match between rare names is a feature
predicting merge
16New York Citywide Immunization Registry The MEDD
De-duplication Project Sophisticated MEDD
features Partial Name Match
- Soundex A phonetic representation of names
- Connor Conor Conner CNR
- When the Soundex representation of two names
matches, a feature fires predicting merge - Edit Distance Features firing based on two
names having an edit distance of 1 - Borthwich ? Borthwick ? Bortwick
17New York Citywide Immunization Registry The MEDD
De-duplication Project Special Situation Features
- Every database has its quirks
- HMO XYZ always sends its data to the CIR with Day
of Birth 1 - Birthday July 1, 1998 not July 15, 1998
- We have a special feature
- If Provider HMO XYZ AND Day of Birth 1 AND
dates differs only on day of birth, THEN predict
merge - We plan to allow users to define these types of
features themselves
18New York Citywide Immunization Registry The MEDD
De-duplication Project Test Procedure
- MEDD tested on c. 3,000 pairs under NYC DOH
supervision - Pairs were carefully hand-scored by NYC DOH as
Merge/Dont Merge - ChoiceMaker never saw the test data
19New York Citywide Immunization Registry The MEDD
De-duplication Project MEDD Evaluation Results
Requested Accuracy of Records Needing Human Review
1 False Positive 1 False Negative 1.4
0.5 False Positive 0.5 False Negative 2.6
0.3 False Positive 0.3 False Negative 3.2
Even with double-checking, human error rate is no
better than 0.3
20New York Citywide Immunization Registry The MEDD
De-duplication Project Summary What MEDD Offers
- Can be trained on just 3,000 record pairs
- Judges nearly 1,000 record-pairs per second
- Achieves very high accuracy by finding the
optimal weighting of the different clues
(features) indicating merge/dont merge - Says merge, dont merge, or I dont know
- Can be rigorously tested
- Registry management can make informed judgments
regarding the effort vs. accuracy trade-off
21New York Citywide Immunization Registry The MEDD
De-duplication Project The 5 Stages of the
De-duplication Process
- Blocking Identify list of possible duplicates
(SmartSearch) - Decision-Making Identify a definitive list of
duplicate records (MEDD) - Human Review of
- Records marked as dont know by MEDD
- Records held by special filters (twins, scanty
records, etc.) - Linkage Link records that belong to the same
child together (if AB and BC then AC) - Update the CIR
22New York Citywide Immunization Registry The MEDD
De-duplication Project Project Avalanche
- Project Avalanche A project by which we
systematically de-duplicate the whole CIR by
comparing every record to every record meeting
certain criteria - Uses our querying tool Smart Search and our
de-duplication tool MEDD - Project Avalanche I February-April 2000
- Project Avalanche II May-July 2000
23New York Citywide Immunization Registry The MEDD
De-duplication Project Project Avalanche I
- Used strict blocking criteria for finding
possible duplicates to be passed on to MEDD such
as - Exact match on DOBMedical Record or
- Exact match on Medicaid number or
- First namegenderDOBlast namemaiden name (and
vise versa) or - Last nameFirst nameDOB
- Used 98 as the cut-off for automatic merging
- Hand-reviewed records produced by the filters
24New York Citywide Immunization Registry The MEDD
De-duplication Project Project Avalanche I
Results
Estimated
25New York Citywide Immunization Registry The MEDD
De-duplication Project Project Avalanche II
- In April 2000 we loaded 4 months worth of data
that were held due to Y2K problems - Used more liberal blocking criteria
- Medical Record Number
- month and year of DOB or
- day and year of DOB or
- day and month of DOB or
- first name
- Used 90 as the cut-off for automatic merging
- Currently hand-reviewing records produced by the
filters
26New York Citywide Immunization Registry The MEDD
De-duplication Project Project Avalanche II
Results
Estimated
27New York Citywide Immunization Registry The MEDD
De-duplication Project Project Avalanche
Discussion
- Using a very conservative cut-off for automatic
merging we reduced the duplicates by about 27.5
each time, more than 30 including human review - As a result of Project Avalanche 81 of records
now have immunizations vs. 58 6 months ago - Since MEDD is not yet implemented on the front
end of the CIR, you dont see the total number of
duplicates decreasing over time in these early
runs
28New York Citywide Immunization Registry The MEDD
De-duplication Project Future of MEDD at the CIR
- As part of the Lead and CIR integration MEDD will
be inserted on the front end, thus reducing the
number of duplicates being created - Improving MEDDs performance will enable us to
automatically merge more duplicates with the same
error rate - Will continue with Project Avalanche until we
bring the duplication rate down to an acceptable
level
29New York Citywide Immunization Registry The MEDD
De-duplication Project Summary ChoiceMaker
Status
- Currently have two employees
- Andrew Borthwick, Ph.D.
- Prof. Arthur Goldberg
- Have several major contracts with New York City
Dept. Of Health - Good prospects of finding similar work with other
state and municipal health departments
30New York Citywide Immunization Registry The MEDD
De-duplication Project Summary De-duplication
Marketplace
- Immunization Registries have very difficult
duplicate record problems - Many others have similar problems
- Medical researchers (correlating birth
certificate and maternal death records) - Banks, phone companies (correlating clients from
different lines of business) - Direct marketers (merging mailing lists)
31New York Citywide Immunization Registry The MEDD
De-duplication Project Summary ChoiceMakers
Plans
- Do further research to decrease the amount of
consulting time needed to deploy MEDD - Seeking first-round investors to fund expansion
of RD and marketing - Have an opening for someone with an M.S. in C.S.
or similar qualifications, starting 10/1/2000 and
a C.S. Ph.D. starting 11/1/2000