Title: BASUG SAS Challenge
1- BASUG SAS Challenge
- 2009 Quarter 1
- Charles E. B. Jones, PhD
- Director, Statistical Programming
- ARIAD Pharmaceuticals, Inc.
2Here is the Challenge
BASUG has two datasets that must be coordinated
together for tracking members and registrants at
each of our meeting or training events. Here is
some detail Membership Dataset (N397) This
is a single file of information on our members
within one year of membership. It contains year
of membership, contact information (e.g. names,
address, email address), and financial
transaction information on the payment of dues. A
new record is created every year when a person
pays dues for that year. Registration Dataset
(N147) When you register for one of our events,
we insert an observation in a file, with the name
of the event and your contact information (e.g.
names, email address, phone number). The goal
of this challenge is to accurately merge the
records of registrants with their appropriate
record in our membership database. The purpose of
the merge is to identify who is a member and who
is not, for use at our meetings and our training
sessions.
3General Questions
- What was your general approach to the challenge?
- What did you see in the data that intrigued you?
- Why did you choose the methods you chose?
- What methods did you consider, but that you
rejected using? And why?
4The Datasets
mem. forms
Organization Email Fname Lname Phone Zip
City Addr1 Addr2
checks
Both
other docs
Membership data (MD)
MD only
internet
Registration data (RD)
5Variables for Merging
- First Name (F)
- Last Name (L)
- Email (E)
- Phone Number (P)
- Organization (O)
6Data challenges (some)
- In one source you may have indicated a middle
name but not in the other. - In one source you may have indicated your work
address but in the other source you put your home
address. - You may have moved or changed jobs between the
time you entered information in one source and
the other. - Your e-mail address may have changed, or you may
have entered a work e-mail address on one form,
and a personal e-mail address on the
7Examples
- 1
- R Blue Cross Blue Shield of MI,
Incqsiocin.oetrdir_at_ksrucisoco.orgTimothyDe
Kassu832-824-0956 - M 2008AWHqsiocin.oetrdir_at_ksrucisoco.orgTimoth
yDe Kassu02190832-824-0956South
HamiltonLandmark Center963 Park Dr - M 2008Averion Inc.qsiocin.oetrdir_at_ksrucisoco.o
rgTimothyDeKassu02190832-824-0956South
Hamilton876 Plantation Street - 2
- R Verispanipdri_at_opeiisiu.pedqedo.comNarayanaN
acapuy666-666-6666 - M 2008Blue Cross Blue Shield
MIipdri_at_opeiisiu.pedqedo.comNarayanaGraham0213
1832-101-6527WakefieldOIRP366 Morrissey
Boulevard - M 2008USDA-Mid-Atlantic Ag. Statistic
ServiceIPDri_at_Opeiisiu.Pedqedo.comNarayanaNacapu
y02231832-101-6521Wakefield16 Staniford
Street, 9th Floor - M 2008Nashbar Senior Lifedas3_at_pgw.pedqedo.com
NarayanaRodgers02231832-164-4425Wakefield7
Underwood Court - M 2008CalTech Richard SchoolDriu.Ac_at_kokwge.org
NarayanaWright02231832-098-8510Wakefield30
Garvey Road
8Specifying a Match
- FL, LE, LP, LO, FE, FP, EP
- FL first name last name etc.
- Refinements/Difficulties
- First name may include Initials
- Phone number entry not standard
- Datasets not clean, needs standardization -
obvious typos - Pipe delimited datasets with empty fields
- Matching definition cannot be too liberal nor too
stringent
9Approach
REG dataset, RD
MEM dataset, MD
Format, mf
For each RD member, loop through MD and test for
match using mf
10Solution
- R x M x k comparisons
- 2k 1 kC1 kC2 kC3 kC4 kC5
- Br,m,j nAi where Ai ith field for Rr
ith field for Mm - i ? Sj
-
and i 1,2,..k (fields) -
and r 1,2,..R (Registrant) -
and m 1,2,..M (Member) -
and Sj is a subset of 1,2,,k -
where j 2k 1 - The set of fields that defines a match is
specified through Sj.
R147
M397
k5
J 7
11Solution
1 if Br,m,j ? ø 0 if Br,m,j ø
For the rth member, a single match exists if
max (f) 1 for some j and
a only one unique m. Multiple matches exist if
max (f) 1 for some
combinations (mz,jz), z1,2.. No matches
exists otherwise
12Code walkthrough
MD
RD
sas
final dataset
13Contents of merged dataset
14Match Results
15Match Distribution
LF
LE
LP
LO
FP
FE
EP
1
2
Y
2
1
Y
3
2
Y
Y
Y
4
1
Y
5
1
Y
6
2
Y
Y
Y
Y
7
12
8
Y
Y
2
9
Y
Y
Y
10
10
Y
Y
Y
7
Y
11
Y
Y
10
Y
12
Y
Y
Y
Y
3
13
Y
Y
Y
1
Y
29
Y
Y
Y
Y
14
Y
Y
Y
Y
12
15
Y
Y
Y
Y
Y
Y
Y
95
86
57
62
25
60
57
48
Note Based on single matches only
16Questions/Conclusions
17THE END