BASUG SAS Challenge - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

BASUG SAS Challenge

Description:

... together for tracking members and registrants at each of our meeting or training events. ... merge the records of registrants with their appropriate record ... – PowerPoint PPT presentation

Number of Views:22
Avg rating:3.0/5.0
Slides: 18
Provided by: bas64
Category:

less

Transcript and Presenter's Notes

Title: BASUG SAS Challenge


1
  • BASUG SAS Challenge
  • 2009 Quarter 1
  • Charles E. B. Jones, PhD
  • Director, Statistical Programming
  • ARIAD Pharmaceuticals, Inc.

2
Here is the Challenge
BASUG has two datasets that must be coordinated
together for tracking members and registrants at
each of our meeting or training events. Here is
some detail Membership Dataset (N397) This
is a single file of information on our members
within one year of membership. It contains year
of membership, contact information (e.g. names,
address, email address), and financial
transaction information on the payment of dues. A
new record is created every year when a person
pays dues for that year. Registration Dataset
(N147) When you register for one of our events,
we insert an observation in a file, with the name
of the event and your contact information (e.g.
names, email address, phone number). The goal
of this challenge is to accurately merge the
records of registrants with their appropriate
record in our membership database. The purpose of
the merge is to identify who is a member and who
is not, for use at our meetings and our training
sessions.
3
General Questions
  • What was your general approach to the challenge?
  • What did you see in the data that intrigued you?
  • Why did you choose the methods you chose?
  • What methods did you consider, but that you
    rejected using? And why?

4
The Datasets
  • Sources are different!

mem. forms
Organization Email Fname Lname Phone Zip
City Addr1 Addr2
checks
Both
other docs
Membership data (MD)
MD only
internet
Registration data (RD)
5
Variables for Merging
  • First Name (F)
  • Last Name (L)
  • Email (E)
  • Phone Number (P)
  • Organization (O)

6
Data challenges (some)
  • In one source you may have indicated a middle
    name but not in the other.
  • In one source you may have indicated your work
    address but in the other source you put your home
    address.
  • You may have moved or changed jobs between the
    time you entered information in one source and
    the other.
  • Your e-mail address may have changed, or you may
    have entered a work e-mail address on one form,
    and a personal e-mail address on the

7
Examples
  • 1
  • R Blue Cross Blue Shield of MI,
    Incqsiocin.oetrdir_at_ksrucisoco.orgTimothyDe
    Kassu832-824-0956
  • M 2008AWHqsiocin.oetrdir_at_ksrucisoco.orgTimoth
    yDe Kassu02190832-824-0956South
    HamiltonLandmark Center963 Park Dr
  • M 2008Averion Inc.qsiocin.oetrdir_at_ksrucisoco.o
    rgTimothyDeKassu02190832-824-0956South
    Hamilton876 Plantation Street
  • 2
  • R Verispanipdri_at_opeiisiu.pedqedo.comNarayanaN
    acapuy666-666-6666
  • M 2008Blue Cross Blue Shield
    MIipdri_at_opeiisiu.pedqedo.comNarayanaGraham0213
    1832-101-6527WakefieldOIRP366 Morrissey
    Boulevard
  • M 2008USDA-Mid-Atlantic Ag. Statistic
    ServiceIPDri_at_Opeiisiu.Pedqedo.comNarayanaNacapu
    y02231832-101-6521Wakefield16 Staniford
    Street, 9th Floor
  • M 2008Nashbar Senior Lifedas3_at_pgw.pedqedo.com
    NarayanaRodgers02231832-164-4425Wakefield7
    Underwood Court
  • M 2008CalTech Richard SchoolDriu.Ac_at_kokwge.org
    NarayanaWright02231832-098-8510Wakefield30
    Garvey Road

8
Specifying a Match
  • FL, LE, LP, LO, FE, FP, EP
  • FL first name last name etc.
  • Refinements/Difficulties
  • First name may include Initials
  • Phone number entry not standard
  • Datasets not clean, needs standardization -
    obvious typos
  • Pipe delimited datasets with empty fields
  • Matching definition cannot be too liberal nor too
    stringent

9
Approach
REG dataset, RD
MEM dataset, MD
Format, mf
For each RD member, loop through MD and test for
match using mf
10
Solution
  • R x M x k comparisons
  • 2k 1 kC1 kC2 kC3 kC4 kC5
  • Br,m,j nAi where Ai ith field for Rr
    ith field for Mm
  • i ? Sj

  • and i 1,2,..k (fields)

  • and r 1,2,..R (Registrant)

  • and m 1,2,..M (Member)

  • and Sj is a subset of 1,2,,k

  • where j 2k 1
  • The set of fields that defines a match is
    specified through Sj.

R147
M397
k5
J 7
11
Solution
  • f (Br,m,j)

1 if Br,m,j ? ø 0 if Br,m,j ø
For the rth member, a single match exists if
max (f) 1 for some j and
a only one unique m. Multiple matches exist if
max (f) 1 for some
combinations (mz,jz), z1,2.. No matches
exists otherwise
12
Code walkthrough
MD
RD
sas
final dataset
13
Contents of merged dataset
14
Match Results
15
Match Distribution
LF
LE
LP
LO
FP
FE
EP
1
2
Y
2
1
Y
3
2
Y
Y
Y
4
1
Y
5
1
Y
6
2
Y
Y
Y
Y
7
12
8
Y
Y
2
9
Y
Y
Y
10
10
Y
Y
Y
7
Y
11
Y
Y
10
Y
12
Y
Y
Y
Y
3
13
Y
Y
Y
1
Y
29
Y
Y
Y
Y
14
Y
Y
Y
Y
12
15
Y
Y
Y
Y
Y
Y
Y
95
86
57
62
25
60
57
48
Note Based on single matches only
16
Questions/Conclusions
17
THE END
Write a Comment
User Comments (0)
About PowerShow.com