Title: SAS Challenge Matching Records
1SAS Challenge Matching Records
- Libing Shi
- Libing.shi_at_bcbsma.com
- Blue Cross Blue Shield of MA
- BASUG Quarter 1 Meeting
- March 19, 2009
2Overview
Final_outdata
Members.txt
Registrations.txt
Fname v Lname v Organization v Addr1 v Addr2
v City v State ? Zip v Phone v Email v Member
(Y/N) Match_Ind (L-sure match M-manual check N-
not matched)
DuesYear Organization Email Fname Lname Zip Phone
City AddressLine1 AddressLine2
Organization Email Fname Lname Phone
3Input datasets
- Method I Use SAS pipe feature to get the file
names - filename indata pipe 'dir/b C\DOCUME1\lshi0001\D
esktop\BASUG\.txt' - data file_list
- length name_of_file 20
- infile indata truncover
- input name_of_file 20.
- call symput ('num_files',_n_)
- run
Path
file_list
missover
4- Method I Automatically read in data using
PROC IMPORT - macro fileread
- do j1 to num_files
- data _null_ set file_list
- if _n_j
- call symput ('filein',name_of_file)
- call symput ('fileout', scan(name_of_file,1,
'.') ) run - PROC IMPORT OUT work.fileout
- DATAFILE "C\Documents and settings\
lshi0001\Desktop\BASUG\filein" - DBMSTAB REPLACE
- DELIMITER''
- GETNAMESYES
- DATAROW2
- RUN
- end
- mend fileread
Bedford VA Hospital"200 Springs Road,
152 Bedford VA Hospital"200 Springs Road, 1
5Log window using Proc import wizard
6Input datasets
- Method II Old fashion way infile, input,
informat and format - data WORK.registrations
- let _EFIERR_ 0
- infile 'C\DOCUME1\lshi0001\Desktop\BASUG\Registr
ations.txt' delimiter '' MISSOVER DSD
lrecl32767 firstobs2 - informat Organization 58.
- informat Email 50.
-
- format Organization 58.
- format Email 50.
-
input
Organization Email
if _ERROR_ then call
symputx('_EFIERR_',1) run
7Compare two read data methods
- Method I
- Code is reusable.
- Dont need to know the file names, variable
names and variable lengths before read them in. - But cant read in data correctly if the data is
messy. - Method II
- extremely flexible, can deal with really messy
data - But the code is not reusable, it is data
specific.
8Explore the datasets
- Dataset Members (397 records)
- Variable Type Len Format
Informat - 9 AddressLine1 Char 50 50.
50. - 10 AddressLine2 Char 37 37.
37. - 8 City Char 20 20.
20. - 1 DuesYear Char 4 4.
4. - 3 Email Char 50 50.
50. - 4 Fname Char 15 15.
15. - 5 Lname Char 15 15.
15. - 2 Organization Char 58 58.
58. - 7 Phone Char 12 12.
12. - 6 Zip Char 10
10. 10. - Dataset Registrations (147 records)
- Variable Type Len Format
Informat - 2 Email Char 50 50.
50. - 3 Fname Char 15 15.
15. - 4 Lname Char 15 15.
15. - 1 Organization Char 58 58.
58. - 5 Phone Char 12 12.
12.
Email Unique Count346 Total 43 records with
email missing
Phone Unique Count342 Total 30 records with
phone missing
Email Unique Count147
Phone Unique Count146, 1 dup
9The challenges
- No unique identifier (single variable or
variables combination) can be used to match the
records - Standardizing variable Fname, Lname and
Organization is an endless task - There is no State information in both
registrations and members files
10Process Steps for Matching
- Same email address, same Lname or same Fname
- Same phone , same Lname and same Fname
- Same phone , same Lname or same Fname
-
- Same organization, same Lname and Fname
- 5. Same Lname and Fname
L
L
M
L
M
11Prepare Data for Matching
- Clean the email address
- 4/6/3835 Asiotscao_at_orgoewn.icn"38845382
- Standardize the phone number
- 832.941.8698 832-875-1600 832-9700640
- Separate names in Fname column
- Marilyn(Mary) Margaret Rose Robert-James
Jack Paul Harri - Clean the Lname
- O''Hearn De Kassu Biggers-Smith
- Standardize the organization name
- Angela and Women's Hospital Angela and
Womens Hospital
12Clean messy address
- Address information in city column
- City AddressLine1
AddressLine2 - 200 Harrison Ave VA Medical Center"200 Springs Rd
Bedford - AddressLine1 and AddressLine2 Switched
- City AddressLine1
AddressLine2 - Wakefield Bldg 919, 01-24
99 Binney Street - Wakefield 326 AAC
321 Forest Street - AddressLine1 has too much information
- City AddressLine1
AddressLine2 - Norwood Bedford VA Hospital"200 Springs Road,
152 Bedford One Kendall Square, Building
200
13Clean the messy data
/clean Registrations and Menbers
files/ let q1 str(") let
q2 str(') macro clean(indata) data
indata._2(droprid) set indata. if
scan(Email,2,'_at_')' ' then Email' '
email1lowcase(scan(Email,1, "q1." )) phone
compress(Phone, "0123456789", "k")
Fname1upcase(scan(Fname,1,'(- '))
Fname2upcase(scan(Fname,2,'()- '))
Fname3upcase(scan(Fname,3,'- '))
FnameMnameupcase(compress(Fname,'()- '))
Lname1translate(Lname,' ', "q1.q2.")
Lname1upcase(compress(Lname1,' - '))
organization compress(organization, ".,q1."
) organizationtranslate(organization,' ',
'/') organizationtranwrd(organization,'',
'and')
14Clean the messy data contd
- organizationtranwrd(organization,'Associates',
'Assoc') - organizationtranwrd(organization,'Associateia
tes', 'Assoc') - organizationtranwrd(organization, 'Univ',
'University') - organizationtranwrd(organization,
'Universityersity', 'University') - organizationtranwrd(organization, 'HealthCare',
'Health Care') - ridprxparse('s/\s/ /')
- call prxchange(rid, -1, organization)
- organization1tranwrd(organization, 'Inc', '
') - organization1tranwrd(organization1,'Corp',
' ') - organization1tranwrd(organization1,'Institutes',
' ') - organization1tranwrd(organization1,'Institute',
' ') - run
- mend clean
- clean(registrations)
- clean(members)
15Clean address
/clean address in members file/ data
members_2(droppos)set members_2 length Zip1
5. new_AddressLine1 50. AddressLine1tranwrd(
AddressLine1,'One ','1 ') AddressLine2tranwrd(A
ddressLine2,'One ','1 ') if
substr(City,1,1) in ( '0', '1',
'2','3','4','5','6','7','8','9') and
AddressLine1' ' then do
AddressLine1City City ' ' end
else if substr(City,1,1) in ( '0', '1',
'2','3','4','5','6','7','8','9') and
AddressLine1 ne ' ' and AddressLine2' ' then
do AddressLine2City City '
' end else if substr(City,1,1) in
( '0', '1', '2','3','4','5','6','7','8','9')
and AddressLine1 ne ' ' and AddressLine2 ne '
' then City ' '
16Clean address contd
if substr(AddressLine2,1,1) in ('0', '1',
'2','3','4', '5', '6', '7', '8', '9') and
scan(AddressLine2,2, ' ') ne 'floor ' and
(substr(AddressLine1,1,1) not in ('0','1',
'2','3','4', '5', '6', '7', '8', '9') or
scan(AddressLine1, 2, ' ')'AAC ') then
do new_AddressLine1AddressLine2 new_AddressLi
ne2AddressLine1 end else do
new_AddressLine1AddressLine1 new_AddressLi
ne2AddressLine2 end if scan(new_AddressLine
1,2,'"') ne ' ' then new_AddressLine1scan(new_Add
ressLine1,2,'"')
17Clean address contd
/Split AddressLine1/ posindexc(new_addressLine1
, ',' ) pos_endlength(new_AddressLine1) if
pos ne 0 then do new_AddressLine2substr(new_Ad
dressLine1, pos1, pos_end - pos)
new_AddressLine1substr(new_AddressLine1,1,pos-1)
end Zip1substr(Zip,1,5) run
18SAS functions for cleaning character variables
- SCAN(string ,n, delimiter(s)) -- Returns a
portion of the string as defined by delimiter. If
you omit delimiter, SAS uses the following
characters blank . lt ( ! ) - / ,
- /clean the wrong email address /
- if scan(Email,2,'_at_')' ' then Email' '
- email1lowcase(scan(Email,1, "q1.q2." ))
4/6/3835 will become empty
Asiotscao_at_orgoewn.icn"38845382 changed to
asiotscao_at_orgoewn.icn
email1lowcase(scan(Email,1,
)) email1lowcase(scan(Email,1, " ))
19SAS functions for cleaning character variables
- COMPRESS(string, chars, modifiers)
- chars a list of characters need to be removed or
kept - Common used modifies
- K (k) keeps the characters in the list
- I (i) ignore the case of the characters in the
list - phone compress(phone, "0123456789", "k")
- Lname1upcase(compress(Lname1,' - '))
- organization compress(organization,
".,q1." ) -
-
20SAS functions for cleaning character variables
- TRANSLATE(string,to-1,from-1lt,...to-n,from-ngt)
Converts every occurrence of a user-supplied
character to another character - organizationtranslate(organization,' ', '/')
- Lname1translate(Lname,' ', "q1.", ' ', "q2.")
- Lname1upcase(compress(Lname1,' - '))
- TRANWRD(string, target, replacement) scans for
words (or patterns of characters) and replaces
those words with a second word (or pattern of
characters). - organizationtranwrd(organization,'', 'and')
- organizationtranwrd(organization,'Associates'
, 'Assoc') - organizationtranwrd(organization,'Associateia
tes', 'Assoc')
Organizationtranwrd(organization, /, )
Lname1translate(Lname,' ', "q1. q2.")
21SAS functions for cleaning character variables
- Prxparse and Call Prxchange to get rid of extra
spaces between words - regular-expression-idPRXPARSE (perl-regular-expre
ssion) - CALL PRXCHANGE (regular-expression-id, times,
old-string , new-string ) - Times is a numeric value that specifies the
number of times to search for a match and replace
a matching pattern. - TipIf the value of times is -1, then all
matching patterns are replaced. - ridprxparse('s/\s/ /')
- call prxchange(rid, -1, organization)
22SAS functions for cleaning character variables
- indexc(string, chars)
- chars a list of characters need to be searched
- Searches string from left to right, returns the
first position of any character present in the
searching list. - pos indexc(new_addressLine1, ',' )
- pos_end length(new_AddressLine1)
- if pos ne 0 then do
- new_AddressLine2substr(new_AddressLine1, pos1,
pos_end - pos) - new_AddressLine1substr(new_AddressLine1,1,pos-
1) - end
23Matching step1by email, either first or last
name
- proc sql create table email_match as
- select distinct a.Fname, a.Lname,
a.Organization, - b.new_AddressLine1 as Addr1,
b.new_AddressLine2 as Addr2, - b.City, b.Zip1 as Zip, a.phone,
- a.email1 as Email, 'Y' as Member,
- case when a.Lname1b.Lname1 or
- (a.Fname1 b.Fname1 or
a.Fname1b.Fname2 or a.Fname1b.Fname3 or - a.Fname2b.Fname1 or (a.Fname2b.Fname2 and
a.Fname2 ne ' ') or (a.Fname2b.Fname3 and
a.Fname2 ne ' ') or - a.FnameMname b.FnameMname)
- then 'L'
- else 'M'
- end as match_ind
- from registrations_2 as a
- join members_2 as b
- on lowcase(a.email1)lowcase(b.email1)
24Find the rest unmatched membersunmatch1
- /get the rest unmatched members/
- proc sql
- create table unmatch1 as
- select
- from registrations_2
- where email1 not in (select email from
email_match) - order by phone, Lname, Fname
- quit
25Matching step 2 by phone, last name and first
name
- proc sql
- create table phone_match as
- select a.Fname, a.Lname, a.organization,
- b.new_AddressLine1 as Addr1,
- b.new_AddressLine2 as Addr2,b.City, b.Zip1
as Zip, a.phone, - a.email1 as Email, 'Y' as Member,
- 'L' as match_ind
- from unmatch1 as a
- join members_2 as b
- on a.Lname1b.Lname1 and
- (a.Fname1 b.Fname1 or a.Fname1b.Fname2
or a.Fname1b.Fname3 or a.Fname2b.Fname1
or (a.Fname2b.Fname2 and a.Fname2 ne ' ') - or (a.Fname2b.Fname3 and a.Fname2 ne '
') or - a.FnameMname b.FnameMname)
- and a.phone b.phone
- order by a.phone, a.Lname, a.Fname
- quit
26Find the rest unmatched members unmach2
- /get the rest unmatched members/
- data unmatch2(dropaddr1 addr2 city zip member
match_ind) - merge unmatch1(ina)
- phone_match(inb)
- by phone Lname Fname
- if a b
- run
27Matching step 3 by phone, last name or first
name
- proc sql
- create table phone_match as
- select a.Fname, a.Lname, a.organization,
- b.new_AddressLine1 as Addr1,
- b.new_AddressLine2 as Addr2,b.City, b.Zip1
as Zip, a.phone, - a.email1 as Email, 'Y' as Member,
- M' as match_ind
- from unmatch2 as a
- join members_2 as b
- on a.Lname1b.Lname1 or
- (a.Fname1 b.Fname1 or a.Fname1b.Fname2
or a.Fname1b.Fname3 or a.Fname2b.Fname1
or (a.Fname2b.Fname2 and a.Fname2 ne ' ') - or (a.Fname2b.Fname3 and a.Fname2 ne '
') or - a.FnameMname b.FnameMname)
- and a.phone b.phone
- order by a.phone, a.Lname, a.Fname
- quit
28Find the rest unmatched members unmatch3
- /get the rest unmatched members/
- data unmatch3(dropaddr1 addr2 city zip member
match_ind) - merge unmatch2(ina)
- phone_match2(inb)
- by phone Lname Fname
- if a b
- proc sort by organization Lname Fname
- run
29Matching step 4by organization, first and name
- proc sql
- create table name_org_match as
- select distinct a.Fname, a.Lname,
- a.organization, b.new_AddressLine1 as
Addr1, - b.new_AddressLine2 as Addr2, b.City,
b.Zip1 as Zip, a.phone, - a.email1 as Email, 'Y' as Member,
- 'L' as match_ind
- from unmatch3 as a
- join members_2 as b
- on (a.Lname1b.Lname1 and
- (a.Fname1 b.Fname1 or a.Fname1b.Fname2
or a.Fname1b.Fname3 - or a.Fname2b.Fname1 or (a.Fname2b.Fname2 and
a.Fname2 ne ' ') or - (a.Fname2b.Fname3 and a.Fname2 ne ' ')
or - a.FnameMname b.FnameMname))
- and upcase(a.organization1)upcase(b.organi
zation1) - order by a.organization, a.Lname, a.Fname
- quit
30Find the rest unmatched members unmatch4
- /get the rest unmatched members/
- data unmatch4(dropaddr1 addr2 city zip member
match_ind) - merge unmatch3(ina)
- name_org_match(inb)
- by organization Lname Fname
- if a b
- proc sort by Lname Fname
- run
31Matching step 5 by first and last name
- proc sql
- create table name_match as
- select distinct a.Fname, a.Lname,
- a.organization,
- b.new_AddressLine1 as Addr1,
- b.new_AddressLine2 as Addr2,b.City, b.Zip1
as Zip, a.phone, - a.email1 as Email, 'Y' as Member,
- 'M' as match_ind
- from unmatch4 as a
- join members_2 as b
- on a.Lname1b.Lname1 and
- (a.Fname1 b.Fname1 or
a.Fname1b.Fname2 or - a.Fname1b.Fname3 or a.Fname2b.Fname1 or
(a.Fname2b.Fname2 and a.Fname2 ne ' ') or
(a.Fname2b.Fname3 and a.Fname2 ne ' ') or - a.FnameMname b.FnameMname)
- order by a.Lname, a.Fname
- quit
32Final unmatched records
- data unmatch (dropaddr1 addr2 city zip Fname1
Fname2 Fname3 FnameMname Lname0 - Lname1
organization1 - rename(email1email))
- merge unmatch4(ina dropemail)
- name_match(inb dropemail)
- by Lname Fname
- if a b
- Member'N'
- Match_ind'N'
- run
33Get state information
- Use sashelp.zipcode in SAS system
- Download most current zip code file from SAS web
and use proc cimport to read it into the system
http//support.sas.com/rnd/datavisualization/mapso
nline/html/misc.html
34Clean set the final data together
- data final set email_match
- phone_match phone_match2
- name_org_match name_match unmatch
- organizationtranwrd(organization, 'University ',
'Univ ') - organizationtranwrd(organization, 'Health',
'Hlth') - organizationtranwrd(organization, 'Services ',
'Srvs ') - organizationtranwrd(organization, 'Center',
'Cntr ') - organizationtranwrd(organization,
'Investigations', 'Ivs ') - organizationtranwrd(organization, 'Inc', '')
- organizationtranwrd(organization, ' and ', '')
- organizationtranwrd(organization, 'Senior ', 'Sr
') - organizationtranwrd(organization, 'School ',
'Schl ') - organizationtranwrd(organization, 'Medical ',
'Med ') - organizationtranwrd(organization, 'Outcomes ',
'Outcms ') - organizationtranwrd(organization, 'Assoc ', 'Ass
') - organizationtranwrd(organization, 'Public ',
'Pblc ') - proc sort by zip run
35Final output file
- data out.SASChallenge_LS( drop _ )
- retain Fname Lname Organization Addr1 Addr2 City
State Zip Phone Email Member Match_Ind - merge final (ina rename(Addr1_Addr1
Addr2_Addr2
Email_Emai Organization_Organizat
ion -
Phone_Phone)) - state (inb rename(StatecodeState
)) - by zip
- if a
- length Organization Addr1 Addr2 30. Phone 10.
Email 50. - Organizationsubstr(_Organization,1,30)
- Addr1substr(_Addr1,1,30)
- Addr2substr(_Addr2,1,30)
- Phone_Phone
- Emailsubstr(_email,1,50) run
36References
-
- A macro for reading multiple text files (SUGI29)
- by Bebbie Miller, Denver, CO
- Address cleaning using the TRANWRD function
(NESUG 2008) - by Rena Jones, Mike Zdeb
- SAS online documents
-
-
- Thank You !
- libing.shi_at_bcbsma.com