Title: RealWorld Data Is Dirty
1Real-World Data Is Dirty
- Data Cleansing and the Merge/Purge Problem
- Hernandez Stolfo Columbia University - 1998
Class Presentation by Rhonda Kost, 06.April 2004
2TOPICS
- Introduction
- A Basic Data Cleansing Solution
- Test Real World Results
- Incremental Merge Purge w/ New Data
- Conclusion
- Recap
3Introduction
4The problem
- Some corporations acquire large amounts of
information every month - The data is stored in many large databases (DB)
- These databases may be heterogeneous
- Variations in schema
- The data may be represented differently across
the various datasets - Data in these DB may simply be inaccurate
5Requirement of the analysis
- The data mining needs to be done
- Quickly
- Efficiently
- Accurately
6Examples of real-world applications
- Credit card companies
- Assess risk of potential new customers
- Find false identities
- Match disparate records concerning a customer
- Mass Marketing companies
- Government agencies
7A Basic Data Cleansing Solution
8Duplicate Elimination
- Sorted-Neighborhood Method (SNM)
- This is done in three phases
- Create a Key for each record
- Sort records on this key
- Merge/Purge records
9SNM Create key
- Compute a key for each record by extracting
relevant fields or portions of fields - Example
10SNM Sort Data
- Sort the records in the data list using the key
in step 1 - This can be very time consuming
- O(NlogN) for a good algorithm,
- O(N2) for a bad algorithm
11SNM Merge records
- Move a fixed size window through the sequential
list of records. - This limits the comparisons to the records in the
window
12SNM Considerations
- What is the optimal window size while
- Maximizing accuracy
- Minimizing computational cost
- Execution time for large DB will be bound by
- Disk I/O
- Number of passes over the data set
13Selection of Keys
- The effectiveness of the SNM highly depends on
the key selected to sort the records - A key is defined to be a sequence of a subset of
attributes - Keys must provide sufficient discriminating power
14Example of Records and Keys
15Equational Theory
- The comparison during the merge phase is an
inferential process - Compares much more information than simply the
key - The more information there is, the better
inferences can be made
16Equational Theory - Example
- Two names are spelled nearly identically and have
the same address - It may be inferred that they are the same person
- Two social security numbers are the same but the
names and addresses are totally different - Could be the same person who moved
- Could be two different people and there is an
error in the social security number
17A simplified rule in English
- Given two records, r1 and r2
- IF the last name of r1 equals the last name of
r2, - AND the first names differ slightly,
- AND the address of r1 equals the address of r2
- THEN
- r1 is equivalent to r2
18The distance function
- A distance function is used to compare pieces
of data (usually text) - Apply distance function to data that differ
slightly - Select a threshold to capture obvious
typographical errors. - Impacts number of successful matches and number
of false positives
19Examples of matched records
20Building an equational theory
- The process of creating a good equational theory
is similar to the process of creating a good
knowledge-base for an expert system - In complex problems, an experts assistance is
needed to write the equational theory
21Transitive Closure
- In general, no single pass (i.e. no single key)
will be sufficient to catch all matching records - An attribute that appears first in the key has
higher discriminating power than those appearing
after them - If an employee has two records in a DB with SSN
193456782 and 913456782, its unlikely they will
fall under the same window
22Transitive Closure
- To increase the number of similar records merged
- Widen the scanning window size, w
- Execute several independent runs of the SNM
- Use a different key each time
- Use a relatively small window
- Call this the Multi-Pass approach
23Transitive Closure
- Each independent run of the Multi-Pass approach
will produce a set of pairs of records - Although one field in a record may be in error,
another field may not - Transitive closure can be applied to those pairs
to be merged
24Multi-pass Matches
- Pass 1 (Lastname discriminates)
- KSNKAT48NRTH789 (Kathi Kason 789912345 )
- KSNKAT48NRTH879 (Kathy Kason 879912345 )
- Pass 2 (Firstname discriminates)
- KATKSN48NRTH789 (Kathi Kason 789912345 )
- KATKSN48NRTH879 (Kathy Kason 879912345 )
- Pass 3 (Address discriminates)
- 48NRTH879KSNKAT (Kathy Kason 879912345 )
- 48NRTH879SMTKAT (Kathy Smith 879912345 )
25Transitive Equality Example
- IF A implies B
- AND B implies C
- THEN A implies C
- From example
- 789912345 Kathi Kason 48 North St. (A)
- 879912345 Kathy Kason 48 North St. (B)
- 879912345 Kathy Smith 48 North St. (C)
26Test Results
27Test Environment
- Test data was created by a database generator
- Names are randomly chosen from a list of 63000
real names - The database generator provides a large number of
parameters - size of the DB,
- percentage of duplicates,
- amount of error
28Correct Duplicate Detection
29Time for each run
30Accuracy for each run
31Real-World Test
- Data was obtained from the Office of Children
Administrative Research (OCAR) of the Department
of Social and Health Services (State of
Washington) - OCARs goals
- How long do children stay in foster care?
- How many different homes do children typically
stay in?
32OCARs Database
- Most of OCARs data is stored in one relation
- The DB contains 6,000,000 total records
- The DB grows by about 50,000 records per month
33Typical Problems in the DB
- Names are frequently misspelled
- SSN or birthdays are either missing or clearly
wrong - Case number often changes when the childs family
moves to another part of the state - Some records use service provider names instead
of the childs - No reliable unique identifier
34OCAR Equational Theory
- Keys for the independent runs
- Last Name, First Name, SSN, Case Number
- First Name, Last Name, SSN, Case Number
- Case Number, First Name, Last Name, SSN
35OCAR Results
36Incremental Merge/Purge w/ New Data
37Incremental Merge/Purge
- Lists are concatenated for first time processing
- Concatenating new data before reapplying the
merge/purge process may be very expensive in both
time and space - An incremental merge/purge approach is needed
Prime Representatives method
38Prime-Representative Definition
- A set of records extracted from each cluster of
records used to represent the information in the
cluster - The Cluster Centroid or base element of
equivalence class
39Prime-Representative creation
- Initially, no PR exists
- After the execution of the first merge/purge
create clusters of similiar records - Correct selection of PR from cluster impacts
accuracy of results - No PR can be the best selection for some clusters
403 Strategies for Choosing PR
- Random Sample
- Select a sample of records at random from each
cluster - N-Latest
- Most recent elements entered in DB
- Syntactic
- Choose the largest or more complete record
41Important Assumption
- No data previously used to select each clusters
PR will be deleted - Deleted records could require restructuring of
clusters (expensive) - No changes in the rule-set will occur after the
first increment of data is processed - Substantial rule change could invalidate
clusters.
42Results
- Cumulative running time for the Incremental
Merge/Purge algorithm is higher than the classic
algorithm - PR selection methodology could improve cumulative
running time - Total running time of the Incremental Merge/Purge
algorithm is always smaller
43Conclusion
44Cleansing of Data
- Sorted-Neighborhood Method is expensive due to
- the sorting phase
- the need for large windows for high accuracy
- Multiple passes with small windows followed by
transitive closure improves accuracy and
performance for level of accuracy - increasing number of successful matches
- decreasing number of false positives
45Recap
46- 2 major reasons merging large databases becomes
a difficult problem - The databases are heterogeneous
- The identifiers or strings differ in how they
are represented within each DB
47- The 3 steps in SNM are
- Creation of key(s)
- Sorting records on this key
- Merge/Purge records
48- Prime representative - set of records from
cluster considered to be representative of data
contained in cluster - 3 strategies for selecting a PR
- Random Sample
- N-Latest
- Syntactic
49Questions