Title: Searching and Integrating Information on the Web
1Searching and Integrating Information on the Web
- Seminar 3 Data Cleansing
- Professor Chen Li
- UC Irvine
2Paper readings
- Efficient merge and purge Hernandez and Stolfo,
SIGMOD 1995 - Approximate String Joins in a Database (Almost)
for Free, Gravano et al, VLDB 2001, - Efficient Record Linkage in Large Data Sets,
Liang Jin, Chen Li, Sharad Mehrotra, DASFAA, 2003 - Sunita Sarawagi Anuradha Bhamidipaty, Interactive
Deduplication Using Active Learning. Sarawagi and
Bhamidipaty, KDD 2003
3Motivation
- Correlate data from different data sources (e.g.,
data integration) - Data is often dirty
- Needs to be cleansed before being used
- Example
- A hospital needs to merge patient records from
different data sources - They have different formats, typos, and
abbreviations
4Example
Table R
Table S
Name SSN Addr
Jack Lemmon 430-871-8294 Maple St
Harrison Ford 292-918-2913 Culver Blvd
Tom Hanks 234-762-1234 Main St
Name SSN Addr
Ton Hanks 234-162-1234 Main Street
Kevin Spacey 928-184-2813 Frost Blvd
Jack Lemon 430-817-8294 Maple Street
- Find records from different datasets that could
be the same entity
5Another Example
- P. Bernstein, D. Chiu Using Semi-Joins to Solve
Relational Queries. JACM 28(1) 25-40(1981) - Philip A. Bernstein, Dah-Ming W. Chiu, Using
Semi-Joins to Solve Relational Queries, Journal
of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981
6Record linkage
- Problem statement
- Given two relations, identify the potentially
matched records - Efficiently and
- Effectively
7Challenges
- How to define good similarity functions?
- Many functions proposed (edit distance, cosine
similarity, ) - Domain knowledge is critical
- Names Wall Street Journal and LA Times
- Address Main Street versus Main St
- How to do matching efficiently
- Offline join version
- Online (interactive) search
- Nearest search
- Range search
8Outline
- Supporting string-similarity joins using RDBMS
- Using mapping techniques
- Interactive deduplication
9Edit Distance
- A widely used metric to define string similarity
- Ed(s1,s2) minimum of operations (insertion,
deletion, substitution) to change s1 to s2 - Example
- s1 Tom Hanks
- s2 Ton Hank
- ed(s1,s2) 2
10Approximate String Joins
- We want to join tuples with similar string
fields - Similarity measure Edit Distance
- Each Insertion, Deletion, Replacement increases
distance by one
Service A
Jenny Stamatopoulou
John Paul McDougal
Aldridge Rodriguez
Panos Ipeirotis
John Smith
Service B
Panos Ipirotis
Jonh Smith
Jenny Stamatopulou
John P. McDougal
Al Dridge Rodriguez
K1
K2
K1
K3
K1
11Focus Approximate String Joins over Relational
DBMSs
- Join two tables on string attributes and keep all
pairs of strings with Edit Distance K - Solve the problem in a database-friendly way
- (if possible with an existing "vanilla" RDBMS)
12Current Approaches for Processing Approximate
String Joins
- No native support for approximate joins in RDBMSs
- Two existing (straightforward) solutions
- Join data outside of DBMS
- Join data via user-defined functions (UDFs)
inside the DBMS
13Approximate String Joins outside of a DBMS
- Export data
- Join outside of DBMS
- Import the result
- Main advantage
- We can exploit any state-of-the-art
string-matching algorithm, without restrictions
from DBMS functionality - Disadvantages
- Substantial amounts of data to be
exported/imported - Cannot be easily integrated with further
processing steps in the DBMS
14Approximate String Joins with UDFs
- Write a UDF to check if two strings match within
distance K - Write an SQL statement that applies the UDF to
the string pairs - SELECT R.stringAttr, S.stringAttr
- FROM R, S
- WHERE edit_distance(R.stringAttr, S.stringAttr,
K) - Main advantage
- Ease of implementation
- Main disadvantage
- UDF applied to entire cross-product of relations
15Our Approach Approximate String Joins over an
Unmodified RDBMS
- Preprocess data and generate auxiliary tables
- Perform join exploiting standard RDBMS
capabilities - Advantages
- No modification of underlying RDBMS needed.
- Can leverage the RDBMS query optimizer.
- Much more efficient than the approach based on
naive UDFs
16Intuition and Roadmap
- Intuition
- Similar strings have many common substrings
- Use exact joins to perform approximate joins
(current - DBMSs are good for exact joins)
- A good candidate set can be verified for false
positives - Ukkonen 1992, Sutinen and Tarhio 1996, Ullman
1977 - Roadmap
- Break strings into substrings of length q
(q-grams) - Perform an exact join on the q-grams
- Find candidate string pairs based on the results
- Check only candidate pairs with a UDF to obtain
final answer
17What is a Q-gram?
- Q-gram A sequence of q characters of the
original string - Example for q3
- vacations
- v, va, vac, aca, cat, ati, tio, ion, ons,
ns, s -
- String with length L ? L q - 1 q-grams
- Similar strings have a many common q-grams
18Q-grams and Edit Distance Operations
- With no edits L q - 1 common q-grams
- Replacement (L q 1) - q common q-grams
- Vacations v, va, vac, aca, cat, ati, tio,
ion, ons, ns, s - Vacalions v, va, vac, aca, cal, ali, lio,
ion, ons, ns, s - Insertion (Lmax q 1) - q common q-grams
- Vacations v, va, vac, aca, cat, ati, tio,
ion, ons, ns, s - Vacatlions v, va, vac, aca, cat, atl, tli,
lio, ion, ons, ns, s - Deletion (Lmax q 1) - q common q-grams
- Vacations v, va, vac, aca, cat, ati, tio,
ion, ons, ns, s - Vacaions v, va, vac, aca, cai, aio, ion,
ons, ns, s
19Number of Common Q-grams and Edit Distance
- For Edit Distance K, there could be at most K
replacements, insertions, deletions - Two strings S1 and S2 with Edit Distance K have
at least max(S1.len, S2.len) q - 1 Kq
q-grams in common - Useful filter eliminate all string pairs without
"enough" common q-grams (no false dismissals)
20Using a DBMS for Q-gram Joins
- If we have the q-grams in the DBMS, we can
perform this counting efficiently. - Create auxiliary tables with tuples of the form
- ltsid, strlen, qgramgt
- and join these tables
- A GROUP BY HAVING COUNT clause can perform the
counting / filtering
21Eliminating Candidate Pairs COUNT FILTERING
- SQL for this filter (parts omitted for clarity)
- SELECT R.sid, S.sid
- FROM R, S
- WHERE R.qgramS.qgram
- GROUP BY R.sid, S.sid
- HAVING COUNT() gt (max(R.strlen, S.strlen) q -
1) Kq - The result is the pair of strings with
sufficiently enough common q-grams to ensure that
we will not have false negatives.
22Eliminating Candidate Pairs Further LENGTH
FILTERING
- Strings with length difference larger than K
cannot be within Edit Distance K - SELECT R.sid, S.sid
- FROM R, S
- WHERE R.qgramS.qgram AND abs(R.strlen -
S.strlen)ltK - GROUP BY R.sid, S.sid
- HAVING COUNT() gt (max(R.strlen, S.strlen) q
1) Kq - We refer to this filter as LENGTH FILTERING
23Exploiting Q-gram Positions for Filtering
- Consider strings aabbzzaacczz and aacczzaabbzz
- Strings are at edit distance 4
- Strings have identical q-grams for q3
- Problem Matching q-grams that are at different
positions in both strings - Either q-grams do not "originate" from same
q-gram, or - Too many edit operations "caused" spurious
q-grams at various parts of strings to match
24POSITION FILTERING - Filtering using positions
- Keep the position of the q-grams ltsid, strlen,
pos, qgramgt - Do not match q-grams that are more than K
positions away - SELECT R.sid, S.sid
- FROM R, S
- WHERE R.qgramS.qgram
- AND abs(R.strlen - S.strlen)ltK
- AND abs(R.pos - S.pos)ltK
- GROUP BY R.sid, S.sid
- HAVING COUNT() gt (max(R.strlen, S.strlen) q
1) Kq - We refer to this filter as POSITION FILTERING
25The Actual, Complete SQL Statement
- SELECT R1.string, S1.string, R1.sid, S1.sid
- FROM R1, S1, R, S,
- WHERE R1.sidR.sid
- AND S1.sidS.sid
- AND R.qgramS.qgram
- AND abs(strlen(R1.string)strlen(S1.string))ltK
- AND abs(R.pos - S.pos)ltK
- GROUP BY R1.sid, S1.sid, R1.string, S1.string
- HAVING COUNT() gt
- (max(strlen(R1.string),strlen(S1.string))
q-1)Kq
26Summary of 1st paper
- Introduced a technique for mapping approximate
string joins into a vanilla SQL expression - Our technique does not require modifying the
underlying RDBMS
27Outline
- Supporting string-similarity joins using RDBMS
- Using mapping techniques
- Interactive deduplication
28Single-attribute Case
- Given
- two sets of strings, R and S
- a similarity function f between strings (metric
space) - Reflexive f(s1,s2) 0 iff s1s2
- Symmetric f(s1,s2) d(s2, s1)
- Triangle inequality f(s1,s2)f(s2,s3) gt
f(s1,s3) - a threshold k
- Find all pairs of strings (r, s) from R and S,
such that f(r,s) lt k.
R
S
29Nested-loop?
- Not desirable for large data sets
- 5 hours for 30K strings!
30Our 2-step approach
- Step 1 map strings (in a metric space) to
objects in a Euclidean space - Step 2 do a similarity join in the Euclidean
space
31Advantages
- Applicable to many metric similarity functions
- Use edit distance as an example
- Other similarity functions also tried, e.g.,
q-gram-based similarity - Open to existing algorithms
- Mapping techniques
- Join techniques
32Step 1
- Map strings into a high-dimensional Euclidean
space
Metric Space
Euclidean Space
33Mapping StringMap
- Input A list of strings
- Output Points in a high-dimensional Euclidean
space that preserve the original distances well - A variation of FastMap
- Each step greedily picks two strings (pivots) to
form an axis - All axes are orthogonal
34Can it preserve distances?
- Data Sources
- IMDB star names 54,000
- German names 132,000
- Distribution of string lengths
35Can it preserve distances?
- Use data set 1 (54K names) as an example
- k2, d20
- Use k5.2 to differentiate similar and
dissimilar pairs.
36Choose Dimensionality d
- Increase d?
- Good ?
- better to differentiate similar pairs from
dissimilar ones. - Bad ?
- Step 1 Efficiency ?
- Step 2 curse of dimensionality
37Choose dimensionality d using sampling
- Sample 1Kx1K strings, find their similar pairs
(within distance k) - Calculate maximum of their new distances w
- Define Cost of finding a similar pair
38Choose Dimensionality d
d15 25
39Choose new threshold k
- Closely related to the mapping property
- Ideally, if ed(r,s) lt k, the Euclidean distance
between two corresponding points lt k. - Choose k using sampling
- Sample 1Kx1K strings, find similar pairs
- Calculate their maximum new distance as k
- repeat multiple times, choose their maximum
40New threshold k in step 2
d20
41Step 2 Similarity Join
- Input Two sets of points in Euclidean space.
- Output Pairs of two points whose distance is
less than new threshold k. - Many join algorithms can be used
42Example
- Adopted an algorithm by Hjaltason and Samet.
- Building two R-Trees.
- Traverse two trees, find points whose distance is
within k. - Pruning during traversal (e.g., using MinDist).
43Final processing
- Among the pairs produced from the similarity-join
step, check their edit distance. - Return those pairs satisfying the threshold k
44Running time
45Recall
- Recall (of found similar pairs)/(of all
similar pairs)
46Multi-attribute linkage
- Example title name year
- Different attributes have different similarity
functions and thresholds - Consider merge rules in disjunctive format
47Evaluation strategies
- Many ways to evaluate rules
- Finding an optimal one NP-hard
- Heuristics
- Treat different conjuncts independently. Pick the
most efficient attribute in each conjunct. - Choose the largest threshold for each attribute.
Then choose the most efficient attribute among
these thresholds.
48Summary of 2nd paper
- A novel two-step approach to record linkage.
- Many existing mapping and join algorithms can be
adopted - Applicable to many distance metrics.
- Time and space efficient.
- Multi-attribute case studied
49Outline
- Supporting string-similarity joins using RDBMS
- Using mapping techniques
- Interactive deduplication
50Problems with Existing Deduplication Methods
- Matching Functions
- Calculate similarity scores, thresholds
- Tedious coding
- Learning-based Methods
- Require large-sized training set for accuracy
- (static training set)
- Difficult to provide a covering and challenging
training set that will bring out the subtlety of
deduplication function
51New Approach
- Relegate the task of finding the deduplication
function to a machine learning algorithm - Design goal
- Less training instances
- Interactive response
- Fast convergence
- High accuracy
- ? Design an interactive deduplication system
called ALIAS -
52ALIAS SYSTEM
- A learned-based method
- Exploit existing similarity functions
- Use Active Learning
- - an active learner actively picks unlabeled
instances with the most information gain in the
training set - Produce a deduplication function that can
identify duplicates
53Overall Architecture
L
Feedback From user
Lp
T
Training data
Mapper
F
Similarity Functions
Train classifiers
Dp
Mapper
Select n instances for labeling
S
Learner
D
Predicate for uncertain region
Similarity Indices
54Primary Inputs for ALIAS
L
- A set of initial training pairs (L)
- less than 10 labeled records
- arranged in pairs of duplicates and
non-duplicates - A set of similarity functions (F)
- Ex word-match, qgram-match
- To compute similarities scores between 2 records
based on any subset of attributes. - Learner will find the right way of combining
those scores to get the final deduplication
function - A database of unlabeled records(D)
- Number of classifiers (lt5)
F
Similarity Functions
D
55Mapped Labeled Instances
L
Lp
- Take r1, r2 from input L
- Record r1(a1, a2, a3)
- Record r2(a1, a2, a3)
- Use similarity functions f1, f2fn to compute
similarity scores s between r1 and r2 - New record
- r1r2 (s1, s2sn, y/n)
- y duplicate
- n non-duplicate
- Put new record in Lp
Mapper
Similarity Functions
F
D
56Mapped Unlabeled Instances
L
Lp
- Take r1, r2 from D x D
- Record r1(a1, a2, a3)
- Record r2(a1, a2, a3)
- Use similarity functions f1, f2fn to compute
similarity scores between r1 and r2 - New record
- r1r2 (s1, s2sn)
- No y/n field
- Put new record in Dp
Mapper
Similarity Functions
F
Mapper
Dp
D
57Active Learner
L
Feedback From user
Lp
T
Training data
Mapper
Similarity Functions
F
Train Classifiers
Dp
Mapper
Select set S of n instances for labeling
s
Learner
D
58ALIAS Algorithm
- 1. Input L, D, F.
- 2. Create pairs Lp from the labeled data L and F.
- 3. Create pairs Dp from the unlabeled data D and
F. - 4. Initial training set T Lp
- 5. Loop until user satisfaction
- Train classier C using T.
- Use C to select a set S of n instances from Dp
for labeling. - If S is empty, exit loop.
- Collect user feedback on the labels of S.
- Add S to T and remove S from Dp.
- 6. Output classifier C
59The Indexing Component
- Purpose
- Avoid mapping all pairs of records in D x D
- 3 Methods
- Grouping
- Sampling
- Indexing
60The Indexing Component
- Grouping
- Example group records in D according to the
field year of publication - Mapped pairs are formed only within records of a
group. - Sampling
- Sample in units of a group instead of individual
records.
61The Indexing Component
- Indexing
- A similarity function
- fraction of common words between two text
attributes gt0.4 - we can create an index on the words of the text
attributes
62The learning component
- Contain a number of classifiers
- A classifier a machine learning algorithm such
as decision tree (D-tree), naïve Bayes (NB),
Support Vector Machine (SVM)to classify
instances - A classifier is trained using a training data set
63Criteria for a classifier
- Accuracy
- Interpretability
- Indexability
- Efficient training
64Accuracy of a Classifier
- Measured by the mean F of recall r and precision
p - r fraction of duplicates correctly classified
- p fraction correct amongst all instances
actually libeled duplicate
65Accuracy (cont.)
- Example
- A case with 1 duplicates a classifier labels
all pairs as non-duplicates - recall r 0 ? mean F 0 ? accuracy 0
- A case with 1 duplicates a classifier
identifies all duplicates correctly but
misclassifies 1 non-duplicates - recall r 1, p 0.5 ? F 0.667 ? accuracy
66.7 - ? If we dont use r and p, then they both have
99 accuracy!
66Criteria for a classifier (cont.)
- Interpretability
- Final deduplication rule is easy to understand
and interpret - Indexability
- Final deduplication rule has indexable predicates
- Efficient training
- Fast to train
67Comparison of Different Classifiers
68Active Learning
The goal is to seek out from the unlabeled pool
the instances which when labeled will help
strengthen the classifier at the fastest possible
rate.
69A simple Example
Most uncertain instance
- Assume all are unlabeled except a and b
- Suppose r-coordinate 0, b-coordinate 1
- Any unlabeled point x to the left of r and to
the right of b will have no effect in reducing
the region of uncertainty - By including m in the training set, the size of
the uncertain region will reduce by half
70How to select an unlabeled instance
- Uncertainty
- The instance about which the learner was most
unsure was also the instance for which the
expected reduction in confusion was the largest - Uncertainty score
- the disagreement among the predictions the
instance get from a committee of N classifiers - A sure duplicate or non-duplicate would get the
same prediction from all members - Representativeness
- An uncertain instance representing a larger
number of unlabeled instances has greater impact
to the classifier
71Example
- 3 similarity functions word match f1, qgram
match f2, string edit distance f3. - Take from Mapped Unlabeled Instances (Dp)
- r1r2 (s1, s2sn)
- r3r4 (s1, s2sn)
- s1, s2sn scores using functions f1, f2, f3
- 3 classifiers D-tree, Naïve Bayes, SVM
-
D-tree Naïve Bayes SVM
r1r2 duplicate Non-duplicate duplicate
r3r4 duplicate duplicate duplicate
selected
72How to Combine Uncertainty and Representativeness
- Proposed two approaches
- 1st approach weighted sum
- Cluster the unlabeled instances
- Estimate the density of points around it
- The instances are scored using a weighted sum of
its density and uncertainty value - n highest scoring instances selected
- 2nd approach sampling
73Conclusion
- ALIAS
- Makes deduplication much easier (less training
instances) - Provide interaction response to the user
- High accuracy