Searching and Integrating Information on the Web - PowerPoint PPT Presentation

About This Presentation

Title:

Searching and Integrating Information on the Web

Description:

Harrison Ford. Maple St. 430-871-8294. Jack Lemmon. Addr. SSN. Name. Table R. Maple Street ... John P. McDougal. Jenny Stamatopulou. Jonh Smith. Panos Ipirotis ... – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 74

Provided by: che7

Learn more at: https://ics.uci.edu

Category:

more less

Transcript and Presenter's Notes

Title: Searching and Integrating Information on the Web

1
Searching and Integrating Information on the Web

Seminar 3 Data Cleansing
Professor Chen Li
UC Irvine

2
Paper readings

Efficient merge and purge Hernandez and Stolfo,
SIGMOD 1995
Approximate String Joins in a Database (Almost)
for Free, Gravano et al, VLDB 2001,
Efficient Record Linkage in Large Data Sets,
Liang Jin, Chen Li, Sharad Mehrotra, DASFAA, 2003
Sunita Sarawagi Anuradha Bhamidipaty, Interactive
Deduplication Using Active Learning. Sarawagi and
Bhamidipaty, KDD 2003

3
Motivation

Correlate data from different data sources (e.g.,
data integration)
Data is often dirty
Needs to be cleansed before being used
Example
A hospital needs to merge patient records from
different data sources
They have different formats, typos, and
abbreviations

4
Example
Table R
Table S
Name SSN Addr
Jack Lemmon 430-871-8294 Maple St
Harrison Ford 292-918-2913 Culver Blvd
Tom Hanks 234-762-1234 Main St

Name SSN Addr
Ton Hanks 234-162-1234 Main Street
Kevin Spacey 928-184-2813 Frost Blvd
Jack Lemon 430-817-8294 Maple Street

Find records from different datasets that could
be the same entity

5
Another Example

P. Bernstein, D. Chiu Using Semi-Joins to Solve
Relational Queries. JACM 28(1) 25-40(1981)
Philip A. Bernstein, Dah-Ming W. Chiu, Using
Semi-Joins to Solve Relational Queries, Journal
of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981

6
Record linkage

Problem statement
Given two relations, identify the potentially
matched records
Efficiently and
Effectively

7
Challenges

How to define good similarity functions?
Many functions proposed (edit distance, cosine
similarity, )
Domain knowledge is critical
Names Wall Street Journal and LA Times
Address Main Street versus Main St
How to do matching efficiently
Offline join version
Online (interactive) search
Nearest search
Range search

8
Outline

Supporting string-similarity joins using RDBMS
Using mapping techniques
Interactive deduplication

9
Edit Distance

A widely used metric to define string similarity
Ed(s1,s2) minimum of operations (insertion,
deletion, substitution) to change s1 to s2
Example
s1 Tom Hanks
s2 Ton Hank
ed(s1,s2) 2

10
Approximate String Joins

We want to join tuples with similar string
fields
Similarity measure Edit Distance
Each Insertion, Deletion, Replacement increases
distance by one

Service A
Jenny Stamatopoulou
John Paul McDougal
Aldridge Rodriguez
Panos Ipeirotis
John Smith

Service B
Panos Ipirotis
Jonh Smith

Jenny Stamatopulou
John P. McDougal

Al Dridge Rodriguez

K1
K2

K1
K3

K1
11
Focus Approximate String Joins over Relational
DBMSs

Join two tables on string attributes and keep all
pairs of strings with Edit Distance K
Solve the problem in a database-friendly way
(if possible with an existing "vanilla" RDBMS)

12
Current Approaches for Processing Approximate
String Joins

No native support for approximate joins in RDBMSs
Two existing (straightforward) solutions
Join data outside of DBMS
Join data via user-defined functions (UDFs)
inside the DBMS

13
Approximate String Joins outside of a DBMS

Export data
Join outside of DBMS
Import the result
Main advantage
We can exploit any state-of-the-art
string-matching algorithm, without restrictions
from DBMS functionality
Disadvantages
Substantial amounts of data to be
exported/imported
Cannot be easily integrated with further
processing steps in the DBMS

14
Approximate String Joins with UDFs

Write a UDF to check if two strings match within
distance K
Write an SQL statement that applies the UDF to
the string pairs
SELECT R.stringAttr, S.stringAttr
FROM R, S
WHERE edit_distance(R.stringAttr, S.stringAttr,
K)
Main advantage
Ease of implementation
Main disadvantage
UDF applied to entire cross-product of relations

15
Our Approach Approximate String Joins over an
Unmodified RDBMS

Preprocess data and generate auxiliary tables
Perform join exploiting standard RDBMS
capabilities
Advantages
No modification of underlying RDBMS needed.
Can leverage the RDBMS query optimizer.
Much more efficient than the approach based on
naive UDFs

16
Intuition and Roadmap

Intuition
Similar strings have many common substrings
Use exact joins to perform approximate joins
(current
DBMSs are good for exact joins)
A good candidate set can be verified for false
positives
Ukkonen 1992, Sutinen and Tarhio 1996, Ullman
1977
Roadmap
Break strings into substrings of length q
(q-grams)
Perform an exact join on the q-grams
Find candidate string pairs based on the results
Check only candidate pairs with a UDF to obtain
final answer

17
What is a Q-gram?

Q-gram A sequence of q characters of the
original string
Example for q3
vacations
v, va, vac, aca, cat, ati, tio, ion, ons,
ns, s
String with length L ? L q - 1 q-grams
Similar strings have a many common q-grams

18
Q-grams and Edit Distance Operations

With no edits L q - 1 common q-grams
Replacement (L q 1) - q common q-grams
Vacations v, va, vac, aca, cat, ati, tio,
ion, ons, ns, s
Vacalions v, va, vac, aca, cal, ali, lio,
ion, ons, ns, s
Insertion (Lmax q 1) - q common q-grams
Vacations v, va, vac, aca, cat, ati, tio,
ion, ons, ns, s
Vacatlions v, va, vac, aca, cat, atl, tli,
lio, ion, ons, ns, s
Deletion (Lmax q 1) - q common q-grams
Vacations v, va, vac, aca, cat, ati, tio,
ion, ons, ns, s
Vacaions v, va, vac, aca, cai, aio, ion,
ons, ns, s

19
Number of Common Q-grams and Edit Distance

For Edit Distance K, there could be at most K
replacements, insertions, deletions
Two strings S1 and S2 with Edit Distance K have
at least max(S1.len, S2.len) q - 1 Kq
q-grams in common
Useful filter eliminate all string pairs without
"enough" common q-grams (no false dismissals)

20
Using a DBMS for Q-gram Joins

If we have the q-grams in the DBMS, we can
perform this counting efficiently.
Create auxiliary tables with tuples of the form
ltsid, strlen, qgramgt
and join these tables
A GROUP BY HAVING COUNT clause can perform the
counting / filtering

21
Eliminating Candidate Pairs COUNT FILTERING

SQL for this filter (parts omitted for clarity)
SELECT R.sid, S.sid
FROM R, S
WHERE R.qgramS.qgram
GROUP BY R.sid, S.sid
HAVING COUNT() gt (max(R.strlen, S.strlen) q -
1) Kq
The result is the pair of strings with
sufficiently enough common q-grams to ensure that
we will not have false negatives.

22
Eliminating Candidate Pairs Further LENGTH
FILTERING

Strings with length difference larger than K
cannot be within Edit Distance K
SELECT R.sid, S.sid
FROM R, S
WHERE R.qgramS.qgram AND abs(R.strlen -
S.strlen)ltK
GROUP BY R.sid, S.sid
HAVING COUNT() gt (max(R.strlen, S.strlen) q
1) Kq
We refer to this filter as LENGTH FILTERING

23
Exploiting Q-gram Positions for Filtering

Consider strings aabbzzaacczz and aacczzaabbzz
Strings are at edit distance 4
Strings have identical q-grams for q3
Problem Matching q-grams that are at different
positions in both strings
Either q-grams do not "originate" from same
q-gram, or
Too many edit operations "caused" spurious
q-grams at various parts of strings to match

24
POSITION FILTERING - Filtering using positions

Keep the position of the q-grams ltsid, strlen,
pos, qgramgt
Do not match q-grams that are more than K
positions away
SELECT R.sid, S.sid
FROM R, S
WHERE R.qgramS.qgram
AND abs(R.strlen - S.strlen)ltK
AND abs(R.pos - S.pos)ltK
GROUP BY R.sid, S.sid
HAVING COUNT() gt (max(R.strlen, S.strlen) q
1) Kq
We refer to this filter as POSITION FILTERING

25
The Actual, Complete SQL Statement

SELECT R1.string, S1.string, R1.sid, S1.sid
FROM R1, S1, R, S,
WHERE R1.sidR.sid
AND S1.sidS.sid
AND R.qgramS.qgram
AND abs(strlen(R1.string)strlen(S1.string))ltK
AND abs(R.pos - S.pos)ltK
GROUP BY R1.sid, S1.sid, R1.string, S1.string
HAVING COUNT() gt
(max(strlen(R1.string),strlen(S1.string))
q-1)Kq

26
Summary of 1st paper

Introduced a technique for mapping approximate
string joins into a vanilla SQL expression
Our technique does not require modifying the
underlying RDBMS

27
Outline

Supporting string-similarity joins using RDBMS
Using mapping techniques
Interactive deduplication

28
Single-attribute Case

Given
two sets of strings, R and S
a similarity function f between strings (metric
space)
Reflexive f(s1,s2) 0 iff s1s2
Symmetric f(s1,s2) d(s2, s1)
Triangle inequality f(s1,s2)f(s2,s3) gt
f(s1,s3)
a threshold k
Find all pairs of strings (r, s) from R and S,
such that f(r,s) lt k.

R
S
29
Nested-loop?

Not desirable for large data sets
5 hours for 30K strings!

30
Our 2-step approach

Step 1 map strings (in a metric space) to
objects in a Euclidean space
Step 2 do a similarity join in the Euclidean
space

31
Advantages

Applicable to many metric similarity functions
Use edit distance as an example
Other similarity functions also tried, e.g.,
q-gram-based similarity
Open to existing algorithms
Mapping techniques
Join techniques

32
Step 1

Map strings into a high-dimensional Euclidean
space

Metric Space
Euclidean Space
33
Mapping StringMap

Input A list of strings
Output Points in a high-dimensional Euclidean
space that preserve the original distances well
A variation of FastMap
Each step greedily picks two strings (pivots) to
form an axis
All axes are orthogonal

34
Can it preserve distances?

Data Sources
IMDB star names 54,000
German names 132,000
Distribution of string lengths

35
Can it preserve distances?

Use data set 1 (54K names) as an example
k2, d20
Use k5.2 to differentiate similar and
dissimilar pairs.

36
Choose Dimensionality d

Increase d?
Good ?
better to differentiate similar pairs from
dissimilar ones.
Bad ?
Step 1 Efficiency ?
Step 2 curse of dimensionality

37
Choose dimensionality d using sampling

Sample 1Kx1K strings, find their similar pairs
(within distance k)
Calculate maximum of their new distances w
Define Cost of finding a similar pair

38
Choose Dimensionality d
d15 25
39
Choose new threshold k

Closely related to the mapping property
Ideally, if ed(r,s) lt k, the Euclidean distance
between two corresponding points lt k.
Choose k using sampling
Sample 1Kx1K strings, find similar pairs
Calculate their maximum new distance as k
repeat multiple times, choose their maximum

40
New threshold k in step 2
d20
41
Step 2 Similarity Join

Input Two sets of points in Euclidean space.
Output Pairs of two points whose distance is
less than new threshold k.
Many join algorithms can be used

42
Example

Adopted an algorithm by Hjaltason and Samet.
Building two R-Trees.
Traverse two trees, find points whose distance is
within k.
Pruning during traversal (e.g., using MinDist).

43
Final processing

Among the pairs produced from the similarity-join
step, check their edit distance.
Return those pairs satisfying the threshold k

44
Running time
45
Recall

Recall (of found similar pairs)/(of all
similar pairs)

46
Multi-attribute linkage

Example title name year
Different attributes have different similarity
functions and thresholds
Consider merge rules in disjunctive format

47
Evaluation strategies

Many ways to evaluate rules
Finding an optimal one NP-hard
Heuristics
Treat different conjuncts independently. Pick the
most efficient attribute in each conjunct.
Choose the largest threshold for each attribute.
Then choose the most efficient attribute among
these thresholds.

48
Summary of 2nd paper

A novel two-step approach to record linkage.
Many existing mapping and join algorithms can be
adopted
Applicable to many distance metrics.
Time and space efficient.
Multi-attribute case studied

49
Outline

Supporting string-similarity joins using RDBMS
Using mapping techniques
Interactive deduplication

50
Problems with Existing Deduplication Methods

Matching Functions
Calculate similarity scores, thresholds
Tedious coding

Learning-based Methods
Require large-sized training set for accuracy
(static training set)
Difficult to provide a covering and challenging
training set that will bring out the subtlety of
deduplication function

51
New Approach

Relegate the task of finding the deduplication
function to a machine learning algorithm
Design goal
Less training instances
Interactive response
Fast convergence
High accuracy
? Design an interactive deduplication system
called ALIAS

52
ALIAS SYSTEM

A learned-based method
Exploit existing similarity functions
Use Active Learning
- an active learner actively picks unlabeled
instances with the most information gain in the
training set
Produce a deduplication function that can
identify duplicates

53
Overall Architecture
L
Feedback From user
Lp
T
Training data
Mapper
F
Similarity Functions
Train classifiers
Dp
Mapper
Select n instances for labeling
S
Learner
D
Predicate for uncertain region
Similarity Indices
54
Primary Inputs for ALIAS
L

A set of initial training pairs (L)
less than 10 labeled records
arranged in pairs of duplicates and
non-duplicates
A set of similarity functions (F)
Ex word-match, qgram-match
To compute similarities scores between 2 records
based on any subset of attributes.
Learner will find the right way of combining
those scores to get the final deduplication
function
A database of unlabeled records(D)
Number of classifiers (lt5)

F
Similarity Functions
D
55
Mapped Labeled Instances
L
Lp

Take r1, r2 from input L
Record r1(a1, a2, a3)
Record r2(a1, a2, a3)
Use similarity functions f1, f2fn to compute
similarity scores s between r1 and r2
New record
r1r2 (s1, s2sn, y/n)
y duplicate
n non-duplicate
Put new record in Lp

Mapper
Similarity Functions
F
D
56
Mapped Unlabeled Instances
L
Lp

Take r1, r2 from D x D
Record r1(a1, a2, a3)
Record r2(a1, a2, a3)
Use similarity functions f1, f2fn to compute
similarity scores between r1 and r2
New record
r1r2 (s1, s2sn)
No y/n field
Put new record in Dp

Mapper
Similarity Functions
F
Mapper
Dp
D
57
Active Learner
L
Feedback From user
Lp
T
Training data
Mapper
Similarity Functions
F
Train Classifiers
Dp
Mapper
Select set S of n instances for labeling
s
Learner
D
58
ALIAS Algorithm

1. Input L, D, F.
2. Create pairs Lp from the labeled data L and F.
3. Create pairs Dp from the unlabeled data D and
F.
4. Initial training set T Lp
5. Loop until user satisfaction
Train classier C using T.
Use C to select a set S of n instances from Dp
for labeling.
If S is empty, exit loop.
Collect user feedback on the labels of S.
Add S to T and remove S from Dp.
6. Output classifier C

59
The Indexing Component

Purpose
Avoid mapping all pairs of records in D x D
3 Methods
Grouping
Sampling
Indexing

60
The Indexing Component

Grouping
Example group records in D according to the
field year of publication
Mapped pairs are formed only within records of a
group.
Sampling
Sample in units of a group instead of individual
records.

61
The Indexing Component

Indexing
A similarity function
fraction of common words between two text
attributes gt0.4
we can create an index on the words of the text
attributes

62
The learning component

Contain a number of classifiers
A classifier a machine learning algorithm such
as decision tree (D-tree), naïve Bayes (NB),
Support Vector Machine (SVM)to classify
instances
A classifier is trained using a training data set

63
Criteria for a classifier

Accuracy
Interpretability
Indexability
Efficient training

64
Accuracy of a Classifier

Measured by the mean F of recall r and precision
p
r fraction of duplicates correctly classified
p fraction correct amongst all instances
actually libeled duplicate

65
Accuracy (cont.)

Example
A case with 1 duplicates a classifier labels
all pairs as non-duplicates
recall r 0 ? mean F 0 ? accuracy 0
A case with 1 duplicates a classifier
identifies all duplicates correctly but
misclassifies 1 non-duplicates
recall r 1, p 0.5 ? F 0.667 ? accuracy
66.7
? If we dont use r and p, then they both have
99 accuracy!

66
Criteria for a classifier (cont.)

Interpretability
Final deduplication rule is easy to understand
and interpret
Indexability
Final deduplication rule has indexable predicates
Efficient training
Fast to train

67
Comparison of Different Classifiers
68
Active Learning
The goal is to seek out from the unlabeled pool
the instances which when labeled will help
strengthen the classifier at the fastest possible
rate.
69
A simple Example
Most uncertain instance

Assume all are unlabeled except a and b
Suppose r-coordinate 0, b-coordinate 1
Any unlabeled point x to the left of r and to
the right of b will have no effect in reducing
the region of uncertainty
By including m in the training set, the size of
the uncertain region will reduce by half

70
How to select an unlabeled instance

Uncertainty
The instance about which the learner was most
unsure was also the instance for which the
expected reduction in confusion was the largest
Uncertainty score
the disagreement among the predictions the
instance get from a committee of N classifiers
A sure duplicate or non-duplicate would get the
same prediction from all members
Representativeness
An uncertain instance representing a larger
number of unlabeled instances has greater impact
to the classifier

71
Example

3 similarity functions word match f1, qgram
match f2, string edit distance f3.
Take from Mapped Unlabeled Instances (Dp)
r1r2 (s1, s2sn)
r3r4 (s1, s2sn)
s1, s2sn scores using functions f1, f2, f3
3 classifiers D-tree, Naïve Bayes, SVM

D-tree Naïve Bayes SVM
r1r2 duplicate Non-duplicate duplicate
r3r4 duplicate duplicate duplicate
selected

72
How to Combine Uncertainty and Representativeness

Proposed two approaches
1st approach weighted sum
Cluster the unlabeled instances
Estimate the density of points around it
The instances are scored using a weighted sum of
its density and uncertainty value
n highest scoring instances selected
2nd approach sampling

73
Conclusion