Searching and Integrating Information on the Web - PowerPoint PPT Presentation

About This Presentation
Title:

Searching and Integrating Information on the Web

Description:

Harrison Ford. Maple St. 430-871-8294. Jack Lemmon. Addr. SSN. Name. Table R. Maple Street ... John P. McDougal. Jenny Stamatopulou. Jonh Smith. Panos Ipirotis ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 74
Provided by: che7
Learn more at: https://ics.uci.edu
Category:

less

Transcript and Presenter's Notes

Title: Searching and Integrating Information on the Web


1
Searching and Integrating Information on the Web
  • Seminar 3 Data Cleansing
  • Professor Chen Li
  • UC Irvine

2
Paper readings
  • Efficient merge and purge Hernandez and Stolfo,
    SIGMOD 1995
  • Approximate String Joins in a Database (Almost)
    for Free, Gravano et al, VLDB 2001,
  • Efficient Record Linkage in Large Data Sets,
    Liang Jin, Chen Li, Sharad Mehrotra, DASFAA, 2003
  • Sunita Sarawagi Anuradha Bhamidipaty, Interactive
    Deduplication Using Active Learning. Sarawagi and
    Bhamidipaty, KDD 2003

3
Motivation
  • Correlate data from different data sources (e.g.,
    data integration)
  • Data is often dirty
  • Needs to be cleansed before being used
  • Example
  • A hospital needs to merge patient records from
    different data sources
  • They have different formats, typos, and
    abbreviations

4
Example
Table R
Table S
Name SSN Addr
Jack Lemmon 430-871-8294 Maple St
Harrison Ford 292-918-2913 Culver Blvd
Tom Hanks 234-762-1234 Main St

Name SSN Addr
Ton Hanks 234-162-1234 Main Street
Kevin Spacey 928-184-2813 Frost Blvd
Jack Lemon 430-817-8294 Maple Street
  • Find records from different datasets that could
    be the same entity

5
Another Example
  • P. Bernstein, D. Chiu Using Semi-Joins to Solve
    Relational Queries. JACM 28(1) 25-40(1981)
  • Philip A. Bernstein, Dah-Ming W. Chiu, Using
    Semi-Joins to Solve Relational Queries, Journal
    of the ACM (JACM), v.28 n.1, p.25-40, Jan. 1981

6
Record linkage
  • Problem statement
  • Given two relations, identify the potentially
    matched records
  • Efficiently and
  • Effectively

7
Challenges
  • How to define good similarity functions?
  • Many functions proposed (edit distance, cosine
    similarity, )
  • Domain knowledge is critical
  • Names Wall Street Journal and LA Times
  • Address Main Street versus Main St
  • How to do matching efficiently
  • Offline join version
  • Online (interactive) search
  • Nearest search
  • Range search

8
Outline
  • Supporting string-similarity joins using RDBMS
  • Using mapping techniques
  • Interactive deduplication

9
Edit Distance
  • A widely used metric to define string similarity
  • Ed(s1,s2) minimum of operations (insertion,
    deletion, substitution) to change s1 to s2
  • Example
  • s1 Tom Hanks
  • s2 Ton Hank
  • ed(s1,s2) 2

10
Approximate String Joins
  • We want to join tuples with similar string
    fields
  • Similarity measure Edit Distance
  • Each Insertion, Deletion, Replacement increases
    distance by one

Service A
Jenny Stamatopoulou
John Paul McDougal
Aldridge Rodriguez
Panos Ipeirotis
John Smith


Service B
Panos Ipirotis
Jonh Smith

Jenny Stamatopulou
John P. McDougal

Al Dridge Rodriguez

K1
K2

K1
K3

K1
11
Focus Approximate String Joins over Relational
DBMSs
  • Join two tables on string attributes and keep all
    pairs of strings with Edit Distance K
  • Solve the problem in a database-friendly way
  • (if possible with an existing "vanilla" RDBMS)

12
Current Approaches for Processing Approximate
String Joins
  • No native support for approximate joins in RDBMSs
  • Two existing (straightforward) solutions
  • Join data outside of DBMS
  • Join data via user-defined functions (UDFs)
    inside the DBMS

13
Approximate String Joins outside of a DBMS
  • Export data
  • Join outside of DBMS
  • Import the result
  • Main advantage
  • We can exploit any state-of-the-art
    string-matching algorithm, without restrictions
    from DBMS functionality
  • Disadvantages
  • Substantial amounts of data to be
    exported/imported
  • Cannot be easily integrated with further
    processing steps in the DBMS

14
Approximate String Joins with UDFs
  • Write a UDF to check if two strings match within
    distance K
  • Write an SQL statement that applies the UDF to
    the string pairs
  • SELECT R.stringAttr, S.stringAttr
  • FROM R, S
  • WHERE edit_distance(R.stringAttr, S.stringAttr,
    K)
  • Main advantage
  • Ease of implementation
  • Main disadvantage
  • UDF applied to entire cross-product of relations

15
Our Approach Approximate String Joins over an
Unmodified RDBMS
  • Preprocess data and generate auxiliary tables
  • Perform join exploiting standard RDBMS
    capabilities
  • Advantages
  • No modification of underlying RDBMS needed.
  • Can leverage the RDBMS query optimizer.
  • Much more efficient than the approach based on
    naive UDFs

16
Intuition and Roadmap
  • Intuition
  • Similar strings have many common substrings
  • Use exact joins to perform approximate joins
    (current
  • DBMSs are good for exact joins)
  • A good candidate set can be verified for false
    positives
  • Ukkonen 1992, Sutinen and Tarhio 1996, Ullman
    1977
  • Roadmap
  • Break strings into substrings of length q
    (q-grams)
  • Perform an exact join on the q-grams
  • Find candidate string pairs based on the results
  • Check only candidate pairs with a UDF to obtain
    final answer

17
What is a Q-gram?
  • Q-gram A sequence of q characters of the
    original string
  • Example for q3
  • vacations
  • v, va, vac, aca, cat, ati, tio, ion, ons,
    ns, s
  • String with length L ? L q - 1 q-grams
  • Similar strings have a many common q-grams

18
Q-grams and Edit Distance Operations
  • With no edits L q - 1 common q-grams
  • Replacement (L q 1) - q common q-grams
  • Vacations v, va, vac, aca, cat, ati, tio,
    ion, ons, ns, s
  • Vacalions v, va, vac, aca, cal, ali, lio,
    ion, ons, ns, s
  • Insertion (Lmax q 1) - q common q-grams
  • Vacations v, va, vac, aca, cat, ati, tio,
    ion, ons, ns, s
  • Vacatlions v, va, vac, aca, cat, atl, tli,
    lio, ion, ons, ns, s
  • Deletion (Lmax q 1) - q common q-grams
  • Vacations v, va, vac, aca, cat, ati, tio,
    ion, ons, ns, s
  • Vacaions v, va, vac, aca, cai, aio, ion,
    ons, ns, s

19
Number of Common Q-grams and Edit Distance
  • For Edit Distance K, there could be at most K
    replacements, insertions, deletions
  • Two strings S1 and S2 with Edit Distance K have
    at least max(S1.len, S2.len) q - 1 Kq
    q-grams in common
  • Useful filter eliminate all string pairs without
    "enough" common q-grams (no false dismissals)

20
Using a DBMS for Q-gram Joins
  • If we have the q-grams in the DBMS, we can
    perform this counting efficiently.
  • Create auxiliary tables with tuples of the form
  • ltsid, strlen, qgramgt
  • and join these tables
  • A GROUP BY HAVING COUNT clause can perform the
    counting / filtering

21
Eliminating Candidate Pairs COUNT FILTERING
  • SQL for this filter (parts omitted for clarity)
  • SELECT R.sid, S.sid
  • FROM R, S
  • WHERE R.qgramS.qgram
  • GROUP BY R.sid, S.sid
  • HAVING COUNT() gt (max(R.strlen, S.strlen) q -
    1) Kq
  • The result is the pair of strings with
    sufficiently enough common q-grams to ensure that
    we will not have false negatives.

22
Eliminating Candidate Pairs Further LENGTH
FILTERING
  • Strings with length difference larger than K
    cannot be within Edit Distance K
  • SELECT R.sid, S.sid
  • FROM R, S
  • WHERE R.qgramS.qgram AND abs(R.strlen -
    S.strlen)ltK
  • GROUP BY R.sid, S.sid
  • HAVING COUNT() gt (max(R.strlen, S.strlen) q
    1) Kq
  • We refer to this filter as LENGTH FILTERING

23
Exploiting Q-gram Positions for Filtering
  • Consider strings aabbzzaacczz and aacczzaabbzz
  • Strings are at edit distance 4
  • Strings have identical q-grams for q3
  • Problem Matching q-grams that are at different
    positions in both strings
  • Either q-grams do not "originate" from same
    q-gram, or
  • Too many edit operations "caused" spurious
    q-grams at various parts of strings to match

24
POSITION FILTERING - Filtering using positions
  • Keep the position of the q-grams ltsid, strlen,
    pos, qgramgt
  • Do not match q-grams that are more than K
    positions away
  • SELECT R.sid, S.sid
  • FROM R, S
  • WHERE R.qgramS.qgram
  • AND abs(R.strlen - S.strlen)ltK
  • AND abs(R.pos - S.pos)ltK
  • GROUP BY R.sid, S.sid
  • HAVING COUNT() gt (max(R.strlen, S.strlen) q
    1) Kq
  • We refer to this filter as POSITION FILTERING

25
The Actual, Complete SQL Statement
  • SELECT R1.string, S1.string, R1.sid, S1.sid
  • FROM R1, S1, R, S,
  • WHERE R1.sidR.sid
  • AND S1.sidS.sid
  • AND R.qgramS.qgram
  • AND abs(strlen(R1.string)strlen(S1.string))ltK
  • AND abs(R.pos - S.pos)ltK
  • GROUP BY R1.sid, S1.sid, R1.string, S1.string
  • HAVING COUNT() gt
  • (max(strlen(R1.string),strlen(S1.string))
    q-1)Kq

26
Summary of 1st paper
  • Introduced a technique for mapping approximate
    string joins into a vanilla SQL expression
  • Our technique does not require modifying the
    underlying RDBMS

27
Outline
  • Supporting string-similarity joins using RDBMS
  • Using mapping techniques
  • Interactive deduplication

28
Single-attribute Case
  • Given
  • two sets of strings, R and S
  • a similarity function f between strings (metric
    space)
  • Reflexive f(s1,s2) 0 iff s1s2
  • Symmetric f(s1,s2) d(s2, s1)
  • Triangle inequality f(s1,s2)f(s2,s3) gt
    f(s1,s3)
  • a threshold k
  • Find all pairs of strings (r, s) from R and S,
    such that f(r,s) lt k.

R
S
29
Nested-loop?
  • Not desirable for large data sets
  • 5 hours for 30K strings!

30
Our 2-step approach
  • Step 1 map strings (in a metric space) to
    objects in a Euclidean space
  • Step 2 do a similarity join in the Euclidean
    space

31
Advantages
  • Applicable to many metric similarity functions
  • Use edit distance as an example
  • Other similarity functions also tried, e.g.,
    q-gram-based similarity
  • Open to existing algorithms
  • Mapping techniques
  • Join techniques

32
Step 1
  • Map strings into a high-dimensional Euclidean
    space

Metric Space
Euclidean Space
33
Mapping StringMap
  • Input A list of strings
  • Output Points in a high-dimensional Euclidean
    space that preserve the original distances well
  • A variation of FastMap
  • Each step greedily picks two strings (pivots) to
    form an axis
  • All axes are orthogonal

34
Can it preserve distances?
  • Data Sources
  • IMDB star names 54,000
  • German names 132,000
  • Distribution of string lengths

35
Can it preserve distances?
  • Use data set 1 (54K names) as an example
  • k2, d20
  • Use k5.2 to differentiate similar and
    dissimilar pairs.

36
Choose Dimensionality d
  • Increase d?
  • Good ?
  • better to differentiate similar pairs from
    dissimilar ones.
  • Bad ?
  • Step 1 Efficiency ?
  • Step 2 curse of dimensionality

37
Choose dimensionality d using sampling
  • Sample 1Kx1K strings, find their similar pairs
    (within distance k)
  • Calculate maximum of their new distances w
  • Define Cost of finding a similar pair

38
Choose Dimensionality d
d15 25
39
Choose new threshold k
  • Closely related to the mapping property
  • Ideally, if ed(r,s) lt k, the Euclidean distance
    between two corresponding points lt k.
  • Choose k using sampling
  • Sample 1Kx1K strings, find similar pairs
  • Calculate their maximum new distance as k
  • repeat multiple times, choose their maximum

40
New threshold k in step 2
d20
41
Step 2 Similarity Join
  • Input Two sets of points in Euclidean space.
  • Output Pairs of two points whose distance is
    less than new threshold k.
  • Many join algorithms can be used

42
Example
  • Adopted an algorithm by Hjaltason and Samet.
  • Building two R-Trees.
  • Traverse two trees, find points whose distance is
    within k.
  • Pruning during traversal (e.g., using MinDist).

43
Final processing
  • Among the pairs produced from the similarity-join
    step, check their edit distance.
  • Return those pairs satisfying the threshold k

44
Running time
45
Recall
  • Recall (of found similar pairs)/(of all
    similar pairs)

46
Multi-attribute linkage
  • Example title name year
  • Different attributes have different similarity
    functions and thresholds
  • Consider merge rules in disjunctive format

47
Evaluation strategies
  • Many ways to evaluate rules
  • Finding an optimal one NP-hard
  • Heuristics
  • Treat different conjuncts independently. Pick the
    most efficient attribute in each conjunct.
  • Choose the largest threshold for each attribute.
    Then choose the most efficient attribute among
    these thresholds.

48
Summary of 2nd paper
  • A novel two-step approach to record linkage.
  • Many existing mapping and join algorithms can be
    adopted
  • Applicable to many distance metrics.
  • Time and space efficient.
  • Multi-attribute case studied

49
Outline
  • Supporting string-similarity joins using RDBMS
  • Using mapping techniques
  • Interactive deduplication

50
Problems with Existing Deduplication Methods
  • Matching Functions
  • Calculate similarity scores, thresholds
  • Tedious coding
  • Learning-based Methods
  • Require large-sized training set for accuracy
  • (static training set)
  • Difficult to provide a covering and challenging
    training set that will bring out the subtlety of
    deduplication function

51
New Approach
  • Relegate the task of finding the deduplication
    function to a machine learning algorithm
  • Design goal
  • Less training instances
  • Interactive response
  • Fast convergence
  • High accuracy
  • ? Design an interactive deduplication system
    called ALIAS

52
ALIAS SYSTEM
  • A learned-based method
  • Exploit existing similarity functions
  • Use Active Learning
  • - an active learner actively picks unlabeled
    instances with the most information gain in the
    training set
  • Produce a deduplication function that can
    identify duplicates

53
Overall Architecture
L
Feedback From user
Lp
T
Training data
Mapper
F
Similarity Functions
Train classifiers
Dp
Mapper
Select n instances for labeling
S
Learner
D
Predicate for uncertain region
Similarity Indices
54
Primary Inputs for ALIAS
L
  • A set of initial training pairs (L)
  • less than 10 labeled records
  • arranged in pairs of duplicates and
    non-duplicates
  • A set of similarity functions (F)
  • Ex word-match, qgram-match
  • To compute similarities scores between 2 records
    based on any subset of attributes.
  • Learner will find the right way of combining
    those scores to get the final deduplication
    function
  • A database of unlabeled records(D)
  • Number of classifiers (lt5)

F
Similarity Functions
D
55
Mapped Labeled Instances
L
Lp
  • Take r1, r2 from input L
  • Record r1(a1, a2, a3)
  • Record r2(a1, a2, a3)
  • Use similarity functions f1, f2fn to compute
    similarity scores s between r1 and r2
  • New record
  • r1r2 (s1, s2sn, y/n)
  • y duplicate
  • n non-duplicate
  • Put new record in Lp

Mapper
Similarity Functions
F
D
56
Mapped Unlabeled Instances
L
Lp
  • Take r1, r2 from D x D
  • Record r1(a1, a2, a3)
  • Record r2(a1, a2, a3)
  • Use similarity functions f1, f2fn to compute
    similarity scores between r1 and r2
  • New record
  • r1r2 (s1, s2sn)
  • No y/n field
  • Put new record in Dp

Mapper
Similarity Functions
F
Mapper
Dp
D
57
Active Learner
L
Feedback From user
Lp
T
Training data
Mapper
Similarity Functions
F
Train Classifiers
Dp
Mapper
Select set S of n instances for labeling
s
Learner
D
58
ALIAS Algorithm
  • 1. Input L, D, F.
  • 2. Create pairs Lp from the labeled data L and F.
  • 3. Create pairs Dp from the unlabeled data D and
    F.
  • 4. Initial training set T Lp
  • 5. Loop until user satisfaction
  • Train classier C using T.
  • Use C to select a set S of n instances from Dp
    for labeling.
  • If S is empty, exit loop.
  • Collect user feedback on the labels of S.
  • Add S to T and remove S from Dp.
  • 6. Output classifier C

59
The Indexing Component
  • Purpose
  • Avoid mapping all pairs of records in D x D
  • 3 Methods
  • Grouping
  • Sampling
  • Indexing

60
The Indexing Component
  • Grouping
  • Example group records in D according to the
    field year of publication
  • Mapped pairs are formed only within records of a
    group.
  • Sampling
  • Sample in units of a group instead of individual
    records.

61
The Indexing Component
  • Indexing
  • A similarity function
  • fraction of common words between two text
    attributes gt0.4
  • we can create an index on the words of the text
    attributes

62
The learning component
  • Contain a number of classifiers
  • A classifier a machine learning algorithm such
    as decision tree (D-tree), naïve Bayes (NB),
    Support Vector Machine (SVM)to classify
    instances
  • A classifier is trained using a training data set

63
Criteria for a classifier
  • Accuracy
  • Interpretability
  • Indexability
  • Efficient training

64
Accuracy of a Classifier
  • Measured by the mean F of recall r and precision
    p
  • r fraction of duplicates correctly classified
  • p fraction correct amongst all instances
    actually libeled duplicate

65
Accuracy (cont.)
  • Example
  • A case with 1 duplicates a classifier labels
    all pairs as non-duplicates
  • recall r 0 ? mean F 0 ? accuracy 0
  • A case with 1 duplicates a classifier
    identifies all duplicates correctly but
    misclassifies 1 non-duplicates
  • recall r 1, p 0.5 ? F 0.667 ? accuracy
    66.7
  • ? If we dont use r and p, then they both have
    99 accuracy!

66
Criteria for a classifier (cont.)
  • Interpretability
  • Final deduplication rule is easy to understand
    and interpret
  • Indexability
  • Final deduplication rule has indexable predicates
  • Efficient training
  • Fast to train

67
Comparison of Different Classifiers
68
Active Learning
The goal is to seek out from the unlabeled pool
the instances which when labeled will help
strengthen the classifier at the fastest possible
rate.
69
A simple Example
Most uncertain instance
  • Assume all are unlabeled except a and b
  • Suppose r-coordinate 0, b-coordinate 1
  • Any unlabeled point x to the left of r and to
    the right of b will have no effect in reducing
    the region of uncertainty
  • By including m in the training set, the size of
    the uncertain region will reduce by half

70
How to select an unlabeled instance
  • Uncertainty
  • The instance about which the learner was most
    unsure was also the instance for which the
    expected reduction in confusion was the largest
  • Uncertainty score
  • the disagreement among the predictions the
    instance get from a committee of N classifiers
  • A sure duplicate or non-duplicate would get the
    same prediction from all members
  • Representativeness
  • An uncertain instance representing a larger
    number of unlabeled instances has greater impact
    to the classifier

71
Example
  • 3 similarity functions word match f1, qgram
    match f2, string edit distance f3.
  • Take from Mapped Unlabeled Instances (Dp)
  • r1r2 (s1, s2sn)
  • r3r4 (s1, s2sn)
  • s1, s2sn scores using functions f1, f2, f3
  • 3 classifiers D-tree, Naïve Bayes, SVM

D-tree Naïve Bayes SVM
r1r2 duplicate Non-duplicate duplicate
r3r4 duplicate duplicate duplicate
selected

72
How to Combine Uncertainty and Representativeness
  • Proposed two approaches
  • 1st approach weighted sum
  • Cluster the unlabeled instances
  • Estimate the density of points around it
  • The instances are scored using a weighted sum of
    its density and uncertainty value
  • n highest scoring instances selected
  • 2nd approach sampling

73
Conclusion
  • ALIAS
  • Makes deduplication much easier (less training
    instances)
  • Provide interaction response to the user
  • High accuracy
Write a Comment
User Comments (0)
About PowerShow.com