Duplicate Detection

About This Presentation

Title:

Duplicate Detection

Description:

Duplicate Detection. Exercise 1. Use ... (E. HomeAddress=' Myskviksv gen 8') - (E.City= 'INGAR ' ... (E. HomeAddress=' Pilgatan 9 ') - (E.City= 'STOCKHOLM' ... – PowerPoint PPT presentation

Number of Views:111

Avg rating:3.0/5.0

Slides: 10

Provided by: Life70

Category:

more less

Transcript and Presenter's Notes

Title: Duplicate Detection

1
Duplicate Detection
2
Exercise 1. Use Extended Key to do Entity
Identification1
3

Table R and S as shown below
Table R
Table S

Name City ZIP PersonNr
Eva Aadde INGARÖ 13469 840126 -1223
Eva Aalto Norsborg 14564 851201-1225
Eva Abrahamsson INGARÖ 13463 861227-1227
Name HomeAddress Telephone
Eva Aadde Myskviksvägen 8 08-571 480 27
Eva Abrahamsson Myrvägen 2 08-570 290 91
Eva Abrahamsson Pilgatan 9 08-642 61 79
Eva Abrahamsson Nyängsvägen 39A 08-530 356 44
4

Suppose the extended key is name, city,
homeaddress and the following ILFDs
(E. HomeAddress Myskviksvägen 8) -gt(E.City
INGARÖ)
(E. HomeAddressMyrvägen 2) -gt(E.City
INGARÖ)
(E. HomeAddress Pilgatan 9 ) -gt(E.City
STOCKHOLM)
(E. HomeAddress Nyängsvägen 39A) -gt(E.City
TULLINGE)
Please construct the integrated table.
--------------------------------------------------
---
1 Lim , Jaideep Srivastava , Satya Prabhakar
, James Richardson, Entity Identification in
Database Integration, Proceedings of the Ninth
International Conference on Data Engineering,
p.294-301, April 19-23, 1993

5
Answer Exercise

Integrated Table

Name City ZIP PersonNr HomeAddress Telephone
Eva Aadde INGARÖ 13469 840126 -1223 Myskviksvägen 8 08-571 480 27
Eva Abrahamsson INGARÖ 13463 861227-1227 Myrvägen 2 08-571 480 27
Eva Abrahamsson STOCKHOLM NULL NULL Pilgatan 9 08-642 61 79
Eva Abrahamsson TULLINGE NULL NULL Nyängsvägen 39A 08-530 356 44
6
Exercise 2. Use Priority Queue to do Duplicate
Detection2
7

Given conditions below, please use Priority
Queue algorithm
to find the Duplicate Clusters within.

Similarities between tuples

Table R, which is already sorted according to
application-specific key

T1 T2 T3 T4 T5 T6 T7
T1 0.6 0.1 0.3 0.5 0.1 0.2
T2 0.6 0.2 0.4 0.4 0.4 0.2
T3 0.1 0.2 0.9 0.4 0.6 0.5
T4 0.3 0.4 0.9 0.4 0.6 0.6
T5 0.5 0.4 0.4 0.4 0.4 0.8
T6 0.1 0.4 0.6 0.6 0.4 0.4
T7 0.2 0.2 0.5 0.6 0.8 0.4
Tuple
T1
T2
T3
T4
T5
T6
T7
8

Method to count Matching Sorce
Given one cluster, the Matching Sorce of one
tuple is
The average of the tuples similarity with the
clusters all representitives.
The condition to declare a new cluster
matching score lt 0.5
The condition to declare a representitive
0.5 lt matching score lt 0.8
The size of Priority Queue
2
--------------------------------------------------
---
2 A.E. Monge and C.P. Elkan, An Efficient
Domain-Independent Algorithm for Detecting
Approximately Duplicate Database Records, Proc.
ACM-SIGMOD Workshop Research Issues on Knowledge
Discovery and Data Mining, 1997