Title: Duplicate Detection
1Duplicate Detection
2Exercise 1. Use Extended Key to do Entity
Identification1
3- Table R and S as shown below
- Table R
- Table S
Name City ZIP PersonNr
Eva Aadde INGARÖ 13469 840126 -1223
Eva Aalto Norsborg 14564 851201-1225
Eva Abrahamsson INGARÖ 13463 861227-1227
Name HomeAddress Telephone
Eva Aadde Myskviksvägen 8 08-571 480 27
Eva Abrahamsson Myrvägen 2 08-570 290 91
Eva Abrahamsson Pilgatan 9 08-642 61 79
Eva Abrahamsson Nyängsvägen 39A 08-530 356 44
4- Suppose the extended key is name, city,
homeaddress and the following ILFDs - (E. HomeAddress Myskviksvägen 8) -gt(E.City
INGARÖ) - (E. HomeAddressMyrvägen 2) -gt(E.City
INGARÖ) - (E. HomeAddress Pilgatan 9 ) -gt(E.City
STOCKHOLM) - (E. HomeAddress Nyängsvägen 39A) -gt(E.City
TULLINGE) - Please construct the integrated table.
- --------------------------------------------------
--- - 1 Lim , Jaideep Srivastava , Satya Prabhakar
, James Richardson, Entity Identification in
Database Integration, Proceedings of the Ninth
International Conference on Data Engineering,
p.294-301, April 19-23, 1993
5Answer Exercise
Name City ZIP PersonNr HomeAddress Telephone
Eva Aadde INGARÖ 13469 840126 -1223 Myskviksvägen 8 08-571 480 27
Eva Abrahamsson INGARÖ 13463 861227-1227 Myrvägen 2 08-571 480 27
Eva Abrahamsson STOCKHOLM NULL NULL Pilgatan 9 08-642 61 79
Eva Abrahamsson TULLINGE NULL NULL Nyängsvägen 39A 08-530 356 44
6Exercise 2. Use Priority Queue to do Duplicate
Detection2
7- Given conditions below, please use Priority
Queue algorithm - to find the Duplicate Clusters within.
- Similarities between tuples
- Table R, which is already sorted according to
application-specific key
T1 T2 T3 T4 T5 T6 T7
T1 0.6 0.1 0.3 0.5 0.1 0.2
T2 0.6 0.2 0.4 0.4 0.4 0.2
T3 0.1 0.2 0.9 0.4 0.6 0.5
T4 0.3 0.4 0.9 0.4 0.6 0.6
T5 0.5 0.4 0.4 0.4 0.4 0.8
T6 0.1 0.4 0.6 0.6 0.4 0.4
T7 0.2 0.2 0.5 0.6 0.8 0.4
Tuple
T1
T2
T3
T4
T5
T6
T7
8- Method to count Matching Sorce
- Given one cluster, the Matching Sorce of one
tuple is - The average of the tuples similarity with the
clusters all representitives. - The condition to declare a new cluster
- matching score lt 0.5
- The condition to declare a representitive
- 0.5 lt matching score lt 0.8
- The size of Priority Queue
- 2
- --------------------------------------------------
--- - 2 A.E. Monge and C.P. Elkan, An Efficient
Domain-Independent Algorithm for Detecting
Approximately Duplicate Database Records, Proc.
ACM-SIGMOD Workshop Research Issues on Knowledge
Discovery and Data Mining, 1997
9Answer
- Record 1
- Queue1
- Record 2
- 21 0.6 gt 0.5 and lt 0.8
- Queue 1,2
- Record 3
- 31 0.1 32 0.2 representitive (0.1 0.2)
/2 0.15 lt 0.5 - Queue 3 1, 2
- Record 4
- 41 0.3 42 0.4 representitive (0.30.4) /2
0.35 lt 0.5 - 43 0.9 gt 0.5 and gt 0.8
- Queue 3, 4 1,2
- Record 5
- 51 0.5 52 0.4 representitive (0.5 0.4)
/2 0.45 lt 0.5 - 53 0.4 representitive 0.4
lt0.5 - Queue 5 3, 4 1,2
- Record 6
- 63 0.6 representitive 0.6 gt 0.5
and lt 0.8 - 65 0.4 lt 0.5
- Queue 3, 4, 6 5 1,2
- Record 7
- 73 0.5 76 0.4 representitive (0.5 0.4)/2
0.45 lt 0.5 - 75 0.8 gt0.5
- Queue 5, 7 3, 4, 6 1,2