Duplicate Detection - PowerPoint PPT Presentation

About This Presentation
Title:

Duplicate Detection

Description:

Duplicate Detection. Exercise 1. Use ... (E. HomeAddress=' Myskviksv gen 8') - (E.City= 'INGAR ' ... (E. HomeAddress=' Pilgatan 9 ') - (E.City= 'STOCKHOLM' ... – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 10
Provided by: Life70
Category:

less

Transcript and Presenter's Notes

Title: Duplicate Detection


1
Duplicate Detection
2
Exercise 1. Use Extended Key to do Entity
Identification1
3
  • Table R and S as shown below
  • Table R
  • Table S

Name City ZIP PersonNr
Eva Aadde INGARÖ 13469 840126 -1223
Eva Aalto Norsborg 14564 851201-1225
Eva Abrahamsson INGARÖ 13463 861227-1227
Name HomeAddress Telephone
Eva Aadde Myskviksvägen 8 08-571 480 27
Eva Abrahamsson Myrvägen 2 08-570 290 91
Eva Abrahamsson Pilgatan 9 08-642 61 79
Eva Abrahamsson Nyängsvägen 39A 08-530 356 44
4
  • Suppose the extended key is name, city,
    homeaddress and the following ILFDs
  • (E. HomeAddress Myskviksvägen 8) -gt(E.City
    INGARÖ)
  • (E. HomeAddressMyrvägen 2) -gt(E.City
    INGARÖ)
  • (E. HomeAddress Pilgatan 9 ) -gt(E.City
    STOCKHOLM)
  • (E. HomeAddress Nyängsvägen 39A) -gt(E.City
    TULLINGE)
  • Please construct the integrated table.
  • --------------------------------------------------
    ---
  • 1 Lim , Jaideep Srivastava , Satya Prabhakar
    , James Richardson, Entity Identification in
    Database Integration, Proceedings of the Ninth
    International Conference on Data Engineering,
    p.294-301, April 19-23, 1993

5
Answer Exercise
  • Integrated Table

Name City ZIP PersonNr HomeAddress Telephone
Eva Aadde INGARÖ 13469 840126 -1223 Myskviksvägen 8 08-571 480 27
Eva Abrahamsson INGARÖ 13463 861227-1227 Myrvägen 2 08-571 480 27
Eva Abrahamsson STOCKHOLM NULL NULL Pilgatan 9 08-642 61 79
Eva Abrahamsson TULLINGE NULL NULL Nyängsvägen 39A 08-530 356 44
6
Exercise 2. Use Priority Queue to do Duplicate
Detection2
7
  • Given conditions below, please use Priority
    Queue algorithm
  • to find the Duplicate Clusters within.
  • Similarities between tuples
  • Table R, which is already sorted according to
    application-specific key

T1 T2 T3 T4 T5 T6 T7
T1 0.6 0.1 0.3 0.5 0.1 0.2
T2 0.6 0.2 0.4 0.4 0.4 0.2
T3 0.1 0.2 0.9 0.4 0.6 0.5
T4 0.3 0.4 0.9 0.4 0.6 0.6
T5 0.5 0.4 0.4 0.4 0.4 0.8
T6 0.1 0.4 0.6 0.6 0.4 0.4
T7 0.2 0.2 0.5 0.6 0.8 0.4
Tuple
T1
T2
T3
T4
T5
T6
T7
8
  • Method to count Matching Sorce
  • Given one cluster, the Matching Sorce of one
    tuple is
  • The average of the tuples similarity with the
    clusters all representitives.
  • The condition to declare a new cluster
  • matching score lt 0.5
  • The condition to declare a representitive
  • 0.5 lt matching score lt 0.8
  • The size of Priority Queue
  • 2
  • --------------------------------------------------
    ---
  • 2 A.E. Monge and C.P. Elkan, An Efficient
    Domain-Independent Algorithm for Detecting
    Approximately Duplicate Database Records, Proc.
    ACM-SIGMOD Workshop Research Issues on Knowledge
    Discovery and Data Mining, 1997

9
Answer
  • Record 1
  • Queue1
  • Record 2
  • 21 0.6 gt 0.5 and lt 0.8
  • Queue 1,2
  • Record 3
  • 31 0.1 32 0.2 representitive (0.1 0.2)
    /2 0.15 lt 0.5
  • Queue 3 1, 2
  • Record 4
  • 41 0.3 42 0.4 representitive (0.30.4) /2
    0.35 lt 0.5
  • 43 0.9 gt 0.5 and gt 0.8
  • Queue 3, 4 1,2
  • Record 5
  • 51 0.5 52 0.4 representitive (0.5 0.4)
    /2 0.45 lt 0.5
  • 53 0.4 representitive 0.4
    lt0.5
  • Queue 5 3, 4 1,2
  • Record 6
  • 63 0.6 representitive 0.6 gt 0.5
    and lt 0.8
  • 65 0.4 lt 0.5
  • Queue 3, 4, 6 5 1,2
  • Record 7
  • 73 0.5 76 0.4 representitive (0.5 0.4)/2
    0.45 lt 0.5
  • 75 0.8 gt0.5
  • Queue 5, 7 3, 4, 6 1,2
Write a Comment
User Comments (0)
About PowerShow.com