Relational%20Clustering%20for%20Entity%20Resolution%20Queries - PowerPoint PPT Presentation

About This Presentation
Title:

Relational%20Clustering%20for%20Entity%20Resolution%20Queries

Description:

Relational Clustering for Entity Resolution Queries. Indrajit Bhattacharya, Louis Licamele ... Resolution Query. Do disambiguation ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 16
Provided by: Csu48
Learn more at: http://www.cs.umd.edu
Category:

less

Transcript and Presenter's Notes

Title: Relational%20Clustering%20for%20Entity%20Resolution%20Queries


1
Relational Clustering for Entity Resolution
Queries
  • Indrajit Bhattacharya, Louis Licamele
  • and Lise Getoor
  • University of Maryland, College Park

2
The Entity Resolution Problem
Abdulla Ansari
WeiWei Wang
Chih Chen
P1 A mouse immunity model, W.Wang, C.Chen,
A.Ansari P2 A better mouse immunity model,
W.Wang, A.Ansari P3 Measuring protein-bound
fluxetine, L.Li, C.Chen, W.Wang P4
Autoimmunity in biliary cirrhosis, W.W.Wang,
A.Ansari
Wenyi Wang
Liyuan Li
Chien-Te Chen
  • Discover the domain entities
  • Map each reference to an entity

3
Query-time ER Motivation
  • Most publicly available databases do not have
    resolved entities
  • PubMed, CiteSeer have many unresolved authors
  • Millions of queries everyday require resolved
    entities directly or indirectly
  • I am looking for all papers by Stuart Russell
  • How do we address this problem?
  • Leave the burden on the user to do the resolution
  • Ask owners to clean their databases
  • Develop techniques for query-time resolution

4
Entity Resolution Queries
  • Disambiguation Query
  • Among all papers with W Wang as author, find
    those written by WeiWei Wang

P1 A mouse immunity model, W.Wang, C.Chen,
A.Ansari P2 A better mouse immunity model,
W.Wang, A.Ansari P3 Measuring protein-bound
fluxetine, L.Li, C.Chen, W.Wang
  • Resolution Query
  • Do disambiguation
  • Also retrieve papers by WeiWei Wang with a
    different author name, e.g. W W Wang etc

P1 A mouse immunity model, W.Wang, C.Chen,
A.Ansari P2 A better mouse immunity model,
W.Wang, A.Ansari P4 Autoimmunity in biliary
cirrhosis, W.W.Wang, A.Ansari
5
Query-time ER using Relations
  • Simple approach for resolving queries
  • Use attributes
  • Quick but not accurate
  • Use best techniques available
  • Collective resolution using relationships
  • How can localize collective resolution?
  • Two-phase collective resolution for query
  • Extract minimal set of relevant records
  • Collective resolution on extracted records

6
Cut-based Evaluation of Relational Clustering
  • Vertices embedded in attribute space
  • Additional (hyper)edges represent relationships

C3
C3
C1
C1
C2
C2
C4
C4
  • Good separation of attributes
  • Many cluster-cluster relationships
  • C1-C3, C1-C4, C2-C4
  • Worse in terms of attributes
  • Fewer cluster-cluster relationships
  • C1-C3, C2-C4

7
A Cut-based Objective Function
weight for attributes
weight for relations
similarity of attributes
1 iff relational edge exists between ci and cj
compatibility of ci and cj
  • Greedy clustering algorithm merge cluster pair
    with max reduction in objective function
  • Common cluster neighborhood
  • Jaccard works better than intersection
  • Similarity of attributes
  • Jaro, Levenstein TF-IDF

8
Extracting Relevant Records
Name expansion
Name expansion
Hyper-edge expansion
Query
Level 0
Level 1
Level 2
P4 A Ansari P2 A Ansari P1 A Ansari P1 C
Chen P3 C Chen P3 L Li
P A Ansari P A Ansari P C Chen P C
Chen P L Li P L Li
W Wang
P4 W W Wang P1 W Wang P2 W Wang P3 W Wang
Start with query name or record
  • Alternate between
  • Name expansion For any relevant record, include
    other records with that name
  • Hyper-edge Expansion For any relevant record,
    include other related records

Terminate at some depth k
9
Adaptive Expansion for a Query
  • Too many records with unconstrained expansion
  • Adaptively select records based on ambiguity
  • Chen is more ambiguous than Ansari
  • Adaptive Name Expansion
  • Expand the more ambiguous records
  • They need extra evidence
  • Adaptive Hyper-edge expansion
  • Add fewer ambiguous records
  • They lead to imprecision

10
Unsupervised Estimation of Ambiguity
  • Probability of multiple entities sharing an
    attribute value
  • Estimate ambiguity of one single valued attribute
    (A1a) using another (A2)
  • Count number of different values of A2 observed
    for records having A1a
  • e.g. different first initials for last-name
    Smith
  • Estimate improves with more independent
    attributes

11
Evaluation Datasets
  • arXiv High Energy Physics
  • 29,555 publications, 58,515 refs to 9,200 authors
  • Queries All ambiguous names (75 in total)
  • True authors per name 2 to 11 (avg. is 2.4)
  • Elsevier BioBase
  • 156,156 publications, 831,991 author refs
  • Keywords, topic classifications, language,
    country and affiliation of corresponding author,
    etc
  • Queries 100 most frequent names
  • True authors per name 1 to 100 (avg. is 32)

12
Growth Rate of Relevant Records and Query
Processing Time
  • Number of relevant references grows rapidly with
    expansion depth

RC-ER is fast but not good enough for query-time
resolution
13
Query-time ER Results
  • Unconstrained expansion
  • Collective resolution more accurate
  • Accuracy improves beyond depth 1

A pair-wise attributes similarity AN also
neighbors attributes transitive closure
  • Adaptive expansion
  • Minimal loss in accuracy
  • Dramatic reduction in query processing time

AX-2 adaptive expansion at depths 2 and
beyond AX-1 adaptive expansion even at depth 1
14
Conclusions
  • Query-centric entity resolution
  • Cut-based evaluation of relational clustering
  • Adaptive selection of relevant references for a
    query
  • Resolution at query-time with minimal loss in
    accuracy

Future Directions
  • Spectral algorithm for relational clustering
  • Stronger coupling between extraction and
    resolution
  • Localized resolution for incoming records

15
References
  • "Query-Time Entity Resolution", Indrajit
    Bhattacharya, Louis Licamele and Lise Getoor, ACM
    SIGKDD, 2006
  • "A Latent Dirichlet Model for Unsupervised Entity
    Resolution", Indrajit Bhattacharya and Lise
    Getoor, SIAM Data Mining, 2006
  • "Entity Resolution in Graphs", Indrajit
    Bhattacharya and Lise Getoor, Chapter in Mining
    Graph Data, Lawrence B. Holder and Diane J. Cook,
    Editors, Wiley, 2006 (to appear).
  • "Relational Clustering for Multi-type Entity
    Resolution", Indrajit Bhattacharya and Lise
    Getoor, SIGKDD Workshop on Multi Relational Data
    Mining (MRDM), 2005
Write a Comment
User Comments (0)
About PowerShow.com