Relational%20Clustering%20for%20Entity%20Resolution%20Queries - PowerPoint PPT Presentation

About This Presentation

Title:

Relational%20Clustering%20for%20Entity%20Resolution%20Queries

Description:

Relational Clustering for Entity Resolution Queries. Indrajit Bhattacharya, Louis Licamele ... Resolution Query. Do disambiguation ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 16

Provided by: Csu48

Learn more at: http://www.cs.umd.edu

Category:

more less

Transcript and Presenter's Notes

Title: Relational%20Clustering%20for%20Entity%20Resolution%20Queries

1
Relational Clustering for Entity Resolution
Queries

Indrajit Bhattacharya, Louis Licamele
and Lise Getoor
University of Maryland, College Park

2
The Entity Resolution Problem
Abdulla Ansari
WeiWei Wang
Chih Chen
P1 A mouse immunity model, W.Wang, C.Chen,
A.Ansari P2 A better mouse immunity model,
W.Wang, A.Ansari P3 Measuring protein-bound
fluxetine, L.Li, C.Chen, W.Wang P4
Autoimmunity in biliary cirrhosis, W.W.Wang,
A.Ansari
Wenyi Wang
Liyuan Li
Chien-Te Chen

Discover the domain entities
Map each reference to an entity

3
Query-time ER Motivation

Most publicly available databases do not have
resolved entities
PubMed, CiteSeer have many unresolved authors
Millions of queries everyday require resolved
entities directly or indirectly
I am looking for all papers by Stuart Russell
How do we address this problem?
Leave the burden on the user to do the resolution
Ask owners to clean their databases
Develop techniques for query-time resolution

4
Entity Resolution Queries

Disambiguation Query
Among all papers with W Wang as author, find
those written by WeiWei Wang

P1 A mouse immunity model, W.Wang, C.Chen,
A.Ansari P2 A better mouse immunity model,
W.Wang, A.Ansari P3 Measuring protein-bound
fluxetine, L.Li, C.Chen, W.Wang

Resolution Query
Do disambiguation
Also retrieve papers by WeiWei Wang with a
different author name, e.g. W W Wang etc

P1 A mouse immunity model, W.Wang, C.Chen,
A.Ansari P2 A better mouse immunity model,
W.Wang, A.Ansari P4 Autoimmunity in biliary
cirrhosis, W.W.Wang, A.Ansari
5
Query-time ER using Relations

Simple approach for resolving queries
Use attributes
Quick but not accurate
Use best techniques available
Collective resolution using relationships
How can localize collective resolution?
Two-phase collective resolution for query
Extract minimal set of relevant records
Collective resolution on extracted records

6
Cut-based Evaluation of Relational Clustering

Vertices embedded in attribute space
Additional (hyper)edges represent relationships

C3
C3
C1
C1
C2
C2
C4
C4

Good separation of attributes
Many cluster-cluster relationships
C1-C3, C1-C4, C2-C4

Worse in terms of attributes
Fewer cluster-cluster relationships
C1-C3, C2-C4

7
A Cut-based Objective Function
weight for attributes
weight for relations
similarity of attributes
1 iff relational edge exists between ci and cj
compatibility of ci and cj

Greedy clustering algorithm merge cluster pair
with max reduction in objective function

Common cluster neighborhood
Jaccard works better than intersection

Similarity of attributes
Jaro, Levenstein TF-IDF

8
Extracting Relevant Records
Name expansion
Name expansion
Hyper-edge expansion
Query
Level 0
Level 1
Level 2
P4 A Ansari P2 A Ansari P1 A Ansari P1 C
Chen P3 C Chen P3 L Li
P A Ansari P A Ansari P C Chen P C
Chen P L Li P L Li
W Wang
P4 W W Wang P1 W Wang P2 W Wang P3 W Wang
Start with query name or record

Alternate between
Name expansion For any relevant record, include
other records with that name
Hyper-edge Expansion For any relevant record,
include other related records

Terminate at some depth k
9
Adaptive Expansion for a Query

Too many records with unconstrained expansion
Adaptively select records based on ambiguity
Chen is more ambiguous than Ansari
Adaptive Name Expansion
Expand the more ambiguous records
They need extra evidence
Adaptive Hyper-edge expansion
Add fewer ambiguous records
They lead to imprecision

10
Unsupervised Estimation of Ambiguity

Probability of multiple entities sharing an
attribute value
Estimate ambiguity of one single valued attribute
(A1a) using another (A2)
Count number of different values of A2 observed
for records having A1a
e.g. different first initials for last-name
Smith
Estimate improves with more independent
attributes

11
Evaluation Datasets

arXiv High Energy Physics
29,555 publications, 58,515 refs to 9,200 authors
Queries All ambiguous names (75 in total)
True authors per name 2 to 11 (avg. is 2.4)
Elsevier BioBase
156,156 publications, 831,991 author refs
Keywords, topic classifications, language,
country and affiliation of corresponding author,
etc
Queries 100 most frequent names
True authors per name 1 to 100 (avg. is 32)

12
Growth Rate of Relevant Records and Query
Processing Time

Number of relevant references grows rapidly with
expansion depth

RC-ER is fast but not good enough for query-time
resolution
13
Query-time ER Results

Unconstrained expansion
Collective resolution more accurate
Accuracy improves beyond depth 1

A pair-wise attributes similarity AN also
neighbors attributes transitive closure

Adaptive expansion
Minimal loss in accuracy
Dramatic reduction in query processing time

AX-2 adaptive expansion at depths 2 and
beyond AX-1 adaptive expansion even at depth 1
14
Conclusions

Query-centric entity resolution
Cut-based evaluation of relational clustering
Adaptive selection of relevant references for a
query
Resolution at query-time with minimal loss in
accuracy

Future Directions

Spectral algorithm for relational clustering
Stronger coupling between extraction and
resolution
Localized resolution for incoming records

15
References

"Query-Time Entity Resolution", Indrajit
Bhattacharya, Louis Licamele and Lise Getoor, ACM
SIGKDD, 2006
"A Latent Dirichlet Model for Unsupervised Entity
Resolution", Indrajit Bhattacharya and Lise
Getoor, SIAM Data Mining, 2006
"Entity Resolution in Graphs", Indrajit
Bhattacharya and Lise Getoor, Chapter in Mining
Graph Data, Lawrence B. Holder and Diane J. Cook,
Editors, Wiley, 2006 (to appear).
"Relational Clustering for Multi-type Entity
Resolution", Indrajit Bhattacharya and Lise
Getoor, SIGKDD Workshop on Multi Relational Data
Mining (MRDM), 2005