Title: Relational%20Clustering%20for%20Entity%20Resolution%20Queries
1Relational Clustering for Entity Resolution
Queries
- Indrajit Bhattacharya, Louis Licamele
- and Lise Getoor
- University of Maryland, College Park
2The Entity Resolution Problem
Abdulla Ansari
WeiWei Wang
Chih Chen
P1 A mouse immunity model, W.Wang, C.Chen,
A.Ansari P2 A better mouse immunity model,
W.Wang, A.Ansari P3 Measuring protein-bound
fluxetine, L.Li, C.Chen, W.Wang P4
Autoimmunity in biliary cirrhosis, W.W.Wang,
A.Ansari
Wenyi Wang
Liyuan Li
Chien-Te Chen
- Discover the domain entities
- Map each reference to an entity
3Query-time ER Motivation
- Most publicly available databases do not have
resolved entities - PubMed, CiteSeer have many unresolved authors
- Millions of queries everyday require resolved
entities directly or indirectly - I am looking for all papers by Stuart Russell
- How do we address this problem?
- Leave the burden on the user to do the resolution
- Ask owners to clean their databases
- Develop techniques for query-time resolution
4Entity Resolution Queries
- Disambiguation Query
- Among all papers with W Wang as author, find
those written by WeiWei Wang
P1 A mouse immunity model, W.Wang, C.Chen,
A.Ansari P2 A better mouse immunity model,
W.Wang, A.Ansari P3 Measuring protein-bound
fluxetine, L.Li, C.Chen, W.Wang
- Resolution Query
- Do disambiguation
- Also retrieve papers by WeiWei Wang with a
different author name, e.g. W W Wang etc
P1 A mouse immunity model, W.Wang, C.Chen,
A.Ansari P2 A better mouse immunity model,
W.Wang, A.Ansari P4 Autoimmunity in biliary
cirrhosis, W.W.Wang, A.Ansari
5Query-time ER using Relations
- Simple approach for resolving queries
- Use attributes
- Quick but not accurate
- Use best techniques available
- Collective resolution using relationships
- How can localize collective resolution?
- Two-phase collective resolution for query
- Extract minimal set of relevant records
- Collective resolution on extracted records
6Cut-based Evaluation of Relational Clustering
- Vertices embedded in attribute space
- Additional (hyper)edges represent relationships
C3
C3
C1
C1
C2
C2
C4
C4
- Good separation of attributes
- Many cluster-cluster relationships
- C1-C3, C1-C4, C2-C4
- Worse in terms of attributes
- Fewer cluster-cluster relationships
- C1-C3, C2-C4
7A Cut-based Objective Function
weight for attributes
weight for relations
similarity of attributes
1 iff relational edge exists between ci and cj
compatibility of ci and cj
- Greedy clustering algorithm merge cluster pair
with max reduction in objective function
- Common cluster neighborhood
- Jaccard works better than intersection
- Similarity of attributes
- Jaro, Levenstein TF-IDF
8Extracting Relevant Records
Name expansion
Name expansion
Hyper-edge expansion
Query
Level 0
Level 1
Level 2
P4 A Ansari P2 A Ansari P1 A Ansari P1 C
Chen P3 C Chen P3 L Li
P A Ansari P A Ansari P C Chen P C
Chen P L Li P L Li
W Wang
P4 W W Wang P1 W Wang P2 W Wang P3 W Wang
Start with query name or record
- Alternate between
- Name expansion For any relevant record, include
other records with that name - Hyper-edge Expansion For any relevant record,
include other related records
Terminate at some depth k
9Adaptive Expansion for a Query
- Too many records with unconstrained expansion
- Adaptively select records based on ambiguity
- Chen is more ambiguous than Ansari
- Adaptive Name Expansion
- Expand the more ambiguous records
- They need extra evidence
- Adaptive Hyper-edge expansion
- Add fewer ambiguous records
- They lead to imprecision
10Unsupervised Estimation of Ambiguity
- Probability of multiple entities sharing an
attribute value - Estimate ambiguity of one single valued attribute
(A1a) using another (A2) - Count number of different values of A2 observed
for records having A1a - e.g. different first initials for last-name
Smith - Estimate improves with more independent
attributes
11Evaluation Datasets
- arXiv High Energy Physics
- 29,555 publications, 58,515 refs to 9,200 authors
- Queries All ambiguous names (75 in total)
- True authors per name 2 to 11 (avg. is 2.4)
- Elsevier BioBase
- 156,156 publications, 831,991 author refs
- Keywords, topic classifications, language,
country and affiliation of corresponding author,
etc - Queries 100 most frequent names
- True authors per name 1 to 100 (avg. is 32)
12Growth Rate of Relevant Records and Query
Processing Time
- Number of relevant references grows rapidly with
expansion depth
RC-ER is fast but not good enough for query-time
resolution
13Query-time ER Results
- Unconstrained expansion
- Collective resolution more accurate
- Accuracy improves beyond depth 1
A pair-wise attributes similarity AN also
neighbors attributes transitive closure
- Adaptive expansion
- Minimal loss in accuracy
- Dramatic reduction in query processing time
AX-2 adaptive expansion at depths 2 and
beyond AX-1 adaptive expansion even at depth 1
14Conclusions
- Query-centric entity resolution
- Cut-based evaluation of relational clustering
- Adaptive selection of relevant references for a
query - Resolution at query-time with minimal loss in
accuracy
Future Directions
- Spectral algorithm for relational clustering
- Stronger coupling between extraction and
resolution - Localized resolution for incoming records
15References
- "Query-Time Entity Resolution", Indrajit
Bhattacharya, Louis Licamele and Lise Getoor, ACM
SIGKDD, 2006 - "A Latent Dirichlet Model for Unsupervised Entity
Resolution", Indrajit Bhattacharya and Lise
Getoor, SIAM Data Mining, 2006 - "Entity Resolution in Graphs", Indrajit
Bhattacharya and Lise Getoor, Chapter in Mining
Graph Data, Lawrence B. Holder and Diane J. Cook,
Editors, Wiley, 2006 (to appear). - "Relational Clustering for Multi-type Entity
Resolution", Indrajit Bhattacharya and Lise
Getoor, SIGKDD Workshop on Multi Relational Data
Mining (MRDM), 2005