Title: CLIP Colloquium Series
1Statistical Relational Learning Entity
Resolution
- Lise Getoor
- University of Maryland, College Park
2What is SRL?
- Traditional statistical machine learning
approaches assume - A random sample of homogeneous objects from
single relation - Traditional relational learning approaches
assume - No noise or uncertainty in data
- Real world data sets
- Multi-relational and heterogeneous
- Noisy and uncertain
- Statistical Relational Learning (SRL)
- newly emerging research area at the intersection
of statistical models and relational
learning/inductive logic programming - Sample Domains
- web data, social networks, biological data,
communication data, customer networks, sensor
networks, natural language, vision,
3SRL Theory
- Directed Approaches
- Semantics based on Bayesian Networks
- Frame-based Directed Models
- Rule-based Directed Models
- Undirected Approaches
- Semantics based on Markov Networks
- Frame-based Undirected Models
- Rule-based Undirected Models
modeling logical vs. statistical dependencies,
feature construction, instances vs. classes,
effective inference, use of labeled unlabeled
data, link prediction, open vs. closed world
Reference Upcoming book on Statistical
Relational Learning w/ Ben Taskar
4SRL Application Link Mining
- Data
- Structured Input Mining graphs and networks
- Structured Output Extracting entity and
relationships from unstructured data - Taxonomy
- Node Centric
- Labeling/ranking nodes (aka Collective
Classification/PageRank) - Consolidating nodes (aka Entity Resolution)
- Discovering hidden nodes (aka Group Discovery)
- Edge Centric
- Labeling/ranking edges
- Predicting the existence, number of edges
- Discovering new relations/paths
- Graph/Subgraph Centric
- Discovering frequent subpatterns
- Metadata discovery, extraction, and reformulation
ReferenceSigKDD Explorations Special Issue on
Link Mining, Dec. 2005, w/ Chris Diehl, JHUAPL
5LINQS Group _at_ UMD
- Members
- myself, Indrajit Bhattacharya, Mustafa Bilgic,
Rezarta Islamaj, Hyunmo Kang, Louis Licamele,
Galileo Namata, Prithivaraj Sen, Vivek Senghal,
Elena Zheleva - Projects
- Link-based Classification
- Entity Resolution (ER)
- Algorithms
- Query-time ER
- User Interface
- Predictive Models for Social Network Analysis
- Abstraction in Affiliation Networks
- Social Capital in Friendship Event Networks
- Role Identification Relationship Ranking
- Temporal Analysis of Email Traffic Networks
- Reputation-based Spam Filtering Privacy
- Feature Gen Protein Interaction Prediction
(biological data) - Ontology Alignment and Schema Mapping
- Probabilistic Databases
- Cost-sensitive Feature Acquisition
6LINQS Group _at_ UMD
- Members
- myself, Indrajit Bhattacharya, Mustafa Bilgic,
Rezarta Islamaj, Hyunmo Kang, Louis Licamele,
Galileo Namata, Prithivaraj Sen, Vivek Senghal,
Elena Zheleva - Projects
- Link-based Classification
- Entity Resolution (ER)
- Algorithms
- Query-time ER
- User Interface
- Predictive Models for Social Network Analysis
- Abstraction in Affiliation Networks
- Social Capital in Friendship Event Networks
- Role Identification Relationship Ranking
- Temporal Analysis of Email Traffic Networks
- Reputation-based Spam Filtering Privacy
- Feature Gen Protein Interaction Prediction
(biological data) - Ontology Alignment and Schema Mapping
- Probabilistic Databases
- Cost-sensitive Feature Acquisition
7LINQS Group _at_ UMD
- Members
- myself, Indrajit Bhattacharya, Mustafa Bilgic,
Rezarta Islamaj, Hyunmo Kang, Louis Licamele,
Galileo Namata, Prithivaraj Sen, Vivek Senghal,
Elena Zheleva - Projects
- Link-based Classification
- Entity Resolution (ER)
- Algorithms
- Query-time ER
- User Interface
- Predictive Models for Social Network Analysis
- Abstraction in Affiliation Networks
- Social Capital in Friendship Event Networks
- Role Identification Relationship Ranking
- Temporal Analysis of Email Traffic Networks
- Reputation-based Spam Filtering Privacy
- Feature Gen Protein Interaction Prediction
(biological data) - Ontology Alignment and Schema Mapping
- Probabilistic Databases
- Cost-sensitive Feature Acquisition
Graduated this fall starting at IBM Delhi April
8Entity Resolution
- The Problem
- Relational Entity Resolution
- Algorithms
- Graph-based Clustering (GBC)
- Probabilistic Model (LDA-ER)
- Query-time Entity Resolution
- ER User Interface
9InfoVis Co-Author Network Fragment
10The Entity Resolution Problem
James Smith
John Smith
John Smith
Jim Smith
J Smith
James Smith
Jon Smith
Jonathan Smith
J Smith
Jonthan Smith
- Issues
- Identification
- Disambiguation
11Attribute-based Entity Resolution
?
J Smith
James Smith
0.8
Jim Smith
James Smith
Pair-wise classification
J Smith
James Smith
?
0.1
John Smith
James Smith
0.7
James Smith
Jon Smith
0.05
James Smith
Jonthan Smith
- Choosing threshold precision/recall tradeoff
- Inability to disambiguate
- Perform transitive closure?
12Entity Resolution
- The Problem
- Relational Entity Resolution
- Algorithms
- Graph-based Clustering (GBC)
- Probabilistic Model (LDA-ER)
- Experimental Evaluation
- Query-time Entity Resolution
- ER User Interface
13Relational Entity Resolution
- References not observed independently
- Links between references indicate relations
between the entities - Co-author relations for bibliographic data
- To, cc lists for email
- Use relations to improve identification and
disambiguation
14Relational Identification
Very similar names. Added evidence from shared
co-authors
15Relational Disambiguation
Very similar names but no shared collaborators
16Relational Constraints
Co-authors are typically distinct
17Collective Entity Resolution
One resolutions provides evidence for another gt
joint resolution
18Entity Resolution with Relations
- Naïve Relational Entity Resolution
- Also compare attributes of related references
- Two references have co-authors w/ similar names
- Collective Entity Resolution
- Use discovered entities of related references
- Entities cannot be identified independently
- Harder problem to solve
19Entity Resolution
- The Problem
- Relational Entity Resolution
- Algorithms
- Relational Clustering (RC-ER)
- DMKD04, Wiley06, DE Bulletin06,TKDD07
- Probabilistic Model (LDA-ER)
- Experimental Evaluation
- Query-time Entity Resolution
- ER User Interface
20P1 JOSTLE Partitioning of Unstructured Meshes
for Massively Parallel Machines, C. Walshaw, M.
Cross, M. G. Everett, S. Johnson J P2
Partitioning Mapping of Unstructured Meshes to
Parallel Machine Topologies, C. Walshaw, M.
Cross, M. G. Everett, S. Johnson, K. McManus
J P3 Dynamic Mesh Partitioning A Unied
Optimisation and Load-Balancing Algorithm, C.
Walshaw, M. Cross, M. G. Everett P4 Code
Generation for Machines with Multiregister
Operations, Alfred V. Aho, Stephen C. Johnson,
Jefferey D. Ullman J P5 Deterministic Parsing
of Ambiguous Grammars, A. Aho, S. Johnson, J.
Ullman J P6 Compilers Principles, Techniques,
and Tools, A. Aho, R. Sethi, J. Ullman
21P1 JOSTLE Partitioning of Unstructured Meshes
for Massively Parallel Machines, C. Walshaw, M.
Cross, M. G. Everett, S. Johnson P2
Partitioning Mapping of Unstructured Meshes to
Parallel Machine Topologies, C. Walshaw, M.
Cross, M. G. Everett, S. Johnson, K. McManus P3
Dynamic Mesh Partitioning A Unied Optimisation
and Load-Balancing Algorithm, C. Walshaw, M.
Cross, M. G. Everett P4 Code Generation for
Machines with Multiregister Operations, Alfred
V. Aho, Stephen C. Johnson, Jefferey D.
Ullman P5 Deterministic Parsing of Ambiguous
Grammars, A. Aho, S. Johnson, J. Ullman P6
Compilers Principles, Techniques, and Tools,
A. Aho, R. Sethi, J. Ullman
22Relational Clustering (RC-ER)
P1
M. G. Everett
S. Johnson
C. Walshaw
M. Cross
P2
K. McManus
M. Everett
S. Johnson
C. Walshaw
M. Cross
P4
Alfred V. Aho
Stephen C. Johnson
Jefferey D. Ullman
P5
S. Johnson
A. Aho
J. Ullman
23Relational Clustering (RC-ER)
P1
M. G. Everett
S. Johnson
C. Walshaw
M. Cross
P2
K. McManus
M. Everett
S. Johnson
C. Walshaw
M. Cross
P4
Alfred V. Aho
Stephen C. Johnson
Jefferey D. Ullman
P5
S. Johnson
A. Aho
J. Ullman
24Relational Clustering (RC-ER)
P1
M. G. Everett
S. Johnson
C. Walshaw
M. Cross
P2
K. McManus
M. Everett
S. Johnson
C. Walshaw
M. Cross
P4
Alfred V. Aho
Stephen C. Johnson
Jefferey D. Ullman
P5
S. Johnson
A. Aho
J. Ullman
25Relational Clustering (RC-ER)
P1
M. G. Everett
S. Johnson
C. Walshaw
M. Cross
P2
K. McManus
M. Everett
S. Johnson
C. Walshaw
M. Cross
P4
Alfred V. Aho
Stephen C. Johnson
Jefferey D. Ullman
P5
S. Johnson
A. Aho
J. Ullman
26Cut-based Formulation of RC-ER
M. G. Everett
S. Johnson
S. Johnson
M. Everett
S. Johnson
A. Aho
Stephen C. Johnson
Alfred V. Aho
- Good separation of attributes
- Many cluster-cluster relationships
- Aho-Johnson1, Aho-Johnson2, Everett-Johnson1
- Worse in terms of attributes
- Fewer cluster-cluster relationships
- Aho-Johnson1, Everett-Johnson2
27Objective Function
weight for attributes
weight for relations
similarity of attributes
1 iff relational edge exists between ci and cj
28Objective Function
weight for attributes
weight for relations
similarity of attributes
1 iff relational edge exists between ci and cj
- Greedy clustering algorithm merge cluster pair
with max reduction in objective function
Common cluster neighborhood
Similarity of attributes
29Measures for Attribute Similarity
- Use best available measure for each attribute
- Name Strings Soft TF-IDF, Levenstein, Jaro
- Textual Attributes TF-IDF
- Aggregate to find similarity between clusters
- Single link, Average link, Complete link
- Cluster representative
30Relational Similarity Example 1
A. Aho
Alfred V. Aho
P5
P4
Stephen C. Johnson
S. Johnson
P4
P5
J. Ullman
Jefferey D. Ullman
All neighborhood clusters are shared high
relational similarity
31Relational Similarity Example 2
Alfred V. Aho
K. McManus
P4, P5
A. Aho
P2
C. Walshaw
P1, P2
C. Walshaw
Stephen C. Johnson
S. Johnson
S. Johnson
M. G. Everett
S. Johnson
P1, P2
M. Everett
P1, P2
P4, P5
Jefferey D. Ullman
M. Cross
J. Ullman
M. Cross
No neighborhood cluster is shared no relational
similarity
32Comparing Cluster Neighborhoods
- Different measures of set similarity
- Common Neighbors Intersection size
- Jaccards Coefficient Normalize by union size
- Adar Coefficient Weighted set similarity
- Higher order similarity Consider nbrs of nbrs
- Also consider neighborhood as multi-set
33Relational Clustering Algorithm
- Find similar references using blocking
- Bootstrap clusters using attributes and relations
- Compute similarities for cluster pairs and insert
into priority queue - Repeat until priority queue is empty
- Find closest cluster pair
- Stop if similarity below threshold
- Merge to create new cluster
- Update similarity for related
clusters - O(n k log n) algorithm w/ efficient
implementation
34Entity Resolution
- The Problem
- Relational Entity Resolution
- Algorithms
- Relational Clustering (RC-ER)
- Probabilistic Model (LDA-ER)
- SIAM SDM06, Best Paper Award
- Experimental Evaluation
- Query-time Entity Resolution
- ER User Interface
35Probabilistic Generative Model for Collective
Entity Resolution
- Model how references co-occur in data
- Generation of references from entities
- Relationships between underlying entities
- Groups of entities instead of pair-wise relations
36Discovering Groups from Relations
Bell Labs Group
Parallel Processing Research Group
Stephen C Johnson
Stephen P Johnson
Alfred V Aho
Ravi Sethi
Chris Walshaw
Kevin McManus
Mark Cross
Martin Everett
Jeffrey D Ullman
P1 C. Walshaw, M. Cross, M. G. Everett, S.
Johnson
P4 Alfred V. Aho, Stephen C. Johnson,
Jefferey D. Ullman
P2 C. Walshaw, M. Cross, M. G. Everett, S.
Johnson, K. McManus
P5 A. Aho, S. Johnson, J. Ullman
P6 A. Aho, R. Sethi, J. Ullman
P3 C. Walshaw, M. Cross, M. G. Everett
37LDA-ER Model
- Entity label a and group label z for each
reference r
- T mixture of groups for each co-occurrence
- Fz multinomial for choosing entity a for each
group z
- Va multinomial for choosing reference r from
entity a
- Dirichlet priors with a and ß
38Generating References from Entities
- Entities are not directly observed
- Hidden attribute for each entity
- Similarity measure for pairs of attributes
- A distribution over attributes for each entity
39Approx. Inference Using Gibbs Sampling
- Conditional distribution over labels for each
ref. - Sample next labels from conditional distribution
- Repeat over all references until convergence
- Converges to most likely number of entities
40Faster Inference Split-Merge Sampling
- Naïve strategy reassigns references individually
- Alternative allow entities to merge or split
- For entity ai, find conditional distribution for
- Merging with existing entity aj
- Splitting back to last merged entities
- Remaining unchanged
- Sample next state for ai from distribution
- O(n g e) time per iteration compared to O(n g
n e)
41Entity Resolution
- The Problem
- Relational Entity Resolution
- Algorithms
- Relational Clustering (RC-ER)
- Probabilistic Model (LDA-ER)
- Experimental Evaluation
- Query-time Entity Resolution
- ER User Interface
42Evaluation Datasets
- CiteSeer
- 1,504 citations to machine learning papers
(Lawrence et al.) - 2,892 references to 1,165 author entities
- arXiv
- 29,555 publications from High Energy Physics (KDD
Cup03) - 58,515 refs to 9,200 authors
- Elsevier BioBase
- 156,156 Biology papers (IBM KDD Challenge 05)
- 831,991 author refs
- Keywords, topic classifications, language,
country and affiliation of corresponding author,
etc
43Baselines
- A Pair-wise duplicate decisions w/ attributes
only - Names Soft-TFIDF with Levenstein, Jaro,
Jaro-Winkler - Other textual attributes TF-IDF
- A Transitive closure over A
- AN Add attribute similarity of co-occurring
refs - AN Transitive closure over AN
- Evaluate pair-wise decisions over references
- F1-measure (harmonic mean of precision and recall)
44ER over Entire Dataset
- RC-ER LDA-ER outperform baselines in all
datasets - Collective resolution better than naïve
relational resolution - BioBase Biggest improvement over baselines
- arXiv 6,500 additional correct resolutions 20
err. red. - CiteSeer Near perfect resolution 22 error
reduction
45ER over Entire Dataset
- RC-ER and baselines require threshold as
parameter - Best achievable performance over all thresholds
- Best RC-ER performance better than LDA-ER
- LDA-ER does not require similarity threshold
46Performance for Specific Names
arXiv Significantly larger improvements for
ambiguous names
47Trends in Synthetic Data
- Bigger improvement with
- bigger of ambiguous refs
- more refs per co-occurrence
- more neighbors per entity
48Entity Resolution
- The Problem
- Relational Entity Resolution
- Algorithms
- Relational Clustering (RC-ER)
- Probabilistic Model (LDA-ER)
- Experimental Evaluation
- Query-time Entity Resolution
- KDD06
- ER User Interface
49Query-time ER Motivation
- Most publicly available databases do not have
resolved entities - PubMed, CiteSeer have unresolved authors
- Query processing requires resolved entities
- Retrieve papers by S. Johnson of Bell Labs
50Entity Resolution Queries
P1 Jostle , C. Walshaw, M. Cross, M. G.
Everett, S. Johnson P2 Parallel Machine
Topologies, C. Walshaw, M. Cross, M. G. Everett,
S. Johnson, K. McManus P5 Deterministic
Parsing , A. Aho, S. Johnson, J. Ullman
- Disambiguation Query
- Among papers with S Johnson as author, find
those by the Bell Labs researcher
- Resolution Query
- Do disambiguation
- Also retrieve papers by the Bell Labs researcher
with a different author name, e.g. Stephen C
Johnson
- P5 Deterministic Parsing , A. Aho, S.
Johnson, J. Ullman - P4 Code Generation , Alfred V. Aho, Stephen
C. Johnson, Jefferey D. Ullman
51Query-time ER using Relations
- Possible directions
- Leave resolution burden on user
- Ask owner to clean database
- Develop techniques for query-time resolution
- Attribute-based query resolution
- Quick but not accurate
- Collective resolution for queries
- Extract relevant records by recursive expansion
- Collective resolution on extracted records
52Extracting Relevant Records
Attr expansion
Attr expansion
Relation expansion
Query
Level 0
Level 1
Level 2
S. Johnson
P4 Stephen C. Johnson P5 S.
Johnson P2 S. Johnson P1 S. Johnson
P4 Alfred V. Aho P5 A. Aho P4
Jefferey D. Ullman P5 J. Ullman P2 K.
McManus P2 C. Walshaw P1 C. Walshaw
P A. Aho P Alfred V. Aho P J.
Ullman P Jefferey D. Ullman P K.
McManus P K. McManus P C. Walshaw P C.
Walshaw
Start with query name or record
- Alternate between
- Attribute expansion For any relevant record,
include other records with that name - Relation Expansion For any relevant record,
include other related records
53Adaptive Expansion
- Too many records with unconstrained expansion
- Adaptively select records based on ambiguity
- Smith is more ambiguous than McManus
- Use adaptive expansion
- Expand the more ambiguous records
- They need extra evidence
- When expanding, add fewer ambiguous records
- They lead to imprecision
- Large reduction in number of relevant records
54Ambiguity Estimation
- Probability of multiple entities sharing
attribute value - No. of entities with last name Smith
- No labeled data available
- Estimate last name ambiguity using other
attributes - No. of different first initials for last-name
Smith - Estimate improves with more independent
attributes
A. Smith, B. Smith, D. Smith, G. Smith, K.
Smith, M. Smith, P. Smith, R. Smith, S. Smith,
T. Smith,
K. McManus
55QT-ER Evaluation Datasets
- arXiv High Energy Physics
- 29,555 publications, 58,515 refs to 9,200 authors
- Queries All ambiguous names (75 total)
- True authors per name 2 to 11 (avg. 2.4)
- Elsevier BioBase
- 156,156 publications, 831,991 author refs
- Queries 100 most frequent names
- True authors per name 1 to 100 (avg. 32)
56Growth Rate of Relevant Records and Query
Processing Time
Number of relevant references grows rapidly with
expansion depth
RC-ER is fast but not good enough for query-time
resolution
57QT-ER Results
- Unconstrained expansion
- Collective resolution more accurate
- Accuracy improves beyond depth 1
- Adaptive expansion
- Minimal loss in accuracy
- Dramatic reduction in query processing time
AX-2 adaptive expansion at depths 2 and
beyond AX-1 adaptive expansion even at depth 1
58Entity Resolution
- The Problem
- Relational Entity Resolution
- Algorithms
- Relational Clustering (RC-ER)
- Probabilistic Model (LDA-ER)
- Experimental Evaluation
- Query-time Entity Resolution
- ER User Interface
- VAST06
59D-Dupe An Interactive Tool for Entity Resolution
http//www.cs.umd.edu/projects/linqs/ddupe
60Current ER Projects
- Entity Resolution in Geospatial Data
- Using spatial information, location name
information and location type information - ACMGIS06
- Name Reference Resolution in Email
- Goal Figure out who is being talked about
- Make use of traffic patterns to infer social
network - SDM06
- Currently investing adaptive context construction
- Elsayed, Namata, Oard, under review
- Ontology Alignment (work w/ Octavian Udrea, Renee
Miller) - Combines relational clustering with logical
inference (e.g. equivalence and subsumption) - Results in a 40 improvement in recall on 30 OWL
lite ontology pairs - under review
61ER for GIS Data - Identification
Dataset A
Dataset B
Match
62ER for GIS Data - Disambiguation
Dataset A
Dataset B
Not Match!
63ER for GIS Data - Identification
Not Match
64An Example
location reference lj ? Dataset B
location reference li ? Dataset A
li.name Qaryat an Nuaymiyah
lj.name Qaryat an Naimiyah
li.coordinates (lati, longi)
lj.coordinates (latj, longj)
li.type Populated place
lj.type City
,
,
Match!
,
65Conclusion
- Projects
- Link-based Classification and Prediction
- Predictive Models for Social Network Analysis
- Temporal Analysis of Email Traffic Network
- Reputation-based Spam filtering
- Ontology Alignment and Schema Mapping
- Feature Generation for Sequences (biological
data) - Protein Interaction Prediction (biological data)
- Probabilistic Databases
- SRL/Link Mining is a emerging research area at
the intersection of statistical machine learning,
logical reasoning and visualization. - In reality, want to be able to flexibly combine
node, edge and graph-based inferences - While there are important pitfalls to take into
account (confidence and privacy), there are many
potential benefits and payoffs
66Thanks!
httpwww.cs.umd.edu/getoor
Work sponsored by the National Science
Foundation, KDD program and National Geospatial
Agency