CLIP Colloquium Series

About This Presentation

Title:

CLIP Colloquium Series

Description:

A random sample of homogeneous objects from single relation ... Abstraction in Affiliation Networks. Social Capital in Friendship Event Networks ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 66

Provided by: get79

Category:

more less

Transcript and Presenter's Notes

Title: CLIP Colloquium Series

1
Statistical Relational Learning Entity
Resolution

Lise Getoor
University of Maryland, College Park

2
What is SRL?

Traditional statistical machine learning
approaches assume
A random sample of homogeneous objects from
single relation
Traditional relational learning approaches
assume
No noise or uncertainty in data
Real world data sets
Multi-relational and heterogeneous
Noisy and uncertain
Statistical Relational Learning (SRL)
newly emerging research area at the intersection
of statistical models and relational
learning/inductive logic programming
Sample Domains
web data, social networks, biological data,
communication data, customer networks, sensor
networks, natural language, vision,

3
SRL Theory

Directed Approaches
Semantics based on Bayesian Networks
Frame-based Directed Models
Rule-based Directed Models
Undirected Approaches
Semantics based on Markov Networks
Frame-based Undirected Models
Rule-based Undirected Models

modeling logical vs. statistical dependencies,
feature construction, instances vs. classes,
effective inference, use of labeled unlabeled
data, link prediction, open vs. closed world
Reference Upcoming book on Statistical
Relational Learning w/ Ben Taskar
4
SRL Application Link Mining

Data
Structured Input Mining graphs and networks
Structured Output Extracting entity and
relationships from unstructured data
Taxonomy
Node Centric
Labeling/ranking nodes (aka Collective
Classification/PageRank)
Consolidating nodes (aka Entity Resolution)
Discovering hidden nodes (aka Group Discovery)
Edge Centric
Labeling/ranking edges
Predicting the existence, number of edges
Discovering new relations/paths
Graph/Subgraph Centric
Discovering frequent subpatterns
Metadata discovery, extraction, and reformulation

ReferenceSigKDD Explorations Special Issue on
Link Mining, Dec. 2005, w/ Chris Diehl, JHUAPL
5
LINQS Group _at_ UMD

Members
myself, Indrajit Bhattacharya, Mustafa Bilgic,
Rezarta Islamaj, Hyunmo Kang, Louis Licamele,
Galileo Namata, Prithivaraj Sen, Vivek Senghal,
Elena Zheleva
Projects
Link-based Classification
Entity Resolution (ER)
Algorithms
Query-time ER
User Interface
Predictive Models for Social Network Analysis
Abstraction in Affiliation Networks
Social Capital in Friendship Event Networks
Role Identification Relationship Ranking
Temporal Analysis of Email Traffic Networks
Reputation-based Spam Filtering Privacy
Feature Gen Protein Interaction Prediction
(biological data)
Ontology Alignment and Schema Mapping
Probabilistic Databases
Cost-sensitive Feature Acquisition

6
LINQS Group _at_ UMD

Members
myself, Indrajit Bhattacharya, Mustafa Bilgic,
Rezarta Islamaj, Hyunmo Kang, Louis Licamele,
Galileo Namata, Prithivaraj Sen, Vivek Senghal,
Elena Zheleva
Projects
Link-based Classification
Entity Resolution (ER)
Algorithms
Query-time ER
User Interface
Predictive Models for Social Network Analysis
Abstraction in Affiliation Networks
Social Capital in Friendship Event Networks
Role Identification Relationship Ranking
Temporal Analysis of Email Traffic Networks
Reputation-based Spam Filtering Privacy
Feature Gen Protein Interaction Prediction
(biological data)
Ontology Alignment and Schema Mapping
Probabilistic Databases
Cost-sensitive Feature Acquisition

7
LINQS Group _at_ UMD

Members
myself, Indrajit Bhattacharya, Mustafa Bilgic,
Rezarta Islamaj, Hyunmo Kang, Louis Licamele,
Galileo Namata, Prithivaraj Sen, Vivek Senghal,
Elena Zheleva
Projects
Link-based Classification
Entity Resolution (ER)
Algorithms
Query-time ER
User Interface
Predictive Models for Social Network Analysis
Abstraction in Affiliation Networks
Social Capital in Friendship Event Networks
Role Identification Relationship Ranking
Temporal Analysis of Email Traffic Networks
Reputation-based Spam Filtering Privacy
Feature Gen Protein Interaction Prediction
(biological data)
Ontology Alignment and Schema Mapping
Probabilistic Databases
Cost-sensitive Feature Acquisition

Graduated this fall starting at IBM Delhi April
8
Entity Resolution

The Problem
Relational Entity Resolution
Algorithms
Graph-based Clustering (GBC)
Probabilistic Model (LDA-ER)
Query-time Entity Resolution
ER User Interface

9
InfoVis Co-Author Network Fragment
10
The Entity Resolution Problem
James Smith
John Smith
John Smith
Jim Smith
J Smith
James Smith
Jon Smith
Jonathan Smith
J Smith
Jonthan Smith

Issues
Identification
Disambiguation

11
Attribute-based Entity Resolution
?
J Smith
James Smith
0.8
Jim Smith
James Smith
Pair-wise classification
J Smith
James Smith
?
0.1
John Smith
James Smith
0.7
James Smith
Jon Smith
0.05
James Smith
Jonthan Smith

Choosing threshold precision/recall tradeoff
Inability to disambiguate
Perform transitive closure?

12
Entity Resolution

The Problem
Relational Entity Resolution
Algorithms
Graph-based Clustering (GBC)
Probabilistic Model (LDA-ER)
Experimental Evaluation
Query-time Entity Resolution
ER User Interface

13
Relational Entity Resolution

References not observed independently
Links between references indicate relations
between the entities
Co-author relations for bibliographic data
To, cc lists for email
Use relations to improve identification and
disambiguation

14
Relational Identification
Very similar names. Added evidence from shared
co-authors
15
Relational Disambiguation
Very similar names but no shared collaborators
16
Relational Constraints
Co-authors are typically distinct
17
Collective Entity Resolution
One resolutions provides evidence for another gt
joint resolution
18
Entity Resolution with Relations

Naïve Relational Entity Resolution
Also compare attributes of related references
Two references have co-authors w/ similar names
Collective Entity Resolution
Use discovered entities of related references
Entities cannot be identified independently
Harder problem to solve

19
Entity Resolution

The Problem
Relational Entity Resolution
Algorithms
Relational Clustering (RC-ER)
DMKD04, Wiley06, DE Bulletin06,TKDD07
Probabilistic Model (LDA-ER)
Experimental Evaluation
Query-time Entity Resolution
ER User Interface

20
P1 JOSTLE Partitioning of Unstructured Meshes
for Massively Parallel Machines, C. Walshaw, M.
Cross, M. G. Everett, S. Johnson J P2
Partitioning Mapping of Unstructured Meshes to
Parallel Machine Topologies, C. Walshaw, M.
Cross, M. G. Everett, S. Johnson, K. McManus
J P3 Dynamic Mesh Partitioning A Unied
Optimisation and Load-Balancing Algorithm, C.
Walshaw, M. Cross, M. G. Everett P4 Code
Generation for Machines with Multiregister
Operations, Alfred V. Aho, Stephen C. Johnson,
Jefferey D. Ullman J P5 Deterministic Parsing
of Ambiguous Grammars, A. Aho, S. Johnson, J.
Ullman J P6 Compilers Principles, Techniques,
and Tools, A. Aho, R. Sethi, J. Ullman
21
P1 JOSTLE Partitioning of Unstructured Meshes
for Massively Parallel Machines, C. Walshaw, M.
Cross, M. G. Everett, S. Johnson P2
Partitioning Mapping of Unstructured Meshes to
Parallel Machine Topologies, C. Walshaw, M.
Cross, M. G. Everett, S. Johnson, K. McManus P3
Dynamic Mesh Partitioning A Unied Optimisation
and Load-Balancing Algorithm, C. Walshaw, M.
Cross, M. G. Everett P4 Code Generation for
Machines with Multiregister Operations, Alfred
V. Aho, Stephen C. Johnson, Jefferey D.
Ullman P5 Deterministic Parsing of Ambiguous
Grammars, A. Aho, S. Johnson, J. Ullman P6
Compilers Principles, Techniques, and Tools,
A. Aho, R. Sethi, J. Ullman
22
Relational Clustering (RC-ER)
P1
M. G. Everett
S. Johnson
C. Walshaw
M. Cross
P2
K. McManus
M. Everett
S. Johnson
C. Walshaw
M. Cross
P4
Alfred V. Aho
Stephen C. Johnson
Jefferey D. Ullman
P5
S. Johnson
A. Aho
J. Ullman
23
Relational Clustering (RC-ER)
P1
M. G. Everett
S. Johnson
C. Walshaw
M. Cross
P2
K. McManus
M. Everett
S. Johnson
C. Walshaw
M. Cross
P4
Alfred V. Aho
Stephen C. Johnson
Jefferey D. Ullman
P5
S. Johnson
A. Aho
J. Ullman
24
Relational Clustering (RC-ER)
P1
M. G. Everett
S. Johnson
C. Walshaw
M. Cross
P2
K. McManus
M. Everett
S. Johnson
C. Walshaw
M. Cross
P4
Alfred V. Aho
Stephen C. Johnson
Jefferey D. Ullman
P5
S. Johnson
A. Aho
J. Ullman
25
Relational Clustering (RC-ER)
P1
M. G. Everett
S. Johnson
C. Walshaw
M. Cross
P2
K. McManus
M. Everett
S. Johnson
C. Walshaw
M. Cross
P4
Alfred V. Aho
Stephen C. Johnson
Jefferey D. Ullman
P5
S. Johnson
A. Aho
J. Ullman
26
Cut-based Formulation of RC-ER
M. G. Everett
S. Johnson
S. Johnson
M. Everett
S. Johnson
A. Aho
Stephen C. Johnson
Alfred V. Aho

Good separation of attributes
Many cluster-cluster relationships
Aho-Johnson1, Aho-Johnson2, Everett-Johnson1

Worse in terms of attributes
Fewer cluster-cluster relationships
Aho-Johnson1, Everett-Johnson2

27
Objective Function

Minimize

weight for attributes
weight for relations
similarity of attributes
1 iff relational edge exists between ci and cj
28
Objective Function

Minimize

weight for attributes
weight for relations
similarity of attributes
1 iff relational edge exists between ci and cj

Greedy clustering algorithm merge cluster pair
with max reduction in objective function

Common cluster neighborhood
Similarity of attributes
29
Measures for Attribute Similarity

Use best available measure for each attribute
Name Strings Soft TF-IDF, Levenstein, Jaro
Textual Attributes TF-IDF
Aggregate to find similarity between clusters
Single link, Average link, Complete link
Cluster representative

30
Relational Similarity Example 1
A. Aho
Alfred V. Aho
P5
P4
Stephen C. Johnson
S. Johnson
P4
P5
J. Ullman
Jefferey D. Ullman
All neighborhood clusters are shared high
relational similarity
31
Relational Similarity Example 2
Alfred V. Aho
K. McManus
P4, P5
A. Aho
P2
C. Walshaw
P1, P2
C. Walshaw
Stephen C. Johnson
S. Johnson
S. Johnson
M. G. Everett
S. Johnson
P1, P2
M. Everett
P1, P2
P4, P5
Jefferey D. Ullman
M. Cross
J. Ullman
M. Cross
No neighborhood cluster is shared no relational
similarity
32
Comparing Cluster Neighborhoods

Different measures of set similarity
Common Neighbors Intersection size
Jaccards Coefficient Normalize by union size
Adar Coefficient Weighted set similarity
Higher order similarity Consider nbrs of nbrs
Also consider neighborhood as multi-set

33
Relational Clustering Algorithm

Find similar references using blocking
Bootstrap clusters using attributes and relations
Compute similarities for cluster pairs and insert
into priority queue
Repeat until priority queue is empty
Find closest cluster pair
Stop if similarity below threshold
Merge to create new cluster
Update similarity for related
clusters
O(n k log n) algorithm w/ efficient
implementation

34
Entity Resolution

The Problem
Relational Entity Resolution
Algorithms
Relational Clustering (RC-ER)
Probabilistic Model (LDA-ER)
SIAM SDM06, Best Paper Award
Experimental Evaluation
Query-time Entity Resolution
ER User Interface

35
Probabilistic Generative Model for Collective
Entity Resolution

Model how references co-occur in data
Generation of references from entities
Relationships between underlying entities
Groups of entities instead of pair-wise relations

36
Discovering Groups from Relations
Bell Labs Group
Parallel Processing Research Group
Stephen C Johnson
Stephen P Johnson
Alfred V Aho
Ravi Sethi
Chris Walshaw
Kevin McManus
Mark Cross
Martin Everett
Jeffrey D Ullman
P1 C. Walshaw, M. Cross, M. G. Everett, S.
Johnson
P4 Alfred V. Aho, Stephen C. Johnson,
Jefferey D. Ullman
P2 C. Walshaw, M. Cross, M. G. Everett, S.
Johnson, K. McManus
P5 A. Aho, S. Johnson, J. Ullman
P6 A. Aho, R. Sethi, J. Ullman
P3 C. Walshaw, M. Cross, M. G. Everett
37
LDA-ER Model

Entity label a and group label z for each
reference r

T mixture of groups for each co-occurrence

Fz multinomial for choosing entity a for each
group z

Va multinomial for choosing reference r from
entity a

Dirichlet priors with a and ß

38
Generating References from Entities

Entities are not directly observed
Hidden attribute for each entity
Similarity measure for pairs of attributes
A distribution over attributes for each entity

39
Approx. Inference Using Gibbs Sampling

Conditional distribution over labels for each
ref.
Sample next labels from conditional distribution
Repeat over all references until convergence

Converges to most likely number of entities

40
Faster Inference Split-Merge Sampling

Naïve strategy reassigns references individually
Alternative allow entities to merge or split
For entity ai, find conditional distribution for
Merging with existing entity aj
Splitting back to last merged entities
Remaining unchanged
Sample next state for ai from distribution
O(n g e) time per iteration compared to O(n g
n e)

41
Entity Resolution

The Problem
Relational Entity Resolution
Algorithms
Relational Clustering (RC-ER)
Probabilistic Model (LDA-ER)
Experimental Evaluation
Query-time Entity Resolution
ER User Interface

42
Evaluation Datasets

CiteSeer
1,504 citations to machine learning papers
(Lawrence et al.)
2,892 references to 1,165 author entities
arXiv
29,555 publications from High Energy Physics (KDD
Cup03)
58,515 refs to 9,200 authors
Elsevier BioBase
156,156 Biology papers (IBM KDD Challenge 05)
831,991 author refs
Keywords, topic classifications, language,
country and affiliation of corresponding author,
etc

43
Baselines

A Pair-wise duplicate decisions w/ attributes
only
Names Soft-TFIDF with Levenstein, Jaro,
Jaro-Winkler
Other textual attributes TF-IDF
A Transitive closure over A
AN Add attribute similarity of co-occurring
refs
AN Transitive closure over AN
Evaluate pair-wise decisions over references
F1-measure (harmonic mean of precision and recall)

44
ER over Entire Dataset

RC-ER LDA-ER outperform baselines in all
datasets
Collective resolution better than naïve
relational resolution
BioBase Biggest improvement over baselines
arXiv 6,500 additional correct resolutions 20
err. red.
CiteSeer Near perfect resolution 22 error
reduction

45
ER over Entire Dataset

RC-ER and baselines require threshold as
parameter
Best achievable performance over all thresholds
Best RC-ER performance better than LDA-ER
LDA-ER does not require similarity threshold

46
Performance for Specific Names
arXiv Significantly larger improvements for
ambiguous names
47
Trends in Synthetic Data

Bigger improvement with
bigger of ambiguous refs
more refs per co-occurrence
more neighbors per entity

48
Entity Resolution

The Problem
Relational Entity Resolution
Algorithms
Relational Clustering (RC-ER)
Probabilistic Model (LDA-ER)
Experimental Evaluation
Query-time Entity Resolution
KDD06
ER User Interface

49
Query-time ER Motivation

Most publicly available databases do not have
resolved entities
PubMed, CiteSeer have unresolved authors
Query processing requires resolved entities
Retrieve papers by S. Johnson of Bell Labs

50
Entity Resolution Queries
P1 Jostle , C. Walshaw, M. Cross, M. G.
Everett, S. Johnson P2 Parallel Machine
Topologies, C. Walshaw, M. Cross, M. G. Everett,
S. Johnson, K. McManus P5 Deterministic
Parsing , A. Aho, S. Johnson, J. Ullman

Disambiguation Query
Among papers with S Johnson as author, find
those by the Bell Labs researcher

Resolution Query
Do disambiguation
Also retrieve papers by the Bell Labs researcher
with a different author name, e.g. Stephen C
Johnson

P5 Deterministic Parsing , A. Aho, S.
Johnson, J. Ullman
P4 Code Generation , Alfred V. Aho, Stephen
C. Johnson, Jefferey D. Ullman

51
Query-time ER using Relations

Possible directions
Leave resolution burden on user
Ask owner to clean database
Develop techniques for query-time resolution
Attribute-based query resolution
Quick but not accurate
Collective resolution for queries
Extract relevant records by recursive expansion
Collective resolution on extracted records

52
Extracting Relevant Records
Attr expansion
Attr expansion
Relation expansion
Query
Level 0
Level 1
Level 2
S. Johnson
P4 Stephen C. Johnson P5 S.
Johnson P2 S. Johnson P1 S. Johnson
P4 Alfred V. Aho P5 A. Aho P4
Jefferey D. Ullman P5 J. Ullman P2 K.
McManus P2 C. Walshaw P1 C. Walshaw
P A. Aho P Alfred V. Aho P J.
Ullman P Jefferey D. Ullman P K.
McManus P K. McManus P C. Walshaw P C.
Walshaw
Start with query name or record

Alternate between
Attribute expansion For any relevant record,
include other records with that name
Relation Expansion For any relevant record,
include other related records

53
Adaptive Expansion

Too many records with unconstrained expansion
Adaptively select records based on ambiguity
Smith is more ambiguous than McManus
Use adaptive expansion
Expand the more ambiguous records
They need extra evidence
When expanding, add fewer ambiguous records
They lead to imprecision
Large reduction in number of relevant records

54
Ambiguity Estimation

Probability of multiple entities sharing
attribute value
No. of entities with last name Smith
No labeled data available
Estimate last name ambiguity using other
attributes
No. of different first initials for last-name
Smith
Estimate improves with more independent
attributes

A. Smith, B. Smith, D. Smith, G. Smith, K.
Smith, M. Smith, P. Smith, R. Smith, S. Smith,
T. Smith,
K. McManus
55
QT-ER Evaluation Datasets

arXiv High Energy Physics
29,555 publications, 58,515 refs to 9,200 authors
Queries All ambiguous names (75 total)
True authors per name 2 to 11 (avg. 2.4)
Elsevier BioBase
156,156 publications, 831,991 author refs
Queries 100 most frequent names
True authors per name 1 to 100 (avg. 32)

56
Growth Rate of Relevant Records and Query
Processing Time
Number of relevant references grows rapidly with
expansion depth
RC-ER is fast but not good enough for query-time
resolution
57
QT-ER Results

Unconstrained expansion
Collective resolution more accurate
Accuracy improves beyond depth 1

Adaptive expansion
Minimal loss in accuracy
Dramatic reduction in query processing time

AX-2 adaptive expansion at depths 2 and
beyond AX-1 adaptive expansion even at depth 1
58
Entity Resolution

The Problem
Relational Entity Resolution
Algorithms
Relational Clustering (RC-ER)
Probabilistic Model (LDA-ER)
Experimental Evaluation
Query-time Entity Resolution
ER User Interface
VAST06

59
D-Dupe An Interactive Tool for Entity Resolution
http//www.cs.umd.edu/projects/linqs/ddupe
60
Current ER Projects

Entity Resolution in Geospatial Data
Using spatial information, location name
information and location type information
ACMGIS06
Name Reference Resolution in Email
Goal Figure out who is being talked about
Make use of traffic patterns to infer social
network
SDM06
Currently investing adaptive context construction
Elsayed, Namata, Oard, under review
Ontology Alignment (work w/ Octavian Udrea, Renee
Miller)
Combines relational clustering with logical
inference (e.g. equivalence and subsumption)
Results in a 40 improvement in recall on 30 OWL
lite ontology pairs
under review

61
ER for GIS Data - Identification
Dataset A
Dataset B
Match
62
ER for GIS Data - Disambiguation
Dataset A
Dataset B
Not Match!
63
ER for GIS Data - Identification
Not Match
64
An Example
location reference lj ? Dataset B
location reference li ? Dataset A
li.name Qaryat an Nuaymiyah
lj.name Qaryat an Naimiyah
li.coordinates (lati, longi)
lj.coordinates (latj, longj)
li.type Populated place
lj.type City
,
,
Match!
,
65
Conclusion

Projects
Link-based Classification and Prediction
Predictive Models for Social Network Analysis
Temporal Analysis of Email Traffic Network
Reputation-based Spam filtering
Ontology Alignment and Schema Mapping
Feature Generation for Sequences (biological
data)
Protein Interaction Prediction (biological data)
Probabilistic Databases
SRL/Link Mining is a emerging research area at
the intersection of statistical machine learning,
logical reasoning and visualization.
In reality, want to be able to flexibly combine
node, edge and graph-based inferences
While there are important pitfalls to take into
account (confidence and privacy), there are many
potential benefits and payoffs