Link Mining and Entity Resolution

About This Presentation

Title:

Link Mining and Entity Resolution

Description:

Closed vs. Open World. Challenges common to any SRL approachl! ... Closed World vs. Open World ... Closed vs. Open World. 08/17/05. JIKD Presentation. 33 ... – PowerPoint PPT presentation

Number of Views:327

Avg rating:3.0/5.0

Slides: 121

Provided by: lise159

Category:

more less

Transcript and Presenter's Notes

Title: Link Mining and Entity Resolution

1
Link Mining and Entity Resolution

Lise Getoor
University of Maryland, College Park

Students Indrajit Bhattacharya, Mustafa Bilgic,
Rezarta Islamaj, Louis Licamele and Prithviraj Sen
2
Roadmap

Link Mining
Projects

3
Link Mining

Traditional machine learning and data mining
approaches assume
A random sample of homogeneous objects from
single relation
Real world data sets
Multi-relational, heterogeneous and
semi-structured
represented as a graph or network
Statistical Relational Learning (SRL)
newly emerging research area at the intersection
of research in social network and link analysis,
hypertext and web mining, graph mining,
relational learning and inductive logic
programming
Sample Domains
web data, bibliographic data, epidemiological
data, communication data, customer networks,
collaborative filtering, trust networks,
biological data

4
What is SRL?

Three views

5
View 1 Alphabet Soup
LBN
CLP(BN)
SRM
PRISM
RDBN
RPM
SLR
BLOG
PLL
pRN
PER
PRM
SLP
MLN
HMRF
RMN
RNM
DAPER
RDBN
RDN
BLP
SGLR
6
View 2 Representation Soup

Hierarchical Bayesian Model Relational
Representation

Add probabilities
Statistical Relational Learning
Logic
Add relations
Probabilities
7
View 3 Data Soup
Training Data
Test Data
8
View 3 Data Soup
Training Data
Test Data
9
View 3 Data Soup
Training Data
Test Data
10
View 3 Data Soup
Training Data
Test Data
11
View 3 Data Soup
Training Data
Test Data
12
View 3 Data Soup
Training Data
Test Data
13
Link Mining Tasks

Tasks
Object Classification
Object Type Prediction
Link Type Prediction
Predicting Link Existence
Link Cardinality Estimation
Entity Resolution
Group Detection
Subgraph Discovery
Metadata Mining

14
Sample Problem Domain

Research World
Researchers
Papers
Reviewers
Co-authors
Citations
Topics
Aka Tenure World

15
Object Prediction

Object Classification
Predicting the category of an object based on its
attributes and its links and attributes of linked
objects
e.g., predicting the topic of a paper based on
the words used in the paper, the topics of papers
it cites, the research interests of the author
Object Type Prediction
Predicting the type of an object based on its
attributes and its links and attributes of linked
objects
e.g., predict the venue type of a publication
(conference, journal, workshop) based on
properties of the paper

16
Link Prediction

Link Classification
Predicting type or purpose of link based on
properties of the participating objects
e.g., predict whether a citation is to
foundational work, background material,
gratuitous PC reference
Predicting Link Existence
Predicting whether a link exists between two
objects
e.g. predicting whether a paper will cite another
paper
Link Cardinality Estimation
Predicting the number of links to an object or
predicting the number of objects reached along a
path from an object
e.g., predict the number of citations of a paper

17
More complex prediction tasks

Group Detection
Predicting when a set of entities belong to the
same group based on clustering both object
attribute values and link structure
e.g., identifying research communities
Entity Resolution
Predicting when a collection of objects are the
same, based on their attributes and their links
(aka record linkage, identity uncertainty)
e.g., predicting when two citations are referring
to the same paper.
Predicate Invention
Induce a new general relation/link from existing
links and paths
e.g., propose concept of advisor from co-author
and financial support
Subgraph Identification, Metadata Mapping

18
SRL Challenges

Collective Classification
Collective Consolidation
Logical vs. Statistical dependencies
Feature Construction aggregation, selection
Flexible and Decomposable Combining Rules
Instances vs. Classes
Effective Use of Labeled Unlabeled Data
Link Prediction
Closed vs. Open World

Challenges common to any SRL approachl! Bayesian
Logic Programs, Relational Logic Networks,
Probabilistic Relational Models, Relational
Markov Networks, Relational Probability Trees,
Stochastic Logic Programming to name a few
19
Collective classification

Using a link-based statistical model for
classification
Inference using learned model is complicated by
the fact that there is correlation between the
object labels
Must find a labeling that maximizes the joint
(conditional) probability

20
Collective consolidation

Using a link-based statistical model for object
consolidation
Consolidation decisions should not be made
independently
Must find a clustering that maximizes the joint
(conditional) probability

21
Logical vs. Statistical Dependence

Coherently handling two types of dependence
structures
Link structure - the logical relationships
between objects
Probabilistic dependence - statistical
relationships between attributes
Challenge statistical models that support rich
logical relationships
Model search complicated by the fact that
attributes can depend on arbitrarily linked
attributes -- issue how to search this huge
space

22
Model Search
P1
P1
P3
P2
I1
I1
A1
A1
P
?
23
Feature Construction

In many cases, objects are linked to a set of
objects. To construct a single feature from this
set of objects, we may either use
Aggregation
Selection

24
Aggregation
P1
P3
P2
I1
A1
P
?
P
25
Selection
P1
P3
P2
I1
A1
P
?
P
26
Individuals vs. Classes

Does model refer
explicitly to individuals
classes or generic categories of individuals
On one hand, wed like to be able to model that a
connection to a particular individual may be
highly predictive
On the other hand, wed like our models to
generalize to new situations, with different
individuals

27
Instance-based Dependencies
P3
P3
I1
A1
Papers that cite P3 are likely to be
28
Class-based Dependencies
?
?
I1
A1
Papers that cite are likely to be
29
Labeled Unlabeled Data

In link-based domains, unlabeled data provide
three sources of information
Helps us infer object attribute distribution
Links between unlabeled data allow us to make use
of attributes of linked objects
Links between labeled data and unlabeled data
(training data and test data) help us make more
accurate inferences

30
Link Prior Probability

The prior probability of any particular link is
typically extraordinarily low
For medium-sized data sets, we have had success
with building explicit models of link existence
It may be more effective to model links at higher
level--required for large data sets!

31
Closed World vs. Open World

The majority of SRL approaches make a closed
world assumption, which assumes that we know all
the potential entities in the domain
In many cases, this is unrealistic
Work by Milch, Marti, Russell on BLOG

32
SRL Tasks Challenges Summary

Tasks
Link-based Object Classification
Object Type Prediction
Link Type Prediction
Predicting Link Existence
Link Cardinality Estimation
Issues Challenges
Collective Classification
Collective Consolidation
Logical vs. Statistical dependencies
Feature construction

Entity Resolution
Group Detection
Predicate Invention
Subgraph Discovery
Metadata Mining
Instances vs. Classes
Effective Use of Labeled Unlabeled Data
Link Prediction
Closed vs. Open World

33
Current Projects

Link-based Classification
Link-based Entity Resolution
Social Network Analysis
Affiliation Networks
Structural and descriptive modeling
Friendship Event Networks
Definitions of Capital and Benefit
Link Mining for the Semantic Web
Feature Generation for Sequences (biological
data)
Schema Maintenance and Discovery

34
Link-based Classification

Predicting the category of an object based on its
attributes and its links and attributes of linked
objects

A
A
A
A
?
B
B
B
B
C
A
35
Our Approach

Investigate use of labeled and unlabeled data for
classification
Learning of models
iterative algorithm
Prediction
links among (unlabeled) test and (labeled)
training data
Requires collective classification
Link-based models
Integrate link features with object attributes
using logistic regression

36
Experiment
37
Projects

Link-based Classification
Link-based Entity Resolution
Social Network Analysis
Affiliation Networks
Structural and descriptive modeling
Friendship Event Networks
Definitions of Capital and Benefit
Link Mining for the Semantic Web
Feature Generation for Sequences (biological
data)
Schema Maintenance and Discovery

38
James Smith
John Smith
Jim Smith
John Smith
James Smith
Jon Smith
Jonathan Smith
J Smith
Jonthan Smith
39
Generalized Entity Resolution

Discover the domain entities
Map each reference to an entity
Identification
References to the same entity may look different
Jonathan Smith, Jon Smith, Jonthan Smith
Distinction/Disambiguation
References to different entities may look similar
Jon Smith, John Smith

40
Entity Resolution Domains

Databases
Deduplication in Data Cleaning
Data Integration with similarity joins
Natural Language Processing
Noun co-reference
Sense disambiguation
Named entity recognition
Computer Vision
Correspondence Problem

41
Issues

Reference attributes
Multiple Entity Types
Relational Reference Data
Collective Resolution
Group Detection
Entity Ontologies

42
ER using Reference Attributes

Identify references with similar attributes
Define ( learn) attribute similarity measures
Resolve references pairwise transitive closure

43
ER using Reference Attributes

Identify references with similar attributes
Define ( learn) attribute similarity measures
Resolve references pairwise transitive closure
Problem Similarity threshold for resolution
Better identification calls for lower threshold
Better distinction calls for higher threshold

44
Motivation for Relational ER

References may not be observed independently in
data
References are linked
Link is a set of related references
Represents relations among underlying entities
E.g. parent-dependent or sibling relations among
person records in census database
Links can help in identification and distinction

45
Example References In Census Data
Jon
Jim
Liz
Jon
James
P
John
Gwyneth
Betsy
J
Gwen
Elizabeth
Jonthan
Don
Paul
Jonathan
Laura
Betsy
Sharon
Ron
Kate
L
J
D
Typos Jon ? Jonathan? John?
Initials J ? James? John? Jonathan? Or none?
46
Example Links In Census Data
Jon
Jim
Liz
Jon
James
P
John
Gwyneth
Betsy
J
Gwen
Elizabeth
Jonthan
Don
Paul
Jonathan
Laura
Betsy
Sharon
Ron
Kate
L
J
D
Links represent family relations
47
Example Inference from Links
Jon
Jim
Liz
Jon
James
P
John
Gwyneth
Betsy
J
Gwen
Elizabeth
Jonthan
Don
Paul
Jonathan
Laura
Betsy
Sharon
Ron
Kate
L
J
J
D
Ambiguity is almost eliminated
Ambiguity is reduced
48
Entity Resolution From Relational Data

References with similar attributes that have
similar relations as well are more likely to be
the same entity
ER Approach 1
Cluster references using relational similarity

49
Entity Resolution Using Group Membership Evidence

Links represent correlations among entities
Some entities more likely to co-occur in links
than others
ER Approach 2 Capture correlations explicitly
with latent group variable
Entities are members of possibly overlapping
groups
Entities in same group more likely to form links

50
Entity Resolution Using Group Membership Evidence
Familial Group 1
Familial Group 2
Jon
Jim
James
Liz
Jon
P
John
Gwyneth
Betsy
J
Gwen
Elizabeth
Jonthan
Don
Paul
Jonathan
Laura
Betsy
Sharon
Ron
Kate
L
J
J
D
Belong to same familial group
Belong to same familial group
Belong to different familial groups
51
Entity Resolution Using Group Membership Evidence

Group Detection is interesting and important
Collaborative groups in social sciences and
bibliometry
Semantic word groups from natural language
corpora
By-product of ER using groups Group Detection
from ambiguous references

52
Collective Entity Resolution from Relations

Resolutions cannot be made independently for
different references
Dependency flows between resolution decisions
through reference links

J Smiths wife Betsy is the same as Betsy who is
the mother of Paul Paul is the same as P Smith
who is John Smiths son

53
Collective Entity Resolution from Relations

Resolutions cannot be made independently for
different references
Dependency flows between resolution decisions
through reference links

When modeling groups, entity resolutions depend
on groups, groups depend on resolved entities

54
Evaluation Domains

Bibliographic Data
Author resolution using co-author links
Relational Clustering (RC-ER)
(DMKD 04, LinkKDD 04,
submitted Book Chapter)
LDA based Group model (LDA-ER)
(under review)
Natural Language
Sense resolution using translation links in
parallel corpora (ACL 04)
Sense Model Senses in different languages depend
directly on each other
Concept Model Semantic sense groups or Concepts
relate senses from different languages

55
Domain 1 Bibliographic Entity Resolution

Resolve author, paper, venue, publisher entities
from citation strings
R. Agrawal, R. Srikant. Fast algorithms for
mining association rules in large databases. In
VLDB-94, 1994.
Rakesh Agrawal and Ramakrishnan Srikant. Fast
Algorithms for Mining Association Rules. In
Proc. of the 20th Int'l Conference on Very Large
Databases, Santiago, Chile, September 1994.

56
Exploiting Bibliographic Links

Resolve author, paper, venue, publisher entities
from citation strings
R. Agrawal, R. Srikant. Fast algorithms for
mining association rules in large databases. In
VLDB-94, 1994.
Rakesh Agrawal and Ramakrishnan Srikant. Fast
Algorithms for Mining Association Rules. In
Proc. of the 20th Int'l Conference on Very Large
Databases, Santiago, Chile, September 1994.

57
Exploiting Bibliographic Links
R. Agrawal
Rakesh Agrawal
co-author
co-author
Ramakrishnan Srikant
R. Srikant
writes
writes
writes
writes
Fast algorithms for mining association rules in
large databases
Fast Algorithms for Mining Association Rules
published-in
published-in
VLDB-94, 1994
Proc. of the 20th Int'l Conference on Very Large
Databases, Santiago, Chile, September 1994
58
Exploiting Bibliographic Links
R. Agrawal
Rakesh Agrawal
Ramakrishnan Srikant
R. Srikant
Fast algorithms for mining association rules in
large databases
Fast Algorithms for Mining Association Rules
VLDB-94, 1994
Proc. of the 20th Int'l Conference on Very Large
Databases, Santiago, Chile, September 1994
59
Exploiting Bibliographic Links
entity 1
R. Agrawal
Rakesh Agrawal
entity 2
Ramakrishnan Srikant
R. Srikant
entity 3
Fast algorithms for mining association rules in
large databases
Fast Algorithms for Mining Association Rules
entity 4
VLDB-94, 1994
Proc. of the 20th Int'l Conference on Very Large
Databases, Santiago, Chile, September 1994
60
Exploiting Bibliographic Links
entity 1
R. Agrawal
Rakesh Agrawal
entity 2
Ramakrishnan Srikant
R. Srikant
entity 3
Fast algorithms for mining association rules in
large databases
Fast Algorithms for Mining Association Rules
entity 4
VLDB-94, 1994
Proc. of the 20th Int'l Conference on Very Large
Databases, Santiago, Chile, September 1994
61
Approach 1 ER using Relational Clustering (RC-ER)

Iteratively cluster similar references into
entities

c1
c2
c3
c4
c5
c6
c7
c8
62
Approach 1 ER using Relational Clustering (RC-ER)

Iteratively cluster similar references into
entities

R. Agrawal
Rakesh Agrawal
c1
c2
Ramakrishnan Srikant
R. Srikant
c9
Fast algorithms for mining association rules in
large databases
Fast Algorithms for Mining Association Rules
c5
c6
VLDB-94, 1994
Proc. of the 20th Int'l Conference on Very Large
Databases, Santiago, Chile, September 1994
c7
c8
63
Approach 1 ER using Relational Clustering (RC-ER)

Iteratively cluster similar references into
entities

c10
c9
c5
c6
c7
c8
64
Approach 1 ER using Relational Clustering (RC-ER)

Iteratively cluster similar references into
entities

c10
c9
c11
c7
c8
65
Approach 1 ER using Relational Clustering (RC-ER)

Iteratively cluster similar references into
entities

c10
c9
c11
c12
66
Similarity Measure For Clustering

Linear combination of attribute and relational
similarity of reference clusters
sim(ci, cj) (1- ?)simattr(ci, cj) ?
simrel(ci, cj)
Attribute similarity measure
Several measures available for pairs of strings
Levenstein, Smith-Waterman, Jaro
Combine pairwise measures for attribute
similarity of two reference clusters
Single link, average link, complete link
Representative attribute for clusters

67
Relational Similarity Measure

Cluster similarity capture dependence between
resolution decisions through links
Each reference cluster ci has its link set H(ci)
Link for each reference in ci
Capture similarity of links in two clusters

68
Edge Detail Similarity

Similarity of two links depends on their
references
Consider resolution decisions on the references

Both links connect to cluster 9
69
Edge Detail Similarity

Similarity of two links depends on their
references
Consider resolution decisions on the references
Label set Eh(i) of ith link
multi-set of cluster labels of its reference
simh(i,j) Jaccard(Eh(i), Eh(j))
Edge Detail Similarity of two clusters
Simrel(c, c) min(simh(i), simh(j)), i ? H(c),
j ? H(c)

70
Neighborhood Similarity

Edge detail similarity is expensive
Ignore explicit link structure
Consider only set of neighborhood clusters
Clusters c1, c2 still similar in terms of
relationships

c5
link 2
link 1
link 3
c1
c3
c4
c5
c2
c4
link 4
c3
71
Neighborhood Similarity

Edge detail similarity is expensive
Ignore explicit link structure
Consider only set of neighborhood clusters
N(c) multiset of cluster labels covered by
links in H(c)
Neighborhood similarity of two clusters
Simrel(c,c) Jaccard(N(c),N(c))

72
Evaluation Datasets

CiteSeer
Machine Learning Citations
Originally created by Lawrence et al.
2,892 references to 1,165 true authors
1,504 links
arXiv HEP
Papers from High Energy Physics
Used for KDD-Cup 03 Data Cleaning Challenge
58,515 references to 9,200 true authors
29,555 links

73
Baseline

Pairwise duplicate decisions using Soft-TFIDF
(ATTR)
Secondary string similarity Scaled
Levenstein(SL), Jaro(JA), Jaro-Winkler(JW)
Transitive Closure over pairwise decisions
(ATTR)
Precision, Recall and F1 over pairwise decisions
Both algorithms require similarity threshold
Report best performance over all thresholds

74
Results F1 for Different String Similarity
Measures

For each measure, neighborhood sim does better
than ATTR and ATTR and edge detail does better
than neighborhood

75
Results Varying Combination Weight using
Bootstrapping
76
Results Varying Combination Weight using
Bootstrapping
77
Results Execution Time
78
Results Best F1

Relational measures improve performance over
attribute baseline in terms of precision, recall
and F1
Neighbor similarity performs almost as well as
edge detail
Neighborhood similarity faster than edge detail

79
Approach 2 Latent Dirichlet Model for ER

Probabilistic model of entity collaboration
groups
Entities (authors) belong to groups
Entities (authors) in a link (document) depend on
the groups that are involved
Latent group variable for each reference
Group labels and entity labels unobserved

80
LDA for Author Entities of Documents
a

Adapt the LDA model for author entities in
documents
Each document has a distribution T over groups
Each group z has a distribution Fz over author
entities
For each author entity, sample a group z from T,
and sample an entity from Fz

?
z
a
F
ß
Rd
T
D
81
LDA for Entity Resolution (LDA-ER)
a

Author entities not directly observed
Generate entity a as before
Entities have attributes v
Generate attribute vi for ith reference from
entity attribute va using noise process

?
z
a
F
ß
T
v
v
A
Rd
D
82
LDA-ER Inference With Known Authors

Exact inference is intractable
Use Gibbs Sampling for group and entity labels of
each reference

83
LDA-ER Inference With Known Authors

Exact inference is intractable
Use Gibbs Sampling for group and entity labels of
each reference
For the ith reference, sample its group label zi,
looking at all other variables

84
Determining Number of Entities

Search over number of entity labels using
sampling
For each entity label i, sample next step
Move all its references to some existing label j
Split its references between i and a new label k
Retain all its references
Number of entity labels
Decreases by 1
Increases by 1
Stays the same

85
Modeling Entity Attributes

Entity attributes are unknown
Incorporate P(V) into joint distribution
Sample entity attributes from full conditional
distribution

86
Noise Parameters

Consider last, first and middle names
First and middle names may be (incorrectly)
initialized or dropped
Characters may be replaced, deleted or inserted
in last names and retained first and middle names
Iteratively estimate noise parameters from entity
and reference attributes

87
Overall Inference Algorithm

Until convergence
Until convergence
Sample group label for each reference
Continue
For each entity label, reassign all references
currently having that label
Sample attribute value for each entity
Estimate noise parameters
Continue

88
Experiments Real Data

Citeseer
Convergence in 30 iterations (10-20 mins)
arXiv HEP
Converegence in 75 iterations (8-20 hrs)
Precision, recall and F1 of pair-wise duplicate
decisions
Baseline
Pair-wise similarity from noise model
Duplicates if similarity above threshold
Transitive closure

89
Results on Real Data

Std Dev of F1 310-4 for CS, 1.710-4 for HEP

CiteSeer
Achieves close to highest possible recall with
very high precision
HEP
Over 646,000 true duplicate pairs
1 improvement means 6,460 pairs

90
Performance with Varying Group Numbers

General Trend Higher precision, lower recall
with more groups
F1 reasonably stable over range of groups

91
Real Resolution Examples

Successful Distinction
(lu j, liu j)
(chang c, chiang c)
Successful Identification
(elliot g, elliott g l)
(dubnick cezary, dubnicki c)
(kaelbing l p, kaelbling leslie pack)
(minton s, minton andrew b)

92
Structural Difference between Data Sets

Percentage of Ambiguous References
0.5 for Citeseer
9 for HEP
Average number of collaborators per author
2.15 for Citeseer
4.5 for HEP
Average number of references per author
2.5 for Citeseer
6.4 for HEP

93
Synthetic Data Generator

Data generator mimics real collaborations
Create collaboration graph in Stage 1
Create documents from this graph in Stage 2
Can control
Number of author entities and documents
Average number of collaborators per author entity
Average number of references per author entity
Average number of references per document
Percentage of ambiguous references

94
Trends in Synthetic Data

Improvement increases sharply with higher
ambiguity in references

95
Trends in Synthetic Data

Improvement increases with more references per
author

96
Trends in Synthetic Data

Improvement increases with more references per
document

97
Bibliographic ER Comparison

Two approaches to relational entity resolution
Probabilistic Generative Model
Notion of optimal solution
Group label for references
Can generalize for unseen data
Able to handle noise
Relational Clustering
Efficient
Customizable string similarity measure
Small improvement over probabilistic model
Needs threshold to determine duplicates

98
Domain 2 Word Sense Resolution

Words in natural language corpora may be
ambiguous
Bank financial institution, shore,
reserve/stockpile
Given word occurrence, determine intended sense
from context
Distinction/Disambiguation problem in ER
References are the word occurrences
Entities are the ambiguous senses of the words

99
Relational WSD from Parallel Corpora

Translations can help resolve senses
Bank translated in Spanish as orilla probably
means shore
Links in WSD
Aligned translation threads in parallel corpora
(bank, banco, banca, Bank, banque)
Multi-type ER
Each language represents a type
Need to resolve senses in all languages
simultaneously
Semantic Group Detection

100
Bilingual Probabilistic Models for WSD

Motivated by Diab and Resnik
Automatic sense tagging using translations
Probabilistic generative model for translations
Sample related senses, one from each language
Sample a word from each selected sense
Two models for sense relations across languages
Sense Model Relate senses directly
Concept Model Relate senses through latent
semantic groups

101
Generative Model 1 Sense Model

Two level generative model
Select a sense T according to priors
Select English word We according to conditional
for that sense
Select Spanish word Ws, again according to
conditional

P(T)
T
P(WeT)
P(WsT)
We
Ws
P(We,Ws,T) ? ?
P(T)
P(WeT)
P(WsT)
102
Generative Model 2 Concept Model

Three level generative model
Select concept C according to priors
Select a sense for each language according to
conditionals for that concept
Select a word conditionally for each of the two
senses

P(C)
C
P(TeC)
P(TsC)
Te
Ts
P(WeTe)
P(WsTs)
We
Ws
P(We,Ws,Te,Ts,C) ?
? ?
?
P(C)
P(TeC)
P(TsC)
P(WeTe)
P(WsTs)
103
Constructing the Models

Issues
Choosing dimensionality of hidden variables
Use of available semantic hierarchies
WordNet hierarchy for English
Use WordNet senses for English words
Relational clustering to discover Spanish senses
and concepts

104
Sense Model Construction

Use WordNet senses for both languages
English word belongs to all its senses from
WordNet
Assign Spanish word to all senses for its English
translations

105
Concept Model Spanish Senses

Use English sense neighborhood for each Spanish
word
Union of senses for its translations
One sense for Spanish word
Each neighborhood defines a Spanish sense
Multiple senses for a Spanish word
Break English neighborhoods into frequently
occurring sub-neighborhoods

106
Concept Model Concepts

English sense neighborhood for Spanish senses
capture relations across language
Cluster English sense neighborhoods to create
concepts
Jaccard similarity of neighborhoods
One concept for each neighborhood cluster
Add the Spanish sense for each neighborhood
Add the English senses from each neighborhood

107
Learning Model Parameters

Select parameters to maximize the joint
probability of observed translation pairs
Expectation Maximization to find model
probabilities
Avoid local maxima
Use synset occurrence frequencies from WordNet
for initialization of model probabilities

108
Training the Models

Training Corpus constructed from multiple sources
Brown Corpus, Senseval 1, Senseval 2 English
Lexical Sample, Wall Street Journal Sec 18-24
from Penn-Tree Bank
Translated into Spanish using Globalink Pro 6.4
and Systran Professional Premium
GIZA for word level alignments

109
Numbers from Experiments

16,186 English words, 31,862 Spanish words
2,385,574 instances of 41,850 distinct
translation pairs
20,361 WordNet senses
Sense model
154,947 parameters
20,361 senses
Concept model
120,268 parameters
20,361 eng. senses, 11,961 spn. senses, 7,366
concepts
EM convergence in about 20 iterations

110
WSD Senseval Comparison

Evaluation on Senseval 2 English All-words
Focus on nouns 875 instances

111
Semantic Sense Groups

Semantic structure for Spanish words
automatically created with senses and concepts
Map words to sense entities and group related
sense entities into concepts

112
Example Concepts Discovered

accidente accidentes
muertes(deaths)
casualty
matar(to kill) matanzas(slaughter) muertes-le
slaying
derramamiento-de-sangre (spilling-of-blood)
cachiporra(bludgeon) obligar(force)
obligando(forcing)
asesinato(murder) asesinatos

Spanish senses
Concept
Spanish words in a sense
Relevant English dictionary sense
113
Example Concepts Discovered

linterna-eléctrica linterna(lantern)
faros-automóvil(headlight)
linternas-portuarias(harbor-light)
antorcha(torch) antorchas antorchas-pino-nudo

114
Example Concepts Discovered

manía craze
culto(cult) cultos proto-senility
delirio delirium
rabias(fury) rabia farfulla(do hastily)

115
Example Concepts Discovered

oportunidad oportunidades
ocasión ocasiones
riesgo(risk) riesgos peligro(danger)
destino sino(fate)
fortuna suerte(fate)
probabilidad probabilidades

116
Entity Resolution Summary

Formulated generalized entity resolution problem
addressing
Reference attributes
Relational data
Collective Inference
Group detection for entities
Two types of entities for parallel WSD

117
Future Work

Resolving Multiple Entity Types
Typed relational similarity measures by
projecting onto each type and aggregation (MRDM
05)
Extend group model for multiple types
Objective Functions for RC-ER
Notion of optimal solution
Generalize cut-based co-clustering (Dhillon 01)
Use entity ontologies for resolution
WordNet similarity instead of Jaccard similarity
for sense neighborhoods

118
ER Issues

collective resolution
global vs. local resolutions
multi-entity resolution
structural properties when to use links
characterization of structural properties of
collaborative data sets benefiting relational
approach
HCI issues
task specific interface for graph data
visualizations which support analytic task

119
Entity Resolution in Enron Email

Message ID 180231
Datetime 2001-01-23 094500
Sender Sara Shackleton
Recipients Tana Jones
Subject Hedge Funds
Tana Other than your email attached, have you
had other discussions with Mark or credit about
hedge funds? Sara
Sara Shackleton
Enron North America Corp.
1400 Smith Street, EB 3801a
Houston, Texas 77002
713-853-5620 (phone)
713-646-3490 (fax)
sara.shackleton_at_enron.com

Emails exchanged between Shackleton and potential
candidates
Joint work with Chris Diehl _at_ JHUAPL
Mark Taylor is the correct association
120
Entity Resolution in Email

Message ID 182297
Datetime 1999-12-20 044100
Sender Sara Shackleton
Recipients Marie Heard
Subject Merrill Lynch - Financial Contract
This is the deal that Susan F. worked on on
Friday. I ll forward the Schedule to you. No
one is asking for a revised Schedule yet but we
should make the change and email the parties on
Susan s email so that everyone knows the latest
changes and then ask if anyone has comments. ss

Emails exchanged between Shackleton and potential
candidates
More context is needed to resolve the
reference Linking references removes ambiguity in
this case Considering recipient communications
with candidates may remove ambiguity as well
121
Entity Resolution in Email

Message ID 71707
Datetime 2001-10-19 143141
Sender Sara Shackleton
Recipients Kim Ward, Jason Williams
Subject FW FW Master purchase/sale agreement -
Salt River
Jay my mistake - Salt River did send a CSA (see
below) Sara

Emails exchanged between Shackleton and potential
candidates
Jay is in fact a reference to Jason
Williams Williams often signs emails as Jay Need
framework that supports detection and resolution
of nicknames
122
Entity Resolution in Email

Message ID 81944
Datetime 2001-10-19 062850
Sender Mark Whitt
Recipients Barry Tycholiz
Subject FW hockey
Here is an opportunity to get a box for one of
the games. Detroit on Feb 4th would be great!
That is a Monday. If you and Kim wanted to you
could come up and ski that weekend prior. Let me
know what you think

Emails exchanged between Whitt and potential
candidates
Candidates listed are only from within
Enron Exploiting the fact that this communication
is social in nature may be useful in dismissing
an already weak hypothesis
123
Entity Resolution in Enron Data
Email communication network
employee directory
org chart
Jane Adams x3-4555 John Addams x4-3421
.
.
To j.smith_at_enron.com From jdoe_at_enron.com Subject
Re trade My friend John says .
.
.
Mail threads
124
Projects

Link-based Classification
Link-based Entity Resolution
efficient algorithms
visualization tools that support ER
Social Network Analysis
Affiliation Networks
Structural and descriptive modeling
Friendship Event Networks
Definitions of Capital and Benefit
Link Mining for the Semantic Web
Feature Generation for Sequences (biological
data)
Word-sense disambiguation from Parallel Corpora

125
Affiliation Networks

An affiliation network contains
Actors A
Events E
Relationships R(A,E) Actor A participates in
event E
Examples
Executive Corporate Boards (ECN)
66,000 executives, 5400 companies, 76,000 board
memberships
Author Publication Networks (APN)
13,000 authors, 16,000 publications, 39,000
authorships

Joint work with Lisa Singh _at_ Georgetown
126
3 Views
a1
a1
e1
e1
a2
a2
a3
e2
e2
a3
a4
a4
e3
e3
a5
a5
Affiliation Network
Event Overlap Graph
Co-Membership Graph
127
Compressing the networks

Descriptive Pruning
Select actors/events based on attributes values
e.g., consider only CEOs
Structural Properties
Consider actors based on structural properties
such as hubs, brokers, etc.
Evaluation
Does pruned network maintain predictive accuracy
for network attributes?

128
Predictive Accuracy of Compression Strategies
129
Summary

Can use both descriptive and structural
properties to significantly compress networks
while maintaining accuracy
Descriptive and structural pruning allow us to
focus on important actors in the network however
the set of actors which they prune are quite
different
These pruned networks may be more effective for
understanding and visualization

130
Projects

Link-based Classification
Link-based Entity Resolution
efficient algorithms
visualization tools that support ER
Social Network Analysis
Affiliation Networks
Structural and descriptive modeling
Friendship Event Networks
Definitions of Capital and Benefit
Link Mining for the Semantic Web
Feature Generation for Sequences (biological
data)
Word-sense disambiguation from Parallel Corpora

131
Friendship Event Networks

A friendship event network contains
Actors
Friendships
Events
Event Organizers
Event Participants
example
Author Collaboration Networks
Actors - Researchers
Friendships - CoAuthors
Events - Conferences
Event Organizers PC Committee
Event Participants - Authors

132
- PC Non Author
- Non PC Author
- PC Author
PC Committee
Conference Authors
133
Define

Personal Social Capital - of friends who are
organizers
Benefit Received - of publications in
conference
Benefit Given - of publications of friends of
PC member
Comparison of different event structures
Temporal Evaluation
look at event series

134
Datasets
Data for past 10 years of 3 major CS conference
135
Overall Capital and Benefit
136
C1 Friendship
137
C1 Capital
138
C1 Capital/Friendship Ratio
139
PC/Author Ratio
140
Capital/Benefit Summary

Defined a generic friendship-event network
Identified interesting structural properties
Very preliminary, much more work to be done

141
Link Mining for the Semantic Web

Need to be able to extract multi-relational data,
not just a single table
Semantic Web tasks which could make use of
learning
schema discovery
populating ontology
schema mappings
schema reformulation
SRL capabilities that are needed
link-based object classification
link type prediction
predicting link existence
link cardinality estimation
entity resolution and object consolidation
group detection
predicate invention

142
An Integrated Approach
ontologies
SRL
Current Projects focus on 1. Link Type
Prediction 2. Link Ontology Discovery
data
143
Projects

Link-based Classification
Link-based Entity Resolution
efficient algorithms
visualization tools that support ER
Social Network Analysis
Affiliation Networks
Structural and descriptive modeling
Friendship Event Networks
Definitions of Capital and Benefit
Link Mining for the Semantic Web
Feature Generation for Sequences (biological
data)
Schema Maintenance and Discovery

144
Summary Link Mining

Tasks
Link-based Object Classification
Object Type Prediction
Link Type Prediction
Predicting Link Existence
Link Cardinality Estimation
Entity Resolution
Group Detection
Subgraph Discovery
Metadata Mining

Challenges
Collective Classification
Collective Consolidation
Logical vs. Statistical dependencies
Feature construction
Instances vs. Classes
Effective Use of Labeled Unlabeled Data
Link Prediction
Closed vs. Open World

These are some of the key capabilities needed to
perform todays complex analytic tasks
145
Recent SRL Activities

Invited Tutorial at ICML/ILP 2005 and Tutorial at
IJCAI
Dagstuhl 2005 workshop on Probababilistic,
Relational and Logical Learning, co-organized w/
Luc DeRaedt, Stephen Muggleton and Tom
Dietterich.http//www.dagstuhl.de/05051/
ICML 2004 workshop on Statistical Relational
Learning and its Connections to Other Fields,
co-organized w/ Tom Dietterich and Kevin
Murphy,http//www.cs.umd.edu/projects/srl2004/
IJCAI 2003 workshop on Statistical Relational
Learning, co-organized w/ David
Jensenhttp//kdl.cs.umass.edu/srl2003/
AAAI 2000 workshop on Statistical Relational
Learning, co-organized w/ David
Jensenhttp//robotics.stanford.edu/srl
Related workshops
KDD MRDM workshops
http//www-ai.ijs.si/SasoDzeroski/MRDM2004/
http//www-ai.ijs.si/SasoDzeroski/MRDM2003/
http//www-ai.ijs.si/SasoDzeroski/MRDM2002/
Benjamin Taskar and I are working on an edited
SRL collection

146
SRL Related Courses

My course at UMDhttp//www.cs.umd.edu/class/sprin
g2005/cmsc828g/
Pedro Domingos course at UWash
Tom Dietterichs course at OSU http//web.engr.or
egonstate.edu/tgd/classes/539/
David Page, Mark Craven and Jude Shavlik at
UWischttp//www.biostat.wisc.edu/page/838.html
Eric Mjolsness course at UCI on Probabilistic
Knowledge Representationhttp//computableplant.ic
s.uci.edu/emj/classes/280_04/Syllabus20ICS20280
20v2.doc
Stuart Russells course at Berkeley on Knowledge
Representation and Reasoninghttp//www.cs.berkele
y.edu/russell/classes/cs289/f04/
Joydeep Ghosh course at UT Austin on Advanced
Topics in Data Mininghttp//www.lans.ece.utexas.e
du/course/382v/05sp/
Michael Littman course at Rutgers on Learned
Representations in AI,http//www.cs.rutgers.edu/
mlittman/courses/lightai03/
David Jensen and Andrew McCallums course at UMass
on Computational Social Network
Analysishttp//kdl.cs.umass.edu/courses/csna/

147
References

Deduplication and Group Detection Using Links
Indrajit Bhattacharya and Lise Getoor. 10th ACM
SIGKDD Workshop on Link Analysis and Group
Detection, Seattle, WA, August 2004.
Word Sense Disambiguation using Probabilistic
Models, Indrajit Bhattacharya, Lise Getoor and
Yoshua Bengio. 42nd Annual Meeting of the
Association for Computational Linguistics,
Barcelona, SP, July 2004.
Iterative Record Linkage for Cleaning and
Integration Indrajit Bhattacharya and Lise
Getoor. 9th ACM SIGMOD Workshop on Research
Issues in Data Mining and Knowledge Discovery,
Paris, FR, June 2004.
Using the Structure of Web Sites for Automatic
Segmentation of Tables, Kristina Lerman, Lise
Getoor, Steve Minton and Craig Knoblock.
Proceedings of ACM-SIGMOD 2004 International
Conference on Management of Data, Paris, FR, June
2004.
Structure Discovery using Statistical Relational
Learning, Lise Getoor. Data Engineering Bulletin,
vol. 26, No. 3, 2003.
Link Mining A New Data Mining Challenge, Lise
Getoor. SIGKDD Explorations, volume 5, issue 1,
2003. Iterative Deduplication, I. Bhattacharya,
L. Getoor.
Link Mining A New Data Mining Challenge, L.
Getoor. SIGKDD Explorations, volume 4, issue 2,
2003.
Link-based Classification, Q. Lu and L. Getoor,
International Conference on Machine Learning,
August, 2003
Labeled and Unlabeled Data for Link-based
Classification, Q. Lu and L. Getoor. ICML
workshop on The Continuum from Labeled to
Unlabeled Data, August, 2003.
Link-based Classification for Text Classification
and Mining, Q. Lu and L. Getoor. IJCAI workshop
on Text Mining and Link Analysis