Title: Link Mining and Entity Resolution
1Link Mining and Entity Resolution
- Lise Getoor
- University of Maryland, College Park
Students Indrajit Bhattacharya, Mustafa Bilgic,
Rezarta Islamaj, Louis Licamele and Prithviraj Sen
2Roadmap
3Link Mining
- Traditional machine learning and data mining
approaches assume - A random sample of homogeneous objects from
single relation - Real world data sets
- Multi-relational, heterogeneous and
semi-structured - represented as a graph or network
- Statistical Relational Learning (SRL)
- newly emerging research area at the intersection
of research in social network and link analysis,
hypertext and web mining, graph mining,
relational learning and inductive logic
programming - Sample Domains
- web data, bibliographic data, epidemiological
data, communication data, customer networks,
collaborative filtering, trust networks,
biological data
4What is SRL?
5View 1 Alphabet Soup
LBN
CLP(BN)
SRM
PRISM
RDBN
RPM
SLR
BLOG
PLL
pRN
PER
PRM
SLP
MLN
HMRF
RMN
RNM
DAPER
RDBN
RDN
BLP
SGLR
6View 2 Representation Soup
- Hierarchical Bayesian Model Relational
Representation
Add probabilities
Statistical Relational Learning
Logic
Add relations
Probabilities
7View 3 Data Soup
Training Data
Test Data
8View 3 Data Soup
Training Data
Test Data
9View 3 Data Soup
Training Data
Test Data
10View 3 Data Soup
Training Data
Test Data
11View 3 Data Soup
Training Data
Test Data
12View 3 Data Soup
Training Data
Test Data
13Link Mining Tasks
- Tasks
- Object Classification
- Object Type Prediction
- Link Type Prediction
- Predicting Link Existence
- Link Cardinality Estimation
- Entity Resolution
- Group Detection
- Subgraph Discovery
- Metadata Mining
14 Sample Problem Domain
- Research World
- Researchers
- Papers
- Reviewers
- Co-authors
- Citations
- Topics
- Aka Tenure World
15Object Prediction
- Object Classification
- Predicting the category of an object based on its
attributes and its links and attributes of linked
objects - e.g., predicting the topic of a paper based on
the words used in the paper, the topics of papers
it cites, the research interests of the author - Object Type Prediction
- Predicting the type of an object based on its
attributes and its links and attributes of linked
objects - e.g., predict the venue type of a publication
(conference, journal, workshop) based on
properties of the paper
16Link Prediction
- Link Classification
- Predicting type or purpose of link based on
properties of the participating objects - e.g., predict whether a citation is to
foundational work, background material,
gratuitous PC reference - Predicting Link Existence
- Predicting whether a link exists between two
objects - e.g. predicting whether a paper will cite another
paper - Link Cardinality Estimation
- Predicting the number of links to an object or
predicting the number of objects reached along a
path from an object - e.g., predict the number of citations of a paper
17More complex prediction tasks
- Group Detection
- Predicting when a set of entities belong to the
same group based on clustering both object
attribute values and link structure - e.g., identifying research communities
- Entity Resolution
- Predicting when a collection of objects are the
same, based on their attributes and their links
(aka record linkage, identity uncertainty) - e.g., predicting when two citations are referring
to the same paper. - Predicate Invention
- Induce a new general relation/link from existing
links and paths - e.g., propose concept of advisor from co-author
and financial support - Subgraph Identification, Metadata Mapping
18SRL Challenges
- Collective Classification
- Collective Consolidation
- Logical vs. Statistical dependencies
- Feature Construction aggregation, selection
- Flexible and Decomposable Combining Rules
- Instances vs. Classes
- Effective Use of Labeled Unlabeled Data
- Link Prediction
- Closed vs. Open World
Challenges common to any SRL approachl! Bayesian
Logic Programs, Relational Logic Networks,
Probabilistic Relational Models, Relational
Markov Networks, Relational Probability Trees,
Stochastic Logic Programming to name a few
19Collective classification
- Using a link-based statistical model for
classification - Inference using learned model is complicated by
the fact that there is correlation between the
object labels - Must find a labeling that maximizes the joint
(conditional) probability
20Collective consolidation
- Using a link-based statistical model for object
consolidation - Consolidation decisions should not be made
independently - Must find a clustering that maximizes the joint
(conditional) probability
21Logical vs. Statistical Dependence
- Coherently handling two types of dependence
structures - Link structure - the logical relationships
between objects - Probabilistic dependence - statistical
relationships between attributes - Challenge statistical models that support rich
logical relationships - Model search complicated by the fact that
attributes can depend on arbitrarily linked
attributes -- issue how to search this huge
space
22Model Search
P1
P1
P3
P2
I1
I1
A1
A1
P
?
23Feature Construction
- In many cases, objects are linked to a set of
objects. To construct a single feature from this
set of objects, we may either use - Aggregation
- Selection
24Aggregation
P1
P3
P2
I1
A1
P
?
P
25Selection
P1
P3
P2
I1
A1
P
?
P
26Individuals vs. Classes
- Does model refer
- explicitly to individuals
- classes or generic categories of individuals
- On one hand, wed like to be able to model that a
connection to a particular individual may be
highly predictive - On the other hand, wed like our models to
generalize to new situations, with different
individuals
27Instance-based Dependencies
P3
P3
I1
A1
Papers that cite P3 are likely to be
28Class-based Dependencies
?
?
I1
A1
Papers that cite are likely to be
29Labeled Unlabeled Data
- In link-based domains, unlabeled data provide
three sources of information - Helps us infer object attribute distribution
- Links between unlabeled data allow us to make use
of attributes of linked objects - Links between labeled data and unlabeled data
(training data and test data) help us make more
accurate inferences
30Link Prior Probability
- The prior probability of any particular link is
typically extraordinarily low - For medium-sized data sets, we have had success
with building explicit models of link existence - It may be more effective to model links at higher
level--required for large data sets!
31Closed World vs. Open World
- The majority of SRL approaches make a closed
world assumption, which assumes that we know all
the potential entities in the domain - In many cases, this is unrealistic
- Work by Milch, Marti, Russell on BLOG
32SRL Tasks Challenges Summary
- Tasks
- Link-based Object Classification
- Object Type Prediction
- Link Type Prediction
- Predicting Link Existence
- Link Cardinality Estimation
- Issues Challenges
- Collective Classification
- Collective Consolidation
- Logical vs. Statistical dependencies
- Feature construction
- Entity Resolution
- Group Detection
- Predicate Invention
- Subgraph Discovery
- Metadata Mining
- Instances vs. Classes
- Effective Use of Labeled Unlabeled Data
- Link Prediction
- Closed vs. Open World
33Current Projects
- Link-based Classification
- Link-based Entity Resolution
- Social Network Analysis
- Affiliation Networks
- Structural and descriptive modeling
- Friendship Event Networks
- Definitions of Capital and Benefit
- Link Mining for the Semantic Web
- Feature Generation for Sequences (biological
data) - Schema Maintenance and Discovery
34Link-based Classification
- Predicting the category of an object based on its
attributes and its links and attributes of linked
objects
A
A
A
A
?
B
B
B
B
C
A
35Our Approach
- Investigate use of labeled and unlabeled data for
classification - Learning of models
- iterative algorithm
- Prediction
- links among (unlabeled) test and (labeled)
training data - Requires collective classification
- Link-based models
- Integrate link features with object attributes
using logistic regression
36Experiment
37Projects
- Link-based Classification
- Link-based Entity Resolution
- Social Network Analysis
- Affiliation Networks
- Structural and descriptive modeling
- Friendship Event Networks
- Definitions of Capital and Benefit
- Link Mining for the Semantic Web
- Feature Generation for Sequences (biological
data) - Schema Maintenance and Discovery
38James Smith
John Smith
Jim Smith
John Smith
James Smith
Jon Smith
Jonathan Smith
J Smith
Jonthan Smith
39Generalized Entity Resolution
- Discover the domain entities
- Map each reference to an entity
- Identification
- References to the same entity may look different
- Jonathan Smith, Jon Smith, Jonthan Smith
- Distinction/Disambiguation
- References to different entities may look similar
- Jon Smith, John Smith
40Entity Resolution Domains
- Databases
- Deduplication in Data Cleaning
- Data Integration with similarity joins
- Natural Language Processing
- Noun co-reference
- Sense disambiguation
- Named entity recognition
- Computer Vision
- Correspondence Problem
41Issues
- Reference attributes
- Multiple Entity Types
- Relational Reference Data
- Collective Resolution
- Group Detection
- Entity Ontologies
42ER using Reference Attributes
- Identify references with similar attributes
- Define ( learn) attribute similarity measures
- Resolve references pairwise transitive closure
43ER using Reference Attributes
- Identify references with similar attributes
- Define ( learn) attribute similarity measures
- Resolve references pairwise transitive closure
- Problem Similarity threshold for resolution
- Better identification calls for lower threshold
- Better distinction calls for higher threshold
44Motivation for Relational ER
- References may not be observed independently in
data - References are linked
- Link is a set of related references
- Represents relations among underlying entities
- E.g. parent-dependent or sibling relations among
person records in census database - Links can help in identification and distinction
45Example References In Census Data
Jon
Jim
Liz
Jon
James
P
John
Gwyneth
Betsy
J
Gwen
Elizabeth
Jonthan
Don
Paul
Jonathan
Laura
Betsy
Sharon
Ron
Kate
L
J
D
Typos Jon ? Jonathan? John?
Initials J ? James? John? Jonathan? Or none?
46Example Links In Census Data
Jon
Jim
Liz
Jon
James
P
John
Gwyneth
Betsy
J
Gwen
Elizabeth
Jonthan
Don
Paul
Jonathan
Laura
Betsy
Sharon
Ron
Kate
L
J
D
Links represent family relations
47Example Inference from Links
Jon
Jim
Liz
Jon
James
P
John
Gwyneth
Betsy
J
Gwen
Elizabeth
Jonthan
Don
Paul
Jonathan
Laura
Betsy
Sharon
Ron
Kate
L
J
J
D
Ambiguity is almost eliminated
Ambiguity is reduced
48Entity Resolution From Relational Data
- References with similar attributes that have
similar relations as well are more likely to be
the same entity - ER Approach 1
- Cluster references using relational similarity
49Entity Resolution Using Group Membership Evidence
- Links represent correlations among entities
- Some entities more likely to co-occur in links
than others - ER Approach 2 Capture correlations explicitly
with latent group variable - Entities are members of possibly overlapping
groups - Entities in same group more likely to form links
50Entity Resolution Using Group Membership Evidence
Familial Group 1
Familial Group 2
Jon
Jim
James
Liz
Jon
P
John
Gwyneth
Betsy
J
Gwen
Elizabeth
Jonthan
Don
Paul
Jonathan
Laura
Betsy
Sharon
Ron
Kate
L
J
J
D
Belong to same familial group
Belong to same familial group
Belong to different familial groups
51Entity Resolution Using Group Membership Evidence
- Group Detection is interesting and important
- Collaborative groups in social sciences and
bibliometry - Semantic word groups from natural language
corpora - By-product of ER using groups Group Detection
from ambiguous references
52Collective Entity Resolution from Relations
- Resolutions cannot be made independently for
different references - Dependency flows between resolution decisions
through reference links
- J Smiths wife Betsy is the same as Betsy who is
the mother of Paul Paul is the same as P Smith
who is John Smiths son
53Collective Entity Resolution from Relations
- Resolutions cannot be made independently for
different references - Dependency flows between resolution decisions
through reference links
- When modeling groups, entity resolutions depend
on groups, groups depend on resolved entities
54Evaluation Domains
- Bibliographic Data
- Author resolution using co-author links
- Relational Clustering (RC-ER)
(DMKD 04, LinkKDD 04,
submitted Book Chapter) - LDA based Group model (LDA-ER)
(under review) - Natural Language
- Sense resolution using translation links in
parallel corpora (ACL 04) - Sense Model Senses in different languages depend
directly on each other - Concept Model Semantic sense groups or Concepts
relate senses from different languages
55Domain 1 Bibliographic Entity Resolution
- Resolve author, paper, venue, publisher entities
from citation strings - R. Agrawal, R. Srikant. Fast algorithms for
mining association rules in large databases. In
VLDB-94, 1994. - Rakesh Agrawal and Ramakrishnan Srikant. Fast
Algorithms for Mining Association Rules. In
Proc. of the 20th Int'l Conference on Very Large
Databases, Santiago, Chile, September 1994.
56Exploiting Bibliographic Links
- Resolve author, paper, venue, publisher entities
from citation strings - R. Agrawal, R. Srikant. Fast algorithms for
mining association rules in large databases. In
VLDB-94, 1994. - Rakesh Agrawal and Ramakrishnan Srikant. Fast
Algorithms for Mining Association Rules. In
Proc. of the 20th Int'l Conference on Very Large
Databases, Santiago, Chile, September 1994.
57Exploiting Bibliographic Links
R. Agrawal
Rakesh Agrawal
co-author
co-author
Ramakrishnan Srikant
R. Srikant
writes
writes
writes
writes
Fast algorithms for mining association rules in
large databases
Fast Algorithms for Mining Association Rules
published-in
published-in
VLDB-94, 1994
Proc. of the 20th Int'l Conference on Very Large
Databases, Santiago, Chile, September 1994
58Exploiting Bibliographic Links
R. Agrawal
Rakesh Agrawal
Ramakrishnan Srikant
R. Srikant
Fast algorithms for mining association rules in
large databases
Fast Algorithms for Mining Association Rules
VLDB-94, 1994
Proc. of the 20th Int'l Conference on Very Large
Databases, Santiago, Chile, September 1994
59Exploiting Bibliographic Links
entity 1
R. Agrawal
Rakesh Agrawal
entity 2
Ramakrishnan Srikant
R. Srikant
entity 3
Fast algorithms for mining association rules in
large databases
Fast Algorithms for Mining Association Rules
entity 4
VLDB-94, 1994
Proc. of the 20th Int'l Conference on Very Large
Databases, Santiago, Chile, September 1994
60Exploiting Bibliographic Links
entity 1
R. Agrawal
Rakesh Agrawal
entity 2
Ramakrishnan Srikant
R. Srikant
entity 3
Fast algorithms for mining association rules in
large databases
Fast Algorithms for Mining Association Rules
entity 4
VLDB-94, 1994
Proc. of the 20th Int'l Conference on Very Large
Databases, Santiago, Chile, September 1994
61Approach 1 ER using Relational Clustering (RC-ER)
- Iteratively cluster similar references into
entities
c1
c2
c3
c4
c5
c6
c7
c8
62Approach 1 ER using Relational Clustering (RC-ER)
- Iteratively cluster similar references into
entities
R. Agrawal
Rakesh Agrawal
c1
c2
Ramakrishnan Srikant
R. Srikant
c9
Fast algorithms for mining association rules in
large databases
Fast Algorithms for Mining Association Rules
c5
c6
VLDB-94, 1994
Proc. of the 20th Int'l Conference on Very Large
Databases, Santiago, Chile, September 1994
c7
c8
63Approach 1 ER using Relational Clustering (RC-ER)
- Iteratively cluster similar references into
entities
c10
c9
c5
c6
c7
c8
64Approach 1 ER using Relational Clustering (RC-ER)
- Iteratively cluster similar references into
entities
c10
c9
c11
c7
c8
65Approach 1 ER using Relational Clustering (RC-ER)
- Iteratively cluster similar references into
entities
c10
c9
c11
c12
66Similarity Measure For Clustering
- Linear combination of attribute and relational
similarity of reference clusters - sim(ci, cj) (1- ?)simattr(ci, cj) ?
simrel(ci, cj) - Attribute similarity measure
- Several measures available for pairs of strings
- Levenstein, Smith-Waterman, Jaro
- Combine pairwise measures for attribute
similarity of two reference clusters - Single link, average link, complete link
- Representative attribute for clusters
67Relational Similarity Measure
- Cluster similarity capture dependence between
resolution decisions through links - Each reference cluster ci has its link set H(ci)
- Link for each reference in ci
- Capture similarity of links in two clusters
68Edge Detail Similarity
- Similarity of two links depends on their
references - Consider resolution decisions on the references
Both links connect to cluster 9
69Edge Detail Similarity
- Similarity of two links depends on their
references - Consider resolution decisions on the references
- Label set Eh(i) of ith link
- multi-set of cluster labels of its reference
- simh(i,j) Jaccard(Eh(i), Eh(j))
- Edge Detail Similarity of two clusters
- Simrel(c, c) min(simh(i), simh(j)), i ? H(c),
j ? H(c)
70Neighborhood Similarity
- Edge detail similarity is expensive
- Ignore explicit link structure
- Consider only set of neighborhood clusters
- Clusters c1, c2 still similar in terms of
relationships
c5
link 2
link 1
link 3
c1
c3
c4
c5
c2
c4
link 4
c3
71Neighborhood Similarity
- Edge detail similarity is expensive
- Ignore explicit link structure
- Consider only set of neighborhood clusters
- N(c) multiset of cluster labels covered by
links in H(c) - Neighborhood similarity of two clusters
- Simrel(c,c) Jaccard(N(c),N(c))
72Evaluation Datasets
- CiteSeer
- Machine Learning Citations
- Originally created by Lawrence et al.
- 2,892 references to 1,165 true authors
- 1,504 links
- arXiv HEP
- Papers from High Energy Physics
- Used for KDD-Cup 03 Data Cleaning Challenge
- 58,515 references to 9,200 true authors
- 29,555 links
73Baseline
- Pairwise duplicate decisions using Soft-TFIDF
(ATTR) - Secondary string similarity Scaled
Levenstein(SL), Jaro(JA), Jaro-Winkler(JW) - Transitive Closure over pairwise decisions
(ATTR) - Precision, Recall and F1 over pairwise decisions
- Both algorithms require similarity threshold
- Report best performance over all thresholds
74Results F1 for Different String Similarity
Measures
- For each measure, neighborhood sim does better
than ATTR and ATTR and edge detail does better
than neighborhood
75Results Varying Combination Weight using
Bootstrapping
76Results Varying Combination Weight using
Bootstrapping
77Results Execution Time
78Results Best F1
- Relational measures improve performance over
attribute baseline in terms of precision, recall
and F1 - Neighbor similarity performs almost as well as
edge detail - Neighborhood similarity faster than edge detail
79Approach 2 Latent Dirichlet Model for ER
- Probabilistic model of entity collaboration
groups - Entities (authors) belong to groups
- Entities (authors) in a link (document) depend on
the groups that are involved - Latent group variable for each reference
- Group labels and entity labels unobserved
80LDA for Author Entities of Documents
a
- Adapt the LDA model for author entities in
documents - Each document has a distribution T over groups
- Each group z has a distribution Fz over author
entities - For each author entity, sample a group z from T,
and sample an entity from Fz
?
z
a
F
ß
Rd
T
D
81LDA for Entity Resolution (LDA-ER)
a
- Author entities not directly observed
- Generate entity a as before
- Entities have attributes v
- Generate attribute vi for ith reference from
entity attribute va using noise process
?
z
a
F
ß
T
v
v
A
Rd
D
82LDA-ER Inference With Known Authors
- Exact inference is intractable
- Use Gibbs Sampling for group and entity labels of
each reference
83LDA-ER Inference With Known Authors
- Exact inference is intractable
- Use Gibbs Sampling for group and entity labels of
each reference - For the ith reference, sample its group label zi,
looking at all other variables
84Determining Number of Entities
- Search over number of entity labels using
sampling - For each entity label i, sample next step
- Move all its references to some existing label j
- Split its references between i and a new label k
- Retain all its references
- Number of entity labels
- Decreases by 1
- Increases by 1
- Stays the same
85Modeling Entity Attributes
- Entity attributes are unknown
- Incorporate P(V) into joint distribution
- Sample entity attributes from full conditional
distribution
86Noise Parameters
- Consider last, first and middle names
- First and middle names may be (incorrectly)
initialized or dropped - Characters may be replaced, deleted or inserted
in last names and retained first and middle names - Iteratively estimate noise parameters from entity
and reference attributes
87Overall Inference Algorithm
- Until convergence
- Until convergence
- Sample group label for each reference
- Continue
-
- For each entity label, reassign all references
currently having that label - Sample attribute value for each entity
- Estimate noise parameters
- Continue
88Experiments Real Data
- Citeseer
- Convergence in 30 iterations (10-20 mins)
- arXiv HEP
- Converegence in 75 iterations (8-20 hrs)
- Precision, recall and F1 of pair-wise duplicate
decisions - Baseline
- Pair-wise similarity from noise model
- Duplicates if similarity above threshold
- Transitive closure
89Results on Real Data
- Std Dev of F1 310-4 for CS, 1.710-4 for HEP
- CiteSeer
- Achieves close to highest possible recall with
very high precision - HEP
- Over 646,000 true duplicate pairs
- 1 improvement means 6,460 pairs
90Performance with Varying Group Numbers
- General Trend Higher precision, lower recall
with more groups - F1 reasonably stable over range of groups
91Real Resolution Examples
- Successful Distinction
- (lu j, liu j)
- (chang c, chiang c)
- Successful Identification
- (elliot g, elliott g l)
- (dubnick cezary, dubnicki c)
- (kaelbing l p, kaelbling leslie pack)
- (minton s, minton andrew b)
92Structural Difference between Data Sets
- Percentage of Ambiguous References
- 0.5 for Citeseer
- 9 for HEP
- Average number of collaborators per author
- 2.15 for Citeseer
- 4.5 for HEP
- Average number of references per author
- 2.5 for Citeseer
- 6.4 for HEP
93Synthetic Data Generator
- Data generator mimics real collaborations
- Create collaboration graph in Stage 1
- Create documents from this graph in Stage 2
- Can control
- Number of author entities and documents
- Average number of collaborators per author entity
- Average number of references per author entity
- Average number of references per document
- Percentage of ambiguous references
-
94Trends in Synthetic Data
- Improvement increases sharply with higher
ambiguity in references
95Trends in Synthetic Data
- Improvement increases with more references per
author
96Trends in Synthetic Data
- Improvement increases with more references per
document
97Bibliographic ER Comparison
- Two approaches to relational entity resolution
- Probabilistic Generative Model
- Notion of optimal solution
- Group label for references
- Can generalize for unseen data
- Able to handle noise
- Relational Clustering
- Efficient
- Customizable string similarity measure
- Small improvement over probabilistic model
- Needs threshold to determine duplicates
98Domain 2 Word Sense Resolution
- Words in natural language corpora may be
ambiguous - Bank financial institution, shore,
reserve/stockpile - Given word occurrence, determine intended sense
from context - Distinction/Disambiguation problem in ER
- References are the word occurrences
- Entities are the ambiguous senses of the words
99Relational WSD from Parallel Corpora
- Translations can help resolve senses
- Bank translated in Spanish as orilla probably
means shore - Links in WSD
- Aligned translation threads in parallel corpora
- (bank, banco, banca, Bank, banque)
- Multi-type ER
- Each language represents a type
- Need to resolve senses in all languages
simultaneously - Semantic Group Detection
100Bilingual Probabilistic Models for WSD
- Motivated by Diab and Resnik
- Automatic sense tagging using translations
- Probabilistic generative model for translations
- Sample related senses, one from each language
- Sample a word from each selected sense
- Two models for sense relations across languages
- Sense Model Relate senses directly
- Concept Model Relate senses through latent
semantic groups
101Generative Model 1 Sense Model
- Two level generative model
- Select a sense T according to priors
- Select English word We according to conditional
for that sense - Select Spanish word Ws, again according to
conditional
P(T)
T
P(WeT)
P(WsT)
We
Ws
P(We,Ws,T) ? ?
P(T)
P(WeT)
P(WsT)
102Generative Model 2 Concept Model
- Three level generative model
- Select concept C according to priors
- Select a sense for each language according to
conditionals for that concept - Select a word conditionally for each of the two
senses
P(C)
C
P(TeC)
P(TsC)
Te
Ts
P(WeTe)
P(WsTs)
We
Ws
P(We,Ws,Te,Ts,C) ?
? ?
?
P(C)
P(TeC)
P(TsC)
P(WeTe)
P(WsTs)
103Constructing the Models
- Issues
- Choosing dimensionality of hidden variables
- Use of available semantic hierarchies
- WordNet hierarchy for English
- Use WordNet senses for English words
- Relational clustering to discover Spanish senses
and concepts
104Sense Model Construction
- Use WordNet senses for both languages
- English word belongs to all its senses from
WordNet - Assign Spanish word to all senses for its English
translations
105Concept Model Spanish Senses
- Use English sense neighborhood for each Spanish
word - Union of senses for its translations
- One sense for Spanish word
- Each neighborhood defines a Spanish sense
- Multiple senses for a Spanish word
- Break English neighborhoods into frequently
occurring sub-neighborhoods
106Concept Model Concepts
- English sense neighborhood for Spanish senses
capture relations across language - Cluster English sense neighborhoods to create
concepts - Jaccard similarity of neighborhoods
- One concept for each neighborhood cluster
- Add the Spanish sense for each neighborhood
- Add the English senses from each neighborhood
107Learning Model Parameters
- Select parameters to maximize the joint
probability of observed translation pairs - Expectation Maximization to find model
probabilities - Avoid local maxima
- Use synset occurrence frequencies from WordNet
for initialization of model probabilities
108Training the Models
- Training Corpus constructed from multiple sources
- Brown Corpus, Senseval 1, Senseval 2 English
Lexical Sample, Wall Street Journal Sec 18-24
from Penn-Tree Bank - Translated into Spanish using Globalink Pro 6.4
and Systran Professional Premium - GIZA for word level alignments
109Numbers from Experiments
- 16,186 English words, 31,862 Spanish words
- 2,385,574 instances of 41,850 distinct
translation pairs - 20,361 WordNet senses
- Sense model
- 154,947 parameters
- 20,361 senses
- Concept model
- 120,268 parameters
- 20,361 eng. senses, 11,961 spn. senses, 7,366
concepts - EM convergence in about 20 iterations
110WSD Senseval Comparison
- Evaluation on Senseval 2 English All-words
- Focus on nouns 875 instances
111Semantic Sense Groups
- Semantic structure for Spanish words
automatically created with senses and concepts - Map words to sense entities and group related
sense entities into concepts
112Example Concepts Discovered
- accidente accidentes
- muertes(deaths)
- casualty
- matar(to kill) matanzas(slaughter) muertes-le
- slaying
- derramamiento-de-sangre (spilling-of-blood)
- cachiporra(bludgeon) obligar(force)
obligando(forcing) - asesinato(murder) asesinatos
Spanish senses
Concept
Spanish words in a sense
Relevant English dictionary sense
113Example Concepts Discovered
- linterna-eléctrica linterna(lantern)
- faros-automóvil(headlight)
- linternas-portuarias(harbor-light)
- antorcha(torch) antorchas antorchas-pino-nudo
114Example Concepts Discovered
- manÃa craze
- culto(cult) cultos proto-senility
- delirio delirium
- rabias(fury) rabia farfulla(do hastily)
115Example Concepts Discovered
- oportunidad oportunidades
- ocasión ocasiones
- riesgo(risk) riesgos peligro(danger)
- destino sino(fate)
- fortuna suerte(fate)
- probabilidad probabilidades
116Entity Resolution Summary
- Formulated generalized entity resolution problem
addressing - Reference attributes
- Relational data
- Collective Inference
- Group detection for entities
- Two types of entities for parallel WSD
117Future Work
- Resolving Multiple Entity Types
- Typed relational similarity measures by
projecting onto each type and aggregation (MRDM
05) - Extend group model for multiple types
- Objective Functions for RC-ER
- Notion of optimal solution
- Generalize cut-based co-clustering (Dhillon 01)
- Use entity ontologies for resolution
- WordNet similarity instead of Jaccard similarity
for sense neighborhoods
118ER Issues
- collective resolution
- global vs. local resolutions
- multi-entity resolution
- structural properties when to use links
- characterization of structural properties of
collaborative data sets benefiting relational
approach - HCI issues
- task specific interface for graph data
- visualizations which support analytic task
119Entity Resolution in Enron Email
- Message ID 180231
- Datetime 2001-01-23 094500
- Sender Sara Shackleton
- Recipients Tana Jones
- Subject Hedge Funds
- Tana Other than your email attached, have you
had other discussions with Mark or credit about
hedge funds? Sara - Sara Shackleton
- Enron North America Corp.
- 1400 Smith Street, EB 3801a
- Houston, Texas 77002
- 713-853-5620 (phone)
- 713-646-3490 (fax)
- sara.shackleton_at_enron.com
Emails exchanged between Shackleton and potential
candidates
Joint work with Chris Diehl _at_ JHUAPL
Mark Taylor is the correct association
120Entity Resolution in Email
- Message ID 182297
- Datetime 1999-12-20 044100
- Sender Sara Shackleton
- Recipients Marie Heard
- Subject Merrill Lynch - Financial Contract
- This is the deal that Susan F. worked on on
Friday. I ll forward the Schedule to you. No
one is asking for a revised Schedule yet but we
should make the change and email the parties on
Susan s email so that everyone knows the latest
changes and then ask if anyone has comments. ss
Emails exchanged between Shackleton and potential
candidates
More context is needed to resolve the
reference Linking references removes ambiguity in
this case Considering recipient communications
with candidates may remove ambiguity as well
121Entity Resolution in Email
- Message ID 71707
- Datetime 2001-10-19 143141
- Sender Sara Shackleton
- Recipients Kim Ward, Jason Williams
- Subject FW FW Master purchase/sale agreement -
Salt River - Jay my mistake - Salt River did send a CSA (see
below) Sara
Emails exchanged between Shackleton and potential
candidates
Jay is in fact a reference to Jason
Williams Williams often signs emails as Jay Need
framework that supports detection and resolution
of nicknames
122Entity Resolution in Email
- Message ID 81944
- Datetime 2001-10-19 062850
- Sender Mark Whitt
- Recipients Barry Tycholiz
- Subject FW hockey
- Here is an opportunity to get a box for one of
the games. Detroit on Feb 4th would be great!
That is a Monday. If you and Kim wanted to you
could come up and ski that weekend prior. Let me
know what you think
Emails exchanged between Whitt and potential
candidates
Candidates listed are only from within
Enron Exploiting the fact that this communication
is social in nature may be useful in dismissing
an already weak hypothesis
123Entity Resolution in Enron Data
Email communication network
employee directory
org chart
Jane Adams x3-4555 John Addams x4-3421
.
.
To j.smith_at_enron.com From jdoe_at_enron.com Subject
Re trade My friend John says .
.
.
Mail threads
124Projects
- Link-based Classification
- Link-based Entity Resolution
- efficient algorithms
- visualization tools that support ER
- Social Network Analysis
- Affiliation Networks
- Structural and descriptive modeling
- Friendship Event Networks
- Definitions of Capital and Benefit
- Link Mining for the Semantic Web
- Feature Generation for Sequences (biological
data) - Word-sense disambiguation from Parallel Corpora
125Affiliation Networks
- An affiliation network contains
- Actors A
- Events E
- Relationships R(A,E) Actor A participates in
event E - Examples
- Executive Corporate Boards (ECN)
- 66,000 executives, 5400 companies, 76,000 board
memberships - Author Publication Networks (APN)
- 13,000 authors, 16,000 publications, 39,000
authorships
Joint work with Lisa Singh _at_ Georgetown
1263 Views
a1
a1
e1
e1
a2
a2
a3
e2
e2
a3
a4
a4
e3
e3
a5
a5
Affiliation Network
Event Overlap Graph
Co-Membership Graph
127Compressing the networks
- Descriptive Pruning
- Select actors/events based on attributes values
- e.g., consider only CEOs
- Structural Properties
- Consider actors based on structural properties
such as hubs, brokers, etc. - Evaluation
- Does pruned network maintain predictive accuracy
for network attributes?
128Predictive Accuracy of Compression Strategies
129Summary
- Can use both descriptive and structural
properties to significantly compress networks
while maintaining accuracy - Descriptive and structural pruning allow us to
focus on important actors in the network however
the set of actors which they prune are quite
different - These pruned networks may be more effective for
understanding and visualization
130Projects
- Link-based Classification
- Link-based Entity Resolution
- efficient algorithms
- visualization tools that support ER
- Social Network Analysis
- Affiliation Networks
- Structural and descriptive modeling
- Friendship Event Networks
- Definitions of Capital and Benefit
- Link Mining for the Semantic Web
- Feature Generation for Sequences (biological
data) - Word-sense disambiguation from Parallel Corpora
131Friendship Event Networks
- A friendship event network contains
- Actors
- Friendships
- Events
- Event Organizers
- Event Participants
- example
- Author Collaboration Networks
- Actors - Researchers
- Friendships - CoAuthors
- Events - Conferences
- Event Organizers PC Committee
- Event Participants - Authors
132- PC Non Author
- Non PC Author
- PC Author
PC Committee
Conference Authors
133Define
- Personal Social Capital - of friends who are
organizers - Benefit Received - of publications in
conference - Benefit Given - of publications of friends of
PC member - Comparison of different event structures
- Temporal Evaluation
- look at event series
134Datasets
Data for past 10 years of 3 major CS conference
135Overall Capital and Benefit
136C1 Friendship
137C1 Capital
138C1 Capital/Friendship Ratio
139PC/Author Ratio
140Capital/Benefit Summary
- Defined a generic friendship-event network
- Identified interesting structural properties
- Very preliminary, much more work to be done
141Link Mining for the Semantic Web
- Need to be able to extract multi-relational data,
not just a single table - Semantic Web tasks which could make use of
learning - schema discovery
- populating ontology
- schema mappings
- schema reformulation
- SRL capabilities that are needed
- link-based object classification
- link type prediction
- predicting link existence
- link cardinality estimation
- entity resolution and object consolidation
- group detection
- predicate invention
142An Integrated Approach
ontologies
SRL
Current Projects focus on 1. Link Type
Prediction 2. Link Ontology Discovery
data
143Projects
- Link-based Classification
- Link-based Entity Resolution
- efficient algorithms
- visualization tools that support ER
- Social Network Analysis
- Affiliation Networks
- Structural and descriptive modeling
- Friendship Event Networks
- Definitions of Capital and Benefit
- Link Mining for the Semantic Web
- Feature Generation for Sequences (biological
data) - Schema Maintenance and Discovery
144Summary Link Mining
- Tasks
- Link-based Object Classification
- Object Type Prediction
- Link Type Prediction
- Predicting Link Existence
- Link Cardinality Estimation
- Entity Resolution
- Group Detection
- Subgraph Discovery
- Metadata Mining
- Challenges
- Collective Classification
- Collective Consolidation
- Logical vs. Statistical dependencies
- Feature construction
- Instances vs. Classes
- Effective Use of Labeled Unlabeled Data
- Link Prediction
- Closed vs. Open World
These are some of the key capabilities needed to
perform todays complex analytic tasks
145Recent SRL Activities
- Invited Tutorial at ICML/ILP 2005 and Tutorial at
IJCAI - Dagstuhl 2005 workshop on Probababilistic,
Relational and Logical Learning, co-organized w/
Luc DeRaedt, Stephen Muggleton and Tom
Dietterich.http//www.dagstuhl.de/05051/ - ICML 2004 workshop on Statistical Relational
Learning and its Connections to Other Fields,
co-organized w/ Tom Dietterich and Kevin
Murphy,http//www.cs.umd.edu/projects/srl2004/ - IJCAI 2003 workshop on Statistical Relational
Learning, co-organized w/ David
Jensenhttp//kdl.cs.umass.edu/srl2003/ - AAAI 2000 workshop on Statistical Relational
Learning, co-organized w/ David
Jensenhttp//robotics.stanford.edu/srl - Related workshops
- KDD MRDM workshops
- http//www-ai.ijs.si/SasoDzeroski/MRDM2004/
- http//www-ai.ijs.si/SasoDzeroski/MRDM2003/
- http//www-ai.ijs.si/SasoDzeroski/MRDM2002/
- Benjamin Taskar and I are working on an edited
SRL collection
146SRL Related Courses
- My course at UMDhttp//www.cs.umd.edu/class/sprin
g2005/cmsc828g/ - Pedro Domingos course at UWash
- Tom Dietterichs course at OSU http//web.engr.or
egonstate.edu/tgd/classes/539/ - David Page, Mark Craven and Jude Shavlik at
UWischttp//www.biostat.wisc.edu/page/838.html - Eric Mjolsness course at UCI on Probabilistic
Knowledge Representationhttp//computableplant.ic
s.uci.edu/emj/classes/280_04/Syllabus20ICS20280
20v2.doc - Stuart Russells course at Berkeley on Knowledge
Representation and Reasoninghttp//www.cs.berkele
y.edu/russell/classes/cs289/f04/ - Joydeep Ghosh course at UT Austin on Advanced
Topics in Data Mininghttp//www.lans.ece.utexas.e
du/course/382v/05sp/ - Michael Littman course at Rutgers on Learned
Representations in AI,http//www.cs.rutgers.edu/
mlittman/courses/lightai03/ - David Jensen and Andrew McCallums course at UMass
on Computational Social Network
Analysishttp//kdl.cs.umass.edu/courses/csna/
147References
- Deduplication and Group Detection Using Links
Indrajit Bhattacharya and Lise Getoor. 10th ACM
SIGKDD Workshop on Link Analysis and Group
Detection, Seattle, WA, August 2004. - Word Sense Disambiguation using Probabilistic
Models, Indrajit Bhattacharya, Lise Getoor and
Yoshua Bengio. 42nd Annual Meeting of the
Association for Computational Linguistics,
Barcelona, SP, July 2004. - Iterative Record Linkage for Cleaning and
Integration Indrajit Bhattacharya and Lise
Getoor. 9th ACM SIGMOD Workshop on Research
Issues in Data Mining and Knowledge Discovery,
Paris, FR, June 2004. - Using the Structure of Web Sites for Automatic
Segmentation of Tables, Kristina Lerman, Lise
Getoor, Steve Minton and Craig Knoblock.
Proceedings of ACM-SIGMOD 2004 International
Conference on Management of Data, Paris, FR, June
2004. - Structure Discovery using Statistical Relational
Learning, Lise Getoor. Data Engineering Bulletin,
vol. 26, No. 3, 2003. - Link Mining A New Data Mining Challenge, Lise
Getoor. SIGKDD Explorations, volume 5, issue 1,
2003. Iterative Deduplication, I. Bhattacharya,
L. Getoor. - Link Mining A New Data Mining Challenge, L.
Getoor. SIGKDD Explorations, volume 4, issue 2,
2003. - Link-based Classification, Q. Lu and L. Getoor,
International Conference on Machine Learning,
August, 2003 - Labeled and Unlabeled Data for Link-based
Classification, Q. Lu and L. Getoor. ICML
workshop on The Continuum from Labeled to
Unlabeled Data, August, 2003. - Link-based Classification for Text Classification
and Mining, Q. Lu and L. Getoor. IJCAI workshop
on Text Mining and Link Analysis
http//www.cs.umd.edu/getoor
Google getoor