Link Mining and Entity Resolution - PowerPoint PPT Presentation

1 / 120
About This Presentation
Title:

Link Mining and Entity Resolution

Description:

Closed vs. Open World. Challenges common to any SRL approachl! ... Closed World vs. Open World ... Closed vs. Open World. 08/17/05. JIKD Presentation. 33 ... – PowerPoint PPT presentation

Number of Views:327
Avg rating:3.0/5.0
Slides: 121
Provided by: lise159
Category:

less

Transcript and Presenter's Notes

Title: Link Mining and Entity Resolution


1
Link Mining and Entity Resolution
  • Lise Getoor
  • University of Maryland, College Park

Students Indrajit Bhattacharya, Mustafa Bilgic,
Rezarta Islamaj, Louis Licamele and Prithviraj Sen
2
Roadmap
  • Link Mining
  • Projects

3
Link Mining
  • Traditional machine learning and data mining
    approaches assume
  • A random sample of homogeneous objects from
    single relation
  • Real world data sets
  • Multi-relational, heterogeneous and
    semi-structured
  • represented as a graph or network
  • Statistical Relational Learning (SRL)
  • newly emerging research area at the intersection
    of research in social network and link analysis,
    hypertext and web mining, graph mining,
    relational learning and inductive logic
    programming
  • Sample Domains
  • web data, bibliographic data, epidemiological
    data, communication data, customer networks,
    collaborative filtering, trust networks,
    biological data

4
What is SRL?
  • Three views

5
View 1 Alphabet Soup
LBN
CLP(BN)
SRM
PRISM
RDBN
RPM
SLR
BLOG
PLL
pRN
PER
PRM
SLP
MLN
HMRF
RMN
RNM
DAPER
RDBN
RDN
BLP
SGLR
6
View 2 Representation Soup
  • Hierarchical Bayesian Model Relational
    Representation

Add probabilities
Statistical Relational Learning
Logic
Add relations
Probabilities
7
View 3 Data Soup
Training Data
Test Data
8
View 3 Data Soup
Training Data
Test Data
9
View 3 Data Soup
Training Data
Test Data
10
View 3 Data Soup
Training Data
Test Data
11
View 3 Data Soup
Training Data
Test Data
12
View 3 Data Soup
Training Data
Test Data
13
Link Mining Tasks
  • Tasks
  • Object Classification
  • Object Type Prediction
  • Link Type Prediction
  • Predicting Link Existence
  • Link Cardinality Estimation
  • Entity Resolution
  • Group Detection
  • Subgraph Discovery
  • Metadata Mining

14
Sample Problem Domain
  • Research World
  • Researchers
  • Papers
  • Reviewers
  • Co-authors
  • Citations
  • Topics
  • Aka Tenure World

15
Object Prediction
  • Object Classification
  • Predicting the category of an object based on its
    attributes and its links and attributes of linked
    objects
  • e.g., predicting the topic of a paper based on
    the words used in the paper, the topics of papers
    it cites, the research interests of the author
  • Object Type Prediction
  • Predicting the type of an object based on its
    attributes and its links and attributes of linked
    objects
  • e.g., predict the venue type of a publication
    (conference, journal, workshop) based on
    properties of the paper

16
Link Prediction
  • Link Classification
  • Predicting type or purpose of link based on
    properties of the participating objects
  • e.g., predict whether a citation is to
    foundational work, background material,
    gratuitous PC reference
  • Predicting Link Existence
  • Predicting whether a link exists between two
    objects
  • e.g. predicting whether a paper will cite another
    paper
  • Link Cardinality Estimation
  • Predicting the number of links to an object or
    predicting the number of objects reached along a
    path from an object
  • e.g., predict the number of citations of a paper

17
More complex prediction tasks
  • Group Detection
  • Predicting when a set of entities belong to the
    same group based on clustering both object
    attribute values and link structure
  • e.g., identifying research communities
  • Entity Resolution
  • Predicting when a collection of objects are the
    same, based on their attributes and their links
    (aka record linkage, identity uncertainty)
  • e.g., predicting when two citations are referring
    to the same paper.
  • Predicate Invention
  • Induce a new general relation/link from existing
    links and paths
  • e.g., propose concept of advisor from co-author
    and financial support
  • Subgraph Identification, Metadata Mapping

18
SRL Challenges
  • Collective Classification
  • Collective Consolidation
  • Logical vs. Statistical dependencies
  • Feature Construction aggregation, selection
  • Flexible and Decomposable Combining Rules
  • Instances vs. Classes
  • Effective Use of Labeled Unlabeled Data
  • Link Prediction
  • Closed vs. Open World

Challenges common to any SRL approachl! Bayesian
Logic Programs, Relational Logic Networks,
Probabilistic Relational Models, Relational
Markov Networks, Relational Probability Trees,
Stochastic Logic Programming to name a few
19
Collective classification
  • Using a link-based statistical model for
    classification
  • Inference using learned model is complicated by
    the fact that there is correlation between the
    object labels
  • Must find a labeling that maximizes the joint
    (conditional) probability

20
Collective consolidation
  • Using a link-based statistical model for object
    consolidation
  • Consolidation decisions should not be made
    independently
  • Must find a clustering that maximizes the joint
    (conditional) probability

21
Logical vs. Statistical Dependence
  • Coherently handling two types of dependence
    structures
  • Link structure - the logical relationships
    between objects
  • Probabilistic dependence - statistical
    relationships between attributes
  • Challenge statistical models that support rich
    logical relationships
  • Model search complicated by the fact that
    attributes can depend on arbitrarily linked
    attributes -- issue how to search this huge
    space

22
Model Search
P1
P1
P3
P2
I1
I1
A1
A1
P
?
23
Feature Construction
  • In many cases, objects are linked to a set of
    objects. To construct a single feature from this
    set of objects, we may either use
  • Aggregation
  • Selection

24
Aggregation
P1
P3
P2
I1
A1
P
?
P
25
Selection
P1
P3
P2
I1
A1
P
?
P
26
Individuals vs. Classes
  • Does model refer
  • explicitly to individuals
  • classes or generic categories of individuals
  • On one hand, wed like to be able to model that a
    connection to a particular individual may be
    highly predictive
  • On the other hand, wed like our models to
    generalize to new situations, with different
    individuals

27
Instance-based Dependencies
P3
P3
I1
A1
Papers that cite P3 are likely to be
28
Class-based Dependencies
?
?
I1
A1
Papers that cite are likely to be
29
Labeled Unlabeled Data
  • In link-based domains, unlabeled data provide
    three sources of information
  • Helps us infer object attribute distribution
  • Links between unlabeled data allow us to make use
    of attributes of linked objects
  • Links between labeled data and unlabeled data
    (training data and test data) help us make more
    accurate inferences

30
Link Prior Probability
  • The prior probability of any particular link is
    typically extraordinarily low
  • For medium-sized data sets, we have had success
    with building explicit models of link existence
  • It may be more effective to model links at higher
    level--required for large data sets!

31
Closed World vs. Open World
  • The majority of SRL approaches make a closed
    world assumption, which assumes that we know all
    the potential entities in the domain
  • In many cases, this is unrealistic
  • Work by Milch, Marti, Russell on BLOG

32
SRL Tasks Challenges Summary
  • Tasks
  • Link-based Object Classification
  • Object Type Prediction
  • Link Type Prediction
  • Predicting Link Existence
  • Link Cardinality Estimation
  • Issues Challenges
  • Collective Classification
  • Collective Consolidation
  • Logical vs. Statistical dependencies
  • Feature construction
  • Entity Resolution
  • Group Detection
  • Predicate Invention
  • Subgraph Discovery
  • Metadata Mining
  • Instances vs. Classes
  • Effective Use of Labeled Unlabeled Data
  • Link Prediction
  • Closed vs. Open World

33
Current Projects
  • Link-based Classification
  • Link-based Entity Resolution
  • Social Network Analysis
  • Affiliation Networks
  • Structural and descriptive modeling
  • Friendship Event Networks
  • Definitions of Capital and Benefit
  • Link Mining for the Semantic Web
  • Feature Generation for Sequences (biological
    data)
  • Schema Maintenance and Discovery

34
Link-based Classification
  • Predicting the category of an object based on its
    attributes and its links and attributes of linked
    objects

A
A
A
A
?
B
B
B
B
C
A
35
Our Approach
  • Investigate use of labeled and unlabeled data for
    classification
  • Learning of models
  • iterative algorithm
  • Prediction
  • links among (unlabeled) test and (labeled)
    training data
  • Requires collective classification
  • Link-based models
  • Integrate link features with object attributes
    using logistic regression

36
Experiment
37
Projects
  • Link-based Classification
  • Link-based Entity Resolution
  • Social Network Analysis
  • Affiliation Networks
  • Structural and descriptive modeling
  • Friendship Event Networks
  • Definitions of Capital and Benefit
  • Link Mining for the Semantic Web
  • Feature Generation for Sequences (biological
    data)
  • Schema Maintenance and Discovery

38
James Smith
John Smith
Jim Smith
John Smith
James Smith
Jon Smith
Jonathan Smith
J Smith
Jonthan Smith
39
Generalized Entity Resolution
  • Discover the domain entities
  • Map each reference to an entity
  • Identification
  • References to the same entity may look different
  • Jonathan Smith, Jon Smith, Jonthan Smith
  • Distinction/Disambiguation
  • References to different entities may look similar
  • Jon Smith, John Smith

40
Entity Resolution Domains
  • Databases
  • Deduplication in Data Cleaning
  • Data Integration with similarity joins
  • Natural Language Processing
  • Noun co-reference
  • Sense disambiguation
  • Named entity recognition
  • Computer Vision
  • Correspondence Problem

41
Issues
  • Reference attributes
  • Multiple Entity Types
  • Relational Reference Data
  • Collective Resolution
  • Group Detection
  • Entity Ontologies

42
ER using Reference Attributes
  • Identify references with similar attributes
  • Define ( learn) attribute similarity measures
  • Resolve references pairwise transitive closure

43
ER using Reference Attributes
  • Identify references with similar attributes
  • Define ( learn) attribute similarity measures
  • Resolve references pairwise transitive closure
  • Problem Similarity threshold for resolution
  • Better identification calls for lower threshold
  • Better distinction calls for higher threshold

44
Motivation for Relational ER
  • References may not be observed independently in
    data
  • References are linked
  • Link is a set of related references
  • Represents relations among underlying entities
  • E.g. parent-dependent or sibling relations among
    person records in census database
  • Links can help in identification and distinction

45
Example References In Census Data
Jon
Jim
Liz
Jon
James
P
John
Gwyneth
Betsy
J
Gwen
Elizabeth
Jonthan
Don
Paul
Jonathan
Laura
Betsy
Sharon
Ron
Kate
L
J
D
Typos Jon ? Jonathan? John?
Initials J ? James? John? Jonathan? Or none?
46
Example Links In Census Data
Jon
Jim
Liz
Jon
James
P
John
Gwyneth
Betsy
J
Gwen
Elizabeth
Jonthan
Don
Paul
Jonathan
Laura
Betsy
Sharon
Ron
Kate
L
J
D
Links represent family relations
47
Example Inference from Links
Jon
Jim
Liz
Jon
James
P
John
Gwyneth
Betsy
J
Gwen
Elizabeth
Jonthan
Don
Paul
Jonathan
Laura
Betsy
Sharon
Ron
Kate
L
J
J
D
Ambiguity is almost eliminated
Ambiguity is reduced
48
Entity Resolution From Relational Data
  • References with similar attributes that have
    similar relations as well are more likely to be
    the same entity
  • ER Approach 1
  • Cluster references using relational similarity

49
Entity Resolution Using Group Membership Evidence
  • Links represent correlations among entities
  • Some entities more likely to co-occur in links
    than others
  • ER Approach 2 Capture correlations explicitly
    with latent group variable
  • Entities are members of possibly overlapping
    groups
  • Entities in same group more likely to form links

50
Entity Resolution Using Group Membership Evidence
Familial Group 1
Familial Group 2
Jon
Jim
James
Liz
Jon
P
John
Gwyneth
Betsy
J
Gwen
Elizabeth
Jonthan
Don
Paul
Jonathan
Laura
Betsy
Sharon
Ron
Kate
L
J
J
D
Belong to same familial group
Belong to same familial group
Belong to different familial groups
51
Entity Resolution Using Group Membership Evidence
  • Group Detection is interesting and important
  • Collaborative groups in social sciences and
    bibliometry
  • Semantic word groups from natural language
    corpora
  • By-product of ER using groups Group Detection
    from ambiguous references

52
Collective Entity Resolution from Relations
  • Resolutions cannot be made independently for
    different references
  • Dependency flows between resolution decisions
    through reference links
  • J Smiths wife Betsy is the same as Betsy who is
    the mother of Paul Paul is the same as P Smith
    who is John Smiths son

53
Collective Entity Resolution from Relations
  • Resolutions cannot be made independently for
    different references
  • Dependency flows between resolution decisions
    through reference links
  • When modeling groups, entity resolutions depend
    on groups, groups depend on resolved entities

54
Evaluation Domains
  • Bibliographic Data
  • Author resolution using co-author links
  • Relational Clustering (RC-ER)
    (DMKD 04, LinkKDD 04,
    submitted Book Chapter)
  • LDA based Group model (LDA-ER)
    (under review)
  • Natural Language
  • Sense resolution using translation links in
    parallel corpora (ACL 04)
  • Sense Model Senses in different languages depend
    directly on each other
  • Concept Model Semantic sense groups or Concepts
    relate senses from different languages

55
Domain 1 Bibliographic Entity Resolution
  • Resolve author, paper, venue, publisher entities
    from citation strings
  • R. Agrawal, R. Srikant. Fast algorithms for
    mining association rules in large databases. In
    VLDB-94, 1994.
  • Rakesh Agrawal and Ramakrishnan Srikant. Fast
    Algorithms for Mining Association Rules. In
    Proc. of the 20th Int'l Conference on Very Large
    Databases, Santiago, Chile, September 1994.

56
Exploiting Bibliographic Links
  • Resolve author, paper, venue, publisher entities
    from citation strings
  • R. Agrawal, R. Srikant. Fast algorithms for
    mining association rules in large databases. In
    VLDB-94, 1994.
  • Rakesh Agrawal and Ramakrishnan Srikant. Fast
    Algorithms for Mining Association Rules. In
    Proc. of the 20th Int'l Conference on Very Large
    Databases, Santiago, Chile, September 1994.

57
Exploiting Bibliographic Links
R. Agrawal
Rakesh Agrawal
co-author
co-author
Ramakrishnan Srikant
R. Srikant
writes
writes
writes
writes
Fast algorithms for mining association rules in
large databases
Fast Algorithms for Mining Association Rules
published-in
published-in
VLDB-94, 1994
Proc. of the 20th Int'l Conference on Very Large
Databases, Santiago, Chile, September 1994
58
Exploiting Bibliographic Links
R. Agrawal
Rakesh Agrawal
Ramakrishnan Srikant
R. Srikant
Fast algorithms for mining association rules in
large databases
Fast Algorithms for Mining Association Rules
VLDB-94, 1994
Proc. of the 20th Int'l Conference on Very Large
Databases, Santiago, Chile, September 1994
59
Exploiting Bibliographic Links
entity 1
R. Agrawal
Rakesh Agrawal
entity 2
Ramakrishnan Srikant
R. Srikant
entity 3
Fast algorithms for mining association rules in
large databases
Fast Algorithms for Mining Association Rules
entity 4
VLDB-94, 1994
Proc. of the 20th Int'l Conference on Very Large
Databases, Santiago, Chile, September 1994
60
Exploiting Bibliographic Links
entity 1
R. Agrawal
Rakesh Agrawal
entity 2
Ramakrishnan Srikant
R. Srikant
entity 3
Fast algorithms for mining association rules in
large databases
Fast Algorithms for Mining Association Rules
entity 4
VLDB-94, 1994
Proc. of the 20th Int'l Conference on Very Large
Databases, Santiago, Chile, September 1994
61
Approach 1 ER using Relational Clustering (RC-ER)
  • Iteratively cluster similar references into
    entities

c1
c2
c3
c4
c5
c6
c7
c8
62
Approach 1 ER using Relational Clustering (RC-ER)
  • Iteratively cluster similar references into
    entities

R. Agrawal
Rakesh Agrawal
c1
c2
Ramakrishnan Srikant
R. Srikant
c9
Fast algorithms for mining association rules in
large databases
Fast Algorithms for Mining Association Rules
c5
c6
VLDB-94, 1994
Proc. of the 20th Int'l Conference on Very Large
Databases, Santiago, Chile, September 1994
c7
c8
63
Approach 1 ER using Relational Clustering (RC-ER)
  • Iteratively cluster similar references into
    entities

c10
c9
c5
c6
c7
c8
64
Approach 1 ER using Relational Clustering (RC-ER)
  • Iteratively cluster similar references into
    entities

c10
c9
c11
c7
c8
65
Approach 1 ER using Relational Clustering (RC-ER)
  • Iteratively cluster similar references into
    entities

c10
c9
c11
c12
66
Similarity Measure For Clustering
  • Linear combination of attribute and relational
    similarity of reference clusters
  • sim(ci, cj) (1- ?)simattr(ci, cj) ?
    simrel(ci, cj)
  • Attribute similarity measure
  • Several measures available for pairs of strings
  • Levenstein, Smith-Waterman, Jaro
  • Combine pairwise measures for attribute
    similarity of two reference clusters
  • Single link, average link, complete link
  • Representative attribute for clusters

67
Relational Similarity Measure
  • Cluster similarity capture dependence between
    resolution decisions through links
  • Each reference cluster ci has its link set H(ci)
  • Link for each reference in ci
  • Capture similarity of links in two clusters

68
Edge Detail Similarity
  • Similarity of two links depends on their
    references
  • Consider resolution decisions on the references

Both links connect to cluster 9
69
Edge Detail Similarity
  • Similarity of two links depends on their
    references
  • Consider resolution decisions on the references
  • Label set Eh(i) of ith link
  • multi-set of cluster labels of its reference
  • simh(i,j) Jaccard(Eh(i), Eh(j))
  • Edge Detail Similarity of two clusters
  • Simrel(c, c) min(simh(i), simh(j)), i ? H(c),
    j ? H(c)

70
Neighborhood Similarity
  • Edge detail similarity is expensive
  • Ignore explicit link structure
  • Consider only set of neighborhood clusters
  • Clusters c1, c2 still similar in terms of
    relationships

c5
link 2
link 1
link 3
c1
c3
c4
c5
c2
c4
link 4
c3
71
Neighborhood Similarity
  • Edge detail similarity is expensive
  • Ignore explicit link structure
  • Consider only set of neighborhood clusters
  • N(c) multiset of cluster labels covered by
    links in H(c)
  • Neighborhood similarity of two clusters
  • Simrel(c,c) Jaccard(N(c),N(c))

72
Evaluation Datasets
  • CiteSeer
  • Machine Learning Citations
  • Originally created by Lawrence et al.
  • 2,892 references to 1,165 true authors
  • 1,504 links
  • arXiv HEP
  • Papers from High Energy Physics
  • Used for KDD-Cup 03 Data Cleaning Challenge
  • 58,515 references to 9,200 true authors
  • 29,555 links

73
Baseline
  • Pairwise duplicate decisions using Soft-TFIDF
    (ATTR)
  • Secondary string similarity Scaled
    Levenstein(SL), Jaro(JA), Jaro-Winkler(JW)
  • Transitive Closure over pairwise decisions
    (ATTR)
  • Precision, Recall and F1 over pairwise decisions
  • Both algorithms require similarity threshold
  • Report best performance over all thresholds

74
Results F1 for Different String Similarity
Measures
  • For each measure, neighborhood sim does better
    than ATTR and ATTR and edge detail does better
    than neighborhood

75
Results Varying Combination Weight using
Bootstrapping
76
Results Varying Combination Weight using
Bootstrapping
77
Results Execution Time
78
Results Best F1
  • Relational measures improve performance over
    attribute baseline in terms of precision, recall
    and F1
  • Neighbor similarity performs almost as well as
    edge detail
  • Neighborhood similarity faster than edge detail

79
Approach 2 Latent Dirichlet Model for ER
  • Probabilistic model of entity collaboration
    groups
  • Entities (authors) belong to groups
  • Entities (authors) in a link (document) depend on
    the groups that are involved
  • Latent group variable for each reference
  • Group labels and entity labels unobserved

80
LDA for Author Entities of Documents
a
  • Adapt the LDA model for author entities in
    documents
  • Each document has a distribution T over groups
  • Each group z has a distribution Fz over author
    entities
  • For each author entity, sample a group z from T,
    and sample an entity from Fz

?
z
a
F
ß
Rd
T
D
81
LDA for Entity Resolution (LDA-ER)
a
  • Author entities not directly observed
  • Generate entity a as before
  • Entities have attributes v
  • Generate attribute vi for ith reference from
    entity attribute va using noise process

?
z
a
F
ß
T
v
v
A
Rd
D
82
LDA-ER Inference With Known Authors
  • Exact inference is intractable
  • Use Gibbs Sampling for group and entity labels of
    each reference

83
LDA-ER Inference With Known Authors
  • Exact inference is intractable
  • Use Gibbs Sampling for group and entity labels of
    each reference
  • For the ith reference, sample its group label zi,
    looking at all other variables

84
Determining Number of Entities
  • Search over number of entity labels using
    sampling
  • For each entity label i, sample next step
  • Move all its references to some existing label j
  • Split its references between i and a new label k
  • Retain all its references
  • Number of entity labels
  • Decreases by 1
  • Increases by 1
  • Stays the same

85
Modeling Entity Attributes
  • Entity attributes are unknown
  • Incorporate P(V) into joint distribution
  • Sample entity attributes from full conditional
    distribution

86
Noise Parameters
  • Consider last, first and middle names
  • First and middle names may be (incorrectly)
    initialized or dropped
  • Characters may be replaced, deleted or inserted
    in last names and retained first and middle names
  • Iteratively estimate noise parameters from entity
    and reference attributes

87
Overall Inference Algorithm
  • Until convergence
  • Until convergence
  • Sample group label for each reference
  • Continue
  • For each entity label, reassign all references
    currently having that label
  • Sample attribute value for each entity
  • Estimate noise parameters
  • Continue

88
Experiments Real Data
  • Citeseer
  • Convergence in 30 iterations (10-20 mins)
  • arXiv HEP
  • Converegence in 75 iterations (8-20 hrs)
  • Precision, recall and F1 of pair-wise duplicate
    decisions
  • Baseline
  • Pair-wise similarity from noise model
  • Duplicates if similarity above threshold
  • Transitive closure

89
Results on Real Data
  • Std Dev of F1 310-4 for CS, 1.710-4 for HEP
  • CiteSeer
  • Achieves close to highest possible recall with
    very high precision
  • HEP
  • Over 646,000 true duplicate pairs
  • 1 improvement means 6,460 pairs

90
Performance with Varying Group Numbers
  • General Trend Higher precision, lower recall
    with more groups
  • F1 reasonably stable over range of groups

91
Real Resolution Examples
  • Successful Distinction
  • (lu j, liu j)
  • (chang c, chiang c)
  • Successful Identification
  • (elliot g, elliott g l)
  • (dubnick cezary, dubnicki c)
  • (kaelbing l p, kaelbling leslie pack)
  • (minton s, minton andrew b)

92
Structural Difference between Data Sets
  • Percentage of Ambiguous References
  • 0.5 for Citeseer
  • 9 for HEP
  • Average number of collaborators per author
  • 2.15 for Citeseer
  • 4.5 for HEP
  • Average number of references per author
  • 2.5 for Citeseer
  • 6.4 for HEP

93
Synthetic Data Generator
  • Data generator mimics real collaborations
  • Create collaboration graph in Stage 1
  • Create documents from this graph in Stage 2
  • Can control
  • Number of author entities and documents
  • Average number of collaborators per author entity
  • Average number of references per author entity
  • Average number of references per document
  • Percentage of ambiguous references

94
Trends in Synthetic Data
  • Improvement increases sharply with higher
    ambiguity in references

95
Trends in Synthetic Data
  • Improvement increases with more references per
    author

96
Trends in Synthetic Data
  • Improvement increases with more references per
    document

97
Bibliographic ER Comparison
  • Two approaches to relational entity resolution
  • Probabilistic Generative Model
  • Notion of optimal solution
  • Group label for references
  • Can generalize for unseen data
  • Able to handle noise
  • Relational Clustering
  • Efficient
  • Customizable string similarity measure
  • Small improvement over probabilistic model
  • Needs threshold to determine duplicates

98
Domain 2 Word Sense Resolution
  • Words in natural language corpora may be
    ambiguous
  • Bank financial institution, shore,
    reserve/stockpile
  • Given word occurrence, determine intended sense
    from context
  • Distinction/Disambiguation problem in ER
  • References are the word occurrences
  • Entities are the ambiguous senses of the words

99
Relational WSD from Parallel Corpora
  • Translations can help resolve senses
  • Bank translated in Spanish as orilla probably
    means shore
  • Links in WSD
  • Aligned translation threads in parallel corpora
  • (bank, banco, banca, Bank, banque)
  • Multi-type ER
  • Each language represents a type
  • Need to resolve senses in all languages
    simultaneously
  • Semantic Group Detection

100
Bilingual Probabilistic Models for WSD
  • Motivated by Diab and Resnik
  • Automatic sense tagging using translations
  • Probabilistic generative model for translations
  • Sample related senses, one from each language
  • Sample a word from each selected sense
  • Two models for sense relations across languages
  • Sense Model Relate senses directly
  • Concept Model Relate senses through latent
    semantic groups

101
Generative Model 1 Sense Model
  • Two level generative model
  • Select a sense T according to priors
  • Select English word We according to conditional
    for that sense
  • Select Spanish word Ws, again according to
    conditional

P(T)
T
P(WeT)
P(WsT)
We
Ws
P(We,Ws,T) ? ?
P(T)
P(WeT)
P(WsT)
102
Generative Model 2 Concept Model
  • Three level generative model
  • Select concept C according to priors
  • Select a sense for each language according to
    conditionals for that concept
  • Select a word conditionally for each of the two
    senses

P(C)
C
P(TeC)
P(TsC)
Te
Ts
P(WeTe)
P(WsTs)
We
Ws
P(We,Ws,Te,Ts,C) ?
? ?
?
P(C)
P(TeC)
P(TsC)
P(WeTe)
P(WsTs)
103
Constructing the Models
  • Issues
  • Choosing dimensionality of hidden variables
  • Use of available semantic hierarchies
  • WordNet hierarchy for English
  • Use WordNet senses for English words
  • Relational clustering to discover Spanish senses
    and concepts

104
Sense Model Construction
  • Use WordNet senses for both languages
  • English word belongs to all its senses from
    WordNet
  • Assign Spanish word to all senses for its English
    translations

105
Concept Model Spanish Senses
  • Use English sense neighborhood for each Spanish
    word
  • Union of senses for its translations
  • One sense for Spanish word
  • Each neighborhood defines a Spanish sense
  • Multiple senses for a Spanish word
  • Break English neighborhoods into frequently
    occurring sub-neighborhoods

106
Concept Model Concepts
  • English sense neighborhood for Spanish senses
    capture relations across language
  • Cluster English sense neighborhoods to create
    concepts
  • Jaccard similarity of neighborhoods
  • One concept for each neighborhood cluster
  • Add the Spanish sense for each neighborhood
  • Add the English senses from each neighborhood

107
Learning Model Parameters
  • Select parameters to maximize the joint
    probability of observed translation pairs
  • Expectation Maximization to find model
    probabilities
  • Avoid local maxima
  • Use synset occurrence frequencies from WordNet
    for initialization of model probabilities

108
Training the Models
  • Training Corpus constructed from multiple sources
  • Brown Corpus, Senseval 1, Senseval 2 English
    Lexical Sample, Wall Street Journal Sec 18-24
    from Penn-Tree Bank
  • Translated into Spanish using Globalink Pro 6.4
    and Systran Professional Premium
  • GIZA for word level alignments

109
Numbers from Experiments
  • 16,186 English words, 31,862 Spanish words
  • 2,385,574 instances of 41,850 distinct
    translation pairs
  • 20,361 WordNet senses
  • Sense model
  • 154,947 parameters
  • 20,361 senses
  • Concept model
  • 120,268 parameters
  • 20,361 eng. senses, 11,961 spn. senses, 7,366
    concepts
  • EM convergence in about 20 iterations

110
WSD Senseval Comparison
  • Evaluation on Senseval 2 English All-words
  • Focus on nouns 875 instances

111
Semantic Sense Groups
  • Semantic structure for Spanish words
    automatically created with senses and concepts
  • Map words to sense entities and group related
    sense entities into concepts

112
Example Concepts Discovered
  • accidente accidentes
  • muertes(deaths)
  • casualty
  • matar(to kill) matanzas(slaughter) muertes-le
  • slaying
  • derramamiento-de-sangre (spilling-of-blood)
  • cachiporra(bludgeon) obligar(force)
    obligando(forcing)
  • asesinato(murder) asesinatos

Spanish senses
Concept
Spanish words in a sense
Relevant English dictionary sense
113
Example Concepts Discovered
  • linterna-eléctrica linterna(lantern)
  • faros-automóvil(headlight)
  • linternas-portuarias(harbor-light)
  • antorcha(torch) antorchas antorchas-pino-nudo

114
Example Concepts Discovered
  • manía craze
  • culto(cult) cultos proto-senility
  • delirio delirium
  • rabias(fury) rabia farfulla(do hastily)

115
Example Concepts Discovered
  • oportunidad oportunidades
  • ocasión ocasiones
  • riesgo(risk) riesgos peligro(danger)
  • destino sino(fate)
  • fortuna suerte(fate)
  • probabilidad probabilidades

116
Entity Resolution Summary
  • Formulated generalized entity resolution problem
    addressing
  • Reference attributes
  • Relational data
  • Collective Inference
  • Group detection for entities
  • Two types of entities for parallel WSD

117
Future Work
  • Resolving Multiple Entity Types
  • Typed relational similarity measures by
    projecting onto each type and aggregation (MRDM
    05)
  • Extend group model for multiple types
  • Objective Functions for RC-ER
  • Notion of optimal solution
  • Generalize cut-based co-clustering (Dhillon 01)
  • Use entity ontologies for resolution
  • WordNet similarity instead of Jaccard similarity
    for sense neighborhoods

118
ER Issues
  • collective resolution
  • global vs. local resolutions
  • multi-entity resolution
  • structural properties when to use links
  • characterization of structural properties of
    collaborative data sets benefiting relational
    approach
  • HCI issues
  • task specific interface for graph data
  • visualizations which support analytic task

119
Entity Resolution in Enron Email
  • Message ID 180231
  • Datetime 2001-01-23 094500
  • Sender Sara Shackleton
  • Recipients Tana Jones
  • Subject Hedge Funds
  • Tana Other than your email attached, have you
    had other discussions with Mark or credit about
    hedge funds? Sara
  • Sara Shackleton
  • Enron North America Corp.
  • 1400 Smith Street, EB 3801a
  • Houston, Texas 77002
  • 713-853-5620 (phone)
  • 713-646-3490 (fax)
  • sara.shackleton_at_enron.com

Emails exchanged between Shackleton and potential
candidates
Joint work with Chris Diehl _at_ JHUAPL
Mark Taylor is the correct association
120
Entity Resolution in Email
  • Message ID 182297
  • Datetime 1999-12-20 044100
  • Sender Sara Shackleton
  • Recipients Marie Heard
  • Subject Merrill Lynch - Financial Contract
  • This is the deal that Susan F. worked on on
    Friday. I ll forward the Schedule to you. No
    one is asking for a revised Schedule yet but we
    should make the change and email the parties on
    Susan s email so that everyone knows the latest
    changes and then ask if anyone has comments. ss

Emails exchanged between Shackleton and potential
candidates
More context is needed to resolve the
reference Linking references removes ambiguity in
this case Considering recipient communications
with candidates may remove ambiguity as well
121
Entity Resolution in Email
  • Message ID 71707
  • Datetime 2001-10-19 143141
  • Sender Sara Shackleton
  • Recipients Kim Ward, Jason Williams
  • Subject FW FW Master purchase/sale agreement -
    Salt River
  • Jay my mistake - Salt River did send a CSA (see
    below) Sara

Emails exchanged between Shackleton and potential
candidates
Jay is in fact a reference to Jason
Williams Williams often signs emails as Jay Need
framework that supports detection and resolution
of nicknames
122
Entity Resolution in Email
  • Message ID 81944
  • Datetime 2001-10-19 062850
  • Sender Mark Whitt
  • Recipients Barry Tycholiz
  • Subject FW hockey
  • Here is an opportunity to get a box for one of
    the games. Detroit on Feb 4th would be great!
    That is a Monday. If you and Kim wanted to you
    could come up and ski that weekend prior. Let me
    know what you think

Emails exchanged between Whitt and potential
candidates
Candidates listed are only from within
Enron Exploiting the fact that this communication
is social in nature may be useful in dismissing
an already weak hypothesis
123
Entity Resolution in Enron Data
Email communication network
employee directory
org chart
Jane Adams x3-4555 John Addams x4-3421
.
.
To j.smith_at_enron.com From jdoe_at_enron.com Subject
Re trade My friend John says .
.
.
Mail threads
124
Projects
  • Link-based Classification
  • Link-based Entity Resolution
  • efficient algorithms
  • visualization tools that support ER
  • Social Network Analysis
  • Affiliation Networks
  • Structural and descriptive modeling
  • Friendship Event Networks
  • Definitions of Capital and Benefit
  • Link Mining for the Semantic Web
  • Feature Generation for Sequences (biological
    data)
  • Word-sense disambiguation from Parallel Corpora

125
Affiliation Networks
  • An affiliation network contains
  • Actors A
  • Events E
  • Relationships R(A,E) Actor A participates in
    event E
  • Examples
  • Executive Corporate Boards (ECN)
  • 66,000 executives, 5400 companies, 76,000 board
    memberships
  • Author Publication Networks (APN)
  • 13,000 authors, 16,000 publications, 39,000
    authorships

Joint work with Lisa Singh _at_ Georgetown
126
3 Views
a1
a1
e1
e1
a2
a2
a3
e2
e2
a3
a4
a4
e3
e3
a5
a5
Affiliation Network
Event Overlap Graph
Co-Membership Graph
127
Compressing the networks
  • Descriptive Pruning
  • Select actors/events based on attributes values
  • e.g., consider only CEOs
  • Structural Properties
  • Consider actors based on structural properties
    such as hubs, brokers, etc.
  • Evaluation
  • Does pruned network maintain predictive accuracy
    for network attributes?

128
Predictive Accuracy of Compression Strategies
129
Summary
  • Can use both descriptive and structural
    properties to significantly compress networks
    while maintaining accuracy
  • Descriptive and structural pruning allow us to
    focus on important actors in the network however
    the set of actors which they prune are quite
    different
  • These pruned networks may be more effective for
    understanding and visualization

130
Projects
  • Link-based Classification
  • Link-based Entity Resolution
  • efficient algorithms
  • visualization tools that support ER
  • Social Network Analysis
  • Affiliation Networks
  • Structural and descriptive modeling
  • Friendship Event Networks
  • Definitions of Capital and Benefit
  • Link Mining for the Semantic Web
  • Feature Generation for Sequences (biological
    data)
  • Word-sense disambiguation from Parallel Corpora

131
Friendship Event Networks
  • A friendship event network contains
  • Actors
  • Friendships
  • Events
  • Event Organizers
  • Event Participants
  • example
  • Author Collaboration Networks
  • Actors - Researchers
  • Friendships - CoAuthors
  • Events - Conferences
  • Event Organizers PC Committee
  • Event Participants - Authors

132
- PC Non Author
- Non PC Author
- PC Author
PC Committee
Conference Authors
133
Define
  • Personal Social Capital - of friends who are
    organizers
  • Benefit Received - of publications in
    conference
  • Benefit Given - of publications of friends of
    PC member
  • Comparison of different event structures
  • Temporal Evaluation
  • look at event series

134
Datasets
Data for past 10 years of 3 major CS conference
135
Overall Capital and Benefit
136
C1 Friendship
137
C1 Capital
138
C1 Capital/Friendship Ratio
139
PC/Author Ratio
140
Capital/Benefit Summary
  • Defined a generic friendship-event network
  • Identified interesting structural properties
  • Very preliminary, much more work to be done

141
Link Mining for the Semantic Web
  • Need to be able to extract multi-relational data,
    not just a single table
  • Semantic Web tasks which could make use of
    learning
  • schema discovery
  • populating ontology
  • schema mappings
  • schema reformulation
  • SRL capabilities that are needed
  • link-based object classification
  • link type prediction
  • predicting link existence
  • link cardinality estimation
  • entity resolution and object consolidation
  • group detection
  • predicate invention

142
An Integrated Approach
ontologies
SRL
Current Projects focus on 1. Link Type
Prediction 2. Link Ontology Discovery
data
143
Projects
  • Link-based Classification
  • Link-based Entity Resolution
  • efficient algorithms
  • visualization tools that support ER
  • Social Network Analysis
  • Affiliation Networks
  • Structural and descriptive modeling
  • Friendship Event Networks
  • Definitions of Capital and Benefit
  • Link Mining for the Semantic Web
  • Feature Generation for Sequences (biological
    data)
  • Schema Maintenance and Discovery

144
Summary Link Mining
  • Tasks
  • Link-based Object Classification
  • Object Type Prediction
  • Link Type Prediction
  • Predicting Link Existence
  • Link Cardinality Estimation
  • Entity Resolution
  • Group Detection
  • Subgraph Discovery
  • Metadata Mining
  • Challenges
  • Collective Classification
  • Collective Consolidation
  • Logical vs. Statistical dependencies
  • Feature construction
  • Instances vs. Classes
  • Effective Use of Labeled Unlabeled Data
  • Link Prediction
  • Closed vs. Open World

These are some of the key capabilities needed to
perform todays complex analytic tasks
145
Recent SRL Activities
  • Invited Tutorial at ICML/ILP 2005 and Tutorial at
    IJCAI
  • Dagstuhl 2005 workshop on Probababilistic,
    Relational and Logical Learning, co-organized w/
    Luc DeRaedt, Stephen Muggleton and Tom
    Dietterich.http//www.dagstuhl.de/05051/
  • ICML 2004 workshop on Statistical Relational
    Learning and its Connections to Other Fields,
    co-organized w/ Tom Dietterich and Kevin
    Murphy,http//www.cs.umd.edu/projects/srl2004/
  • IJCAI 2003 workshop on Statistical Relational
    Learning, co-organized w/ David
    Jensenhttp//kdl.cs.umass.edu/srl2003/
  • AAAI 2000 workshop on Statistical Relational
    Learning, co-organized w/ David
    Jensenhttp//robotics.stanford.edu/srl
  • Related workshops
  • KDD MRDM workshops
  • http//www-ai.ijs.si/SasoDzeroski/MRDM2004/
  • http//www-ai.ijs.si/SasoDzeroski/MRDM2003/
  • http//www-ai.ijs.si/SasoDzeroski/MRDM2002/
  • Benjamin Taskar and I are working on an edited
    SRL collection

146
SRL Related Courses
  • My course at UMDhttp//www.cs.umd.edu/class/sprin
    g2005/cmsc828g/
  • Pedro Domingos course at UWash
  • Tom Dietterichs course at OSU http//web.engr.or
    egonstate.edu/tgd/classes/539/
  • David Page, Mark Craven and Jude Shavlik at
    UWischttp//www.biostat.wisc.edu/page/838.html
  • Eric Mjolsness course at UCI on Probabilistic
    Knowledge Representationhttp//computableplant.ic
    s.uci.edu/emj/classes/280_04/Syllabus20ICS20280
    20v2.doc
  • Stuart Russells course at Berkeley on Knowledge
    Representation and Reasoninghttp//www.cs.berkele
    y.edu/russell/classes/cs289/f04/
  • Joydeep Ghosh course at UT Austin on Advanced
    Topics in Data Mininghttp//www.lans.ece.utexas.e
    du/course/382v/05sp/
  • Michael Littman course at Rutgers on Learned
    Representations in AI,http//www.cs.rutgers.edu/
    mlittman/courses/lightai03/
  • David Jensen and Andrew McCallums course at UMass
    on Computational Social Network
    Analysishttp//kdl.cs.umass.edu/courses/csna/

147
References
  • Deduplication and Group Detection Using Links
    Indrajit Bhattacharya and Lise Getoor. 10th ACM
    SIGKDD Workshop on Link Analysis and Group
    Detection, Seattle, WA, August 2004.
  • Word Sense Disambiguation using Probabilistic
    Models, Indrajit Bhattacharya, Lise Getoor and
    Yoshua Bengio. 42nd Annual Meeting of the
    Association for Computational Linguistics,
    Barcelona, SP, July 2004.
  • Iterative Record Linkage for Cleaning and
    Integration Indrajit Bhattacharya and Lise
    Getoor. 9th ACM SIGMOD Workshop on Research
    Issues in Data Mining and Knowledge Discovery,
    Paris, FR, June 2004.
  • Using the Structure of Web Sites for Automatic
    Segmentation of Tables, Kristina Lerman, Lise
    Getoor, Steve Minton and Craig Knoblock.
    Proceedings of ACM-SIGMOD 2004 International
    Conference on Management of Data, Paris, FR, June
    2004.
  • Structure Discovery using Statistical Relational
    Learning, Lise Getoor. Data Engineering Bulletin,
    vol. 26, No. 3, 2003.
  • Link Mining A New Data Mining Challenge, Lise
    Getoor. SIGKDD Explorations, volume 5, issue 1,
    2003. Iterative Deduplication, I. Bhattacharya,
    L. Getoor.
  • Link Mining A New Data Mining Challenge, L.
    Getoor. SIGKDD Explorations, volume 4, issue 2,
    2003.
  • Link-based Classification, Q. Lu and L. Getoor,
    International Conference on Machine Learning,
    August, 2003
  • Labeled and Unlabeled Data for Link-based
    Classification, Q. Lu and L. Getoor. ICML
    workshop on The Continuum from Labeled to
    Unlabeled Data, August, 2003.
  • Link-based Classification for Text Classification
    and Mining, Q. Lu and L. Getoor. IJCAI workshop
    on Text Mining and Link Analysis

http//www.cs.umd.edu/getoor
Google getoor
Write a Comment
User Comments (0)
About PowerShow.com