Entity Disambiguation

About This Presentation

Title:

Entity Disambiguation

Description:

1. Schema/Ontology level : Determining the similarity of attributes/concepts ... for entity disambiguation (Scalable Information Bottleneck (LIMBO) method) ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 27

Provided by: lsdis9

Category:

more less

Transcript and Presenter's Notes

Title: Entity Disambiguation

1
Entity Disambiguation

By
Angela Maduko
Directed by
Amit Sheth

2
Entity Disambiguation Problem

Emerges mainly while merging information from
different sources
Two major levels
1. Schema/Ontology level Determining the
similarity of attributes/concepts/classes from
the different schema/ontology to be merged
2. Instance level Which instances of
concepts/classes (/tuples in relational databases
) refer to the same entity

3
Current approaches for both levels

Feature-based Similarity Approach (FSA)
Set-Theory Similarity Approach (STA)
Information-Theory Similarity Approach (ITA)
Hybrid Approach (HA)
Relationship-based Similarity Approach (RSA)
Hybrid Similarity Approach (HSA)

4
ITA

In 1, Dekang presents a measure for the
similarity between two concepts based on both
their commonalities and differences
Intuition 1 The similarity between A and B is
related to their commonality. The more
commonality they share, the more similar they
are.
Intuition 2 The similarity between A and B is
related to the differences between them. The more
differences they have, the less similar they are.
Intuition 3 The maximum similarity between A and
B is reached when A and B are identical, no
matter how much commonality they share.

5
ITA

Consider the concept Fruit
A is an Apple
B is an Orange
Commonality of A and B?
Common(A, B) Fruit(A) and Fruit(B)
Measures the commonality between A and B
I(common(A, B)) by the amount of information
contained in common(A, B)
Where the information content of S I(S) -logP(S)

6
ITA

Differences is measured by I(description(A, B))
I(common(A, B))
Decription(A, B) is a proposition which describes
what A and B are
Can be applied at both levels 1 2
Intuitively, sim(A, B)
1 when A and B are exactly alike
0 when they share no commonalities
Proposes sim(A, B)

7
ITA

In 2, Resnik measures the similarity between
two concepts in an is-a taxonomy based on the
information content of their most specific common
super-concept
Define P(c) as the probability of encountering an
instance of a concept c in the taxonomy
For any two concepts c1 and c2, define S(c1, c2)
as the set of concepts that subsume both c1 and
c2
Proposes sim(c1, c2)

8
ITA

100 instances of concept X
4 instances of concept Y
200 instances of concept Z
2000 instances of all concepts
sim(A, B)
Sim(C, D)
sim(A, D)
sim(A, E)
sim(C, D) gt sim(A, B). Should this be so?

Y
D
C
Z
F
E
X
A
B
9
ITA

Define s(w) as the set of concepts that are word
senses of word w. Proposes a measure for word
similarity as follows
Sim(w1, w2)
Can be applied at level 1 only
Doctor (medical and PhD)
Nurse (medical and nanny)
Sim(Doctor, Nurse)

10
STA

3 introduces a set theoretical notion of a
matching function F based on the following
assumptions for classes a, b, c with description
sets A, B, C respectively
Matching s(a, b) F(A ? B, A - B, B - A)
Monotonicity s(a, b) s(a, c) whenever A ? B ?
A ? C, A - B ? A - C, B - A ? C - A

11
STA

Proposes two models
Contrast model Similarity is defined as
An increasing function of common features
A decreasing function of distinctive features
(features that apply to one object but not the
other)
S(a, b) ?f(A ? B) - ?f(A -B) - ?f(B - A) (?,?,?
0)
Function f measures the salience of set of
features
f depends on intensity and context factors
Intensity physical salience (eg physical
features)
Context salience of features varies with
context

12
STA

Ratio Model
S(a, b)
?,?,? 0
Can be applied at both levels 1 2

13
STA

4 determines the similarity between two
entities by the distance between them
Defines Pij as the probability that entities ai
and bj have the same value for their kth
attribute, for all k
Assigns costs to mismatching errors (ie not
matching where should and matching where should
not) and then calculates a cost function based on
Pij to be maximized.
Shows that the expected distance between ai and
bj dij 1 - Pij, substitutes this in the cost
function
Calculates the distance between two entities as a
linear combination (weighted average) of the
distances between their common attributes

14
STA

Relationships amongst common attributes such as
key attributes functional dependencies are
exploited
Obtains attribute weights from user
Can be applied at level 2 only

15
CA

In 5, the authors present a model-based k-means
clustering algorithm for name disambiguation in
citations
Randomly assigns citations to N clusters
Estimates prior probabilities of each cluster
Computes probability that a cluster produces a
given citation c
Assigns c to the cluster with the highest
probability of producing it
Applied at level 2 only

16
CA

My comments on this paper
The drawback of this approach is in the
estimation of the model parameters, the data
necessary for this may not be readily available.
Even if a user supplies estimates of these
parameters, these estimates may not be unbiased
and consistent

17
HA

7 combines clustering and information content
approaches for entity disambiguation (Scalable
Information Bottleneck (LIMBO) method)
Attempts to cluster entities in such a way that
the clusters are informative about the entities
within them
Model A set T of n entities (relational tuples),
defined on m attributes (A1, A2, , Am) .Domain
of attribute Ai is the set Vi Vi,1, Vi,2, ,
Vi, di
Let T and V be two discrete random variables that
can take values from T and V respectively
Initially, assigns each entity to a cluster ie
clusters entities. Let Cq denote this initial
clustering, then the mutual information of Cq and
T, I(Cq, T) the mutual information of V and T,
I(V, T)

18
HA

Assumes number of distinct entities k is known
Seeks a clustering Ck of V such that I(Ck, T)
remains as large as possible or the information
loss I(V, T) - I(Ck, T) is minimal

19
HSA

In 8, Kashyap and Sheth introduce the concept
of semantic proximity (semPro) between entities
to capture their similarity
In addition to context, employs relationships and
features of entities in determining their
similarity
semPro(O1,O2) ltContext, Abstraction, (D1, D2),
(S1, S2)gt
Context ? context in which objects O1 and O2 are
being compared
Abstraction ? abstraction/mappings relating
domains of the objects
(D1, D2) ? domain definitions of the objects
(S1, S2) ? states of the objects

20
HSA

Abstractions
Total 1-1 value mapping
Partial many-one mapping.
Generalization/specialization.
Aggregation.
Functional dependencies.
ANY
NONE

21
HSA

Semantic Taxonomy
Defines 5 degrees of similarity between objects
Semantic Equivalence
Semantic Relationship
Semantic Relevance
Semantic Resemblance
Semantic Incompatibility

22
HSA

Semantic Equivalence strongest measure of
semantic proximity
Two objects are said to be semantically
equivalent when they represent the same real
world entity ie
semPro(O1,O2) ltALL, total 1-1 value mapping,
(D1, D2), - gt (domain Semantic Equivalence)
semPro(O1,O2) ltALL, M, (D1, D2), (S1, S2)gt
where M a total 1-1 value mappings between (D1,
S1) and (D2, S2) (state Semantic Equivalence)

23
HSA

Semantic Relationship weaker than semantic
equivalence.
semPro(O1,O2) ltALL, M, (D1 ,D2) , _)gt where M
a partial many-one value mapping,
generalization or aggregation
Requirement of a 1-1 mapping is relaxed such
that, given an instance O1, we can identify an
instance of O2, but not vice versa.

24
HSA

Semantic Relevance
Two objects are semantically relevant if there
exists any mapping between their domains in some
context
semPro(O1,O2) ltSOME, ANY, (D1 ,D2) , _)gt

25
HSA

Semantic Resemblance weakest measure of semantic
proximity.
There does not exists any mapping between their
domains in any context
Have same roles in some contexts with coherent
definition contexts

26
HSA

Semantic Incompatibility
Asserts semantic dissimilarity.
Asserts that there is no context and no
abstraction in which the domains of the two
objects are related.
semPro(O1,O2) ltNONE, NONE, (D1,D2), _gt

27
HSA

4 encompasses both the feature-based and
relationship-based approaches
Represents entity classes using 3 components,
assesses similarity of two classes using 3
different similarity measures wrt to the 3
components viz
Synonym set (to address polysemy and synonymy)
Set of distinguishing features or differentiae
(functions, parts and attributes)
Set of semantic inter-relations amongst entity
classes (mainly hyponymy and meronymy)

28
HSA

Applied at level 1 only
Defines the semantic neighbourhood of a class a
with radius r as follows N(a, r) ci ? ?i
d(a, ci) r where d(a, c) is the shortest path
connecting the two classes in the ontology

29
HSA

Proposes S(a, b) ?wSw(a, b) ?uSu(a, b)
?nSn(a, b)
Sw, Su and Sn are the respective similarities
between synonym sets, features and semantic
neighbourhoods of classes a and b
?w, ?u and ?n 0 are the respective weights of
the similarity of each component
S(a, b) based on a normalization of Tverskys3
model, with 0?1

30
HSA

Assumes similarity is asymmetric, with more
similarity from a class to its super-class than
vice-versa
The function ? evaluates this asymmetry, defined
thus ?(a, b)
where depth(a) returns the shortest path from
class a to an imaginary root connecting the roots
of the two ontologies
Applies word matching in the synonym sets of a
and b for Sw

31
HSA

For Su, applies matching over corresponding
differentiae (functions Sf, parts Sp, attributes
Sa with corresponding weights ?f, ?p and ?f 0)
such that Su(a,b) ?fSf(a,b) ?pSp(a,b)
?fSf(a,b) where ?f ?p ?f 1
For Sn, compares entity classes in semantic
neighbourhoods based on synonym sets or
differentiae of these classes.

32
HSA

In 5 Cho et al propose a model derived from the
edge-based approach, employing information
content of the node based approach based on these
facts
There exists a correlation between similarity and
of shared parent concepts in a hierarchy
Link type (hyponymy, meronymy etc) ? semantic
relationship

33
HSA

Conceptual similarity between a node and its
adjacent child node may not be equal
As depth increases in the hierarchy, conceptual
similarity b/w a node and its adjacent child node
decreases
Population of nodes is not uniform over entire
ontological structure (links in a dense part of
hierarchy ? less distance than that in a less
dense part )

34
HSA

Proposes S(ci, cj) D(Lj ?i)?0kn
W(tk)d(ck1?k)f(d) ( maxH(c) ), where
f(d) is a function that returns a depth factor
(topological location in hierarchy)
d(ck1?k) is a density function
D(Lj ?i) is a function that returns a distance
factor between ci and cj (shortest path from one
node to the other)
W(tk) is a weight function that assigns weights
to each link type (W(tk) 1 for is-a link)
H(c) is information content of super-concepts of
ci and cj
For level 1 only

35
References

Dekang Lin, An Information-Theoretic Definition
of Similarity, Proceedings ofthe Fifteenth
International Conference on Machine Learning,
p.296-304, 1998
Philip Resnik, Using Information Content to
Evaluate Semantic Similarity in a Taxonomy,
IJCAI, 1995.
Tversky Amos, Features of Similarity,
Psychological Review 84(4), 1977, pp 327 - 352.
Debabrata Dey, A Distance-Based Approach to
Entity Reconciliation in Heterogeneous Databases,
IEEE Transactions on Knowledge and Data
Engineeing, 14 (3), May/June 2002.
Hui Han, Hongyuan Zha and C. Lee Giles, A
Model-based K-means Algorithm for Name
Disambiguation in Proceedings of the Second
International Semantic Web Conference (ISWC-03)
Workshop on Semantic Web Technologies for
Searching and Retrieving Scientific Data. 2003
M. Andrea Rodriguez and Max J. Egenhofer,
Determining Semantic Similarity Among Entity
Classes from Different Ontologies, IEEE
Transactions on Knowledge and Data Engineering ,
15 (2) 442-456, 2003
Periklis Andritsos, Renee J. Miller and
Panayiotis Tsaparas, Information-Theoretic Tools
for Mining Database Structure from Large Data
Sets, SIGMOD Conference 2004 731-742
Vipul Kashyap, Amit Sheth, Semantic and schematic
similarities between database objects a
context-based approach, VLDB Journal 5, no. 4
(1996) 276--304. 367
Miyoung Cho, Junho Choi and Pankoo Kim, An
Efficient computational Method for Measuring
Similarity between Two Conceptual Entities, WAIM
2003 381-388

Write a Comment

User Comments (0)

About PowerShow.com

Entity Disambiguation - PowerPoint PPT Presentation

Entity Disambiguation

1. Schema/Ontology level : Determining the similarity of attributes/concepts ... for entity disambiguation (Scalable Information Bottleneck (LIMBO) method) ... – PowerPoint PPT presentation