Entity Disambiguation - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Entity Disambiguation

Description:

1. Schema/Ontology level : Determining the similarity of attributes/concepts ... for entity disambiguation (Scalable Information Bottleneck (LIMBO) method) ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 27
Provided by: lsdis9
Category:

less

Transcript and Presenter's Notes

Title: Entity Disambiguation


1
Entity Disambiguation
  • By
  • Angela Maduko
  • Directed by
  • Amit Sheth

2
Entity Disambiguation Problem
  • Emerges mainly while merging information from
    different sources
  • Two major levels
  • 1. Schema/Ontology level Determining the
    similarity of attributes/concepts/classes from
    the different schema/ontology to be merged
  • 2. Instance level Which instances of
    concepts/classes (/tuples in relational databases
    ) refer to the same entity

3
Current approaches for both levels
  • Feature-based Similarity Approach (FSA)
  • Set-Theory Similarity Approach (STA)
  • Information-Theory Similarity Approach (ITA)
  • Hybrid Approach (HA)
  • Relationship-based Similarity Approach (RSA)
  • Hybrid Similarity Approach (HSA)

4
ITA
  • In 1, Dekang presents a measure for the
    similarity between two concepts based on both
    their commonalities and differences
  • Intuition 1 The similarity between A and B is
    related to their commonality. The more
    commonality they share, the more similar they
    are.
  • Intuition 2 The similarity between A and B is
    related to the differences between them. The more
    differences they have, the less similar they are.
  • Intuition 3 The maximum similarity between A and
    B is reached when A and B are identical, no
    matter how much commonality they share.

5
ITA
  • Consider the concept Fruit
  • A is an Apple
  • B is an Orange
  • Commonality of A and B?
  • Common(A, B) Fruit(A) and Fruit(B)
  • Measures the commonality between A and B
    I(common(A, B)) by the amount of information
    contained in common(A, B)
  • Where the information content of S I(S) -logP(S)

6
ITA
  • Differences is measured by I(description(A, B))
    I(common(A, B))
  • Decription(A, B) is a proposition which describes
    what A and B are
  • Can be applied at both levels 1 2
  • Intuitively, sim(A, B)
  • 1 when A and B are exactly alike
  • 0 when they share no commonalities
  • Proposes sim(A, B)

7
ITA
  • In 2, Resnik measures the similarity between
    two concepts in an is-a taxonomy based on the
    information content of their most specific common
    super-concept
  • Define P(c) as the probability of encountering an
    instance of a concept c in the taxonomy
  • For any two concepts c1 and c2, define S(c1, c2)
    as the set of concepts that subsume both c1 and
    c2
  • Proposes sim(c1, c2)

8
ITA
  • 100 instances of concept X
  • 4 instances of concept Y
  • 200 instances of concept Z
  • 2000 instances of all concepts
  • sim(A, B)
  • Sim(C, D)
  • sim(A, D)
  • sim(A, E)
  • sim(C, D) gt sim(A, B). Should this be so?

Y
D
C
Z
F
E
X
A
B
9
ITA
  • Define s(w) as the set of concepts that are word
    senses of word w. Proposes a measure for word
    similarity as follows
  • Sim(w1, w2)
  • Can be applied at level 1 only
  • Doctor (medical and PhD)
  • Nurse (medical and nanny)
  • Sim(Doctor, Nurse)

10
STA
  • 3 introduces a set theoretical notion of a
    matching function F based on the following
    assumptions for classes a, b, c with description
    sets A, B, C respectively
  • Matching s(a, b) F(A ? B, A - B, B - A)
  • Monotonicity s(a, b) s(a, c) whenever A ? B ?
    A ? C, A - B ? A - C, B - A ? C - A

11
STA
  • Proposes two models
  • Contrast model Similarity is defined as
  • An increasing function of common features
  • A decreasing function of distinctive features
    (features that apply to one object but not the
    other)
  • S(a, b) ?f(A ? B) - ?f(A -B) - ?f(B - A) (?,?,?
    0)
  • Function f measures the salience of set of
    features
  • f depends on intensity and context factors
  • Intensity physical salience (eg physical
    features)
  • Context salience of features varies with
    context

12
STA
  • Ratio Model
  • S(a, b)
  • ?,?,? 0
  • Can be applied at both levels 1 2

13
STA
  • 4 determines the similarity between two
    entities by the distance between them
  • Defines Pij as the probability that entities ai
    and bj have the same value for their kth
    attribute, for all k
  • Assigns costs to mismatching errors (ie not
    matching where should and matching where should
    not) and then calculates a cost function based on
    Pij to be maximized.
  • Shows that the expected distance between ai and
    bj dij 1 - Pij, substitutes this in the cost
    function
  • Calculates the distance between two entities as a
    linear combination (weighted average) of the
    distances between their common attributes

14
STA
  • Relationships amongst common attributes such as
    key attributes functional dependencies are
    exploited
  • Obtains attribute weights from user
  • Can be applied at level 2 only

15
CA
  • In 5, the authors present a model-based k-means
    clustering algorithm for name disambiguation in
    citations
  • Randomly assigns citations to N clusters
  • Estimates prior probabilities of each cluster
  • Computes probability that a cluster produces a
    given citation c
  • Assigns c to the cluster with the highest
    probability of producing it
  • Applied at level 2 only

16
CA
  • My comments on this paper
  • The drawback of this approach is in the
    estimation of the model parameters, the data
    necessary for this may not be readily available.
  • Even if a user supplies estimates of these
    parameters, these estimates may not be unbiased
    and consistent

17
HA
  • 7 combines clustering and information content
    approaches for entity disambiguation (Scalable
    Information Bottleneck (LIMBO) method)
  • Attempts to cluster entities in such a way that
    the clusters are informative about the entities
    within them
  • Model A set T of n entities (relational tuples),
    defined on m attributes (A1, A2, , Am) .Domain
    of attribute Ai is the set Vi Vi,1, Vi,2, ,
    Vi, di
  • Let T and V be two discrete random variables that
    can take values from T and V respectively
  • Initially, assigns each entity to a cluster ie
    clusters entities. Let Cq denote this initial
    clustering, then the mutual information of Cq and
    T, I(Cq, T) the mutual information of V and T,
    I(V, T)

18
HA
  • Assumes number of distinct entities k is known
  • Seeks a clustering Ck of V such that I(Ck, T)
    remains as large as possible or the information
    loss I(V, T) - I(Ck, T) is minimal

19
HSA
  • In 8, Kashyap and Sheth introduce the concept
    of semantic proximity (semPro) between entities
    to capture their similarity
  • In addition to context, employs relationships and
    features of entities in determining their
    similarity
  • semPro(O1,O2) ltContext, Abstraction, (D1, D2),
    (S1, S2)gt
  • Context ? context in which objects O1 and O2 are
    being compared
  • Abstraction ? abstraction/mappings relating
    domains of the objects
  • (D1, D2) ? domain definitions of the objects
  • (S1, S2) ? states of the objects

20
HSA
  • Abstractions
  • Total 1-1 value mapping
  • Partial many-one mapping.
  • Generalization/specialization.
  • Aggregation.
  • Functional dependencies.
  • ANY
  • NONE

21
HSA
  • Semantic Taxonomy
  • Defines 5 degrees of similarity between objects
  • Semantic Equivalence
  • Semantic Relationship
  • Semantic Relevance
  • Semantic Resemblance
  • Semantic Incompatibility

22
HSA
  • Semantic Equivalence strongest measure of
    semantic proximity
  • Two objects are said to be semantically
    equivalent when they represent the same real
    world entity ie
  • semPro(O1,O2) ltALL, total 1-1 value mapping,
    (D1, D2), - gt (domain Semantic Equivalence)
  • semPro(O1,O2) ltALL, M, (D1, D2), (S1, S2)gt
    where M a total 1-1 value mappings between (D1,
    S1) and (D2, S2) (state Semantic Equivalence)

23
HSA
  • Semantic Relationship weaker than semantic
    equivalence.
  • semPro(O1,O2) ltALL, M, (D1 ,D2) , _)gt where M
    a partial many-one value mapping,
    generalization or aggregation
  • Requirement of a 1-1 mapping is relaxed such
    that, given an instance O1, we can identify an
    instance of O2, but not vice versa.

24
HSA
  • Semantic Relevance
  • Two objects are semantically relevant if there
    exists any mapping between their domains in some
    context
  • semPro(O1,O2) ltSOME, ANY, (D1 ,D2) , _)gt

25
HSA
  • Semantic Resemblance weakest measure of semantic
    proximity.
  • There does not exists any mapping between their
    domains in any context
  • Have same roles in some contexts with coherent
    definition contexts

26
HSA
  • Semantic Incompatibility
  • Asserts semantic dissimilarity.
  • Asserts that there is no context and no
    abstraction in which the domains of the two
    objects are related.
  • semPro(O1,O2) ltNONE, NONE, (D1,D2), _gt

27
HSA
  • 4 encompasses both the feature-based and
    relationship-based approaches
  • Represents entity classes using 3 components,
    assesses similarity of two classes using 3
    different similarity measures wrt to the 3
    components viz
  • Synonym set (to address polysemy and synonymy)
  • Set of distinguishing features or differentiae
    (functions, parts and attributes)
  • Set of semantic inter-relations amongst entity
    classes (mainly hyponymy and meronymy)

28
HSA
  • Applied at level 1 only
  • Defines the semantic neighbourhood of a class a
    with radius r as follows N(a, r) ci ? ?i
    d(a, ci) r where d(a, c) is the shortest path
    connecting the two classes in the ontology

29
HSA
  • Proposes S(a, b) ?wSw(a, b) ?uSu(a, b)
    ?nSn(a, b)
  • Sw, Su and Sn are the respective similarities
    between synonym sets, features and semantic
    neighbourhoods of classes a and b
  • ?w, ?u and ?n 0 are the respective weights of
    the similarity of each component
  • S(a, b) based on a normalization of Tverskys3
    model, with 0?1

30
HSA
  • Assumes similarity is asymmetric, with more
    similarity from a class to its super-class than
    vice-versa
  • The function ? evaluates this asymmetry, defined
    thus ?(a, b)
  • where depth(a) returns the shortest path from
    class a to an imaginary root connecting the roots
    of the two ontologies
  • Applies word matching in the synonym sets of a
    and b for Sw

31
HSA
  • For Su, applies matching over corresponding
    differentiae (functions Sf, parts Sp, attributes
    Sa with corresponding weights ?f, ?p and ?f 0)
    such that Su(a,b) ?fSf(a,b) ?pSp(a,b)
    ?fSf(a,b) where ?f ?p ?f 1
  • For Sn, compares entity classes in semantic
    neighbourhoods based on synonym sets or
    differentiae of these classes.

32
HSA
  • In 5 Cho et al propose a model derived from the
    edge-based approach, employing information
    content of the node based approach based on these
    facts
  • There exists a correlation between similarity and
    of shared parent concepts in a hierarchy
  • Link type (hyponymy, meronymy etc) ? semantic
    relationship

33
HSA
  • Conceptual similarity between a node and its
    adjacent child node may not be equal
  • As depth increases in the hierarchy, conceptual
    similarity b/w a node and its adjacent child node
    decreases
  • Population of nodes is not uniform over entire
    ontological structure (links in a dense part of
    hierarchy ? less distance than that in a less
    dense part )

34
HSA
  • Proposes S(ci, cj) D(Lj ?i)?0kn
    W(tk)d(ck1?k)f(d) ( maxH(c) ), where
  • f(d) is a function that returns a depth factor
    (topological location in hierarchy)
  • d(ck1?k) is a density function
  • D(Lj ?i) is a function that returns a distance
    factor between ci and cj (shortest path from one
    node to the other)
  • W(tk) is a weight function that assigns weights
    to each link type (W(tk) 1 for is-a link)
  • H(c) is information content of super-concepts of
    ci and cj
  • For level 1 only

35
References
  • Dekang Lin, An Information-Theoretic Definition
    of Similarity, Proceedings ofthe Fifteenth
    International Conference on Machine Learning,
    p.296-304, 1998
  • Philip Resnik, Using Information Content to
    Evaluate Semantic Similarity in a Taxonomy,
    IJCAI, 1995.
  • Tversky Amos, Features of Similarity,
    Psychological Review 84(4), 1977, pp 327 - 352.
  • Debabrata Dey, A Distance-Based Approach to
    Entity Reconciliation in Heterogeneous Databases,
    IEEE Transactions on Knowledge and Data
    Engineeing, 14 (3), May/June 2002.
  • Hui Han, Hongyuan Zha and C. Lee Giles, A
    Model-based K-means Algorithm for Name
    Disambiguation in Proceedings of the Second
    International Semantic Web Conference (ISWC-03)
    Workshop on Semantic Web Technologies for
    Searching and Retrieving Scientific Data. 2003
  • M. Andrea Rodriguez and Max J. Egenhofer,
    Determining Semantic Similarity Among Entity
    Classes from Different Ontologies, IEEE
    Transactions on Knowledge and Data Engineering ,
    15 (2) 442-456, 2003
  • Periklis Andritsos, Renee J. Miller and
    Panayiotis Tsaparas, Information-Theoretic Tools
    for Mining Database Structure from Large Data
    Sets, SIGMOD Conference 2004 731-742
  • Vipul Kashyap, Amit Sheth, Semantic and schematic
    similarities between database objects a
    context-based approach, VLDB Journal 5, no. 4
    (1996) 276--304. 367
  • Miyoung Cho, Junho Choi and Pankoo Kim, An
    Efficient computational Method for Measuring
    Similarity between Two Conceptual Entities, WAIM
    2003 381-388
Write a Comment
User Comments (0)
About PowerShow.com