Natural Language Processing word sense disambiguation - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Natural Language Processing word sense disambiguation

Description:

Natural Language Processing word sense disambiguation Updated 1/12/2005 Overview of the Problem Problem: many words have different meanings or senses == there is ... – PowerPoint PPT presentation

Number of Views:164
Avg rating:3.0/5.0
Slides: 23
Provided by: gid60
Category:

less

Transcript and Presenter's Notes

Title: Natural Language Processing word sense disambiguation


1
Natural Language Processingword sense
disambiguation
  • Updated 1/12/2005

2
Overview of the Problem
  • Problem many words have different meanings or
    senses gt there is ambiguity about how they are
    to be interpreted.
  • Task to determine which of the senses of an
    ambiguous word is invoked in a particular use of
    the word. This is done by looking at the context
    of the words use.
  • Note more often than not the different senses of
    a word are closely related.

3
Word Sense
  • Many words have several meanings or senses.
  • Consider the word bank (Websters new
    Collegiate)
  • the rising ground, bordering a llake, river or
    sea
  • an establishment for the custody, loan exchange
    or issue of money, for the extension of credit,
    and for facilitating the transmission of funds.
  • However, the senses are not always so well
    defined. E.g. Title
  • An identifying name given to a book, play, film,
    musical composition, or other work.
  • A general or descriptive heading, as of a book
    chapter.
  • Law. A heading that names a document, statute, or
    proceeding.
  • A formal appellation attached to the name of a
    person or family by virtue of office, rank,
    hereditary privilege, noble birth etc.

4
Overview of our Discussion
  • Methodology
  • Supervised Disambiguation based on a labeled
    training set.
  • Dictionary-Based Disambiguation based on lexical
    resources such as dictionaries and thesauri.
  • Unsupervised Disambiguation based on unlabeled
    corpora.

5
Methodological Preliminaries
  • Supervised versus Unsupervised Learning in
    supervised learning the sense label of a word
    occurrence is known. In unsupervised learning, it
    is not known.
  • Pseudowords used to generate artificial
    evaluation data for comparison and improvements
    of text-processing algorithms.
  • Upper and Lower Bounds on Performance used to
    find out how well an algorithm performs relative
    to the difficulty of the task.

6
Supervised Disambiguation
  • Training set exemplars where each occurrence of
    the ambiguous word w is annotated with a semantic
    label gt Classification problem.
  • Approaches
  • Bayesian Classification the context of
    occurrence is treated as a bag of words without
    structure, but it integrates information from
    many words.
  • Information Theory only looks at informative
    features in the context. These features may be
    sensitive to text structure.
  • There are many more approaches (see Chapter 16 or
    the Machine Learning course).

7
Supervised Disambiguation Bayesian
Classification I
  • (Gale et al, 1992)s Idea to look at the words
    around an ambiguous word in a large context
    window. Each content word contributes potentially
    useful information about which sense of the
    ambiguous word is likely to be used with it. The
    classifier does no feature selection. Instead, it
    combines the evidence from all features.
  • Bayes decision rule Decide s if P(sC) gt
    P(skC) for all sk ? s.
  • P(skC) is computed by Bayes Rule.

8
Supervised Disambiguation Bayesian
Classification II
  • Naive Bayes assumption
  • P(Csk) P(vj vj in C sk) ? vj in CP(vj
    sk)
  • The Naive Bayes assumption is incorrect in the
    context of text processing, but it is useful.
  • Two consequences
  • The structure and linear ordering of words is
    ignored bag of words model.
  • The presence of one word is independent of
    another, which is clearly untrue in text.

9
Supervised Disambiguation Bayesian
Classification III
  • Decision rule for Naive Bayes
  • Decide sargmax sk log P(sk)? vj in C log P(vj
    sk)
  • P(vj sk) and P(sk) are computed via
    Maximum-Likelihood Estimation, perhaps with
    appropriate smoothing, from the labeled training
    corpus.
  • Performance
  • Gale, Church, and Yarowsky obtain 90 correct
    disambiguation on 6 ambiguous nouns in Hansard
    corpus using this approach (drug, duty, land,
    language, position, sentence)

10
Supervised Disambiguation Bayesian
Classification IV
  • Clues for the two senses of drug

Sense clues for sense
medication prices, prescription, patent, increase, consumer
illegal substance abuse, paraphernalia, illicit, alcohol, cocaine,
11
Supervised DisambiguationAn Information-Theoreti
c Approach
  • (Brown et al., 1991)s Idea to find a single
    contextual feature that reliably indicates which
    sense of the ambiguous word is being used.
  • The Flip-Flop algorithm is used to disambiguate
    between the different senses of a word using the
    mutual information as a measure.
  • I(XY)?x?X?y?Yp(x,y) log p(x,y)/(p(x)p(y))
  • The algorithm works by searching for a partition
    of senses that maximizes the mutual information.
    The algorithm stops when the increase becomes
    insignificant.

12
Flip Flop algorithm
  • t1, , tm are translations of an ambiguous
    word, and x1, , xn are possible values of the
    indicator.
  • find random partition PP1, P2 of t1, , tm
  • while (there is a significant improvement) do
  • find partition QQ1, Q2 of indicators x1, ,
    xn that maximizes I(PQ)
  • find partition PP1, P2 of translations t1,
    , tm that maximizes I(PQ)
  • end

13
Flip Flop - example
  • Suppose we want to translate prendre based on its
    object and have t1, , tmtake, make, rise,
    speak and x1, ,xnmesure, note, exemple,
    décision, parole, and that prendre is used as
    take when occurring with the objects mesure,
    note, and exemple otherwise used as make, rise
    or speak.
  • Suppose the initial partition is P1take, rise
    and P2make, speak. Then choose partition of
    Q of indicator values that maximizes I(PQ), say
    Q1mesure, note, exemple and Q2décision,
    parole (selected if the division gives us the
    most information for distinguishing translations
    in P1 from translations in P2).
  • prendre la parole is not translated as rise to
    speak when it should be repartition as P1take
    and P2rise, make, speak, and Q as previously.
    This is always correct for take sense.
  • To distinguish among the others, we would have to
    consider more than two senses.

14
Dictionary-Based Disambiguation Overview
  • We will be looking at three different methods
  • Disambiguation based on sense definitions
  • Thesaurus-Based Disambiguation
  • Disambiguation based on translations in a
    second-language corpus
  • Also, we will show how a careful examination of
    the distributional properties of senses can lead
    to significant improvements in disambiguation.

15
Disambiguation based on sense definitions
  • (Lesk, 1986 Idea) a words dictionary
    definitions are likely to be good indicators for
    the sense they define.
  • Express the dictionary sub-definitions of the
    ambiguous word as sets of bag-of-words and the
    words occurring in the context of the ambiguous
    word as single bags-of-words emanating from its
    dictionary definitions (all pooled together).
  • Disambiguate the ambiguous word by choosing the
    sub-definition of the ambiguous word that has the
    greatest overlap with the words occurring in its
    context.

16
Thesaurus-Based Disambiguation
  • Idea the semantic categories of the words in a
    context determine the semantic category of the
    context as a whole. This category, in turn,
    determines which word senses are used.
  • (Walker, 87) each word is assigned one or more
    subject codes which corresponds to its different
    meanings. For each subject code, we count the
    number of words (from the context) having the
    same subject code. We select the subject code
    corresponding to the highest count.
  • (Yarowski, 92) adapted the algorithm for words
    that do not occur in the thesaurus but that are
    very . Informative. E.g., Navratilova
    --gt Sports

17
Disambiguation based on translations in a
second-language corpus
  • (Dagan Itai, 91)s Idea words can be
    disambiguated by looking at how they are
    translated in other languages.
  • Example the word interest has two translations
    in German 1) Beteiligung (legal share--50 a
    interest in the company) 2) Interesse
    (attention, concern--her interest in
    Mathematics).
  • To disambiguate the word interest, we identify
    the sentence it occurs in, search a German corpus
    for instances of the phrase, and assign the
    meaning associated with the German use of the
    word in that phrase.

18
One sense per discourse, one sense per collocation
  • (Yarowsky, 1995)s Idea there are constraints
    between different occurrences of an ambiguous
    word within a corpus that can be exploited for
    disambiguation
  • One sense per discourse The sense of a target
    word is highly consistent within any given
    document.
  • One sense per collocation nearby words provide
    strong and consistent clues to the sense of a
    target word, conditional on relative distance,
    order and syntactic relationship.

19
Unsupervised Disambiguation
  • Idea disambiguate word senses without having
    recourse to supporting tools such as dictionaries
    and thesauri and in the absence of labeled text.
    Simply cluster the contexts of an ambiguous word
    into a number of groups and discriminate between
    these groups without labeling them.
  • (Schutze, 1998) The probabilistic model is the
    same Bayesian model as the one used for
    supervised classification, but the P(vj sk) are
    estimated using the EM algorithm.

20
EM algorithm
  • Initialize the parameters ? of model. These are
    P(vj sk) and P(sk), j 1,2,J, k 1,2,K.
  • compute the log likelihood of corpus C given the
    model ? l(C?) log ?i?k P(cj sk) P(sk)
  • while l(C?) increses repeat
  • E-step hik P(cj sk) P(sk) / ?k P(cj sk)
    P(sk) (use Naive bayes to compute P(cj
    sk) )
  • M-step reestimate the parameters P(vj sk) and
    P(sk) by MLE
  • P(vj sk) ?ci hjk/Zj where the sum is
    over all contexts ci in which vj occurs, Zj a
    normalizing constant.
  • P(sk) ?i hjk/ ?k ?i hjk ?i hjk/I

21
Disambiguation
  • Once the model parameters have been estimated, a
    word w can be disambiguated by computing the
    probability of each sense given the words vj in
    the context.
  • Again we use the Naïve Bayes assumption
  • Decide sargmax sk log P(sk)? vj in C log
    P(vj sk)

22
Performance of unsupervised disambiguation
  • Is capable of identifying minute difference in
    senses, e.g. a bank in physical sense and in
    abstract sense.
  • Usually the clusters obtained are not identical
    with dictionary senses.
  • Results of unsupervised disambiguation (schutze
    1998)

word sense Mean accuracy
suit lawsuit garment 95 96
motion physical movement proposal for action 85 88
train Line of railroad cars teach 79 55
Write a Comment
User Comments (0)
About PowerShow.com