Title: Natural Language Processing word sense disambiguation
1Natural Language Processingword sense
disambiguation
2Overview of the Problem
- Problem many words have different meanings or
senses gt there is ambiguity about how they are
to be interpreted. - Task to determine which of the senses of an
ambiguous word is invoked in a particular use of
the word. This is done by looking at the context
of the words use. - Note more often than not the different senses of
a word are closely related.
3Word Sense
- Many words have several meanings or senses.
- Consider the word bank (Websters new
Collegiate) - the rising ground, bordering a llake, river or
sea - an establishment for the custody, loan exchange
or issue of money, for the extension of credit,
and for facilitating the transmission of funds. - However, the senses are not always so well
defined. E.g. Title - An identifying name given to a book, play, film,
musical composition, or other work. - A general or descriptive heading, as of a book
chapter. - Law. A heading that names a document, statute, or
proceeding. - A formal appellation attached to the name of a
person or family by virtue of office, rank,
hereditary privilege, noble birth etc.
4Overview of our Discussion
- Methodology
- Supervised Disambiguation based on a labeled
training set. - Dictionary-Based Disambiguation based on lexical
resources such as dictionaries and thesauri. - Unsupervised Disambiguation based on unlabeled
corpora.
5Methodological Preliminaries
- Supervised versus Unsupervised Learning in
supervised learning the sense label of a word
occurrence is known. In unsupervised learning, it
is not known. - Pseudowords used to generate artificial
evaluation data for comparison and improvements
of text-processing algorithms. - Upper and Lower Bounds on Performance used to
find out how well an algorithm performs relative
to the difficulty of the task.
6Supervised Disambiguation
- Training set exemplars where each occurrence of
the ambiguous word w is annotated with a semantic
label gt Classification problem. - Approaches
- Bayesian Classification the context of
occurrence is treated as a bag of words without
structure, but it integrates information from
many words. - Information Theory only looks at informative
features in the context. These features may be
sensitive to text structure. - There are many more approaches (see Chapter 16 or
the Machine Learning course).
7Supervised Disambiguation Bayesian
Classification I
- (Gale et al, 1992)s Idea to look at the words
around an ambiguous word in a large context
window. Each content word contributes potentially
useful information about which sense of the
ambiguous word is likely to be used with it. The
classifier does no feature selection. Instead, it
combines the evidence from all features. - Bayes decision rule Decide s if P(sC) gt
P(skC) for all sk ? s. - P(skC) is computed by Bayes Rule.
8Supervised Disambiguation Bayesian
Classification II
- Naive Bayes assumption
- P(Csk) P(vj vj in C sk) ? vj in CP(vj
sk) - The Naive Bayes assumption is incorrect in the
context of text processing, but it is useful. - Two consequences
- The structure and linear ordering of words is
ignored bag of words model. - The presence of one word is independent of
another, which is clearly untrue in text.
9Supervised Disambiguation Bayesian
Classification III
- Decision rule for Naive Bayes
- Decide sargmax sk log P(sk)? vj in C log P(vj
sk) - P(vj sk) and P(sk) are computed via
Maximum-Likelihood Estimation, perhaps with
appropriate smoothing, from the labeled training
corpus. - Performance
- Gale, Church, and Yarowsky obtain 90 correct
disambiguation on 6 ambiguous nouns in Hansard
corpus using this approach (drug, duty, land,
language, position, sentence)
10Supervised Disambiguation Bayesian
Classification IV
- Clues for the two senses of drug
-
-
Sense clues for sense
medication prices, prescription, patent, increase, consumer
illegal substance abuse, paraphernalia, illicit, alcohol, cocaine,
11Supervised DisambiguationAn Information-Theoreti
c Approach
- (Brown et al., 1991)s Idea to find a single
contextual feature that reliably indicates which
sense of the ambiguous word is being used. - The Flip-Flop algorithm is used to disambiguate
between the different senses of a word using the
mutual information as a measure. - I(XY)?x?X?y?Yp(x,y) log p(x,y)/(p(x)p(y))
- The algorithm works by searching for a partition
of senses that maximizes the mutual information.
The algorithm stops when the increase becomes
insignificant.
12Flip Flop algorithm
- t1, , tm are translations of an ambiguous
word, and x1, , xn are possible values of the
indicator. - find random partition PP1, P2 of t1, , tm
- while (there is a significant improvement) do
- find partition QQ1, Q2 of indicators x1, ,
xn that maximizes I(PQ) - find partition PP1, P2 of translations t1,
, tm that maximizes I(PQ) - end
13Flip Flop - example
- Suppose we want to translate prendre based on its
object and have t1, , tmtake, make, rise,
speak and x1, ,xnmesure, note, exemple,
décision, parole, and that prendre is used as
take when occurring with the objects mesure,
note, and exemple otherwise used as make, rise
or speak. - Suppose the initial partition is P1take, rise
and P2make, speak. Then choose partition of
Q of indicator values that maximizes I(PQ), say
Q1mesure, note, exemple and Q2décision,
parole (selected if the division gives us the
most information for distinguishing translations
in P1 from translations in P2). - prendre la parole is not translated as rise to
speak when it should be repartition as P1take
and P2rise, make, speak, and Q as previously.
This is always correct for take sense. - To distinguish among the others, we would have to
consider more than two senses.
14Dictionary-Based Disambiguation Overview
- We will be looking at three different methods
- Disambiguation based on sense definitions
- Thesaurus-Based Disambiguation
- Disambiguation based on translations in a
second-language corpus - Also, we will show how a careful examination of
the distributional properties of senses can lead
to significant improvements in disambiguation.
15Disambiguation based on sense definitions
- (Lesk, 1986 Idea) a words dictionary
definitions are likely to be good indicators for
the sense they define. - Express the dictionary sub-definitions of the
ambiguous word as sets of bag-of-words and the
words occurring in the context of the ambiguous
word as single bags-of-words emanating from its
dictionary definitions (all pooled together). - Disambiguate the ambiguous word by choosing the
sub-definition of the ambiguous word that has the
greatest overlap with the words occurring in its
context.
16Thesaurus-Based Disambiguation
- Idea the semantic categories of the words in a
context determine the semantic category of the
context as a whole. This category, in turn,
determines which word senses are used. - (Walker, 87) each word is assigned one or more
subject codes which corresponds to its different
meanings. For each subject code, we count the
number of words (from the context) having the
same subject code. We select the subject code
corresponding to the highest count. - (Yarowski, 92) adapted the algorithm for words
that do not occur in the thesaurus but that are
very . Informative. E.g., Navratilova
--gt Sports
17Disambiguation based on translations in a
second-language corpus
- (Dagan Itai, 91)s Idea words can be
disambiguated by looking at how they are
translated in other languages. - Example the word interest has two translations
in German 1) Beteiligung (legal share--50 a
interest in the company) 2) Interesse
(attention, concern--her interest in
Mathematics). - To disambiguate the word interest, we identify
the sentence it occurs in, search a German corpus
for instances of the phrase, and assign the
meaning associated with the German use of the
word in that phrase.
18One sense per discourse, one sense per collocation
- (Yarowsky, 1995)s Idea there are constraints
between different occurrences of an ambiguous
word within a corpus that can be exploited for
disambiguation - One sense per discourse The sense of a target
word is highly consistent within any given
document. - One sense per collocation nearby words provide
strong and consistent clues to the sense of a
target word, conditional on relative distance,
order and syntactic relationship.
19Unsupervised Disambiguation
- Idea disambiguate word senses without having
recourse to supporting tools such as dictionaries
and thesauri and in the absence of labeled text.
Simply cluster the contexts of an ambiguous word
into a number of groups and discriminate between
these groups without labeling them. - (Schutze, 1998) The probabilistic model is the
same Bayesian model as the one used for
supervised classification, but the P(vj sk) are
estimated using the EM algorithm.
20EM algorithm
- Initialize the parameters ? of model. These are
P(vj sk) and P(sk), j 1,2,J, k 1,2,K. - compute the log likelihood of corpus C given the
model ? l(C?) log ?i?k P(cj sk) P(sk) - while l(C?) increses repeat
- E-step hik P(cj sk) P(sk) / ?k P(cj sk)
P(sk) (use Naive bayes to compute P(cj
sk) ) - M-step reestimate the parameters P(vj sk) and
P(sk) by MLE - P(vj sk) ?ci hjk/Zj where the sum is
over all contexts ci in which vj occurs, Zj a
normalizing constant. - P(sk) ?i hjk/ ?k ?i hjk ?i hjk/I
-
21Disambiguation
- Once the model parameters have been estimated, a
word w can be disambiguated by computing the
probability of each sense given the words vj in
the context. - Again we use the Naïve Bayes assumption
- Decide sargmax sk log P(sk)? vj in C log
P(vj sk)
22Performance of unsupervised disambiguation
- Is capable of identifying minute difference in
senses, e.g. a bank in physical sense and in
abstract sense. - Usually the clusters obtained are not identical
with dictionary senses. - Results of unsupervised disambiguation (schutze
1998)
word sense Mean accuracy
suit lawsuit garment 95 96
motion physical movement proposal for action 85 88
train Line of railroad cars teach 79 55