Advances in Word Sense Disambiguation

About This Presentation

Title:

Advances in Word Sense Disambiguation

Description:

... resources such as dictionaries and thesauri. discourse properties ... In recent years, most dictionaries made available in Machine Readable format (MRD) ... – PowerPoint PPT presentation

Number of Views:909

Avg rating:3.0/5.0

Slides: 209

Provided by: Rada76

Learn more at: https://www.d.umn.edu

Category:

more less

Transcript and Presenter's Notes

Title: Advances in Word Sense Disambiguation

1
Advances in Word Sense Disambiguation

Tutorial at ACL 2005
June 25, 2005
Ted Pedersen
University of Minnesota, Duluth
http//www.d.umn.edu/tpederse
Rada Mihalcea
University of North Texas
http//www.cs.unt.edu/rada

2
Goal of the Tutorial

Introduce the problem of word sense
disambiguation (WSD), focusing on the range of
formulations and approaches currently practiced.
Accessible to anyone with an interest in NLP.
Persuade you to work on word sense disambiguation
Its an interesting problem
Lots of good work already done, still more to do
There is infrastructure to help you get started
Persuade you to use word sense disambiguation in
your text applications.

3
Outline of Tutorial

Introduction (Ted)
Methodolodgy (Rada)
Knowledge Intensive Methods (Rada)
Supervised Approaches (Ted)
Minimally Supervised Approaches (Rada) / BREAK
Unsupervised Learning (Ted)
How to Get Started (Rada)
Conclusion (Ted)

4
Part 1Introduction
5
Outline

Definitions
Ambiguity for Humans and Computers
Very Brief Historical Overview
Theoretical Connections
Practical Applications

6
Definitions

Word sense disambiguation is the problem of
selecting a sense for a word from a set of
predefined possibilities.
Sense Inventory usually comes from a dictionary
or thesaurus.
Knowledge intensive methods, supervised learning,
and (sometimes) bootstrapping approaches
Word sense discrimination is the problem of
dividing the usages of a word into different
meanings, without regard to any particular
existing sense inventory.
Unsupervised techniques

7
Outline

Definitions
Ambiguity for Humans and Computers
Very Brief Historical Overview
Theoretical Connections
Practical Applications

8
Computers versus Humans

Polysemy most words have many possible
meanings.
A computer program has no basis for knowing which
one is appropriate, even if it is obvious to a
human
Ambiguity is rarely a problem for humans in their
day to day communication, except in extreme
cases

9
Ambiguity for Humans - Newspaper Headlines!

DRUNK GETS NINE YEARS IN VIOLIN CASE
FARMER BILL DIES IN HOUSE
PROSTITUTES APPEAL TO POPE
STOLEN PAINTING FOUND BY TREE
RED TAPE HOLDS UP NEW BRIDGE
DEER KILL 300,000
RESIDENTS CAN DROP OFF TREES
INCLUDE CHILDREN WHEN BAKING COOKIES
MINERS REFUSE TO WORK AFTER DEATH

10
Ambiguity for a Computer

The fisherman jumped off the bank and into the
water.
The bank down the street was robbed!
Back in the day, we had an entire bank of
computers devoted to this problem.
The bank in that road is entirely too steep and
is really dangerous.
The plane took a bank to the left, and then
headed off towards the mountains.

11
Outline

Definitions
Ambiguity for Humans and Computers
Very Brief Historical Overview
Theoretical Connections
Practical Applications

12
Early Days of WSD

Noted as problem for Machine Translation (Weaver,
1949)
A word can often only be translated if you know
the specific sense intended (A bill in English
could be a pico or a cuenta in Spanish)
Bar-Hillel (1960) posed the following
Little John was looking for his toy box. Finally,
he found it. The box was in the pen. John was
very happy.
Is pen a writing instrument or an enclosure
where children play?
declared it unsolvable, left the field of MT!

13
Since then

1970s - 1980s
Rule based systems
Rely on hand crafted knowledge sources
1990s
Corpus based approaches
Dependence on sense tagged text
(Ide and Veronis, 1998) overview history from
early days to 1998.
2000s
Hybrid Systems
Minimizing or eliminating use of sense tagged
text
Taking advantage of the Web

14
Outline

Definitions
Ambiguity for Humans and Computers
Very Brief Historical Overview
Interdisciplinary Connections
Practical Applications

15
Interdisciplinary Connections

Cognitive Science Psychology
Quillian (1968), Collins and Loftus (1975)
spreading activation
Hirst (1987) developed marker passing model
Linguistics
Fodor Katz (1963) selectional preferences
Resnik (1993) pursued statistically
Philosophy of Language
Wittgenstein (1958) meaning as use
For a large class of cases-though not for all-in
which we employ the word "meaning" it can be
defined thus the meaning of a word is its use in
the language.

16
Outline

Definitions
Ambiguity for Humans and Computers
Very Brief Historical Overview
Theoretical Connections
Practical Applications

17
Practical Applications

Machine Translation
Translate bill from English to Spanish
Is it a pico or a cuenta?
Is it a bird jaw or an invoice?
Information Retrieval
Find all Web Pages about cricket
The sport or the insect?
Question Answering
What is George Millers position on gun control?
The psychologist or US congressman?
Knowledge Acquisition
Add to KB Herb Bergson is the mayor of Duluth.
Minnesota or Georgia?

18
References

(Bar-Hillel, 1960) The Present Status of
Automatic Translations of Languages. In Advances
in Computers. Volume 1. Alt, F. (editor).
Academic Press, New York, NY. pp 91-163.
(Collins and Loftus, 1975) A Spreading Activation
Theory of Semantic Memory. Psychological Review,
(82) pp. 407-428.
(Fodor and Katz, 1963) The structure of semantic
theory. Language (39). pp 170-210.
(Hirst, 1987) Semantic Interpretation and the
Resolution of Ambiguity. Cambridge University
Press.
(Ide and Véronis, 1998)Word Sense Disambiguation
The State of the Art.. Computational Linguistics
(24) pp 1-40.
(Quillian, 1968) Semantic Memory. In Semantic
Information Processing. Minsky, M. (editor). The
MIT Press, Cambridge, MA. pp. 227-270.
(Resnik, 1993) Selection and Information A
Class-Based Approach to Lexical Relationships.
Ph.D. Dissertation. University of Pennsylvania.
(Weaver, 1949) Translation. In Machine
Translation of Languages fourteen essays. Locke,
W.N. and Booth, A.D. (editors) The MIT Press,
Cambridge, Mass. pp. 15-23.
(Wittgenstein, 1958) Philosophical
Investigations, 3rd edition. Translated by G.E.M.
Anscombe. Macmillan Publishing Co., New York.

19
Part 2 Methodology
20
Outline

General considerations
All-words disambiguation
Targeted-words disambiguation
Word sense discrimination, sense discovery
Evaluation (granularity, scoring)

21
Overview of the Problem

Many words have several meanings (homonymy /
polysemy)
Determine which sense of a word is used in a
specific sentence
Note
often, the different senses of a word are closely
related
Ex title - right of legal ownership
- document that is evidence of
the legal ownership,
sometimes, several senses can be activated in a
single context (co-activation)
Ex This could bring competition to the trade
competition - the act of competing
- the people who are
competing

Ex chair furniture or person
Ex child young person or human offspring

22
Word Senses

The meaning of a word in a given context
Word sense representations
With respect to a dictionary
chair a seat for one person, with a
support for the back "he put his coat over the
back of the chair and sat down"
chair the position of professor "he was
awarded an endowed chair in economics"
With respect to the translation in a second
language
chair chaise
chair directeur
With respect to the context where it occurs
(discrimination)
Sit on a chair Take a seat on this chair
The chair of the Math Department The chair
of the meeting

23
Approaches to Word Sense Disambiguation

Knowledge-Based Disambiguation
use of external lexical resources such as
dictionaries and thesauri
discourse properties
Supervised Disambiguation
based on a labeled training set
the learning system has
a training set of feature-encoded inputs AND
their appropriate sense label (category)
Unsupervised Disambiguation
based on unlabeled corpora
The learning system has
a training set of feature-encoded inputs BUT
NOT their appropriate sense label (category)

24
All Words Word Sense Disambiguation

Attempt to disambiguate all open-class words in a
text
He put his suit over the back of the chair
Knowledge-based approaches
Use information from dictionaries
Definitions / Examples for each meaning
Find similarity between definitions and current
context
Position in a semantic network
Find that table is closer to chair/furniture
than to chair/person
Use discourse properties
A word exhibits the same sense in a discourse /
in a collocation

25
All Words Word Sense Disambiguation

Minimally supervised approaches
Learn to disambiguate words using small annotated
corpora
E.g. SemCor corpus where all open class words
are disambiguated
200,000 running words
Most frequent sense

26
Targeted Word Sense Disambiguation

Disambiguate one target word
Take a seat on this chair
The chair of the Math Department
WSD is viewed as a typical classification problem
use machine learning techniques to train a system
Training
Corpus of occurrences of the target word, each
occurrence annotated with appropriate sense
Build feature vectors
a vector of relevant linguistic features that
represents the context (ex a window of words
around the target word)
Disambiguation
Disambiguate the target word in new unseen text

27
Targeted Word Sense Disambiguation

Take a window of n word around the target word
Encode information about the words around the
target word
typical features include words, root forms, POS
tags, frequency,
An electric guitar and bass player stand off to
one side, not really part of the scene, just as a
sort of nod to gringo expectations perhaps.
Surrounding context (local features)
(guitar, NN1), (and, CJC), (player, NN1),
(stand, VVB)
Frequent co-occurring words (topical features)
fishing, big, sound, player, fly, rod, pound,
double, runs, playing, guitar, band
0,0,0,1,0,0,0,0,0,0,1,0
Other features
followed by "player", contains "show" in the
sentence,
yes, no,

28
Unsupervised Disambiguation

Disambiguate word senses
without supporting tools such as dictionaries and
thesauri
without a labeled training text
Without such resources, word senses are not
labeled
We cannot say chair/furniture or chair/person
We can
Cluster/group the contexts of an ambiguous word
into a number of groups
Discriminate between these groups without
actually labeling them

29
Unsupervised Disambiguation

Hypothesis same senses of words will have
similar neighboring words
Disambiguation algorithm
Identify context vectors corresponding to all
occurrences of a particular word
Partition them into regions of high density
Assign a sense to each such region
Sit on a chair
Take a seat on this chair
The chair of the Math Department
The chair of the meeting

30
Evaluating Word Sense Disambiguation

Metrics
Precision percentage of words that are tagged
correctly, out of the words addressed by the
system
Recall percentage of words that are tagged
correctly, out of all words in the test set
Example
Test set of 100 words Precision 50 / 75
0.66
System attempts 75 words Recall 50 / 100
0.50
Words correctly disambiguated 50
Special tags are possible
Unknown
Proper noun
Multiple senses
Compare to a gold standard
SEMCOR corpus, SENSEVAL corpus,

31
Evaluating Word Sense Disambiguation

Difficulty in evaluation
Nature of the senses to distinguish has a huge
impact on results
Coarse versus fine-grained sense distinction
chair a seat for one person, with a support for
the back "he put his coat over the back of the
chair and sat down
chair the position of professor "he was
awarded an endowed chair in economics
bank a financial institution that accepts
deposits and channels the money into lending
activities "he cashed a check at the bank"
"that bank holds the mortgage on my home"
bank a building in which commercial banking is
transacted "the bank is on the corner of Nassau
and Witherspoon
Sense maps
Cluster similar senses
Allow for both fine-grained and coarse-grained
evaluation

32
Bounds on Performance

Upper and Lower Bounds on Performance
Measure of how well an algorithm performs
relative to the difficulty of the task.
Upper Bound
Human performance
Around 97-99 with few and clearly distinct
senses
Inter-judge agreement
With words with clear distinct senses 95 and
up
With polysemous words with related senses 65
70
Lower Bound (or baseline)
The assignment of a random sense / the most
frequent sense
90 is excellent for a word with 2 equiprobable
senses
90 is trivial for a word with 2 senses with
probability ratios of 9 to 1

33
References

(Gale, Church and Yarowsky 1992) Gale, W.,
Church, K., and Yarowsky, D. Estimating upper and
lower bounds on the performance of word-sense
disambiguation programs ACL 1992.
(Miller et. al., 1994) Miller, G., Chodorow, M.,
Landes, S., Leacock, C., and Thomas, R. Using a
semantic concordance for sense identification.
ARPA Workshop 1994.
(Miller, 1995) Miller, G. Wordnet A lexical
database. ACM, 38(11) 1995.
(Senseval) Senseval evaluation exercises
http//www.senseval.org

34
Part 3 Knowledge-based Methods for Word Sense
Disambiguation
35
Outline

Task definition
Machine Readable Dictionaries
Algorithms based on Machine Readable Dictionaries
Selectional Restrictions
Measures of Semantic Similarity
Heuristic-based Methods

36
Task Definition

Knowledge-based WSD class of WSD methods
relying (mainly) on knowledge drawn from
dictionaries and/or raw text
Resources
Yes
Machine Readable Dictionaries
Raw corpora
No
Manually annotated corpora
Scope
All open-class words

37
Machine Readable Dictionaries

In recent years, most dictionaries made available
in Machine Readable format (MRD)
Oxford English Dictionary
Collins
Longman Dictionary of Ordinary Contemporary
English (LDOCE)
Thesauruses add synonymy information
Roget Thesaurus
Semantic networks add more semantic relations
WordNet
EuroWordNet

38
MRD A Resource for Knowledge-based WSD

For each word in the language vocabulary, an MRD
provides
A list of meanings
Definitions (for all word meanings)
Typical usage examples (for most word meanings)

39
MRD A Resource for Knowledge-based WSD

A thesaurus adds
An explicit synonymy relation between word
meanings
A semantic network adds
Hypernymy/hyponymy (IS-A), meronymy/holonymy
(PART-OF), antonymy, entailnment, etc.

WordNet synsets for the noun plant 1.
plant, works, industrial plant 2. plant,
flora, plant life
WordNet related concepts for the meaning plant
life plant, flora, plant life
hypernym organism, being
hypomym house plant, fungus,
meronym plant tissue, plant part
holonym Plantae, kingdom Plantae, plant
kingdom
40
Outline

Task definition
Machine Readable Dictionaries
Algorithms based on Machine Readable Dictionaries
Selectional Restrictions
Measures of Semantic Similarity
Heuristic-based Methods

41
Lesk Algorithm

(Michael Lesk 1986) Identify senses of words in
context using definition overlap
Algorithm
Retrieve from MRD all sense definitions of the
words to be disambiguated
Determine the definition overlap for all possible
sense combinations
Choose senses that lead to highest overlap

Example disambiguate PINE CONE
PINE
1. kinds of evergreen tree with needle-shaped
leaves
2. waste away through sorrow or illness
CONE
1. solid body which narrows to a point
2. something of this shape whether solid or
hollow
3. fruit of certain evergreen trees

Pine1 ? Cone1 0 Pine2 ? Cone1 0 Pine1 ?
Cone2 1 Pine2 ? Cone2 0 Pine1 ? Cone3
2 Pine2 ? Cone3 0
42
Lesk Algorithm for More than Two Words?

I saw a man who is 98 years old and can still
walk and tell jokes
nine open class words see(26), man(11),
year(4), old(8), can(5), still(4), walk(10),
tell(8), joke(3)
43,929,600 sense combinations! How to find the
optimal sense combination?
Simulated annealing (Cowie, Guthrie, Guthrie
1992)
Define a function E combination of word senses
in a given text.
Find the combination of senses that leads to
highest definition overlap (redundancy)
1. Start with E the most frequent sense
for each word
2. At each iteration, replace the sense of a
random word in the set with a different sense,
and measure E
3. Stop iterating when there is no change in
the configuration of senses

43
Lesk Algorithm A Simplified Version

Original Lesk definition measure overlap between
sense definitions for all words in context
Identify simultaneously the correct senses for
all words in context
Simplified Lesk (Kilgarriff Rosensweig 2000)
measure overlap between sense definitions of a
word and current context
Identify the correct sense for one word at a time
Search space significantly reduced

44
Lesk Algorithm A Simplified Version

Algorithm for simplified Lesk
Retrieve from MRD all sense definitions of the
word to be disambiguated
Determine the overlap between each sense
definition and the current context
Choose the sense that leads to highest overlap

Example disambiguate PINE in
Pine cones hanging in a tree
PINE
1. kinds of evergreen tree with needle-shaped
leaves
2. waste away through sorrow or illness

Pine1 ? Sentence 1 Pine2 ? Sentence 0
45
Evaluations of Lesk Algorithm

Initial evaluation by M. Lesk
50-70 on short samples of text manually
annotated set, with respect to Oxford Advanced
Learners Dictionary
Simulated annealing
47 on 50 manually annotated sentences
Evaluation on Senseval-2 all-words data, with
back-off to random sense (Mihalcea Tarau 2004)
Original Lesk 35
Simplified Lesk 47
Evaluation on Senseval-2 all-words data, with
back-off to most frequent sense (Vasilescu,
Langlais, Lapalme 2004)
Original Lesk 42
Simplified Lesk 58

46
Outline

Task definition
Machine Readable Dictionaries
Algorithms based on Machine Readable Dictionaries
Selectional Preferences
Measures of Semantic Similarity
Heuristic-based Methods

47
Selectional Preferences

A way to constrain the possible meanings of words
in a given context
E.g. Wash a dish vs. Cook a dish
WASH-OBJECT vs. COOK-FOOD
Capture information about possible relations
between semantic classes
Common sense knowledge
Alternative terminology
Selectional Restrictions
Selectional Preferences
Selectional Constraints

48
Acquiring Selectional Preferences

From annotated corpora
Circular relationship with the WSD problem
Need WSD to build the annotated corpus
Need selectional preferences to derive WSD
From raw corpora
Frequency counts
Information theory measures
Class-to-class relations

49
Preliminaries Learning Word-to-Word Relations

An indication of the semantic fit between two
words
1. Frequency counts
Pairs of words connected by a syntactic relations
2. Conditional probabilities
Condition on one of the words

50
Learning Selectional Preferences (1)

Word-to-class relations (Resnik 1993)
Quantify the contribution of a semantic class
using all the concepts subsumed by that class
where

51
Learning Selectional Preferences (2)

Determine the contribution of a word sense based
on the assumption of equal sense distributions
e.g. plant has two senses ? 50 occurences are
sense 1, 50 are sense 2
Example learning restrictions for the verb to
drink
Find high-scoring verb-object pairs
Find prototypical object classes (high
association score)

52
Learning Selectional Preferences (3)

Other algorithms
Learn class-to-class relations (Agirre and
Martinez, 2002)
E.g. ingest food is a class-to-class relation
for eat chicken
Bayesian networks (Ciaramita and Johnson, 2000)
Tree cut model (Li and Abe, 1998)

53
Using Selectional Preferences for WSD

Algorithm
1. Learn a large set of selectional preferences
for a given syntactic relation R
2. Given a pair of words W1 W2 connected by a
relation R
3. Find all selectional preferences W1 C
(word-to-class) or C1 C2 (class-to-class) that
apply
4. Select the meanings of W1 and W2 based on the
selected semantic class

Example disambiguate coffee in drink coffee
1. (beverage) a beverage consisting of an
infusion of ground coffee beans
2. (tree) any of several small trees native to
the tropical Old World
3. (color) a medium to dark brown color

Given the selectional preference DRINK BEVERAGE
coffee1
54
Evaluation of Selectional Preferences for WSD

Data set
mainly on verb-object, subject-verb relations
extracted from SemCor
Compare against random baseline
Results (Agirre and Martinez, 2000)
Average results on 8 nouns
Similar figures reported in (Resnik 1997)

55
Outline

Task definition
Machine Readable Dictionaries
Algorithms based on Machine Readable Dictionaries
Selectional Restrictions
Measures of Semantic Similarity
Heuristic-based Methods

56
Semantic Similarity

Words in a discourse must be related in meaning,
for the discourse to be coherent (Haliday and
Hassan, 1976)
Use this property for WSD Identify related
meanings for words that share a common context
Context span
1. Local context semantic similarity between
pairs of words
2. Global context lexical chains

57
Semantic Similarity in a Local Context

Similarity determined between pairs of concepts,
or between a word and its surrounding context
Relies on similarity metrics on semantic networks
(Rada et al. 1989)

carnivore
bear
feline, felid
canine, canid
fissiped mamal, fissiped
wild dog
wolf
hyena
dog
hunting dog
hyena dog
dingo
dachshund
terrier
58
Semantic Similarity Metrics (1)

Input two concepts (same part of speech)
Output similarity measure
(Leacock and Chodorow 1998)
E.g. Similarity(wolf,dog) 0.60
Similarity(wolf,bear) 0.42
(Resnik 1995)
Define information content, P(C) probability of
seeing a concept of type C in a large corpus
Probability of seeing a concept probability of
seeing instances of that concept
Determine the contribution of a word sense based
on the assumption of equal sense distributions
e.g. plant has two senses ? 50 occurrences are
sense 1, 50 are sense 2

, D is the taxonomy depth
59
Semantic Similarity Metrics (2)

Similarity using information content
(Resnik 1995) Define similarity between two
concepts (LCS Least Common Subsumer)
Alternatives (Jiang and Conrath 1997)
Other metrics
Similarity using information content (Lin 1998)
Similarity using gloss-based paths across
different hierarchies (Mihalcea and Moldovan
1999)
Conceptual density measure between noun semantic
hierarchies and current context (Agirre and Rigau
1995)
Adapted Lesk algorithm (Banerjee and Pedersen
2002)

60
Semantic Similarity Metrics for WSD

Disambiguate target words based on similarity
with one word to the left and one word to the
right
(Patwardhan, Banerjee, Pedersen 2002)
Evaluation
1,723 ambiguous nouns from Senseval-2
Among 5 similarity metrics, (Jiang and Conrath
1997) provide the best precision (39)

Example disambiguate PLANT in plant with
flowers
PLANT
plant, works, industrial plant
plant, flora, plant life
Similarity (plant1, flower) 0.2
Similarity (plant2, flower) 1.5
plant2

61
Semantic Similarity in a Global Context

Lexical chains (Hirst and St-Onge 1988), (Haliday
and Hassan 1976)
A lexical chain is a sequence of semantically
related words, which creates a context and
contributes to the continuity of meaning and the
coherence of a discourse
Algorithm for finding lexical chains
Select the candidate words from the text. These
are words for which we can compute similarity
measures, and therefore most of the time they
have the same part of speech.
For each such candidate word, and for each
meaning for this word, find a chain to receive
the candidate word sense, based on a semantic
relatedness measure between the concepts that are
already in the chain, and the candidate word
meaning.
If such a chain is found, insert the word in this
chain otherwise, create a new chain.

62
Semantic Similarity of a Global Context
A very long train traveling along the rails with
a constant velocity v in a certain direction
train
1 public transport
1 change location
2 a bar of steel for trains
2 order set of things
3 piece of cloth
travel
2 undergo transportation
rail
1 a barrier
3 a small bird
63
Lexical Chains for WSD

Identify lexical chains in a text
Usually target one part of speech at a time
Identify the meaning of words based on their
membership to a lexical chain
Evaluation
(Galley and McKeown 2003) lexical chains on 74
SemCor texts give 62.09
(Mihalcea and Moldovan 2000) on five SemCor texts
give 90 with 60 recall
lexical chains anchored on monosemous words
(Okumura and Honda 1994) lexical chains on five
Japanese texts give 63.4

64
Outline

Task definition
Machine Readable Dictionaries
Algorithms based on Machine Readable Dictionaries
Selectional Restrictions
Measures of Semantic Similarity
Heuristic-based Methods

65
Most Frequent Sense (1)

Identify the most often used meaning and use this
meaning by default
Word meanings exhibit a Zipfian distribution
E.g. distribution of word senses in SemCor

Example plant/flora is used more often than
plant/factory - annotate any instance of
PLANT as plant/flora
66
Most Frequent Sense (2)

Method 1 Find the most frequent sense in an
annotated corpus
Method 2 Find the most frequent sense using a
method based on distributional similarity
(McCarthy et al. 2004)
1. Given a word w, find the top k
distributionally similar words
Nw n1, n2, , nk, with associated
similarity scores dss(w,n1), dss(w,n2),
dss(w,nk)
2. For each sense wsi of w, identify the
similarity with the words nj, using the sense of
nj that maximizes this score
3. Rank senses wsi of w based on the total
similarity score

67
Most Frequent Sense(3)

Word senses
pipe 1 tobacco pipe
pipe 2 tube of metal or plastic
Distributional similar words
N tube, cable, wire, tank, hole, cylinder,
fitting, tap,
For each word in N, find similarity with pipei
(using the sense that maximizes the similarity)
pipe1 tube (3) 0.3
pipe2 tube (1) 0.6
Compute score for each sense pipei
score (pipe1) 0.25
score (pipe2) 0.73
Note results depend on the corpus used to find
distributionally
similar words gt can find domain specific
predominant senses

68
One Sense Per Discourse

A word tends to preserve its meaning across all
its occurrences in a given discourse (Gale,
Church, Yarowksy 1992)
What does this mean?
Evaluation
8 words with two-way ambiguity, e.g. plant,
crane, etc.
98 of the two-word occurrences in the same
discourse carry the same meaning
The grain of salt Performance depends on
granularity
(Krovetz 1998) experiments with words with more
than two senses
Performance of one sense per discourse measured
on SemCor is approx. 70

E.g. The ambiguous word PLANT occurs 10 times in
a discourse all instances of plant
carry the same meaning
69
One Sense per Collocation

A word tends to preserver its meaning when used
in the same collocation (Yarowsky 1993)
Strong for adjacent collocations
Weaker as the distance between words increases
An example
Evaluation
97 precision on words with two-way ambiguity
Finer granularity
(Martinez and Agirre 2000) tested the one sense
per collocation hypothesis on text annotated
with WordNet senses
70 precision on SemCor words

The ambiguous word PLANT preserves its meaning in
all its occurrences within the collocation
industrial plant, regardless of the context
where this collocation occurs
70
References

(Agirre and Rigau, 1995) Agirre, E. and Rigau, G.
A proposal for word sense disambiguation using
conceptual distance. RANLP 1995.
(Agirre and Martinez 2001) Agirre, E. and
Martinez, D. Learning class-to-class selectional
preferences. CONLL 2001.
(Banerjee and Pedersen 2002) Banerjee, S. and
Pedersen, T. An adapted Lesk algorithm for word
sense disambiguation using WordNet. CICLING 2002.
(Cowie, Guthrie and Guthrie 1992), Cowie, L. and
Guthrie, J. A. and Guthrie, L. Lexical
disambiguation using simulated annealing. COLING
2002.
(Gale, Church and Yarowsky 1992) Gale, W.,
Church, K., and Yarowsky, D. One sense per
discourse. DARPA workshop 1992.
(Halliday and Hasan 1976) Halliday, M. and Hasan,
R., (1976). Cohesion in English. Longman.
(Galley and McKeown 2003) Galley, M. and McKeown,
K. (2003) Improving word sense disambiguation in
lexical chaining. IJCAI 2003
(Hirst and St-Onge 1998) Hirst, G. and St-Onge,
D. Lexical chains as representations of context
in the detection and correction of malaproprisms.
WordNet An electronic lexical database, MIT
Press.
(Jiang and Conrath 1997) Jiang, J. and Conrath,
D. Semantic similarity based on corpus statistics
and lexical taxonomy. COLING 1997.
(Krovetz, 1998) Krovetz, R. More than one sense
per discourse. ACL-SIGLEX 1998.
(Lesk, 1986) Lesk, M. Automatic sense
disambiguation using machine readable
dictionaries How to tell a pine cone from an ice
cream cone. SIGDOC 1986.
(Lin 1998) Lin, D An information theoretic
definition of similarity. ICML 1998.

71
References

(Martinez and Agirre 2000) Martinez, D. and
Agirre, E. One sense per collocation and
genre/topic variations. EMNLP 2000.
(Miller et. al., 1994) Miller, G., Chodorow, M.,
Landes, S., Leacock, C., and Thomas, R. Using a
semantic concordance for sense identification.
ARPA Workshop 1994.
(Miller, 1995) Miller, G. Wordnet A lexical
database. ACM, 38(11) 1995.
(Mihalcea and Moldovan, 1999) Mihalcea, R. and
Moldovan, D. A method for word sense
disambiguation of unrestricted text. ACL 1999.
(Mihalcea and Moldovan 2000) Mihalcea, R. and
Moldovan, D. An iterative approach to word sense
disambiguation. FLAIRS 2000.
(Mihalcea, Tarau, Figa 2004) R. Mihalcea, P.
Tarau, E. Figa PageRank on Semantic Networks with
Application to Word Sense Disambiguation, COLING
2004.
(Patwardhan, Banerjee, and Pedersen 2003)
Patwardhan, S. and Banerjee, S. and Pedersen, T.
Using Measures of Semantic Relatedeness for Word
Sense Disambiguation. CICLING 2003.
(Rada et al 1989) Rada, R. and Mili, H. and
Bicknell, E. and Blettner, M. Development and
application of a metric on semantic nets. IEEE
Transactions on Systems, Man, and Cybernetics,
19(1) 1989.
(Resnik 1993) Resnik, P. Selection and
Information A Class-Based Approach to Lexical
Relationships. University of Pennsylvania 1993.
(Resnik 1995) Resnik, P. Using information
content to evaluate semantic similarity. IJCAI
1995.
(Vasilescu, Langlais, Lapalme 2004) F. Vasilescu,
P. Langlais, G. Lapalme "Evaluating variants of
the Lesk approach for disambiguating words, LREC
2004.
(Yarowsky, 1993) Yarowsky, D. One sense per
collocation. ARPA Workshop 1993.

Part 4
Supervised Methods of Word Sense Disambiguation

73
Outline

What is Supervised Learning?
Task Definition
Single Classifiers
Naïve Bayesian Classifiers
Decision Lists and Trees
Ensembles of Classifiers

74
What is Supervised Learning?

Collect a set of examples that illustrate the
various possible classifications or outcomes of
an event.
Identify patterns in the examples associated with
each particular class of the event.
Generalize those patterns into rules.
Apply the rules to classify a new event.

75
Learn from these examples when do I go to the
store?
Day CLASS Go to Store? F1 Hot Outside? F2 Slept Well? F3 Ate Well?
1 YES YES NO NO
2 NO YES NO YES
3 YES NO NO NO
4 NO NO NO YES
76
Learn from these examples when do I go to the
store?
Day CLASS Go to Store? F1 Hot Outside? F2 Slept Well? F3 Ate Well?
1 YES YES NO NO
2 NO YES NO YES
3 YES NO NO NO
4 NO NO NO YES
77
Outline

What is Supervised Learning?
Task Definition
Single Classifiers
Naïve Bayesian Classifiers
Decision Lists and Trees
Ensembles of Classifiers

78
Task Definition

Supervised WSD Class of methods that induces a
classifier from manually sense-tagged text using
machine learning techniques.
Resources
Sense Tagged Text
Dictionary (implicit source of sense inventory)
Syntactic Analysis (POS tagger, Chunker, Parser,
)
Scope
Typically one target word per context
Part of speech of target word resolved
Lends itself to targeted word formulation
Reduces WSD to a classification problem where a
target word is assigned the most appropriate
sense from a given set of possibilities based on
the context in which it occurs

79
Sense Tagged Text
Bonnie and Clyde are two really famous criminals, I think they were bank/1 robbers
My bank/1 charges too much for an overdraft.
I went to the bank/1 to deposit my check and get a new ATM card.
The University of Minnesota has an East and a West Bank/2 campus right on the Mississippi River.
My grandfather planted his pole in the bank/2 and got a great big catfish!
The bank/2 is pretty muddy, I cant walk there.
80
Two Bags of Words(Co-occurrences in the window
of context)
FINANCIAL_BANK_BAG a an and are ATM Bonnie card charges check Clyde criminals deposit famous for get I much My new overdraft really robbers the they think to too two went were
RIVER_BANK_BAG a an and big campus cant catfish East got grandfather great has his I in is Minnesota Mississippi muddy My of on planted pole pretty right River The the there University walk West
81
Simple Supervised Approach

Given a sentence S containing bank
For each word Wi in S
If Wi is in FINANCIAL_BANK_BAG then
Sense_1 Sense_1 1
If Wi is in RIVER_BANK_BAG then
Sense_2 Sense_2 1
If Sense_1 gt Sense_2 then print Financial
else if Sense_2 gt Sense_1 then print River
else print Cant Decide

82
Supervised Methodology

Create a sample of training data where a given
target word is manually annotated with a sense
from a predetermined set of possibilities.
One tagged word per instance/lexical sample
disambiguation
Select a set of features with which to represent
context.
co-occurrences, collocations, POS tags, verb-obj
relations, etc...
Convert sense-tagged training instances to
feature vectors.
Apply a machine learning algorithm to induce a
classifier.
Form structure or relation among features
Parameters strength of feature interactions
Convert a held out sample of test data into
feature vectors.
correct sense tags are known but not used
Apply classifier to test instances to assign a
sense tag.

83
From Text to Feature Vectors

My/pronoun grandfather/noun used/verb to/prep
fish/verb along/adv the/det banks/SHORE of/prep
the/det Mississippi/noun River/noun. (S1)
The/det bank/FINANCE issued/verb a/det check/noun
for/prep the/det amount/noun of/prep
interest/noun. (S2)

P-2 P-1 P1 P2 fish check river interest SENSE TAG
S1 adv det prep det Y N Y N SHORE
S2 det verb det N Y N Y FINANCE
84
Supervised Learning Algorithms

Once data is converted to feature vector form,
any supervised learning algorithm can be used.
Many have been applied to WSD with good results
Support Vector Machines
Nearest Neighbor Classifiers
Decision Trees
Decision Lists
Naïve Bayesian Classifiers
Perceptrons
Neural Networks
Graphical Models
Log Linear Models

85
Outline

What is Supervised Learning?
Task Definition
Naïve Bayesian Classifier
Decision Lists and Trees
Ensembles of Classifiers

86
Naïve Bayesian Classifier

Naïve Bayesian Classifier well known in Machine
Learning community for good performance across a
range of tasks (e.g., Domingos and Pazzani, 1997)
Word Sense Disambiguation is no exception
Assumes conditional independence among features,
given the sense of a word.
The form of the model is assumed, but parameters
are estimated from training instances
When applied to WSD, features are often a bag of
words that come from the training data
Usually thousands of binary features that
indicate if a word is present in the context of
the target word (or not)

87
Bayesian Inference

Given observed features, what is most likely
sense?
Estimate probability of observed features given
sense
Estimate unconditional probability of sense
Unconditional probability of features is a
normalizing term, doesnt affect sense
classification

88
Naïve Bayesian Model
89
The Naïve Bayesian Classifier

Given 2,000 instances of bank, 1,500 for bank/1
(financial sense) and 500 for bank/2 (river
sense)
P(S1) 1,500/2000 .75
P(S2) 500/2,000 .25
Given credit occurs 200 times with bank/1 and 4
times with bank/2.
P(F1credit) 204/2000 .102
P(F1creditS1) 200/1,500 .133
P(F1creditS2) 4/500 .008
Given a test instance that has one feature
credit
P(S1F1credit) .133.75/.102 .978
P(S2F1credit) .008.25/.102 .020

90
Comparative Results

(Leacock, et. al. 1993) compared Naïve Bayes with
a Neural Network and a Context Vector approach
when disambiguating six senses of line
(Mooney, 1996) compared Naïve Bayes with a Neural
Network, Decision Tree/List Learners, Disjunctive
and Conjunctive Normal Form learners, and a
perceptron when disambiguating six senses of
line
(Pedersen, 1998) compared Naïve Bayes with
Decision Tree, Rule Based Learner, Probabilistic
Model, etc. when disambiguating line and 12 other
words
All found that Naïve Bayesian Classifier
performed as well as any of the other methods!

91
Outline

What is Supervised Learning?
Task Definition
Naïve Bayesian Classifiers
Decision Lists and Trees
Ensembles of Classifiers

92
Decision Lists and Trees

Very widely used in Machine Learning.
Decision trees used very early for WSD research
(e.g., Kelly and Stone, 1975 Black, 1988).
Represent disambiguation problem as a series of
questions (presence of feature) that reveal the
sense of a word.
List decides between two senses after one
positive answer
Tree allows for decision among multiple senses
after a series of answers
Uses a smaller, more refined set of features than
bag of words and Naïve Bayes.
More descriptive and easier to interpret.

93
Decision List for WSD (Yarowsky, 1994)

Identify collocational features from sense tagged
data.
Word immediately to the left or right of target
I have my bank/1 statement.
The river bank/2 is muddy.
Pair of words to immediate left or right of
target
The worlds richest bank/1 is here in New York.
The river bank/2 is muddy.
Words found within k positions to left or right
of target, where k is often 10-50
My credit is just horrible because my bank/1 has
made several mistakes with my account and the
balance is very low.

94
Building the Decision List

Sort order of collocation tests using log of
conditional probabilities.
Words most indicative of one sense (and not the
other) will be ranked highly.

95
Computing DL score

Given 2,000 instances of bank, 1,500 for bank/1
(financial sense) and 500 for bank/2 (river
sense)
P(S1) 1,500/2,000 .75
P(S2) 500/2,000 .25
Given credit occurs 200 times with bank/1 and 4
times with bank/2.
P(F1credit) 204/2,000 .102
P(F1creditS1) 200/1,500 .133
P(F1creditS2) 4/500 .008
From Bayes Rule
P(S1F1credit) .133.75/.102 .978
P(S2F1credit) .008.25/.102 .020
DL Score abs (log (.978/.020)) 3.89

96
Using the Decision List

Sort DL-score, go through test instance looking
for matching feature. First match reveals sense

DL-score Feature Sense
3.89 credit within bank Bank/1 financial
2.20 bank is muddy Bank/2 river
1.09 pole within bank Bank/2 river
0.00 of the bank N/A
97
Using the Decision List
98
Learning a Decision Tree

Identify the feature that most cleanly divides
the training data into the known senses.
Cleanly measured by information gain or gain
ratio.
Create subsets of training data according to
feature values.
Find another feature that most cleanly divides a
subset of the training data.
Continue until each subset of training data is
pure or as clean as possible.
Well known decision tree learning algorithms
include ID3 and C4.5 (Quillian, 1986, 1993)
In Senseval-1, a modified decision list (which
supported some conditional branching) was most
accurate for English Lexical Sample task
(Yarowsky, 2000)

99
Supervised WSD with Individual Classifiers

Many supervised Machine Learning algorithms have
been applied to Word Sense Disambiguation, most
work reasonably well.
(Witten and Frank, 2000) is a great intro. to
supervised learning.
Features tend to differentiate among methods more
than the learning algorithms.
Good sets of features tend to include
Co-occurrences or keywords (global)
Collocations (local)
Bigrams (local and global)
Part of speech (local)
Predicate-argument relations
Verb-object, subject-verb,
Heads of Noun and Verb Phrases

100
Convergence of Results

Accuracy of different systems applied to the same
data tends to converge on a particular value, no
one system shockingly better than another.
Senseval-1, a number of systems in range of
74-78 accuracy for English Lexical Sample task.
Senseval-2, a number of systems in range of
61-64 accuracy for English Lexical Sample task.
Senseval-3, a number of systems in range of
70-73 accuracy for English Lexical Sample task
What to do next?

101
Outline

What is Supervised Learning?
Task Definition
Naïve Bayesian Classifiers
Decision Lists and Trees
Ensembles of Classifiers

102
Ensembles of Classifiers

Classifier error has two components (Bias and
Variance)
Some algorithms (e.g., decision trees) try and
build a representation of the training data Low
Bias/High Variance
Others (e.g., Naïve Bayes) assume a parametric
form and dont represent the training data High
Bias/Low Variance
Combining classifiers with different bias
variance characteristics can lead to improved
overall accuracy
Bagging a decision tree can smooth out the
effect of small variations in the training data
(Breiman, 1996)
Sample with replacement from the training data to
learn multiple decision trees.
Outliers in training data will tend to be
obscured/eliminated.

103
Ensemble Considerations

Must choose different learning algorithms with
significantly different bias/variance
characteristics.
Naïve Bayesian Classifier versus Decision Tree
Must choose feature representations that yield
significantly different (independent?) views of
the training data.
Lexical versus syntactic features
Must choose how to combine classifiers.
Simple Majority Voting
Averaging of probabilities across multiple
classifier output
Maximum Entropy combination (e.g., Klein, et.
al., 2002)

104
Ensemble Results

(Pedersen, 2000) achieved state of art for
interest and line data using ensemble of Naïve
Bayesian Classifiers.
Many Naïve Bayesian Classifiers trained on
varying sized windows of context / bags of words.
Classifiers combined by a weighted vote
(Florian and Yarowsky, 2002) achieved state of
the art for Senseval-1 and Senseval-2 data using
combination of six classifiers.
Rich set of collocational and syntactic features.
Combined via linear combination of top three
classifiers.
Many Senseval-2 and Senseval-3 systems employed
ensemble methods.

105
References

(Black, 1988) An experiment in computational
discrimination of English word senses. IBM
Journal of Research and Development (32) pg.
185-194.
(Breiman, 1996) The heuristics of instability in
model selection. Annals of Statistics (24) pg.
2350-2383.
(Domingos and Pazzani, 1997) On the Optimality of
the Simple Bayesian Classifier under Zero-One
Loss, Machine Learning (29) pg. 103-130.
(Domingos, 2000) A Unified Bias Variance
Decomposition for Zero-One and Squared Loss. In
Proceedings of AAAI. Pg. 564-569.
(Florian an dYarowsky, 2002) Modeling Consensus
Classifier Combination for Word Sense
Disambiguation. In Proceedings of EMNLP, pp
25-32.
(Kelly and Stone, 1975). Computer Recognition of
English Word Senses, North Holland Publishing
Co., Amsterdam.
(Klein, et. al., 2002) Combining Heterogeneous
Classifiers for Word-Sense Disambiguation,
Proceedings of Senseval-2. pg. 87-89.
(Leacock, et. al. 1993) Corpus based statistical
sense resolution. In Proceedings of the ARPA
Workshop on Human Language Technology. pg.
260-265.
(Mooney, 1996) Comparative experiments on
disambiguating word senses An illustration of
the role of bias in machine learning. Proceedings
of EMNLP. pg. 82-91.

106
References

(Pedersen, 1998) Learning Probabilistic Models of
Word Sense Disambiguation. Ph.D. Dissertation.
Southern Methodist University.
(Pedersen, 2000) A simple approach to building
ensembles of Naive Bayesian classifiers for word
sense disambiguation. In Proceedings of NAACL.
(Quillian, 1986). Induction of Decision Trees.
Machine Learning (1). pg. 81-106.
(Quillian, 1993). C4.5 Programs for Machine
Learning. San Francisco, Morgan Kaufmann.
(Witten and Frank, 2000). Data Mining Practical
Machine Learning Tools and Techniques with Java
Implementations. Morgan-Kaufmann. San Francisco.
(Yarowsky, 1994) Decision lists for lexical
ambiguity resolution Application to accent
restoration in Spanish and French. In Proceedings
of ACL. pp. 88-95.
(Yarowsky, 2000) Hierarchical decision lists for
word sense disambiguation. Computers and the
Humanities, 34.

107
Part 5 Minimally Supervised Methods for Word
Sense Disambiguation
108
Outline

Task definition
What does minimally supervised mean?
Bootstrapping algorithms
Co-training
Self-training
Yarowsky algorithm
Using the Web for Word Sense Disambiguation
Web as a corpus
Web as collective mind

109
Task Definition

Supervised WSD learning sense classifiers
starting with annotated data
Minimally supervised WSD learning sense
classifiers from annotated data, with minimal
human supervision
Examples
Automatically bootstrap a corpus starting with a
few human annotated examples
Use monosemous relatives / dictionary definitions
to automatically construct sense tagged data
Rely on Web-users active learning for corpus
annotation

110
Outline

Task definition
What does minimally supervised mean?
Bootstrapping algorithms
Co-training
Self-training
Yarowsky algorithm
Using the Web for Word Sense Disambiguation
Web as a corpus
Web as collective mind

111
Bootstrapping WSD Classifiers

Build sense classifiers with little training data
Expand applicability of supervised WSD
Bootstrapping approaches
Co-training
Self-training
Yarowsky algorithm

112
Bootstrapping Recipe

Ingredients
(Some) labeled data
(Large amounts of) unlabeled data
(One or more) basic classifiers
Output
Classifier that improves over the basic
classifiers

113
building the only atomic plant plant growth
is retarded a herb or flowering plant a
nuclear power plant building a new vehicle
plant the animal and plant life the
passion-fruit plant
plants1 and animals industry plant2
114
Co-training / Self-training