Title: Advances in Word Sense Disambiguation
1Advances in Word Sense Disambiguation
- Tutorial at ACL 2005
- June 25, 2005
- Ted Pedersen
- University of Minnesota, Duluth
- http//www.d.umn.edu/tpederse
- Rada Mihalcea
- University of North Texas
- http//www.cs.unt.edu/rada
2Goal of the Tutorial
- Introduce the problem of word sense
disambiguation (WSD), focusing on the range of
formulations and approaches currently practiced. - Accessible to anyone with an interest in NLP.
- Persuade you to work on word sense disambiguation
- Its an interesting problem
- Lots of good work already done, still more to do
- There is infrastructure to help you get started
- Persuade you to use word sense disambiguation in
your text applications.
3Outline of Tutorial
- Introduction (Ted)
- Methodolodgy (Rada)
- Knowledge Intensive Methods (Rada)
- Supervised Approaches (Ted)
- Minimally Supervised Approaches (Rada) / BREAK
- Unsupervised Learning (Ted)
- How to Get Started (Rada)
- Conclusion (Ted)
4Part 1Introduction
5Outline
- Definitions
- Ambiguity for Humans and Computers
- Very Brief Historical Overview
- Theoretical Connections
- Practical Applications
6Definitions
- Word sense disambiguation is the problem of
selecting a sense for a word from a set of
predefined possibilities. - Sense Inventory usually comes from a dictionary
or thesaurus. - Knowledge intensive methods, supervised learning,
and (sometimes) bootstrapping approaches - Word sense discrimination is the problem of
dividing the usages of a word into different
meanings, without regard to any particular
existing sense inventory. - Unsupervised techniques
7Outline
- Definitions
- Ambiguity for Humans and Computers
- Very Brief Historical Overview
- Theoretical Connections
- Practical Applications
8Computers versus Humans
- Polysemy most words have many possible
meanings. - A computer program has no basis for knowing which
one is appropriate, even if it is obvious to a
human - Ambiguity is rarely a problem for humans in their
day to day communication, except in extreme
cases
9Ambiguity for Humans - Newspaper Headlines!
- DRUNK GETS NINE YEARS IN VIOLIN CASE
- FARMER BILL DIES IN HOUSE
- PROSTITUTES APPEAL TO POPE
- STOLEN PAINTING FOUND BY TREE
- RED TAPE HOLDS UP NEW BRIDGE
- DEER KILL 300,000
- RESIDENTS CAN DROP OFF TREES
- INCLUDE CHILDREN WHEN BAKING COOKIES
- MINERS REFUSE TO WORK AFTER DEATH
10Ambiguity for a Computer
- The fisherman jumped off the bank and into the
water. - The bank down the street was robbed!
- Back in the day, we had an entire bank of
computers devoted to this problem. - The bank in that road is entirely too steep and
is really dangerous. - The plane took a bank to the left, and then
headed off towards the mountains.
11Outline
- Definitions
- Ambiguity for Humans and Computers
- Very Brief Historical Overview
- Theoretical Connections
- Practical Applications
12Early Days of WSD
- Noted as problem for Machine Translation (Weaver,
1949) - A word can often only be translated if you know
the specific sense intended (A bill in English
could be a pico or a cuenta in Spanish) - Bar-Hillel (1960) posed the following
- Little John was looking for his toy box. Finally,
he found it. The box was in the pen. John was
very happy. - Is pen a writing instrument or an enclosure
where children play? - declared it unsolvable, left the field of MT!
-
-
13Since then
- 1970s - 1980s
- Rule based systems
- Rely on hand crafted knowledge sources
- 1990s
- Corpus based approaches
- Dependence on sense tagged text
- (Ide and Veronis, 1998) overview history from
early days to 1998. - 2000s
- Hybrid Systems
- Minimizing or eliminating use of sense tagged
text - Taking advantage of the Web
14Outline
- Definitions
- Ambiguity for Humans and Computers
- Very Brief Historical Overview
- Interdisciplinary Connections
- Practical Applications
15Interdisciplinary Connections
-
- Cognitive Science Psychology
- Quillian (1968), Collins and Loftus (1975)
spreading activation - Hirst (1987) developed marker passing model
- Linguistics
- Fodor Katz (1963) selectional preferences
- Resnik (1993) pursued statistically
- Philosophy of Language
- Wittgenstein (1958) meaning as use
- For a large class of cases-though not for all-in
which we employ the word "meaning" it can be
defined thus the meaning of a word is its use in
the language.
16Outline
- Definitions
- Ambiguity for Humans and Computers
- Very Brief Historical Overview
- Theoretical Connections
- Practical Applications
17Practical Applications
- Machine Translation
- Translate bill from English to Spanish
- Is it a pico or a cuenta?
- Is it a bird jaw or an invoice?
- Information Retrieval
- Find all Web Pages about cricket
- The sport or the insect?
- Question Answering
- What is George Millers position on gun control?
- The psychologist or US congressman?
- Knowledge Acquisition
- Add to KB Herb Bergson is the mayor of Duluth.
- Minnesota or Georgia?
18References
- (Bar-Hillel, 1960) The Present Status of
Automatic Translations of Languages. In Advances
in Computers. Volume 1. Alt, F. (editor).
Academic Press, New York, NY. pp 91-163. - (Collins and Loftus, 1975) A Spreading Activation
Theory of Semantic Memory. Psychological Review,
(82) pp. 407-428. - (Fodor and Katz, 1963) The structure of semantic
theory. Language (39). pp 170-210. - (Hirst, 1987) Semantic Interpretation and the
Resolution of Ambiguity. Cambridge University
Press. - (Ide and Véronis, 1998)Word Sense Disambiguation
The State of the Art.. Computational Linguistics
(24) pp 1-40. - (Quillian, 1968) Semantic Memory. In Semantic
Information Processing. Minsky, M. (editor). The
MIT Press, Cambridge, MA. pp. 227-270. - (Resnik, 1993) Selection and Information A
Class-Based Approach to Lexical Relationships.
Ph.D. Dissertation. University of Pennsylvania. - (Weaver, 1949) Translation. In Machine
Translation of Languages fourteen essays. Locke,
W.N. and Booth, A.D. (editors) The MIT Press,
Cambridge, Mass. pp. 15-23. - (Wittgenstein, 1958) Philosophical
Investigations, 3rd edition. Translated by G.E.M.
Anscombe. Macmillan Publishing Co., New York.
19Part 2 Methodology
20Outline
- General considerations
- All-words disambiguation
- Targeted-words disambiguation
- Word sense discrimination, sense discovery
- Evaluation (granularity, scoring)
21Overview of the Problem
- Many words have several meanings (homonymy /
polysemy) - Determine which sense of a word is used in a
specific sentence - Note
- often, the different senses of a word are closely
related - Ex title - right of legal ownership
- - document that is evidence of
the legal ownership, - sometimes, several senses can be activated in a
single context (co-activation) - Ex This could bring competition to the trade
- competition - the act of competing
- - the people who are
competing
- Ex chair furniture or person
- Ex child young person or human offspring
22Word Senses
- The meaning of a word in a given context
- Word sense representations
- With respect to a dictionary
- chair a seat for one person, with a
support for the back "he put his coat over the
back of the chair and sat down" - chair the position of professor "he was
awarded an endowed chair in economics" - With respect to the translation in a second
language - chair chaise
- chair directeur
- With respect to the context where it occurs
(discrimination) - Sit on a chair Take a seat on this chair
- The chair of the Math Department The chair
of the meeting
23Approaches to Word Sense Disambiguation
- Knowledge-Based Disambiguation
- use of external lexical resources such as
dictionaries and thesauri - discourse properties
- Supervised Disambiguation
- based on a labeled training set
- the learning system has
- a training set of feature-encoded inputs AND
- their appropriate sense label (category)
- Unsupervised Disambiguation
- based on unlabeled corpora
- The learning system has
- a training set of feature-encoded inputs BUT
- NOT their appropriate sense label (category)
24All Words Word Sense Disambiguation
- Attempt to disambiguate all open-class words in a
text - He put his suit over the back of the chair
- Knowledge-based approaches
- Use information from dictionaries
- Definitions / Examples for each meaning
- Find similarity between definitions and current
context - Position in a semantic network
- Find that table is closer to chair/furniture
than to chair/person - Use discourse properties
- A word exhibits the same sense in a discourse /
in a collocation
25All Words Word Sense Disambiguation
- Minimally supervised approaches
- Learn to disambiguate words using small annotated
corpora - E.g. SemCor corpus where all open class words
are disambiguated - 200,000 running words
- Most frequent sense
26Targeted Word Sense Disambiguation
- Disambiguate one target word
- Take a seat on this chair
- The chair of the Math Department
- WSD is viewed as a typical classification problem
- use machine learning techniques to train a system
- Training
- Corpus of occurrences of the target word, each
occurrence annotated with appropriate sense - Build feature vectors
- a vector of relevant linguistic features that
represents the context (ex a window of words
around the target word) - Disambiguation
- Disambiguate the target word in new unseen text
27Targeted Word Sense Disambiguation
- Take a window of n word around the target word
- Encode information about the words around the
target word - typical features include words, root forms, POS
tags, frequency, - An electric guitar and bass player stand off to
one side, not really part of the scene, just as a
sort of nod to gringo expectations perhaps. - Surrounding context (local features)
- (guitar, NN1), (and, CJC), (player, NN1),
(stand, VVB) - Frequent co-occurring words (topical features)
- fishing, big, sound, player, fly, rod, pound,
double, runs, playing, guitar, band - 0,0,0,1,0,0,0,0,0,0,1,0
- Other features
- followed by "player", contains "show" in the
sentence, - yes, no,
28Unsupervised Disambiguation
- Disambiguate word senses
- without supporting tools such as dictionaries and
thesauri - without a labeled training text
- Without such resources, word senses are not
labeled - We cannot say chair/furniture or chair/person
- We can
- Cluster/group the contexts of an ambiguous word
into a number of groups - Discriminate between these groups without
actually labeling them
29Unsupervised Disambiguation
- Hypothesis same senses of words will have
similar neighboring words - Disambiguation algorithm
- Identify context vectors corresponding to all
occurrences of a particular word - Partition them into regions of high density
- Assign a sense to each such region
-
- Sit on a chair
- Take a seat on this chair
- The chair of the Math Department
- The chair of the meeting
30Evaluating Word Sense Disambiguation
- Metrics
- Precision percentage of words that are tagged
correctly, out of the words addressed by the
system - Recall percentage of words that are tagged
correctly, out of all words in the test set - Example
- Test set of 100 words Precision 50 / 75
0.66 - System attempts 75 words Recall 50 / 100
0.50 - Words correctly disambiguated 50
- Special tags are possible
- Unknown
- Proper noun
- Multiple senses
- Compare to a gold standard
- SEMCOR corpus, SENSEVAL corpus,
31Evaluating Word Sense Disambiguation
- Difficulty in evaluation
- Nature of the senses to distinguish has a huge
impact on results - Coarse versus fine-grained sense distinction
- chair a seat for one person, with a support for
the back "he put his coat over the back of the
chair and sat down - chair the position of professor "he was
awarded an endowed chair in economics - bank a financial institution that accepts
deposits and channels the money into lending
activities "he cashed a check at the bank"
"that bank holds the mortgage on my home" - bank a building in which commercial banking is
transacted "the bank is on the corner of Nassau
and Witherspoon - Sense maps
- Cluster similar senses
- Allow for both fine-grained and coarse-grained
evaluation
32Bounds on Performance
- Upper and Lower Bounds on Performance
- Measure of how well an algorithm performs
relative to the difficulty of the task. - Upper Bound
- Human performance
- Around 97-99 with few and clearly distinct
senses - Inter-judge agreement
- With words with clear distinct senses 95 and
up - With polysemous words with related senses 65
70 - Lower Bound (or baseline)
- The assignment of a random sense / the most
frequent sense - 90 is excellent for a word with 2 equiprobable
senses - 90 is trivial for a word with 2 senses with
probability ratios of 9 to 1
33References
- (Gale, Church and Yarowsky 1992) Gale, W.,
Church, K., and Yarowsky, D. Estimating upper and
lower bounds on the performance of word-sense
disambiguation programs ACL 1992. - (Miller et. al., 1994) Miller, G., Chodorow, M.,
Landes, S., Leacock, C., and Thomas, R. Using a
semantic concordance for sense identification.
ARPA Workshop 1994. - (Miller, 1995) Miller, G. Wordnet A lexical
database. ACM, 38(11) 1995. - (Senseval) Senseval evaluation exercises
http//www.senseval.org
34Part 3 Knowledge-based Methods for Word Sense
Disambiguation
35Outline
- Task definition
- Machine Readable Dictionaries
- Algorithms based on Machine Readable Dictionaries
- Selectional Restrictions
- Measures of Semantic Similarity
- Heuristic-based Methods
36Task Definition
- Knowledge-based WSD class of WSD methods
relying (mainly) on knowledge drawn from
dictionaries and/or raw text - Resources
- Yes
- Machine Readable Dictionaries
- Raw corpora
- No
- Manually annotated corpora
- Scope
- All open-class words
37Machine Readable Dictionaries
- In recent years, most dictionaries made available
in Machine Readable format (MRD) - Oxford English Dictionary
- Collins
- Longman Dictionary of Ordinary Contemporary
English (LDOCE) - Thesauruses add synonymy information
- Roget Thesaurus
- Semantic networks add more semantic relations
- WordNet
- EuroWordNet
38MRD A Resource for Knowledge-based WSD
- For each word in the language vocabulary, an MRD
provides - A list of meanings
- Definitions (for all word meanings)
- Typical usage examples (for most word meanings)
39MRD A Resource for Knowledge-based WSD
- A thesaurus adds
- An explicit synonymy relation between word
meanings - A semantic network adds
- Hypernymy/hyponymy (IS-A), meronymy/holonymy
(PART-OF), antonymy, entailnment, etc.
WordNet synsets for the noun plant 1.
plant, works, industrial plant 2. plant,
flora, plant life
WordNet related concepts for the meaning plant
life plant, flora, plant life
hypernym organism, being
hypomym house plant, fungus,
meronym plant tissue, plant part
holonym Plantae, kingdom Plantae, plant
kingdom
40Outline
- Task definition
- Machine Readable Dictionaries
- Algorithms based on Machine Readable Dictionaries
- Selectional Restrictions
- Measures of Semantic Similarity
- Heuristic-based Methods
41Lesk Algorithm
- (Michael Lesk 1986) Identify senses of words in
context using definition overlap - Algorithm
- Retrieve from MRD all sense definitions of the
words to be disambiguated - Determine the definition overlap for all possible
sense combinations - Choose senses that lead to highest overlap
- Example disambiguate PINE CONE
- PINE
- 1. kinds of evergreen tree with needle-shaped
leaves - 2. waste away through sorrow or illness
- CONE
- 1. solid body which narrows to a point
- 2. something of this shape whether solid or
hollow - 3. fruit of certain evergreen trees
Pine1 ? Cone1 0 Pine2 ? Cone1 0 Pine1 ?
Cone2 1 Pine2 ? Cone2 0 Pine1 ? Cone3
2 Pine2 ? Cone3 0
42Lesk Algorithm for More than Two Words?
- I saw a man who is 98 years old and can still
walk and tell jokes - nine open class words see(26), man(11),
year(4), old(8), can(5), still(4), walk(10),
tell(8), joke(3) - 43,929,600 sense combinations! How to find the
optimal sense combination? - Simulated annealing (Cowie, Guthrie, Guthrie
1992) - Define a function E combination of word senses
in a given text. - Find the combination of senses that leads to
highest definition overlap (redundancy) - 1. Start with E the most frequent sense
for each word - 2. At each iteration, replace the sense of a
random word in the set with a different sense,
and measure E - 3. Stop iterating when there is no change in
the configuration of senses
43Lesk Algorithm A Simplified Version
- Original Lesk definition measure overlap between
sense definitions for all words in context - Identify simultaneously the correct senses for
all words in context - Simplified Lesk (Kilgarriff Rosensweig 2000)
measure overlap between sense definitions of a
word and current context - Identify the correct sense for one word at a time
- Search space significantly reduced
44Lesk Algorithm A Simplified Version
- Algorithm for simplified Lesk
- Retrieve from MRD all sense definitions of the
word to be disambiguated - Determine the overlap between each sense
definition and the current context - Choose the sense that leads to highest overlap
- Example disambiguate PINE in
- Pine cones hanging in a tree
- PINE
- 1. kinds of evergreen tree with needle-shaped
leaves - 2. waste away through sorrow or illness
Pine1 ? Sentence 1 Pine2 ? Sentence 0
45Evaluations of Lesk Algorithm
- Initial evaluation by M. Lesk
- 50-70 on short samples of text manually
annotated set, with respect to Oxford Advanced
Learners Dictionary - Simulated annealing
- 47 on 50 manually annotated sentences
- Evaluation on Senseval-2 all-words data, with
back-off to random sense (Mihalcea Tarau 2004) - Original Lesk 35
- Simplified Lesk 47
- Evaluation on Senseval-2 all-words data, with
back-off to most frequent sense (Vasilescu,
Langlais, Lapalme 2004) - Original Lesk 42
- Simplified Lesk 58
46Outline
- Task definition
- Machine Readable Dictionaries
- Algorithms based on Machine Readable Dictionaries
- Selectional Preferences
- Measures of Semantic Similarity
- Heuristic-based Methods
47Selectional Preferences
- A way to constrain the possible meanings of words
in a given context - E.g. Wash a dish vs. Cook a dish
- WASH-OBJECT vs. COOK-FOOD
- Capture information about possible relations
between semantic classes - Common sense knowledge
- Alternative terminology
- Selectional Restrictions
- Selectional Preferences
- Selectional Constraints
48Acquiring Selectional Preferences
- From annotated corpora
- Circular relationship with the WSD problem
- Need WSD to build the annotated corpus
- Need selectional preferences to derive WSD
- From raw corpora
- Frequency counts
- Information theory measures
- Class-to-class relations
49Preliminaries Learning Word-to-Word Relations
- An indication of the semantic fit between two
words - 1. Frequency counts
- Pairs of words connected by a syntactic relations
- 2. Conditional probabilities
- Condition on one of the words
50Learning Selectional Preferences (1)
- Word-to-class relations (Resnik 1993)
- Quantify the contribution of a semantic class
using all the concepts subsumed by that class - where
51Learning Selectional Preferences (2)
- Determine the contribution of a word sense based
on the assumption of equal sense distributions - e.g. plant has two senses ? 50 occurences are
sense 1, 50 are sense 2 - Example learning restrictions for the verb to
drink - Find high-scoring verb-object pairs
- Find prototypical object classes (high
association score)
52Learning Selectional Preferences (3)
- Other algorithms
- Learn class-to-class relations (Agirre and
Martinez, 2002) - E.g. ingest food is a class-to-class relation
for eat chicken - Bayesian networks (Ciaramita and Johnson, 2000)
- Tree cut model (Li and Abe, 1998)
53Using Selectional Preferences for WSD
- Algorithm
- 1. Learn a large set of selectional preferences
for a given syntactic relation R - 2. Given a pair of words W1 W2 connected by a
relation R - 3. Find all selectional preferences W1 C
(word-to-class) or C1 C2 (class-to-class) that
apply - 4. Select the meanings of W1 and W2 based on the
selected semantic class
- Example disambiguate coffee in drink coffee
- 1. (beverage) a beverage consisting of an
infusion of ground coffee beans - 2. (tree) any of several small trees native to
the tropical Old World - 3. (color) a medium to dark brown color
-
Given the selectional preference DRINK BEVERAGE
coffee1
54Evaluation of Selectional Preferences for WSD
- Data set
- mainly on verb-object, subject-verb relations
extracted from SemCor - Compare against random baseline
- Results (Agirre and Martinez, 2000)
- Average results on 8 nouns
- Similar figures reported in (Resnik 1997)
55Outline
- Task definition
- Machine Readable Dictionaries
- Algorithms based on Machine Readable Dictionaries
- Selectional Restrictions
- Measures of Semantic Similarity
- Heuristic-based Methods
56Semantic Similarity
- Words in a discourse must be related in meaning,
for the discourse to be coherent (Haliday and
Hassan, 1976) - Use this property for WSD Identify related
meanings for words that share a common context - Context span
- 1. Local context semantic similarity between
pairs of words - 2. Global context lexical chains
57Semantic Similarity in a Local Context
- Similarity determined between pairs of concepts,
or between a word and its surrounding context - Relies on similarity metrics on semantic networks
- (Rada et al. 1989)
carnivore
bear
feline, felid
canine, canid
fissiped mamal, fissiped
wild dog
wolf
hyena
dog
hunting dog
hyena dog
dingo
dachshund
terrier
58Semantic Similarity Metrics (1)
- Input two concepts (same part of speech)
- Output similarity measure
- (Leacock and Chodorow 1998)
- E.g. Similarity(wolf,dog) 0.60
Similarity(wolf,bear) 0.42 - (Resnik 1995)
- Define information content, P(C) probability of
seeing a concept of type C in a large corpus - Probability of seeing a concept probability of
seeing instances of that concept - Determine the contribution of a word sense based
on the assumption of equal sense distributions - e.g. plant has two senses ? 50 occurrences are
sense 1, 50 are sense 2
, D is the taxonomy depth
59Semantic Similarity Metrics (2)
- Similarity using information content
- (Resnik 1995) Define similarity between two
concepts (LCS Least Common Subsumer) - Alternatives (Jiang and Conrath 1997)
- Other metrics
- Similarity using information content (Lin 1998)
- Similarity using gloss-based paths across
different hierarchies (Mihalcea and Moldovan
1999) - Conceptual density measure between noun semantic
hierarchies and current context (Agirre and Rigau
1995) - Adapted Lesk algorithm (Banerjee and Pedersen
2002)
60Semantic Similarity Metrics for WSD
- Disambiguate target words based on similarity
with one word to the left and one word to the
right - (Patwardhan, Banerjee, Pedersen 2002)
- Evaluation
- 1,723 ambiguous nouns from Senseval-2
- Among 5 similarity metrics, (Jiang and Conrath
1997) provide the best precision (39)
- Example disambiguate PLANT in plant with
flowers - PLANT
- plant, works, industrial plant
- plant, flora, plant life
- Similarity (plant1, flower) 0.2
- Similarity (plant2, flower) 1.5
plant2
61Semantic Similarity in a Global Context
- Lexical chains (Hirst and St-Onge 1988), (Haliday
and Hassan 1976) - A lexical chain is a sequence of semantically
related words, which creates a context and
contributes to the continuity of meaning and the
coherence of a discourse - Algorithm for finding lexical chains
- Select the candidate words from the text. These
are words for which we can compute similarity
measures, and therefore most of the time they
have the same part of speech. - For each such candidate word, and for each
meaning for this word, find a chain to receive
the candidate word sense, based on a semantic
relatedness measure between the concepts that are
already in the chain, and the candidate word
meaning. - If such a chain is found, insert the word in this
chain otherwise, create a new chain.
62Semantic Similarity of a Global Context
A very long train traveling along the rails with
a constant velocity v in a certain direction
train
1 public transport
1 change location
2 a bar of steel for trains
2 order set of things
3 piece of cloth
travel
2 undergo transportation
rail
1 a barrier
3 a small bird
63Lexical Chains for WSD
- Identify lexical chains in a text
- Usually target one part of speech at a time
- Identify the meaning of words based on their
membership to a lexical chain - Evaluation
- (Galley and McKeown 2003) lexical chains on 74
SemCor texts give 62.09 - (Mihalcea and Moldovan 2000) on five SemCor texts
give 90 with 60 recall - lexical chains anchored on monosemous words
- (Okumura and Honda 1994) lexical chains on five
Japanese texts give 63.4
64Outline
- Task definition
- Machine Readable Dictionaries
- Algorithms based on Machine Readable Dictionaries
- Selectional Restrictions
- Measures of Semantic Similarity
- Heuristic-based Methods
65Most Frequent Sense (1)
- Identify the most often used meaning and use this
meaning by default - Word meanings exhibit a Zipfian distribution
- E.g. distribution of word senses in SemCor
Example plant/flora is used more often than
plant/factory - annotate any instance of
PLANT as plant/flora
66Most Frequent Sense (2)
- Method 1 Find the most frequent sense in an
annotated corpus - Method 2 Find the most frequent sense using a
method based on distributional similarity
(McCarthy et al. 2004) - 1. Given a word w, find the top k
distributionally similar words - Nw n1, n2, , nk, with associated
similarity scores dss(w,n1), dss(w,n2),
dss(w,nk) - 2. For each sense wsi of w, identify the
similarity with the words nj, using the sense of
nj that maximizes this score - 3. Rank senses wsi of w based on the total
similarity score
67Most Frequent Sense(3)
- Word senses
- pipe 1 tobacco pipe
- pipe 2 tube of metal or plastic
- Distributional similar words
- N tube, cable, wire, tank, hole, cylinder,
fitting, tap, - For each word in N, find similarity with pipei
(using the sense that maximizes the similarity) - pipe1 tube (3) 0.3
- pipe2 tube (1) 0.6
- Compute score for each sense pipei
- score (pipe1) 0.25
- score (pipe2) 0.73
- Note results depend on the corpus used to find
distributionally - similar words gt can find domain specific
predominant senses
68One Sense Per Discourse
- A word tends to preserve its meaning across all
its occurrences in a given discourse (Gale,
Church, Yarowksy 1992) - What does this mean?
- Evaluation
- 8 words with two-way ambiguity, e.g. plant,
crane, etc. - 98 of the two-word occurrences in the same
discourse carry the same meaning - The grain of salt Performance depends on
granularity - (Krovetz 1998) experiments with words with more
than two senses - Performance of one sense per discourse measured
on SemCor is approx. 70
E.g. The ambiguous word PLANT occurs 10 times in
a discourse all instances of plant
carry the same meaning
69One Sense per Collocation
- A word tends to preserver its meaning when used
in the same collocation (Yarowsky 1993) - Strong for adjacent collocations
- Weaker as the distance between words increases
- An example
- Evaluation
- 97 precision on words with two-way ambiguity
- Finer granularity
- (Martinez and Agirre 2000) tested the one sense
per collocation hypothesis on text annotated
with WordNet senses - 70 precision on SemCor words
The ambiguous word PLANT preserves its meaning in
all its occurrences within the collocation
industrial plant, regardless of the context
where this collocation occurs
70References
- (Agirre and Rigau, 1995) Agirre, E. and Rigau, G.
A proposal for word sense disambiguation using
conceptual distance. RANLP 1995. Â - (Agirre and Martinez 2001) Agirre, E. and
Martinez, D. Learning class-to-class selectional
preferences. CONLL 2001. - Â (Banerjee and Pedersen 2002) Banerjee, S. and
Pedersen, T. An adapted Lesk algorithm for word
sense disambiguation using WordNet. CICLING 2002. - (Cowie, Guthrie and Guthrie 1992), Cowie, L. and
Guthrie, J. A. and Guthrie, L. Lexical
disambiguation using simulated annealing. COLING
2002. - (Gale, Church and Yarowsky 1992) Gale, W.,
Church, K., and Yarowsky, D. One sense per
discourse. DARPA workshop 1992. - (Halliday and Hasan 1976) Halliday, M. and Hasan,
R., (1976). Cohesion in English. Longman. - (Galley and McKeown 2003) Galley, M. and McKeown,
K. (2003) Improving word sense disambiguation in
lexical chaining. IJCAI 2003 - (Hirst and St-Onge 1998) Hirst, G. and St-Onge,
D. Lexical chains as representations of context
in the detection and correction of malaproprisms.
WordNet An electronic lexical database, MIT
Press. - (Jiang and Conrath 1997) Jiang, J. and Conrath,
D. Semantic similarity based on corpus statistics
and lexical taxonomy. COLING 1997. - (Krovetz, 1998) Krovetz, R. More than one sense
per discourse. ACL-SIGLEX 1998. - (Lesk, 1986) Lesk, M. Automatic sense
disambiguation using machine readable
dictionaries How to tell a pine cone from an ice
cream cone. SIGDOC 1986. - (Lin 1998) Lin, D An information theoretic
definition of similarity. ICML 1998.
71References
- (Martinez and Agirre 2000) Martinez, D. and
Agirre, E. One sense per collocation and
genre/topic variations. EMNLP 2000. - (Miller et. al., 1994) Miller, G., Chodorow, M.,
Landes, S., Leacock, C., and Thomas, R. Using a
semantic concordance for sense identification.
ARPA Workshop 1994. - (Miller, 1995) Miller, G. Wordnet A lexical
database. ACM, 38(11) 1995. - (Mihalcea and Moldovan, 1999) Mihalcea, R. and
Moldovan, D. A method for word sense
disambiguation of unrestricted text. ACL 1999. - (Mihalcea and Moldovan 2000) Mihalcea, R. and
Moldovan, D. An iterative approach to word sense
disambiguation. FLAIRS 2000. - (Mihalcea, Tarau, Figa 2004) R. Mihalcea, P.
Tarau, E. Figa PageRank on Semantic Networks with
Application to Word Sense Disambiguation, COLING
2004. - (Patwardhan, Banerjee, and Pedersen 2003)
Patwardhan, S. and Banerjee, S. and Pedersen, T.
Using Measures of Semantic Relatedeness for Word
Sense Disambiguation. CICLING 2003. - (Rada et al 1989) Rada, R. and Mili, H. and
Bicknell, E. and Blettner, M. Development and
application of a metric on semantic nets. IEEE
Transactions on Systems, Man, and Cybernetics,
19(1) 1989. - (Resnik 1993) Resnik, P. Selection and
Information A Class-Based Approach to Lexical
Relationships. University of Pennsylvania 1993. Â - (Resnik 1995) Resnik, P. Using information
content to evaluate semantic similarity. IJCAI
1995. - (Vasilescu, Langlais, Lapalme 2004) F. Vasilescu,
P. Langlais, G. Lapalme "Evaluating variants of
the Lesk approach for disambiguating words, LREC
2004. - (Yarowsky, 1993) Yarowsky, D. One sense per
collocation. ARPA Workshop 1993.
72- Part 4
- Supervised Methods of Word Sense Disambiguation
73Outline
- What is Supervised Learning?
- Task Definition
- Single Classifiers
- Naïve Bayesian Classifiers
- Decision Lists and Trees
- Ensembles of Classifiers
74What is Supervised Learning?
- Collect a set of examples that illustrate the
various possible classifications or outcomes of
an event. - Identify patterns in the examples associated with
each particular class of the event. - Generalize those patterns into rules.
- Apply the rules to classify a new event.
75Learn from these examples when do I go to the
store?
Day CLASS Go to Store? F1 Hot Outside? F2 Slept Well? F3 Ate Well?
1 YES YES NO NO
2 NO YES NO YES
3 YES NO NO NO
4 NO NO NO YES
76Learn from these examples when do I go to the
store?
Day CLASS Go to Store? F1 Hot Outside? F2 Slept Well? F3 Ate Well?
1 YES YES NO NO
2 NO YES NO YES
3 YES NO NO NO
4 NO NO NO YES
77Outline
- What is Supervised Learning?
- Task Definition
- Single Classifiers
- Naïve Bayesian Classifiers
- Decision Lists and Trees
- Ensembles of Classifiers
78Task Definition
- Supervised WSD Class of methods that induces a
classifier from manually sense-tagged text using
machine learning techniques. - Resources
- Sense Tagged Text
- Dictionary (implicit source of sense inventory)
- Syntactic Analysis (POS tagger, Chunker, Parser,
) - Scope
- Typically one target word per context
- Part of speech of target word resolved
- Lends itself to targeted word formulation
- Reduces WSD to a classification problem where a
target word is assigned the most appropriate
sense from a given set of possibilities based on
the context in which it occurs
79Sense Tagged Text
Bonnie and Clyde are two really famous criminals, I think they were bank/1 robbers
My bank/1 charges too much for an overdraft.
I went to the bank/1 to deposit my check and get a new ATM card.
The University of Minnesota has an East and a West Bank/2 campus right on the Mississippi River.
My grandfather planted his pole in the bank/2 and got a great big catfish!
The bank/2 is pretty muddy, I cant walk there.
80Two Bags of Words(Co-occurrences in the window
of context)
FINANCIAL_BANK_BAG a an and are ATM Bonnie card charges check Clyde criminals deposit famous for get I much My new overdraft really robbers the they think to too two went were
RIVER_BANK_BAG a an and big campus cant catfish East got grandfather great has his I in is Minnesota Mississippi muddy My of on planted pole pretty right River The the there University walk West
81Simple Supervised Approach
- Given a sentence S containing bank
-
- For each word Wi in S
- If Wi is in FINANCIAL_BANK_BAG then
- Sense_1 Sense_1 1
- If Wi is in RIVER_BANK_BAG then
- Sense_2 Sense_2 1
-
- If Sense_1 gt Sense_2 then print Financial
- else if Sense_2 gt Sense_1 then print River
- else print Cant Decide
-
82Supervised Methodology
- Create a sample of training data where a given
target word is manually annotated with a sense
from a predetermined set of possibilities. - One tagged word per instance/lexical sample
disambiguation - Select a set of features with which to represent
context. - co-occurrences, collocations, POS tags, verb-obj
relations, etc... - Convert sense-tagged training instances to
feature vectors. - Apply a machine learning algorithm to induce a
classifier. - Form structure or relation among features
- Parameters strength of feature interactions
- Convert a held out sample of test data into
feature vectors. - correct sense tags are known but not used
- Apply classifier to test instances to assign a
sense tag.
83From Text to Feature Vectors
- My/pronoun grandfather/noun used/verb to/prep
fish/verb along/adv the/det banks/SHORE of/prep
the/det Mississippi/noun River/noun. (S1) - The/det bank/FINANCE issued/verb a/det check/noun
for/prep the/det amount/noun of/prep
interest/noun. (S2)
P-2 P-1 P1 P2 fish check river interest SENSE TAG
S1 adv det prep det Y N Y N SHORE
S2 det verb det N Y N Y FINANCE
84Supervised Learning Algorithms
- Once data is converted to feature vector form,
any supervised learning algorithm can be used.
Many have been applied to WSD with good results - Support Vector Machines
- Nearest Neighbor Classifiers
- Decision Trees
- Decision Lists
- Naïve Bayesian Classifiers
- Perceptrons
- Neural Networks
- Graphical Models
- Log Linear Models
85Outline
- What is Supervised Learning?
- Task Definition
- Naïve Bayesian Classifier
- Decision Lists and Trees
- Ensembles of Classifiers
86Naïve Bayesian Classifier
- Naïve Bayesian Classifier well known in Machine
Learning community for good performance across a
range of tasks (e.g., Domingos and Pazzani, 1997) - Word Sense Disambiguation is no exception
- Assumes conditional independence among features,
given the sense of a word. - The form of the model is assumed, but parameters
are estimated from training instances - When applied to WSD, features are often a bag of
words that come from the training data - Usually thousands of binary features that
indicate if a word is present in the context of
the target word (or not)
87Bayesian Inference
- Given observed features, what is most likely
sense? - Estimate probability of observed features given
sense - Estimate unconditional probability of sense
- Unconditional probability of features is a
normalizing term, doesnt affect sense
classification
88Naïve Bayesian Model
89The Naïve Bayesian Classifier
- Given 2,000 instances of bank, 1,500 for bank/1
(financial sense) and 500 for bank/2 (river
sense) - P(S1) 1,500/2000 .75
- P(S2) 500/2,000 .25
- Given credit occurs 200 times with bank/1 and 4
times with bank/2. - P(F1credit) 204/2000 .102
- P(F1creditS1) 200/1,500 .133
- P(F1creditS2) 4/500 .008
- Given a test instance that has one feature
credit - P(S1F1credit) .133.75/.102 .978
- P(S2F1credit) .008.25/.102 .020
90Comparative Results
- (Leacock, et. al. 1993) compared Naïve Bayes with
a Neural Network and a Context Vector approach
when disambiguating six senses of line - (Mooney, 1996) compared Naïve Bayes with a Neural
Network, Decision Tree/List Learners, Disjunctive
and Conjunctive Normal Form learners, and a
perceptron when disambiguating six senses of
line - (Pedersen, 1998) compared Naïve Bayes with
Decision Tree, Rule Based Learner, Probabilistic
Model, etc. when disambiguating line and 12 other
words - All found that Naïve Bayesian Classifier
performed as well as any of the other methods!
91Outline
- What is Supervised Learning?
- Task Definition
- Naïve Bayesian Classifiers
- Decision Lists and Trees
- Ensembles of Classifiers
92Decision Lists and Trees
- Very widely used in Machine Learning.
- Decision trees used very early for WSD research
(e.g., Kelly and Stone, 1975 Black, 1988). - Represent disambiguation problem as a series of
questions (presence of feature) that reveal the
sense of a word. - List decides between two senses after one
positive answer - Tree allows for decision among multiple senses
after a series of answers - Uses a smaller, more refined set of features than
bag of words and Naïve Bayes. - More descriptive and easier to interpret.
93Decision List for WSD (Yarowsky, 1994)
- Identify collocational features from sense tagged
data. - Word immediately to the left or right of target
- I have my bank/1 statement.
- The river bank/2 is muddy.
- Pair of words to immediate left or right of
target - The worlds richest bank/1 is here in New York.
- The river bank/2 is muddy.
- Words found within k positions to left or right
of target, where k is often 10-50 - My credit is just horrible because my bank/1 has
made several mistakes with my account and the
balance is very low.
94Building the Decision List
- Sort order of collocation tests using log of
conditional probabilities. - Words most indicative of one sense (and not the
other) will be ranked highly.
95Computing DL score
- Given 2,000 instances of bank, 1,500 for bank/1
(financial sense) and 500 for bank/2 (river
sense) - P(S1) 1,500/2,000 .75
- P(S2) 500/2,000 .25
- Given credit occurs 200 times with bank/1 and 4
times with bank/2. - P(F1credit) 204/2,000 .102
- P(F1creditS1) 200/1,500 .133
- P(F1creditS2) 4/500 .008
- From Bayes Rule
- P(S1F1credit) .133.75/.102 .978
- P(S2F1credit) .008.25/.102 .020
- DL Score abs (log (.978/.020)) 3.89
96Using the Decision List
- Sort DL-score, go through test instance looking
for matching feature. First match reveals sense
DL-score Feature Sense
3.89 credit within bank Bank/1 financial
2.20 bank is muddy Bank/2 river
1.09 pole within bank Bank/2 river
0.00 of the bank N/A
97Using the Decision List
98Learning a Decision Tree
- Identify the feature that most cleanly divides
the training data into the known senses. - Cleanly measured by information gain or gain
ratio. - Create subsets of training data according to
feature values. - Find another feature that most cleanly divides a
subset of the training data. - Continue until each subset of training data is
pure or as clean as possible. - Well known decision tree learning algorithms
include ID3 and C4.5 (Quillian, 1986, 1993) - In Senseval-1, a modified decision list (which
supported some conditional branching) was most
accurate for English Lexical Sample task
(Yarowsky, 2000)
99Supervised WSD with Individual Classifiers
- Many supervised Machine Learning algorithms have
been applied to Word Sense Disambiguation, most
work reasonably well. - (Witten and Frank, 2000) is a great intro. to
supervised learning. - Features tend to differentiate among methods more
than the learning algorithms. - Good sets of features tend to include
- Co-occurrences or keywords (global)
- Collocations (local)
- Bigrams (local and global)
- Part of speech (local)
- Predicate-argument relations
- Verb-object, subject-verb,
- Heads of Noun and Verb Phrases
100Convergence of Results
- Accuracy of different systems applied to the same
data tends to converge on a particular value, no
one system shockingly better than another. - Senseval-1, a number of systems in range of
74-78 accuracy for English Lexical Sample task. - Senseval-2, a number of systems in range of
61-64 accuracy for English Lexical Sample task. - Senseval-3, a number of systems in range of
70-73 accuracy for English Lexical Sample task - What to do next?
101Outline
- What is Supervised Learning?
- Task Definition
- Naïve Bayesian Classifiers
- Decision Lists and Trees
- Ensembles of Classifiers
102Ensembles of Classifiers
- Classifier error has two components (Bias and
Variance) - Some algorithms (e.g., decision trees) try and
build a representation of the training data Low
Bias/High Variance - Others (e.g., Naïve Bayes) assume a parametric
form and dont represent the training data High
Bias/Low Variance - Combining classifiers with different bias
variance characteristics can lead to improved
overall accuracy - Bagging a decision tree can smooth out the
effect of small variations in the training data
(Breiman, 1996) - Sample with replacement from the training data to
learn multiple decision trees. - Outliers in training data will tend to be
obscured/eliminated.
103Ensemble Considerations
- Must choose different learning algorithms with
significantly different bias/variance
characteristics. - Naïve Bayesian Classifier versus Decision Tree
- Must choose feature representations that yield
significantly different (independent?) views of
the training data. - Lexical versus syntactic features
- Must choose how to combine classifiers.
- Simple Majority Voting
- Averaging of probabilities across multiple
classifier output - Maximum Entropy combination (e.g., Klein, et.
al., 2002)
104Ensemble Results
- (Pedersen, 2000) achieved state of art for
interest and line data using ensemble of Naïve
Bayesian Classifiers. - Many Naïve Bayesian Classifiers trained on
varying sized windows of context / bags of words. - Classifiers combined by a weighted vote
- (Florian and Yarowsky, 2002) achieved state of
the art for Senseval-1 and Senseval-2 data using
combination of six classifiers. - Rich set of collocational and syntactic features.
- Combined via linear combination of top three
classifiers. - Many Senseval-2 and Senseval-3 systems employed
ensemble methods.
105References
- (Black, 1988) An experiment in computational
discrimination of English word senses. IBM
Journal of Research and Development (32) pg.
185-194. - (Breiman, 1996) The heuristics of instability in
model selection. Annals of Statistics (24) pg.
2350-2383. - (Domingos and Pazzani, 1997) On the Optimality of
the Simple Bayesian Classifier under Zero-One
Loss, Machine Learning (29) pg. 103-130. - (Domingos, 2000) A Unified Bias Variance
Decomposition for Zero-One and Squared Loss. In
Proceedings of AAAI. Pg. 564-569. - (Florian an dYarowsky, 2002) Modeling Consensus
Classifier Combination for Word Sense
Disambiguation. In Proceedings of EMNLP, pp
25-32. - (Kelly and Stone, 1975). Computer Recognition of
English Word Senses, North Holland Publishing
Co., Amsterdam. - (Klein, et. al., 2002) Combining Heterogeneous
Classifiers for Word-Sense Disambiguation,
Proceedings of Senseval-2. pg. 87-89. - (Leacock, et. al. 1993) Corpus based statistical
sense resolution. In Proceedings of the ARPA
Workshop on Human Language Technology. pg.
260-265. - (Mooney, 1996) Comparative experiments on
disambiguating word senses An illustration of
the role of bias in machine learning. Proceedings
of EMNLP. pg. 82-91.
106References
- (Pedersen, 1998) Learning Probabilistic Models of
Word Sense Disambiguation. Ph.D. Dissertation.
Southern Methodist University. - (Pedersen, 2000) A simple approach to building
ensembles of Naive Bayesian classifiers for word
sense disambiguation. In Proceedings of NAACL. - (Quillian, 1986). Induction of Decision Trees.
Machine Learning (1). pg. 81-106. - (Quillian, 1993). C4.5 Programs for Machine
Learning. San Francisco, Morgan Kaufmann. - (Witten and Frank, 2000). Data Mining Practical
Machine Learning Tools and Techniques with Java
Implementations. Morgan-Kaufmann. San Francisco. - (Yarowsky, 1994) Decision lists for lexical
ambiguity resolution Application to accent
restoration in Spanish and French. In Proceedings
of ACL. pp. 88-95. - (Yarowsky, 2000) Hierarchical decision lists for
word sense disambiguation. Computers and the
Humanities, 34.
107Part 5 Minimally Supervised Methods for Word
Sense Disambiguation
108Outline
- Task definition
- What does minimally supervised mean?
- Bootstrapping algorithms
- Co-training
- Self-training
- Yarowsky algorithm
- Using the Web for Word Sense Disambiguation
- Web as a corpus
- Web as collective mind
109Task Definition
- Supervised WSD learning sense classifiers
starting with annotated data - Minimally supervised WSD learning sense
classifiers from annotated data, with minimal
human supervision - Examples
- Automatically bootstrap a corpus starting with a
few human annotated examples - Use monosemous relatives / dictionary definitions
to automatically construct sense tagged data - Rely on Web-users active learning for corpus
annotation
110Outline
- Task definition
- What does minimally supervised mean?
- Bootstrapping algorithms
- Co-training
- Self-training
- Yarowsky algorithm
- Using the Web for Word Sense Disambiguation
- Web as a corpus
- Web as collective mind
111Bootstrapping WSD Classifiers
- Build sense classifiers with little training data
- Expand applicability of supervised WSD
- Bootstrapping approaches
- Co-training
- Self-training
- Yarowsky algorithm
112Bootstrapping Recipe
- Ingredients
- (Some) labeled data
- (Large amounts of) unlabeled data
- (One or more) basic classifiers
- Output
- Classifier that improves over the basic
classifiers
113 building the only atomic plant plant growth
is retarded a herb or flowering plant a
nuclear power plant building a new vehicle
plant the animal and plant life the
passion-fruit plant
plants1 and animals industry plant2
114Co-training / Self-training
- A set L of labeled training examples
- A set U of unlabeled exa