Title: Advances in Word Sense Disambiguation
1Advances in Word Sense Disambiguation
- Tutorial at AAAI-2005
- July 9, 2005
- Rada Mihalcea
- University of North Texas
- http//www.cs.unt.edu/rada
- Ted Pedersen
- University of Minnesota, Duluth
- http//www.d.umn.edu/tpederse
- Note slides have been modified/deleted/added
- For those interested in lexical semantics, I
suggest getting the entire tutorial
2Definitions
- Word sense disambiguation is the problem of
selecting a sense for a word from a set of
predefined possibilities. - Sense Inventory usually comes from a dictionary
or thesaurus. - Knowledge intensive methods, supervised learning,
and (sometimes) bootstrapping approaches - Word sense discrimination is the problem of
dividing the usages of a word into different
meanings, without regard to any particular
existing sense inventory. - Unsupervised techniques
3Word Senses
- The meaning of a word in a given context
- Word sense representations
- With respect to a dictionary
- chair a seat for one person, with a
support for the back "he put his coat over the
back of the chair and sat down" - chair the position of professor "he was
awarded an endowed chair in economics" - With respect to the translation in a second
language - chair chaise
- chair directeur
- With respect to the context where it occurs
(discrimination) - Sit on a chair Take a seat on this chair
- The chair of the Math Department The chair
of the meeting
4Approaches to Word Sense Disambiguation
- Knowledge-Based Disambiguation
- use of external lexical resources such as
dictionaries and thesauri - discourse properties
- Supervised Disambiguation
- based on a labeled training set
- the learning system has
- a training set of feature-encoded inputs AND
- their appropriate sense label (category)
- Unsupervised Disambiguation
- based on unlabeled corpora
- The learning system has
- a training set of feature-encoded inputs BUT
- NOT their appropriate sense label (category)
5All Words Word Sense Disambiguation
- Attempt to disambiguate all open-class words in a
text - He put his suit over the back of the chair
- Use information from dictionaries
- Definitions / Examples for each meaning
- Find similarity between definitions and current
context - Position in a semantic network
- Find that table is closer to chair/furniture
than to chair/person - Use discourse properties
- A word exhibits the same sense in a discourse /
in a collocation
6All Words Word Sense Disambiguation
- Minimally supervised approaches
- Learn to disambiguate words using small annotated
corpora - E.g. SemCor corpus where all open class words
are disambiguated - 200,000 running words
- Most frequent sense
7Targeted Word Sense Disambiguation (we saw this
in the previous lecture notes)
- Disambiguate one target word
- Take a seat on this chair
- The chair of the Math Department
- WSD is viewed as a typical classification problem
- use machine learning techniques to train a system
- Training
- Corpus of occurrences of the target word, each
occurrence annotated with appropriate sense - Build feature vectors
- a vector of relevant linguistic features that
represents the context (ex a window of words
around the target word) - Disambiguation
- Disambiguate the target word in new unseen text
8Unsupervised Disambiguation
- Disambiguate word senses
- without supporting tools such as dictionaries and
thesauri - without a labeled training text
- Without such resources, word senses are not
labeled - We cannot say chair/furniture or chair/person
- We can
- Cluster/group the contexts of an ambiguous word
into a number of groups - Discriminate between these groups without
actually labeling them
9Unsupervised Disambiguation
- Hypothesis same senses of words will have
similar neighboring words - Disambiguation algorithm
- Identify context vectors corresponding to all
occurrences of a particular word - Partition them into regions of high density
- Assign a sense to each such region
-
- Sit on a chair
- Take a seat on this chair
- The chair of the Math Department
- The chair of the meeting
10Bounds on Performance
- Upper and Lower Bounds on Performance
- Measure of how well an algorithm performs
relative to the difficulty of the task. - Upper Bound
- Human performance
- Around 97-99 with few and clearly distinct
senses - Inter-judge agreement
- With words with clear distinct senses 95 and
up - With polysemous words with related senses 65
70 - Lower Bound (or baseline)
- The assignment of a random sense / the most
frequent sense - 90 is excellent for a word with 2 equiprobable
senses - 90 is trivial for a word with 2 senses with
probability ratios of 9 to 1
11References
- (Gale, Church and Yarowsky 1992) Gale, W.,
Church, K., and Yarowsky, D. Estimating upper and
lower bounds on the performance of word-sense
disambiguation programs ACL 1992. - (Miller et. al., 1994) Miller, G., Chodorow, M.,
Landes, S., Leacock, C., and Thomas, R. Using a
semantic concordance for sense identification.
ARPA Workshop 1994. - (Miller, 1995) Miller, G. Wordnet A lexical
database. ACM, 38(11) 1995. - (Senseval) Senseval evaluation exercises
http//www.senseval.org
12Part 3 Knowledge-based Methods for Word Sense
Disambiguation
13Outline
- Task definition
- Machine Readable Dictionaries
- Algorithms based on Machine Readable Dictionaries
- Selectional Restrictions
- Measures of Semantic Similarity
- Heuristic-based Methods
14Task Definition
- Knowledge-based WSD class of WSD methods
relying (mainly) on knowledge drawn from
dictionaries and/or raw text - Resources
- Yes
- Machine Readable Dictionaries
- Raw corpora
- No
- Manually annotated corpora
- Scope
- All open-class words
15Machine Readable Dictionaries
- In recent years, most dictionaries made available
in Machine Readable format (MRD) - Oxford English Dictionary
- Collins
- Longman Dictionary of Ordinary Contemporary
English (LDOCE) - Thesauruses add synonymy information
- Roget Thesaurus
- Semantic networks add more semantic relations
- WordNet
- EuroWordNet
16MRD A Resource for Knowledge-based WSD
- For each word in the language vocabulary, an MRD
provides - A list of meanings
- Definitions (for all word meanings)
- Typical usage examples (for most word meanings)
17MRD A Resource for Knowledge-based WSD
- A thesaurus adds
- An explicit synonymy relation between word
meanings - A semantic network adds
- Hypernymy/hyponymy (IS-A), meronymy (PART-OF),
antonymy, entailment, etc.
WordNet synsets for the noun plant 1.
plant, works, industrial plant 2. plant,
flora, plant life
WordNet related concepts for the meaning plant
life plant, flora, plant life
hypernym organism, being
hypomym house plant, fungus,
meronym plant tissue, plant part
holonym Plantae, kingdom Plantae, plant
kingdom
18Outline
- Task definition
- Machine Readable Dictionaries
- Algorithms based on Machine Readable Dictionaries
- Selectional Restrictions
- Measures of Semantic Similarity
- Heuristic-based Methods
19Lesk Algorithm
- (Michael Lesk 1986) Identify senses of words in
context using definition overlap - Algorithm
- Retrieve from MRD all sense definitions of the
words to be disambiguated - Determine the definition overlap for all possible
sense combinations - Choose senses that lead to highest overlap
- Example disambiguate PINE CONE
- PINE
- 1. kinds of evergreen tree with needle-shaped
leaves - 2. waste away through sorrow or illness
- CONE
- 1. solid body which narrows to a point
- 2. something of this shape whether solid or
hollow - 3. fruit of certain evergreen trees
Pine1 ? Cone1 0 Pine2 ? Cone1 0 Pine1 ?
Cone2 1 Pine2 ? Cone2 0 Pine1 ? Cone3
2 Pine2 ? Cone3 0
20Lesk Algorithm A Simplified Version
- Original Lesk definition measure overlap between
sense definitions for all words in context - Identify simultaneously the correct senses for
all words in context - Simplified Lesk (Kilgarriff Rosensweig 2000)
measure overlap between sense definitions of a
word and current context - Identify the correct sense for one word at a time
- Search space significantly reduced
21Lesk Algorithm A Simplified Version
- Algorithm for simplified Lesk
- Retrieve from MRD all sense definitions of the
word to be disambiguated - Determine the overlap between each sense
definition and the current context - Choose the sense that leads to highest overlap
- Example disambiguate PINE in
- Pine cones hanging in a tree
- PINE
- 1. kinds of evergreen tree with needle-shaped
leaves - 2. waste away through sorrow or illness
Pine1 ? Sentence 1 Pine2 ? Sentence 0
22Evaluations of Lesk Algorithm
- Initial evaluation by M. Lesk
- 50-70 on short samples of text manually
annotated set, with respect to Oxford Advanced
Learners Dictionary - Evaluation on Senseval-2 all-words data, with
back-off to random sense (Mihalcea Tarau 2004) - Original Lesk 35
- Simplified Lesk 47
- Evaluation on Senseval-2 all-words data, with
back-off to most frequent sense (Vasilescu,
Langlais, Lapalme 2004) - Original Lesk 42
- Simplified Lesk 58
23Outline
- Task definition
- Machine Readable Dictionaries
- Algorithms based on Machine Readable Dictionaries
- Selectional Preferences
- Measures of Semantic Similarity
- Heuristic-based Methods
24Selectional Preferences
- A way to constrain the possible meanings of words
in a given context - E.g. Wash a dish vs. Cook a dish
- WASH-OBJECT vs. COOK-FOOD
- Capture information about possible relations
between semantic classes - Common sense knowledge
- Alternative terminology
- Selectional Restrictions
- Selectional Preferences
- Selectional Constraints
25Acquiring Selectional Preferences
- From annotated corpora
- We saw this in the previous lecture notes
- From raw corpora
- Frequency counts
- Information theory measures
- Class-to-class relations
26Preliminaries Learning Word-to-Word Relations
- An indication of the semantic fit between two
words - 1. Frequency counts
- Pairs of words connected by a syntactic relations
- 2. Conditional probabilities
- Condition on one of the words
27Learning Selectional Preferences (1)(p. 14 in
Chapter 19 you wont be responsible for this
formula)
- Word-to-class relations (Resnik 1993)
- Quantify the contribution of a semantic class
using all the concepts subsumed by that class - where
28Outline
- Task definition
- Machine Readable Dictionaries
- Algorithms based on Machine Readable Dictionaries
- Selectional Restrictions
- Measures of Semantic Similarity
- Heuristic-based Methods
29Semantic Similarity
- Words in a discourse must be related in meaning,
for the discourse to be coherent (Haliday and
Hassan, 1976) - Use this property for WSD Identify related
meanings for words that share a common context
30See Figure 19.6 in the chapter
- Basic idea the shorter the path between two
senses in a semantic network, the more similar
they are. - So, you can see that nickel, dime are closer to
budget than they are to Richter scale (in Figure
19.6)
31Semantic Similarity Metrics (1)
- Input two concepts (same part of speech)
- Output similarity measure
- (Leacock and Chodorow 1998)
- E.g. Similarity(wolf,dog) 0.60
Similarity(wolf,bear) 0.42 - (Resnik 1995)
- Define information content, where P(C)
probability of seeing a concept of type C in a
large corpus - Probability of seeing a concept probability of
seeing instances of that concept
, D is the taxonomy depth
32Semantic Similarity Metrics (2)
- Similarity using information content
- (Resnik 1995) Define similarity between two
concepts (LCS Least Common Subsumer) - Alternatives (Jiang and Conrath 1997)
- Other metrics
- Similarity using information content (Lin 1998)
- Similarity using gloss-based paths across
different hierarchies (Mihalcea and Moldovan
1999) - Conceptual density measure between noun semantic
hierarchies and current context (Agirre and Rigau
1995) - Adapted Lesk algorithm (Banerjee and Pedersen
2002)
33Most Frequent Sense (1)
- Identify the most often used meaning and use this
meaning by default - Word meanings exhibit a Zipfian distribution
- E.g. distribution of word senses in SemCor
Example plant/flora is used more often than
plant/factory - annotate any instance of
PLANT as plant/flora
34Most Frequent Sense (2) (you arent responsible
for this)
- Method 1 Find the most frequent sense in an
annotated corpus - Method 2 Find the most frequent sense using a
method based on distributional similarity
(McCarthy et al. 2004) - 1. Given a word w, find the top k
distributionally similar words - Nw n1, n2, , nk, with associated
similarity scores dss(w,n1), dss(w,n2),
dss(w,nk) - 2. For each sense wsi of w, identify the
similarity with the words nj, using the sense of
nj that maximizes this score - 3. Rank senses wsi of w based on the total
similarity score
35Most Frequent Sense(3)
- Word senses
- pipe 1 tobacco pipe
- pipe 2 tube of metal or plastic
- Distributional similar words
- N tube, cable, wire, tank, hole, cylinder,
fitting, tap, - For each word in N, find similarity with pipei
(using the sense that maximizes the similarity) - pipe1 tube (3) 0.3
- pipe2 tube (1) 0.6
- Compute score for each sense pipei
- score (pipe1) 0.25
- score (pipe2) 0.73
- Note results depend on the corpus used to find
distributionally - similar words gt can find domain specific
predominant senses
36One Sense Per Discourse
- A word tends to preserve its meaning across all
its occurrences in a given discourse (Gale,
Church, Yarowksy 1992) - What does this mean?
- Evaluation
- 8 words with two-way ambiguity, e.g. plant,
crane, etc. - 98 of the two-word occurrences in the same
discourse carry the same meaning - The grain of salt Performance depends on
granularity - (Krovetz 1998) experiments with words with more
than two senses - Performance of one sense per discourse measured
on SemCor is approx. 70
E.g. The ambiguous word PLANT occurs 10 times in
a discourse all instances of plant
carry the same meaning
37One Sense per Collocation
- A word tends to preserve its meaning when used in
the same collocation (Yarowsky 1993) - Strong for adjacent collocations
- Weaker as the distance between words increases
- An example
- Evaluation
- 97 precision on words with two-way ambiguity
- Finer granularity
- (Martinez and Agirre 2000) tested the one sense
per collocation hypothesis on text annotated
with WordNet senses - 70 precision on SemCor words
The ambiguous word PLANT preserves its meaning in
all its occurrences within the collocation
industrial plant, regardless of the context
where this collocation occurs
38References
- (Agirre and Rigau, 1995) Agirre, E. and Rigau, G.
A proposal for word sense disambiguation using
conceptual distance. RANLP 1995. Â - (Agirre and Martinez 2001) Agirre, E. and
Martinez, D. Learning class-to-class selectional
preferences. CONLL 2001. - Â (Banerjee and Pedersen 2002) Banerjee, S. and
Pedersen, T. An adapted Lesk algorithm for word
sense disambiguation using WordNet. CICLING 2002. - (Cowie, Guthrie and Guthrie 1992), Cowie, L. and
Guthrie, J. A. and Guthrie, L. Lexical
disambiguation using simulated annealing. COLING
2002. - (Gale, Church and Yarowsky 1992) Gale, W.,
Church, K., and Yarowsky, D. One sense per
discourse. DARPA workshop 1992. - (Halliday and Hasan 1976) Halliday, M. and Hasan,
R., (1976). Cohesion in English. Longman. - (Galley and McKeown 2003) Galley, M. and McKeown,
K. (2003) Improving word sense disambiguation in
lexical chaining. IJCAI 2003 - (Hirst and St-Onge 1998) Hirst, G. and St-Onge,
D. Lexical chains as representations of context
in the detection and correction of malaproprisms.
WordNet An electronic lexical database, MIT
Press. - (Jiang and Conrath 1997) Jiang, J. and Conrath,
D. Semantic similarity based on corpus statistics
and lexical taxonomy. COLING 1997. - (Krovetz, 1998) Krovetz, R. More than one sense
per discourse. ACL-SIGLEX 1998. - (Lesk, 1986) Lesk, M. Automatic sense
disambiguation using machine readable
dictionaries How to tell a pine cone from an ice
cream cone. SIGDOC 1986. - (Lin 1998) Lin, D An information theoretic
definition of similarity. ICML 1998.
39References
- (Martinez and Agirre 2000) Martinez, D. and
Agirre, E. One sense per collocation and
genre/topic variations. EMNLP 2000. - (Miller et. al., 1994) Miller, G., Chodorow, M.,
Landes, S., Leacock, C., and Thomas, R. Using a
semantic concordance for sense identification.
ARPA Workshop 1994. - (Miller, 1995) Miller, G. Wordnet A lexical
database. ACM, 38(11) 1995. - (Mihalcea and Moldovan, 1999) Mihalcea, R. and
Moldovan, D. A method for word sense
disambiguation of unrestricted text. ACL 1999. - (Mihalcea and Moldovan 2000) Mihalcea, R. and
Moldovan, D. An iterative approach to word sense
disambiguation. FLAIRS 2000. - (Mihalcea, Tarau, Figa 2004) R. Mihalcea, P.
Tarau, E. Figa PageRank on Semantic Networks with
Application to Word Sense Disambiguation, COLING
2004. - (Patwardhan, Banerjee, and Pedersen 2003)
Patwardhan, S. and Banerjee, S. and Pedersen, T.
Using Measures of Semantic Relatedeness for Word
Sense Disambiguation. CICLING 2003. - (Rada et al 1989) Rada, R. and Mili, H. and
Bicknell, E. and Blettner, M. Development and
application of a metric on semantic nets. IEEE
Transactions on Systems, Man, and Cybernetics,
19(1) 1989. - (Resnik 1993) Resnik, P. Selection and
Information A Class-Based Approach to Lexical
Relationships. University of Pennsylvania 1993. Â - (Resnik 1995) Resnik, P. Using information
content to evaluate semantic similarity. IJCAI
1995. - (Vasilescu, Langlais, Lapalme 2004) F. Vasilescu,
P. Langlais, G. Lapalme "Evaluating variants of
the Lesk approach for disambiguating words, LREC
2004. - (Yarowsky, 1993) Yarowsky, D. One sense per
collocation. ARPA Workshop 1993.
40- Part 4
- Supervised Methods of Word Sense Disambiguation
- This section has been deleted
41Part 5 Minimally Supervised Methods for Word
Sense Disambiguation
42Outline
- Task definition
- What does minimally supervised mean?
- Bootstrapping algorithms
- Co-training
- Self-training
- Yarowsky algorithm
- Using the Web for Word Sense Disambiguation
- Web as a corpus
- Web as collective mind
43Task Definition
- Supervised WSD learning sense classifiers
starting with annotated data - Minimally supervised WSD learning sense
classifiers from annotated data, with minimal
human supervision - Examples
- Automatically bootstrap a corpus starting with a
few human annotated examples - Use monosemous relatives / dictionary definitions
to automatically construct sense tagged data
44Outline
- Task definition
- What does minimally supervised mean?
- Bootstrapping algorithms
- Co-training
- Self-training
- Yarowsky algorithm
- Using the Web for Word Sense Disambiguation
- Web as a corpus
- Web as collective mind
45Bootstrapping Recipe
- Ingredients
- (Some) labeled data
- (Large amounts of) unlabeled data
- (One or more) basic classifiers
- Output
- Classifier that improves over the basic
classifiers
46 building the only atomic plant plant growth
is retarded a herb or flowering plant a
nuclear power plant building a new vehicle
plant the animal and plant life the
passion-fruit plant
plants1 and animals industry plant2
47Co-training/Self-training
- A set L of labeled training examples
- A set U of unlabeled examples
- Classifiers Ci
- 1. Create a pool of examples U'
- choose P random examples from U
- 2. Loop for I iterations
- Train Ci on L and label U'
- Select G most confident examples and add to L
- maintain distribution in L
- Refill U' with examples from U
- keep U' at constant size P
48Co-training
- (Blum and Mitchell 1998)
- Two classifiers
- independent views
- independence condition can be relaxed
- Co-training in Natural Language Learning
- Statistical parsing (Sarkar 2001)
- Co-reference resolution (Ng and Cardie 2003)
- Part of speech tagging (Clark, Curran and Osborne
2003) - ...
49Self-training
- (Nigam and Ghani 2000)
- One single classifier
- Retrain on its own output
- Self-training for Natural Language Learning
- Part of speech tagging (Clark, Curran and Osborne
2003) - Co-reference resolution (Ng and Cardie 2003)
- several classifiers through bagging
50Yarowsky Algorithm
- (Yarowsky 1995)
- Similar to co-training
- Relies on two heuristics and a decision list
- One sense per collocation
- Nearby words provide strong and consistent clues
as to the sense of a target word - One sense per discourse
- The sense of a target word is highly consistent
within a single document
51Learning Algorithm
- A decision list is used to classify instances of
target word
the loss of animal and plant species through
extinction
- Classification is based on the highest ranking
rule that matches the target context
LogL Collocation Sense
9.31 flower (within /- k words) A (living)
9.24 job (within /- k words) B (factory)
9.03 fruit (within /- k words) A (living)
9.02 plant species A (living)
... ...
52Bootstrapping Algorithm
Sense-A life
Sense-B factory
- All occurrences of the target word are identified
- A small training set of seed data is tagged with
word sense
53Bootstrapping Algorithm
- Seed set grows and residual set shrinks .
54Bootstrapping Algorithm
- Convergence Stop when residual set stabilizes
55Bootstrapping Algorithm
- Iterative procedure
- Train decision list algorithm on seed set
- Classify residual data with decision list
- Create new seed set by identifying samples that
are tagged with a probability above a certain
threshold - Retrain classifier on new seed set
- Selecting training seeds
- Initial training set should accurately
distinguish among possible senses - Strategies
- Select a single, defining seed collocation for
each possible sense. - Ex life and manufacturing for target
plant - Use words from dictionary definitions
- Hand-label most frequent collocates
56Evaluation
- Test corpus extracted from 460 million word
corpus of multiple sources (news articles,
transcripts, novels, etc.) - Performance of multiple models compared with
- supervised decision lists
- unsupervised learning algorithm of Schütze
(1992), based on alignment of clusters with word
senses
Word Senses Supervised Unsupervised Schütze Unsupervised Bootstrapping
plant living/factory 97.7 92 98.6
space volume/outer 93.9 90 93.6
tank vehicle/container 97.1 95 96.5
motion legal/physical 98.0 92 97.9
-
Avg. - 96.1 92.2 96.5
57The Web as a CorpusThis topic has been deleted
- Use the Web as a large textual corpus
- Build annotated corpora using monosemous
relatives - Bootstrap annotated corpora starting with few
seeds - Similar to (Yarowsky 1995)
- Use the (semi)automatically tagged data to train
WSD classifiers
58References
- (Abney 2002) Abney, S. Bootstrapping. Proceedings
of ACL 2002. - (Blum and Mitchell 1998) Blum, A. and Mitchell,
T. Combining labeled and unlabeled data with
co-training. Proceedings of COLT 1998. - (Chklovski and Mihalcea 2002) Chklovski, T. and
Mihalcea, R. Building a sense tagged corpus with
Open Mind Word Expert. Proceedings of ACL 2002
workshop on WSD. - (Clark, Curran and Osborne 2003) Clark, S. and
Curran, J.R. and Osborne, M. Bootstrapping POS
taggers using unlabelled data. Proceedings of
CoNLL 2003. - (Mihalcea 1999) Mihalcea, R. An automatic method
for generating sense tagged corpora. Proceedings
of AAAI 1999. - (Mihalcea 2002) Mihalcea, R. Bootstrapping large
sense tagged corpora. Proceedings of LREC 2002. - (Mihalcea 2004) Mihalcea, R. Co-training and
Self-training for Word Sense Disambiguation.
Proceedings of CoNLL 2004. - (Ng and Cardie 2003) Ng, V. and Cardie, C. Weakly
supervised natural language learning without
redundant views. Proceedings of HLT-NAACL 2003. - (Nigam and Ghani 2000) Nigam, K. and Ghani, R.
Analyzing the effectiveness and applicability of
co-training. Proceedings of CIKM 2000. - (Sarkar 2001) Sarkar, A. Applying cotraining
methods to statistical parsing. Proceedings of
NAACL 2001. - (Yarowsky 1995) Yarowsky, D. Unsupervised word
sense disambiguation rivaling supervised methods.
Proceedings of ACL 1995.
59- Part 6
- Unsupervised Methods of Word Sense Discrimination
60Outline
- What is Unsupervised Learning?
- Task Definition
- Agglomerative Clustering
- LSI/LSA
- Sense Discrimination Using Parallel Texts
61What is Unsupervised Learning?
- Unsupervised learning identifies patterns in a
large sample of data, without the benefit of any
manually labeled examples or external knowledge
sources - These patterns are used to divide the data into
clusters, where each member of a cluster has more
in common with the other members of its own
cluster than any other - Note! If you remove manual labels from supervised
data and cluster, you may not discover the same
classes as in supervised learning - Supervised Classification identifies features
that trigger a sense tag - Unsupervised Clustering finds similarity between
contexts
62Task Definition
- Word Sense Discrimination reduces to the problem
of finding the targeted words that occur in the
most similar contexts and placing them in a
cluster
63Agglomerative Clustering
- Create a similarity matrix of instances to be
discriminated - Results in a symmetric instance by instance
matrix, where each cell contains the similarity
score between a pair of instances - Typically a first order representation, where
similarity is based on the features observed in
the pair of instances - Apply Agglomerative Clustering algorithm to
matrix - To start, each instance is its own cluster
- Form a cluster from the most similar pair of
instances - Repeat until the desired number of clusters is
obtained - Advantages high quality clustering
- Disadvantages computationally expensive, must
carry out exhaustive pair wise comparisons
64Measuring Similarity (you dont need to know
these)
- Integer Values
- Matching Coefficient
- Jaccard Coefficient
- Dice Coefficient
- Real Values
- Cosine
65Evaluation of Unsupervised Methods
- If Sense tagged text is available, can be used
for evaluation - Assume that sense tags represent true clusters,
and compare these to discovered clusters - Find mapping of clusters to senses that attains
maximum accuracy - Pseudo words are especially useful, since it is
hard to find data that is discriminated - Pick two words or names from a corpus, and
conflate them into one name. Then see how well
you can discriminate. - http//www.d.umn.edu/kulka020/kanaghaName.html
- Baseline Algorithm group all instances into one
cluster, this will reach accuracy equal to
majority classifier
66Sense Discrimination Using Parallel Texts
- There is controversy as to what exactly is a
word sense (e.g., Kilgarriff, 1997) - It is sometimes unclear how fine grained sense
distinctions need to be to be useful in practice.
- Parallel text may present a solution to both
problems! - Text in one language and its translation into
another - Resnik and Yarowsky (1997) suggest that word
sense disambiguation concern itself with sense
distinctions that manifest themselves across
languages. - A bill in English may be a pico (bird jaw) in
or a cuenta (invoice) in Spanish.
67Parallel Text
- Parallel Text can be found on the Web and there
are several large corpora available (e.g., UN
Parallel Text, Canadian Hansards) - Manual annotation of sense tags is not required!
However, text must be word aligned (translations
identified between the two languages). - http//www.cs.unt.edu/rada/wpt/
- Workshop on Parallel Text, NAACL 2003
- Given word aligned parallel text, sense
distinctions can be discovered. (e.g., Li and and
Li, 2002, Diab, 2002)
68References
- (Diab, 2002) Diab, Mona and Philip Resnik, An
Unsupervised Method for Word Sense Tagging using
Parallel Corpora, Proceedings of ACL, 2002. - (Firth, 1957) A Synopsis of Linguistic Theory
1930-1955. In Studies in Linguistic Analysis,
Oxford University Press, Oxford. - (Kilgarriff, 1997) I dont believe in word
senses, Computers and the Humanities (31) pp.
91-113. - (Li and Li, 2002) Word Translation Disambiguation
Using Bilingual Bootstrapping. Proceedings of
ACL. Pp. 343-351. - (McQuitty, 1966) Similarity Analysis by
Reciprocal Pairs for Discrete and Continuous
Data. Educational and Psychological Measurement
(26) pp. 825-831. - (Miller and Charles, 1991) Contextual correlates
of semantic similarity. Language and Cognitive
Processes, 6 (1) pp. 1 - 28. - (Pedersen and Bruce, 1997) Distinguishing Word
Sense in Untagged Text. In Proceedings of EMNLP2.
pp 197-207. - (Purandare and Pedersen, 2004) Word Sense
Discrimination by Clustering Contexts in Vector
and Similarity Spaces. Proceedings of the
Conference on Natural Language and Learning. pp.
41-48. - (Resnik and Yarowsky, 1997) A Perspective on
Word Sense Disambiguation Methods and their
Evaluation. The ACL-SIGLEX Workshop Tagging Text
with Lexical Semantics. pp. 79-86. - (Schutze, 1998) Automatic Word Sense
Discrimination. Computational Linguistics, 24 (1)
pp. 97-123.
69Outlinemost of this section deleted
- Where to get the required ingredients?
- Machine Readable Dictionaries
- Machine Learning Algorithms
- Sense Annotated Data
- Raw Data
- Where to get WSD software?
- How to get your algorithms tested?
- Senseval
70Senseval
- Evaluation of WSD systems http//www.senseval.org
- Senseval 1 1999 about 10 teams
- Senseval 2 2001 about 30 teams
- Senseval 3 2004 about 55 teams
- Senseval 4 2007(?)
- Provides sense annotated data for many languages,
for several tasks - Languages English, Romanian, Chinese, Basque,
Spanish, etc. - Tasks Lexical Sample, All words, etc.
- Provides evaluation software
- Provides results of other participating systems
71Thank You!
- Rada Mihalcea (rada_at_cs.unt.edu)
- http//www.cs.unt.edu/rada
- Ted Pedersen (tpederse_at_d.umn.edu)
- http//www.d.umn.edu/tpederse