Advances in Word Sense Disambiguation - PowerPoint PPT Presentation

About This Presentation
Title:

Advances in Word Sense Disambiguation

Description:

Definitions / Examples for each meaning. Find similarity ... Typical usage examples (for most word meanings) WordNet definitions/examples for the noun plant ... – PowerPoint PPT presentation

Number of Views:375
Avg rating:3.0/5.0
Slides: 72
Provided by: Rada97
Category:

less

Transcript and Presenter's Notes

Title: Advances in Word Sense Disambiguation


1
Advances in Word Sense Disambiguation
  • Tutorial at AAAI-2005
  • July 9, 2005
  • Rada Mihalcea
  • University of North Texas
  • http//www.cs.unt.edu/rada
  • Ted Pedersen
  • University of Minnesota, Duluth
  • http//www.d.umn.edu/tpederse
  • Note slides have been modified/deleted/added
  • For those interested in lexical semantics, I
    suggest getting the entire tutorial

2
Definitions
  • Word sense disambiguation is the problem of
    selecting a sense for a word from a set of
    predefined possibilities.
  • Sense Inventory usually comes from a dictionary
    or thesaurus.
  • Knowledge intensive methods, supervised learning,
    and (sometimes) bootstrapping approaches
  • Word sense discrimination is the problem of
    dividing the usages of a word into different
    meanings, without regard to any particular
    existing sense inventory.
  • Unsupervised techniques

3
Word Senses
  • The meaning of a word in a given context
  • Word sense representations
  • With respect to a dictionary
  • chair a seat for one person, with a
    support for the back "he put his coat over the
    back of the chair and sat down"
  • chair the position of professor "he was
    awarded an endowed chair in economics"
  • With respect to the translation in a second
    language
  • chair chaise
  • chair directeur
  • With respect to the context where it occurs
    (discrimination)
  • Sit on a chair Take a seat on this chair
  • The chair of the Math Department The chair
    of the meeting

4
Approaches to Word Sense Disambiguation
  • Knowledge-Based Disambiguation
  • use of external lexical resources such as
    dictionaries and thesauri
  • discourse properties
  • Supervised Disambiguation
  • based on a labeled training set
  • the learning system has
  • a training set of feature-encoded inputs AND
  • their appropriate sense label (category)
  • Unsupervised Disambiguation
  • based on unlabeled corpora
  • The learning system has
  • a training set of feature-encoded inputs BUT
  • NOT their appropriate sense label (category)

5
All Words Word Sense Disambiguation
  • Attempt to disambiguate all open-class words in a
    text
  • He put his suit over the back of the chair
  • Use information from dictionaries
  • Definitions / Examples for each meaning
  • Find similarity between definitions and current
    context
  • Position in a semantic network
  • Find that table is closer to chair/furniture
    than to chair/person
  • Use discourse properties
  • A word exhibits the same sense in a discourse /
    in a collocation

6
All Words Word Sense Disambiguation
  • Minimally supervised approaches
  • Learn to disambiguate words using small annotated
    corpora
  • E.g. SemCor corpus where all open class words
    are disambiguated
  • 200,000 running words
  • Most frequent sense

7
Targeted Word Sense Disambiguation (we saw this
in the previous lecture notes)
  • Disambiguate one target word
  • Take a seat on this chair
  • The chair of the Math Department
  • WSD is viewed as a typical classification problem
  • use machine learning techniques to train a system
  • Training
  • Corpus of occurrences of the target word, each
    occurrence annotated with appropriate sense
  • Build feature vectors
  • a vector of relevant linguistic features that
    represents the context (ex a window of words
    around the target word)
  • Disambiguation
  • Disambiguate the target word in new unseen text

8
Unsupervised Disambiguation
  • Disambiguate word senses
  • without supporting tools such as dictionaries and
    thesauri
  • without a labeled training text
  • Without such resources, word senses are not
    labeled
  • We cannot say chair/furniture or chair/person
  • We can
  • Cluster/group the contexts of an ambiguous word
    into a number of groups
  • Discriminate between these groups without
    actually labeling them

9
Unsupervised Disambiguation
  • Hypothesis same senses of words will have
    similar neighboring words
  • Disambiguation algorithm
  • Identify context vectors corresponding to all
    occurrences of a particular word
  • Partition them into regions of high density
  • Assign a sense to each such region
  • Sit on a chair
  • Take a seat on this chair
  • The chair of the Math Department
  • The chair of the meeting

10
Bounds on Performance
  • Upper and Lower Bounds on Performance
  • Measure of how well an algorithm performs
    relative to the difficulty of the task.
  • Upper Bound
  • Human performance
  • Around 97-99 with few and clearly distinct
    senses
  • Inter-judge agreement
  • With words with clear distinct senses 95 and
    up
  • With polysemous words with related senses 65
    70
  • Lower Bound (or baseline)
  • The assignment of a random sense / the most
    frequent sense
  • 90 is excellent for a word with 2 equiprobable
    senses
  • 90 is trivial for a word with 2 senses with
    probability ratios of 9 to 1

11
References
  • (Gale, Church and Yarowsky 1992) Gale, W.,
    Church, K., and Yarowsky, D. Estimating upper and
    lower bounds on the performance of word-sense
    disambiguation programs ACL 1992.
  • (Miller et. al., 1994) Miller, G., Chodorow, M.,
    Landes, S., Leacock, C., and Thomas, R. Using a
    semantic concordance for sense identification.
    ARPA Workshop 1994.
  • (Miller, 1995) Miller, G. Wordnet A lexical
    database. ACM, 38(11) 1995.
  • (Senseval) Senseval evaluation exercises
    http//www.senseval.org

12
Part 3 Knowledge-based Methods for Word Sense
Disambiguation
13
Outline
  • Task definition
  • Machine Readable Dictionaries
  • Algorithms based on Machine Readable Dictionaries
  • Selectional Restrictions
  • Measures of Semantic Similarity
  • Heuristic-based Methods

14
Task Definition
  • Knowledge-based WSD class of WSD methods
    relying (mainly) on knowledge drawn from
    dictionaries and/or raw text
  • Resources
  • Yes
  • Machine Readable Dictionaries
  • Raw corpora
  • No
  • Manually annotated corpora
  • Scope
  • All open-class words

15
Machine Readable Dictionaries
  • In recent years, most dictionaries made available
    in Machine Readable format (MRD)
  • Oxford English Dictionary
  • Collins
  • Longman Dictionary of Ordinary Contemporary
    English (LDOCE)
  • Thesauruses add synonymy information
  • Roget Thesaurus
  • Semantic networks add more semantic relations
  • WordNet
  • EuroWordNet

16
MRD A Resource for Knowledge-based WSD
  • For each word in the language vocabulary, an MRD
    provides
  • A list of meanings
  • Definitions (for all word meanings)
  • Typical usage examples (for most word meanings)

17
MRD A Resource for Knowledge-based WSD
  • A thesaurus adds
  • An explicit synonymy relation between word
    meanings
  • A semantic network adds
  • Hypernymy/hyponymy (IS-A), meronymy (PART-OF),
    antonymy, entailment, etc.

WordNet synsets for the noun plant 1.
plant, works, industrial plant 2. plant,
flora, plant life
WordNet related concepts for the meaning plant
life plant, flora, plant life
hypernym organism, being
hypomym house plant, fungus,
meronym plant tissue, plant part
holonym Plantae, kingdom Plantae, plant
kingdom
18
Outline
  • Task definition
  • Machine Readable Dictionaries
  • Algorithms based on Machine Readable Dictionaries
  • Selectional Restrictions
  • Measures of Semantic Similarity
  • Heuristic-based Methods

19
Lesk Algorithm
  • (Michael Lesk 1986) Identify senses of words in
    context using definition overlap
  • Algorithm
  • Retrieve from MRD all sense definitions of the
    words to be disambiguated
  • Determine the definition overlap for all possible
    sense combinations
  • Choose senses that lead to highest overlap
  • Example disambiguate PINE CONE
  • PINE
  • 1. kinds of evergreen tree with needle-shaped
    leaves
  • 2. waste away through sorrow or illness
  • CONE
  • 1. solid body which narrows to a point
  • 2. something of this shape whether solid or
    hollow
  • 3. fruit of certain evergreen trees

Pine1 ? Cone1 0 Pine2 ? Cone1 0 Pine1 ?
Cone2 1 Pine2 ? Cone2 0 Pine1 ? Cone3
2 Pine2 ? Cone3 0
20
Lesk Algorithm A Simplified Version
  • Original Lesk definition measure overlap between
    sense definitions for all words in context
  • Identify simultaneously the correct senses for
    all words in context
  • Simplified Lesk (Kilgarriff Rosensweig 2000)
    measure overlap between sense definitions of a
    word and current context
  • Identify the correct sense for one word at a time
  • Search space significantly reduced

21
Lesk Algorithm A Simplified Version
  • Algorithm for simplified Lesk
  • Retrieve from MRD all sense definitions of the
    word to be disambiguated
  • Determine the overlap between each sense
    definition and the current context
  • Choose the sense that leads to highest overlap
  • Example disambiguate PINE in
  • Pine cones hanging in a tree
  • PINE
  • 1. kinds of evergreen tree with needle-shaped
    leaves
  • 2. waste away through sorrow or illness

Pine1 ? Sentence 1 Pine2 ? Sentence 0
22
Evaluations of Lesk Algorithm
  • Initial evaluation by M. Lesk
  • 50-70 on short samples of text manually
    annotated set, with respect to Oxford Advanced
    Learners Dictionary
  • Evaluation on Senseval-2 all-words data, with
    back-off to random sense (Mihalcea Tarau 2004)
  • Original Lesk 35
  • Simplified Lesk 47
  • Evaluation on Senseval-2 all-words data, with
    back-off to most frequent sense (Vasilescu,
    Langlais, Lapalme 2004)
  • Original Lesk 42
  • Simplified Lesk 58

23
Outline
  • Task definition
  • Machine Readable Dictionaries
  • Algorithms based on Machine Readable Dictionaries
  • Selectional Preferences
  • Measures of Semantic Similarity
  • Heuristic-based Methods

24
Selectional Preferences
  • A way to constrain the possible meanings of words
    in a given context
  • E.g. Wash a dish vs. Cook a dish
  • WASH-OBJECT vs. COOK-FOOD
  • Capture information about possible relations
    between semantic classes
  • Common sense knowledge
  • Alternative terminology
  • Selectional Restrictions
  • Selectional Preferences
  • Selectional Constraints

25
Acquiring Selectional Preferences
  • From annotated corpora
  • We saw this in the previous lecture notes
  • From raw corpora
  • Frequency counts
  • Information theory measures
  • Class-to-class relations

26
Preliminaries Learning Word-to-Word Relations
  • An indication of the semantic fit between two
    words
  • 1. Frequency counts
  • Pairs of words connected by a syntactic relations
  • 2. Conditional probabilities
  • Condition on one of the words

27
Learning Selectional Preferences (1)(p. 14 in
Chapter 19 you wont be responsible for this
formula)
  • Word-to-class relations (Resnik 1993)
  • Quantify the contribution of a semantic class
    using all the concepts subsumed by that class
  • where

28
Outline
  • Task definition
  • Machine Readable Dictionaries
  • Algorithms based on Machine Readable Dictionaries
  • Selectional Restrictions
  • Measures of Semantic Similarity
  • Heuristic-based Methods

29
Semantic Similarity
  • Words in a discourse must be related in meaning,
    for the discourse to be coherent (Haliday and
    Hassan, 1976)
  • Use this property for WSD Identify related
    meanings for words that share a common context

30
See Figure 19.6 in the chapter
  • Basic idea the shorter the path between two
    senses in a semantic network, the more similar
    they are.
  • So, you can see that nickel, dime are closer to
    budget than they are to Richter scale (in Figure
    19.6)

31
Semantic Similarity Metrics (1)
  • Input two concepts (same part of speech)
  • Output similarity measure
  • (Leacock and Chodorow 1998)
  • E.g. Similarity(wolf,dog) 0.60
    Similarity(wolf,bear) 0.42
  • (Resnik 1995)
  • Define information content, where P(C)
    probability of seeing a concept of type C in a
    large corpus
  • Probability of seeing a concept probability of
    seeing instances of that concept

, D is the taxonomy depth
32
Semantic Similarity Metrics (2)
  • Similarity using information content
  • (Resnik 1995) Define similarity between two
    concepts (LCS Least Common Subsumer)
  • Alternatives (Jiang and Conrath 1997)
  • Other metrics
  • Similarity using information content (Lin 1998)
  • Similarity using gloss-based paths across
    different hierarchies (Mihalcea and Moldovan
    1999)
  • Conceptual density measure between noun semantic
    hierarchies and current context (Agirre and Rigau
    1995)
  • Adapted Lesk algorithm (Banerjee and Pedersen
    2002)

33
Most Frequent Sense (1)
  • Identify the most often used meaning and use this
    meaning by default
  • Word meanings exhibit a Zipfian distribution
  • E.g. distribution of word senses in SemCor

Example plant/flora is used more often than
plant/factory - annotate any instance of
PLANT as plant/flora
34
Most Frequent Sense (2) (you arent responsible
for this)
  • Method 1 Find the most frequent sense in an
    annotated corpus
  • Method 2 Find the most frequent sense using a
    method based on distributional similarity
    (McCarthy et al. 2004)
  • 1. Given a word w, find the top k
    distributionally similar words
  • Nw n1, n2, , nk, with associated
    similarity scores dss(w,n1), dss(w,n2),
    dss(w,nk)
  • 2. For each sense wsi of w, identify the
    similarity with the words nj, using the sense of
    nj that maximizes this score
  • 3. Rank senses wsi of w based on the total
    similarity score

35
Most Frequent Sense(3)
  • Word senses
  • pipe 1 tobacco pipe
  • pipe 2 tube of metal or plastic
  • Distributional similar words
  • N tube, cable, wire, tank, hole, cylinder,
    fitting, tap,
  • For each word in N, find similarity with pipei
    (using the sense that maximizes the similarity)
  • pipe1 tube (3) 0.3
  • pipe2 tube (1) 0.6
  • Compute score for each sense pipei
  • score (pipe1) 0.25
  • score (pipe2) 0.73
  • Note results depend on the corpus used to find
    distributionally
  • similar words gt can find domain specific
    predominant senses

36
One Sense Per Discourse
  • A word tends to preserve its meaning across all
    its occurrences in a given discourse (Gale,
    Church, Yarowksy 1992)
  • What does this mean?
  • Evaluation
  • 8 words with two-way ambiguity, e.g. plant,
    crane, etc.
  • 98 of the two-word occurrences in the same
    discourse carry the same meaning
  • The grain of salt Performance depends on
    granularity
  • (Krovetz 1998) experiments with words with more
    than two senses
  • Performance of one sense per discourse measured
    on SemCor is approx. 70

E.g. The ambiguous word PLANT occurs 10 times in
a discourse all instances of plant
carry the same meaning
37
One Sense per Collocation
  • A word tends to preserve its meaning when used in
    the same collocation (Yarowsky 1993)
  • Strong for adjacent collocations
  • Weaker as the distance between words increases
  • An example
  • Evaluation
  • 97 precision on words with two-way ambiguity
  • Finer granularity
  • (Martinez and Agirre 2000) tested the one sense
    per collocation hypothesis on text annotated
    with WordNet senses
  • 70 precision on SemCor words

The ambiguous word PLANT preserves its meaning in
all its occurrences within the collocation
industrial plant, regardless of the context
where this collocation occurs
38
References
  • (Agirre and Rigau, 1995) Agirre, E. and Rigau, G.
    A proposal for word sense disambiguation using
    conceptual distance. RANLP 1995.  
  • (Agirre and Martinez 2001) Agirre, E. and
    Martinez, D. Learning class-to-class selectional
    preferences. CONLL 2001.
  •  (Banerjee and Pedersen 2002) Banerjee, S. and
    Pedersen, T. An adapted Lesk algorithm for word
    sense disambiguation using WordNet. CICLING 2002.
  • (Cowie, Guthrie and Guthrie 1992), Cowie, L. and
    Guthrie, J. A. and Guthrie, L. Lexical
    disambiguation using simulated annealing. COLING
    2002.
  • (Gale, Church and Yarowsky 1992) Gale, W.,
    Church, K., and Yarowsky, D. One sense per
    discourse. DARPA workshop 1992.
  • (Halliday and Hasan 1976) Halliday, M. and Hasan,
    R., (1976). Cohesion in English. Longman.
  • (Galley and McKeown 2003) Galley, M. and McKeown,
    K. (2003) Improving word sense disambiguation in
    lexical chaining. IJCAI 2003
  • (Hirst and St-Onge 1998) Hirst, G. and St-Onge,
    D. Lexical chains as representations of context
    in the detection and correction of malaproprisms.
    WordNet An electronic lexical database, MIT
    Press.
  • (Jiang and Conrath 1997) Jiang, J. and Conrath,
    D. Semantic similarity based on corpus statistics
    and lexical taxonomy. COLING 1997.
  • (Krovetz, 1998) Krovetz, R. More than one sense
    per discourse. ACL-SIGLEX 1998.
  • (Lesk, 1986) Lesk, M. Automatic sense
    disambiguation using machine readable
    dictionaries How to tell a pine cone from an ice
    cream cone. SIGDOC 1986.
  • (Lin 1998) Lin, D An information theoretic
    definition of similarity. ICML 1998.

39
References
  • (Martinez and Agirre 2000) Martinez, D. and
    Agirre, E. One sense per collocation and
    genre/topic variations. EMNLP 2000.
  • (Miller et. al., 1994) Miller, G., Chodorow, M.,
    Landes, S., Leacock, C., and Thomas, R. Using a
    semantic concordance for sense identification.
    ARPA Workshop 1994.
  • (Miller, 1995) Miller, G. Wordnet A lexical
    database. ACM, 38(11) 1995.
  • (Mihalcea and Moldovan, 1999) Mihalcea, R. and
    Moldovan, D. A method for word sense
    disambiguation of unrestricted text. ACL 1999.
  • (Mihalcea and Moldovan 2000) Mihalcea, R. and
    Moldovan, D. An iterative approach to word sense
    disambiguation. FLAIRS 2000.
  • (Mihalcea, Tarau, Figa 2004) R. Mihalcea, P.
    Tarau, E. Figa PageRank on Semantic Networks with
    Application to Word Sense Disambiguation, COLING
    2004.
  • (Patwardhan, Banerjee, and Pedersen 2003)
    Patwardhan, S. and Banerjee, S. and Pedersen, T.
    Using Measures of Semantic Relatedeness for Word
    Sense Disambiguation. CICLING 2003.
  • (Rada et al 1989) Rada, R. and Mili, H. and
    Bicknell, E. and Blettner, M. Development and
    application of a metric on semantic nets. IEEE
    Transactions on Systems, Man, and Cybernetics,
    19(1) 1989.
  • (Resnik 1993) Resnik, P. Selection and
    Information A Class-Based Approach to Lexical
    Relationships. University of Pennsylvania 1993.  
  • (Resnik 1995) Resnik, P. Using information
    content to evaluate semantic similarity. IJCAI
    1995.
  • (Vasilescu, Langlais, Lapalme 2004) F. Vasilescu,
    P. Langlais, G. Lapalme "Evaluating variants of
    the Lesk approach for disambiguating words, LREC
    2004.
  • (Yarowsky, 1993) Yarowsky, D. One sense per
    collocation. ARPA Workshop 1993.

40
  • Part 4
  • Supervised Methods of Word Sense Disambiguation
  • This section has been deleted

41
Part 5 Minimally Supervised Methods for Word
Sense Disambiguation
42
Outline
  • Task definition
  • What does minimally supervised mean?
  • Bootstrapping algorithms
  • Co-training
  • Self-training
  • Yarowsky algorithm
  • Using the Web for Word Sense Disambiguation
  • Web as a corpus
  • Web as collective mind

43
Task Definition
  • Supervised WSD learning sense classifiers
    starting with annotated data
  • Minimally supervised WSD learning sense
    classifiers from annotated data, with minimal
    human supervision
  • Examples
  • Automatically bootstrap a corpus starting with a
    few human annotated examples
  • Use monosemous relatives / dictionary definitions
    to automatically construct sense tagged data

44
Outline
  • Task definition
  • What does minimally supervised mean?
  • Bootstrapping algorithms
  • Co-training
  • Self-training
  • Yarowsky algorithm
  • Using the Web for Word Sense Disambiguation
  • Web as a corpus
  • Web as collective mind

45
Bootstrapping Recipe
  • Ingredients
  • (Some) labeled data
  • (Large amounts of) unlabeled data
  • (One or more) basic classifiers
  • Output
  • Classifier that improves over the basic
    classifiers

46
building the only atomic plant plant growth
is retarded a herb or flowering plant a
nuclear power plant building a new vehicle
plant the animal and plant life the
passion-fruit plant
plants1 and animals industry plant2
47
Co-training/Self-training
  • A set L of labeled training examples
  • A set U of unlabeled examples
  • Classifiers Ci
  • 1. Create a pool of examples U'
  • choose P random examples from U
  • 2. Loop for I iterations
  • Train Ci on L and label U'
  • Select G most confident examples and add to L
  • maintain distribution in L
  • Refill U' with examples from U
  • keep U' at constant size P

48
Co-training
  • (Blum and Mitchell 1998)
  • Two classifiers
  • independent views
  • independence condition can be relaxed
  • Co-training in Natural Language Learning
  • Statistical parsing (Sarkar 2001)
  • Co-reference resolution (Ng and Cardie 2003)
  • Part of speech tagging (Clark, Curran and Osborne
    2003)
  • ...

49
Self-training
  • (Nigam and Ghani 2000)
  • One single classifier
  • Retrain on its own output
  • Self-training for Natural Language Learning
  • Part of speech tagging (Clark, Curran and Osborne
    2003)
  • Co-reference resolution (Ng and Cardie 2003)
  • several classifiers through bagging

50
Yarowsky Algorithm
  • (Yarowsky 1995)
  • Similar to co-training
  • Relies on two heuristics and a decision list
  • One sense per collocation
  • Nearby words provide strong and consistent clues
    as to the sense of a target word
  • One sense per discourse
  • The sense of a target word is highly consistent
    within a single document

51
Learning Algorithm
  • A decision list is used to classify instances of
    target word

the loss of animal and plant species through
extinction
  • Classification is based on the highest ranking
    rule that matches the target context

LogL Collocation Sense

9.31 flower (within /- k words) A (living)
9.24 job (within /- k words) B (factory)
9.03 fruit (within /- k words) A (living)
9.02 plant species A (living)
... ...

52
Bootstrapping Algorithm
Sense-A life
Sense-B factory
  • All occurrences of the target word are identified
  • A small training set of seed data is tagged with
    word sense

53
Bootstrapping Algorithm
  • Seed set grows and residual set shrinks .

54
Bootstrapping Algorithm
  • Convergence Stop when residual set stabilizes

55
Bootstrapping Algorithm
  • Iterative procedure
  • Train decision list algorithm on seed set
  • Classify residual data with decision list
  • Create new seed set by identifying samples that
    are tagged with a probability above a certain
    threshold
  • Retrain classifier on new seed set
  • Selecting training seeds
  • Initial training set should accurately
    distinguish among possible senses
  • Strategies
  • Select a single, defining seed collocation for
    each possible sense.
  • Ex life and manufacturing for target
    plant
  • Use words from dictionary definitions
  • Hand-label most frequent collocates

56
Evaluation
  • Test corpus extracted from 460 million word
    corpus of multiple sources (news articles,
    transcripts, novels, etc.)
  • Performance of multiple models compared with
  • supervised decision lists
  • unsupervised learning algorithm of Schütze
    (1992), based on alignment of clusters with word
    senses

Word Senses Supervised Unsupervised Schütze Unsupervised Bootstrapping
plant living/factory 97.7 92 98.6
space volume/outer 93.9 90 93.6
tank vehicle/container 97.1 95 96.5
motion legal/physical 98.0 92 97.9
-
Avg. - 96.1 92.2 96.5
57
The Web as a CorpusThis topic has been deleted
  • Use the Web as a large textual corpus
  • Build annotated corpora using monosemous
    relatives
  • Bootstrap annotated corpora starting with few
    seeds
  • Similar to (Yarowsky 1995)
  • Use the (semi)automatically tagged data to train
    WSD classifiers

58
References
  • (Abney 2002) Abney, S. Bootstrapping. Proceedings
    of ACL 2002.
  • (Blum and Mitchell 1998) Blum, A. and Mitchell,
    T. Combining labeled and unlabeled data with
    co-training. Proceedings of COLT 1998.
  • (Chklovski and Mihalcea 2002) Chklovski, T. and
    Mihalcea, R. Building a sense tagged corpus with
    Open Mind Word Expert. Proceedings of ACL 2002
    workshop on WSD.
  • (Clark, Curran and Osborne 2003) Clark, S. and
    Curran, J.R. and Osborne, M. Bootstrapping POS
    taggers using unlabelled data. Proceedings of
    CoNLL 2003.
  • (Mihalcea 1999) Mihalcea, R. An automatic method
    for generating sense tagged corpora. Proceedings
    of AAAI 1999.
  • (Mihalcea 2002) Mihalcea, R. Bootstrapping large
    sense tagged corpora. Proceedings of LREC 2002.
  • (Mihalcea 2004) Mihalcea, R. Co-training and
    Self-training for Word Sense Disambiguation.
    Proceedings of CoNLL 2004.
  • (Ng and Cardie 2003) Ng, V. and Cardie, C. Weakly
    supervised natural language learning without
    redundant views. Proceedings of HLT-NAACL 2003.
  • (Nigam and Ghani 2000) Nigam, K. and Ghani, R.
    Analyzing the effectiveness and applicability of
    co-training. Proceedings of CIKM 2000.
  • (Sarkar 2001) Sarkar, A. Applying cotraining
    methods to statistical parsing. Proceedings of
    NAACL 2001.
  • (Yarowsky 1995) Yarowsky, D. Unsupervised word
    sense disambiguation rivaling supervised methods.
    Proceedings of ACL 1995.

59
  • Part 6
  • Unsupervised Methods of Word Sense Discrimination

60
Outline
  • What is Unsupervised Learning?
  • Task Definition
  • Agglomerative Clustering
  • LSI/LSA
  • Sense Discrimination Using Parallel Texts

61
What is Unsupervised Learning?
  • Unsupervised learning identifies patterns in a
    large sample of data, without the benefit of any
    manually labeled examples or external knowledge
    sources
  • These patterns are used to divide the data into
    clusters, where each member of a cluster has more
    in common with the other members of its own
    cluster than any other
  • Note! If you remove manual labels from supervised
    data and cluster, you may not discover the same
    classes as in supervised learning
  • Supervised Classification identifies features
    that trigger a sense tag
  • Unsupervised Clustering finds similarity between
    contexts

62
Task Definition
  • Word Sense Discrimination reduces to the problem
    of finding the targeted words that occur in the
    most similar contexts and placing them in a
    cluster

63
Agglomerative Clustering
  • Create a similarity matrix of instances to be
    discriminated
  • Results in a symmetric instance by instance
    matrix, where each cell contains the similarity
    score between a pair of instances
  • Typically a first order representation, where
    similarity is based on the features observed in
    the pair of instances
  • Apply Agglomerative Clustering algorithm to
    matrix
  • To start, each instance is its own cluster
  • Form a cluster from the most similar pair of
    instances
  • Repeat until the desired number of clusters is
    obtained
  • Advantages high quality clustering
  • Disadvantages computationally expensive, must
    carry out exhaustive pair wise comparisons

64
Measuring Similarity (you dont need to know
these)
  • Integer Values
  • Matching Coefficient
  • Jaccard Coefficient
  • Dice Coefficient
  • Real Values
  • Cosine

65
Evaluation of Unsupervised Methods
  • If Sense tagged text is available, can be used
    for evaluation
  • Assume that sense tags represent true clusters,
    and compare these to discovered clusters
  • Find mapping of clusters to senses that attains
    maximum accuracy
  • Pseudo words are especially useful, since it is
    hard to find data that is discriminated
  • Pick two words or names from a corpus, and
    conflate them into one name. Then see how well
    you can discriminate.
  • http//www.d.umn.edu/kulka020/kanaghaName.html
  • Baseline Algorithm group all instances into one
    cluster, this will reach accuracy equal to
    majority classifier

66
Sense Discrimination Using Parallel Texts
  • There is controversy as to what exactly is a
    word sense (e.g., Kilgarriff, 1997)
  • It is sometimes unclear how fine grained sense
    distinctions need to be to be useful in practice.
  • Parallel text may present a solution to both
    problems!
  • Text in one language and its translation into
    another
  • Resnik and Yarowsky (1997) suggest that word
    sense disambiguation concern itself with sense
    distinctions that manifest themselves across
    languages.
  • A bill in English may be a pico (bird jaw) in
    or a cuenta (invoice) in Spanish.

67
Parallel Text
  • Parallel Text can be found on the Web and there
    are several large corpora available (e.g., UN
    Parallel Text, Canadian Hansards)
  • Manual annotation of sense tags is not required!
    However, text must be word aligned (translations
    identified between the two languages).
  • http//www.cs.unt.edu/rada/wpt/
  • Workshop on Parallel Text, NAACL 2003
  • Given word aligned parallel text, sense
    distinctions can be discovered. (e.g., Li and and
    Li, 2002, Diab, 2002)

68
References
  • (Diab, 2002) Diab, Mona and Philip Resnik, An
    Unsupervised Method for Word Sense Tagging using
    Parallel Corpora, Proceedings of ACL, 2002.
  • (Firth, 1957) A Synopsis of Linguistic Theory
    1930-1955. In Studies in Linguistic Analysis,
    Oxford University Press, Oxford.
  • (Kilgarriff, 1997) I dont believe in word
    senses, Computers and the Humanities (31) pp.
    91-113.
  • (Li and Li, 2002) Word Translation Disambiguation
    Using Bilingual Bootstrapping. Proceedings of
    ACL. Pp. 343-351.
  • (McQuitty, 1966) Similarity Analysis by
    Reciprocal Pairs for Discrete and Continuous
    Data. Educational and Psychological Measurement
    (26) pp. 825-831.
  • (Miller and Charles, 1991) Contextual correlates
    of semantic similarity. Language and Cognitive
    Processes, 6 (1) pp. 1 - 28.
  • (Pedersen and Bruce, 1997) Distinguishing Word
    Sense in Untagged Text. In Proceedings of EMNLP2.
    pp 197-207.
  • (Purandare and Pedersen, 2004) Word Sense
    Discrimination by Clustering Contexts in Vector
    and Similarity Spaces. Proceedings of the
    Conference on Natural Language and Learning. pp.
    41-48.
  • (Resnik and Yarowsky, 1997) A Perspective on
    Word Sense Disambiguation Methods and their
    Evaluation. The ACL-SIGLEX Workshop Tagging Text
    with Lexical Semantics. pp. 79-86.
  • (Schutze, 1998) Automatic Word Sense
    Discrimination. Computational Linguistics, 24 (1)
    pp. 97-123.

69
Outlinemost of this section deleted
  • Where to get the required ingredients?
  • Machine Readable Dictionaries
  • Machine Learning Algorithms
  • Sense Annotated Data
  • Raw Data
  • Where to get WSD software?
  • How to get your algorithms tested?
  • Senseval

70
Senseval
  • Evaluation of WSD systems http//www.senseval.org
  • Senseval 1 1999 about 10 teams
  • Senseval 2 2001 about 30 teams
  • Senseval 3 2004 about 55 teams
  • Senseval 4 2007(?)
  • Provides sense annotated data for many languages,
    for several tasks
  • Languages English, Romanian, Chinese, Basque,
    Spanish, etc.
  • Tasks Lexical Sample, All words, etc.
  • Provides evaluation software
  • Provides results of other participating systems

71
Thank You!
  • Rada Mihalcea (rada_at_cs.unt.edu)
  • http//www.cs.unt.edu/rada
  • Ted Pedersen (tpederse_at_d.umn.edu)
  • http//www.d.umn.edu/tpederse
Write a Comment
User Comments (0)
About PowerShow.com