Supervised%20Methods%20of%20Word%20Sense%20Disambiguation - PowerPoint PPT Presentation

About This Presentation
Title:

Supervised%20Methods%20of%20Word%20Sense%20Disambiguation

Description:

Title: Notebook Author: Rada Last modified by: umdcs Created Date: 9/17/2004 4:54:54 AM Document presentation format: On-screen Show Company: UNT Other titles – PowerPoint PPT presentation

Number of Views:169
Avg rating:3.0/5.0
Slides: 32
Provided by: Rada54
Category:

less

Transcript and Presenter's Notes

Title: Supervised%20Methods%20of%20Word%20Sense%20Disambiguation


1
  • Part 4
  • Supervised Methods of Word Sense Disambiguation

2
Outline
  • What is Supervised Learning?
  • Task Definition
  • Single Classifiers
  • Naïve Bayesian Classifiers
  • Decision Lists and Trees
  • Ensembles of Classifiers

3
What is Supervised Learning?
  • Collect a set of examples that illustrate the
    various possible classifications or outcomes of
    an event.
  • Identify patterns in the examples associated with
    each particular class of the event.
  • Generalize those patterns into rules.
  • Apply the rules to classify a new event.

4
Learn from these examples when do I go to the
store?
Day Go to Store? Hot Outside? Slept Well? Ate Well?
1 YES YES NO NO
2 NO YES NO YES
3 YES NO NO NO
4 NO NO NO YES
5
Outline
  • What is Supervised Learning?
  • Task Definition
  • Single Classifiers
  • Naïve Bayesian Classifiers
  • Decision Lists and Trees
  • Ensembles of Classifiers

6
Task Definition
  • Supervised WSD Class of methods that induces a
    classifier from manually sense-tagged text using
    machine learning techniques.
  • Resources
  • Sense Tagged Text
  • Dictionary (implicit source of sense inventory)
  • Syntactic Analysis (POS tagger, Chunker, Parser,
    )
  • Scope
  • Typically one target word per context
  • Part of speech of target word resolved
  • Lends itself to lexical sample formulation
  • Reduces WSD to a classification problem where a
    target word is assigned the most appropriate
    sense from a given set of possibilities based on
    the context in which it occurs

7
Sense Tagged Text
Bonnie and Clyde are two really famous criminals, I think they were bank/1 robbers
My bank/1 charges too much for an overdraft.
I went to the bank/1 to deposit my check and get a new ATM card.
The University of Minnesota has an East and a West Bank/2 campus right on the Mississippi River.
My grandfather planted his pole in the bank/2 and got a great big catfish!
The bank/2 is pretty muddy, I cant walk there.
8
Two Bags of Words(Co-occurrences in the window
of context)
FINANCIAL_BANK_BAG a an and are ATM Bonnie card charges check Clyde criminals deposit famous for get I much My new overdraft really robbers the they think to too two went were
RIVER_BANK_BAG a an and big campus cant catfish East got grandfather great has his I in is Minnesota Mississippi muddy My of on planted pole pretty right River The the there University walk West
9
Simple Supervised Approach
  • Given a sentence S containing bank
  • For each word Wi in S
  • If Wi is in FINANCIAL_BANK_BAG then
  • Sense_1 Sense_1 1
  • If Wi is in RIVER_BANK_BAG then
  • Sense_2 Sense_2 1
  • If Sense_1 gt Sense_2 then print Financial
  • else if Sense_2 gt Sense_1 then print River
  • else print Cant Decide

10
Supervised Methodology
  • Create a sample of training data where a given
    target word is manually annotated with a sense
    from a predetermined set of possibilities.
  • One tagged word per instance/lexical sample
    disambiguation
  • Select a set of features with which to represent
    context.
  • co-occurrences, collocations, POS tags, verb-obj
    relations, etc...
  • Convert sense-tagged training instances to
    feature vectors.
  • Apply a machine learning algorithm to induce a
    classifier.
  • Form structure or relation among features
  • Parameters strength of feature interactions
  • Convert a held out sample of test data into
    feature vectors.
  • correct sense tags are known but not used
  • Apply classifier to test instances to assign a
    sense tag.

11
Outline
  • What is Supervised Learning?
  • Task Definition
  • Naïve Bayesian Classifier
  • Decision Lists and Trees
  • Ensembles of Classifiers

12
Naïve Bayesian Classifier
  • Naïve Bayesian Classifier well known in Machine
    Learning community for good performance across a
    range of tasks (e.g., Domingos and Pazzani, 1997)
  • Word Sense Disambiguation is no exception
  • Assumes conditional independence among features,
    given the sense of a word.
  • The form of the model is assumed, but parameters
    are estimated from training instances
  • When applied to WSD, features are often a bag of
    words that come from the training data
  • Usually thousands of binary features that
    indicate if a word is present in the context of
    the target word (or not)

13
Bayesian Inference
  • Given observed features, what is most likely
    sense?
  • Estimate probability of observed features given
    sense
  • Estimate unconditional probability of sense
  • Unconditional probability of features is a
    normalizing term, doesnt affect sense
    classification

14
Naïve Bayesian Model
15
The Naïve Bayesian Classifier
  • Given 2,000 instances of bank, 1,500 for bank/1
    (financial sense) and 500 for bank/2 (river
    sense)
  • P(S1) 1,500/2000 .75
  • P(S2) 500/2,000 .25
  • Given credit occurs 200 times with bank/1 and 4
    times with bank/2.
  • P(F1credit) 204/2000 .102
  • P(F1creditS1) 200/1,500 .133
  • P(F1creditS2) 4/500 .008
  • Given a test instance that has one feature
    credit
  • P(S1F1credit) .133.75/.102 .978
  • P(S2F1credit) .008.25/.102 .020

16
Comparative Results
  • (Leacock, et. al. 1993) compared Naïve Bayes with
    a Neural Network and a Context Vector approach
    when disambiguating six senses of line
  • (Mooney, 1996) compared Naïve Bayes with a Neural
    Network, Decision Tree/List Learners, Disjunctive
    and Conjunctive Normal Form learners, and a
    perceptron when disambiguating six senses of
    line
  • (Pedersen, 1998) compared Naïve Bayes with
    Decision Tree, Rule Based Learner, Probabilistic
    Model, etc. when disambiguating line and 12 other
    words
  • All found that Naïve Bayesian Classifier
    performed as well as any of the other methods!

17
Outline
  • What is Supervised Learning?
  • Task Definition
  • Naïve Bayesian Classifiers
  • Decision Lists and Trees
  • Ensembles of Classifiers

18
Decision Lists and Trees
  • Very widely used in Machine Learning.
  • Decision trees used very early for WSD research
    (e.g., Kelly and Stone, 1975 Black, 1988).
  • Represent disambiguation problem as a series of
    questions (presence of feature) that reveal the
    sense of a word.
  • List decides between two senses after one
    positive answer
  • Tree allows for decision among multiple senses
    after a series of answers
  • Uses a smaller, more refined set of features than
    bag of words and Naïve Bayes.
  • More descriptive and easier to interpret.

19
Decision List for WSD (Yarowsky, 1994)
  • Identify collocational features from sense tagged
    data.
  • Word immediately to the left or right of target
  • I have my bank/1 statement.
  • The river bank/2 is muddy.
  • Pair of words to immediate left or right of
    target
  • The worlds richest bank/1 is here in New York.
  • The river bank/2 is muddy.
  • Words found within k positions to left or right
    of target, where k is often 10-50
  • My credit is just horrible because my bank/1 has
    made several mistakes with my account and the
    balance is very low.

20
Building the Decision List
  • Sort order of collocation tests using log of
    conditional probabilities.
  • Words most indicative of one sense (and not the
    other) will be ranked highly.

21
Computing DL score
  • Given 2,000 instances of bank, 1,500 for bank/1
    (financial sense) and 500 for bank/2 (river
    sense)
  • P(S1) 1,500/2,000 .75
  • P(S2) 500/2,000 .25
  • Given credit occurs 200 times with bank/1 and 4
    times with bank/2.
  • P(F1credit) 204/2,000 .102
  • P(F1creditS1) 200/1,500 .133
  • P(F1creditS2) 4/500 .008
  • From Bayes Rule
  • P(S1F1credit) .133.75/.102 .978
  • P(S2F1credit) .008.25/.102 .020
  • DL Score abs (log (.978/.020)) 3.89

22
Using the Decision List
  • Sort DL-score, go through test instance looking
    for matching feature. First match reveals sense

DL-score Feature Sense
3.89 credit within bank Bank/1 financial
2.20 bank is muddy Bank/2 river
1.09 pole within bank Bank/2 river
0.00 of the bank N/A
23
Using the Decision List
24
Learning a Decision Tree
  • Identify the feature that most cleanly divides
    the training data into the known senses.
  • Cleanly measured by information gain or gain
    ratio.
  • Create subsets of training data according to
    feature values.
  • Find another feature that most cleanly divides a
    subset of the training data.
  • Continue until each subset of training data is
    pure or as clean as possible.
  • Well known decision tree learning algorithms
    include ID3 and C4.5 (Quillian, 1986, 1993)
  • In Senseval-1 a modified decision list (which
    supported some conditional branching) was most
    accurate for English Lexical Sample task
    (Yarowsky, 2000)

25
Supervised WSD with Individual Classifiers
  • Most supervised Machine Learning algorithms have
    been applied to Word Sense Disambiguation, most
    work reasonably well.
  • Features tend to differentiate among methods more
    than the learning algorithms.
  • Good sets of features tend to include
  • Co-occurrences or keywords (global)
  • Collocations (local)
  • Bigrams (local and global)
  • Part of speech (local)
  • Predicate-argument relations
  • Verb-object, subject-verb,
  • Heads of Noun and Verb Phrases

26
Convergence of Results
  • Accuracy of different systems applied to the same
    data tends to converge on a particular value, no
    one system shockingly better than another.
  • Senseval-1, a number of systems in range of
    74-78 accuracy for English Lexical Sample task.
  • Senseval-2, a number of systems in range of
    61-64 accuracy for English Lexical Sample task.
  • Senseval-3, a number of systems in range of
    70-73 accuracy for English Lexical Sample task
  • What to do next?

27
Outline
  • What is Supervised Learning?
  • Task Definition
  • Naïve Bayesian Classifiers
  • Decision Lists and Trees
  • Ensembles of Classifiers

28
Ensembles of Classifiers
  • Classifier error has two components (Bias and
    Variance)
  • Some algorithms (e.g., decision trees) try and
    build a representation of the training data Low
    Bias/High Variance
  • Others (e.g., Naïve Bayes) assume a parametric
    form and dont represent the training data High
    Bias/Low Variance
  • Combining classifiers with different bias
    variance characteristics can lead to improved
    overall accuracy
  • Bagging a decision tree can smooth out the
    effect of small variations in the training data
    (Breiman, 1996)
  • Sample with replacement from the training data to
    learn multiple decision trees.
  • Outliers in training data will tend to be
    obscured/eliminated.

29
Ensemble Considerations
  • Must choose different learning algorithms with
    significantly different bias/variance
    characteristics.
  • Naïve Bayesian Classifier versus Decision Tree
  • Must choose feature representations that yield
    significantly different (independent?) views of
    the training data.
  • Lexical versus syntactic features
  • Must choose how to combine classifiers.
  • Simple Majority Voting
  • Averaging of probabilities across multiple
    classifier output
  • Maximum Entropy combination (e.g., Klein, et.
    al., 2002)

30
Ensemble Results
  • (Pedersen, 2000) achieved state of art for
    interest and line data using ensemble of Naïve
    Bayesian Classifiers.
  • Many Naïve Bayesian Classifiers trained on
    varying sized windows of context / bags of words.
  • Classifiers combined by a weighted vote
  • (Florian and Yarowsky, 2002) achieved state of
    the art for Senseval-1 and Senseval-2 data using
    combination of six classifiers.
  • Rich set of collocational and syntactic features.
  • Combined via linear combination of top three
    classifiers.
  • Many Senseval-2 and Senseval-3 systems employed
    ensemble methods.

31
References
  • (Black, 1988) E. Black (1988) An experiment in
    computational discrimination of English word
    senses. IBM Journal of Research and Development
    (32) pg. 185-194.
  • (Breiman, 1996) L. Breiman. (1996) The heuristics
    of instability in model selection. Annals of
    Statistics (24) pg. 2350-2383.
  • (Domingos and Pazzani, 1997) P. Domingos and M.
    Pazzani. (1997) On the Optimality of the Simple
    Bayesian Classifier under Zero-One Loss, Machine
    Learning (29) pg. 103-130.
  • (Domingos, 2000) P. Domingos. (2000) A Unified
    Bias Variance Decomposition for Zero-One and
    Squared Loss. In Proceedings of AAAI. Pg.
    564-569.
  • (Florian an dYarowsky, 2002) R. Florian and D.
    Yarowsky. (2002) Modeling Consensus Classifier
    Combination for Word Sense Disambiguation. In
    Proceedings of EMNLP, pp 25-32.
  • (Kelly and Stone, 1975). E. Kelly and P. Stone.
    (1975) Computer Recognition of English Word
    Senses, North Holland Publishing Co., Amsterdam.
  • (Klein, et. al., 2002) D. Klein, K. Toutanova, H.
    Tolga Ilhan, S. Kamvar, and C. Manning, Combining
    Heterogeneous Classifiers for Word-Sense
    Disambiguation, Proceedings of Senseval-2. pg.
    87-89.
  • (Leacock, et. al. 1993) C. Leacock, J. Towell, E.
    Voorhees. (1993) Corpus based statistical sense
    resolution. In Proceedings of the ARPA Workshop
    on Human Language Technology. pg. 260-265.
  • (Mooney, 1996) R. Mooney. (1996) Comparative
    experiments on disambiguating word senses An
    illustration of the role of bias in machine
    learning. Proceedings of EMNLP. pg. 82-91.
  • (Pedersen, 1998) T. Pedersen. (1998) Learning
    Probabilistic Models of Word Sense
    Disambiguation. Ph.D. Dissertation. Southern
    Methodist University.
  • (Pedersen, 2000) T. Pedersen (2000) A simple
    approach to building ensembles of Naive Bayesian
    classifiers for word sense disambiguation. In
    Proceedings of NAACL.
  • (Quillian, 1986). J.R. Quillian (1986) Induction
    of Decision Trees. Machine Learning (1). pg.
    81-106.
  • (Quillian, 1993). J.R. Quillian (1993) C4.5
    Programs for Machine Learning. San Francisco,
    Morgan Kaufmann.
  • (Yarowsky, 1994) D. Yarowsky. (1994) Decision
    lists for lexical ambiguity resolution
    Application to accent restoration in Spanish and
    French. In Proceedings of ACL. pp. 88-95.
  • (Yarowsky, 2000) D. Yarowsky. (2000)
    Hierarchical decision lists for word sense
    disambiguation. Computers and the Humanities, 34.
Write a Comment
User Comments (0)
About PowerShow.com