Title: Supervised%20Methods%20of%20Word%20Sense%20Disambiguation
1- Part 4
- Supervised Methods of Word Sense Disambiguation
2Outline
- What is Supervised Learning?
- Task Definition
- Single Classifiers
- Naïve Bayesian Classifiers
- Decision Lists and Trees
- Ensembles of Classifiers
3What is Supervised Learning?
- Collect a set of examples that illustrate the
various possible classifications or outcomes of
an event. - Identify patterns in the examples associated with
each particular class of the event. - Generalize those patterns into rules.
- Apply the rules to classify a new event.
4Learn from these examples when do I go to the
store?
Day Go to Store? Hot Outside? Slept Well? Ate Well?
1 YES YES NO NO
2 NO YES NO YES
3 YES NO NO NO
4 NO NO NO YES
5Outline
- What is Supervised Learning?
- Task Definition
- Single Classifiers
- Naïve Bayesian Classifiers
- Decision Lists and Trees
- Ensembles of Classifiers
6Task Definition
- Supervised WSD Class of methods that induces a
classifier from manually sense-tagged text using
machine learning techniques. - Resources
- Sense Tagged Text
- Dictionary (implicit source of sense inventory)
- Syntactic Analysis (POS tagger, Chunker, Parser,
) - Scope
- Typically one target word per context
- Part of speech of target word resolved
- Lends itself to lexical sample formulation
- Reduces WSD to a classification problem where a
target word is assigned the most appropriate
sense from a given set of possibilities based on
the context in which it occurs
7Sense Tagged Text
Bonnie and Clyde are two really famous criminals, I think they were bank/1 robbers
My bank/1 charges too much for an overdraft.
I went to the bank/1 to deposit my check and get a new ATM card.
The University of Minnesota has an East and a West Bank/2 campus right on the Mississippi River.
My grandfather planted his pole in the bank/2 and got a great big catfish!
The bank/2 is pretty muddy, I cant walk there.
8Two Bags of Words(Co-occurrences in the window
of context)
FINANCIAL_BANK_BAG a an and are ATM Bonnie card charges check Clyde criminals deposit famous for get I much My new overdraft really robbers the they think to too two went were
RIVER_BANK_BAG a an and big campus cant catfish East got grandfather great has his I in is Minnesota Mississippi muddy My of on planted pole pretty right River The the there University walk West
9Simple Supervised Approach
- Given a sentence S containing bank
-
- For each word Wi in S
- If Wi is in FINANCIAL_BANK_BAG then
- Sense_1 Sense_1 1
- If Wi is in RIVER_BANK_BAG then
- Sense_2 Sense_2 1
-
- If Sense_1 gt Sense_2 then print Financial
- else if Sense_2 gt Sense_1 then print River
- else print Cant Decide
-
10Supervised Methodology
- Create a sample of training data where a given
target word is manually annotated with a sense
from a predetermined set of possibilities. - One tagged word per instance/lexical sample
disambiguation - Select a set of features with which to represent
context. - co-occurrences, collocations, POS tags, verb-obj
relations, etc... - Convert sense-tagged training instances to
feature vectors. - Apply a machine learning algorithm to induce a
classifier. - Form structure or relation among features
- Parameters strength of feature interactions
- Convert a held out sample of test data into
feature vectors. - correct sense tags are known but not used
- Apply classifier to test instances to assign a
sense tag.
11Outline
- What is Supervised Learning?
- Task Definition
- Naïve Bayesian Classifier
- Decision Lists and Trees
- Ensembles of Classifiers
12Naïve Bayesian Classifier
- Naïve Bayesian Classifier well known in Machine
Learning community for good performance across a
range of tasks (e.g., Domingos and Pazzani, 1997) - Word Sense Disambiguation is no exception
- Assumes conditional independence among features,
given the sense of a word. - The form of the model is assumed, but parameters
are estimated from training instances - When applied to WSD, features are often a bag of
words that come from the training data - Usually thousands of binary features that
indicate if a word is present in the context of
the target word (or not)
13Bayesian Inference
- Given observed features, what is most likely
sense? - Estimate probability of observed features given
sense - Estimate unconditional probability of sense
- Unconditional probability of features is a
normalizing term, doesnt affect sense
classification
14Naïve Bayesian Model
15The Naïve Bayesian Classifier
- Given 2,000 instances of bank, 1,500 for bank/1
(financial sense) and 500 for bank/2 (river
sense) - P(S1) 1,500/2000 .75
- P(S2) 500/2,000 .25
- Given credit occurs 200 times with bank/1 and 4
times with bank/2. - P(F1credit) 204/2000 .102
- P(F1creditS1) 200/1,500 .133
- P(F1creditS2) 4/500 .008
- Given a test instance that has one feature
credit - P(S1F1credit) .133.75/.102 .978
- P(S2F1credit) .008.25/.102 .020
16Comparative Results
- (Leacock, et. al. 1993) compared Naïve Bayes with
a Neural Network and a Context Vector approach
when disambiguating six senses of line - (Mooney, 1996) compared Naïve Bayes with a Neural
Network, Decision Tree/List Learners, Disjunctive
and Conjunctive Normal Form learners, and a
perceptron when disambiguating six senses of
line - (Pedersen, 1998) compared Naïve Bayes with
Decision Tree, Rule Based Learner, Probabilistic
Model, etc. when disambiguating line and 12 other
words - All found that Naïve Bayesian Classifier
performed as well as any of the other methods!
17Outline
- What is Supervised Learning?
- Task Definition
- Naïve Bayesian Classifiers
- Decision Lists and Trees
- Ensembles of Classifiers
18Decision Lists and Trees
- Very widely used in Machine Learning.
- Decision trees used very early for WSD research
(e.g., Kelly and Stone, 1975 Black, 1988). - Represent disambiguation problem as a series of
questions (presence of feature) that reveal the
sense of a word. - List decides between two senses after one
positive answer - Tree allows for decision among multiple senses
after a series of answers - Uses a smaller, more refined set of features than
bag of words and Naïve Bayes. - More descriptive and easier to interpret.
19Decision List for WSD (Yarowsky, 1994)
- Identify collocational features from sense tagged
data. - Word immediately to the left or right of target
- I have my bank/1 statement.
- The river bank/2 is muddy.
- Pair of words to immediate left or right of
target - The worlds richest bank/1 is here in New York.
- The river bank/2 is muddy.
- Words found within k positions to left or right
of target, where k is often 10-50 - My credit is just horrible because my bank/1 has
made several mistakes with my account and the
balance is very low.
20Building the Decision List
- Sort order of collocation tests using log of
conditional probabilities. - Words most indicative of one sense (and not the
other) will be ranked highly.
21Computing DL score
- Given 2,000 instances of bank, 1,500 for bank/1
(financial sense) and 500 for bank/2 (river
sense) - P(S1) 1,500/2,000 .75
- P(S2) 500/2,000 .25
- Given credit occurs 200 times with bank/1 and 4
times with bank/2. - P(F1credit) 204/2,000 .102
- P(F1creditS1) 200/1,500 .133
- P(F1creditS2) 4/500 .008
- From Bayes Rule
- P(S1F1credit) .133.75/.102 .978
- P(S2F1credit) .008.25/.102 .020
- DL Score abs (log (.978/.020)) 3.89
22Using the Decision List
- Sort DL-score, go through test instance looking
for matching feature. First match reveals sense
DL-score Feature Sense
3.89 credit within bank Bank/1 financial
2.20 bank is muddy Bank/2 river
1.09 pole within bank Bank/2 river
0.00 of the bank N/A
23Using the Decision List
24Learning a Decision Tree
- Identify the feature that most cleanly divides
the training data into the known senses. - Cleanly measured by information gain or gain
ratio. - Create subsets of training data according to
feature values. - Find another feature that most cleanly divides a
subset of the training data. - Continue until each subset of training data is
pure or as clean as possible. - Well known decision tree learning algorithms
include ID3 and C4.5 (Quillian, 1986, 1993) - In Senseval-1 a modified decision list (which
supported some conditional branching) was most
accurate for English Lexical Sample task
(Yarowsky, 2000)
25Supervised WSD with Individual Classifiers
- Most supervised Machine Learning algorithms have
been applied to Word Sense Disambiguation, most
work reasonably well. - Features tend to differentiate among methods more
than the learning algorithms. - Good sets of features tend to include
- Co-occurrences or keywords (global)
- Collocations (local)
- Bigrams (local and global)
- Part of speech (local)
- Predicate-argument relations
- Verb-object, subject-verb,
- Heads of Noun and Verb Phrases
26Convergence of Results
- Accuracy of different systems applied to the same
data tends to converge on a particular value, no
one system shockingly better than another. - Senseval-1, a number of systems in range of
74-78 accuracy for English Lexical Sample task. - Senseval-2, a number of systems in range of
61-64 accuracy for English Lexical Sample task. - Senseval-3, a number of systems in range of
70-73 accuracy for English Lexical Sample task - What to do next?
27Outline
- What is Supervised Learning?
- Task Definition
- Naïve Bayesian Classifiers
- Decision Lists and Trees
- Ensembles of Classifiers
28Ensembles of Classifiers
- Classifier error has two components (Bias and
Variance) - Some algorithms (e.g., decision trees) try and
build a representation of the training data Low
Bias/High Variance - Others (e.g., Naïve Bayes) assume a parametric
form and dont represent the training data High
Bias/Low Variance - Combining classifiers with different bias
variance characteristics can lead to improved
overall accuracy - Bagging a decision tree can smooth out the
effect of small variations in the training data
(Breiman, 1996) - Sample with replacement from the training data to
learn multiple decision trees. - Outliers in training data will tend to be
obscured/eliminated.
29Ensemble Considerations
- Must choose different learning algorithms with
significantly different bias/variance
characteristics. - Naïve Bayesian Classifier versus Decision Tree
- Must choose feature representations that yield
significantly different (independent?) views of
the training data. - Lexical versus syntactic features
- Must choose how to combine classifiers.
- Simple Majority Voting
- Averaging of probabilities across multiple
classifier output - Maximum Entropy combination (e.g., Klein, et.
al., 2002)
30Ensemble Results
- (Pedersen, 2000) achieved state of art for
interest and line data using ensemble of Naïve
Bayesian Classifiers. - Many Naïve Bayesian Classifiers trained on
varying sized windows of context / bags of words. - Classifiers combined by a weighted vote
- (Florian and Yarowsky, 2002) achieved state of
the art for Senseval-1 and Senseval-2 data using
combination of six classifiers. - Rich set of collocational and syntactic features.
- Combined via linear combination of top three
classifiers. - Many Senseval-2 and Senseval-3 systems employed
ensemble methods.
31References
- (Black, 1988) E. Black (1988) An experiment in
computational discrimination of English word
senses. IBM Journal of Research and Development
(32) pg. 185-194. - (Breiman, 1996) L. Breiman. (1996) The heuristics
of instability in model selection. Annals of
Statistics (24) pg. 2350-2383. - (Domingos and Pazzani, 1997) P. Domingos and M.
Pazzani. (1997) On the Optimality of the Simple
Bayesian Classifier under Zero-One Loss, Machine
Learning (29) pg. 103-130. - (Domingos, 2000) P. Domingos. (2000) A Unified
Bias Variance Decomposition for Zero-One and
Squared Loss. In Proceedings of AAAI. Pg.
564-569. - (Florian an dYarowsky, 2002) R. Florian and D.
Yarowsky. (2002) Modeling Consensus Classifier
Combination for Word Sense Disambiguation. In
Proceedings of EMNLP, pp 25-32. - (Kelly and Stone, 1975). E. Kelly and P. Stone.
(1975) Computer Recognition of English Word
Senses, North Holland Publishing Co., Amsterdam. - (Klein, et. al., 2002) D. Klein, K. Toutanova, H.
Tolga Ilhan, S. Kamvar, and C. Manning, Combining
Heterogeneous Classifiers for Word-Sense
Disambiguation, Proceedings of Senseval-2. pg.
87-89. - (Leacock, et. al. 1993) C. Leacock, J. Towell, E.
Voorhees. (1993) Corpus based statistical sense
resolution. In Proceedings of the ARPA Workshop
on Human Language Technology. pg. 260-265. - (Mooney, 1996) R. Mooney. (1996) Comparative
experiments on disambiguating word senses An
illustration of the role of bias in machine
learning. Proceedings of EMNLP. pg. 82-91. - (Pedersen, 1998) T. Pedersen. (1998) Learning
Probabilistic Models of Word Sense
Disambiguation. Ph.D. Dissertation. Southern
Methodist University. - (Pedersen, 2000) T. Pedersen (2000) A simple
approach to building ensembles of Naive Bayesian
classifiers for word sense disambiguation. In
Proceedings of NAACL. - (Quillian, 1986). J.R. Quillian (1986) Induction
of Decision Trees. Machine Learning (1). pg.
81-106. - (Quillian, 1993). J.R. Quillian (1993) C4.5
Programs for Machine Learning. San Francisco,
Morgan Kaufmann. - (Yarowsky, 1994) D. Yarowsky. (1994) Decision
lists for lexical ambiguity resolution
Application to accent restoration in Spanish and
French. In Proceedings of ACL. pp. 88-95. - (Yarowsky, 2000) D. Yarowsky. (2000)
Hierarchical decision lists for word sense
disambiguation. Computers and the Humanities, 34.