Title: Collective Word Sense Disambiguation
1Collective Word Sense Disambiguation
- David Vickrey
- Ben Taskar
- Daphne Koller
2Word Sense Disambiguation
Clues
The electricity plant supplies 500 homes with
power.
vs.
A plant requires water and sunlight to survive.
Clues
Tricky
That plant produces bottled water.
3WSD as Classification
- Senses s1, s2, , sk correspond to classes c1,
c2, , ck - Features properties of context of word
occurrence - Subject or verb of sentence
- Any word occurring within 4 words of occurrence
- Document set of features corresponding to an
occurrence
The electricity plant supplies 500 homes with
power.
4Simple Approaches
- Only features are what words appear in context
- Naïve Bayes
- Discriminative, e.g. SVM
- Problems
- Feature set not rich enough
- Data extremely sparse
- space occurs 38 times in corpus with 200,000
words
5Available Data
- WordNet electronic thesaurus
- Words grouped by meaning into synsets
- Slightly over 100,000 synsets
- For nouns and verbs, hierarchy over synsets
Animal
Mammal
Bird
Dog, Hound, Canine
Retriever
Terrier
6Available Data
- Around 400,000 word corpus labeled with synsets
from WordNet - Sample sentences from WordNet
- Very sparse for most words
7What Hasnt Worked
- Intuition context of dog similar to context of
retriever - Use hierarchy to determine possibly useful data
- Using cross-validation, learn what data is
actually useful - This hasnt worked out very well
8Why?
- Lots of parameters (not even counting parameters
estimated using MLE) - gt 100K for one model, 20K for another
- Not much data (400K words)
- a, the, and, of, to occur 65K times (together)
- Hierarchy may not be very useful
- Hand-built not designed for this task
- Features not very expressive
- Luke is looking at this more closely using an SVM
9Collective WSD
- Ideas
- Determine senses of all words in a document
simultaneously - Allows for richer features
- Train on unlabeled data as well as labeled
- Lots and lots of unlabeled text available
10Model
- Variables
- S1,S2, , Sn synsets
- W1,W2, , Wn words, always observed
S1
S3
S2
S4
S5
W1
W3
W2
W4
W5
11Model
- Each synset generated from previous context
size of context a parameter (4)
n
?
P(Wi Si) P(Si Si-3,Si-2,Si-1)
P(S,W)
i 1
P(Sis Si-3,Si-2,Si-1) Z(si-3,si-2,si-1)
exp(?s(si-3)?s(si-2)?s(si-1)?s)
P(W) S P(S,W)
12Learning
- Two sets of parameters
- P(Wi Si) Given current estimates of marginals
P(Si), expected counts - ?s(s) For s ? Domain(Si-1), s ? Domain(Si),
gradient descent on log likelihood gives
P(w,si-3,si-2,s,s) P(w,si-3,si-2,s) P(s
si-3,si-2,s)
?s(s) S
Si-3,Si-2
13Efficiency
- Only need to calculate marginals over contexts
- Forwards-backwards
- Issue some words have many possible synsets
(40-50) want very fast inference - Possibly prune values?
14WordNet and Synsets
- Model uses WordNet to determine domain of Si
- Synset information should be more reliable
- This allows us learn without any labeled data
- Consider synsets eagle,hawk, eagle (golf
shot), and hawk(to sell) - Since parameters depend only on synset, even
without labeled data, can find correct clustering
15Richer Features
- Heuristic One sense per discourse usually,
within a document any given word only takes one
of its possible senses - Can capture this using long-range links
- Could assume each word independent of all
occurrences besides the ones immediately before
and after - Or, could use approximate inference (Kikuchi)
16Richer Features
- Can reduce feature sparsity using hierarchy
(e.g., replace all occurrences of dog and cat
with animal) - Need collective classification to do this
- Could add global hidden variables to try to
capture document subject
17Advanced Parameters
- Lots of parameters
- Regularization likely helpful
- Could tie parameters together based on similarity
in the WordNet hierarchy - Ties in what I was working on before
- More data in this situation (unlabeled)
18Experiments