Title: Measures of Distributional Similarity
1Measures of Distributional Similarity
Lillian Lee Department of Computer
Science Cornell University
- Presenter Cosmin Adrian Bejan
2Overview
- Goal improve probability estimation for unseen
cooccurences. - Contributions of this paper
- an empirical comparison of a broad range of
measures - a classification of similarity functions based on
the information that they incorporate - a new function for evaluating proxy distributions
3Introduction
- How to estimate the conditional cooccurence
probability P(vn) of an unseen word pair (n,v)
drawn from some finite set NxV ? - Normal approaches
- Katz back-off method
- Jelinek-Mercer interpolation method.
- An alternative approach
- distance-weighted averaging
where S(n) is a set of candidate similar words
and sim(n,m) is a function of similarity between
n and m.
4Distributional Similarity Functions
Notations N set of nouns V
set of transitive verbs (n,v)
coocurence pair where n is the
head of the direct object of v.
n,m two nouns whose distributi-onal similarity
is to be determined q(v) P(vn) r(v) P(vm)
(1)
Euclidean distance
(2)
L1 norm
(3)
cosine
(4)
Jaccards coefficient
5Distributional Similarity Functions
Jensen-Shannon divergence
(5)
Kullback Leibler divergence
confusion probability
(6)
Kendals ?
(7)
6The Evaluation Method
- Evaluation of similarity functions on a binary
decision task - Data verb-object cooccurence pairs involving
1000 most frequent nouns - Training/Testing set 80 / 20
- Testing set
- discard the pairs occurring in the training data
- split the remaining pairs into five partitions
- replace each (n,v1) with a (n,v1,v2) triple such
that P(v1)?P(v2) - The task reconstruct which of (n,v1) and (n,v2)
was the original cooccurence. - The error-rate measured for test-set performance
where T is the number of test triple tokens in
the set
7The Evaluation Method
- Incorporate similarity function into a decision
rule as follows - (n,v1,v2) test instance
- Sf,k(n) the k most similar words to n according
to f - evidence Ef,k(n,v1) for v1 the number of
neighbors m? Sf,k(n) such that P(v1m)gtP(v2m) - the decision rule choose the verb alternative
with the greatest evidence - For two functions f and g if Ef,k(n,v1)gtEg,k(n,v
1) then the k most similar words according to f
are on the whole better predictors that the k
most similar words according to g hence f
induces an inherently better similarity ranking
for distance-weighted averaging.
8Similarity Metric Performance
9(No Transcript)
10The Skew Divergence
- Remark it is desirable to have a similarity
function that focuses on the verbs that cooccur
with both of the nouns being compared.
a - skew divergence
- the skew divergence is asymmetric
- sa depends only on the verbs in Vqr.
11Performance of the Skew Divergence