Title: Frequency Estimates for Statistical Word Similarity Measures
1Frequency Estimates for Statistical Word
Similarity Measures
Egidio Terra and C.L.A. Clarke School of Computer
Science University of Waterloo
- Presenter Cosmin Adrian Bejan
2Introduction
- A comparative study of two methods for estimating
word cooccurence frequencies required by word
similarity measures to solve human-oriented
language tests. - Example of such tests
- determine the best synonym in a set of
alternatives AA1, A2, A3, A4 for a specific
target word TW in a context Cw1, w2, wn \
TW. - determine the best synonym when no context is
available -
3Measuring Word Similarity
- the notion for cooccurence of two words can be
depicted by a contingency table - each dimension represents a random discrete
variable Wi with range A wi, ? wi - each cell represent the joint frequency
- where Nmax is the maximum number of
cooccurences.
4Similarity between two words
Pointwise Mutual Information
?2 - test
Likelihood ratio
Average Mutual Information
5Context supported similarity
Cosine of Pointwise Mutual Information
L1 norm
Contextual Average Mutual Information
Contextual Jensen- Shanon Digergence
Pointwise Mutual Infor- mation of Multiple words
6Window-oriented approach
- fw_i frequency of wi
- fw_1,w_2 cooccurence frequency of w1 and w2
- N size of the corpus in words
- P(wi) fw_i/N
- fw_1,w_2 is estimated by the number of windows
where the two words cooccur. - Nwt number of windows of size t
- P(w1, w2) fw_1,w_2 / Nwt
7Document-oriented approach
- dfw_i frequency of a word wi. It corresponds to
the number of documents in which the words
appears. - D the number of documents
- P(wi) dfw_i/ D
- dfw_1,w_2 cooccurence frequency of two words
is the number of documents where the words
cooccur. - P(w1, w2) dfw_1,w_2 / D
8Results for TOEFL test set
9Results for TS1 and context