Title: Overview of Peter D. Turney
1Overview of Peter D. Turneys Work on Similarity
2similarity
- Attributional similarity (2001 - 2003)
- the degree to which two words are synonymous
- also known as
- Semantic relatedness and semantic association
- Relational similarity (2005 - 2008)
- the degree to which two relations are analogous
3Objective evaluation of the approaches by
- Attributional similarity
- 80 TOFEL Synonym questions
- Relational similarity
- 374 SAT analogy questions
42001 Mining the Web for Synonyms PMI-IR versus
LSA on TOEFL
- In Proceedings of the 12th European
- Conference on Machine Learning,
- pages 491502, Springer, Berlin, 2001.
51 Introduction
- ?????
- ???????????,????????????????????
- ??????co-occurrence
- a word is characterized by the company it keeps
61 Introduction idea
- ?????problem??????choice1, choice2, , choicen
- ??choicei?score(choicei),???????????
- uses Pointwise Mutual Information (PMI)
- to analyze statistical data collected by
Information Retrieval (IR).
72 formula
- Score 1
- Score 2 NEAR???????
82 formula
- Score 3 ??????big vs. small
- Score 4 ?????context
- context word???????(?????)
93 Experiments
- Compare with
- LSA Latent Semantic Analysis
- ????????????X61,000 30,473
- ????????
- ????SVD
- Element tfidf weight
- Similarity cosine
- ???TOFEL??
10- Dataset
- 80?TOFEL??
- 50?ESL???
113 Experiments PMI-IR Vs. LSA
- ????
- PMI-IR????,???
- 2s/query 8 querys,???????????
- ??2S
- LSA???
- 61,000 30,473???61,000 300,UNIX Station???????
123 Experiments
- 80?TOFEL??, 50?ESL???
- PMI-IR 73.75(59/80) 74(37/50)
- ??? 64.5(51.6/80)
- LSA 64.4(51.5/80)
- ?? PMI-IR WIN 10
- ??
- NEAR???,Smaller chunk size
- LSA 64.4
- PMI-IR with AND 62.5
- PMI-IR with NEAR 72.5
134 Conclusion
- ??PMI?IR
- ??????????????
- PMI
- ?????????
- ??????????
142003Combining independent modules in lexical
multiple-choice problems
- In RANLP-03, pages 482489,
- Borovets, Bulgaria
- (RANLP Recent Advances in Natural Language
Processing )
151 Introduction
- There are several approaches to natural language
problems - No one will be the best for all problem
instances. - How about combine them?
161 Introduction
- two main contributions
- introduces and evaluates several new modules
- for answering multiple-choice synonym questions
and analogy questions. - 3 merging rules
- presents a novel product rule
- compares it with other 2 similar merging rules.
172 Merging rules the parameter
- The parameter of the rules w
- phij gt 0 represents the probability
- ? i ?module 1 lt i lt n
- ? h ? instance 1 lt h lt m.
- ? j ?choice 1 lt j lt k
- Dh,wj be the probability
- assigned by the merging rule to choice j of
training instance h when the weights are set to
w. - 1lt a(h) lt k be the correct answer for instance
182 Merging rules old
- mixture rule very common
- ???
- logarithmic rule
192 Merging rules novel
203 Synonym dataset
- a training set of 431 4-choice synonym questions
- randomly divided them into 331 training questions
and 100 testing questions. - Optimize w with the training set
213 Synonym Modules
- LSA
- PMI-IR
- Thesaurus
- queries Wordsmyth (www.wordsmyth.net)
- Create synonyms lists for both stem and choices
- scored them by their overlap
- Connector
- used summary pages from querying Google with a
pair of words - Weighted sum of
- the times when the words appear separated by a
symbol - , , , ,, , /, ,, (,
- means, defined, equals, synonym, whitespace, and
- the number of times dictionary or thesaurus
appear
223 Synonym combine results
- 3 rules accuracies are nearly identical
- the product and logarithmic rules assign higher
probabilities to correct answers - as evidenced by the mean likelihood.
233 Synonym compare with other approaches
244 Analogies dataset
- 374 5-choice instances
- randomly split the collection into 274 training
instances and 100 testing instances. - Eg. catmeow
- (a) mousescamper,
- (b) birdpeck,
- (c) dogbark,
- (d) horsegroom,
- (e) lionscratch
254 Analogies modules
- Phrase vectors
- Create vector r to present the relationship
between X and Y. - Phrases with 128 patterns
- Eg. X for Y", Y with X", X in the Y", Y on X
- Query and record the number of hits
- Measure by cosine
- Thesaurus paths (WordNet)
- degree of similarity between paths
264 Analogies combine results
- Lexical relation modules
- a set of more specific modules using the WordNet
- 9 modules Each checks a relationship
- Synonym, Antonym, Hypernym, Hyponym,
Meronymsubstance, Meronympart, Meronymmember,
Holonymsubstance, Holonymmember. - Check the stem first, then the choices
- Similarity
- Make use of definition
- Similaritydict uses dictionary.com and
- Similaritywordsmyth uses wordsmyth.net
- Given ABCD, similarity sim (A, C) sim (B,
D)
27(No Transcript)
285 Conclusion
- applied three trained merging rules to TOEFL
questions - Accuracy 97.5
- provided first results on a challenging analogy
task with a set of novel modules that use both
lexical databases and statistical information. - Accuracy 45
- the popular mixture rule was consistently weaker
than the logarithmic and product rules at
assigning high probabilities to correct answers.
29State of the art (accuracy)
LSA HUMAN PIM-IR (2001) HYBRID (2003)
Synonym question 64.4 64.5 73.75 97.5
HYBRID HUMAN
Analogies 45 57
302005 Corpus-based Learning of Analogies and
Semantic Relations
- IJCAI 2005
- Proceedings of the Nineteenth International Joint
Conference on Artificial Intelligence, Edinburgh,
Scotland, UK, July 30-August 5, 2005.
311 Introduction
- Verbal analogy VSM
- AB CD
- The novelty of the paper is the application of
VSM to measure the similarity between
relationships. - Noun-modifier pairs relations supervised nearest
neighbour algorithm - Dataset Nastase and Szpakowicz (2003), 600
none-modifier pairs.
321 Introduction examples
- Analogy
- Noun-modifier pairs relations
- Laser printer
- Relation instrument
332 Solving Analogy Problems
- assign scores to candidate analogies ABCD
- For multiple-choice questions, guess highest
scoring choice - Sim(R1, R2)
- difficulty is that R1 and R2 are implicit
- attempt to learn R1 and R2 using unsupervised
learning from a very large corpus
342 Solving Analogy Problems Vector Space Model
- create vectors, r1 and r2, that represent
features of R1 and R2 - measure the similarity of R1 and R2 by the cosine
of the angle ? between r1 and r2
352 Solving Analogy Problems?????
- Generate vector for each word pair
- Joining terms
- X for Y", Y with X", X in the Y", Y on X
- vector
- log(hit1), log(hit2), log(hit128)
64 joining terms
search
phrases
hits
log
Word Pair AB
vector
362 Solving Analogy Problems experiment
372 Solving Analogy Problems experiment
383 Noun-Modifier Semantic Relations
- First attempt to classify semantic relations
without a lexicon.
3930 Semantic Relations of training data
403 Noun-Modifier Semantic Relations algorithm
- nearest neighbour supervised learning
- nearest neighbour cosine
- Cosine (training pair, testing pair)
- vector of 128 elements, same joining terms as
before
413 Noun-Modifier Semantic RelationsExperiment
for the 30 Classes
4230 Semantic Relations
- F when precision and recall are balanced
- 26.5
- F for random guessing
- 3.3
- much better than random guessing
- but still much room for improvement
- 30 classes is hard
- too many possibilities for confusing classes
- try 5 classes instead
- group classes together
435 Semantic Relations
44F for the 5 Classes
455 Semantic Relations
- F when precision and recall are balanced
- 43.2
- F for random guessing
- 20.0
- better than random guessing
- better than 30 classes
- 26.5
- but still room for improvement
46Execution Time
- experiments presented here required 76,800
queries to AltaVista - 600 word pairs
- 128 queries per word pair
- 76,800 queries
- as courtesy to AltaVista, inserted a five second
delay between each query - processing 76,800 queries took about five days
47Conclusion
- The cosine metric in the VSM used to
- Analogy
- Classify semantic relations
- It performs much better than random guessing, but
below human levels.
48State of the art
accuracy HYBRID (2003) VSM (2005) HUMAN
Analogies 45 47 57
F-measure VSM (2005)
Noun-Modifier (5 classes) 43.2
492006aSimilarity of Semantic Relations
- Computational Linguistics, 32(3)379416.
501 Introduction
- Latent Relational Analysis (LRA)
- LRA extends the VSM approach of Turney and
Littman (2005) in three ways - The connecting patterns are derived automatically
from the corpus, instead of using a fixed set of
patterns. - Singular Value Decomposition (SVD) is used to
smooth the frequency data. - automatically generated synonyms are used to
explore variations of the word pairs.
512 A short description of LRA?????
- Generate vector for each word pair
64 joining terms
search
vector
phrases
hits
log
Word Pair AB
?log(hit)
????? pattern
??
AB, AB ?????
SVD
Calculate avg(cosine)
523 Experiment Word Analogy Questions Baseline LSA
- Matrix 17,232 8,000, density of 5.8
- Time required 2094936, 9 days
- Performance
53Experiment Word Analogy Questions LSA vs. VSM
- Corpus size
- AltaVista 51011 English words
- WMTS 51010 English words
54Experiment Word Analogy Questions Varying the
Parameters
55Experiment Word Analogy Questions Ablation
Experiments
- No SVD not significant, but maybe significant
with more word pairs - No synonyms recall drops
- No both recall drops
- VSM drop is significant
56Experiments with Noun-Modi?er Relations
- Dataset
- 600 noun-modi?er pairs, hand-labeled with 30
classes of semantic relations - Algorithm
- Baseline LRA with Single Nearest Neighbor
- LRA a distance (nearness) measure
57(No Transcript)
58Discussion
- For Word Analogy Questions
- Performance is not yet be adequate for practical
application - Speed
- For noun-modifier classification
- More hand-labeled data, but its expensive
- the choice of classification scheme for the
semantic relations - Hybrid approach
- combine the corpus-based approach of LRA with the
lexicon-based approach of Veale (2004)
59Conclusion of 2006a
- LRA, extend the VSM (2005) in
- Patterns are derived automatically
- SVD is used to smooth and compress data.
- automatically generated synonyms are used to
explore variations of the word pairs.
60State of the art
accuracy HYBRID (2003) VSM (2005) LRA (2006a) HUMAN
Analogies 45 47 56.8 57
F-measure VSM (2005) LRA (2006a)
Noun-Modifier (5 classes) 43.2 54.6
612006bExpressing Implicit Semantic Relations
without Supervision
62Introduction
- Hearst (1992) pattern XY
- Pattern Y such as the X can be used to mine
large text corpora for hypernym-hyponym - Search using the pattern Y such as the X and
find the string bird such as the ostrich, then
we can infer that ostrich is a hyponym of
bird. - Here we consider the inverse of this problem
XY pattern - Can we mine a large text corpus for patterns
that express the implicit relations between X
and Y?
63Introduction
- Discovering high quality patterns
- Pertinence measure of quality
- Reliable for mining further word pairs with the
same semantic relations
642 Pertinence
- the first formal measure of quality for text
mining patterns. - a set of word pairs
- a set of patterns
- Pi is pertinent to XjYj
- if highly typical word pairs XkYk for the
pattern Pi tend to be relationally similar to
XjYj - Pertinence tends to be highest with
unambiguous patterns
652 Pertinence ??
- fk,I is the number of occurrences in a corpus of
the word pair XkYk with the pattern Pi - Smoothing
?????
663 Related Work
- Hearst (1992)
- describes a method for finding patterns like Y
such as the X. - but her method requires human judgment.
- Riloff and Jones (1999)
- use a mutual bootstrapping technique that can
find patterns automatically - but the bootstrapping requires an initial seed of
manually chosen examples. - Other works all require training examples or
initial seed patterns for each relation
673 Related Work
- Turney (2006a) LRA
- maps each pair XY to a high-dimensional vector
v, then calculate the cosine. - Pertinence is based on it
- A limitation
- the semantic content of the vectors is difficult
to interpret
68The Algorithm
- 1. Find phrases
- 2. Generate patterns
- Note pattern frequency (TF)
- A local frequency count
- 3. Count pair frequency
- Its a global frequency count (DF)
- 4. Map pairs to rows
- both for XjYj and YjXj
- 5. Map patterns to columns
- drop all patterns with a pair frequency
less than 10 - 1,706,845 distinct patterns 42,032 patterns
69The Algorithm
- 6. Build a sparse matrix
- Element is frequency
- 7. Calculate entropy log and entropy
- gives more weight to patterns that vary
substantially in frequency for each pair. - 8. Apply SVD
- 9. Calculate cosines
- 10. Calculate conditional probabilities
- For every word pair and every pattern
- 11. Calculate pertinence
70The Algorithm?????
??, ??patterns?
??1, pattern list1 ??n, pattern listn
??, ??
??
??
715 Experiments with Word Analogies
- Dataset
- 374 college-level multiple-choice word
analogies, taken from the SAT test. - 6374 2244 pairs
- 4194rows 84,064 columns
- The sparse matrix density is 0.91
Score ( rankstem rankchoice ) / 2
72(No Transcript)
73- the four highest ranking patterns for the
stem and solution for the first example
74- the top five pairs match the pattern Y such
as the X.
75Comparing with other measures
76Experiments with Noun-Modifiers
77Method and Result
- Method
- A single nearest neighbour algorithm with
leave-one-out cross-validation. - The distance between two noun-modifier pairs is
measured by the average rank of their best
shared pattern. - Result
78More
- For the 5 general classes
79Comparing with other measures
80Discussion
- Time
- Word Analogies 5 hours, vs. 5 days (2005), 9
days(2006a) - Noun-Modifiers 9 hours
- the majority of the time was spent in SEARCHING
- Performance
- Near the level of the average senior high school
student (54.6 vs. 57) - For applications such as building a thesaurus,
lexicon, or ontology, this level of
performance suggests that our algorithm could
assist, but not replace, a human expert.
81Conclusion
- LRA is a black box
- The main contribution of this paper is the idea
of pertinence - use it to find patterns that express the
implicit semantic relations between two words.
82State of the art
accuracy HYBRID (2003) VSM (2005) LRA (2006a) pertinence (2006b) HUMAN
Analogies 45 47 56.8 55.7 57
F-measure VSM (2005) LRA (2006a) pertinence (2006b)
Noun-Modifier (5 classes) 43.2 54.6 50.2
832008A Uniform Approach to Analogies, Synonyms,
Antonyms,and Associations
- Proceedings of the 22nd International Conference
on Computational Linguistics (Coling 2008),
August 2008, Manchester, UK, Pages 905-912
841 Introduction
- ??????,???????????????
- we restrict our attention to
- analogous
- synonymous
- Antonymous
- Associated
- As far as we know, the algorithm proposed here is
the first attempt to deal with all four tasks
using a uniform approach.
851 Introduction idea
- Analogous
- Synonymous
- XY is analogous to the pair leviedimposed
- Antonymous
- XY is analogous to the pair blackwhite
- Associated
- XY is analogous to the pair doctorhospital
861 Introduction Why not WordNet?
- WordNet contains all of the needed relations.
- Corpus-based algorithm is BETTER than lexicon
- answer 374 multiple-choice SAT analogy questions
- WordNet (Veale, 2004) 43
- corpus-based (Turney, 2006a) 56
- Less human labor
- Easy to extend to other languages
871 Introduction experiments
- SAT college entrance test
- TOFEL
- ESL
- a set of word pairs that are labeled similar,
associated, and both, developed for experiments
in cognitive psychology
882 Algorithm PairClass
- view the task of recognizing word analogies
- as a problem of classifying word pairs
- standard classification problem for supervised
machine learning
892 Algorithm Resource
- Corpus
- 5 1010 words, consisting of web pages gathered
by a web crawler, gathered by Clarke,CharlesL.A.,
2003 - Wumpus
- an efficient search engine for passage retrieval
from large corpora. (http//www.wumpus-search.org/
) - to study issues that arise in the context of
indexing dynamic text collections in multi-user
environments.
902 Algorithm PairClass
training set testing set
0 to 1 words X 0 to 3 words Y 0 to 1 words
masonstone
the mason cut the stone with
Step1 generate morphological variations
Step 2 search in a large corpus for all phrases
Step 3 generate patterns
masonsstones
the X cut Y with X the Y
2(n-2) patterns
SMO RBF algorithm
Step 4 reduce the number of patterns
Step 5 generate feature vectors
Step 6 apply a standard supervised learning
algorithm Weka
top kN patterns k 20
91PairClass vs. LSA(Turney, 2006a)
- PairClass does not use a lexicon to find synonyms
for the input word pairs. - a pure corpus-based algorithm can handle synonyms
without a lexicon. - PairClass uses a support vector machine (SVM)
instead of a nearest neighbour (NN) learning
algorithm. - PairClass does not use SVD to smooth the feature
vectors. - It has been our experience that SVD is not
necessary with SVMs.
92- Measure of similarity
- PairClass probability estimates, more useful
- Turney (2006) cosine
- The automatically generated patterns are slightly
more general - PairClass 0 to 1 words X 0 to 3 words Y 0
to 1 words - Turney (2006) X 0 to 3 words Y
- The morphological processing in PairClass (Minnen
et al., 2001) is more sophisticated than in
Turney (2006).
933 Experiment SAT Analogies
- use a set of 374 multiple-choice questions from
the SAT college entrance exam. - Eg.
a binary classification problem
943 Experiment SAT Analogies
- 1st DIFFICULTY no negative examples
- the training set consists of one positive example
(the stem pair) and the testing set consists of
five unlabeled examples (the five choice pairs). - Solution
- Randomly choose one of the other 373 questions,
to be a negative example - use PairClass to estimate the probability that
each testing example is positive, and we guess
the testing example with the highest probability.
95(No Transcript)
963 Experiment SAT Analogies
- 2nd DIFFICULTY
- the algorithm is very unstable, for lack of
examples. - Solution
- To increase the stability, we repeat the learning
process 10 times, using a different randomly
chosen negative training example each time. - Average the 10 probability
97PairClass accuracy of 52.1
52.1
983 Experiment TOEFL Synonyms
- Recognizing synonyms
- a set of 80 multiple-choice synonym questions
from the TOEFL
99- View it as a binary classification problem
1003 Experiment TOEFL Synonyms
- 80 questions, 80 positive, 240 negative
- apply PairClass using ten-fold cross-validation
- In each random fold, 90 of the pairs are used
for training and 10 are used for testing. - For each fold, the model that is learned from the
training set is used to assign probabilities to
the pairs in the testing set. - They are non-overlapping, so can cover the whole
dataset. - Choice the one with hightest probability
101PairClass accuracy of 76.1
76.1
1023 Experiment Synonyms and Antonyms
- a set of 136 ESL practice questions
1033 Experiment Synonyms and Antonyms
- By patterns hand-coded
- Lin et al. (2003)
- two patterns, from X to Y and either X or Y
. - Antonyms they occasionally appear in a large
corpus in one of these two patterns - Synonyms very rare to appear in these patterns.
- PairClass
- automatically
1043 Experiment Synonyms and Antonyms
- RESULT
- PairClass ten-fold cross-validation
- accuracy of 75.0 (ten-fold cross-validation)
- Baseline
- accuracy of 65.4 (Always guessing the majority
class) - NO COMPARISON
1053 Experiment Similar, Associated, and Both
- Lund et al. (1995) evaluated their corpus-based
algorithm for measuring word similarity with word
pairs that were labeled similar, associated, or
both. - These 144 labeled pairs were originally created
for cognitive psychology experiments with human
subjects
1063 Experiment Similar, Associated, and Both
- Lund et al. (1995)
- did not measure the accuracy
- showed that their algorithms similarity scores
were correlated with the response times of human
subjects in priming tests. - PairClass with ten-fold cross-validation
- accuracy of 77.1
- Baseline
- guessing the majority and Randomly guessing
33.3 - Since the three classes are of equal size
1073 Experiment summary
- For the first two experiments
- PairClass is not the best,
- But it performs competitively
- For the second two experiments,
- PairClass performs significantly above the
baselines.
108State of the art
YEAR ?? ?? synonym analogy
2001 PMI-IR Corpus-based 73.75
2003 PR Hybrid 97.50
2005 VSM Corpus-based 47.1
2006a LRA Corpus-based 56.1
2006b PERT Corpus-based 53.5
2008 PairClass Corpus-based 76.1 52.1
HUMAN 64.5 57.0
109?????o_0