Title: Discriminating Word Senses Using McQuittys Similarity Analysis
1Discriminating Word Senses Using McQuittys
Similarity Analysis
- Amruta Purandare
- University of Minnesota, Duluth
- Advisor Dr Ted Pedersen
- Research supported by National Science Foundation
(NSF) - Faculty Early Career Development Award (0092784)
2Discriminating line
They will begin line formation before
ceremony Connect modem to any jack on your
line Quit printing after the last line of each
file Your line will not get tied while you are
connected to net Stand balanced and comfortable
during line up Lines that do not fit a page are
truncated New line service provides reliable
connections Pages are separated by line feed
characters They stand far right when in line
formation
3They will begin line formation before
ceremony Stand balanced and comfortable during
line up They stand far right when in line
formation
Your line will not get tied while you are
connected to net Connect modem to any jack on
your line New line service provides reliable
connections
Quit printing after the last line of each
page Lines that do not fit a page are
truncated Pages are separated by line feed
characters
4Introduction
- What is Word Sense Discrimination ?
- Unsupervised learning
Clusters
Training
Features
Test
Feature Vectors
similarity matrix
evaluate
5Representing context
- Features (from training)
- Bi grams
- Unigrams
- Second Order Co-occurrences/SOCs (Schütze98)
- Mixture
- Feature vectors (Binary)
- Measuring similarity
- Cosine
- Match
6Feature examples
7McQuittys method
- Pedersen Bruce, 1997
- Agglomerative
- UPGMA / Average Link
- Stopping rules
- Number of clusters
- Score cutoff
xy/2
y
x
8Evaluation
sense1
( Maj )
sense2
sense3
sense4
c2
c3
c1
c4
9Evaluation
Accuracy38/550.69
sense3
sense4
sense1
sense2
10Majority Sense Classifier
Maj. 17/550.31
sense2
11Experimental Data
12Scope of the experiments
- 584 experiments (73 4 2)
- 73 Words 72 Senseval-2, LINE
- 4 Features Bi grams, Unigrams, SOCs, Mix
- 2 Similarity Measures Match, Cosine
- Window 5
- for Bi grams and SOCs
- Frequency cutoff 2
13Senseval-2 Results POS wise
29 NOUNS
28verbs
15 adjs
Maj0.57
Maj0.51
Maj0.64
No of words of a POS for which experiment
obtained accuracy more than Majority
14Senseval-2 Results Feature wise
SOC
UNI
BI
32
18
38
72 words X 2 measures 144
15Senseval-2 Results Measure wise
COS
MAT
49
39
72 words x 3 features 216
16Line Results
Maj 0.16
On uniform distribution of 6 senses
17 Sample Confusion Table (fine.soc.cos)
S0 elegant S1 small grained S2 superior S3
satisfactory S4 thin
60
precision 36/60 60.00
18Conclusions
- Small set of SOCs was powerful
- Half the number of unigrams/bigrams
- Scaling done by Cosine helps !
- Need more training data!
- Need to improve feature
- Selection (Tests of associations)
- extraction (Stemming)
- matching (Fuzzy matching)
- strategies for bi grams
- Explore new features
- POS
- Collocations
19Recent work
- PDL implementation
- Cluto - Clustering Toolkit
- http//www-users.cs.umn.edu/karypis/cluto
- 6 clustering methods, 12 merging criteria
- Plans
- Comparing clustering in
- similarity space Vs vector space (Schütze, 1998)
- Stopping rules
20Sense labeling
They will begin line formation before
ceremony Stand balanced and comfortable during
line up They stand far right when in line
formation
formation
Your line will not get tied while you are
connected to net Connect modem to any jack on
your line New line service provides reliable
connections
phone
Quit printing after the last line of each
file Lines that do not fit a page are
truncated Pages are separated by line feed
characters
text
21Software Packages
- SenseClusters (Our Discrimination Toolkit)
- http//www.d.umn.edu/tpederse/senseclusters.html
- PDL (Used to implement clustering algorithms)
- http//pdl.perl.org/
- NSP (Used for extracting features)
- http//www.d.umn.edu/tpederse/nsp.html
- SenseTools (Used for preprocessing, feature
matching) - http//www.d.umn.edu/tpederse/sensetools.html
- Cluto (Clustering Toolkit)
- http//www-users.cs.umn.edu/karypis/cluto