Title: Word Sense Disambiguation
1Word Sense Disambiguation
German Rigau i Claramunt http//www.lsi.upc.es/ri
gau TALP Research Center Departament de
Llenguatges i Sistemes Informàtics Universitat
Politècnica de Catalunya
2WSDOutline
- Setting
- Unsupervised WSD systems
- Supervised WSD systems
- Using the Web and EWN to WSD
3Using the Web and EWN for WSDSetting
- Word Sense Disambiguation
- is the problem of assigning the appropriate
meaning (sense) to a given word in a text - WSD is perhaps the great open problem at the
lexical level of NLP (Resnik Yarowsky 97) - WSD resolution would allow
- acquisition of subcategorisation structure
parsing - improve existing Information Retrieval
- Machine Translation
- Natural Language Understanding
4Using the Web and EWN for WSDSetting
- Example
- Senses (WordNet 1.5.)
- age 1 the length of time something (or someone)
has existed "his age was 71" "it was replaced
because of its age" - age 2 a historic period "the Victorian age"
"we live in a litigious age - DSO Corpora examples (Ng 96)
- He was mad about stars at the gtgt age 1 ltlt of
nine . - About 20,000 years ago the last ice gtgt age 2 ltlt
ended .
5Using the Web and EWN for WSDSetting
- Knowledge-Driven WSD (Unsupervised)
- knowledge-based WSD
- 100 coverage
- 55 accuracy (SensEval-1)
- No Training Process
- Large scale lexical knowledge resources
- WordNet
- MRDs
- Thesaurus
6Using the Web and EWN for WSDSetting
- Data-Driven (Supervised)
- corpus-based WSD
- statistical-based WSD
- Machine-Learning WSD
- no full coverage
- 75 accuracy (SensEval-1)
- Training Process
- learning from large amount of sense annotated
corpora - (Ng 97) effort of 16 man/year per year per
language
7UnsupervisedWord Sense DisambiguationSystems
German Rigau i Claramunt http//www.lsi.upc.es/ri
gau TALP Research Center Departament de
Llenguatges i Sistemes Informàtics Universitat
Politècnica de Catalunya
8Unsupervised WSD SystemsOutline
- Setting
- Knowledge-driven WSD methods
- MRDs
- Thesauri Corpus
- LKBs
- LKBs Conceptual Distance
- LKBs Conceptual Density
- LKBs Corpus
- Experiments Genus Sense Disambiguation
- Future Work
9Unsupervised WSD SystemsSetting
- Knowledge-Driven (Unsupervised)
- No Need of large anotated corpora
- Tested on unrestricted domains (words and
senses) - - Worst results
10Unsupervised WSD SystemsMRDs
- Lesk Method
- (Lesk 86)
- Counting word overlaping between context and
senses of the word - (Cowie et al. 92)
- simulated annealing for overcomming the
combinatorial explosion using LDOCE - (Wilks Stevenson 97)
- simulated annealing
- 57 accuracy at a sense level
11Unsupervised WSD SystemsMRDs
- Coocurrence Word Vectors
- (Wilks et al. 93)
- word-context vectors from LDOCE
- testing large set of relateness functions
- 13 senses of word bank
- 45 accuracy
- (Rigau et al. 97)
- (Noun) Genus Sense Disambiguation
- 60 accuracy
12Unsupervised WSD SystemsMRDs
371.616 conexions 11.8004 9.8 16 elaborado
queso 35 113 10.8938 8.0 23 pasta queso
178 113 10.4846 7.5 25 leche queso 274
113 10.2483 9.2 13 oveja queso 45
113 9.1513 7.6 16 queso sabor 113
160 7.4956 8.3 8 queso tortilla 113
51 6.7732 7.5 8 queso vaca 113 84 6.5830
6.1 12 maíz queso 347 113 6.2208 8.9 5
queso suero 113 21 6.1509 8.8 5
mantequilla queso 22 113 6.1474 7.9 6
compacta queso 50 113 5.9918 7.7 6 picante
queso 55 113 5.9002 9.8 4 manchego queso
9 113 5.6805 7.3 6 cabra queso 75
113 5.6300 5.9 9 pan queso 287 113
13Unsupervised WSD SystemsThesauri Corpus
- (Yarowsky 92)
- uses Rogets Thesaurus to partition
- Groliers Enciclopedia
- 1042 categories
- 92 accuracy for 12 polysemous words
- (Yarowsky 95)
- seed words
- (Liddy Paik 92)
- subject-code correlation matrix
- 122 LDOCE semantic codes
- 166 sentences of Wall Street Journal
- 89 correct subject code
14Unsupervised WSD SystemsLKBs Conceptual
Distance
- (Rada et al. 92)
- length of the shortest path
- (Sussna 93)
- (Agirre et al. 94)
- (Rigau 94 Rigau et al. 95, 97 Atserias et al.
97) - length of the shortest path
- specificity of the concepts
15Unsupervised WSD SystemsLKBs Conceptual Density
16Unsupervised WSD SystemsLKBs Conceptual Density
- (Agirre Rigau 95, 96)
- length of the shortest path
- the depth in the hierarchy
- concepts in a dense part of the hierarchy are
relatively closer than those in a more sparse
region. - the measure should be independent of the number
of concepts involved
17Unsupervised WSD SystemsLKBs Corpus
- (Resnik 95)
- Information Content
- (Richardson et al. 94)
- (Jiang Conrath 97)
18Unsupervised WSD SystemsExperiments Genus Sense
Disambiguation
- Unsupervised WSD
- Unrestricted WSD (coverage 100)
- Eight Heuristics (McRoy 92)
- Combining several lexical resources
- Combining several methods
19Unsupervised WSD SystemsExperiments Genus Sense
Disambiguation
- 0) Monosemous Genus Term
- 1) Entry Sense Ordering
- 2) Explicit Semantic Domain
- 3) Word Matching (Lesk 86)
- 4) Simple Concordance
- 5) Coocurrence Word Vectors
- 6) Semantic Vectors
- 7) Conceptual Distance
20Unsupervised WSD SystemsExperiments Genus Sense
Disambiguation
21Unsupervised WSD SystemsExperiments Genus Sense
Disambiguation
- Knowledge provided by each heuristic
22SupervisedWord Sense DisambiguationSystems
German Rigau i Claramunt http//www.lsi.upc.es/ri
gau TALP Research Center Departament de
Llenguatges i Sistemes Informàtics Universitat
Politècnica de Catalunya
23WSD using ML algoritmsOutline
- Setting
- Methodology
- Machine Learning algorithms
- Naive Bayes (Mooney 98)
- Snow (Dagan et al. 97)
- Exemplar-based (Ng 97)
- LazyBoosting (Escudero et al. 00)
- Experimental Results
- Naive Bayes vs. Exemplar Based
- Portability and Tuning of Supervised WSD
- Future Work
24WSD using ML algorithmsSetting
- Data-Driven (Supervised)
- Better results
- - Need of large corpora
- knowledge adquisition bottleneck
- (Gale et al. 93, Ng 97)
- - Tested on limited domains (words and senses)
25WSD using ML algorithmsSetting
- Current research lines open the bottleneck
- Design of efficient example sampling methods
(Engelson Dagan 96 Fujii et al. 98) - Use of WordNet and Web to automatically obtain
examples (Leacock et al. 98 Mihalcea Moldovan
99) -
- Use of unsupervised methods for estimating
parameters (Pedersen Bruce 98)
26WSD using ML algorithmsSetting
-
- Contradictory Previous Work
- (Mooney, 98)
- t-student test of significance
- n-fold cross-validation
- - Word line with 4,149 examples and 6 senses
(Leacock et al. 93). - - Neither parameter setting nor algorithm
tunning - (Ng 97)
- Large corpora (192,800 occurrences of 191
words) - - Direct Test (No n-fold crossvalidation).
- - Small set of features.
27WSD using ML algoritmsOutline
- Setting
- Methodology
- Machine Learning algorithms
- Naive Bayes (Mooney 98)
- Snow (Dagan et al. 97)
- Exemplar-based (Ng 97)
- LazyBoosting (Escudero et al. 00)
- Experimental Results
- Naive Bayes vs. Exemplar Based
- Portability and Tuning of Supervised WSD
- Future Work
28WSD using ML algorithms Methodology
- Main goals
- Study supervised methods for WSD
- Use it with Automatically Extracted Examples
from the Web using WordNet - Rigorous direct comparisons
- Supervised WSD Methods
- Naive Bayes
- State-of-the-art accuracy (Mooney 98)
- Snow
- From Text Categorization (Dagan et al. 97)
- Exemplar-based
- State-of-the-art accuracy (Ng 97)
- Boosting
- From Text Categorization (Schapire Singer to
appear, Escudero, Màrquez Rigau 2000)
29WSD using ML algorithms Methodology
- Evaluation (Dietterich 98)
- 10-fold crossvalidation
- t-student test of significance
- Data
- LDC (Ng 96)
- 192,800 occurrences of 191 words
- (121 nouns 70 verbs)
- Avg. Number of senses 7.2 N, 12.6 V, 9.2 (all)
- WSJ Corpus (Corpus A)
- Brown Corpus (Corpus B)
- Sets of attributes
- Set A (Ng 97)
- Small set of features
- No broad-context attributes
- Set B ? (Ng 96)
- Large set of features
- Broad-context attributes
30WSD using ML algoritmsOutline
- Setting
- Methodology
- Machine Learning algorithms
- Naive Bayes (Mooney 98)
- Snow (Dagan et al. 97)
- Exemplar-based (Ng 97)
- LazyBoosting (Escudero et al. 00)
- Experimental Results
- Naive Bayes vs. Exemplar Based
- Portability and Tuning of Supervised WSD
- Future Work
31WSD using ML algorithmsNaive Bayes
- Based on Bayes Theorem (Duda Hart 73)
- Frequencies used as probabilities
- Assumed independence of example features
- Smoothing technique (Ng 97)
32WSD using ML algorithmsExemplar-based WSD
-
- k-NN approach (Ng 96 Ng 97)
- Distances
- Hamming
- Modified Value Difference Metric
- MVDM (Cost Salzberg 93)
- Variants
- Example weighting
- Attribute weighting (RLM 91)
k3
33WSD using ML algorithmsSnow
- Snow (Golding Roth 99)
- Sparse Network of Winows
- On-line learning system
- Winow (Littlestone 88)
- linear threshold
- mistake-driven (when predicted class is wrong)
34WSD using ML algorithmsSnow
MAX
Winow Sense 1
Winow Sense 2
wf
w-1 average
w242
w1of
w2nuclear
... an average ltage_1gt of 42 ...
... in this ltage_2gt of nuclear ...
35WSD using ML algorithmsBoosting
- AdaBoost.MH (Freund Shapire00)
- Combine many simple weak classifiers
(hypothesis) - Weak classifiers are trained sequencially
- Each iteration concentrate on the most difficult
cases - Results Better than NB and EB
- - Problem Computational Complexity
- Time and space grow linearly with number of
examples. - Solution LazyBoosting!
36WSD using ML algoritmsOutline
- Setting
- Methodology
- Machine Learning algorithms
- Naive Bayes (Mooney 98)
- Snow (Dagan et al. 97)
- Exemplar-based (Ng 97)
- LazyBoosting (Escudero et al. 00)
- Experimental Results
- Naive Bayes vs. Exemplar Based
- Portability and Tuning of Supervised WSD
- Future Work
37WSD using ML algorithmsExperimental Results
(LazyBoosting)
- Features from Set A (Ng 97)
- w-2, w-1 , w1, w2 , (w-2, w-1), (w-1 , w1),
(w1, w2) - 15 reference words (10 N, 5 V)
- Average
- ns ex att
- nouns (121) 8.6 1040 3978
- verbs (70) 17.9 1266 4432
- total (191) 12.1 1115 4150
- Accuaracy
- MFS NB EB1 EB15 AB750 ABSC
- nouns (121) 57.4 71.7 65.8 71.1 73.5 73.4
- verbs (70) 46.6 57.6 51.1 58.1 59.3 59.1
- total (191) 53.3 66.4 60.2 66.2 68.1 68.0
38WSD using ML algorithmsExperimental Results
(LazyBoosting)
- Accelerating the WeakLearner
- Reducing Feature Space
- Frequency filtering (Freq)
- Discard those features occourring less than N
times - Local frequency filtering (LFreq)
- Selects the N most freqeunt features of each
sense - RLM ranking (López de Mantaras 91)
- Selects the N most relevant features
- Reducing the number of Attributes examined
- LazyBoosting
- A small proportion of attributes are randomly
selected at each iteration
39WSD using ML algorithmsExperimental Results
(LazyBoosting)
- Accelerating the WeakLearner
- All methods perform quite well
- many irrelevant attributes in the domain
- LFreq is slghly better than Freq
- RLM performs better than LFreq and Freq
- LazyBoosting is better than all other methods
- acceptable performance with 1 of exploration
when looking for a weak rule. - 10 achieves the same performance than 100
- 7 times faster!
40WSD using ML algorithmsExperimental Results
(LazyBoosting)
- 7 features from Set A (Ng 97)
- w-2, w-1 , w1, w2 , (w-2, w-1), (w-1 , w1),
(w1, w2) - 15 reference words (10 N, 5 V)
- Average
- ns ex att
- nouns (121) 8.6 1040 3978
- verbs (70) 17.9 1266 4432
- total (191) 12.1 1115 4150
- Accuaracy
- MFS NB EB15 LB10SC
- nouns (121) 56.4 68.7 68.0 70.8
- verbs (70) 46.7 64.8 64.9 67.5
- total (191) 52.3 67.1 66.7 69.5
41WSD using ML algorithmsExperimental Results (NB
vs EB)
- Experiments on Set A with 15 words
- Results
- Conclusions
- NB and EB are better than MFS
- k-NN performs better with kgt1
- Variants of EB improve the EB
- MVDM(cs) metric is better than Hamming distance
- EB performs better than NB
42WSD using ML algorithmsExperimental Results (NB
vs EB)
- Experiments on Set B with 15 words
- Results
- What happened?
- Problem with the binary representation of the
broad-context attributes. - Examples are represented with sparse vectors
(5,000 positions). - Two examples coincide in the majority of values.
- Biases the similarity measure in favour of
shortest sentences. - Related work Clarified
- (Mooney 98)
- Poor results of k-NN algorithm
- (Ng 96 Ng 97)
- Lower results of a system with a large number of
attributes
43WSD using ML algorithmsExperimental Results (NB
vs EB)
- Improving both methods (NB and EB) (Escudero et
al. 00b) - Use only positive information
- Treat the broad-context attributes as
multivalued attributes - Let two values
- The similarity S between two values has to be
redefined as - This representation allows a very
computationally efficient implementation - Positive Naive Bayes (PNB)
- Positive Exemplar-based (PEB)
44WSD using ML algorithmsExperimental Results (NB
vs EB)
- Experiments on Set B with 15 words
- Results
- Conclusions
- PEB improves by 12.2 points the accuracy of EB
- PEB is higher than Set A except PEBh,10,e,a
- PNB is at least as accurate as NB
- The positive approach increases greatly the
efficiency (80 times for NB and 15 for EB) of the
algorithms - PEB accuracy is higher than PNB
45WSD using ML algorithmsExperimental Results (NB
vs EB)
- Global Results (191 words)
- Conclusions
- In Set A,
- The best option is Exemplar-based using MVDM
metric - In Set B,
- The best option is Exemplar-based using Hamming
distance and example weighting - MVDM metric has higher accuracy but is currently
computationally prohibitive - Positive Exemplar-based allows the addition of
unordered contextual attributes with an accuracy
improvement - Positive information allows to improve greatly
the efficiency
46WSD using ML algorithmsExperimental Results
(Portability)
- 15 features from Set A (Ng 96)
- p-3, p-2, p-1 , p1, p2, p3, w-1 , w1, (w-2,
w-1), (w-1 , w1), (w1, w2), (w-3, w-2, w-1),
(w-2, w-1 , w1), (w-1 , w1, w2), (w1, w2 ,
w3) - 21 reference words (13 N, 8 V)
- DSO Corpus
- Wall Street Journal (Corpus A)
- Brown Corpus (Corpus B)
- 7 combinations of training-test sets
- AB-AB, AB-A, AB-B
- A-A, B-B, A-B, B-A
- Forcing the number of examples of corpus A and B
be the same (reducing the size to the smalest)
47WSD using ML algorithmsExperimental Results
(Portability)
First Experiment ( accuracy) Method AB-AB AB-
A AB-B MFC 46.6 53.0 39.2 NB 61,6 67.3 55.9
EB 63.0 69.0 57.0 Snow 60.1 65.6 56.3 LB 66
.3 71.8 60.9 Method A-A B-B A-B B-A MFC
56.0 45.5 36.4 38.7 NB 65.9 56.8 41.4 47.
7 EB 69.0 57.4 45.3 51.1 Snow 67.1 56.1
44.1 49.8 LB 71.3 59.0 47.1 52.0
48WSD using ML algorithmsExperimental Results
(Portability)
- Conclusions of First Experiment
- LazyBoosting outperforms all other methods in all
cases - the knowledge acquired from a single corpus
almost covers the knowldge of combining both
corpora - Very disappointing results!
- Looking at Kappa values
- NB most similar to MFC
- LB most similar to DSO
- LB most disimilar to MFC
49WSD using ML algorithmsExperimental Results
(Portability)
- Second Experiment
- Adding tuning material
- BA-A, AB-B, A-A, B-B
- ranging from 10 to 50 (50 remaining for test)
- For NB, EB, Snow it is not worth keeping the
original corpus - LB has a moderate (but consistent) improvement
when retaining the original training set
50WSD using ML algorithmsExperimental Results
(Portability)
- Third Experiment
- Two main reasons
- Corpus A and B have a very different distribution
of senses - Examples of corpus A and B contain different
information - New corpus sense-balanced
- Forcing the number of examples of each sense of
corpus A and B be the same (reducing the size to
the smalest)
51WSD using ML algorithmsExperimental Results
(Portability)
- Third Experiment ( accuracy)
- Method AB-AB AB-A AB-B
- MFC 48.6 48.6 48.5
- LB 64.4 66.2 62.5
- Method A-A B-B A-B B-A
- MFC 48.6 48.5 48.7 48.7
- LB 65.2 61.7 56.1 58.0
- Even when the same distribution of senses is
conserved between training and test examples, the
portability is not garanteed!
52WSD using ML algoritmsOutline
- Setting
- Methodology
- Machine Learning algorithms
- Naive Bayes (Mooney 98)
- Snow (Dagan et al. 97)
- Exemplar-based (Ng 97)
- LazyBoosting (Escudero et al. 00)
- Experimental Results
- Naive Bayes vs. Exemplar Based
- Portability and Tuning of Supervised WSD
- Future Work
53WSD using ML algorithmsFuture Works
- Other methods (SVMs, DLs, ...)
- Other corpora (Semcor, Senseval, Bruce, ...)
- Comparison with unsupervised methods
- Combination of classifiers
- Search of the optimum set of features for each
method - Try new sets of features (semantic features,
...) - 3 research lines of bottleneck knowledge
adquisition solution - Other tagsets (synsets, semantic fields, base
concepts, groups of synsets, ...)
54Using the Web and EuroWordNet forWord Sense
Disambiguation
German Rigau i Claramunt http//www.lsi.upc.es/ri
gau TALP Research Center Departament de
Llenguatges i Sistemes Informàtics Universitat
Politècnica de Catalunya
55Using the Web and EWN for WSDOutline
- Setting
- Exploiting EWN Semantic Relations
- Collecting training Corpus from the Web
56Using the Web and EWN for WSDSetting
- Our approach
- Unsupervised
- Automatically obtain training corpora
- using the Web or on-line corpora
- to feed a supervised ML WSD system
57Using the Web and EWN for WSDOutline
- Setting
- Exploiting EWN Semantic Relations
- Collecting training Corpus from the Web
58Using the Web and EWN for WSDExploiting EWN
Semantic Relations
- WordNet
- WordNet is organized conceptually
- 123,497 content words
- 11,514 polisemous
- 99,642 synsets
wine, vino -- (fermented juice (of grapes
especially)) gt sake, saki -- (Japanese
beverage from fermented rice ...) gt
vintage -- (a season's yield of wine from a
vineyard) gt red wine -- (wine having a
red color derived from skins ...) gt
Pinot noir -- (dry red California table wine
...) gt claret, red Bordeaux -- (dry
red Bordeaux or Bordeaux-like wine)
gt Saint Emilion -- (full-bodied red wine from
...) gt Chianti -- (dry red Italian
table wine from the Chianti ...) gt
Cabernet, Cabernet Sauvignon -- (superior
Bordeaux-type red wine) gt Rioja --
(dry red table wine from the Rioja ...)
gt zinfandel -- (dry fruity red wine from
California)
59Using the Web and EWN for WSDExploiting EWN
Semantic Relations
- SR PoS Examples
- Synonymy Noun coche, automóvil
- Verb salir, pasear
- Adj feliz, contento
- Adv duramente, severamente
- Hyponymy Noun coche -gt vehículo
- Meronymy Noun motor -gt coche
- Troponymy Verb marchar -gt caminar
- Entailment Verb roncar -gt dormir
60Using the Web and EWN for WSDExploiting EWN
Semantic Relations
61Using the Web and EWN for WSDExploiting EWN
Semantic Relations
partido 1 Todos los partidos piden reformas
legales para TV3. La derecha planea agruparse en
un partido. El diputado reiteró que ni él ni UDC,
como partido, han recibido dinero de
Pellerols. partido 2 Pero España puso al
partido intensidad, ritmo y coraje. El
seleccionador cree que el partido de hoy contra
Italia dará la medida de España El Racing no gana
en su campo desde hace seis partidos.
62Using the Web and EWN for WSDExploiting EWN
Semantic Relations
partido 1 No negociaremos nunca com un partido
político que sea partidario de la independencia
de Taiwan. Una vez más es noticia la desviación
de fondos destinadoss a la formación ocupacional
hacia la financiación de un partido
político. Estas lleyess fueron votadas gracias a
un consenso general de los partidos
políticos. partido 2 Rivera pide el suporte de
la afición para encarrilar las semifinales. Sólo
el equipo de Valero Ribera puede sentenciar una
semifinal como lo hizo ayer en un Palau Blaugrana
completamente entregado. El Racing ganó los
cuartos de final en su campo.
63Using the Web and EWN for WSDExploiting EWN
Semantic Relations
- 11,514 polisemous words
- 1 sense
-
- synonym brother father daugther grandchid
- 1 step 2095 8903 3894 759 116
- 2 step 3 1331 16 3
- 3 step 512
- 4 step 147
- 5 step 43
- total 2905 8906 5927 775 119
64Using the Web and EWN for WSDExploiting EWN
Semantic Relations
- 11,514 polisemous words
- 2 senses
- synonym brother father daughter grandchild
- 1 step 479 6988 584 408 87
- 2 step 24 97 8 2
- 3 step 9
- 4 step 3
- total 479 7012 693 417 89
65Using the Web and EWN for WSDExploiting EWN
Semantic Relations
11,514 polisemous words 3 senses synonym bro
ther father daughter grandchild 1
step 108 5640 76 239 59 2
step 22 6 1 total 108 5662 76 245
60
66Using the Web and EWN for WSDExploiting EWN
Semantic Relations
11,514 polisemous words 1 sense SB SD SB
D SBDF SBDFC 1 step 8903 3461 9257 102
84 10284 2 step 3 34 188 1068 1068 3
step 2 30 137 137 4 step 4 19
19 total 8906 3487 9479 11508 11508
67Using the Web and EWN for WSDExploiting EWN
Semantic Relations
11,514 polisemous words 2 sense SB SD SBD
SBDF SBDFC 1 step 7580 1282 8048 8891 88
99 2 step 281 16 461 1196 1213 3
step 11 1 33 264 245 4 step 2 80 74 5
step 13 13 6 step 2 2 total 7872 1299 8
544 10446 10446
68Using the Web and EWN for WSDExploiting EWN
Semantic Relations
- 11,514 polisemous words
- 3 sense
-
- SB SD SBD SBDF SBDFC
- 1 step 6116 568 6691 7657 7673
- 2 step 274 5 482 1030 1039
- 3 step 5 46 295 311
- 4 step 7 91 78
- 5 step 1 28 12
- 6 step 3 3
- total 6395 573 7230 9104 9113
69Using the Web and EWN for WSDOutline
- Setting
- Exploiting EWN Semantic Relations
- Collecting training Corpus from the Web
70Using the Web and EWN for WSDCollecting training
Corpus from the Web
- (Mihalcea Moldovan 99)
- Search engines Altavista
- Complex queries
- synonyms
- definitions
- 120 word senses
- 91 precision
- Example
- ltgrow, raise, farm, producegt (cultivate by
growing) - cultivate NEAR growing AND (grow OR raise OR farm
OR produce)