Title: DomainSpecific Sense Distributions and Predominant Sense Acquisition
1Domain-Specific Sense Distributions and
Predominant Sense Acquisition
- Rob Koeling Diana McCarthy John Carroll
- University of Sussex University of Sussex
University of Sussex
2Overview
- Motivation
- Finding Predominant Senses
- Creating the 3 Gold Standards
- Predominant Sense Evaluation
- Conclusions / Future Work
3Motivation
- Distributions of word senses are often highly
skewed - Manually sense tagged data (SemCor) is used for
WSD - Either as training data for statistical model
- Or as back-off if primary model fails
- First sense heuristic is powerful
- 61.5 on Senseval-3 all words task
- Information about the domain of a document can
help WSD
4Motivation
- Consider the word goal (WordNet 1.7.1)
- Synonyms/Hypernyms (Ordered by Estimated
Frequency) of noun goal - 4 senses of goal
- - Sense 1 goal, end, objective
- content, cognitive content, mental
object - - Sense 2 goal
- score
- - Sense 3 goal
- game equipment
- - Sense 4 destination, goal
- point
5Motivation
- Sense distributions (for some words) are domain
specific - Beneficial for WSD
- No existing domain-specific sense tagged corpora
- Automatic ranking applied to specific domains
6Goal (mental object)
- Create sense tagged data for a selection of 40
words for different domains - Characterize the annotated data
- Apply auto rank method to these domains and
evaluate on sense tagged data
7Finding Predominant Senses
- Ingredients
- Auto. created thesaurus (e.g. Lin 98)
- goal aim (0.3), win (0.2), ., victory (0.1)
- Sense inventory (e.g. WordNet)
- 1) objective, 2) score, 3) game equipment, 4)
destination - Sense similarity score (e.g. as defined in
WordNet Similarity package)
8Calculating the Prevalence of Word Senses
GOAL
Neighbours aim (0.3), win (0.2), ., victory
(0.1) Senses 1) objective, 2) score, 3) game
equipment, 4) destination
- Calculate Semantic similarity score between First
sense of goal (objective) and first neighbour
(aim)
9Calculating the Prevalence of Word Senses
GOAL
Neighbours aim (0.3), win (0.2), ., victory
(0.1) Senses 1) objective, 2) score, 3) game
equipment, 4) destination
- Calculate Semantic similarity score between First
sense of goal (objective) and first neighbour
(aim) - Normalize and multiply with Distributional
Similarity Score (0.3)
10Calculating the Prevalence of Word Senses
GOAL
Neighbours aim (0.3), win (0.2), ., victory
(0.1) Senses 1) objective, 2) score, 3) game
equipment, 4) destination
- Calculate Semantic similarity score between First
sense of goal (objective) and first neighbour
(aim) - Normalize and multiply with Distributional
Similarity Score (0.3) - Repeat procedure for First sense and second
neighbour third neighbour . Etc.
11Calculating the Prevalence of Word Senses
GOAL
Neighbours aim (0.3), win (0.2), ., victory
(0.1) Senses 1) objective, 2) score, 3) game
equipment, 4) destination
- Calculate Semantic similarity score between First
sense of goal (objective) and first neighbour
(aim) - Normalize and multiply with Distributional
Similarity Score (0.3) - Repeat procedure for first sense and second
neighbour third neighbour . Etc. - Add up all scores to compute Ranking Score for
sense 1 (objective)
12Calculating the Prevalence of Word Senses
GOAL
Neighbours aim (0.3), win (0.2), ., victory
(0.1) Senses 1) objective, 2) score, 3) game
equipment, 4) destination
- Calculate Semantic similarity score between First
sense of goal (objective) and first neighbour
(aim) - Normalize and multiply with Distributional
Similarity Score (0.3) - Repeat procedure for First sense and second
neighbour third neighbour . Etc. - Add up all scores to compute Ranking Score for
sense 1 (objective) - Repeat procedure for sense 2 (score), sense 3
etc.
13Finding Predominant sense
- Ranking (Sports) for goal
- Score (1.2345)
- Objective (0.8345)
- Destination (0.5434)
- Game equipment (0.4536)
14Finding Predominant sense
- Ranking (Sports) for goal
- Score (1.2345)
- Objective (0.8345)
- Destination (0.5434)
- Game equipment (0.4536)
- Ranking (Finance) for goal
- Objective (1.3452)
- Score (0.9450)
- Destination (0.4374)
- Game equipment (0.3536)
15Creating the 3 Gold Standards
- Used corpora
- BNC (90M words mixed)
- Reuters Finance (32M words)
- Reuters Sports (9M words)
- We computed thesauruses for each of these corpora
16Creating the 3 Gold Standards
- Word selection
- 40 nouns Not completely random
- 2 sets of words
- Subject Field Codes (domain labels for WN1.6)
- Domain salience
17Subject Field Codes
- 38 words have at least 1 sense labelled Sports
and 1 labelled Finance - Not all usable. Three criteria
- Frequency in BNC1000
- At most 12 senses
- At least 75 examples in each corpus
18Subject Field Codes
- Resulting set (17 words)
- club, manager, record, right, bill, check,
competition, conversion, crew, delivery,
division, fishing, reserve, return, score,
receiver, running - High frequency in BNC, mid frequency, low
frequency
19Domain salience
- Resulting sets
- Sport fan,star,transfer,striker,goal,
title, tie,coach - Financepackage, chip, bond, market,
- strike, bank, share, target
- Equal will,phase, half, top,
performance, level, country
20The Annotation Task
- Set up as an Open Mind Word Expert task
- 10 annotators
- 125 sentences randomly sampled from each corpus
- Some noise filtered
- First 100 selected
- Most sentences triple annotated
21(No Transcript)
22Characterisation of the Annotated Data
- 33225 tagging acts
- 65 inter-annotator agreement
23Sense Distributions
24Sense Distributions
25Sense Distributions
26Predominant Sense Evaluation
- Disambiguation using predominant sense
27Predominant Sense Evaluation
- Disambiguation using predominant sense
28Predominant Sense Evaluation
- Disambiguation using predominant sense
29Predominant Sense Evaluation
- Disambiguation using predominant sense
30Predominant Sense Evaluation
- Disambiguation using predominant sense
31Predominant Sense Evaluation
- Best results when trained on a domain relevant
corpus - Random baseline and SemCor baseline are always
comfortably beaten - For words that are pertinent to the domain, it
pays to use domain specific training data
32Discussion/Conclusions
33Discussion/Conclusions
- Sense-tagged corpora for 2 domains
34Discussion/Conclusions
- Sense-tagged corpora for 2 domains
- Predominant sense is much more dominant in domain
than in general case
35Discussion/Conclusions
- Sense-tagged corpora for 2 domains
- Predominant sense is much more dominant in domain
than in general case - Quantitative evaluation of auto rank
- auto acquisition of predominant senses can
outperform SemCor baseline
36Discussion/Conclusions
- Sense-tagged corpora for 2 domains
- Predominant sense is much more dominant in domain
than in general case - Quantitative evaluation of auto rank
- auto acquisition of predominant senses can
outperform SemCor baseline - Choosing the predominant sense will be hard to
beat for some words within a specific domain
37Discussion/Conclusions
- Sense-tagged corpora for 2 domains
- Predominant sense is much more dominant in domain
than in general case - Quantitative evaluation of auto rank
- auto acquisition of predominant senses can
outperform SemCor baseline - Choosing the predominant sense will be hard to
beat for some words within a specific domain - Others remain highly ambiguous
38Discussion/Conclusions
- When to select domain specific predominant sense?
39Discussion/Conclusions
- When to select domain specific predominant sense?
- Look at differences in auto ranking domain
corpus vs. balanced corpus
40Discussion/Conclusions
- When to select predominant sense for WSD?
- Look at differences in auto ranking domain
corpus vs. balanced corpus - E.g. different predominant sense
41Future work
- Best method for quantifying substantial change
- Beyond predominant sense use the full ranking
(sense distributions) for improved models for WSD - Influence of corpus size
- Influence of noise (robustness)
42Thank you!