Title: CPSC 503 Computational Linguistics
1CPSC 503Computational Linguistics
- Word-Sense Disambiguation
- Information Retrieval
- Lecture 18
- Giuseppe Carenini
2Semantics Summary
- What meaning is and how to represent it
- How to map sentences into their meaning
- Meaning of individual words
- Tasks
- Information Extraction
- Word Sense Disambiguation
- Information Retrieval
3Today 24/3
- Word-Sense Disambiguation
- Machine Learning Approaches
- Information Retrieval (ad hoc)
4Supervised ML Approaches to WSD
5Training Data Example
((word context) ? sense)i
- ..after the soup she had bass with a big salad
6WordNet Bass music vs. fish
- The noun bass'' has 8 senses in WordNet
- bass - (the lowest part of the musical range)
- bass, bass part - (the lowest part in polyphonic
music) - bass, basso - (an adult male singer with )
- sea bass, bass - (flesh of lean-fleshed saltwater
fish of the family Serranidae) - freshwater bass, bass - (any of various North
American lean-fleshed ) - bass, bass voice, basso - (the lowest adult male
singing voice) - bass - (the member with the lowest range of a
family of musical instruments) - bass -(nontechnical name for any of numerous
edible marine and freshwater spiny-finned
fishes)
7Representations for Context
- GOAL Informative characterization of the window
of text surrounding the target word
- TASK Select relevant linguistic information,
encode them as a feature vector
8Relevant Linguistic Information(1)
- Collocational info about the words that appear
in specific positions to the right and left of
the target word
Typically words and their POS
word in position -n, part-of-speech position -n,
word in position n, part-of-speech position
n,
Assume a window of /- 2 from the target
- Example text (WSJ)
- An electric guitar and bass player stand off to
one side not really part of the scene,
guitar, NN, and, CJC, player, NN, stand, VVB
9Relevant Linguistic Information(2)
- Co-occurrence info about the words that occur
anywhere in the window regardless of position
- Find k content words that most frequently
co-occur with target in corpus (for bass
fishing, big, sound, player, fly , guitar, band))
Vector for one case c(fishing), c(big),
c(sound), c(player), c(fly), , c(guitar),
c(band)
- Example text (WSJ)
- An electric guitar and bass player stand off to
one side not really part of the scene,
0,0,0,1,0,0,0,0,0,0,1,0
10ML for Classifiers
- Training Data
- Co-occurrence
- Collocational
- Naïve Bayes
- Decision lists
- Decision trees
- Neural nets
- Support vector machines
- Nearest neighbor methods
Machine Learning
Classifier
11Naïve Bayes
12Naïve Bayes Evaluation
- Experiment comparing different classifiers
Mooney 96 - Naïve Bayes and Neural Network achieved highest
performance - 73 in assigning one of six senses to line
13Bootstrapping
- What if you dont have enough data to train a
system
14Bootstrapping how to pick the seeds
- Hand-labeling
- Likely correct
- Likely to be prototypical
- One sense per collocation search for words or
phrases strongly associates with target senses.
Then automatic labeling.
- E.g., bass
- play is strongly associated with the music sense
whereas fish is strongly associated the fish sense
15Unsupervised Methods Schultze 98
Training Data
Machine Learning
(word vector)1 (word vector)n
K Clusters ci
16Agglomerative Clustering
- Assign each instance to its own cluster
- Repeat
- Merge the two clusters that are more similar
- Until (specified of clusters is reached)
- If there are too many training instances -gtrandom
sampling
17Problems
- Given these general ML approaches, how many
classifiers do I need to perform WSD robustly - One for each ambiguous word in the language
- How do you decide what set of tags/labels/senses
to use for a given word? - Depends on the application
18Recent Work on WSD
- Word Sense Disambiguation Recent Successes and
Future Directions - A SIGLEX/SENSEVAL Workshop at ACL 2002 University
of Pennsylvania
19Today 24/3
- Word-Sense Disambiguation
- Machine Learning Approaches
- Information Retrieval (ad hoc)
20Information Retrieval
- Retrieving relevant documents from document
repositories - Sub-Areas
- Ad hoc retrieval (Query-gt List of documents)
- Text Categorization (Document -gt Category)
- Eg BusinessNews (OIL, ACQ, )
- Filtering (special case of TC, with 2 categories
- relevant/non-relevant)
21Information Retrieval
- Bag of words assumption in modern IR the
meanings of documents is captured by analyzing
(counting) the words that occur in them. - Efficiency
- Works in practice
Tobias Scheffer and Stefan Wrobel. Text
classification beyond the bag-of-words
representation In Proceedings of the
ICML-Workshop on Text Learning. 2002.
22IR Terminology
- Documents
- Any contiguous bunch of text (E.g. News article,
Web page, paragraph)
- Collection
- A bunch of documents
- Terms
- Words that occur in a collection (but it may
include common phrases E.g. car insurance)
- Query
- Terms that express an information need
23Terms Selection and Creation
- Stop list? a list of frequent largely
content-free words that are not considered (of,
the, a, to, etc.)
- Stemming? Are terms stems or words?
- Eg. Are dog and dogs separate terms or are they
collapsed to dog?
- Phrases? Include most frequent biagrams as
phrases
24Ad hoc Ranked Retrieval
Documents in collection ranked by relevance
d1 d2 dM
What should a t express?
25First approximation bit vector
- ti 1 if the corresponding word type occurs in
the document ( ti 0 otherwise )
Is this a satisfying solution?
26Better Term Weighting
- Local weight How important is this term to the
meaning of this document
- Global weight How well does this term
discriminate among the documents in the collection
The more documents a term occurs in the less
important it is
- SOLUTION combine Local and Global
27New Similarity the cosine measure
normalized
28Ad Hoc Retrieval Summary
- Given a users query find all the documents that
contain any of the terms in the query
Why only those documents?
- Convert the query to a vector
- Compute the cosine between the query vector and
all the candidate documents and sort
29IR Evaluation (1)
- What do we want?
- We want documents relevant to the query to be
near the top of the list
d1 d2 dM
- Use a test collection where you have
- A set of documents
- A set of queries
- A set of relevance judgments that tell you which
documents are relevant to each query
30IR Evaluation (2)
- Can we use Precision and Recall?
- Precision relevant docs returned/docs returned
- Recall relevant docs returned/relevant docs
total
31Precision and Recall Plots
Higher cut-off
1
precision
0
1
recall
32IR Current Research
- TREC (Text Retrieval Conference)
- large document sets for testing
- uniform scoring systems
- Different Tracks
- Interactive Track studying user interaction with
text retrieval systems. - Question Answering Track
- Web Track
- Terabyte Track
- ...
33Next Time
- Discourse and Dialog Chp. 18 and 19