Title: Unsupervised Methods
1Unsupervised Methods
2Association Measures
- Association between items assoc(x,y)
- term-term, term-document, term-category,
- Simple measure freq(x,y), log(freq(x,y))1
- Based on contingency table
3Mutual Information
- The item corresponding to x,y in the Mutual
Information for X,Y - Disadvantage the MI value is inflated for low
freq(x,y) - Examples results for two NLP articles
4Log-Likelihood Ratio Test
- Comparing the likelihood of the data given two
competing hypotheses (Dunning,93) - Does not depend heavily on assumptions of
normality, can be applied to small samples - Used to test if p(xy) p(xy) p(x), by
comparing it to the general case (inequality) - High log-likelihood score indicates that the data
is much less likely if assuming equality
5Log-Likelihood (cont.)
- Likelihood function
- The likelihood ratio
- is asymptotically distributed
- High the data is less likely given
6Log-Likelihood for Bigrams
7Log-Likelihood for Binomial
8Measuring Term Topicality
- For query relevance ranking Inverse Document
Frequency - For term extraction
- Frequency
- Frequency ratio for specialized vs. general
corpus - Entropy of term co-occurrence distribution
- Burstiness
- Entropy of distribution (frequency) in documents
- Proportion of topical documents for term (freqgt1)
within all documents containing term (Katz, 1996)
9Similarity Measures
- Cosine
- Min/Max
- KL to Average
10A Unifying Schema of Similarity(with Erez Lotan)
- A general schema encoding most measures
- Identifies explicitly the important factors that
determine (word) similarity - Provides the basis for
- a general and efficient similarity computation
procedure - evaluating and comparing alternative measures and
components
11Mapping to Unified Similarity Scheme
12Association and Joint Association
- assoc(u,att) quantify association strength
- mutual information, weighted log frequency,
conditional probability (orthogonal to scheme) - joint(assoc(u,att),assoc(v,att)) quantify the
similarity of the two associations - ratio, difference, min, product
-
13Normalization
- Global weight of a word vector
- For cosine
- Normalization factor
- For cosine
14The General Similarity Scheme
15Min/Max Measures
16Associations Used with Min/Max
- Log-frequency and Global Entropy Weight
(Grefenstette, 1994) - Mutual information (Dagan et al., 1993/5)
17Cosine Measure
- Used for word similarity (Ruge, 1992) with
assoc(u,att)ln(freq(u,att)) - Popular for document ranking (vector space)
18Methodological Benefits
- Joint work with Erez Lotan (Dagan 2000 and in
preparation) - Uniform understanding of similarity measure
structure - Modular evaluation/comparison of measure
components - Modular implementation architecture, easy
experimentation by plugging alternative measure
combinations
19Empirical Evaluation
- Precision and comparative Recall at each point in
the list
20Comparing Measure Combinations
Precision
Recall
- Min/Max schemes worked better than cosine and
Jensen-Shannon (almost by 20 points) stable over
association measures
21Effect of Co-occurrence Type on Semantic
Similarity
22Computational Benefits
- Complexity reduced by sparseness factor
non-zero cells / total cells - Two orders of magnitude in corpus data
23General Scheme - Conclusions
- A general mathematical scheme
- Identifies the important factors for measuring
similarity - Efficient general procedure based on scheme
- Empirical comparison of different measure
components (measure structure and assoc) - Successful application in an Internet crawler for
thesaurus construction (small corpora)
24Clustering Methods
- Input A set of objects (words, documents)
- Output A set of clusters (sets of elements)
- Based on a criterion for the quality of a class,
which guides cluster split/merge/modification - a distance function between objects/classes
- a global quality function
25Clustering Types
- Soft / Hard
- Hierarchical / Flat
- Top-down / bottom-up
- Predefined number of clusters or not
- Input
- all point-to-point distances
- original vector representation for points,
computing needed distances during clustering
26Applications of Clustering
- Word clustering
- Constructing a hierarchical thesaurus
- Compactness and generalization in word
cooccurrence modeling (will be discussed later) - Document clustering
- Browsing of document collections and search query
output - Assistance in defining a set of supervised
categories
27Hierarchical Agglomerative Clustering Methods
(HACM)
1. Initialize every point as a cluster 2. Compute
a merge score for all cluster pairs 3. Perform
the best scoring merge 4. Compute the merge score
between the new cluster and all other
clusters 5. If more than one cluster remains,
return to 3
28Types of Merge Score
- Minimal distance between the two candidates for
the merge. Alternatives for cluster distance - Single link distance between two nearest points
- Complete ling distance between two furthest
points - Group average average pairwise distance for all
points - Centroid distance between the two cluster
centroids - Based on the quality of the merged class
- Wards method minimal increase in total
within-group sum of squares (average squared
distance to centroid) - Based on a global criterion (in Brown et al.,
1992 minimal reduction in average mutual
information)
29Unsupervised Statistics and Generalizations for
Classification
- Many supervised methods use cooccurrence
statistics as features or probability estimates - eat a peach,beach
- fire a missile vs. fire the prime minister
- Sparse data problem if alternative cooccurrences
never occurred, how to estimate their
probabilities, or their relative strength as
features?
30Application Semantic Disambiguation
31Statistical Approach
Semantic Judgment
Corpus (text collection)
ltverbobject throw-grenadegt 20 times
ltverbobject throw-pocketgt 1 time
32What about sense disambiguation?(for translation)
33Solution Mapping to Another Language
English(-English)-Hebrew Dictionary
bar1? chafisa soap ? sabon window ?
chalonbar2? sorag
- Exploiting ambiguities difference
- Principle intersecting redundancies(Dagan
and Itai 1994)
34Selection Model Highlights
- Multinomial model, under certain linguistic
assumptions - Selection confidence lower bound for
odds-ratio
- Overlapping ambiguous constructs are resolved
through constraint propagation, by decreasing
confidence order. - Results (Hebrew?English)Coverage
70 Precision within coverage 90 - 20 improvement over choosing most frequent
translation (the common baseline)
35Data Sparseness and Similarity
ltverbobject hidpis-tikiyagt
?
ltverbobject print-foldergt 0 times
ltverbobject print-file_cabinetgt 0 times
- Standard approach back-off to single term
frequency - Similarity-based inference
36Computing Distributional Similarity
folderfile
Similar
37Disambiguation Algorithm
- Selection of preferred alternative
- Hypothesized similarity-based frequency derived
from average association for similar
words(incorporating single term frequency) - Comparing hypothesized frequencies
38Computation and Evaluation
- Heuristic search used to speed computation of k
most similar words - Results (Hebrew?English)
- 15 coverage increase, while decreasing precision
by 2 - Accuracy 15 better than back-off to single word
frequency(Dagan, Marcus and Markovitch 1995)
39Probabilistic Framework - Smoothing
40Smoothing Conditional Attribute Probability
41Similarity/Distance Functions for Probability
Distributions
42Sample Results
- Most similar words to guy
Measure Closest Words
A guy kid thing lot man mother doctor friend boy son
L guy kid lot thing man doctor girl rest son bit
PC role people fire guy man year lot today way part
Typical common verb contexts see
get give tell take
PC an earlier attempt for
similarity-based smoothing
- Several smoothing experiments (A performed best)
- Language modeling for speech (hunt bears?pears)
- Perplexity (predicting test corpus likelihood)
- Data recovery task (similar to sense
disambiguation) - Insensitive to exact value of ß
43Class-Based Generalization
- Obtain a cooccurrence-based clustering of words
and model a word cooccurrence by word-class or
class-class cooccurrence - Brown et al., CL 1992 Mutual information
clustering class-based model interpolated to
n-gram model - Pereira, Tishby, Lee, ACL 1993 soft, top-down
distributional clustering for bigram modeling - Similarity/class-based methods general
effectiveness yet to be shown
44Conclusions
- (Relatively) simple models cover a wide range of
applications - Usefulness in (hybrid) systems automatic
processing and knowledge acquisition
45Discussion