Unsupervised Methods - PowerPoint PPT Presentation

About This Presentation

Title:

Unsupervised Methods

Description:

Unsupervised Methods – PowerPoint PPT presentation

Number of Views:92

Avg rating:3.0/5.0

Slides: 46

Provided by: aci75

Category:

more less

Transcript and Presenter's Notes

Title: Unsupervised Methods

1
Unsupervised Methods
2
Association Measures

Association between items assoc(x,y)
term-term, term-document, term-category,
Simple measure freq(x,y), log(freq(x,y))1
Based on contingency table

3
Mutual Information

The item corresponding to x,y in the Mutual
Information for X,Y
Disadvantage the MI value is inflated for low
freq(x,y)
Examples results for two NLP articles

4
Log-Likelihood Ratio Test

Comparing the likelihood of the data given two
competing hypotheses (Dunning,93)
Does not depend heavily on assumptions of
normality, can be applied to small samples
Used to test if p(xy) p(xy) p(x), by
comparing it to the general case (inequality)
High log-likelihood score indicates that the data
is much less likely if assuming equality

5
Log-Likelihood (cont.)

Likelihood function
The likelihood ratio
is asymptotically distributed
High the data is less likely given

6
Log-Likelihood for Bigrams
7
Log-Likelihood for Binomial

Maximum obtained for

8
Measuring Term Topicality

For query relevance ranking Inverse Document
Frequency
For term extraction
Frequency
Frequency ratio for specialized vs. general
corpus
Entropy of term co-occurrence distribution
Burstiness
Entropy of distribution (frequency) in documents
Proportion of topical documents for term (freqgt1)
within all documents containing term (Katz, 1996)

9
Similarity Measures

Cosine
Min/Max
KL to Average

10
A Unifying Schema of Similarity(with Erez Lotan)

A general schema encoding most measures
Identifies explicitly the important factors that
determine (word) similarity
Provides the basis for
a general and efficient similarity computation
procedure
evaluating and comparing alternative measures and
components

11
Mapping to Unified Similarity Scheme
12
Association and Joint Association

assoc(u,att) quantify association strength
mutual information, weighted log frequency,
conditional probability (orthogonal to scheme)
joint(assoc(u,att),assoc(v,att)) quantify the
similarity of the two associations
ratio, difference, min, product

13
Normalization

Global weight of a word vector
For cosine
Normalization factor
For cosine

14
The General Similarity Scheme
15
Min/Max Measures

May be viewed as

16
Associations Used with Min/Max

Log-frequency and Global Entropy Weight
(Grefenstette, 1994)
Mutual information (Dagan et al., 1993/5)

17
Cosine Measure

Used for word similarity (Ruge, 1992) with
assoc(u,att)ln(freq(u,att))
Popular for document ranking (vector space)

18
Methodological Benefits

Joint work with Erez Lotan (Dagan 2000 and in
preparation)
Uniform understanding of similarity measure
structure
Modular evaluation/comparison of measure
components
Modular implementation architecture, easy
experimentation by plugging alternative measure
combinations

19
Empirical Evaluation

Precision and comparative Recall at each point in
the list

20
Comparing Measure Combinations
Precision
Recall

Min/Max schemes worked better than cosine and
Jensen-Shannon (almost by 20 points) stable over
association measures

21
Effect of Co-occurrence Type on Semantic
Similarity
22
Computational Benefits

Complexity reduced by sparseness factor
non-zero cells / total cells
Two orders of magnitude in corpus data

23
General Scheme - Conclusions

A general mathematical scheme
Identifies the important factors for measuring
similarity
Efficient general procedure based on scheme
Empirical comparison of different measure
components (measure structure and assoc)
Successful application in an Internet crawler for
thesaurus construction (small corpora)

24
Clustering Methods

Input A set of objects (words, documents)
Output A set of clusters (sets of elements)
Based on a criterion for the quality of a class,
which guides cluster split/merge/modification
a distance function between objects/classes
a global quality function

25
Clustering Types

Soft / Hard
Hierarchical / Flat
Top-down / bottom-up
Predefined number of clusters or not
Input
all point-to-point distances
original vector representation for points,
computing needed distances during clustering

26
Applications of Clustering

Word clustering
Constructing a hierarchical thesaurus
Compactness and generalization in word
cooccurrence modeling (will be discussed later)
Document clustering
Browsing of document collections and search query
output
Assistance in defining a set of supervised
categories

27
Hierarchical Agglomerative Clustering Methods
(HACM)
1. Initialize every point as a cluster 2. Compute
a merge score for all cluster pairs 3. Perform
the best scoring merge 4. Compute the merge score
between the new cluster and all other
clusters 5. If more than one cluster remains,
return to 3
28
Types of Merge Score

Minimal distance between the two candidates for
the merge. Alternatives for cluster distance
Single link distance between two nearest points
Complete ling distance between two furthest
points
Group average average pairwise distance for all
points
Centroid distance between the two cluster
centroids
Based on the quality of the merged class
Wards method minimal increase in total
within-group sum of squares (average squared
distance to centroid)
Based on a global criterion (in Brown et al.,
1992 minimal reduction in average mutual
information)

29
Unsupervised Statistics and Generalizations for
Classification

Many supervised methods use cooccurrence
statistics as features or probability estimates
eat a peach,beach
fire a missile vs. fire the prime minister
Sparse data problem if alternative cooccurrences
never occurred, how to estimate their
probabilities, or their relative strength as
features?

30
Application Semantic Disambiguation
31
Statistical Approach
Semantic Judgment
Corpus (text collection)
ltverbobject throw-grenadegt 20 times
ltverbobject throw-pocketgt 1 time
32
What about sense disambiguation?(for translation)
33
Solution Mapping to Another Language
English(-English)-Hebrew Dictionary
bar1? chafisa soap ? sabon window ?
chalonbar2? sorag

Exploiting ambiguities difference
Principle intersecting redundancies(Dagan
and Itai 1994)

34
Selection Model Highlights

Multinomial model, under certain linguistic
assumptions
Selection confidence lower bound for
odds-ratio

Overlapping ambiguous constructs are resolved
through constraint propagation, by decreasing
confidence order.
Results (Hebrew?English)Coverage
70 Precision within coverage 90
20 improvement over choosing most frequent
translation (the common baseline)

35
Data Sparseness and Similarity
ltverbobject hidpis-tikiyagt
?
ltverbobject print-foldergt 0 times
ltverbobject print-file_cabinetgt 0 times

Standard approach back-off to single term
frequency
Similarity-based inference

36
Computing Distributional Similarity
folderfile
Similar
37
Disambiguation Algorithm

Selection of preferred alternative
Hypothesized similarity-based frequency derived
from average association for similar
words(incorporating single term frequency)
Comparing hypothesized frequencies

38
Computation and Evaluation

Heuristic search used to speed computation of k
most similar words
Results (Hebrew?English)
15 coverage increase, while decreasing precision
by 2
Accuracy 15 better than back-off to single word
frequency(Dagan, Marcus and Markovitch 1995)

39
Probabilistic Framework - Smoothing
40
Smoothing Conditional Attribute Probability
41
Similarity/Distance Functions for Probability
Distributions
42
Sample Results

Most similar words to guy

Measure Closest Words
A guy kid thing lot man mother doctor friend boy son
L guy kid lot thing man doctor girl rest son bit
PC role people fire guy man year lot today way part
Typical common verb contexts see
get give tell take
PC an earlier attempt for
similarity-based smoothing

Several smoothing experiments (A performed best)
Language modeling for speech (hunt bears?pears)
Perplexity (predicting test corpus likelihood)
Data recovery task (similar to sense
disambiguation)
Insensitive to exact value of ß

43
Class-Based Generalization

Obtain a cooccurrence-based clustering of words
and model a word cooccurrence by word-class or
class-class cooccurrence
Brown et al., CL 1992 Mutual information
clustering class-based model interpolated to
n-gram model
Pereira, Tishby, Lee, ACL 1993 soft, top-down
distributional clustering for bigram modeling
Similarity/class-based methods general
effectiveness yet to be shown

44
Conclusions