Unsupervised Methods - PowerPoint PPT Presentation

About This Presentation
Title:

Unsupervised Methods

Description:

Unsupervised Methods – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 46
Provided by: aci75
Category:

less

Transcript and Presenter's Notes

Title: Unsupervised Methods


1
Unsupervised Methods
2
Association Measures
  • Association between items assoc(x,y)
  • term-term, term-document, term-category,
  • Simple measure freq(x,y), log(freq(x,y))1
  • Based on contingency table

3
Mutual Information
  • The item corresponding to x,y in the Mutual
    Information for X,Y
  • Disadvantage the MI value is inflated for low
    freq(x,y)
  • Examples results for two NLP articles

4
Log-Likelihood Ratio Test
  • Comparing the likelihood of the data given two
    competing hypotheses (Dunning,93)
  • Does not depend heavily on assumptions of
    normality, can be applied to small samples
  • Used to test if p(xy) p(xy) p(x), by
    comparing it to the general case (inequality)
  • High log-likelihood score indicates that the data
    is much less likely if assuming equality

5
Log-Likelihood (cont.)
  • Likelihood function
  • The likelihood ratio
  • is asymptotically distributed
  • High the data is less likely given

6
Log-Likelihood for Bigrams
7
Log-Likelihood for Binomial
  • Maximum obtained for

8
Measuring Term Topicality
  • For query relevance ranking Inverse Document
    Frequency
  • For term extraction
  • Frequency
  • Frequency ratio for specialized vs. general
    corpus
  • Entropy of term co-occurrence distribution
  • Burstiness
  • Entropy of distribution (frequency) in documents
  • Proportion of topical documents for term (freqgt1)
    within all documents containing term (Katz, 1996)

9
Similarity Measures
  • Cosine
  • Min/Max
  • KL to Average

10
A Unifying Schema of Similarity(with Erez Lotan)
  • A general schema encoding most measures
  • Identifies explicitly the important factors that
    determine (word) similarity
  • Provides the basis for
  • a general and efficient similarity computation
    procedure
  • evaluating and comparing alternative measures and
    components

11
Mapping to Unified Similarity Scheme
12
Association and Joint Association
  • assoc(u,att) quantify association strength
  • mutual information, weighted log frequency,
    conditional probability (orthogonal to scheme)
  • joint(assoc(u,att),assoc(v,att)) quantify the
    similarity of the two associations
  • ratio, difference, min, product

13
Normalization
  • Global weight of a word vector
  • For cosine
  • Normalization factor
  • For cosine

14
The General Similarity Scheme
15
Min/Max Measures
  • May be viewed as

16
Associations Used with Min/Max
  • Log-frequency and Global Entropy Weight
    (Grefenstette, 1994)
  • Mutual information (Dagan et al., 1993/5)

17
Cosine Measure
  • Used for word similarity (Ruge, 1992) with
    assoc(u,att)ln(freq(u,att))
  • Popular for document ranking (vector space)

18
Methodological Benefits
  • Joint work with Erez Lotan (Dagan 2000 and in
    preparation)
  • Uniform understanding of similarity measure
    structure
  • Modular evaluation/comparison of measure
    components
  • Modular implementation architecture, easy
    experimentation by plugging alternative measure
    combinations

19
Empirical Evaluation
  • Precision and comparative Recall at each point in
    the list

20
Comparing Measure Combinations
Precision
Recall
  • Min/Max schemes worked better than cosine and
    Jensen-Shannon (almost by 20 points) stable over
    association measures

21
Effect of Co-occurrence Type on Semantic
Similarity
22
Computational Benefits
  • Complexity reduced by sparseness factor
    non-zero cells / total cells
  • Two orders of magnitude in corpus data

23
General Scheme - Conclusions
  • A general mathematical scheme
  • Identifies the important factors for measuring
    similarity
  • Efficient general procedure based on scheme
  • Empirical comparison of different measure
    components (measure structure and assoc)
  • Successful application in an Internet crawler for
    thesaurus construction (small corpora)

24
Clustering Methods
  • Input A set of objects (words, documents)
  • Output A set of clusters (sets of elements)
  • Based on a criterion for the quality of a class,
    which guides cluster split/merge/modification
  • a distance function between objects/classes
  • a global quality function

25
Clustering Types
  • Soft / Hard
  • Hierarchical / Flat
  • Top-down / bottom-up
  • Predefined number of clusters or not
  • Input
  • all point-to-point distances
  • original vector representation for points,
    computing needed distances during clustering

26
Applications of Clustering
  • Word clustering
  • Constructing a hierarchical thesaurus
  • Compactness and generalization in word
    cooccurrence modeling (will be discussed later)
  • Document clustering
  • Browsing of document collections and search query
    output
  • Assistance in defining a set of supervised
    categories

27
Hierarchical Agglomerative Clustering Methods
(HACM)
1. Initialize every point as a cluster 2. Compute
a merge score for all cluster pairs 3. Perform
the best scoring merge 4. Compute the merge score
between the new cluster and all other
clusters 5. If more than one cluster remains,
return to 3
28
Types of Merge Score
  • Minimal distance between the two candidates for
    the merge. Alternatives for cluster distance
  • Single link distance between two nearest points
  • Complete ling distance between two furthest
    points
  • Group average average pairwise distance for all
    points
  • Centroid distance between the two cluster
    centroids
  • Based on the quality of the merged class
  • Wards method minimal increase in total
    within-group sum of squares (average squared
    distance to centroid)
  • Based on a global criterion (in Brown et al.,
    1992 minimal reduction in average mutual
    information)

29
Unsupervised Statistics and Generalizations for
Classification
  • Many supervised methods use cooccurrence
    statistics as features or probability estimates
  • eat a peach,beach
  • fire a missile vs. fire the prime minister
  • Sparse data problem if alternative cooccurrences
    never occurred, how to estimate their
    probabilities, or their relative strength as
    features?

30
Application Semantic Disambiguation
31
Statistical Approach
Semantic Judgment
Corpus (text collection)
ltverbobject throw-grenadegt 20 times
ltverbobject throw-pocketgt 1 time
32
What about sense disambiguation?(for translation)
33
Solution Mapping to Another Language
English(-English)-Hebrew Dictionary
bar1? chafisa soap ? sabon window ?
chalonbar2? sorag
  • Exploiting ambiguities difference
  • Principle intersecting redundancies(Dagan
    and Itai 1994)

34
Selection Model Highlights
  • Multinomial model, under certain linguistic
    assumptions
  • Selection confidence lower bound for
    odds-ratio
  • Overlapping ambiguous constructs are resolved
    through constraint propagation, by decreasing
    confidence order.
  • Results (Hebrew?English)Coverage
    70 Precision within coverage 90
  • 20 improvement over choosing most frequent
    translation (the common baseline)

35
Data Sparseness and Similarity
ltverbobject hidpis-tikiyagt
?
ltverbobject print-foldergt 0 times
ltverbobject print-file_cabinetgt 0 times
  • Standard approach back-off to single term
    frequency
  • Similarity-based inference

36
Computing Distributional Similarity
folderfile
Similar
37
Disambiguation Algorithm
  • Selection of preferred alternative
  • Hypothesized similarity-based frequency derived
    from average association for similar
    words(incorporating single term frequency)
  • Comparing hypothesized frequencies

38
Computation and Evaluation
  • Heuristic search used to speed computation of k
    most similar words
  • Results (Hebrew?English)
  • 15 coverage increase, while decreasing precision
    by 2
  • Accuracy 15 better than back-off to single word
    frequency(Dagan, Marcus and Markovitch 1995)

39
Probabilistic Framework - Smoothing
40
Smoothing Conditional Attribute Probability
41
Similarity/Distance Functions for Probability
Distributions
42
Sample Results
  • Most similar words to guy

Measure Closest Words
A guy kid thing lot man mother doctor friend boy son
L guy kid lot thing man doctor girl rest son bit
PC role people fire guy man year lot today way part
Typical common verb contexts see
get give tell take
PC an earlier attempt for
similarity-based smoothing
  • Several smoothing experiments (A performed best)
  • Language modeling for speech (hunt bears?pears)
  • Perplexity (predicting test corpus likelihood)
  • Data recovery task (similar to sense
    disambiguation)
  • Insensitive to exact value of ß

43
Class-Based Generalization
  • Obtain a cooccurrence-based clustering of words
    and model a word cooccurrence by word-class or
    class-class cooccurrence
  • Brown et al., CL 1992 Mutual information
    clustering class-based model interpolated to
    n-gram model
  • Pereira, Tishby, Lee, ACL 1993 soft, top-down
    distributional clustering for bigram modeling
  • Similarity/class-based methods general
    effectiveness yet to be shown

44
Conclusions
  • (Relatively) simple models cover a wide range of
    applications
  • Usefulness in (hybrid) systems automatic
    processing and knowledge acquisition

45
Discussion
Write a Comment
User Comments (0)
About PowerShow.com