Chapter 8 Collocations - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Chapter 8 Collocations

Description:

... strong cigarettes, tea, coffee versus powerful drug (e.g., heroin) ... Applications: compare a general text with a subject-specific text specific dictionary ... – PowerPoint PPT presentation

Number of Views:228
Avg rating:3.0/5.0
Slides: 63
Provided by: hsinhs
Category:

less

Transcript and Presenter's Notes

Title: Chapter 8 Collocations


1
Chapter 8 Collocations
2
Introduction
  • Collocations, consisting of two or more words,
    are characterized by limited compositionality.
  • Large overlap between the concepts of
    collocations and terms, technical term and
    terminological phrase.
  • Collocations sometimes reflect interesting
    attitudes (in English) towards different types of
    substances strong cigarettes, tea, coffee versus
    powerful drug (e.g., heroin)
  • Applications
  • natural language generation
  • computational lexicography
  • Parsing,

terminology extraction
3
Definition (w.r.t Computational and Statistical
Literature)
  • A collocation is defined as a sequence of two
    or more consecutive words, that has
    characteristics of a syntactic and semantic unit,
    and whose exact and unambiguous meaning or
    connotation cannot be derived directly from the
    meaning or connotation of its components.
    Chouekra, 1988
  • strong tea
  • basic sense having great physical strength
  • meaning in strong tea rich in some active agent

4
Other Definitions/Notions (w.r.t. Linguistic
Literature) I
  • Collocations are not necessarily adjacent
  • Typical criteria for collocations
    non-compositionality, non-substitutability,
    non-modifiability.
  • Collocations cannot be translated into other
    languages.
  • Generalization to weaker cases strong
    association of words, but not necessarily fixed
    occurrence.

green house ? ?? (O) ????? (X)
5
Linguistic Subclasses of Collocations
  • Light verbs verbs with little semantic content
  • the noun contributes argument structure to the
    predicate (e.g.).
  • The man took a walk. vs. The man took a radio
  • Verb particle constructions or Phrasal Verbs
  • A phrasal verb is a verb plus a preposition which
    creates a meaning different from the original
    verb.
  • take in ? deceive
  • Proper Nouns/Names
  • Terminological Expressions

6
Overview of the Collocation Detecting Techniques
Surveyed
  • Selection of Collocations by Frequency
  • Selection of Collocation based on Mean and
    Variance of the distance between focal word and
    collocating word.
  • Hypothesis Testing
  • Mutual Information

7
Reference Corpus
  • New York Times
  • August November 1990
  • 115MB of text, 14M words
  • Examples drawn from bigrams

8
Frequency(Justeson Katz, 1995)
  • 1. Selecting the most frequently occurring
    bigrams (refer to 8-9) ? quantitative technique
  • 2. Passing the results through a part-of-speech
    filter (refer to 8-10) ? linguistic knowledge
  • Simple method that works very well. (refer to
    8-11)

9
Selecting the most frequently occurring bigrams
is not always significant.
. . .
pairs of function words
10
Keep those patterns that are likely to be
phrases.
A adjective, P preposition, N noun
stop list to exclude words whose frequent tag is
not a verb, noun, or adjective
11
Most of them are non-compositional phrases.
not regarded as non- nompositional phrases
Can be filtered out by the longest sequence
12
??
??
4
force
13
Mean and Variance (Smadja et al., 1993)
  • Frequency-based search works well for fixed
    phrases. However, many collocations consist of
    two words in more flexible relationships.
  • She knocked on his door.
  • They knocked at the door.
  • 100 women knocked on Donaldsons door.
  • A man knocked on the metal front door.
  • How to extend bigrams to bigrams at a distance?

14
  • Define a collocational window (e.g., 3-4 words on
    each side)
  • Enter every word pair a collocational bigram

15
Mean and Variance (Smadja et al., 1993)
  • The method computes the mean and variance of the
    offset (signed distance) between the two words in
    the corpus.
  • She knocked on his door.
  • They knocked at the door.
  • 100 women knocked on Donaldsons door.
  • A man knocked on the metal front door.

(Donaldson s)
-3 for the door that she knocked on
16
Mean and Variance (Smadja et al., 1993)
  • Variance
  • how much the individual offsets deviate from the
    mean
  • Interpretation of standard deviation
  • 0 the two words always occur at exactly the same
    distance.
  • Low the two words usually occur at about the
    same distance.
  • High the two words occur by chance.
  • Find the pairs with low deviation

17
e.g., strong opposition
noise
always 0
strong leftist support strong business support
strong support
18
If the offsets are randomly distributed (i.e.,
no collocation), then the variance/sample
deviation will be high.
for
for and strong dont form interesting collocations
19
distance 1 distance 2 distance 3 distance 4
10/15
low
2 percentage
high
no interesting relationship
of millions of
? 0
distance 1,2 distance 3,4 distance
2,3 distance -1,1
medium
normal distribution
strong business support strong support, strong
business support powerful lobbying
organizations ?? Richard M. Nixon Richard M
Nixon, Richard M. Nixon Garrison said/said
Garrison
distance-1 how many words are inserted in between
20
Smadjas Approach
  • additional constraint filter out flat peaks in
    the position histogram

peaks that are not surrounded by deep valleys
21
Hypothesis Testing I Overview
  • High frequency and low variance can be
    accidental.
  • whether the co-occurrence is random, or
  • whether it occurs more often than chance
  • This is a classical problem in Statistics called
    Hypothesis Testing.
  • Formulate a null hypothesis H0 (no association
    beyond chance occurrences)
  • Calculate the probability that a collocation
    would occur if H0 were true, and then reject H0
    if p is too low and retain H0 as possible,
    otherwise.
  • Two issues
  • Look for particular patterns in the data
  • How much data we have seen

significance level of p lt 0.05, 0.01, 0.005, or
0.001
22
The t-Test
  • The t-test assesses whether the means of two
    groups are statistically different from each
    other.

23
The t-Test (Continued)
  • When we are looking at the differences between
    scores for two groups, we have to judge the
    difference between their means relative to the
    spread or variability of their scores.

the difference between the means is the same in
all three
24
(No Transcript)
25
Look it up in a table of significance to test
whether the ratio is large enough to say that the
difference between the groups is not likely to
have been a chance finding.
The sum of sample in both groups minus 2
Given the alpha level, the df (degree of
freedom), and the t-value, you can look the
t-value up in a standard table of significance
-t
0.005
99
t
0.005
-2.576
2.576
If the ratio is larger than 2.576, we say the two
groups are different with 99 confidence.
26
Hypothesis Testing I Overview
  • Null hypothesis
  • Each of the words w1 and w2 is generated
    completely independently of the other.
    I.E.,they are no association. In other words,
    they are randomly distributed (normal
    distribution is adopted)
  • The chance of coming together is

27
Hypothesis Testing II The t test
  • The t test looks at the mean and variance of a
    sample of measurements, where the null hypothesis
    is that the sample is drawn from a distribution
    with mean ?.
  • The test looks at the difference between the
    observed and expected means, scaled by the
    variance of the data, and tells us how likely one
    is to get a sample of that mean and variance
    assuming that the sample is drawn from a normal
    distribution with mean ?.
  • To apply the t test to collocations, we think of
    the text corpus as a long sequence of N bigrams.

expected mean
28
Hypothesis Testing II The t test
  • is the sample mean
  • s2 is the sample variance
  • N is the sample size
  • ? is the mean of distribution

If the t statistic is large enough, then we can
reject the null hypothesis (here, null means no
difference).
29
  • Null hypothesis
  • the sample from general population, i.e., they
    are in no difference
  • The mean height of a population of men is 158 cm
  • Sample
  • 200 men with
  • Is this sample from the general population or
    from a different population of smaller men?
  • The sample is not drawn from a population with
    mean 158 cm, and the probability of error is less
    than 0.5
  • (confidence level)0.005
  • The value of t is 2.576.
  • Because 3.05 gt 2.576, reject the
  • null hypothesis with 99.5
  • confidence.

30
Example
  • a text corpus has 14,307,668 tokens
  • new occurs 15,828 times
  • companies occurs 4,675 times
  • new companies occur 8 times
  • Does new companies form a collocation?
  • What is the sample that we are measuring the mean
    and variance?

31
  • new occurs 15,828 times, companies 4,675 times,
    and 14,307,668 tokens
  • Null hypothesis the occurrences of new and
    companies are independent, i.e., they are in no
    relationship

maximum likelihood estimate
observed probability
Bernoulli trial
?3.615?10-7
?2p(1-p)?p when p is small
32
  • 8 occurrences of new companies among the
    14,307,668 bigrams
  • Because 0.999932 lt 2.576, we cannot reject the
    null hypothesis. I.e., new and companies occur
    independently, and do not form a collocation.
  • new companies is completely compositional, and
    there is no element of added meaning.

33
Reject the null hypothesis, so that they are
good candidates for Significance.
A high portion of the occurrence of both words,
or at least a very high portion of the occurrence
of one of the words occurs in the bigram ? t is
high
????????(Ayatollah Ruhollah Khomeini)
min(
)20
,
27
30
59
24
9017
10570
13478
14093
15019
the same frequency (frequency-based method fails)
Fail the test for significance, so that they are
not good candidates for collocations
34
Hypothesis Testing II Hypothesis testing of
differences (Church Hanks, 1989)
  • We may also want to find words whose
    co-occurrence patterns best distinguish between
    two words.
  • E.g., to find words that best differentiate the
    meanings of strong and powerful
  • This application can be useful for lexicography
  • which word (strong and powerful) is suitable to
    modify computer for a lexicographer?

35
p o w e r f u l s t r o n g
36
  • The t test is extended to the comparison of the
    means of two normal populations.
  • Here, the null hypothesis is that the average
    difference is 0 (?0).

?
?
37
Assume a Bernoulli distribution.
when p is small
where v1 and v2 are the words we are comparing
(e.g., powerful and
strong), and w is the collocate of
interest (e.g., computers)
where C(x) is the number of times x occurs in the
corpus
38
p o w e r f u l s t r o n g
39
Pearsons Chi-Square test I Method
  • Use of the t test has been criticized because it
    assumes that probabilities are approximately
    normally distributed (not true, generally).
  • The test (Chi-Square test) does not make
    this assumption.
  • The essence of the test is to compare observed
    frequencies with frequencies expected for
    independence. If the difference between observed
    and expected frequencies is large, then we can
    reject the null hypothesis of independence.

40
Chi square distribution
  • The chi square distribution has one parameter,
    its degrees of freedom (df)
  • The mean of a chi square distribution is its df.
  • The mode is df-2 and the median is approximately
    df-7

41
Given C(new)15,828, C(companies)4,675, C(new
companies)8 Total of words14,307,668 we can
derive (?new,companies)4675-84667 (new,
?companies)15828-815820 (?new,
?companies)14307668-15820-466714287181
42
Oij observed value for cell (i,j) Eij expected
value for cell (i,j)
Expect 5.2 occurrences of new companies
new and companies occur independently
43
?0.05 ?23.841 Because 1.55lt3.841, we cannot
reject the null hypothesis that new and companies
occur independently of each other. new companies
is not a good candidate for a collocation.
44
t statistic vs. X2 statistic
  • For the problem of finding collocations, the
    differences between the t statistic and the X2
    statistic do not seem to be large.
  • The 20 bigrams with the highest t scores in the
    test corpus are also the 20 bigrams with the
    highest X2 scores.
  • X2 statistic is also appropriate for large
    probabilities, for which normality assumption of
    the t test fails.

45
Pearsons Chi-Square test II Applications
  • One of the early uses of the Chi square test in
    Statistical NLP was the identification of
    translation pairs in aligned corpora (Church
    Gale, 1991).

?2456400
46
  • A more recent application is to use Chi square
    as a metric for corpus similarity (Kilgariff and
    Rose, 1998)
  • Nevertheless, the Chi-Square test should not be
    used in small corpora.

Null hypothesis the two corpora are in no
relationship. If X2 score is high, then the two
corpora have a high degree of
similarity.
47
Likelihood Ratios I Within a single corpus
(Dunning, 1993)
  • Likelihood ratios are more appropriate for sparse
    data than the Chi-Square test. In addition, they
    are easier to interpret than the Chi-Square
    statistic.
  • In applying the likelihood ratio test to
    collocation discovery, we examine the following
    two alternative explanations for the occurrence
    frequency of a bigram w1 w2
  • The occurrence of w2 is independent of the
    previous occurrence of w1
  • The occurrence of w2 is dependent of the previous
    occurrence of w1

evidence for a collocation
48
maximum likelihood estimates
Hypothesis 1.
Hypothesis 2.
where c1, c2, and c12 are the number of
occurrences of w1, w2 and w1w2 in the corpus.
c2
c2-c12
c12
c1
N
49
binomial distribution
c12 out of c1 bigrams are w1w2
c2-c12 out of N-c1 bigrams are ?w1w2
c2
c12
c1-c12
c2-c12
c1
N
50
binomial distribution
the probability w2 occur when w1 preceded
the probability w2 occur when
a different preceded
c12 out of c1 bigrams are w1w2
c2-c12 out of N-c1 bigrams are ?w1w2
c2
c12
c1-c12
c2-c12
c1
N
51
The log of the likelihood ratio ?
52
rare bigrams
53
Analysis
  • We do not have to look up in a table in
    likelihood ratios.
  • Likelihood ratios can be more appropriate for
    sparse data than the ?2 test.
  • -2log? is asymptotically ?2 distributed.

54
Likelihood Ratios II Between two or more corpora
(Damerau, 1993)
  • Ratios of relative frequencies between two or
    more different corpora can be used to discover
    collocations that are characteristic of a corpus
    when compared to other corpora.
  • This approach is most useful for the discovery of
    subject-specific collocations.

55
1989?7?31?,???269?????? ???????????Hizbollah ???Sh
eikh Abdel Karim Obeid? ?????????????????
Applications compare a general text with a
subject-specific text ? specific dictionary
56
Mutual Information
  • An Information-Theoretic measure for discovering
    collocations is pointwise mutual information
    (Church et al., 89, 91)

random variables vs. values of random variables
57
????????(Ayatollah Ruhollah Khomeini)
58
Mutual Information
  • Pointwise Mutual Information is roughly a measure
    of how much one word tells us about the other.
  • Pointwise mutual information works particularly
    badly in sparse environments.

59
House of Commons (? ? ? ) ?? Chambre de communes
(communes,house) incorrect pair (chambre,house)
correct pair
similar
difference
gt
lt
60
first 1000 documents
6
??
8
only occur once in 23 times larger corpus
entire corpus
5 6
61
  • Sparseness is a particularly difficult problem
    for mutual information.
  • Perfect dependence
  • As the perfectly dependent bigrams get rarer,
    their mutual
  • information increases. ? bad measure of
    dependence
  • I.e., the score depends on the frequency
    of the individual words.
  • Perfect independence
  • ? good measure of independence

62
With MI, bigrams composed of low-frequency words
will receive a higher score than bigrams
composed of high-frequency words. Higher
frequency means more evidence and a higher rank
for bigrams is preferred when we have more
evidence
opposite
Write a Comment
User Comments (0)
About PowerShow.com