Title: Chapter 8 Collocations
1Chapter 8 Collocations
2Introduction
- Collocations, consisting of two or more words,
are characterized by limited compositionality. - Large overlap between the concepts of
collocations and terms, technical term and
terminological phrase. - Collocations sometimes reflect interesting
attitudes (in English) towards different types of
substances strong cigarettes, tea, coffee versus
powerful drug (e.g., heroin) - Applications
- natural language generation
- computational lexicography
- Parsing,
terminology extraction
3Definition (w.r.t Computational and Statistical
Literature)
- A collocation is defined as a sequence of two
or more consecutive words, that has
characteristics of a syntactic and semantic unit,
and whose exact and unambiguous meaning or
connotation cannot be derived directly from the
meaning or connotation of its components.
Chouekra, 1988 - strong tea
- basic sense having great physical strength
- meaning in strong tea rich in some active agent
4Other Definitions/Notions (w.r.t. Linguistic
Literature) I
- Collocations are not necessarily adjacent
- Typical criteria for collocations
non-compositionality, non-substitutability,
non-modifiability. - Collocations cannot be translated into other
languages. - Generalization to weaker cases strong
association of words, but not necessarily fixed
occurrence.
green house ? ?? (O) ????? (X)
5Linguistic Subclasses of Collocations
- Light verbs verbs with little semantic content
- the noun contributes argument structure to the
predicate (e.g.). - The man took a walk. vs. The man took a radio
- Verb particle constructions or Phrasal Verbs
- A phrasal verb is a verb plus a preposition which
creates a meaning different from the original
verb. - take in ? deceive
- Proper Nouns/Names
- Terminological Expressions
6Overview of the Collocation Detecting Techniques
Surveyed
- Selection of Collocations by Frequency
- Selection of Collocation based on Mean and
Variance of the distance between focal word and
collocating word. - Hypothesis Testing
- Mutual Information
7Reference Corpus
- New York Times
- August November 1990
- 115MB of text, 14M words
- Examples drawn from bigrams
8Frequency(Justeson Katz, 1995)
- 1. Selecting the most frequently occurring
bigrams (refer to 8-9) ? quantitative technique - 2. Passing the results through a part-of-speech
filter (refer to 8-10) ? linguistic knowledge - Simple method that works very well. (refer to
8-11)
9Selecting the most frequently occurring bigrams
is not always significant.
. . .
pairs of function words
10Keep those patterns that are likely to be
phrases.
A adjective, P preposition, N noun
stop list to exclude words whose frequent tag is
not a verb, noun, or adjective
11Most of them are non-compositional phrases.
not regarded as non- nompositional phrases
Can be filtered out by the longest sequence
12??
??
4
force
13Mean and Variance (Smadja et al., 1993)
- Frequency-based search works well for fixed
phrases. However, many collocations consist of
two words in more flexible relationships. - She knocked on his door.
- They knocked at the door.
- 100 women knocked on Donaldsons door.
- A man knocked on the metal front door.
- How to extend bigrams to bigrams at a distance?
14- Define a collocational window (e.g., 3-4 words on
each side) - Enter every word pair a collocational bigram
15Mean and Variance (Smadja et al., 1993)
- The method computes the mean and variance of the
offset (signed distance) between the two words in
the corpus. - She knocked on his door.
- They knocked at the door.
- 100 women knocked on Donaldsons door.
- A man knocked on the metal front door.
(Donaldson s)
-3 for the door that she knocked on
16Mean and Variance (Smadja et al., 1993)
- Variance
- how much the individual offsets deviate from the
mean - Interpretation of standard deviation
- 0 the two words always occur at exactly the same
distance. - Low the two words usually occur at about the
same distance. - High the two words occur by chance.
- Find the pairs with low deviation
17e.g., strong opposition
noise
always 0
strong leftist support strong business support
strong support
18If the offsets are randomly distributed (i.e.,
no collocation), then the variance/sample
deviation will be high.
for
for and strong dont form interesting collocations
19distance 1 distance 2 distance 3 distance 4
10/15
low
2 percentage
high
no interesting relationship
of millions of
? 0
distance 1,2 distance 3,4 distance
2,3 distance -1,1
medium
normal distribution
strong business support strong support, strong
business support powerful lobbying
organizations ?? Richard M. Nixon Richard M
Nixon, Richard M. Nixon Garrison said/said
Garrison
distance-1 how many words are inserted in between
20Smadjas Approach
- additional constraint filter out flat peaks in
the position histogram
peaks that are not surrounded by deep valleys
21Hypothesis Testing I Overview
- High frequency and low variance can be
accidental. - whether the co-occurrence is random, or
- whether it occurs more often than chance
- This is a classical problem in Statistics called
Hypothesis Testing. - Formulate a null hypothesis H0 (no association
beyond chance occurrences) - Calculate the probability that a collocation
would occur if H0 were true, and then reject H0
if p is too low and retain H0 as possible,
otherwise. - Two issues
- Look for particular patterns in the data
- How much data we have seen
significance level of p lt 0.05, 0.01, 0.005, or
0.001
22The t-Test
- The t-test assesses whether the means of two
groups are statistically different from each
other.
23The t-Test (Continued)
- When we are looking at the differences between
scores for two groups, we have to judge the
difference between their means relative to the
spread or variability of their scores.
the difference between the means is the same in
all three
24(No Transcript)
25Look it up in a table of significance to test
whether the ratio is large enough to say that the
difference between the groups is not likely to
have been a chance finding.
The sum of sample in both groups minus 2
Given the alpha level, the df (degree of
freedom), and the t-value, you can look the
t-value up in a standard table of significance
-t
0.005
99
t
0.005
-2.576
2.576
If the ratio is larger than 2.576, we say the two
groups are different with 99 confidence.
26Hypothesis Testing I Overview
- Null hypothesis
- Each of the words w1 and w2 is generated
completely independently of the other.
I.E.,they are no association. In other words,
they are randomly distributed (normal
distribution is adopted) - The chance of coming together is
27Hypothesis Testing II The t test
- The t test looks at the mean and variance of a
sample of measurements, where the null hypothesis
is that the sample is drawn from a distribution
with mean ?. - The test looks at the difference between the
observed and expected means, scaled by the
variance of the data, and tells us how likely one
is to get a sample of that mean and variance
assuming that the sample is drawn from a normal
distribution with mean ?. - To apply the t test to collocations, we think of
the text corpus as a long sequence of N bigrams.
expected mean
28Hypothesis Testing II The t test
- is the sample mean
- s2 is the sample variance
- N is the sample size
- ? is the mean of distribution
If the t statistic is large enough, then we can
reject the null hypothesis (here, null means no
difference).
29- Null hypothesis
- the sample from general population, i.e., they
are in no difference - The mean height of a population of men is 158 cm
- Sample
- 200 men with
- Is this sample from the general population or
from a different population of smaller men? - The sample is not drawn from a population with
mean 158 cm, and the probability of error is less
than 0.5
- (confidence level)0.005
- The value of t is 2.576.
- Because 3.05 gt 2.576, reject the
- null hypothesis with 99.5
- confidence.
30Example
- a text corpus has 14,307,668 tokens
- new occurs 15,828 times
- companies occurs 4,675 times
- new companies occur 8 times
- Does new companies form a collocation?
- What is the sample that we are measuring the mean
and variance?
31- new occurs 15,828 times, companies 4,675 times,
and 14,307,668 tokens - Null hypothesis the occurrences of new and
companies are independent, i.e., they are in no
relationship
maximum likelihood estimate
observed probability
Bernoulli trial
?3.615?10-7
?2p(1-p)?p when p is small
32- 8 occurrences of new companies among the
14,307,668 bigrams - Because 0.999932 lt 2.576, we cannot reject the
null hypothesis. I.e., new and companies occur
independently, and do not form a collocation. - new companies is completely compositional, and
there is no element of added meaning.
33Reject the null hypothesis, so that they are
good candidates for Significance.
A high portion of the occurrence of both words,
or at least a very high portion of the occurrence
of one of the words occurs in the bigram ? t is
high
????????(Ayatollah Ruhollah Khomeini)
min(
)20
,
27
30
59
24
9017
10570
13478
14093
15019
the same frequency (frequency-based method fails)
Fail the test for significance, so that they are
not good candidates for collocations
34Hypothesis Testing II Hypothesis testing of
differences (Church Hanks, 1989)
- We may also want to find words whose
co-occurrence patterns best distinguish between
two words. - E.g., to find words that best differentiate the
meanings of strong and powerful - This application can be useful for lexicography
- which word (strong and powerful) is suitable to
modify computer for a lexicographer?
35p o w e r f u l s t r o n g
36- The t test is extended to the comparison of the
means of two normal populations. - Here, the null hypothesis is that the average
difference is 0 (?0).
?
?
37Assume a Bernoulli distribution.
when p is small
where v1 and v2 are the words we are comparing
(e.g., powerful and
strong), and w is the collocate of
interest (e.g., computers)
where C(x) is the number of times x occurs in the
corpus
38p o w e r f u l s t r o n g
39Pearsons Chi-Square test I Method
- Use of the t test has been criticized because it
assumes that probabilities are approximately
normally distributed (not true, generally). - The test (Chi-Square test) does not make
this assumption. - The essence of the test is to compare observed
frequencies with frequencies expected for
independence. If the difference between observed
and expected frequencies is large, then we can
reject the null hypothesis of independence.
40Chi square distribution
- The chi square distribution has one parameter,
its degrees of freedom (df) - The mean of a chi square distribution is its df.
- The mode is df-2 and the median is approximately
df-7
41Given C(new)15,828, C(companies)4,675, C(new
companies)8 Total of words14,307,668 we can
derive (?new,companies)4675-84667 (new,
?companies)15828-815820 (?new,
?companies)14307668-15820-466714287181
42Oij observed value for cell (i,j) Eij expected
value for cell (i,j)
Expect 5.2 occurrences of new companies
new and companies occur independently
43?0.05 ?23.841 Because 1.55lt3.841, we cannot
reject the null hypothesis that new and companies
occur independently of each other. new companies
is not a good candidate for a collocation.
44t statistic vs. X2 statistic
- For the problem of finding collocations, the
differences between the t statistic and the X2
statistic do not seem to be large. - The 20 bigrams with the highest t scores in the
test corpus are also the 20 bigrams with the
highest X2 scores. - X2 statistic is also appropriate for large
probabilities, for which normality assumption of
the t test fails.
45Pearsons Chi-Square test II Applications
- One of the early uses of the Chi square test in
Statistical NLP was the identification of
translation pairs in aligned corpora (Church
Gale, 1991).
?2456400
46- A more recent application is to use Chi square
as a metric for corpus similarity (Kilgariff and
Rose, 1998) - Nevertheless, the Chi-Square test should not be
used in small corpora.
Null hypothesis the two corpora are in no
relationship. If X2 score is high, then the two
corpora have a high degree of
similarity.
47Likelihood Ratios I Within a single corpus
(Dunning, 1993)
- Likelihood ratios are more appropriate for sparse
data than the Chi-Square test. In addition, they
are easier to interpret than the Chi-Square
statistic. - In applying the likelihood ratio test to
collocation discovery, we examine the following
two alternative explanations for the occurrence
frequency of a bigram w1 w2 - The occurrence of w2 is independent of the
previous occurrence of w1 - The occurrence of w2 is dependent of the previous
occurrence of w1
evidence for a collocation
48maximum likelihood estimates
Hypothesis 1.
Hypothesis 2.
where c1, c2, and c12 are the number of
occurrences of w1, w2 and w1w2 in the corpus.
c2
c2-c12
c12
c1
N
49binomial distribution
c12 out of c1 bigrams are w1w2
c2-c12 out of N-c1 bigrams are ?w1w2
c2
c12
c1-c12
c2-c12
c1
N
50binomial distribution
the probability w2 occur when w1 preceded
the probability w2 occur when
a different preceded
c12 out of c1 bigrams are w1w2
c2-c12 out of N-c1 bigrams are ?w1w2
c2
c12
c1-c12
c2-c12
c1
N
51The log of the likelihood ratio ?
52rare bigrams
53Analysis
- We do not have to look up in a table in
likelihood ratios. - Likelihood ratios can be more appropriate for
sparse data than the ?2 test. - -2log? is asymptotically ?2 distributed.
54Likelihood Ratios II Between two or more corpora
(Damerau, 1993)
- Ratios of relative frequencies between two or
more different corpora can be used to discover
collocations that are characteristic of a corpus
when compared to other corpora. - This approach is most useful for the discovery of
subject-specific collocations.
551989?7?31?,???269?????? ???????????Hizbollah ???Sh
eikh Abdel Karim Obeid? ?????????????????
Applications compare a general text with a
subject-specific text ? specific dictionary
56Mutual Information
- An Information-Theoretic measure for discovering
collocations is pointwise mutual information
(Church et al., 89, 91)
random variables vs. values of random variables
57????????(Ayatollah Ruhollah Khomeini)
58Mutual Information
- Pointwise Mutual Information is roughly a measure
of how much one word tells us about the other. - Pointwise mutual information works particularly
badly in sparse environments.
59House of Commons (? ? ? ) ?? Chambre de communes
(communes,house) incorrect pair (chambre,house)
correct pair
similar
difference
gt
lt
60first 1000 documents
6
??
8
only occur once in 23 times larger corpus
entire corpus
5 6
61- Sparseness is a particularly difficult problem
for mutual information. - Perfect dependence
- As the perfectly dependent bigrams get rarer,
their mutual - information increases. ? bad measure of
dependence - I.e., the score depends on the frequency
of the individual words. - Perfect independence
- ? good measure of independence
62With MI, bigrams composed of low-frequency words
will receive a higher score than bigrams
composed of high-frequency words. Higher
frequency means more evidence and a higher rank
for bigrams is preferred when we have more
evidence
opposite