Title: COMP791A: Statistical Language Processing
1COMP791A Statistical Language Processing
Collocations Chap. 5
2A collocation
- is an expression of 2 or more words that
correspond to a conventional way of saying
things. - broad daylight
- Why not? ?bright daylight or ?narrow darkness
- Big mistake but not ?large mistake
- overlap with the concepts of
- terms, technical terms terminological phrases
- Collocations extracted form technical domains
- Ex hydraulic oil filter, file transfer protocol
3Examples of Collocations
- strong tea
- weapons of mass destruction
- to make up
- to check in
- heard it through the grapevine
- he knocked at the door
- I made it all up
4Definition of a collocation
- (Choueka, 1988)
- A collocation is defined as a sequence of two
or more consecutive words, that has
characteristics of a syntactic and semantic unit,
and whose exact and unambiguous meaning or
connotation cannot be derived directly from the
meaning or connotation of its components." - Criteria
- non-compositionality
- non-substitutability
- non-modifiability
- non-translatable word for word
5Non-Compositionality
- A phrase is compositional if its meaning can be
predicted from the meaning of its parts - Collocations have limited compositionality
- there is usually an element of meaning added to
the combination - Ex strong tea
- Idioms are the most extreme examples of
non-compositionality - Ex to hear it through the grapevine
6Non-Substitutability
- We cannot substitute near-synonyms for the
components of a collocation. - Strong is a near-synonym of powerful
- strong tea ?powerful tea
- yellow is as good a description of the color of
white wines - white wine ?yellow wine
7Non-modifiability
- Many collocations cannot be freely modified with
additional lexical material or through
grammatical transformations - weapons of mass destruction --gt ?weapons of
massive destruction - to be fed up to the back teeth --gt ?to be fed up
to the teeth in the back
8Non-translatable (word for word)
- English
- make a decision ?take a decision
- French
- ?faire une décision prendre une décision
- to test whether a group of words is a
collocation - translate it into another language
- if we cannot translate it word by word
- then it probably is a collocation
9Linguistic Subclasses of Collocations
- Phrases with light verbs
- Verbs with little semantic content in the
collocation - make, take, do
- Verb particle/phrasal verb constructions
- to go down, to check out,
- Proper nouns
- John Smith
- Terminological expressions
- concepts and objects in technical domains
- hydraulic oil filter
10Why study collocations?
- In NLG
- The output should be natural
- make a decision ?take a decision
- In lexicography
- Identify collocations to list them in a
dictionary - To distinguish the usage of synonyms or
near-synonyms - In parsing
- To give preference to most natural attachments
- plastic (can opener) ? (plastic can)
opener - In corpus linguistics and psycholinguists
- Ex To study social attitudes towards different
types of substances - strong cigarettes/tea/coffee
- powerful drug
11A note on (near-)synonymy
- To determine if 2 words are synonyms-- Principle
of substitutability - 2 words are synonym if they can be substituted
for one another in some?/any? sentence without
changing the meaning or acceptability of the
sentence - How big/large is this plane?
- Would I be flying on a big/large or small plane?
- Miss Nelson became a kind of big / ?? large
sister to Tom. - I think I made a big / ?? large mistake.
12A note on (near-)synonymy (cont)
- True synonyms are rare...
- Depend on
- shades of meaning
- words may share central core meaning but have
different sense accents - register/social factors
- speaking to a 4-yr old VS to graduate students!
- collocations
- conventional way of saying something / fixed
expression
13Approaches to finding collocations
- Frequency
- Mean and Variance
- Hypothesis Testing
- t-test
- ?2-test
- Mutual Information
14Approaches to finding collocations
- --gt Frequency
- Mean and Variance
- Hypothesis Testing
- t-test
- ?2-test
- Mutual Information
15Frequency
- (Justeson Katz, 1995)
- Hypothesis
- if 2 words occur together very often, they must
be interesting candidates for a collocation - Method
- Select the most frequently occurring bigrams
(sequence of 2 adjacent words)
16Results
- Not very interesting
- Except for New York, all bigrams are pairs of
function words - So, lets pass the results through a part-of-
speech filter
17Frequency POS filter
- Simple method that works very well
18Strong versus powerful
- On a 14 million word corpus from the New-York
Times (Aug.-Nov. 1990)
19Frequency Conclusion
- Advantages
- works well for fixed phrases
- Simple method accurate result
- Requires small linguistic knowledge
- But many collocations consist of two words in
more flexible relationships - she knocked on his door
- they knocked at the door
- 100 women knocked on Donaldsons door
- a man knocked on the metal front door
20Approaches to finding collocations
- Frequency
- --gt Mean and Variance
- Hypothesis Testing
- t-test
- ?2-test
- Mutual Information
21Mean and Variance
- (Smadja et al., 1993)
- Looks at the distribution of distances between
two words in a corpus - looking for pairs of words with low variance
- A low variance means that the two words usually
occur at about the same distance - A low variance --gt good candidate for collocation
- Need a Collocational Window to capture
collocations of variable distances
22Collocational Window
- This is an example of a three word window.
- To capture 2-word collocations
- this is this an
- is an is example
- an example an if
- example if example a
- of a of three
- a three a word
- three word three window
- word window
23Mean and Variance (cont)
- The mean is the average offset (signed distance)
between two words in a corpus - The variance measures how much the individual
offsets deviate from the mean - n is the number of times the two words (two
candidates) co-occur - di is the offset of the ith pair of candidates
- is the mean offset of all pairs of candidates
- If offsets (di) are the same in all
co-occurrences - --gt variance is zero
- --gt definitely a collocation
- If offsets (di) are randomly distributed
- --gt variance is high
- --gt not a collocation
24An Example
- window size 11 around knock (5 left, 5 right)
- she knocked on his door
- they knocked at the door
- 100 women knocked on Donaldsons door
- a man knocked on the metal front door
- Mean d
- Std. deviation s
25Position histograms
- strongopposition
- variance is low
- --gt interesting collocation
- strongsupport
- strongfor
- variance is high
- --gt not interesting collocation
26Mean and variance versus Frequency
std. dev. 0 mean offset 1 --gt would be found
by frequency method
std. dev. 0 high mean offset --gt very
interesting, but would not be found by frequency
method
high deviation --gt not interesting
27Mean Variance Conclusion
- good for finding collocations that have
- looser relationship between words
- intervening material and relative position
28Approaches to finding collocations
- Frequency
- Mean and Variance
- --gt Hypothesis Testing
- t-test
- ?2-test
- Mutual Information
29Hypothesis Testing
- If 2 words are frequent they will frequently
occur together - Frequent bigrams and low variance can be
accidental (two words can co-occur by chance) - We want to determine whether the co-occurrence is
random or whether it occurs more often than
chance - This is a classical problem in statistics called
Hypothesis Testing - When two words co-occur, Hypothesis Testing
measures how confident we have that this was due
to chance or not
30Hypothesis Testing (cont)
- We formulate a null hypothesis H0
- H0 no real association (just chance)
- H0 states what should be true if two words do not
form a collocation - if 2 words w1 and w2 do not form a collocation,
then w1 and w2 are independently of each other - We need a statistical test that tells us how
probable or improbable it is that a certain
combination occurs - Statistical tests
- t test
- ?2 test
31Approaches to finding collocations
- Frequency
- Mean and Variance
- Hypothesis Testing
- --gt t-test
- ?2-test
- Mutual Information
32Hypothesis Testing the t-test
- (or Student's t-test)
- H0 states that
- We calculate the probability p-value that a
collocation would occur if H0 was true - If p-value is too low, we reject H0
- Typically if under a significant level of p lt
0.05, 0.01, or 0.001 - Otherwise, retain H0 as possible
33Some intuition
- Ho women and men are equally tall, on average
- We gather data from 10 men and 10 women
- Assume we want to compare the heights of men and
women - we cannot measure the height of every adult
- so we take a sample of the population
- and make inferences about the whole population
- by comparing the sample means and the variation
of each mean -
34Some intuition (con't)
- t-test compares
- the sample mean (computed from observed values)
- to a expected mean
- determines the likelihood (p-value) that the
difference between the 2 means occurs by chance. - a p-value close to 1 --gt it is very likely that
the expected and sample means are the same - a small p-value (ex 0.01) --gt it is unlikely
(only a 1 in 100 chance) that such a difference
would occur by chance - so the lower the p-value --gt the more certain we
are that there is a significant difference
between the observed and expected mean, so we
reject H0
35Some intuition (cont)
- t-test assigns a probability to describe the
likelihood that the null hypothesis is true -
high p-value --gt Accept Ho
Accept Ho
low p-value --gt Reject Ho
Reject Ho
Reject Ho
frequency
frequency
0
value of t
0
value of t
Critical value c (value of t where we decide to
reject Ho)
Confidence level a probability that t-score gt
critical value c
t distribution (1-tailed)
t distribution (2-tailed)
36Some intuition (cont)
- Compute t score
- Consult the table of critical values with df 18
(1010-2) - If t gt critical value (value in table), then the
2 samples are significantly different at the
probability level that is listed - Assume t2.7
- if there is no difference in height between women
and men (H0 is true) then the probability of
finding t2.7 is between 0.025 0.01 - thats not much
- so we reject the null hypothesis H0
- and conclude that there is a difference in height
between men and woman
Probability table based on the t
distribution (2-tailed test)
37The t-Test
- looks at the mean and variance of a sample of
measurements - the null hypothesis is that the sample is drawn
from a distribution with mean ? - The test
- looks at the difference between the observed and
expected means, scaled by the variance of the
data - tells us how likely one is to get a sample of
that mean and variance - assuming that the sample is drawn from a normal
distribution with mean ?.
38The t-Statistic
Difference between the observed mean and the
expected mean
is the sample mean ? is the expected mean of
the distribution s2 is the sample variance N is
the sample size
- the higher the value of t, the greater the
confidence that - there is a significant difference
- its not due to chance
- the 2 words are not independent
39t-Test for finding Collocations
- We think of a corpus of N words as a long
sequence of N bigrams - the samples are seen as random variables that
- take the value 1 when the bigram of interest
occurs - take the value 0 otherwise
40t-Test Example with collocations
- In a corpus
- new occurs 15,828 times
- companies occurs 4,675 times
- new companies occurs 8 times
- there are 14,307,668 tokens overall
- Is new companies a collocation?
- Null hypothesis
- Independence assumption
- P(new companies) P(new) P(companies)
41Example (Cont.)
- If the null hypothesis is true, then
- if we randomly generate bigrams of words
- assign 1 to the outcome new companies
- assign 0 to any other outcome
- in effect a Bernoulli trial
- then the probability of having new companies is
expected to be 3.615 x 10-7 - So the expected mean is ? 3.615 x 10-7
- The variance s2 p(1-p) p since for most
bigrams p is small - in binomial distribution s2 np(1-p) but
here, n1
42Example (Cont.)
- But we counted 8 occurrences of the bigram new
companies - So the observed mean is
- By applying the t-test, we have
- With a confidence level a0.005, critical value
is 2.576 (t should be at least 2.576) - Since t1 lt 2.576
- we cannot reject the Ho
- so we cannot claim that new and companies form a
collocation
43t test Some results
- t test applied to 10 bigrams that occur with
frequency 20 - Notes
- Frequency-based method could not have seen the
difference in these bigrams, because they all
have the same frequency - the t test takes into account the frequency of a
bigram relative to the frequencies of its
component words - If a high proportion of the occurrences of both
words occurs in the bigram, then its t is high. - The t test is mostly used to rank collocations
- fail the t-test (t lt 2.756) so
- we cannot reject the null hypothesis
- so they do not form a collocation
- pass the t-test (t gt 2.756) so
- we can reject the null hypothesis
- so they form collocation
44Hypothesis testing of differences
- Used to see if 2 words (near-synonyms) are used
in the same context or not - strong vs powerful
- can be useful in lexicography
- we want to test
- if there is a difference in 2 populations
- Ex height of woman / height of man
- the null hypothesis is that there is no
difference - i.e. the average difference is 0 (? 0)
is the sample mean of population 1 is the
sample mean of population 2 s12 is the sample
variance of population 1 s22 is the sample
variance of population 2 n1 is the sample size of
population 1 n2 is the sample size of population
2
45Difference test example
- Is there a difference in how we use powerful
and how we use strong?
46Approaches to finding collocations
- Frequency
- Mean and Variance
- Hypothesis Testing
- t-test
- --gt ?2-test
- Mutual Information
47Hypothesis testing the ?2-test
- problem with the t test is that it assumes that
probabilities are approximately normally
distributed - the ?2-test does not make this assumption
- The essence of the ?2-test is the same as the
t-test - Compare observed frequencies and expected
frequencies for independence - if the difference is large
- then we can reject the null hypothesis of
independence
48?2-test
- In its simplest form, it is applied to a 2x2
table of observed frequencies - The ?2 statistic
- sums the differences between observed frequencies
(in the table) - and expected values for independence
- scaled by the magnitude of the expected values
49?2-test- Example
- Observed frequencies Obsij
50?2-test- Example (cont)
- Expected frequencies Expij
- If independence
- Computed from the marginal probabilities (the
totals of the rows and columns converted into
proportions) - Ex expected frequency for cell (1,1) (new
companies) - marginal probability of new occurring as the
first part of a bigram times marginal probability
of companies occurring as the second part of
bigram - If new and companies occurred completely
independent of each other - we would expect 5.17 occurrences of new
companies on average
51?2-test- Example (cont)
- But is the difference significant?
- df in an nxc table (n-1)(c-1) (2-1)(2-1) 1
(degrees of freedom) - The probability level of ?0.05 the critical
value is 3.84 - Since 1.55 lt 3.84
- So we cannot reject H0 (that new and companies
occur independently of each other) - So new companies is not a good candidate for a
collocation
52?2-test Conclusion
- Differences between the t statistic and ?2
statistic do not seem to be large - But
- the ?2 test is appropriate for large
probabilities - where t test fails because of the normality
assumption - the ?2 is not appropriate with sparse data (if
numbers in the 2 by 2 tables are small) - ?2 test has been applied to a wider range of
problems - Machine translation
- Corpus similarity
53?2-test for machine translation
- (Church Gale, 1991)
- To identify translation word pairs in aligned
corpora - Ex
- ?2 456 400 gtgt 3.84 (with ? 0.05)
- So vache and cow are not independent and so
are translations of each other
Nb of aligned sentence pairs containing cow in
English and vache in French
Observed frequency cow cow TOTAL
vache 59 6 65
vache 8 570 934 570 942
TOTAL 67 570 940 571 007
54?2-test for corpus similarity
- (Kilgarriff Rose, 1998)
- Ex
- Compute ?2 for the 2 populations (corpus1 and
corpus2) - Ho the 2 corpora have the same word distribution
Observed frequency Corpus 1 Corpus 2 Ratio
Word1 60 9 60/9 6.7
Word2 500 76 6.6
Word3 124 20 6.2
Word500
55Collocations across corpora
- Ratios of relative frequencies between two or
more different corpora - can be used to discover collocations that are
characteristic of a corpus when compared to other
corpus
56Collocations across corpora (cont)
- most useful for the discovery of subject-specific
collocations - Compare a general text with a subject-specific
text - words and phrases that (on a relative basis)
occur most often in the subject-specific text are
likely to be part of the vocabulary that is
specific to the domain
57Approaches to finding collocations
- Frequency
- Mean and Variance
- Hypothesis Testing
- t-test
- ?2-test
- --gt Mutual Information
58Pointwise Mutual Information
- Uses a measure from information-theory
- Pointwise mutual information between 2 events x
and y (in our case the occurrence of 2 words) is
roughly - a measure of how much one event (word) tells us
about the other - or a measure of the independence of 2 events (or
2 words) - If 2 events x and y are independent, then I(x,y)
0
59Example
- Assume
- c(Ayatollah) 42
- c(Ruhollah) 20
- c(Ayatollah, Ruhollah) 20
- N 143 076 668
- Then
- So? The occurrence of Ayatollah at position i
increases by 18.38bits if Ruhollah occurs at
position i1 - works particularly badly with sparse data
60Pointwise Mutual Information (cont)
- With pointwise mutual information
- With t-test (see p.43 of slides)
- Same ranking as t-test
61Pointwise Mutual Information (cont)
- good measure of independence
- values close to 0 --gt independence
- bad measure of dependence
- because score depends on frequency
- all things being equal, bigrams of low frequency
words will receive a higher score than bigrams of
high frequency words - so sometimes we take C(w1 w2) I(w1 , w2)
62Automatic vs manual detection of collocations
- Manual detection finds a wider variety of
grammatical patterns - Ex in the BBI combinatory dictionary of English
- Quality of collocations is better that
computer-generated ones - But long and requires expertise
strength power
to build up to assume
to find emergency
to save discretionary
to sap somebody's fire
brute supernatural
tensile to turn off the
the to do X the to do X