Title: LIN 3098 Corpus Linguistics Lecture 6
1LIN 3098 Corpus LinguisticsLecture 6
2In this lecture
- More on corpora for lexicography
- collocations as a window on lexical semantics
- uses of collocations
- distinguishing near-synonyms
- cross-register variation
- case study synonymy and the contextual view of
meaning
3Part 1
4The empiricist tradition in lexical semantics
- Main exponent Firth (1957)
- Fundamental position
- the meaning of words is best discovered through
an analysis of the context in which they occur - Contrast to more traditional, rationalist
approaches - meaning is usually defined in terms of concepts
or features - words can be distinguished based on distinctions
among their features
5Collocations and collocational strength
- Example 1 Adjective-noun combinations
- large number big number
- large distinction big distinction
- Why are large and big not equally acceptable with
different nouns? - Example 2 Noun compounds
- computer scientist
- computer terminal
- computer desk
- Are these compounds equally well-established?
6Uses of collocations
- Collocations can tell us something about
- distinctions in word meaning between apparently
synonymous words - whether certain expressions should be considered
as frozen or nearly so - We should view such phrases as falling on a
continuum - one extreme simple, syntactic combination (kick
the door) - other extreme fully frozen idiomatic expressions
(kick the bucket) - plenty of intermediate cases
7Properties of collocations
- Frequency and regularity
- Textual proximity
- Limited compositionality
- Non-substitutability
- Non-modifiability
- Category restrictions
8Frequency and regularity
- We know that language is regular (non-random) and
rule-based. - this aspect is emphasised by rationalist
approaches to grammar - We also need to acknowledge that frequency of
usage is an important factor in language
development. - why do big and large collocate differently with
different nouns?
9Regularity/frequency
- f(strong tea) gt f(powerful tea)
- f(credit card) gt f(credit bankruptcy)
- f(white wine) gt f(yellow wine)
- (even though white wine is actually yellowish)
10Narrow window (textual proximity)
- Usually, we specify an n-gram window within which
to analyse collocations - bigram credit card, credit crunch
- trigram credit card fraud, credit card expiry
-
- The idea is to look at co-occurrence of words
within a specific n-gram window - We can also count n-grams with intervening words
- federal (.) subsidy
- matches federal farm subsidy, federal
manufacturing subsidy
11Textual proximity (continued)
- Usually collocates of a word occur close to that
word. - may still occur across a span
- Examples
- bigram white wine, powerful tea
- gtbigram knock on the door knock on Xs door
12Non-compositionality
- white wine
- not really white, meaning not fully predictable
from component words syntax - signal interpretation
- a term used in Intelligent Signal Processing
connotations go beyond compositional meaning - Similarly
- regression coefficient
- good practice guidelines
- Extreme cases
- idioms such as kick the bucket
- meaning is completely frozen
13Non-substitutability
- If a phrase is a collocation, we cant substitute
a word in the phrase for a near-synonym, and
still have the same overall meaning. - E.g.
- white wine vs. yellow wine
- powerful tea vs. strong tea
-
14Non-modifiability
- Often, there are restrictions on inserting
additional lexical items into the collocation,
especially in the case of idioms. - Example
- kick the bucket vs. ?kick the large bucket
- NB
- this is a matter of degree!
- non-idiomatic collocations are more flexible
15Category restrictions
- Frequency alone doesnt indicate collocational
strength - by the is a very frequent phrase in English
- not a collocation
- Collocations tend to be formed from content
words - AN powerful tea
- NN regression coefficient, mass demonstration
- NPREPN degrees of freedom
16Part 2
- Distinguishing near-synonyms a case study (from
Biber et al 1993)
17Near-synonyms
- Whats the difference between
- big, large, great
- A traditional dictionary (OED online)
- large adj. of considerable or relatively great
size, extent, or capacity - big adj. of considerable size, physical power, or
extent - great adj. of an extent, amount, or intensity
considerably above average - Is this informative enough?
18The frequency of the adjectives
(Longman-Lancaster corpus)
- overall (5.7m words)
- f(big) 1,319
- f(large) 2,342
- f(great) 2,254
- academic prose (2.7m words)
- f(big) 84
- f(large) 1,641
- f(great) 772
- fiction (3m words)
- f(big) 1,235
- f(large) 701
- f(great) 1,482
large gtgt great gtgt big
large gtgt great gtgt big
great gtgt big gtgt large
19Immediate right collocate big
- academic prose
- big enough (2.2 / m) big traders (1.1 / m)
- fiction
- big man (9.6 / m) big enough (8.9 / m)
- big house (7.6 / m)
- Seems to be used for physical size or object,
person, or organisation - big enough is usually used for size as well
- the house is big enough
- Also occurs often with descriptive adjectives
big black X etc.
20Immediate right collocates large
- Academic prose
- large number (48.3/m)
- large numbers (31.3/m)
- large scale (29.4/m)
- Fiction
- large black (4.3 / m)
- large enough (3.6 / m)
- large room (2.7 / m)
- large number (2.3 / m)
- Used more often than big for quantities or
proportions. - large enough is usually used for such quantities
too - the ratio is large enough
Lemmatisation would allow us to combine these
21Immediate right collocates great
- academic prose
- great deal (44.6 / m) great importance (12.6 /
m) - great variety (7.0 / m) great detail (2.6 / m)
- fiction
- great deal (40.4 / m) great man (6.6 /m)
- In academic prose, mostly used for amount or
quantity. Rather like large, but also occurs with
deal. - Great also used for intensitygreat importance,
great care - In fiction, mostly used for amounts
- a great deal of apple juice
22Salient differences
- This is a very brief overview of uses and senses
of the three adjectives. - It helps explain the different frequency
distribution across registers - fiction often contains physical descriptions
(thus, big is more frequent than in academic
prose) - academic prose more often concerned with
proportions, amounts, quantities (thus, great is
more frequent here)
23Widening the window
- Two words can co-occur regularly even with a few
words between them - academic prose
- large X of large X in
- large X open large X that
- fiction
- large X of large X and
- large X in large X eyes
24Widening the window - II
- The most frequent collocate of large in a
three-word window is of. - What nouns intervene between large and of?
- large amounts of, large numbers of
- again, typically quantities or proportions
- Large X eyes is very frequent in fiction (not
academic prose) - his large hazel eyes
- confirms earlier conclusion that fiction has more
physical descriptions
25Interim summary
- This brief overview shows that
- collocations help to distinguish between
near-synonyms - can also help to discover patterns of variation
across registers
26Part 3
- The contextual theory of synonymy and similarity
- Corpus-based and psycholinguistic evidence
27Synonymy
- Different phonological words with highly related
meanings - sofa / couch
- boy / lad
- zghir (small) / ckejken (little)
- Traditional definition
- w1 is synonymous with w2 if w1 can replace w2 in
a sentence, salva veritate - Is this ever the case? Can we replace one word
for another and keep our sentence identical?
28Imperfect synonymy
- Synonyms often exhibit slight differences,
espcially in connotations - zghir (small) is fairly neutral with respect to
the thing spoken of - ckejken (small/little) might be used for a
little child, but not a teenager - may carry connotations of dependence, cuteness,
etc...
29The importance of register
- With near-synonyms, there are often
register-governed conditions of use. - E.g. naive vs gullible vs ingenuous
- gullible / naive seem critical, or even offensive
- ingenuous more likely in a formal context
30Synonymy vs. Similarity
- The contextual theory of synonymy
- based on the work of Wittgenstein (1953), and
Firth (1957) - You shall know a word by the company it keeps
(Firth 1957) - Under this view, perfect synonyms might not
exist. - But words can be judged as highly similar if
people put them into the same linguistic
contexts, and judge the change to be slight.
31Synonymy vs. similarity example
- Miller Charles 1991
- Weak contextual hypothesis The similarity of the
context in which 2 words appear contributes to
the semantic similarity of those words. - E.g. snake is similar to resp. synonym of
serpent to the extent that we find snake and
serpent in the same linguistic contexts. - It is much more likely that snake/serpent will
occur in similar contexts than snake/toad - NB this is not a discrete notion of synonymy,
but a continuous definition of similarity
32The Miller/Charles experiment
- Subjects were given sentences with missing words
asked to place words they felt were OK in each
context. - Method to compare words A and B
- find sentences containing A
- find sentences containing B
- delete A and B from sentences and shuffle them
- ask people to choose which sentences to place A
and B in. - Showed that people will put similar words in the
same context, and this is highly correlated with
occurrence in similar contexts in corpora. -
33Issues with similarity
- Similar is a much broader concept than
synonymous - Contextually related, though differing in
meaning - man / woman
- boy / girl
- master / pupil
- Contextually related, but with opposite
meanings - big / small
- clever / stupid
34Part 4
- Bonus Topic Mutual Information for ranking
collocations
35General idea
- Suppose we identify several multiword units in a
corpus (e.g. several N-N compounds). - We would like to know to what extent the words
making them up are strongly collocated. - Could be that these words occur together purely
by chance.
36An analogy
- Suppose Tom and Viv are an item turn up
everywhere unless theyre together. - From your point of view
- Seeing Tom increases your certainty (your
information) that Viv is around. - Seeing Viv does the same with respect to Tom.
- Your assumptions would be very different if
- you knew that Tom and Viv had never been able to
stand eachother - Or you only knew them separately, and had no idea
they had a relationship
37The reasoning (I)
- Example collocations involving post
- A search through a corpus throws up lots of
co-occurring words, e.g. - the post
- post in
- post office
- post mortem
- We dont want to call all these collocations.
- E.g. the is extremely frequent, and this is
probably why it occurs very frequently with post. - (remember Zipfs law)
38The reasoning (II)
- Suppose we suspect a strong relationship between
post and mortem. - There are two possibilities
- post mortem just co-occur randomly, so theyre
as likely to occur together as separately. - post mortem is indeed a collocation, so finding
mortem should increase our certainty that well
also find post in its immediate environment. - thus, the two words have high mutual information
- Given the word w, the mutual information score
tells us how much our certainty increases that
post is in the vicinity.
39Mutual information post mortem
- compute the frequency of post, mortem and post
mortem. - Denoted f(post), f(mortem), f(post mortem)
- compute the probability of these
- this is just the frequency divided by the corpus
size f(post)/N etc - a better estimate, because proportional
- we denote these p(post), p(mortem) etc
40Mutual information post mortem
- Compare the probability of finding post mortem
to the probability of finding either word on its
own -
Probability of finding the two words together
within a certain window.
Probability of the two words independently.
41Mutual information post mortem
- Finally, we turn this probability ratio into a
measure of information. - Information is measured in bits.
- A probability estimate is turned into an
information value by taking its log (usually to
base 2).
The amount of information about post increases by
this amount if we know that there is an
accompanying word mortem.
42Interpreting MI
- If MI is positive, and reasonably high (usually 2
or higher), then the two words are strongly
collocated. - If MI is negative, then the two words are
actually unlikely to occur together. - If MI is approximately zero, then the two words
tend to occur independently.
43Summary
- This lecture has focused on another use of
corpora for lexicography - dominant paradigm is the empiricist view of
meaning and language - takes a very different approach to issues of
synonymy than rationalist approaches - We have also introduced the concept of Mutual
Information, as a way of measuring collocational
strength.