LIN 3098 Corpus Linguistics Lecture 6 - PowerPoint PPT Presentation

About This Presentation
Title:

LIN 3098 Corpus Linguistics Lecture 6

Description:

LIN 3098 Corpus Linguistics Lecture 6 Albert Gatt Part 4 Bonus Topic: Mutual Information for ranking collocations General idea Suppose we identify several multiword ... – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 44
Provided by: Alber75
Category:

less

Transcript and Presenter's Notes

Title: LIN 3098 Corpus Linguistics Lecture 6


1
LIN 3098 Corpus LinguisticsLecture 6
  • Albert Gatt

2
In this lecture
  • More on corpora for lexicography
  • collocations as a window on lexical semantics
  • uses of collocations
  • distinguishing near-synonyms
  • cross-register variation
  • case study synonymy and the contextual view of
    meaning

3
Part 1
  • What is a collocation?

4
The empiricist tradition in lexical semantics
  • Main exponent Firth (1957)
  • Fundamental position
  • the meaning of words is best discovered through
    an analysis of the context in which they occur
  • Contrast to more traditional, rationalist
    approaches
  • meaning is usually defined in terms of concepts
    or features
  • words can be distinguished based on distinctions
    among their features

5
Collocations and collocational strength
  • Example 1 Adjective-noun combinations
  • large number big number
  • large distinction big distinction
  • Why are large and big not equally acceptable with
    different nouns?
  • Example 2 Noun compounds
  • computer scientist
  • computer terminal
  • computer desk
  • Are these compounds equally well-established?

6
Uses of collocations
  • Collocations can tell us something about
  • distinctions in word meaning between apparently
    synonymous words
  • whether certain expressions should be considered
    as frozen or nearly so
  • We should view such phrases as falling on a
    continuum
  • one extreme simple, syntactic combination (kick
    the door)
  • other extreme fully frozen idiomatic expressions
    (kick the bucket)
  • plenty of intermediate cases

7
Properties of collocations
  1. Frequency and regularity
  2. Textual proximity
  3. Limited compositionality
  4. Non-substitutability
  5. Non-modifiability
  6. Category restrictions

8
Frequency and regularity
  • We know that language is regular (non-random) and
    rule-based.
  • this aspect is emphasised by rationalist
    approaches to grammar
  • We also need to acknowledge that frequency of
    usage is an important factor in language
    development.
  • why do big and large collocate differently with
    different nouns?

9
Regularity/frequency
  • f(strong tea) gt f(powerful tea)
  • f(credit card) gt f(credit bankruptcy)
  • f(white wine) gt f(yellow wine)
  • (even though white wine is actually yellowish)

10
Narrow window (textual proximity)
  • Usually, we specify an n-gram window within which
    to analyse collocations
  • bigram credit card, credit crunch
  • trigram credit card fraud, credit card expiry
  • The idea is to look at co-occurrence of words
    within a specific n-gram window
  • We can also count n-grams with intervening words
  • federal (.) subsidy
  • matches federal farm subsidy, federal
    manufacturing subsidy

11
Textual proximity (continued)
  • Usually collocates of a word occur close to that
    word.
  • may still occur across a span
  • Examples
  • bigram white wine, powerful tea
  • gtbigram knock on the door knock on Xs door

12
Non-compositionality
  • white wine
  • not really white, meaning not fully predictable
    from component words syntax
  • signal interpretation
  • a term used in Intelligent Signal Processing
    connotations go beyond compositional meaning
  • Similarly
  • regression coefficient
  • good practice guidelines
  • Extreme cases
  • idioms such as kick the bucket
  • meaning is completely frozen

13
Non-substitutability
  • If a phrase is a collocation, we cant substitute
    a word in the phrase for a near-synonym, and
    still have the same overall meaning.
  • E.g.
  • white wine vs. yellow wine
  • powerful tea vs. strong tea

14
Non-modifiability
  • Often, there are restrictions on inserting
    additional lexical items into the collocation,
    especially in the case of idioms.
  • Example
  • kick the bucket vs. ?kick the large bucket
  • NB
  • this is a matter of degree!
  • non-idiomatic collocations are more flexible

15
Category restrictions
  • Frequency alone doesnt indicate collocational
    strength
  • by the is a very frequent phrase in English
  • not a collocation
  • Collocations tend to be formed from content
    words
  • AN powerful tea
  • NN regression coefficient, mass demonstration
  • NPREPN degrees of freedom

16
Part 2
  • Distinguishing near-synonyms a case study (from
    Biber et al 1993)

17
Near-synonyms
  • Whats the difference between
  • big, large, great
  • A traditional dictionary (OED online)
  • large adj. of considerable or relatively great
    size, extent, or capacity
  • big adj. of considerable size, physical power, or
    extent
  • great adj. of an extent, amount, or intensity
    considerably above average
  • Is this informative enough?

18
The frequency of the adjectives
(Longman-Lancaster corpus)
  • overall (5.7m words)
  • f(big) 1,319
  • f(large) 2,342
  • f(great) 2,254
  • academic prose (2.7m words)
  • f(big) 84
  • f(large) 1,641
  • f(great) 772
  • fiction (3m words)
  • f(big) 1,235
  • f(large) 701
  • f(great) 1,482

large gtgt great gtgt big
large gtgt great gtgt big
great gtgt big gtgt large
19
Immediate right collocate big
  • academic prose
  • big enough (2.2 / m) big traders (1.1 / m)
  • fiction
  • big man (9.6 / m) big enough (8.9 / m)
  • big house (7.6 / m)
  • Seems to be used for physical size or object,
    person, or organisation
  • big enough is usually used for size as well
  • the house is big enough
  • Also occurs often with descriptive adjectives
    big black X etc.

20
Immediate right collocates large
  • Academic prose
  • large number (48.3/m)
  • large numbers (31.3/m)
  • large scale (29.4/m)
  • Fiction
  • large black (4.3 / m)
  • large enough (3.6 / m)
  • large room (2.7 / m)
  • large number (2.3 / m)
  • Used more often than big for quantities or
    proportions.
  • large enough is usually used for such quantities
    too
  • the ratio is large enough

Lemmatisation would allow us to combine these
21
Immediate right collocates great
  • academic prose
  • great deal (44.6 / m) great importance (12.6 /
    m)
  • great variety (7.0 / m) great detail (2.6 / m)
  • fiction
  • great deal (40.4 / m) great man (6.6 /m)
  • In academic prose, mostly used for amount or
    quantity. Rather like large, but also occurs with
    deal.
  • Great also used for intensitygreat importance,
    great care
  • In fiction, mostly used for amounts
  • a great deal of apple juice

22
Salient differences
  • This is a very brief overview of uses and senses
    of the three adjectives.
  • It helps explain the different frequency
    distribution across registers
  • fiction often contains physical descriptions
    (thus, big is more frequent than in academic
    prose)
  • academic prose more often concerned with
    proportions, amounts, quantities (thus, great is
    more frequent here)

23
Widening the window
  • Two words can co-occur regularly even with a few
    words between them
  • academic prose
  • large X of large X in
  • large X open large X that
  • fiction
  • large X of large X and
  • large X in large X eyes

24
Widening the window - II
  • The most frequent collocate of large in a
    three-word window is of.
  • What nouns intervene between large and of?
  • large amounts of, large numbers of
  • again, typically quantities or proportions
  • Large X eyes is very frequent in fiction (not
    academic prose)
  • his large hazel eyes
  • confirms earlier conclusion that fiction has more
    physical descriptions

25
Interim summary
  • This brief overview shows that
  • collocations help to distinguish between
    near-synonyms
  • can also help to discover patterns of variation
    across registers

26
Part 3
  • The contextual theory of synonymy and similarity
  • Corpus-based and psycholinguistic evidence

27
Synonymy
  • Different phonological words with highly related
    meanings
  • sofa / couch
  • boy / lad
  • zghir (small) / ckejken (little)
  • Traditional definition
  • w1 is synonymous with w2 if w1 can replace w2 in
    a sentence, salva veritate
  • Is this ever the case? Can we replace one word
    for another and keep our sentence identical?

28
Imperfect synonymy
  • Synonyms often exhibit slight differences,
    espcially in connotations
  • zghir (small) is fairly neutral with respect to
    the thing spoken of
  • ckejken (small/little) might be used for a
    little child, but not a teenager
  • may carry connotations of dependence, cuteness,
    etc...

29
The importance of register
  • With near-synonyms, there are often
    register-governed conditions of use.
  • E.g. naive vs gullible vs ingenuous
  • gullible / naive seem critical, or even offensive
  • ingenuous more likely in a formal context

30
Synonymy vs. Similarity
  • The contextual theory of synonymy
  • based on the work of Wittgenstein (1953), and
    Firth (1957)
  • You shall know a word by the company it keeps
    (Firth 1957)
  • Under this view, perfect synonyms might not
    exist.
  • But words can be judged as highly similar if
    people put them into the same linguistic
    contexts, and judge the change to be slight.

31
Synonymy vs. similarity example
  • Miller Charles 1991
  • Weak contextual hypothesis The similarity of the
    context in which 2 words appear contributes to
    the semantic similarity of those words.
  • E.g. snake is similar to resp. synonym of
    serpent to the extent that we find snake and
    serpent in the same linguistic contexts.
  • It is much more likely that snake/serpent will
    occur in similar contexts than snake/toad
  • NB this is not a discrete notion of synonymy,
    but a continuous definition of similarity

32
The Miller/Charles experiment
  • Subjects were given sentences with missing words
    asked to place words they felt were OK in each
    context.
  • Method to compare words A and B
  • find sentences containing A
  • find sentences containing B
  • delete A and B from sentences and shuffle them
  • ask people to choose which sentences to place A
    and B in.
  • Showed that people will put similar words in the
    same context, and this is highly correlated with
    occurrence in similar contexts in corpora.

33
Issues with similarity
  • Similar is a much broader concept than
    synonymous
  • Contextually related, though differing in
    meaning
  • man / woman
  • boy / girl
  • master / pupil
  • Contextually related, but with opposite
    meanings
  • big / small
  • clever / stupid

34
Part 4
  • Bonus Topic Mutual Information for ranking
    collocations

35
General idea
  • Suppose we identify several multiword units in a
    corpus (e.g. several N-N compounds).
  • We would like to know to what extent the words
    making them up are strongly collocated.
  • Could be that these words occur together purely
    by chance.

36
An analogy
  • Suppose Tom and Viv are an item turn up
    everywhere unless theyre together.
  • From your point of view
  • Seeing Tom increases your certainty (your
    information) that Viv is around.
  • Seeing Viv does the same with respect to Tom.
  • Your assumptions would be very different if
  • you knew that Tom and Viv had never been able to
    stand eachother
  • Or you only knew them separately, and had no idea
    they had a relationship

37
The reasoning (I)
  • Example collocations involving post
  • A search through a corpus throws up lots of
    co-occurring words, e.g.
  • the post
  • post in
  • post office
  • post mortem
  • We dont want to call all these collocations.
  • E.g. the is extremely frequent, and this is
    probably why it occurs very frequently with post.
  • (remember Zipfs law)

38
The reasoning (II)
  • Suppose we suspect a strong relationship between
    post and mortem.
  • There are two possibilities
  • post mortem just co-occur randomly, so theyre
    as likely to occur together as separately.
  • post mortem is indeed a collocation, so finding
    mortem should increase our certainty that well
    also find post in its immediate environment.
  • thus, the two words have high mutual information
  • Given the word w, the mutual information score
    tells us how much our certainty increases that
    post is in the vicinity.

39
Mutual information post mortem
  • compute the frequency of post, mortem and post
    mortem.
  • Denoted f(post), f(mortem), f(post mortem)
  • compute the probability of these
  • this is just the frequency divided by the corpus
    size f(post)/N etc
  • a better estimate, because proportional
  • we denote these p(post), p(mortem) etc

40
Mutual information post mortem
  • Compare the probability of finding post mortem
    to the probability of finding either word on its
    own

Probability of finding the two words together
within a certain window.
Probability of the two words independently.
41
Mutual information post mortem
  1. Finally, we turn this probability ratio into a
    measure of information.
  2. Information is measured in bits.
  3. A probability estimate is turned into an
    information value by taking its log (usually to
    base 2).

The amount of information about post increases by
this amount if we know that there is an
accompanying word mortem.
42
Interpreting MI
  • If MI is positive, and reasonably high (usually 2
    or higher), then the two words are strongly
    collocated.
  • If MI is negative, then the two words are
    actually unlikely to occur together.
  • If MI is approximately zero, then the two words
    tend to occur independently.

43
Summary
  • This lecture has focused on another use of
    corpora for lexicography
  • dominant paradigm is the empiricist view of
    meaning and language
  • takes a very different approach to issues of
    synonymy than rationalist approaches
  • We have also introduced the concept of Mutual
    Information, as a way of measuring collocational
    strength.
Write a Comment
User Comments (0)
About PowerShow.com