COMP791A: Statistical Language Processing - PowerPoint PPT Presentation

About This Presentation
Title:

COMP791A: Statistical Language Processing

Description:

COMP791A: Statistical Language Processing Collocations Chap. 5 A collocation is an expression of 2 or more words that correspond to a conventional way of saying ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 63
Provided by: umd3
Category:

less

Transcript and Presenter's Notes

Title: COMP791A: Statistical Language Processing


1
COMP791A Statistical Language Processing
Collocations Chap. 5
2
A collocation
  • is an expression of 2 or more words that
    correspond to a conventional way of saying
    things.
  • broad daylight
  • Why not? ?bright daylight or ?narrow darkness
  • Big mistake but not ?large mistake
  • overlap with the concepts of
  • terms, technical terms terminological phrases
  • Collocations extracted form technical domains
  • Ex hydraulic oil filter, file transfer protocol

3
Examples of Collocations
  • strong tea
  • weapons of mass destruction
  • to make up
  • to check in
  • heard it through the grapevine
  • he knocked at the door
  • I made it all up

4
Definition of a collocation
  • (Choueka, 1988)
  • A collocation is defined as a sequence of two
    or more consecutive words, that has
    characteristics of a syntactic and semantic unit,
    and whose exact and unambiguous meaning or
    connotation cannot be derived directly from the
    meaning or connotation of its components."
  • Criteria
  • non-compositionality
  • non-substitutability
  • non-modifiability
  • non-translatable word for word

5
Non-Compositionality
  • A phrase is compositional if its meaning can be
    predicted from the meaning of its parts
  • Collocations have limited compositionality
  • there is usually an element of meaning added to
    the combination
  • Ex strong tea
  • Idioms are the most extreme examples of
    non-compositionality
  • Ex to hear it through the grapevine

6
Non-Substitutability
  • We cannot substitute near-synonyms for the
    components of a collocation.
  • Strong is a near-synonym of powerful
  • strong tea ?powerful tea
  • yellow is as good a description of the color of
    white wines
  • white wine ?yellow wine

7
Non-modifiability
  • Many collocations cannot be freely modified with
    additional lexical material or through
    grammatical transformations
  • weapons of mass destruction --gt ?weapons of
    massive destruction
  • to be fed up to the back teeth --gt ?to be fed up
    to the teeth in the back

8
Non-translatable (word for word)
  • English
  • make a decision ?take a decision
  • French
  • ?faire une décision prendre une décision
  • to test whether a group of words is a
    collocation
  • translate it into another language
  • if we cannot translate it word by word
  • then it probably is a collocation

9
Linguistic Subclasses of Collocations
  • Phrases with light verbs
  • Verbs with little semantic content in the
    collocation
  • make, take, do
  • Verb particle/phrasal verb constructions
  • to go down, to check out,
  • Proper nouns
  • John Smith
  • Terminological expressions
  • concepts and objects in technical domains
  • hydraulic oil filter

10
Why study collocations?
  • In NLG
  • The output should be natural
  • make a decision ?take a decision
  • In lexicography
  • Identify collocations to list them in a
    dictionary
  • To distinguish the usage of synonyms or
    near-synonyms
  • In parsing
  • To give preference to most natural attachments
  • plastic (can opener) ? (plastic can)
    opener
  • In corpus linguistics and psycholinguists
  • Ex To study social attitudes towards different
    types of substances
  • strong cigarettes/tea/coffee
  • powerful drug

11
A note on (near-)synonymy
  • To determine if 2 words are synonyms-- Principle
    of substitutability
  • 2 words are synonym if they can be substituted
    for one another in some?/any? sentence without
    changing the meaning or acceptability of the
    sentence
  • How big/large is this plane?
  • Would I be flying on a big/large or small plane?
  • Miss Nelson became a kind of big / ?? large
    sister to Tom.
  • I think I made a big / ?? large mistake.

12
A note on (near-)synonymy (cont)
  • True synonyms are rare...
  • Depend on
  • shades of meaning
  • words may share central core meaning but have
    different sense accents
  • register/social factors
  • speaking to a 4-yr old VS to graduate students!
  • collocations
  • conventional way of saying something / fixed
    expression

13
Approaches to finding collocations
  • Frequency
  • Mean and Variance
  • Hypothesis Testing
  • t-test
  • ?2-test
  • Mutual Information

14
Approaches to finding collocations
  • --gt Frequency
  • Mean and Variance
  • Hypothesis Testing
  • t-test
  • ?2-test
  • Mutual Information

15
Frequency
  • (Justeson Katz, 1995)
  • Hypothesis
  • if 2 words occur together very often, they must
    be interesting candidates for a collocation
  • Method
  • Select the most frequently occurring bigrams
    (sequence of 2 adjacent words)

16
Results
  • Not very interesting
  • Except for New York, all bigrams are pairs of
    function words
  • So, lets pass the results through a part-of-
    speech filter

17
Frequency POS filter
  • Simple method that works very well

18
Strong versus powerful
  • On a 14 million word corpus from the New-York
    Times (Aug.-Nov. 1990)

19
Frequency Conclusion
  • Advantages
  • works well for fixed phrases
  • Simple method accurate result
  • Requires small linguistic knowledge
  • But many collocations consist of two words in
    more flexible relationships
  • she knocked on his door
  • they knocked at the door
  • 100 women knocked on Donaldsons door
  • a man knocked on the metal front door

20
Approaches to finding collocations
  • Frequency
  • --gt Mean and Variance
  • Hypothesis Testing
  • t-test
  • ?2-test
  • Mutual Information

21
Mean and Variance
  • (Smadja et al., 1993)
  • Looks at the distribution of distances between
    two words in a corpus
  • looking for pairs of words with low variance
  • A low variance means that the two words usually
    occur at about the same distance
  • A low variance --gt good candidate for collocation
  • Need a Collocational Window to capture
    collocations of variable distances

22
Collocational Window
  • This is an example of a three word window.
  • To capture 2-word collocations
  • this is this an
  • is an is example
  • an example an if
  • example if example a
  • of a of three
  • a three a word
  • three word three window
  • word window

23
Mean and Variance (cont)
  • The mean is the average offset (signed distance)
    between two words in a corpus
  • The variance measures how much the individual
    offsets deviate from the mean
  • n is the number of times the two words (two
    candidates) co-occur
  • di is the offset of the ith pair of candidates
  • is the mean offset of all pairs of candidates
  • If offsets (di) are the same in all
    co-occurrences
  • --gt variance is zero
  • --gt definitely a collocation
  • If offsets (di) are randomly distributed
  • --gt variance is high
  • --gt not a collocation

24
An Example
  • window size 11 around knock (5 left, 5 right)
  • she knocked on his door
  • they knocked at the door
  • 100 women knocked on Donaldsons door
  • a man knocked on the metal front door
  • Mean d
  • Std. deviation s

25
Position histograms
  • strongopposition
  • variance is low
  • --gt interesting collocation
  • strongsupport
  • strongfor
  • variance is high
  • --gt not interesting collocation

26
Mean and variance versus Frequency
std. dev. 0 mean offset 1 --gt would be found
by frequency method
std. dev. 0 high mean offset --gt very
interesting, but would not be found by frequency
method
high deviation --gt not interesting
27
Mean Variance Conclusion
  • good for finding collocations that have
  • looser relationship between words
  • intervening material and relative position

28
Approaches to finding collocations
  • Frequency
  • Mean and Variance
  • --gt Hypothesis Testing
  • t-test
  • ?2-test
  • Mutual Information

29
Hypothesis Testing
  • If 2 words are frequent they will frequently
    occur together
  • Frequent bigrams and low variance can be
    accidental (two words can co-occur by chance)
  • We want to determine whether the co-occurrence is
    random or whether it occurs more often than
    chance
  • This is a classical problem in statistics called
    Hypothesis Testing
  • When two words co-occur, Hypothesis Testing
    measures how confident we have that this was due
    to chance or not

30
Hypothesis Testing (cont)
  • We formulate a null hypothesis H0
  • H0 no real association (just chance)
  • H0 states what should be true if two words do not
    form a collocation
  • if 2 words w1 and w2 do not form a collocation,
    then w1 and w2 are independently of each other
  • We need a statistical test that tells us how
    probable or improbable it is that a certain
    combination occurs
  • Statistical tests
  • t test
  • ?2 test

31
Approaches to finding collocations
  • Frequency
  • Mean and Variance
  • Hypothesis Testing
  • --gt t-test
  • ?2-test
  • Mutual Information

32
Hypothesis Testing the t-test
  • (or Student's t-test)
  • H0 states that
  • We calculate the probability p-value that a
    collocation would occur if H0 was true
  • If p-value is too low, we reject H0
  • Typically if under a significant level of p lt
    0.05, 0.01, or 0.001
  • Otherwise, retain H0 as possible

33
Some intuition
  • Ho women and men are equally tall, on average
  • We gather data from 10 men and 10 women
  • Assume we want to compare the heights of men and
    women
  • we cannot measure the height of every adult
  • so we take a sample of the population
  • and make inferences about the whole population
  • by comparing the sample means and the variation
    of each mean
  •  

34
Some intuition (con't)
  • t-test compares
  • the sample mean (computed from observed values)
  • to a expected mean
  • determines the likelihood (p-value) that the
    difference between the 2 means occurs by chance.
  • a p-value close to 1 --gt it is very likely that
    the expected and sample means are the same
  • a small p-value (ex 0.01) --gt it is unlikely
    (only a 1 in 100 chance) that such a difference
    would occur by chance
  • so the lower the p-value --gt the more certain we
    are that there is a significant difference
    between the observed and expected mean, so we
    reject H0

35
Some intuition (cont)
  • t-test assigns a probability to describe the
    likelihood that the null hypothesis is true
  •  

high p-value --gt Accept Ho
Accept Ho
low p-value --gt Reject Ho
Reject Ho
Reject Ho
frequency
frequency
0
value of t
0
value of t
Critical value c (value of t where we decide to
reject Ho)
Confidence level a probability that t-score gt
critical value c
t distribution (1-tailed)
t distribution (2-tailed)
36
Some intuition (cont)
  • Compute t score
  • Consult the table of critical values with df 18
    (1010-2)
  • If t gt critical value (value in table), then the
    2 samples are significantly different at the
    probability level that is listed
  • Assume t2.7
  • if there is no difference in height between women
    and men (H0 is true) then the probability of
    finding t2.7 is between 0.025 0.01
  • thats not much
  • so we reject the null hypothesis H0
  • and conclude that there is a difference in height
    between men and woman

Probability table based on the t
distribution (2-tailed test)
37
The t-Test
  • looks at the mean and variance of a sample of
    measurements
  • the null hypothesis is that the sample is drawn
    from a distribution with mean ?
  • The test
  • looks at the difference between the observed and
    expected means, scaled by the variance of the
    data
  • tells us how likely one is to get a sample of
    that mean and variance
  • assuming that the sample is drawn from a normal
    distribution with mean ?.

38
The t-Statistic
Difference between the observed mean and the
expected mean
is the sample mean ? is the expected mean of
the distribution s2 is the sample variance N is
the sample size
  • the higher the value of t, the greater the
    confidence that
  • there is a significant difference
  • its not due to chance
  • the 2 words are not independent

39
t-Test for finding Collocations
  • We think of a corpus of N words as a long
    sequence of N bigrams
  • the samples are seen as random variables that
  • take the value 1 when the bigram of interest
    occurs
  • take the value 0 otherwise

40
t-Test Example with collocations
  • In a corpus
  • new occurs 15,828 times
  • companies occurs 4,675 times
  • new companies occurs 8 times
  • there are 14,307,668 tokens overall
  • Is new companies a collocation?
  • Null hypothesis
  • Independence assumption
  • P(new companies) P(new) P(companies)

41
Example (Cont.)
  • If the null hypothesis is true, then
  • if we randomly generate bigrams of words
  • assign 1 to the outcome new companies
  • assign 0 to any other outcome
  • in effect a Bernoulli trial
  • then the probability of having new companies is
    expected to be 3.615 x 10-7
  • So the expected mean is ? 3.615 x 10-7
  • The variance s2 p(1-p) p since for most
    bigrams p is small
  • in binomial distribution s2 np(1-p) but
    here, n1

42
Example (Cont.)
  • But we counted 8 occurrences of the bigram new
    companies
  • So the observed mean is
  • By applying the t-test, we have
  • With a confidence level a0.005, critical value
    is 2.576 (t should be at least 2.576)
  • Since t1 lt 2.576
  • we cannot reject the Ho
  • so we cannot claim that new and companies form a
    collocation

43
t test Some results
  • t test applied to 10 bigrams that occur with
    frequency 20
  • Notes
  • Frequency-based method could not have seen the
    difference in these bigrams, because they all
    have the same frequency
  • the t test takes into account the frequency of a
    bigram relative to the frequencies of its
    component words
  • If a high proportion of the occurrences of both
    words occurs in the bigram, then its t is high.
  • The t test is mostly used to rank collocations
  • fail the t-test (t lt 2.756) so
  • we cannot reject the null hypothesis
  • so they do not form a collocation
  • pass the t-test (t gt 2.756) so
  • we can reject the null hypothesis
  • so they form collocation

44
Hypothesis testing of differences
  • Used to see if 2 words (near-synonyms) are used
    in the same context or not
  • strong vs powerful
  • can be useful in lexicography
  • we want to test
  • if there is a difference in 2 populations
  • Ex height of woman / height of man
  • the null hypothesis is that there is no
    difference
  • i.e. the average difference is 0 (? 0)

is the sample mean of population 1 is the
sample mean of population 2 s12 is the sample
variance of population 1 s22 is the sample
variance of population 2 n1 is the sample size of
population 1 n2 is the sample size of population
2
45
Difference test example
  • Is there a difference in how we use powerful
    and how we use strong?

46
Approaches to finding collocations
  • Frequency
  • Mean and Variance
  • Hypothesis Testing
  • t-test
  • --gt ?2-test
  • Mutual Information

47
Hypothesis testing the ?2-test
  • problem with the t test is that it assumes that
    probabilities are approximately normally
    distributed
  • the ?2-test does not make this assumption
  • The essence of the ?2-test is the same as the
    t-test
  • Compare observed frequencies and expected
    frequencies for independence
  • if the difference is large
  • then we can reject the null hypothesis of
    independence

48
?2-test
  • In its simplest form, it is applied to a 2x2
    table of observed frequencies
  • The ?2 statistic
  • sums the differences between observed frequencies
    (in the table)
  • and expected values for independence
  • scaled by the magnitude of the expected values

49
?2-test- Example
  • Observed frequencies Obsij

50
?2-test- Example (cont)
  • Expected frequencies Expij
  • If independence
  • Computed from the marginal probabilities (the
    totals of the rows and columns converted into
    proportions)
  • Ex expected frequency for cell (1,1) (new
    companies)
  • marginal probability of new occurring as the
    first part of a bigram times marginal probability
    of companies occurring as the second part of
    bigram
  • If new and companies occurred completely
    independent of each other
  • we would expect 5.17 occurrences of new
    companies on average

51
?2-test- Example (cont)
  • But is the difference significant?
  • df in an nxc table (n-1)(c-1) (2-1)(2-1) 1
    (degrees of freedom)
  • The probability level of ?0.05 the critical
    value is 3.84
  • Since 1.55 lt 3.84
  • So we cannot reject H0 (that new and companies
    occur independently of each other)
  • So new companies is not a good candidate for a
    collocation

52
?2-test Conclusion
  • Differences between the t statistic and ?2
    statistic do not seem to be large
  • But
  • the ?2 test is appropriate for large
    probabilities
  • where t test fails because of the normality
    assumption
  • the ?2 is not appropriate with sparse data (if
    numbers in the 2 by 2 tables are small)
  • ?2 test has been applied to a wider range of
    problems
  • Machine translation
  • Corpus similarity

53
?2-test for machine translation
  • (Church Gale, 1991)
  • To identify translation word pairs in aligned
    corpora
  • Ex
  • ?2 456 400 gtgt 3.84 (with ? 0.05)
  • So vache and cow are not independent and so
    are translations of each other

Nb of aligned sentence pairs containing cow in
English and vache in French
Observed frequency cow cow TOTAL
vache 59 6 65
vache 8 570 934 570 942
TOTAL 67 570 940 571 007
54
?2-test for corpus similarity
  • (Kilgarriff Rose, 1998)
  • Ex
  • Compute ?2 for the 2 populations (corpus1 and
    corpus2)
  • Ho the 2 corpora have the same word distribution

Observed frequency Corpus 1 Corpus 2 Ratio
Word1 60 9 60/9 6.7
Word2 500 76 6.6
Word3 124 20 6.2

Word500
55
Collocations across corpora
  • Ratios of relative frequencies between two or
    more different corpora
  • can be used to discover collocations that are
    characteristic of a corpus when compared to other
    corpus

56
Collocations across corpora (cont)
  • most useful for the discovery of subject-specific
    collocations
  • Compare a general text with a subject-specific
    text
  • words and phrases that (on a relative basis)
    occur most often in the subject-specific text are
    likely to be part of the vocabulary that is
    specific to the domain

57
Approaches to finding collocations
  • Frequency
  • Mean and Variance
  • Hypothesis Testing
  • t-test
  • ?2-test
  • --gt Mutual Information

58
Pointwise Mutual Information
  • Uses a measure from information-theory
  • Pointwise mutual information between 2 events x
    and y (in our case the occurrence of 2 words) is
    roughly
  • a measure of how much one event (word) tells us
    about the other
  • or a measure of the independence of 2 events (or
    2 words)
  • If 2 events x and y are independent, then I(x,y)
    0

59
Example
  • Assume
  • c(Ayatollah) 42
  • c(Ruhollah) 20
  • c(Ayatollah, Ruhollah) 20
  • N 143 076 668
  • Then
  • So? The occurrence of Ayatollah at position i
    increases by 18.38bits if Ruhollah occurs at
    position i1
  • works particularly badly with sparse data

60
Pointwise Mutual Information (cont)
  • With pointwise mutual information
  • With t-test (see p.43 of slides)
  • Same ranking as t-test

61
Pointwise Mutual Information (cont)
  • good measure of independence
  • values close to 0 --gt independence
  • bad measure of dependence
  • because score depends on frequency
  • all things being equal, bigrams of low frequency
    words will receive a higher score than bigrams of
    high frequency words
  • so sometimes we take C(w1 w2) I(w1 , w2)

62
Automatic vs manual detection of collocations
  • Manual detection finds a wider variety of
    grammatical patterns
  • Ex in the BBI combinatory dictionary of English
  • Quality of collocations is better that
    computer-generated ones
  • But long and requires expertise

strength power
to build up to assume
to find emergency
to save discretionary
to sap somebody's fire
brute supernatural
tensile to turn off the
the to do X the to do X
Write a Comment
User Comments (0)
About PowerShow.com