LIN 3098 - PowerPoint PPT Presentation

About This Presentation
Title:

LIN 3098

Description:

LIN 3098 Corpus Linguistics Lecture 5 Albert Gatt So what are the high-frequency words? Top 5 ranked words in the Maltese data: li ( that ), l- (DEF), il ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 42
Provided by: staffUmE8
Category:
Tags: lin

less

Transcript and Presenter's Notes

Title: LIN 3098


1
LIN 3098 Corpus LinguisticsLecture 5
  • Albert Gatt

2
In this lecture
  • Corpora and the Lexicon
  • uses of corpora in lexicography
  • Counting words
  • lemmatisation and other issues
  • types versus tokens
  • word frequency distributions in corpora

3
Part 1
  • Corpora and lexicography

4
Why corpora are useful
  • Lexicographic work has long relied on contextual
    cues to identify meanings.
  • e.g. Samuel Johnson used examples from literature
    to exemplify uses of a word.
  • Corpora make this procedure much easier
  • not only to provide examples but
  • to actually identify meanings of a word given its
    context
  • definitions of word meanings should therefore be
    more precise, if based on large amounts of data

5
Specific applications
  • Grammatical alternations of words
  • E.g. Verb diathesis alternations
  • Atkins and Levin (1995) found that verbs such as
    quiver and quake have both intransitive and
    transitive uses. (see Lecture 1)
  • E.g. uses of prepositions such as on, with
  • Regional variations in word use
  • relying on corpora which include
    gender/region/dialect/date information

6
Specific applications - II
  • Identification of occurrences of a specific
    homograph, e.g. house (Verb)
  • examination of the contexts in which it occurs
  • relies on POS tagging
  • Keeping track of changes in a language through a
    monitor corpus
  • Identifying how common a word is, through
    frequency counts.
  • many dictionaries include such information now
  • this shall be our starting point

7
Part 2
  • Counting words in corpora types versus tokens

8
Running example
  • Throughout this lecture, reference is made to
    data from a corpus of Maltese texts
  • ca. 51,000 words
  • all from Maltese-language newspapers
  • various topics and article types

9
How to count words types versus tokens
  • token any word in the corpus
  • (also counting words that occur more than once)
  • type all the individual, different words in the
    corpus
  • (grouping occurrences of a word together as
    representatives of a single type)
  • Example
  • I spoke to the chap who spoke to the child
  • 10 tokens
  • 7 types (I, spoke, to, the, chap, who, child)

10
More on types and tokens
  • The number of tokens in the corpus is an estimate
    of overall corpus size
  • Maltese corpus 51,000 tokens
  • The number of types is an estimate of vocabulary
    size
  • gives an idea of the lexical richness of the
    corpus
  • Maltese corpus 8193 types

11
Type/token ratio
  • A (rough!) way of measuring the amount of
    variation in the vocabulary in the corpus.
  • Roughly, can be interpreted as the rate at which
    new types are introduced, as a function of number
    of tokens

12
Difficult decisions - I
  • Do we distinguish upper- and lower-case words?
  • is New in New York the same as new in new car?
  • but what of New in New cars are expensive?
    (sentence-initial caps)
  • in practise, its not straightforward to
    distinguish the two accurately, but can be done

13
Difficult decisions - II
  • What about morphological variants?
  • man men ? one type or two?
  • go went ? one type or two?
  • If we map all morphological (inflectional)
    variants to a single type, our counts will be
    cleaner (lemmatisation).
  • depends on availability of automated methods to
    do this
  • Maltese also presents problems with variants of
    the definite article (ir-, is-, ix- etc)
  • ir-ragel (DEF-man) one token or two?

14
Difficult decisions - III
  • Do numbers count?
  • e.g. is 1,500 a word?
  • may artificially inflate frequency counts
  • one approach is to treat all numbers as tokens of
    a single type NUMBER or
  • Punctuation
  • can compromise frequency counts
  • computer will treat woman! as different from
    woman
  • needs to be stripped
  • problematic for languages that rely on
    non-alphabetic symbols Maltese l (to) vs l-
    (the)

15
Part 2
  • Representing word frequencies

16
Raw frequency lists (data from Maltese)
  • A simple list, pairing each word with its
    frequency

word frequency
ahhar (last) 97
jkun (be.IMPERF.3SG) 96
ukoll (also) 93
bhala (as) 91
dak (that.SGM) 86
tat- (of.DEF) 86
17
Frequency ranks
  • Word counts can get very big.
  • most frequent word in the Maltese corpus occurs
    2195 times (and the corpus is small)
  • Raw frequency lists can be hard to process.
  • Useful to represent words in terms of rank
  • count the words
  • sort by frequency (most frequent first)
  • assign a rank to the words
  • rank 1 most frequent
  • rank 2 next most frequent

18
Rank-frequency list example (data from Maltese)
rank Frequency
1 2195
2 2080
3 1277
4 1264
Rank of type, according to frequency
Number of times the type occurs
19
Frequency spectrum (data from Maltese)
  • A representation that shows, for each frequency
    value, the number of different types that occur
    with that frequency.

frequency types
1 4382
2 1253
3 661
4 356
20
Normalised frequency counts
  • A raw frequency for a word isnt necessarily
    informative.
  • E.g. difficult to compare the frequency of the
    word in corpora of different sizes.
  • We often take a normalised count.
  • typical to divide the frequency by some constant,
    such as 10,000 or 1,000,000
  • this gives frequency of word per million rather
    than a raw count.

21
Type/token ratio revisited
  • (no. of types)/(no. of tokens)
  • Another way of estimating vocabulary richness
    of a corpus, instead of just looking at
    vocabulary size.
  • E.g. if a corpus consists of 1000 words, and
    there are 400 types, then the TTR is 40

22
Type/token ratio
  • Ratio varies enormously depending on corpus size!
  • If the corpus is 1000 words, its easy to see a
    TTR of, say, 40.
  • With 4 million words, its more likely to be in
    the region of 2.
  • Reasons
  • vocab size grows with corpus size but
  • large corpora will contain a lot of tokens that
    occur many times

23
Standardised type/token ratio
  • One way to account for TTR variations due to
    corpus size is to compute an average TTR for
    chunks of a constant size. Example
  • compute the TTR for every 1000 words of running
    text
  • then, take an average over all the 1000-word
    chunks
  • This is the approach used, for example, in
    WordSmith.

24
Part 3
  • Frequency distributions, or
  • few giants, many midgets

25
Non-linguistic case study
  • Suppose we are interested in measuring peoples
    height.
  • population adult, male/female, European
  • sample N people from the relevant population
  • measure height of each person in the sample
  • Results
  • person 1 1.6 m
  • person 2 1.5 m

26
Measures of central tendency
  • Given the height of individuals in our sample, we
    can calculate some summary statistics
  • mean (average) sum of all heights in sample,
    divided by N
  • mode most frequent value
  • Median the middle value
  • What are your expectations?

27
The data (example)
  • Mean 158.8cm
  • This is the expected value in the long run.
  • If our sample is good, we would expect that most
    people would have a height at or around the mean.
  • Mode 160cm
  • Median 160

height
1 135
2 159
3 160
4 160
5 180
28
Plotting height/frequency
  • Observations
  • Extreme values are less frequent.
  • 2. Most people fall on the mean
  • 3. Mode is approximately same as mean
  • 4. Bell-shaped curve (normal distribution)

29
Plotting height/frequency
  • This shape characterises the Normal Distribution.
  • A bell curve
  • Quite typical for a lot of data sampled from
    humans (but not all data)

30
What about language?
  • Typical observations about word frequencies in
    corpora
  • there are a few words with extremely high
    frequency
  • there are many more words with extremely low
    frequency
  • the mean is not a good indicator most words will
    have an actual value that is very far above or
    below the mean

31
A closer look at the Maltese data
  • Out of 51,000 tokens
  • 8016 tokens belong to just the 5 most frequent
    types (the types at ranks 1 -- 5)
  • ca. 15 of our corpus size is made up of only 5
    different words!
  • Out of 8193 types
  • 4382 are hapax legomena, occurring only once
    (bottom ranks)
  • 1253 occur only twice
  • In this data, the mean wont tell us very much.
  • it hides huge variations!

32
Ranks and frequencies (Maltese)
  • 2195
  • 2080
  • 1277
  • 1
  • 1

Among top ranks, frequency drops very dramatically
Among bottom ranks, frequency drops very gradually
33
General observations
  • In corpora
  • there are always a few very high-frequency words,
    and many low-frequency words
  • among the top ranks, frequency differences are
    big
  • among bottom ranks, frequency differences are
    very small

34
So what are the high-frequency words?
  • Top 5 ranked words in the Maltese data
  • li (that), l- (DEF), il- (DEF), u (and), ta
    (of), tal- (of the)
  • Bottom ranked words
  • zona (zone) f 1
  • yankee f 1
  • zwieten (Zejtun residents) f 1
  • xortih (luck.POSS-3SGM) f 1
  • widnejhom (ear.POSS-3PL) f 1

35
Zipfs law
  • George K. Zipf (1902 1950) established a
    mathematical model for describing frequency data
  • Frequency decreases with rank. More precisely,
    frequency is inversely proportional to rank.
  • We can plot this in a chart
  • Y-axis frequency
  • X-axis rank
  • each dot on the chart represents the lexical item
    (type) at a given rank

36
How Zipfs law pans out (Maltese data)
A few high frequency, low-rank words
Hundreds of low-frequency, high-rank words
37
Zipfs law cross-linguistically
  • Empirical work has shown that the Zipfian
    distribution is observable
  • independent of the language
  • irrespective of corpus size (for reasonably large
    corpora)
  • The bigger your corpus
  • the bigger your vocabulary size (no. types)
  • the more words of frequency 1 (hapax legomena)
  • Why?

38
Some reasons
  • If words were completely random, every word would
    be equally likely.
  • Our plot would be completely flat all words at
    all ranks have same frequency.
  • Language is absolutely non-random
  • occurrence of words governed by
  • syntax
  • author/speaker intentions
  • ...
  • Some words are the basic skeleton for our
    sentences. They are the most frequent.

39
Implications
  • Traditional measures of central tendency (mean
    etc) not very useful.
  • No two corpora can be directly compared if they
    are of different size
  • vocab size increases with corpus size
  • most of the vocab made up of hapax legomena
  • most of the corpus size (no. tokens) made up of a
    few, very frequent types, typically function
    words.

40
Summary
  • Weve introduced some of the uses of corpora for
    lexicography.
  • Focused today on word frequencies, especially
    Zipfs law
  • looked at some of the implications
  • Next up
  • collocations and why theyre useful

41
References
  • Baroni, M. (2007). Distributions in text. In A.
    Lüdeling and M. Kytö (eds.), Corpus linguistics
    An international handbook. Berlin Mouton de
    Gruyter.
Write a Comment
User Comments (0)
About PowerShow.com