Title: LIN 3098
1LIN 3098 Corpus LinguisticsLecture 5
2In this lecture
- Corpora and the Lexicon
- uses of corpora in lexicography
- Counting words
- lemmatisation and other issues
- types versus tokens
- word frequency distributions in corpora
3Part 1
4Why corpora are useful
- Lexicographic work has long relied on contextual
cues to identify meanings. - e.g. Samuel Johnson used examples from literature
to exemplify uses of a word. - Corpora make this procedure much easier
- not only to provide examples but
- to actually identify meanings of a word given its
context - definitions of word meanings should therefore be
more precise, if based on large amounts of data
5Specific applications
- Grammatical alternations of words
- E.g. Verb diathesis alternations
- Atkins and Levin (1995) found that verbs such as
quiver and quake have both intransitive and
transitive uses. (see Lecture 1) - E.g. uses of prepositions such as on, with
- Regional variations in word use
- relying on corpora which include
gender/region/dialect/date information
6Specific applications - II
- Identification of occurrences of a specific
homograph, e.g. house (Verb) - examination of the contexts in which it occurs
- relies on POS tagging
- Keeping track of changes in a language through a
monitor corpus - Identifying how common a word is, through
frequency counts. - many dictionaries include such information now
- this shall be our starting point
7Part 2
- Counting words in corpora types versus tokens
8Running example
- Throughout this lecture, reference is made to
data from a corpus of Maltese texts - ca. 51,000 words
- all from Maltese-language newspapers
- various topics and article types
9How to count words types versus tokens
- token any word in the corpus
- (also counting words that occur more than once)
- type all the individual, different words in the
corpus - (grouping occurrences of a word together as
representatives of a single type) - Example
- I spoke to the chap who spoke to the child
- 10 tokens
- 7 types (I, spoke, to, the, chap, who, child)
10More on types and tokens
- The number of tokens in the corpus is an estimate
of overall corpus size - Maltese corpus 51,000 tokens
- The number of types is an estimate of vocabulary
size - gives an idea of the lexical richness of the
corpus - Maltese corpus 8193 types
11Type/token ratio
- A (rough!) way of measuring the amount of
variation in the vocabulary in the corpus. - Roughly, can be interpreted as the rate at which
new types are introduced, as a function of number
of tokens
12Difficult decisions - I
- Do we distinguish upper- and lower-case words?
- is New in New York the same as new in new car?
- but what of New in New cars are expensive?
(sentence-initial caps) - in practise, its not straightforward to
distinguish the two accurately, but can be done
13Difficult decisions - II
- What about morphological variants?
- man men ? one type or two?
- go went ? one type or two?
- If we map all morphological (inflectional)
variants to a single type, our counts will be
cleaner (lemmatisation). - depends on availability of automated methods to
do this - Maltese also presents problems with variants of
the definite article (ir-, is-, ix- etc) - ir-ragel (DEF-man) one token or two?
14Difficult decisions - III
- Do numbers count?
- e.g. is 1,500 a word?
- may artificially inflate frequency counts
- one approach is to treat all numbers as tokens of
a single type NUMBER or - Punctuation
- can compromise frequency counts
- computer will treat woman! as different from
woman - needs to be stripped
- problematic for languages that rely on
non-alphabetic symbols Maltese l (to) vs l-
(the)
15Part 2
- Representing word frequencies
16Raw frequency lists (data from Maltese)
- A simple list, pairing each word with its
frequency -
word frequency
ahhar (last) 97
jkun (be.IMPERF.3SG) 96
ukoll (also) 93
bhala (as) 91
dak (that.SGM) 86
tat- (of.DEF) 86
17Frequency ranks
- Word counts can get very big.
- most frequent word in the Maltese corpus occurs
2195 times (and the corpus is small) - Raw frequency lists can be hard to process.
- Useful to represent words in terms of rank
- count the words
- sort by frequency (most frequent first)
- assign a rank to the words
- rank 1 most frequent
- rank 2 next most frequent
18Rank-frequency list example (data from Maltese)
rank Frequency
1 2195
2 2080
3 1277
4 1264
Rank of type, according to frequency
Number of times the type occurs
19Frequency spectrum (data from Maltese)
- A representation that shows, for each frequency
value, the number of different types that occur
with that frequency.
frequency types
1 4382
2 1253
3 661
4 356
20Normalised frequency counts
- A raw frequency for a word isnt necessarily
informative. - E.g. difficult to compare the frequency of the
word in corpora of different sizes. - We often take a normalised count.
- typical to divide the frequency by some constant,
such as 10,000 or 1,000,000 - this gives frequency of word per million rather
than a raw count.
21Type/token ratio revisited
- (no. of types)/(no. of tokens)
- Another way of estimating vocabulary richness
of a corpus, instead of just looking at
vocabulary size. - E.g. if a corpus consists of 1000 words, and
there are 400 types, then the TTR is 40
22Type/token ratio
- Ratio varies enormously depending on corpus size!
- If the corpus is 1000 words, its easy to see a
TTR of, say, 40. - With 4 million words, its more likely to be in
the region of 2. - Reasons
- vocab size grows with corpus size but
- large corpora will contain a lot of tokens that
occur many times
23Standardised type/token ratio
- One way to account for TTR variations due to
corpus size is to compute an average TTR for
chunks of a constant size. Example - compute the TTR for every 1000 words of running
text - then, take an average over all the 1000-word
chunks - This is the approach used, for example, in
WordSmith.
24Part 3
- Frequency distributions, or
- few giants, many midgets
25Non-linguistic case study
- Suppose we are interested in measuring peoples
height. - population adult, male/female, European
- sample N people from the relevant population
- measure height of each person in the sample
- Results
- person 1 1.6 m
- person 2 1.5 m
-
26Measures of central tendency
- Given the height of individuals in our sample, we
can calculate some summary statistics - mean (average) sum of all heights in sample,
divided by N - mode most frequent value
- Median the middle value
- What are your expectations?
27The data (example)
- Mean 158.8cm
- This is the expected value in the long run.
- If our sample is good, we would expect that most
people would have a height at or around the mean. - Mode 160cm
- Median 160
height
1 135
2 159
3 160
4 160
5 180
28Plotting height/frequency
- Observations
- Extreme values are less frequent.
- 2. Most people fall on the mean
- 3. Mode is approximately same as mean
- 4. Bell-shaped curve (normal distribution)
29Plotting height/frequency
- This shape characterises the Normal Distribution.
- A bell curve
- Quite typical for a lot of data sampled from
humans (but not all data)
30What about language?
- Typical observations about word frequencies in
corpora - there are a few words with extremely high
frequency - there are many more words with extremely low
frequency - the mean is not a good indicator most words will
have an actual value that is very far above or
below the mean
31A closer look at the Maltese data
- Out of 51,000 tokens
- 8016 tokens belong to just the 5 most frequent
types (the types at ranks 1 -- 5) - ca. 15 of our corpus size is made up of only 5
different words! - Out of 8193 types
- 4382 are hapax legomena, occurring only once
(bottom ranks) - 1253 occur only twice
-
- In this data, the mean wont tell us very much.
- it hides huge variations!
32Ranks and frequencies (Maltese)
Among top ranks, frequency drops very dramatically
Among bottom ranks, frequency drops very gradually
33General observations
- In corpora
- there are always a few very high-frequency words,
and many low-frequency words - among the top ranks, frequency differences are
big - among bottom ranks, frequency differences are
very small
34So what are the high-frequency words?
- Top 5 ranked words in the Maltese data
- li (that), l- (DEF), il- (DEF), u (and), ta
(of), tal- (of the) - Bottom ranked words
- zona (zone) f 1
- yankee f 1
- zwieten (Zejtun residents) f 1
- xortih (luck.POSS-3SGM) f 1
- widnejhom (ear.POSS-3PL) f 1
35Zipfs law
- George K. Zipf (1902 1950) established a
mathematical model for describing frequency data - Frequency decreases with rank. More precisely,
frequency is inversely proportional to rank. - We can plot this in a chart
- Y-axis frequency
- X-axis rank
- each dot on the chart represents the lexical item
(type) at a given rank
36How Zipfs law pans out (Maltese data)
A few high frequency, low-rank words
Hundreds of low-frequency, high-rank words
37Zipfs law cross-linguistically
- Empirical work has shown that the Zipfian
distribution is observable - independent of the language
- irrespective of corpus size (for reasonably large
corpora) - The bigger your corpus
- the bigger your vocabulary size (no. types)
- the more words of frequency 1 (hapax legomena)
- Why?
38Some reasons
- If words were completely random, every word would
be equally likely. - Our plot would be completely flat all words at
all ranks have same frequency. - Language is absolutely non-random
- occurrence of words governed by
- syntax
- author/speaker intentions
- ...
- Some words are the basic skeleton for our
sentences. They are the most frequent.
39Implications
- Traditional measures of central tendency (mean
etc) not very useful. - No two corpora can be directly compared if they
are of different size - vocab size increases with corpus size
- most of the vocab made up of hapax legomena
- most of the corpus size (no. tokens) made up of a
few, very frequent types, typically function
words.
40Summary
- Weve introduced some of the uses of corpora for
lexicography. - Focused today on word frequencies, especially
Zipfs law - looked at some of the implications
- Next up
- collocations and why theyre useful
41References
- Baroni, M. (2007). Distributions in text. In A.
Lüdeling and M. Kytö (eds.), Corpus linguistics
An international handbook. Berlin Mouton de
Gruyter.