Title: Corpus Linguistics Lecture 1
1Corpus LinguisticsLecture 1
2Contact details
- My email albert.gatt_at_um.edu.mt
- Drop me a line with queries etc, and to arrange
meetings.
3Course web page
- Course web page http//staff.um.edu.mt/albert.gat
t/home/teaching/corpusLing.html - Details of tutorials, lectures etc will always be
on the web page. - Readings for the lecture
- Downloadable lecture notes (available after the
lecture)
4Suggested text
- T. McEnery and A. Wilson. (2001). Corpus
Linguistics. Edinburgh University Press - NB Over the course of these lectures, other
readings will also be proposed and made
available, usually online.
5Lectures and assessment
- Structure of lectures
- all lectures will take place in the lab
- usually, about half the lecture (1hr) will be
devoted to practical work - Course assessment assignment
- Final essay (ca. 1500-2000 words)
- Essay topics will involve research on corpora!
6Questions
7What is corpus linguistics?
- A new theory of language?
- No. In principle, any theory of language is
compatible with corpus-based research. - A separate branch of linguistics (in addition to
syntax, semantics)? - No. Most aspects of language can be studied using
a corpus (in principle). - A methodology to study language in all its
aspects? - Yes! The most important principle is that aspects
of language are studied empirically by analysing
natural data using a corpus. - A corpus is an electronic, machine-readable
collection of texts that represent real life
language use.
8Goals of this lecture
- To define the terms
- corpus linguistics
- corpus
- To give an overview of the history of corpus
linguistics - To contrast the corpus-based approach to other
methodologies used in the study of language
9An initial example
- Suppose youre a linguist interested in the
syntax of verb phrases. - Some verbs are transitive, some intransitive
- I ate the meat pie (transitive)
- I swam (intransitive)
- What about
- quiver
- quake
- Are these really intransitive?
Most traditional grammars characterise these as
intransitive
10One possible methodology
- The standard method relies on the linguists
intuition - I never use quiver/quake with a direct object.
- I am a native speaker of this language.
- All native speakers have a common mental grammar
or competence (Chomsky). - Therefore, my mental grammar is the same as
everyone elses. - Therefore, my intuition accurately reflects
English speakers competence. - Therefore, quiver/quake are intransitive.
- NB The above is a gross simplification! E.g.
linguists often rely on judgements elicited from
other native speakers.
11Another possible methodology
- This one relies on data
- I may never use quiver/quake with a direct
object, but - other people might
- Therefore, Ill get my hands on a large sample of
written and/or spoken English and check.
12Quiver/quake the corpus linguists answer
- A study by Atkins and Levin (1995) found that
quiver and quake do occur in transitive
constructions - the insect quivered its wings
- it quaked his bowels (with fear)
- Used a corpus of 50 million words to find
examples of the verbs. - With sufficient data, you can find examples that
your own intuition wont give you
13Example II lexical semantics
- Quasi-synonymous lexical items exhibit subtle
differences in context. - strong
- powerful
- A fine-grained theory of lexical semantics would
benefit from data about these contextual cues to
meaning.
14Example II continued
- Some differences between strong and powerful
(source British National Corpus) - strong
- powerful
- The differences are subtle, but examining their
collocates helps.
15Some preliminary definitions
- The second approach is typical of the
corpus-based methodology - Corpus A large, machine-readable collection of
texts. - Often, in addition to the texts themselves, a
corpus is annotated with relevant linguistic
information. - Corpus-based methodology An approach to Natural
Language analysis that relies on generalisations
made from data.
16Example (British National Corpus)
- British National Corpus (BNC)
- 100 million words of English
- 90 written, 10 spoken
- Designed to be representative and balanced.
- Texts from different genres (literature, news,
academic writing) - Annotated Every single word is accompanied by
part-of-speech information.
17Example (continued)
- A sentence in the BNC
- Explosives found on Hampstead Heath.
- ltsgt
- ltw NN2gtExplosives
- ltw VVDgtfound
- ltw PRPgton
- ltw NP0gtHampstead
- ltw NP0gtHeath
- ltPUNgt.
18Example (continued)
new sentence
- ltsgt
- ltw NN2gtExplosives
- ltw VVDgtfound
- ltw PRPgton
- ltw NP0gtHampstead
- ltw NP0gtHeath
- ltPUNgt.
- Explosives found on Hampstead Heath
plural noun
past tense verb
preposition
proper noun
proper noun
punctuation
19Important to note
- This is not raw text.
- Annotation means we can search for particular
patterns. - E.g. for the quiver/quake study find all
occurrences of quiver which are verbs, followed
by a determiner and a noun - The collection is very large
- Only in very large collections are we likely to
find rare occurrences. - Corpus search is done by computer. You cant
trawl through 100 million words manually!
20The practical objections
- But were linguists not computer scientists! Do I
have to write programs? - No, there are literally dozens of available tools
to search in a corpus. - Are all corpora good for all purposes?
- No. Some are general-purpose, like the BNC.
Others are designed to address specific issues.
21The theoretical objections
- What guarantee do we have that the texts in our
corpus are good data, quality texts, written by
people we can trust? - How do I know that what I find isnt just a
small, exceptional case. E.g. quiver in a
transitive construction could be really a
one-off! - Just because there are a few examples of
something, doesnt mean that all native speakers
use a certain construction! - Do we throw intuition out of the window?
22Part 2
- A brief history of corpus linguistics
23Language and the cognitive revolution
- Before the 1950s, the linguists task was
- to collect data about a language
- to make generalisations from the data (e.g. In
Maltese, the verb always agrees in number and
gender with the subject NP) - The basic idea language is out there, the sum
total of things people say and write. - After the 1950s
- the so-called cognitive revolution
- language treated as a mental phenomenon
- no longer about collecting data, but explaining
what mental capabilities speakers have
24The 19th early 20th Century
- Many early studies relied on corpora.
- Language acquisition research was based on
collections of child data. - Anthropologists collected samples of unknown
languages. - Comparative linguists used large samples from
different languages. - A lot of work done on frequencies
- frequency of words
- frequency of grammatical patterns
- frequency of different spellings
- All of this was interrupted around 1955.
25Chomsky and the cognitive turn
- Chomsky (1957) was primarily responsible for the
new, cognitive view of language. - He distinguished (1965)
- Descriptive adequacy describing language, making
generalisations such as X occurs more often than
Y - Explanatory adequacy explaining why some things
are found in a language, but not others, by
appealing to speakers competence, their mental
grammar - He made several criticisms of corpus-based
approaches.
26Criticisms of corpora (I)
- Competence vs. performance
- To explain language, we need to focus on
competence of an idealised speaker-hearer. - Competence internalised, tacit knowledge of
language - Performance the language we speak/write is
not a good mirror of our knowledge - it depends on situations
- it can be degraded
- it can be influenced by other cognitive factors
beyond linguistic knowledge
27Criticisms of corpora (II)
- Early work using corpora assumed that
- the number of sentences of a language is finite
(so we can get to know everything about language
if the sample is large enough) - But actually, it is impossible to count the
number of sentences in a language. - Syntactic rules make the possibilities literally
infinite - the man in the house (NP -gt NP PP)
- the man in the house on the beach (PP -gt PREP
NP) - the man in the house on the beach by the lake
-
- So what use is a corpus? Were never going to
have an infinite corpus.
28Criticisms of corpora (III)
- A corpus is always skewed, i.e. biased in favour
of certain things. - Certain obvious things are simply never said.
E.g. We probably wont find a dog is a dog in our
corpus. - A corpus is always partial We will only find
things in a corpus if they are frequent enough. - A corpus is necessarily only a sample.
- Rare things are likely to be omitted from a
sample.
29Criticisms of corpora (IV)
- Why use a corpus if we already know things by
introspection? - How can a corpus tell us what is ungrammatical?
- Corpora wont contain disallowed structures,
because these are by definition not part of the
language. - So a corpus contains exclusively positive
evidence you only get the allowed things - But if X is not in the corpus, this doesnt mean
its not allowed. - It might just be rare, and your corpus isnt big
enough. (Skewness)
30Refutations
- Corpora can be better than introspectvie evidence
because - They are public other people can verify and
replicate your results (the essence of scientific
method). - Some kinds of data are simply not available to
introspection. E.g. people arent good at
estimating the frequency of words or structures. - Skewness can itself be informative If X occurs
more frequently than Y in a corpus, that in
itself is an interesting fact.
31Refutations (II)
- By the way, nobodys saying throw introspection
out the window - There is no reason not to combine the
corpus-based and the introspection-based method. - Many other objections can be overcome by using
large enough corpora. - Pre-1950, most corpus work was done manually, so
it was error prone. - Machine-readable corpora means we have a great
new tool to analyse language very efficiently!
32Corpora in the late 20th Century
- Corpus linguistics enjoyed a revival with the
advent of the digital personal computer. - Kucera and Francis the Brown Corpus, one of the
first - Svartvik the London-Lund Corpus, which built on
Brown - These were rapidly followed by others Today,
corpora are firmly back on the linguistic
landscape.
33Summary
- Introduced the notion of corpus and corpus-based
research - Gave a quick overview of the history of this
methodology - Looked at some possible objections to
corpus-based methods, and some possible
counter-arguments
34Next lecture
- We look more closely at some important properties
of a corpus - Machine-readability
- Balance
- Representativeness