Title: Corpus linguistics an introduction
1Corpus linguisticsan introduction
2Czech National Corpus
Hellenic National Corpus
Croatian National Corpus
Hrvatski nacionalni korpus
http//pkukmweb.ukm.my/iha/skblcl.pdf
3Hungarian National Corpus
Slovak National Corpus
National Corpus of Irish
Maltese National Corpus
The Reference Corpus of Polish
4The Bank of English
London/Oslo- Bergen (LOB)
British National Corpus (BNC)
International Corpus of English (ICE)
5What is a corpus?
- A collection of naturally occurring language
text, chosen to characterise a state or variety
of language (Sinclair) - A collection of linguistic data, either written
text or a transcription of recorded data, which
can be used as starting-point of linguistic
description or as a means of verifying hypotheses
about a language (Dictionary of linguistics and
phonetics)
6What is a corpus? (II)
- Large body of evidence typically composed of
attested language use (McEnery) - Usually a corpus is in machine-readable format
and is ideally viewable and analysable through (a
single) software package - The word corpus comes from Latin body and the
plural is corpora
7What is not a corpus
- Lists of words
- Lists of sentences produced with the purpose of
creating a corpus - Archive a repository of readable electronic
texts not linked in any coordinated way
(http//www.archive.org)The Internet Archive
is building a digital library of Internet sites
and other cultural artifacts in digital form.
Like a paper library, we provide free access to
researchers, historians, scholars, and the
general public.
8What can we do with a corpus?
- Corpus-based approaches hypotheses are checked
against a corpus - Corpus-driven approaches hypotheses are drawn
from the corpus
9What can we do with a corpus? (II)
- 'Alright,' said the computer Deep Thought. 'The
Answer to the Great Question...' - 'Yes...!'
- 'Of Life, the Universe and Everything ... ' said
Deep Thought. - 'Yes ... !'
- 'Is ...'
- 'Yes...!!!...?'
- 'Forty-two,' said Deep Thought, with infinite
majesty and calm. - It was a long time before anyone spoke.
- 'Forty-two!' yelled someone in the audience. 'Is
that all you've got to show for seven and a half
million years' work?' - 'I checked it very thoroughly,' said the
computer, 'and that quite definitely is the
answer. I think the problem, to be quite honest
with you, is that you've never actually known
what the question is.' - Hitchhikers guide to the galaxy by Douglas Adams
10Fields where corpora are used
- Lexicography to design dictionaries
- Language studies (relations between languages,
differences between genre, evolution of the
language) - Computational linguistics (training and testing
methods) - Language teaching (learners corpora)
- Cultural studies, psycholinguistics
11The characteristics of analysis using corpora
(Biber, 1998)
- It is empirical, analysing the actual patterns
of use from natural texts - It utilises a large and principled collection of
natural texts as the basis for analysis - It makes extensive use of computers for
analysis, using both automatic and interactive
techniques - It depends on both quantitative and qualitative
analytical techniques
12History
- We have to split the history in two periods
before Chomsky and after Chomsky - Before Chomsky, methods similar to the ones in
corpus linguistics were used (empiricism)http/
/www.ling.lancs.ac.uk/monkey/ihe/linguistics/corpu
s1/1fra1.htm
13Early corpus linguistics
- Before Chomsky
- Computers were not available so it was difficult
to analyse large collections of text - Studies of child language using diaries kept by
parents - Spelling conventions in a German corpus of 11
million words - Foreign language pedagogy
14Early corpus linguistics (II)
- All the work of early corpus linguistics was
underpinned by two fundamental, yet flawed
assumptions - The sentences of a natural language are finite.
- The sentences of a natural language can be
collected and enumerated. - Most linguists saw the corpus as the only source
of linguistic evidence in the formation of
linguistic theories
15(No Transcript)
16Chomsky
- Between 1957 and 1965 Chomsky changed the
direction of linguistics from empiricism towards
rationalism - Any natural corpus will be skewed. Some
sentences wont occur because they are obvious,
other because they are false, still others
because they are impolite. The corpus, if
natural, will be so wildly skewed that the
description would be no more than a mere list
(Chomsky, 1962) - Introspection started to be used instead
-
17Problems with introspection
- Naturally occurring data is observable and
verifiable by everyone. - Introspective data is artificial.
- Human beings have only the vaguest notion of the
frequency of a construct or a word.
18The revival of corpus linguistics
- The research in corpus linguistics was continued
in small centres - The hardware still imposed some restrictions,
the real development will start in the 80s - Fields like computational linguistics were not
interested to use corpora
19The revival of corpus linguistics (II)
- 1960s Brown Corpus (at the Brown University
American English) - 1970s LOB corpus British English
- 1980s Bank of English in Birmingham
- 1990s (BNC, LDC, ICE corpus, ELRA, TRACTOR,
ICAME) -
20Why bother with corpora?
- Even expert speakers have only a partial
knowledge of a languageA corpus can be more
comprehensive and balanced - Even expert speakers tend to notice the unusual
and think of what is possibleA corpus can show
us what is common and typical - Even expert speakers cannot quantify their
knowledge of languageA corpus can give us
accurate statistics
21Why bother with corpora? (II)
- Even expert speakers cannot remember everything
they knowA corpus can store and recall all the
information that has been input - Even experts speakers cannot make up natural
examplesA corpus can provide us with a vast
number of real examples - Even expert speakers have prejudices and
preferences and every language has cultural
connotations and underlying ideologyA corpus can
give you more objective evidence
22Why bother with corpora? (III)
- Even expert speakers are not always available to
be consultedA corpus can be made permanently
accessible to all - Even expert speakers cannot keep up with language
changeA constantly updated corpus can reflect
even recent changes in the language - Even expert speakers lack authority they can be
challenged by other expert speakersA corpus can
encompass the actual language use of many expert
speakers
23Parameters of a corpus
- Language
- Monolingual
- Multilingual (comparable corpora)
- Parallel
- Type of source
- Written
- Spoken
- Mix
24Parameters of a corpus (II)
- Size of the corpus is not all important and it
depends very much on the type of texts used - Annotated/not annotated (type of encoding used
plain text, SGML/XML encoded) - Static corpus static/monitor corpus
- Corpus/sub-corpus
- Number of words/types
25Type/token ratio
- From Brown corpus 1m tokens (written only) -
50,406 types - From 1980s Birmingham/Cobuild corpora 1m tokens
(spoken only) - 36,807 types - 17,459 occur only
once - NB - fewer types than Brown (written only)
spoken language is more repetitive, smaller
vocabulary is used - 4m tokens (Times newspapers only) - 122,773 types
- 54,144 occur only once - 18m tokens (general corpus) - 228,323 types -
131,299 occur only once
26Type/token ratio
- 121m tokens (general corpus) - 475,633 types -
213,684 occur only once - 211m tokens (general corpus) - 638,901 types
- 323m tokens (general corpus) - 812,467 types
- 418m tokens (general corpus) - 938,914 types -
438,647 occur only once
27Ways to exploit a corpus
- Word (token) / types frequency lists
- N-grams
- Concordances
- Collocations/collegations
- Specially designed programs (especially when the
corpus is annotated)
28Frequency lists
- are lists which indicates the words which appear
in a corpus and their frequency - they provide a survey of the corpus
- a frequency list becomes more meaningful when
compared with other lists - they remove a word from its contexts
29Concordances
- show words in the context they appear
- usually they are obtained using special programs
which allow to manipulate the lists of
concordances - KWIC (Key Word In Context) is the most common
format
30Collocations
- collocation the occurrence of two or more
words within a short space of each other in text - the collocates are extracted using a window to
the left and right of a specified word - can be used to further analyse the context of a
word
31The word learning
32Building corpora
- Ways to acquire corpora
- Direct conversion from electronic format
- Optical scanning
- Keyboarding
- Speech transcription
33Building corpora (II)
- Criteria in corpus design
- Size (small corpora are for genre specific
studies, whereas big corpora make robust, general
statements about a language) - Genre (domain, distribution, age, )
- The structure of the corpus can be decided
- A priori (Brown, LOB, )
- A posteriori
- Old material is replaced with new one (monitor
corpus)
34Building corpora (III)
- Selection, permission, acquisition
- Data conversion, optical scanning, keyboarding,
speech transcription - Cleaning, spell-checking, encoding (annotation),
indexing - Writing documentation
- Evaluation of corpora
- Distribution of corpora
35Possible problems when building a corpus
- A sampling frame designed to allow the
exploitation of a certain linguistics properties - Balance and representativeness
- Information can be lost through cleaning
- Duplication
- When working with speech information can be lost
through transcribing
36Web as a corpus
- The Web can be very useful source of texts
- The Web is very helpful for languages other than
English - Quite often there is not control on the language
which is investigated therefore filtering (if
possible) is necessary
37Web as a corpus
38Web as a corpus
39Web as a corpus
40Corpus annotation
- Enrichment of a corpus with various types of
information - It can be done at every level
- Word part of speech, sense
- Sentence sentence boundaries, syntactic tree
- Discourse coreferential chains, discourse
segments - Certain expressions named entities
41Annotation scheme
- A standard used to annotate certain
characteristics - Gives meaning to a tag
- Nowadays it is in XML
- Usually in addition to an annotation scheme, a
set of guidelines is produces to assist the
annotation
42Examples (II)
- ltPgtltSgtltW POS"PRON" NUM"PL LEMMA"we"gtWelt/WgtltW
POS"V" LEMMA"have"gthavelt/WgtltW POS"EN"
LEMMA"develop"gtdevelopedlt/WgtltNPgtltW POS"DET"
LEMMA"a"gtalt/WgtltW POS"A LEMMA"computational"gt
computationallt/WgtltW POS"N" NUM"SG"
LEMMA"paradigm"gt paradigmlt/WgtltW
POS"PUNCT"gt,lt/Wgt ...lt/NPgt ... lt/Sgtlt/Pgt
43What are the advantages of corpus annotation?
- Ease of exploitation
- Reusability
- Multi-functionality
- Explicit analyses
- Once a corpus is annotated it can be used in
further research
44Annotation of a corpus
- Can be done automatically, semi-automatically
and manually - Sometimes the method is automatic and then the
results postprocessed - Usually special tools are used to minimise the
human error
45Criticism to corpus annotation
- Corpus annotation produce impure corpora
- Sometimes annotation can hide certain features
- Consistency versus accuracy
- Measures to compute the reliability of an
annotation - Sometimes the annotation scheme can cover a
phenomenon only partially.
46Existing corpora
- Brown Corpus/LOB corpus
- Bank of English
- Wall Street Journal, Penn Tree Bank, BNC, ANC,
ICE, WBE, Reuters Corpus - Canadian Hansard parallel corpus English-French
- York-Helsinki Parsed corpus of Old Poetry
- Tiger corpus German
- CORII/CODIS - contemporary written Italian
- MULTEX 1984 and The Republic in many languages
47Distributors of corpora
- LDC (Linguistic Data Consortium)
- ELRA (European Language Resources Association)
- TRACTOR (TELRI Research Archive of Computational
Tools and Resources) - ICAME (International Computer Archive of Modern
and Medieval English)
48References
- Karin Aijmer and Bengt Altenberg (1991) English
corpus linguistics, Longman - Duglas Biber, Susan Cnrad and Randi Reppen (1998)
Corpus linguistics, Cambridge University Press - Graeme D. Kennedy (1998) An introduction to
corpus linguistics, Longman - Tony McEnery and Andrew Wilson (1996) Corpus
linguistics, Edinburgh University Press
49References (II)
- Geoff Barnbrook (1996) Language and Computers,
Edinburgh University Press - Tony McEnery (2003) Corpus linguistics. In
Ruslan Mitkov (ed.) The Oxford Handbook of
Computational Linguistics, Oxford University
Press