Title: Korpuslinguistik mit und f
1Korpuslinguistik mit und für Computerlinguistik
- Martin Volk
- Universität Zürich
- Eurospider Information Technology AG
2Sources for linguistic information
- Introspection (own usage and judgement)
- Usage and judgement by others
- Questioning (goal-driven)
- interview
- questionaire
- Observation ('involuntary' utterances)
- spoken utterances (? corpora)
- written utterances (? corpora)
3What is a corpus?
- a text collection
- a representative text collection
- a representative and structured text collection
- a representative, structured and annotated text
collection - ...
4Example
- Is 'ob' used as a preposition in German?
- Introspection
- Rothenburg ob der Tauber
- Dictionary (Wahrig. Deutsches Wörterbuch. 1996)
Präp. mit Dativ veraltet ob dem Wasserfall - Web Google 'ob dem'
- Sage Der Wilde Jäger ob dem Neuenburgersee
- Corpus
5Corpus Examples
- CZ94 ... fiel schier vom Stuhl ob der Äusserung
eines Ozeanologen ... - CZ94 Bei manchem Ölgiganten kam ob der
Ergebnisse gar Euphorie auf. - CZ94 ... rieben sich vergnügt die Hände ob des
zu erwartenden Schlagabtauschs. - ob is a preposition with genitive!!
- in CZ corpus 'ob' is tagged as preposition 21
times (obviously some incorrect)
6History of Corpus Linguistics
- collections of text were widely used in the 19th
century and in the first half of the 20th century - language acquisition
- orthography (letter frequency)
- field linguistics
- ? American Structuralism (influential until 1960)
7History of Corpus Linguistics
- Chomsky's criticism Speakers produce and
understand infinitely many new sentences/words. - therefore the new research goal is to describe
the underlying language faculty of a speaker (
universal grammar), competence rather than
performance
8History of Corpus Linguistics
- Chomsky's criticism every collection of texts is
a collection of performance data and so many
factors contribute to it that it cannot be used
to model competence. - A corpus is necessarily skewed. Some sentences
won't occur because they are obvious, false or
impolite.
9History of Corpus Linguistics
- theoretical linguistics
- competence (what is grammatical?)
- introspection
- indefinitely many types, productivity
- grammatical vs. ungrammatical
- corpus linguistics
- performance (what is attested?)
- instances
- finite number of types
- degrees of grammaticality
10Corpus research in Linguistics
- Lexicography (Dictionaries)
- Grammaticography (Reference grammars)
- Learner corpora Language acquisition
- Parallel corpora Translation
11Construction of Corpora
- Written text is easier to obtain than spoken
text. Some examples - Newspapers
- Fiction (e.g. fairy tales)
- Technical Literature (e.g. manuals, medicine)
- Personal letters Email
- Advertising (incl. political propaganda)
- Belief and Thought (e.g. bible)
12Corpora of spoken language
- Spontaneous spoken language
- recording of dialogues (e.g. telephone
conversation) - Prepared spoken language
- Public speeches (e.g. in parliament)
- Radio or TV news
- Spoken utterances must be transcribed for
linguistic research.
13Size of corpora
- Brown Corpus for English (1964, 1 Mio. words)
- LIMAS-Corpus for German (1970, 1 Mio. words)
- British National Corpus (1995, 100 Mio. words)
- Cosmas corpus (2002, gt 100 Mio. words)
14Brown Corpus (1964)
- 500 texts
- out of 15 different text types
- with 2000 words each
15British National Corpus
- 90 written English, 10 spoken English
- 3209 texts
- out of 10 different text types written and
- 6 text types spoken
- with lt 40'000 words each
- ? multi-purpose corpus
16Other considerations
- Time frame of the corpus
- Native and non-native speakers
- Sociolinguistic variables
- Gender
- Age
- Education
- Dialect
- Social context and relationships
17Types of corpora
- Raw texts
- Automatically annotated corpora
- Texts with Part-of-Speech tags
- Partially parsed texts
- Manually annotated corpora
- Treebank
- FrameNet
18Types of Corpora
- Balanced Corpora vs. special corpora
- Spoken vs. written language
- Monolingual vs. Multilingual Corpora
- Parallel vs. comparable corpora
19Corpora in Computational Linguistics
Corpora
annotation
Facts Rules Preferences
learning
20My Motivation for Corpus Linguistics
- Attempt to build a parser for German
- But problems with ambiguities!!
- Therefore Learn attachment preferences from a
corpus!
21Corpora vs. Test suites
- A test suite
- is a collection of manually constructed and
selected sentences. - is used for testing computational grammars and
parsers. - reduces the amount of testing.
- leads to specific problems of the NLP system.
22Basic problems in CL
- Knowledge is missing (too little information)
- e.g. unknown words
- Ambiguities (too much information)
- e.g. in syntax attachment preferences
23Corpora in Computational Linguistics
- Widespread use of (manually) annotated material
for measuring progress! - Some examples from COLING 2002
- Treebanks to train and test probabilistic
grammars - Enriching treebanks with dependency information
- Automatic error detection in PoS-Tagged Corpora
- SENSEVAL data to train and test word sense
disambiguation programs
24Possible Student Tasks
- Which German prepositions take a noun without a
determiner? (e.g. pro, via) - When is mit used as an adverb? (e.g. )
- What is the distribution of separable verb
prefixes in German? - How often are relative clauses introduced with
welche(r) ? - How often are present participle forms used in
German? - What kind of foreign language material is in the
corpus?
25Possible Student Tasks
- Create a small parallel corpus (e.g. with various
versions of 'Alice in Wonderland' or National
Geographic) - Create a small corpus of spoken language (e.g. by
transcription of one issue of 'Big Brother'). - Create a small treebank with the ANNOTATE tool.
26What corpora do we have for German?
- Raw text
- ComputerZeitung 1993-97 (about 1.3 million words
per year) - ComputerZeitung iX
- Tages-Anzeiger 2000
27Information in TagesAnzeiger
- Date
- Category (Sport, Politics, Culture, Economics
etc.) - Author
- Title vs. Text
28What corpora do we have for German?
- Syntactically Annotated Text (Treebanks)
- NEGRA treebank (20'000 sentences)
- ComputerZeitung treebank (3'000 sentences)
- Text with manually corrected PoS tags
- 50'000 sentences from University speeches
- others
29The goal
- If you can walk, you can dance.
- If you can talk, you can sing.
- If you can parse, you can understand.
- (Hans Uszkoreit, COLING 2002)
30Acknowledgement
- Some slides were highly influenced by or even
copied from Anke Lüdeling's course "Introduction
to Corpus Linguistics" at http//www.cl-ki.uni-osn
abrueck.de/aluedeli/Corpuslinguistik.html