Title: Corpus Linguistics and Ontologies
1Corpus Linguistics and Ontologies
- Martin Volk
- Stockholm University
2Overview of todays lecture
- What is Corpus Linguistics?
- Why do Computational Linguists work with corpora?
- What is an Ontology?
- Why do Computational Linguists work with / on
Ontologies? - Your task for next week!
3Sources for linguistic information
- Introspection (own usage and judgement)
- Usage and judgement by others
- Questioning (goal-driven)
- interview
- questionaire
- Observation ('involuntary' utterances)
- spoken utterances (? corpora)
- written utterances (? corpora)
4What is a corpus?
- a text collection
- a representative text collection
- a representative and structured text collection
- a representative, structured and annotated text
collection - ...
5Example
- 'ob' is a conjunction in German.
- Is 'ob' also used as a preposition?
- Introspection
- Rothenburg ob der Tauber
- Dictionary (Wahrig. Deutsches Wörterbuch. 1996)
Präp. mit Dativ veraltet ob dem Wasserfall - Web Google 'ob dem'
- Sage Der Wilde Jäger ob dem Neuenburgersee
- Corpus
6Corpus Examples
- CZ_94 ... fiel schier vom Stuhl ob der Äusserung
eines Ozeanologen ... - CZ_94 Bei manchem Ölgiganten kam ob der
Ergebnisse gar Euphorie auf. - CZ_94 ... rieben sich vergnügt die Hände ob des
zu erwartenden Schlagabtauschs. - ob is a preposition with genitive!!
- in CZ corpus 'ob' is tagged as preposition 21
times (but some incorrect)
7What is corpus linguistics?http//www.engl.polyu.
edu.hk/corpuslinguist/corpus.htm
- Corpus linguistics is simply the study of
language through corpus-based research, but it
differs from traditional linguistics in its
insistence on the systematic study of authentic
examples of language in use. - Text linguistics vs corpus linguistics
- Illustration vs evidence
- Introspection and informant testing vs
observation of text
8Corpus research in Linguistics
- Lexicography (Dictionaries)
- collocations ? idioms, support verb units,
- Grammaticography (Reference grammars)
- Learner corpora Language acquisition
- Parallel corpora Translation
- Historical corpora Language development
9Construction of Corpora
- Written text is easier to obtain than spoken
text. Some examples - Newspapers
- Fiction (e.g. fairy tales)
- Technical Literature (e.g. manuals, medicine)
- Personal letters Email
- Advertising (incl. political propaganda)
- Belief and Thought (e.g. bible)
10Corpora of spoken language
- Spontaneous spoken language
- recording of dialogues (e.g. telephone
conversation) - Prepared spoken language
- Public speeches (e.g. in parliament)
- Radio or TV news
- Spoken utterances must be transcribed for
linguistic research.
11Brown Corpus (1964)
- 1 million words
- 500 texts
- out of 15 different text types
- with 2000 words each
12British National Corpus (BNC)
- 1995, 100 Mio. words
- 90 written English, 10 spoken English
- 3209 texts
- out of 10 different text types written and
- 6 text types spoken
- with lt 40'000 words each
- ? multi-purpose corpus
13The Stockholm-Umeå-Corpus (SUC)
- a one-million word corpus
- with 500 text samples
- from different text genres
- with Part-of-Speech information,
- morpho-syntactic information,
- and base forms,
- all manually checked.
- SUC_example_sent.htm
14Types of corpora
- Raw texts
- Automatically annotated corpora
- Texts with Part-of-Speech tags
- Partially parsed texts
- Manually annotated corpora
- Treebank (e.g. the Penn Treebank)
- FrameNet
15Types of Corpora
- Balanced Corpora vs. special corpora
- Spoken vs. written language
- Monolingual vs. Multilingual Corpora
- Parallel vs. comparable corpora
16Parallel Corpora
- Translated texts that are aligned on the sentence
level. - Example from the Europarl corpus
17Europarl examples
- Swedish-English
- the preposition med
- the verb uppmana
- the noun myndighet
- the adverb alltid
- Swedish-French
- the noun myndighet
- the adverb alltid
18Corpora in Computational Linguistics
Corpora
annotation
Facts Rules Preferences
learning
19The goal of Natural Language Processing in CL
- Build a system that simulates human language
understanding - with input a natural language utterance
- with output answers to questions as to
- who did (or will do) what to whom, when, where
and why. - NLP is not necessarily concerned with how people
understand language ? Cognitive Linguistics
20Basic problems in CL
- Knowledge is missing ( too little information)
- e.g. unknown words
- e.g. unknown relations between concepts
- Ambiguities ( too much information)
- e.g. in syntax attachment preferences
- e.g. in semantics word senses
21Acknowledgement
- Some slides were highly influenced by or even
copied from Anke Lüdeling's course "Introduction
to Corpus Linguistics" at http//www.cl-ki.uni-osn
abrueck.de/aluedeli/Corpuslinguistik.html
22Take a breath
- Questions on Corpus Linguistics??
- Still to come
- An introduction to Ontologies
23What is an ontology?
- An ontology holds information about what
categories exist in the domain, what properties
they have, and how they are related to one
another. (Chandrasekaran et al. 1999)
24Types of ontologies
- Top-level ontology
- Lexical ontology (e.g. WordNet)
- WordNet is a lexical database for English (with
synonyms and hyperonyms) - General ontology (e.g. Cyc)
- Cyc is a formalized representation of fundamental
human knowledge facts, rules of thumb, and
heuristics for reasoning about the objects and
events of everyday life - Domain ontology (e.g. UMLS)
- The Unified Medical Language System
- Task ontology (e.g. CPV)
- The Common Procurement Vocabulary
25(No Transcript)
26Entities in Ontologies
- Concepts
- Objects
- Events and processes
- Properties
- Attributes
- Relations (hold between entities)
27Hierarchies in Ontologies
- Type hierarchy (type subtype isa
hypernym-hyponym) - Cat isa Animal.
- Part hierarchy (part whole is part of
meronymy) - Finger is a part of Hand
- Broader-narrower hierarchy (mixes part hierarchy
and type hierarchy)
28Ontology Terminology (Michael Denny)
- The following are synonym classes
- Concept, class, category, type, term, entity, set
and thing. - Instance, individual, resource, extension,
description, object and entity. - Relation, relationship, property, function, role,
slot, attribute, association, criterion,
constraint of, feature and predicate.
29Why ontologies in CL?
- Give structure to knowledge.
- Improve human knowledge access.
- Model human associations.
- Improve the computers knowledge processing.
- Get closer to the meaning of an utterance.
- Get closer to the goal of NLP.
30Ontology and its neighbors
- Lexicon
- A list of words is not an ontology.
- But cross-references (see also) or reading
distinctions can be seen as ontological
relations. - Thesaurus (? is an ontology)
- a synonym (and hyperonym) lexicon.
- general vocabulary (not domain specific).
- mostly atomic concepts ( single word entries).
- Terminology database (? is an ontology Siemens
TermDB) - domain-specific vocabulary
- most often multi-lingual
- with definitions, context examples and source
documentation - single word and complex entries
31Ontology and its neighbors
- Encyclopedia
- A collection of short descriptions about people,
geographical entities, and popular terms. - with cross-references.
- Knowledge base
- repository of formal knowledge representation.
- facts and rules (in Prolog).
32Related terms to be explained later
- The semantic web
- Knowledge management
- Text mining
- Dublin Core
- RDF, OWL