Corpus Linguistics and Ontologies - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

Corpus Linguistics and Ontologies

Description:

Why do Computational Linguists work with corpora? What is an ... Sage: Der Wilde J ger ob dem Neuenburgersee. Corpus. 7/22/09. Martin Volk. 6. Corpus Examples ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 33
Provided by: marti85
Category:

less

Transcript and Presenter's Notes

Title: Corpus Linguistics and Ontologies


1
Corpus Linguistics and Ontologies
  • Martin Volk
  • Stockholm University

2
Overview of todays lecture
  • What is Corpus Linguistics?
  • Why do Computational Linguists work with corpora?
  • What is an Ontology?
  • Why do Computational Linguists work with / on
    Ontologies?
  • Your task for next week!

3
Sources for linguistic information
  • Introspection (own usage and judgement)
  • Usage and judgement by others
  • Questioning (goal-driven)
  • interview
  • questionaire
  • Observation ('involuntary' utterances)
  • spoken utterances (? corpora)
  • written utterances (? corpora)

4
What is a corpus?
  • a text collection
  • a representative text collection
  • a representative and structured text collection
  • a representative, structured and annotated text
    collection
  • ...

5
Example
  • 'ob' is a conjunction in German.
  • Is 'ob' also used as a preposition?
  • Introspection
  • Rothenburg ob der Tauber
  • Dictionary (Wahrig. Deutsches Wörterbuch. 1996)
    Präp. mit Dativ veraltet ob dem Wasserfall
  • Web Google 'ob dem'
  • Sage Der Wilde Jäger ob dem Neuenburgersee
  • Corpus

6
Corpus Examples
  • CZ_94 ... fiel schier vom Stuhl ob der Äusserung
    eines Ozeanologen ...
  • CZ_94 Bei manchem Ölgiganten kam ob der
    Ergebnisse gar Euphorie auf.
  • CZ_94 ... rieben sich vergnügt die Hände ob des
    zu erwartenden Schlagabtauschs.
  • ob is a preposition with genitive!!
  • in CZ corpus 'ob' is tagged as preposition 21
    times (but some incorrect)

7
What is corpus linguistics?http//www.engl.polyu.
edu.hk/corpuslinguist/corpus.htm
  • Corpus linguistics is simply the study of
    language through corpus-based research, but it
    differs from traditional linguistics in its
    insistence on the systematic study of authentic
    examples of language in use.
  • Text linguistics vs corpus linguistics
  • Illustration vs evidence
  • Introspection and informant testing vs
    observation of text

8
Corpus research in Linguistics
  • Lexicography (Dictionaries)
  • collocations ? idioms, support verb units,
  • Grammaticography (Reference grammars)
  • Learner corpora Language acquisition
  • Parallel corpora Translation
  • Historical corpora Language development

9
Construction of Corpora
  • Written text is easier to obtain than spoken
    text. Some examples
  • Newspapers
  • Fiction (e.g. fairy tales)
  • Technical Literature (e.g. manuals, medicine)
  • Personal letters Email
  • Advertising (incl. political propaganda)
  • Belief and Thought (e.g. bible)

10
Corpora of spoken language
  • Spontaneous spoken language
  • recording of dialogues (e.g. telephone
    conversation)
  • Prepared spoken language
  • Public speeches (e.g. in parliament)
  • Radio or TV news
  • Spoken utterances must be transcribed for
    linguistic research.

11
Brown Corpus (1964)
  • 1 million words
  • 500 texts
  • out of 15 different text types
  • with 2000 words each

12
British National Corpus (BNC)
  • 1995, 100 Mio. words
  • 90 written English, 10 spoken English
  • 3209 texts
  • out of 10 different text types written and
  • 6 text types spoken
  • with lt 40'000 words each
  • ? multi-purpose corpus

13
The Stockholm-Umeå-Corpus (SUC)
  • a one-million word corpus
  • with 500 text samples
  • from different text genres
  • with Part-of-Speech information,
  • morpho-syntactic information,
  • and base forms,
  • all manually checked.
  • SUC_example_sent.htm

14
Types of corpora
  • Raw texts
  • Automatically annotated corpora
  • Texts with Part-of-Speech tags
  • Partially parsed texts
  • Manually annotated corpora
  • Treebank (e.g. the Penn Treebank)
  • FrameNet

15
Types of Corpora
  • Balanced Corpora vs. special corpora
  • Spoken vs. written language
  • Monolingual vs. Multilingual Corpora
  • Parallel vs. comparable corpora

16
Parallel Corpora
  • Translated texts that are aligned on the sentence
    level.
  • Example from the Europarl corpus

17
Europarl examples
  • Swedish-English
  • the preposition med
  • the verb uppmana
  • the noun myndighet
  • the adverb alltid
  • Swedish-French
  • the noun myndighet
  • the adverb alltid

18
Corpora in Computational Linguistics
Corpora
annotation
Facts Rules Preferences
learning
19
The goal of Natural Language Processing in CL
  • Build a system that simulates human language
    understanding
  • with input a natural language utterance
  • with output answers to questions as to
  • who did (or will do) what to whom, when, where
    and why.
  • NLP is not necessarily concerned with how people
    understand language ? Cognitive Linguistics

20
Basic problems in CL
  • Knowledge is missing ( too little information)
  • e.g. unknown words
  • e.g. unknown relations between concepts
  • Ambiguities ( too much information)
  • e.g. in syntax attachment preferences
  • e.g. in semantics word senses

21
Acknowledgement
  • Some slides were highly influenced by or even
    copied from Anke Lüdeling's course "Introduction
    to Corpus Linguistics" at http//www.cl-ki.uni-osn
    abrueck.de/aluedeli/Corpuslinguistik.html

22
Take a breath
  • Questions on Corpus Linguistics??
  • Still to come
  • An introduction to Ontologies

23
What is an ontology?
  • An ontology holds information about what
    categories exist in the domain, what properties
    they have, and how they are related to one
    another. (Chandrasekaran et al. 1999)

24
Types of ontologies
  • Top-level ontology
  • Lexical ontology (e.g. WordNet)
  • WordNet is a lexical database for English (with
    synonyms and hyperonyms)
  • General ontology (e.g. Cyc)
  • Cyc is a formalized representation of fundamental
    human knowledge facts, rules of thumb, and
    heuristics for reasoning about the objects and
    events of everyday life
  • Domain ontology (e.g. UMLS)
  • The Unified Medical Language System
  • Task ontology (e.g. CPV)
  • The Common Procurement Vocabulary

25
(No Transcript)
26
Entities in Ontologies
  • Concepts
  • Objects
  • Events and processes
  • Properties
  • Attributes
  • Relations (hold between entities)

27
Hierarchies in Ontologies
  • Type hierarchy (type subtype isa
    hypernym-hyponym)
  • Cat isa Animal.
  • Part hierarchy (part whole is part of
    meronymy)
  • Finger is a part of Hand
  • Broader-narrower hierarchy (mixes part hierarchy
    and type hierarchy)

28
Ontology Terminology (Michael Denny)
  • The following are synonym classes
  • Concept, class, category, type, term, entity, set
    and thing.
  • Instance, individual, resource, extension,
    description, object and entity.
  • Relation, relationship, property, function, role,
    slot, attribute, association, criterion,
    constraint of, feature and predicate.

29
Why ontologies in CL?
  • Give structure to knowledge.
  • Improve human knowledge access.
  • Model human associations.
  • Improve the computers knowledge processing.
  • Get closer to the meaning of an utterance.
  • Get closer to the goal of NLP.

30
Ontology and its neighbors
  • Lexicon
  • A list of words is not an ontology.
  • But cross-references (see also) or reading
    distinctions can be seen as ontological
    relations.
  • Thesaurus (? is an ontology)
  • a synonym (and hyperonym) lexicon.
  • general vocabulary (not domain specific).
  • mostly atomic concepts ( single word entries).
  • Terminology database (? is an ontology Siemens
    TermDB)
  • domain-specific vocabulary
  • most often multi-lingual
  • with definitions, context examples and source
    documentation
  • single word and complex entries

31
Ontology and its neighbors
  • Encyclopedia
  • A collection of short descriptions about people,
    geographical entities, and popular terms.
  • with cross-references.
  • Knowledge base
  • repository of formal knowledge representation.
  • facts and rules (in Prolog).

32
Related terms to be explained later
  • The semantic web
  • Knowledge management
  • Text mining
  • Dublin Core
  • RDF, OWL
Write a Comment
User Comments (0)
About PowerShow.com