LIN 3098 Corpus Linguistics - PowerPoint PPT Presentation

About This Presentation
Title:

LIN 3098 Corpus Linguistics

Description:

tagging words with their emotional content ... Corpus Linguistics Discourse annotation Most common: text-level things such as paragraphs Less common: ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 32
Provided by: staffUmE8
Category:

less

Transcript and Presenter's Notes

Title: LIN 3098 Corpus Linguistics


1
LIN 3098Corpus Linguistics Lecture 4
  • Albert Gatt

2
In this lecture
  • Levels of annotation
  • Corpus typology
  • classification based on type and levels of
    annotation
  • multilingual corpora

3
Part 1
  • Levels of corpus annotation (cont/d)

4
Levels of linguistic annotation
  • part-of-speech (word-level)
  • lemmatisation (word-level)
  • parsing (phrase sentence-level)
  • semantics (multi-level)
  • semantic relationships between words and phrases
  • semantic features of words
  • discourse features (supra-sentence level)
  • phonetic transcription
  • prosody

5
Lemmatisation
  • Groups morphological variants of a word under the
    head word
  • mexa (walk)
  • imxejt (I walked)
  • imxejna (we walked)
  • nimxu (we walk)
  • ...
  • Increasingly common these days.

Together , these form a lemma
6
Lemmatisation example the SUSANNE corpus
  • Format word tag lemma
  • A050030.33 - VVDv said say
  • Every word in the corpus is on separate line.
  • Extremely useful for lexicography

Corpus filesentence.word
POS tag
actual word
head word (lemma)
7
Automatic morphological analysis
  • For some languages, there are reasonably good
    lemmatisers/ morphological analysers
  • Examples for English
  • morpha built at the University of Sussex
  • EngTwol commercial, by LingSoft.

8
Engtwol output
  • undeniable
  • "undeniable" ltDERblegt A ABS
  • (derived with ble suffix)
  • adjective (A)
  • absolute (ABS) form
  • This is a rule-based analyser. There are others
    which use corpus-derived statistical patterns.

9
Semantic annotation I Two types
  • markup of semantic relations (e.g.
    predicate-argument structure)
  • currently used in parsed corpora, to mark up
    function-argument structures etc.
  • markup of features of word meaning (mainly, word
    senses)
  • has origins in content analysis to arrive at
    conclusions about how prominent particular
    concepts are
  • Now used in a lot of work on word sense
    disambiguation

10
Example of type 1 semantic markup (Penn Treebank)
  • (S (NPSBJ1 Chris)
  • (VP wants
  • (S (NPSBJ 1)
  • (VP to
  • (VP throw
  • (NP the ball))))))
  • Predicate Argument Structure
  • wants(Chris, throw(Chris, ball))

Empty embedded subject linked to NP subject no. 1
11
Semantic markup type 2 lexical features
  • Most common type
  • word-sense tagged corpora
  • Main idea
  • disambiguate a word in context by tagging its
    sense
  • Often uses WordNet (Miller et al 1993)
  • WordNet is a lexical taxonomy which represents
    lexical relations within a large number of words.
  • including hyponymy (IS-A) relations etc
  • For each entry, all the (supposed) senses of the
    word are given.
  • Main use identify senses of words in context,
    mark them up with a pointer to a wordnet sense.

12
WordNet senses Move (noun)
  • (377) move -- (the act of deciding to do
    something "he didn't make a move to help" "his
    first move was to hire a lawyer")
  • (70) move, relocation -- (the act of changing
    your residence or place of business "they say
    that three moves equal one fire")
  • (57) motion, movement, move, motility -- (a
    change of position that does not entail a change
    of location "the reflex motion of his eyebrows
    revealed his surprise" "movement is a sign of
    life" "an impatient move of his hand"
    "gastrointestinal motility")
  • (30) motion, movement, move -- (the act of
    changing location from one place to another
    "police controlled the motion of the crowd" "the
    movement of people from the farms to the cities"
    "his move put him directly in my path")
  • (5) move -- ((game) a player's turn to take some
    action permitted by the rules of the game)

13
WordNet senses Move (verb)
  • (130) travel, go, move, locomote -- (change
    location move, travel, or proceed "How fast
    does your new car go?" "We travelled from Rome
    to Naples by bus" "The policemen went from door
    to door looking for the suspect" "The soldiers
    moved towards the city in an attempt to take it
    before night fell")
  • (60) move, displace -- (cause to move, both in a
    concrete and in an abstract sense "Move those
    boxes into the corner, please" "I'm moving my
    money to another bank" "The director moved more
    responsibilities onto his new assistant")
  • (52) move -- (move so as to change position,
    perform a nontranslational motion "He moved his
    hand slightly to the right")
  • (20) move -- (change residence, affiliation, or
    place of employment "We moved from Idaho to
    Nebraska" "The basketball player moved from one
    team to another")

14
Check it out!
  • Wordnet is freely available for download
  • http//wordnet.princeton.edu/

15
Word sense annotation other uses
  • tagging words with their semantic field (Wilson
    1996)
  • plant life
  • mens clothing
  • tagging words with their emotional content
    (Campbell Pennebaker 2002) based on a
    dictionary
  • social processes
  • negative emotions
  • This approach underlies Pennebakers Linguistic
    Inquiry and WordCount (LIWC) system,
  • analyses a text and comes up with a profile of
    its personal/emotional content
  • relates this to some features of its author
    (gender, age)

16
Discourse annotation
  • Most common
  • text-level things such as paragraphs
  • Less common
  • anaphoric NPs and reference (cf. example from
    lecture 3)
  • Even less common
  • annotation of words which function as discourse
    cues (Stenstrom 1984)
  • apology (sorry), hedges (sort of), etc
  • annotation of rhetorical structure

17
Discourse Annotating rhetorical structure (I)
  • Rhetorical Structure Theory (Mann and Thompson
    1988)
  • views text as made up of discourse units
  • units stand in various rhetorical relations,
    which reflect their role in constructing an
    argument, a narrative, etc
  • CONCESSION/CONTRAST relation
  • Although Mr. Freeman is retiring, he will
    continue to work as a consultant for American
    Express on a project basis.
  • Second unit is the main one (nucleus)
  • First unit (satellite) concedes that what the
    main unit is saying is contradicted by another
    fact.
  • Recent corpus (Marcu et al 2003) is annotated
    with this information.

18
Phonetic transcription
  • Not many phonetically transcribed corpora.
  • MARSEC corpus is one of the best known. This is a
    version of the Lancaster/IBM Spoken English
    Corpus.
  • Several databases of transcribed speech, however.
    Mostly used for statistical speech technology
    applications (e.g. text-to-speech synthesis).

19
Annotating suprasegmentals
  • Aims capture suprasegmental features such as
    stress, intonation and pauses in spoken speech.
  • Some transcription systems exist
  • TOBI (American)
  • Tonic Stress Marker (TSM British)
  • define ways of annotating suprasegmentals such as
    start/end of tone group simultaneous speech,
    rise-fall tone, falling tone, etc

20
Problem-oriented tagging
  • If youre interested in a particular problem, and
    no corpus exists, build your own!
  • Many corpora define problem-specific annotation
    schemes.

21
Example the TUNA Corpus
  • Problem How do people refer to objects using
    definite NPs?
  • Main interest visual properties (colour, size
    etc)
  • Focus semantics of definite NPs, i.e. what
    people choose to include in their description.
  • Method
  • experiment to get people to describe objects,
    distinguishing them from other objects in the
    same visual scene
  • annotation of descriptions based on semantics

22
TUNA Corpus description
  • ltDESCRIPTION NUM"SINGULAR"gt
  • ltATTRIBUTE NAME"colour" VALUE"red"gt red
    lt/ATTRIBUTEgt
  • ltATTRIBUTE NAME"type" VALUE"sofa"gt sofa
    lt/ATTRIBUTEgt
  • ltATTRIBUTE NAME"size" VALUE"large"gt bigger
    version lt/ATTRIBUTEgt
  • lt/DESCRIPTIONgt
  • Red sofa, bigger version.
  • Features of the corpus
  • represents the target referent
  • also represents the distractors (from which the
    target must be distinguished)
  • semantically transparent annotation goes beyond
    language

23
Part 2
  • Multilingual corpora

24
Why multilingual corpora?
  • comparative studies
  • syntax
  • morphology
  • the cornerstone of most research in automatic
    machine translation nowadays
  • most MT systems are statistical, trained on large
    repositories of parallel (e.g. English-Chinese)
    text.

25
Parallel corpora
  • Represents a text in its original language (L1),
    with a translation in another language (L2)
  • long history Medieval polyglot bibles were among
    the first parallel corpora
  • Alignment
  • Many parallel corpora align L1 and L2 at sentence
    level, sometimes also at word level
  • Sentence-level alignment can be achieved
    automatically with very high accuracy!

26
Example SMULTRON corpus
  • Developed and released in 2007-8
  • Relatively small
  • Aligned texts in English, Swedish and German
  • E.g. Sophies World is one of the texts
  • Annotated with syntax, POS, morphology
  • Comes with a tool to view parallel syntactic
    trees.

27
SMULTRON example English (Sophies World)
  • lts ids3gt
  • ltterminalsgt
  • ltt id"s3_1" word"Sophie" pos"NNP"
    morph"--"/gt
  • ltt id"s3_2" word"Amundsen" pos"NNP"
    morph"--"/gt
  • ltt id"s3_3" word"was" pos"VBD"
    morph"--"/gt
  • ltt id"s3_4" word"on" pos"IN" morph"--"/gt
  • ltt id"s3_5" word"her" pos"PRP" morph"--"/gt
  • ltt id"s3_6" word"way" pos"NN" morph"--"/gt
  • ltt id"s3_7" word"home" pos"RB" morph"--"/gt
  • ltt id"s3_8" word"from" pos"IN" morph"--"/gt
  • ltt id"s3_9" word"school" pos"NN" morph"--"/gt
  • ltt id"s3_10" word"." pos"." morph"--"/gt
  • lt/terminalsgt
  • lt/sgt
  • This shows terminal nodes only. Corpus Also
    represents syntactic non-terminals (NP, VP etc)

28
SMULTRON Same sentence in German
  • lts id3gt
  • ltterminalsgt
  • ltt id"s3_1" word"Sofie" pos"NE"
    morph"FEM" lemma"Sofie " /gt
  • ltt id"s3_2 word"Amundsen" pos"NE"
    morph"--" lemma"Amundsen /gt
  • ltt id"s3_3" word"war" pos"VAFIN"
    morph"--" lemma"sein"/gt
  • ltt id"s3_4" word"auf" pos"APPR" morph"--"
    lemma"auf" /gt
  • ltt id"s3_5" word"dem" pos"ART" morph"--"
    lemma"der" /gt
  • ltt id"s3_6" word"Heimweg" pos"NN"
    morph"MASK" lemma"Heimweg /gt
  • ltt id"s3_7" word"von" pos"APPR" morph"--"
    lemma"von" /gt
  • ltt id"s3_8" word"der" pos"ART" morph"--"
    lemma"die" /gt
  • ltt id"s3_9" word"Schule" pos"NN"
    morph"FEM" lemma"Schule" /gt
  • ltt id"s3_10" word"." pos"." morph"--"
    lemma"--" /gt
  • lt/terminalsgt
  • lt/sgt
  • Note richer morphology, representation of
    lemmas,

29
Translation corpora
  • Not parallel.
  • Have different texts in two or more different
    languages, of the same genre.
  • Examples
  • PAROLE corpus is a translation corpus for EU
    languages

30
Why translation corpora?
  • Parallel corpora, by definition, contain
    translation (L2)
  • can give rise to errors
  • artificiality and translation quality can be an
    issue
  • e.g. McEnery Wilson report a study on an
    English-Polish corpus. The Polish text reads
    like a translation
  • Problem can be overcome if the texts used are
    professionally translated.
  • Translation corpora have texts in two or more
    languages, in the original.
  • Data is more natural.

31
Summary
  • We have now concluded our initial incursion into
  • corpus construction
  • corpus annotation
  • corpus typology
  • Next up
  • using corpora for linguistic research
Write a Comment
User Comments (0)
About PowerShow.com