Title: LIN 3098 Corpus Linguistics
1LIN 3098Corpus Linguistics Lecture 4
2In this lecture
- Levels of annotation
- Corpus typology
- classification based on type and levels of
annotation - multilingual corpora
3Part 1
- Levels of corpus annotation (cont/d)
4Levels of linguistic annotation
- part-of-speech (word-level)
- lemmatisation (word-level)
- parsing (phrase sentence-level)
- semantics (multi-level)
- semantic relationships between words and phrases
- semantic features of words
- discourse features (supra-sentence level)
- phonetic transcription
- prosody
5Lemmatisation
- Groups morphological variants of a word under the
head word - mexa (walk)
- imxejt (I walked)
- imxejna (we walked)
- nimxu (we walk)
- ...
- Increasingly common these days.
Together , these form a lemma
6Lemmatisation example the SUSANNE corpus
- Format word tag lemma
- A050030.33 - VVDv said say
- Every word in the corpus is on separate line.
- Extremely useful for lexicography
Corpus filesentence.word
POS tag
actual word
head word (lemma)
7Automatic morphological analysis
- For some languages, there are reasonably good
lemmatisers/ morphological analysers - Examples for English
- morpha built at the University of Sussex
- EngTwol commercial, by LingSoft.
8Engtwol output
- undeniable
- "undeniable" ltDERblegt A ABS
- (derived with ble suffix)
- adjective (A)
- absolute (ABS) form
- This is a rule-based analyser. There are others
which use corpus-derived statistical patterns.
9Semantic annotation I Two types
- markup of semantic relations (e.g.
predicate-argument structure) - currently used in parsed corpora, to mark up
function-argument structures etc. - markup of features of word meaning (mainly, word
senses) - has origins in content analysis to arrive at
conclusions about how prominent particular
concepts are - Now used in a lot of work on word sense
disambiguation
10Example of type 1 semantic markup (Penn Treebank)
- (S (NPSBJ1 Chris)
- (VP wants
- (S (NPSBJ 1)
- (VP to
- (VP throw
- (NP the ball))))))
- Predicate Argument Structure
- wants(Chris, throw(Chris, ball))
Empty embedded subject linked to NP subject no. 1
11Semantic markup type 2 lexical features
- Most common type
- word-sense tagged corpora
- Main idea
- disambiguate a word in context by tagging its
sense - Often uses WordNet (Miller et al 1993)
- WordNet is a lexical taxonomy which represents
lexical relations within a large number of words. - including hyponymy (IS-A) relations etc
- For each entry, all the (supposed) senses of the
word are given. - Main use identify senses of words in context,
mark them up with a pointer to a wordnet sense.
12WordNet senses Move (noun)
-
- (377) move -- (the act of deciding to do
something "he didn't make a move to help" "his
first move was to hire a lawyer") - (70) move, relocation -- (the act of changing
your residence or place of business "they say
that three moves equal one fire") - (57) motion, movement, move, motility -- (a
change of position that does not entail a change
of location "the reflex motion of his eyebrows
revealed his surprise" "movement is a sign of
life" "an impatient move of his hand"
"gastrointestinal motility") - (30) motion, movement, move -- (the act of
changing location from one place to another
"police controlled the motion of the crowd" "the
movement of people from the farms to the cities"
"his move put him directly in my path") - (5) move -- ((game) a player's turn to take some
action permitted by the rules of the game)
13WordNet senses Move (verb)
-
- (130) travel, go, move, locomote -- (change
location move, travel, or proceed "How fast
does your new car go?" "We travelled from Rome
to Naples by bus" "The policemen went from door
to door looking for the suspect" "The soldiers
moved towards the city in an attempt to take it
before night fell") - (60) move, displace -- (cause to move, both in a
concrete and in an abstract sense "Move those
boxes into the corner, please" "I'm moving my
money to another bank" "The director moved more
responsibilities onto his new assistant") - (52) move -- (move so as to change position,
perform a nontranslational motion "He moved his
hand slightly to the right") - (20) move -- (change residence, affiliation, or
place of employment "We moved from Idaho to
Nebraska" "The basketball player moved from one
team to another")
14Check it out!
- Wordnet is freely available for download
- http//wordnet.princeton.edu/
15Word sense annotation other uses
- tagging words with their semantic field (Wilson
1996) - plant life
- mens clothing
-
- tagging words with their emotional content
(Campbell Pennebaker 2002) based on a
dictionary - social processes
- negative emotions
- This approach underlies Pennebakers Linguistic
Inquiry and WordCount (LIWC) system, - analyses a text and comes up with a profile of
its personal/emotional content - relates this to some features of its author
(gender, age)
16Discourse annotation
- Most common
- text-level things such as paragraphs
- Less common
- anaphoric NPs and reference (cf. example from
lecture 3) - Even less common
- annotation of words which function as discourse
cues (Stenstrom 1984) - apology (sorry), hedges (sort of), etc
- annotation of rhetorical structure
17Discourse Annotating rhetorical structure (I)
- Rhetorical Structure Theory (Mann and Thompson
1988) - views text as made up of discourse units
- units stand in various rhetorical relations,
which reflect their role in constructing an
argument, a narrative, etc - CONCESSION/CONTRAST relation
- Although Mr. Freeman is retiring, he will
continue to work as a consultant for American
Express on a project basis. - Second unit is the main one (nucleus)
- First unit (satellite) concedes that what the
main unit is saying is contradicted by another
fact. - Recent corpus (Marcu et al 2003) is annotated
with this information.
18Phonetic transcription
- Not many phonetically transcribed corpora.
- MARSEC corpus is one of the best known. This is a
version of the Lancaster/IBM Spoken English
Corpus. - Several databases of transcribed speech, however.
Mostly used for statistical speech technology
applications (e.g. text-to-speech synthesis).
19Annotating suprasegmentals
- Aims capture suprasegmental features such as
stress, intonation and pauses in spoken speech. - Some transcription systems exist
- TOBI (American)
- Tonic Stress Marker (TSM British)
- define ways of annotating suprasegmentals such as
start/end of tone group simultaneous speech,
rise-fall tone, falling tone, etc
20Problem-oriented tagging
- If youre interested in a particular problem, and
no corpus exists, build your own! - Many corpora define problem-specific annotation
schemes.
21Example the TUNA Corpus
- Problem How do people refer to objects using
definite NPs? - Main interest visual properties (colour, size
etc) - Focus semantics of definite NPs, i.e. what
people choose to include in their description. - Method
- experiment to get people to describe objects,
distinguishing them from other objects in the
same visual scene - annotation of descriptions based on semantics
22TUNA Corpus description
- ltDESCRIPTION NUM"SINGULAR"gt
- ltATTRIBUTE NAME"colour" VALUE"red"gt red
lt/ATTRIBUTEgt - ltATTRIBUTE NAME"type" VALUE"sofa"gt sofa
lt/ATTRIBUTEgt - ltATTRIBUTE NAME"size" VALUE"large"gt bigger
version lt/ATTRIBUTEgt - lt/DESCRIPTIONgt
- Red sofa, bigger version.
- Features of the corpus
- represents the target referent
- also represents the distractors (from which the
target must be distinguished) - semantically transparent annotation goes beyond
language
23Part 2
24Why multilingual corpora?
- comparative studies
- syntax
- morphology
-
- the cornerstone of most research in automatic
machine translation nowadays - most MT systems are statistical, trained on large
repositories of parallel (e.g. English-Chinese)
text.
25Parallel corpora
- Represents a text in its original language (L1),
with a translation in another language (L2) - long history Medieval polyglot bibles were among
the first parallel corpora - Alignment
- Many parallel corpora align L1 and L2 at sentence
level, sometimes also at word level - Sentence-level alignment can be achieved
automatically with very high accuracy!
26Example SMULTRON corpus
- Developed and released in 2007-8
- Relatively small
- Aligned texts in English, Swedish and German
- E.g. Sophies World is one of the texts
- Annotated with syntax, POS, morphology
- Comes with a tool to view parallel syntactic
trees.
27SMULTRON example English (Sophies World)
- lts ids3gt
- ltterminalsgt
- ltt id"s3_1" word"Sophie" pos"NNP"
morph"--"/gt - ltt id"s3_2" word"Amundsen" pos"NNP"
morph"--"/gt - ltt id"s3_3" word"was" pos"VBD"
morph"--"/gt - ltt id"s3_4" word"on" pos"IN" morph"--"/gt
- ltt id"s3_5" word"her" pos"PRP" morph"--"/gt
- ltt id"s3_6" word"way" pos"NN" morph"--"/gt
- ltt id"s3_7" word"home" pos"RB" morph"--"/gt
- ltt id"s3_8" word"from" pos"IN" morph"--"/gt
- ltt id"s3_9" word"school" pos"NN" morph"--"/gt
- ltt id"s3_10" word"." pos"." morph"--"/gt
- lt/terminalsgt
- lt/sgt
- This shows terminal nodes only. Corpus Also
represents syntactic non-terminals (NP, VP etc)
28SMULTRON Same sentence in German
- lts id3gt
- ltterminalsgt
- ltt id"s3_1" word"Sofie" pos"NE"
morph"FEM" lemma"Sofie " /gt - ltt id"s3_2 word"Amundsen" pos"NE"
morph"--" lemma"Amundsen /gt - ltt id"s3_3" word"war" pos"VAFIN"
morph"--" lemma"sein"/gt - ltt id"s3_4" word"auf" pos"APPR" morph"--"
lemma"auf" /gt - ltt id"s3_5" word"dem" pos"ART" morph"--"
lemma"der" /gt - ltt id"s3_6" word"Heimweg" pos"NN"
morph"MASK" lemma"Heimweg /gt - ltt id"s3_7" word"von" pos"APPR" morph"--"
lemma"von" /gt - ltt id"s3_8" word"der" pos"ART" morph"--"
lemma"die" /gt - ltt id"s3_9" word"Schule" pos"NN"
morph"FEM" lemma"Schule" /gt - ltt id"s3_10" word"." pos"." morph"--"
lemma"--" /gt - lt/terminalsgt
- lt/sgt
- Note richer morphology, representation of
lemmas,
29Translation corpora
- Not parallel.
- Have different texts in two or more different
languages, of the same genre. - Examples
- PAROLE corpus is a translation corpus for EU
languages
30Why translation corpora?
- Parallel corpora, by definition, contain
translation (L2) - can give rise to errors
- artificiality and translation quality can be an
issue - e.g. McEnery Wilson report a study on an
English-Polish corpus. The Polish text reads
like a translation - Problem can be overcome if the texts used are
professionally translated. - Translation corpora have texts in two or more
languages, in the original. - Data is more natural.
31Summary
- We have now concluded our initial incursion into
- corpus construction
- corpus annotation
- corpus typology
- Next up
- using corpora for linguistic research