Title: LELA 30922 Lecture 5
1LELA 30922Lecture 5
- Corpus annotation and SGML
- See esp.
- R Garside, G Leech A McEnery (eds) Corpus
Annotation, London (1997) Longman, ch. 1
Introduction by G Leech something similar
available at http//llc.oxfordjournals.org/cgi/rep
rint/8/4/275.pdf - CM Sperberg-McQueen and L Burnard (eds)
Guidelines for Electronic Text Encoding and
Interchange, ch. 2 A Gentle Introduction to
SGML, available at - http//www-sul.stanford.edu/tools/tutorials/html2.
0/gentle.html
2Annotation
- Difference between a corpus and a mere
collection of texts is mainly due to the value
added by annotation - Includes generic information about the text,
usually stored in a header - But more significantly, annotations within the
text itself
3Why annotate?
- Adds information
- Reflects some analysis of text
- Inasmuch as this may reflect commitment to some
theoretical approach, this can be a barrier
sometimes (but see later) - Increases usefulness/reusability of text
- Multi-functionality
- May make corpus usable for something not
originally foreseen by its compilers
4Golden rules of annotation
- Recoverability
- It should always be possible to ignore the
annotation and reconstruct the corpus in its raw
form - Extricability
- Correspondingly, annotations should be easily
accessible so they can be stored separately if
necessary (Before and after versions) - Transparency documentation
- Purpose and meaning of annotations
- How (eg manually or automatically), where and by
whom annotations were done - If automatic, information about the programs used
- Quality indication
- Annotations almost inevitably include some errors
or inconsistencies - To what extent have annotations been checked?
- What is the measured accuracy rate, and against
what benchmark?
5Theory-neutrality
- Schools of thought
- Annotations may reflect a particular theoretical
approach, and this should be acknowledged - Consensus
- corpus annotations which are more (rather than
less) theory-neutral will be more widely used - given the amount of work involved, it pays to be
aware of the descriptive traditions of the
relevant field - Standards
- There are very few absolute standards, but some
schemes can become de facto standards through
widespread use - For example, BNC designers were aware of the
likely side effects of any decisions (regarding
annotation) that they took
6Types of annotation
- Plain corpus it appears in its existing raw
state of plain text - Corpus marked up for formatting attributes e.g.
page breaks, paragraphs, font sizes - Corpus annotated with identifying information,
such as title, author, genre, register, edition
date - Corpus annotated with linguistic information
- Corpus annotated with additional interpretive
information, eg error analysis in learner corpus
7Levels of linguistic annotation
- Paragraph and sentence-boundary disambiguation
- Naive fullstopspacecapital unreliable for
genuine texts - May also involve distinguishing titles/headings
from running text - Tokenization identification of lexical units
- multi-word units, cliticised words (eg cant)
- Lemmatisation identification of lemmas (or
lexemes) - Makes accessible variants of lexemes for more
generic searches - May involve some disambiguation (eg rose)
8Levels of linguistic annotation
- POS tagging (grammatical tagging)
- assigning to each lexical unit a code indicating
its part of speech - most basic type of linguistic corpus annotation
and forms an essential foundation for further
forms of analysis - Parsing (treebanking)
- Identification of syntactic relationships between
words - Semantic tagging
- Marking of word senses (sense resolution)
- Marking of semantic relationships eg agent,
patient - Marking with semantic categories eg human, animate
9Levels of linguistic annotation
- Discourse annotation
- especially for transcribed speech
- Identifying discourse function of text eg
apology, greeting - or other pragmatic aspects, eg politeness level,
- Anaphoric annotation
- Identification of pronoun reference
- and other anaphoric links (eg different
references to the same entity) - Phonetic transcription (only in spoken language
corpora) - Indication of details of pronunciation not
otherwise reflected in transcription eg weak
forms, - Explicit indication of accent/dialect features eg
vowel qualities, allophonic variation - Prosodic annotation (only in spoken language
corpora) - Suprasegmental iformation, eg stress, intonation,
rhythm
10Some examples
PROSODIC ANNOTATION, LONDON-LUND CORPUS well
very nice of you to ((come and)) _spare the
!t\/ime and come and !t\alk - tell me
about the - !pr\oblems And incidentally . I
_at_ do do t\ell me anything you want about
the college in !g\eneral Source Leech chapter
in Garside et al. 1997
11EXAMPLE OF PART-OF-SPEECH TAGGING, LOB
CORPUS hospitality_NN is_BEZ an_AT excellent_JJ
virtue_NN ,_, but_CC not_XNOT when_WRB the_ATI
guests_NNS have_HV to_TO sleep_VB in_IN rows_NNS
in_IN the_ATI cellar_NN !_! the_ATI lovers_NNS
,_, whose_WP chief_JJB scene_NN was_BEDZ cut_VBN
at_IN the_ATI last_AP moment_NN ,_, had_HVD
comparatively_RB little_AP to_TO sing_VB '_'
he_PP3A stole_VBD my_PP wallet_NN !_! '_'
roared_VBD Rollinson_NP ._.
EXAMPLE OF PART-OF-SPEECH TAGGING, LOB
CORPUS hospitality_NN is_BEZ an_AT excellent_JJ
virtue_NN ,_, but_CC not_XNOT when_WRB the_ATI
guests_NNS have_HV to_TO sleep_VB in_IN rows_NNS
in_IN the_ATI cellar_NN !_! the_ATI lovers_NNS
,_, whose_WP chief_JJB scene_NN was_BEDZ cut_VBN
at_IN the_ATI last_AP moment_NN ,_, had_HVD
comparatively_RB little_AP to_TO sing_VB '_'
he_PP3A stole_VBD my_PP wallet_NN !_! '_'
roared_VBD Rollinson_NP ._.
EXAMPLE OF SKELETON PARSING, FROM THE SPOKEN
ENGLISH CORPUS SN Nemo_NP1 ,_, N the_AT
killer_NN1 whale_NN1 N ,_, FrN who_PNQS NV
'd_VHD grown_VVN J too_RG big_JJ P for_IF N
his_APP pool_NN1 P on_II N Clacton_NP1
Pier_NNL1 NPNPJVFrN ,_, V has_VHZ
arrived_VVN safely_RR P at_II N his_APP new_JJ
home_NN1 P in_II N Windsor_NP1 safari_NN1
park_NNL1 NPNPV ._. S
Source http//ucrel.lancs.ac.uk/annotation.html
12ANAPHORIC ANNOTATION OF AP NEWSWIRE S.1 The state
Supreme Court has refused to release Rahway State
Prison inmate James Scott on bail. S.2 The
fighter is serving 30-40 years for a 1975 armed
robbery conviction. S.3 Scott had asked for
freedom while he waits for an appeal decision.
S.4 Meanwhile, his promoter, Murad Muhammed,
said Wednesday he netted only 15,250 for Scott's
nationally televised light heavyweight fight
against ranking contender Yaqui Lopez last
Saturday. S.5 The fight, in which Scott won a
unanimous decision over Lopez, grossed 135,000
for Muhammed's firm, Triangle Productions of
Newark, he said.
S.1 (0) The state Supreme Court has refused to
release 1 2 Rahway State Prison 2 inmate 1
(1 James Scott 1) on bail . S.2 (1 The fighter
1) is serving 30-40 years for a 1975 armed
robbery conviction . S.3 (1 Scott 1) had asked
for freedom while lt1 he waits for an appeal
decision . S.4 Meanwhile , 3 lt1 his promoter 3
, 3 Murad Muhammed 3 , said Wednesday lt3 he
netted only 15,250 for (4 1 Scott 1 's
nationally televised light heavyweight fight
against 5 ranking contender 5 (5 Yaqui Lopez
5) last Saturday 4) . S.5 (4 The fight , in
which 1 Scott 1 won a unanimous decision over
(5 Lopez 5) 4) , grossed 135,000 for 6 3
Muhammed 3 's firm 6, 6 Triangle Productions
of Newark 6 , lt3 he said .
Source http//ucrel.lancs.ac.uk/annotation.html
13SGML
- Although none of the examples just shown use it,
for all but the simplest of mark-up schemes, SGML
is widely recommended and used - SGML standard generalized mark-up language
- Actually suitable for all sorts of things,
including web pages (HTML is SGML-conformant)
14What is a mark-up language?
- Mark-up historically referred to printers marks
on a manuscript to indicate typesetting
requirements. - Now covers all sorts of codes inserted into
electronic texts to govern formatting, printing,
or other information. - Mark-up, or (synonymously) encoding, is defined
as any means of making explicit an interpretation
of a text. - By mark-up language we mean a set of mark-up
conventions used together for encoding texts.
Language must specify - what mark-up is allowed
- what mark-up is required
- how mark-up is to be distinguished from text
- what the mark-up means
- SGML provides the means for doing the first three
- Separate documentation/software is required for
the last - eg (1) difference between identifying something
as ltemphgtand how that appears in print (2) why
something may or may not be tagged as a relative
clause
15Rules of SGML
- SGML allows us to define
- Elements
- Specific features of elements
- Hierarchical/structural relations between
elements - These specified in a document type definition
(DTD) - DTD allows software to be written to
- Help annotators annotate consistently
- Explore documents marked-up
16Elements in SGML
- Have a (unique) name
- Semantics of name are application dependent
- up to designer to choose appropriate name, but
nothing automatically follows from the choice of
any particular name - Each element must be explicitly marked or tagged
in some way - Most usual is with ltelementgtand lt/elementgtpairs,
called start- and end-tags - Much SGML-compliant software seems to allow
start-only tags - element (esp. useful for single words or
characters) - _tag suffix
17Attributes
- Elements can have named attributes with
associated values - When defined, values can be identified as
- REQUIRED must be specified
- IMPLIED optional
- CURRENT inferred to be the same as the last
specified value for that attribute - Values can be from a predefined list, or can be
of a general type (string, integer, etc)
18DTD (Document type definition)
- Helps to impose uniformity over the corpus
- Defines the (expected or to-be-imposed) structure
of the document - For each element, defines
- How it appears (whether end tags are required)
- What its substructure is, ie what elements, how
many of them, whether compulsory or not
19Example of DTD
lt!ELEMENT anthology - - (poem)gt lt!ELEMENT poem
- - (title?, stanza couplet)gt lt!ELEMENT
title - O (PCDATA) gt lt!ELEMENT stanza - O
(line) gt lt!ELEMENT couplet O (cline, cline)
gt lt!ELEMENT (line cline) O O (PCDATA) gt
- Start and end tags necessary (-) or optional (O)
- Anthology consists of 1 or more poems
- Poem has an optional title, then 1 or more
stanzas or 1 or more couplets - Title consists of parsed character data, ie
normal text - Stanza has one or more lines, couplet has two
lines - Both lines and clines have the same definition
normal text
20Attributes
lt!ATTLIST poem id ID
IMPLIED status (draft revised
published) draft gt
- DTD defines the attributes expected/required for
each element - A poem has an id and a status
- Value of id is any identifier, and is optional
- Status is one of three values, default draft
21ltanthologygt ltpoem id12 statusrevisedgt lttitlegtIt
s a grand old teamlt/titlegt ltstanzagt ltlinegtIts a
grand old team to play for ltlinegtIts a grand old
team to support ltlinegtAnd if you know your
history ltlinegtIts enough to make your heart
go Whoooooah lt/stanzagt lt/poemgt ltpoem
id13gt ... lt/poemgt lt/anthologygt
22Mark-up exemplified
RAW TEXT Two men retained their marbles, and as
luck would have it they're both roughie-toughie
types as well as military scientists - a cross
between Albert Einstein and Action Man!
TOKENIZED TEXT ltw orthCAPgtTwolt/wgt ltwgtmenlt/wgt
ltwgtretainedlt/wgt ltwgttheirlt/wgt ltwgtmarblesltc
PUNgt,lt/cgt ltwgtandlt/wgt ltwgtaslt/wgt ltwgtlucklt/wgt
ltwgtwouldlt/wgt ltwgthavelt/wgt ltwgtitlt/wgt
ltwgttheylt/wgtltwgt'relt/wgt ltwgtbothlt/wgt
ltwgtroughie-toughielt/wgt ltwgttypeslt/wgt ltwgtaslt/wgt
ltwgtwelllt/wgt ltwgtaslt/wgt ltwgtmilitarylt/wgt
ltwgtscientists ltc PUNgtmdashlt/cgtlt/wgt ltwgtalt/wgt
ltwgtcrosslt/wgt ltwgtbetweenlt/wgt ltw orthCAPgtAlbertlt/wgt
ltw orthCAPgtEinsteinlt/wgt ltwgtandlt/wgt ltw
orthCAPgtActionlt/wgt ltw orthCAPgtManltc PUNgt!lt/cgt
23LEMMATIZED TEXT ltw orthCAPgtTwolt/wgt ltw
lemmangtmenlt/wgt ltw lemretaingtretainedlt/wgt
ltwgttheirlt/wgt ltw lemmarblegtmarblesltc PUNgt,lt/cgt
ltwgtandlt/wgt ltwgtaslt/wgt ltwgtlucklt/wgt ltwgtwouldlt/wgt
ltwgthavelt/wgt ltwgtitlt/wgt ltwgttheylt/wgtltw
lembegt'relt/wgt ltwgtbothlt/wgt ltwgtroughie-toughielt/wgt
ltw lemtypegttypeslt/wgt ltwgtaslt/wgt ltwgtwelllt/wgt
ltwgtaslt/wgt ltwgtmilitarylt/wgt ltw lemscientistgtscienti
stslt/wgt ltc PUNgtmdashlt/cgt ltwgtalt/wgt ltwgtcrosslt/wgt
ltwgtbetweenlt/wgt ltw orthCAPgtAlbertlt/wgt ltw
orthCAPgtEinsteinlt/wgt ltwgtandlt/wgt ltw
orthCAPgtActionlt/wgt ltw orthCAPgtManlt/wgtltc
PUNgt!lt/cgt
24POS TAGGED TEXT ltw orthCAP CRDgtTwolt/wgt ltw NN2
lemmangtmenlt/wgt ltw VVD lemretaingtretainedlt/wgt
ltw DPSgttheirlt/wgt ltw NN2 lemmarblegtmarbleslt/wgtltc
PUNgt,lt/cgt ltw CJCgtandlt/wgt ltw CJSgtaslt/wgt ltw
NN1-VVBgtlucklt/wgt ltw VM0gtwouldlt/wgt ltw VHIgthavelt/wgt
ltw PNPgtitlt/wgt ltw PNPgttheylt/wgtltw VBB
lembegt'relt/wgt ltw AV0gtbothlt/wgt ltw
AJ0gtroughie-toughielt/wgt ltw NN2gttypeslt/wgt ltw
AV0gtaslt/wgt ltw AV0gtwelllt/wgt ltw CJSgtaslt/wgt ltw
AJ0gtmilitarylt/wgt ltw NN2gtscientistslt/wgt ltc
PUNgtmdashlt/cgt ltw AT0gtalt/wgt ltw NN1gtcrosslt/wgt ltw
PRPgtbetweenlt/wgt ltw NP0gtAlbertlt/wgt ltw
NP0gtEinsteinlt/wgt ltw CJCgtandlt/wgt ltw
NN1gtActionlt/wgt ltw NN1-NP0gtManltc PUNgt!lt/cgt
25POS TAGGED TEXT with idioms and named
entities ltw orthCAP CRDgtTwolt/wgt ltw NN2
lemmangtmenlt/wgt ltphrase typeidiomgtltw VVD
lemretaingtretainedlt/wgt ltw DPSgttheirlt/wgt ltw NN2
lemmarblegtmarbleslt/wgtlt/phrasegtltc PUNgt,lt/cgt ltw
CJCgtandlt/wgt ltphrase typeidiomgtltw CJSgtaslt/wgt ltw
NN1-VVBgtlucklt/wgt ltw VM0gtwouldlt/wgt ltw VHIgthavelt/wgt
ltw PNPgtitlt/wgtlt/phrasegt ltw PNPgttheylt/wgtltw VBB
lembegt'relt/wgt ltw AV0gtbothlt/wgt ltw
AJ0gtroughie-toughielt/wgt ltw NN2gttypeslt/wgt
ltphrase typecompound posCJSgtltw AV0gtaslt/wgt ltw
AV0gtwelllt/wgt ltw CJSgtaslt/wgtlt/phrasegt ltphrase
typecompound posNN2gtltw AJ0gtmilitarylt/wgt ltw
NN2gtscientistslt/wgtlt/phrasegt ltc PUNgtmdashlt/cgt ltw
AT0gtalt/wgt ltw NN1gtcrosslt/wgt ltw PRPgtbetweenlt/wgt
ltphrase typecompound posNP0gtltw NP0gtAlbertlt/wgt
ltw NP0gtEinsteinlt/wgtlt/phrasegt ltw CJCgtandlt/wgt
ltphrase typecompound posNP0gtltw NN1gtActionlt/wgt
ltw NN1-NP0gtManlt/phrasegtltc PUNgt!lt/cgt