LELA 30922 Lecture 5 - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

LELA 30922 Lecture 5

Description:

LELA 30922 Lecture 5 Corpus annotation and SGML See esp. R Garside, G Leech & A McEnery (eds) Corpus Annotation, London (1997) Longman, ch. 1 Introduction by G ... – PowerPoint PPT presentation

Number of Views:68
Avg rating:3.0/5.0
Slides: 26
Provided by: Har57
Category:
Tags: lela | lecture

less

Transcript and Presenter's Notes

Title: LELA 30922 Lecture 5


1
LELA 30922Lecture 5
  • Corpus annotation and SGML
  • See esp.
  • R Garside, G Leech A McEnery (eds) Corpus
    Annotation, London (1997) Longman, ch. 1
    Introduction by G Leech something similar
    available at http//llc.oxfordjournals.org/cgi/rep
    rint/8/4/275.pdf
  • CM Sperberg-McQueen and L Burnard (eds)
    Guidelines for Electronic Text Encoding and
    Interchange, ch. 2 A Gentle Introduction to
    SGML, available at
  • http//www-sul.stanford.edu/tools/tutorials/html2.
    0/gentle.html

2
Annotation
  • Difference between a corpus and a mere
    collection of texts is mainly due to the value
    added by annotation
  • Includes generic information about the text,
    usually stored in a header
  • But more significantly, annotations within the
    text itself

3
Why annotate?
  • Adds information
  • Reflects some analysis of text
  • Inasmuch as this may reflect commitment to some
    theoretical approach, this can be a barrier
    sometimes (but see later)
  • Increases usefulness/reusability of text
  • Multi-functionality
  • May make corpus usable for something not
    originally foreseen by its compilers

4
Golden rules of annotation
  • Recoverability
  • It should always be possible to ignore the
    annotation and reconstruct the corpus in its raw
    form
  • Extricability
  • Correspondingly, annotations should be easily
    accessible so they can be stored separately if
    necessary (Before and after versions)
  • Transparency documentation
  • Purpose and meaning of annotations
  • How (eg manually or automatically), where and by
    whom annotations were done
  • If automatic, information about the programs used
  • Quality indication
  • Annotations almost inevitably include some errors
    or inconsistencies
  • To what extent have annotations been checked?
  • What is the measured accuracy rate, and against
    what benchmark?

5
Theory-neutrality
  • Schools of thought
  • Annotations may reflect a particular theoretical
    approach, and this should be acknowledged
  • Consensus
  • corpus annotations which are more (rather than
    less) theory-neutral will be more widely used
  • given the amount of work involved, it pays to be
    aware of the descriptive traditions of the
    relevant field
  • Standards
  • There are very few absolute standards, but some
    schemes can become de facto standards through
    widespread use
  • For example, BNC designers were aware of the
    likely side effects of any decisions (regarding
    annotation) that they took

6
Types of annotation
  • Plain corpus it appears in its existing raw
    state of plain text
  • Corpus marked up for formatting attributes e.g.
    page breaks, paragraphs, font sizes
  • Corpus annotated with identifying information,
    such as title, author, genre, register, edition
    date
  • Corpus annotated with linguistic information
  • Corpus annotated with additional interpretive
    information, eg error analysis in learner corpus

7
Levels of linguistic annotation
  • Paragraph and sentence-boundary disambiguation
  • Naive fullstopspacecapital unreliable for
    genuine texts
  • May also involve distinguishing titles/headings
    from running text
  • Tokenization identification of lexical units
  • multi-word units, cliticised words (eg cant)
  • Lemmatisation identification of lemmas (or
    lexemes)
  • Makes accessible variants of lexemes for more
    generic searches
  • May involve some disambiguation (eg rose)

8
Levels of linguistic annotation
  • POS tagging (grammatical tagging)
  • assigning to each lexical unit a code indicating
    its part of speech
  • most basic type of linguistic corpus annotation
    and forms an essential foundation for further
    forms of analysis
  • Parsing (treebanking)
  • Identification of syntactic relationships between
    words
  • Semantic tagging
  • Marking of word senses (sense resolution)
  • Marking of semantic relationships eg agent,
    patient
  • Marking with semantic categories eg human, animate

9
Levels of linguistic annotation
  • Discourse annotation
  • especially for transcribed speech
  • Identifying discourse function of text eg
    apology, greeting
  • or other pragmatic aspects, eg politeness level,
  • Anaphoric annotation
  • Identification of pronoun reference
  • and other anaphoric links (eg different
    references to the same entity)
  • Phonetic transcription (only in spoken language
    corpora)
  • Indication of details of pronunciation not
    otherwise reflected in transcription eg weak
    forms,
  • Explicit indication of accent/dialect features eg
    vowel qualities, allophonic variation
  • Prosodic annotation (only in spoken language
    corpora)
  • Suprasegmental iformation, eg stress, intonation,
    rhythm

10
Some examples
PROSODIC ANNOTATION, LONDON-LUND CORPUS well
very nice of you to ((come and)) _spare the
!t\/ime and come and !t\alk - tell me
about the - !pr\oblems And incidentally . I
_at_ do do t\ell me anything you want about
the college in !g\eneral Source Leech chapter
in Garside et al. 1997
11
EXAMPLE OF PART-OF-SPEECH TAGGING, LOB
CORPUS hospitality_NN is_BEZ an_AT excellent_JJ
virtue_NN ,_, but_CC not_XNOT when_WRB the_ATI
guests_NNS have_HV to_TO sleep_VB in_IN rows_NNS
in_IN the_ATI cellar_NN !_! the_ATI lovers_NNS
,_, whose_WP chief_JJB scene_NN was_BEDZ cut_VBN
at_IN the_ATI last_AP moment_NN ,_, had_HVD
comparatively_RB little_AP to_TO sing_VB '_'
he_PP3A stole_VBD my_PP wallet_NN !_! '_'
roared_VBD Rollinson_NP ._.
EXAMPLE OF PART-OF-SPEECH TAGGING, LOB
CORPUS hospitality_NN is_BEZ an_AT excellent_JJ
virtue_NN ,_, but_CC not_XNOT when_WRB the_ATI
guests_NNS have_HV to_TO sleep_VB in_IN rows_NNS
in_IN the_ATI cellar_NN !_! the_ATI lovers_NNS
,_, whose_WP chief_JJB scene_NN was_BEDZ cut_VBN
at_IN the_ATI last_AP moment_NN ,_, had_HVD
comparatively_RB little_AP to_TO sing_VB '_'
he_PP3A stole_VBD my_PP wallet_NN !_! '_'
roared_VBD Rollinson_NP ._.
EXAMPLE OF SKELETON PARSING, FROM THE SPOKEN
ENGLISH CORPUS SN Nemo_NP1 ,_, N the_AT
killer_NN1 whale_NN1 N ,_, FrN who_PNQS NV
'd_VHD grown_VVN J too_RG big_JJ P for_IF N
his_APP pool_NN1 P on_II N Clacton_NP1
Pier_NNL1 NPNPJVFrN ,_, V has_VHZ
arrived_VVN safely_RR P at_II N his_APP new_JJ
home_NN1 P in_II N Windsor_NP1 safari_NN1
park_NNL1 NPNPV ._. S
Source http//ucrel.lancs.ac.uk/annotation.html
12
ANAPHORIC ANNOTATION OF AP NEWSWIRE S.1 The state
Supreme Court has refused to release Rahway State
Prison inmate James Scott on bail. S.2 The
fighter is serving 30-40 years for a 1975 armed
robbery conviction. S.3 Scott had asked for
freedom while he waits for an appeal decision.
S.4 Meanwhile, his promoter, Murad Muhammed,
said Wednesday he netted only 15,250 for Scott's
nationally televised light heavyweight fight
against ranking contender Yaqui Lopez last
Saturday. S.5 The fight, in which Scott won a
unanimous decision over Lopez, grossed 135,000
for Muhammed's firm, Triangle Productions of
Newark, he said.
S.1 (0) The state Supreme Court has refused to
release 1 2 Rahway State Prison 2 inmate 1
(1 James Scott 1) on bail . S.2 (1 The fighter
1) is serving 30-40 years for a 1975 armed
robbery conviction . S.3 (1 Scott 1) had asked
for freedom while lt1 he waits for an appeal
decision . S.4 Meanwhile , 3 lt1 his promoter 3
, 3 Murad Muhammed 3 , said Wednesday lt3 he
netted only 15,250 for (4 1 Scott 1 's
nationally televised light heavyweight fight
against 5 ranking contender 5 (5 Yaqui Lopez
5) last Saturday 4) . S.5 (4 The fight , in
which 1 Scott 1 won a unanimous decision over
(5 Lopez 5) 4) , grossed 135,000 for 6 3
Muhammed 3 's firm 6, 6 Triangle Productions
of Newark 6 , lt3 he said .
Source http//ucrel.lancs.ac.uk/annotation.html
13
SGML
  • Although none of the examples just shown use it,
    for all but the simplest of mark-up schemes, SGML
    is widely recommended and used
  • SGML standard generalized mark-up language
  • Actually suitable for all sorts of things,
    including web pages (HTML is SGML-conformant)

14
What is a mark-up language?
  • Mark-up historically referred to printers marks
    on a manuscript to indicate typesetting
    requirements.
  • Now covers all sorts of codes inserted into
    electronic texts to govern formatting, printing,
    or other information.
  • Mark-up, or (synonymously) encoding, is defined
    as any means of making explicit an interpretation
    of a text.
  • By mark-up language we mean a set of mark-up
    conventions used together for encoding texts.
    Language must specify
  • what mark-up is allowed
  • what mark-up is required
  • how mark-up is to be distinguished from text
  • what the mark-up means
  • SGML provides the means for doing the first three
  • Separate documentation/software is required for
    the last
  • eg (1) difference between identifying something
    as ltemphgtand how that appears in print (2) why
    something may or may not be tagged as a relative
    clause

15
Rules of SGML
  • SGML allows us to define
  • Elements
  • Specific features of elements
  • Hierarchical/structural relations between
    elements
  • These specified in a document type definition
    (DTD)
  • DTD allows software to be written to
  • Help annotators annotate consistently
  • Explore documents marked-up

16
Elements in SGML
  • Have a (unique) name
  • Semantics of name are application dependent
  • up to designer to choose appropriate name, but
    nothing automatically follows from the choice of
    any particular name
  • Each element must be explicitly marked or tagged
    in some way
  • Most usual is with ltelementgtand lt/elementgtpairs,
    called start- and end-tags
  • Much SGML-compliant software seems to allow
    start-only tags
  • element (esp. useful for single words or
    characters)
  • _tag suffix

17
Attributes
  • Elements can have named attributes with
    associated values
  • When defined, values can be identified as
  • REQUIRED must be specified
  • IMPLIED optional
  • CURRENT inferred to be the same as the last
    specified value for that attribute
  • Values can be from a predefined list, or can be
    of a general type (string, integer, etc)

18
DTD (Document type definition)
  • Helps to impose uniformity over the corpus
  • Defines the (expected or to-be-imposed) structure
    of the document
  • For each element, defines
  • How it appears (whether end tags are required)
  • What its substructure is, ie what elements, how
    many of them, whether compulsory or not

19
Example of DTD
lt!ELEMENT anthology - - (poem)gt lt!ELEMENT poem
- - (title?, stanza couplet)gt lt!ELEMENT
title - O (PCDATA) gt lt!ELEMENT stanza - O
(line) gt lt!ELEMENT couplet O (cline, cline)
gt lt!ELEMENT (line cline) O O (PCDATA) gt
  • Start and end tags necessary (-) or optional (O)
  • Anthology consists of 1 or more poems
  • Poem has an optional title, then 1 or more
    stanzas or 1 or more couplets
  • Title consists of parsed character data, ie
    normal text
  • Stanza has one or more lines, couplet has two
    lines
  • Both lines and clines have the same definition
    normal text

20
Attributes
lt!ATTLIST poem id ID
IMPLIED status (draft revised
published) draft gt
  • DTD defines the attributes expected/required for
    each element
  • A poem has an id and a status
  • Value of id is any identifier, and is optional
  • Status is one of three values, default draft

21
ltanthologygt ltpoem id12 statusrevisedgt lttitlegtIt
s a grand old teamlt/titlegt ltstanzagt ltlinegtIts a
grand old team to play for ltlinegtIts a grand old
team to support ltlinegtAnd if you know your
history ltlinegtIts enough to make your heart
go Whoooooah lt/stanzagt lt/poemgt ltpoem
id13gt ... lt/poemgt lt/anthologygt
22
Mark-up exemplified
RAW TEXT Two men retained their marbles, and as
luck would have it they're both roughie-toughie
types as well as military scientists - a cross
between Albert Einstein and Action Man!
TOKENIZED TEXT ltw orthCAPgtTwolt/wgt ltwgtmenlt/wgt
ltwgtretainedlt/wgt ltwgttheirlt/wgt ltwgtmarblesltc
PUNgt,lt/cgt ltwgtandlt/wgt ltwgtaslt/wgt ltwgtlucklt/wgt
ltwgtwouldlt/wgt ltwgthavelt/wgt ltwgtitlt/wgt
ltwgttheylt/wgtltwgt'relt/wgt ltwgtbothlt/wgt
ltwgtroughie-toughielt/wgt ltwgttypeslt/wgt ltwgtaslt/wgt
ltwgtwelllt/wgt ltwgtaslt/wgt ltwgtmilitarylt/wgt
ltwgtscientists ltc PUNgtmdashlt/cgtlt/wgt ltwgtalt/wgt
ltwgtcrosslt/wgt ltwgtbetweenlt/wgt ltw orthCAPgtAlbertlt/wgt
ltw orthCAPgtEinsteinlt/wgt ltwgtandlt/wgt ltw
orthCAPgtActionlt/wgt ltw orthCAPgtManltc PUNgt!lt/cgt
23
LEMMATIZED TEXT ltw orthCAPgtTwolt/wgt ltw
lemmangtmenlt/wgt ltw lemretaingtretainedlt/wgt
ltwgttheirlt/wgt ltw lemmarblegtmarblesltc PUNgt,lt/cgt
ltwgtandlt/wgt ltwgtaslt/wgt ltwgtlucklt/wgt ltwgtwouldlt/wgt
ltwgthavelt/wgt ltwgtitlt/wgt ltwgttheylt/wgtltw
lembegt'relt/wgt ltwgtbothlt/wgt ltwgtroughie-toughielt/wgt
ltw lemtypegttypeslt/wgt ltwgtaslt/wgt ltwgtwelllt/wgt
ltwgtaslt/wgt ltwgtmilitarylt/wgt ltw lemscientistgtscienti
stslt/wgt ltc PUNgtmdashlt/cgt ltwgtalt/wgt ltwgtcrosslt/wgt
ltwgtbetweenlt/wgt ltw orthCAPgtAlbertlt/wgt ltw
orthCAPgtEinsteinlt/wgt ltwgtandlt/wgt ltw
orthCAPgtActionlt/wgt ltw orthCAPgtManlt/wgtltc
PUNgt!lt/cgt
24
POS TAGGED TEXT ltw orthCAP CRDgtTwolt/wgt ltw NN2
lemmangtmenlt/wgt ltw VVD lemretaingtretainedlt/wgt
ltw DPSgttheirlt/wgt ltw NN2 lemmarblegtmarbleslt/wgtltc
PUNgt,lt/cgt ltw CJCgtandlt/wgt ltw CJSgtaslt/wgt ltw
NN1-VVBgtlucklt/wgt ltw VM0gtwouldlt/wgt ltw VHIgthavelt/wgt
ltw PNPgtitlt/wgt ltw PNPgttheylt/wgtltw VBB
lembegt'relt/wgt ltw AV0gtbothlt/wgt ltw
AJ0gtroughie-toughielt/wgt ltw NN2gttypeslt/wgt ltw
AV0gtaslt/wgt ltw AV0gtwelllt/wgt ltw CJSgtaslt/wgt ltw
AJ0gtmilitarylt/wgt ltw NN2gtscientistslt/wgt ltc
PUNgtmdashlt/cgt ltw AT0gtalt/wgt ltw NN1gtcrosslt/wgt ltw
PRPgtbetweenlt/wgt ltw NP0gtAlbertlt/wgt ltw
NP0gtEinsteinlt/wgt ltw CJCgtandlt/wgt ltw
NN1gtActionlt/wgt ltw NN1-NP0gtManltc PUNgt!lt/cgt
25
POS TAGGED TEXT with idioms and named
entities ltw orthCAP CRDgtTwolt/wgt ltw NN2
lemmangtmenlt/wgt ltphrase typeidiomgtltw VVD
lemretaingtretainedlt/wgt ltw DPSgttheirlt/wgt ltw NN2
lemmarblegtmarbleslt/wgtlt/phrasegtltc PUNgt,lt/cgt ltw
CJCgtandlt/wgt ltphrase typeidiomgtltw CJSgtaslt/wgt ltw
NN1-VVBgtlucklt/wgt ltw VM0gtwouldlt/wgt ltw VHIgthavelt/wgt
ltw PNPgtitlt/wgtlt/phrasegt ltw PNPgttheylt/wgtltw VBB
lembegt'relt/wgt ltw AV0gtbothlt/wgt ltw
AJ0gtroughie-toughielt/wgt ltw NN2gttypeslt/wgt
ltphrase typecompound posCJSgtltw AV0gtaslt/wgt ltw
AV0gtwelllt/wgt ltw CJSgtaslt/wgtlt/phrasegt ltphrase
typecompound posNN2gtltw AJ0gtmilitarylt/wgt ltw
NN2gtscientistslt/wgtlt/phrasegt ltc PUNgtmdashlt/cgt ltw
AT0gtalt/wgt ltw NN1gtcrosslt/wgt ltw PRPgtbetweenlt/wgt
ltphrase typecompound posNP0gtltw NP0gtAlbertlt/wgt
ltw NP0gtEinsteinlt/wgtlt/phrasegt ltw CJCgtandlt/wgt
ltphrase typecompound posNP0gtltw NN1gtActionlt/wgt
ltw NN1-NP0gtManlt/phrasegtltc PUNgt!lt/cgt
Write a Comment
User Comments (0)
About PowerShow.com