Annotation of corpora - PowerPoint PPT Presentation

About This Presentation

Annotation of corpora


Annotation of corpora A. Part-of-speech tagging B. Syntactic annotation C. Semantic annotation D. Discourse annotation E. Pragmatic annotation – PowerPoint PPT presentation

Number of Views:146
Avg rating:3.0/5.0
Slides: 31
Provided by: Elk110


Transcript and Presenter's Notes

Title: Annotation of corpora

Annotation of corpora
  • A. Part-of-speech tagging
  • B. Syntactic annotation
  • C. Semantic annotation
  • D. Discourse annotation
  • E. Pragmatic annotation

Annotation of corpora
  • perfectly plain produced by scanning no
    information about text (usually, not even
  • marked up for formatting attributes e.g. page
    breaks, paragraphs, font sizes, italics, etc.
  • annotated with identifying information, e.g.
    edition date, author, genre, register, etc.
  • annotated for part of speech, syntactic
    structure, discourse information, etc.

A. Part-of-speech tagging
  • LOB sample with POS tagging
  • A01 2 '_' stop_VB electing_VBG life_NN
    peers_NNS '_' ._.
  • A01 3 by_IN Trevor_NP Williams_NP ._.
  • A01 4 a_AT move_NN to_TO stop_VB \0Mr_NPT
    Gaitskell_NP from_IN
  • A01 4 nominating_VBG any_DTI more_AP labour_NN
  • A01 5 life_NN peers_NNS is_BEZ to_TO be_BE
    made_VBN at_IN a_AT meeting_NN
  • A01 5 of_IN labour_NN \0MPs_NPTS tomorrow_NR ._.

A. Part-of-speech tagging
  • Main steps
  • Divide the text into word tokens (tokenization)
  • Select a set of tags
  • Apply tag set to tokens
  • Tokenization
  • orthographic word - morpho-syntactic unit?
  • multiwords, e.g., in spite of label as
    in_PREP31 spite_PREP32 of_PREP33
  • mergers, e.g., clitics as in hasnt, je taime,
    vendetelo label as vendete_VERBlo_PRON
  • compounds, e.g., tag set label as
    tagset_NOUN or tag_NOUN set_NOUN?

A. Part-of-speech tagging
  • Choice of tag set
  • sophisticated, linguistically well grounded set
    of tags
  • BUT not automatically applicable without loss of
  • example come - present plural indicative,
    imperative, subjunctive Lancaster corpus
    distinguish from to-infinitive, LOB, Brown
    corpus dont distinguish

A. Part-of-speech tagging
  • tag word class
  • label alphanumeric characters
  • examples preposition preposition p
    rep IN singular proper
    noun NOUNpropsing N-p-sg NP1
  • logically organized (taxonomy), e.g., in
    Lancaster, BNC, C7
  • presentation horizontal or vertical

A. Part-of-speech tagging
  • encoding of tags
  • TEI (SGML), e.g., BNC ltw AV0gtEven ltw AT0gtthe
    ltw AJ0gt old ltw NN2gtwomen ltw VVBgtmanage ltc PUNgt,
    ltw AVOgtjust ltw CJSgtas ltw PNPgtthey ltw VVBgtre ltw
    VVGgtpassing ltwPNPgtyou ltc (Garside et
    al., 1997)

A. Part-of-speech tagging
  • Applying tags to words
  • tagging scheme should include a procedure of how
    to assign tags to words (both for humans and
  • need a lexicon it will say which tags are
    assignable to which words
  • again ambuguity is a problem

B. Syntactic annotation
  • syntactic annotation parsed corpora
  • purposes
  • training automatic parsers (computational
    linguistics, e.g. probabilistic parsers -
    inductive training through extraction of
    frequency counts)
  • extracting information (linguistics, e.g.,
    building a lexicon, investigating
    subcategorization frames, collocations or other
    linguistic things, describing sublanguages)

B. Syntactic annotation
  • a parsing scheme needs (cf. POS tagging)
  • a list of symbols
  • definitions of symbols
  • description of how to apply symbols to text
  • syntactically annotated corpora tree banks
  • examples of tree banks Penn Treebank, Nijmegen
    Treebank, Susanne Corpus , Helsinki Constraint
    Grammar (ENGCG), Lancaster/IBM SEC treebank

B. Syntactic annotation
  • Parsing
  • the (automatic) analysis of texts (sentences) in
    terms of syntactic categories

Pierre 61 old will join
the as an executive Nov 29 Vinken
years board director
B. Syntactic annotation
  • Penn Treebank
  • skeleton parsing partial parse, leaving out the
    hard things (such as PP-attachment)
  • phrase structure model (Garside et al., 1997,

((S (NP (NP Pierre Vinken) , (ADJP (NP 61
years) old ,)) will (VP
join (NP the board) (PP as (NP a nonexecutive
director)) (NP Nov 29))) .)
B. Syntactic annotation
  • Penn Treebank
  • available through LDC
  • size 3,300,000 words (Feb 97)
  • Brown corpus, Wall Street Journal
  • in the current phase
  • add function labels (Subj, Obj etc.)
  • add null constituents or traces (e.g., Its easy
    t to eat)
  • add indices for coreferences (e.g., Maryi saw
    herselfi in the mirror)
  • discontinuous constituents
  • add semantic roles (Agent, Goal etc)
  • may get too complex for large-scale reliable

B. Syntactic annotation
  • Susanne Corpus
  • part of the Brown corpus, 128,000 words
  • result of manual analysis
  • parsing scheme specified in great detail
  • available from Oxford Text Archive
  • (http)
  • (ftp)

A./B. Demo

C. Semantic annotation
  • problem (1) more than one way of referring to a
    concept, e.g.,
  • text analysis choice of expression may reflect
    ideologies in the text or relationships between
    participants in conversation, for example, in
    doctor-patient interaction abdomen --- tummy
  • information retrieval historian in fashion seeks
    information about trousers trousers ---
    slacks, shorts, leggings, breeches --gt
    cf. RECALL in IR

C. Semantic annotation
  • problem (2) one single word can refer to
    different concepts, e.g.,
  • information retrieval historian in fashion wants
    to know about boots boot --- may refer to
    shoe, computer, kick, car --gt cf.
  • so
  • need to identify related words (problem 1)
  • need to identify the different senses of a word
    (problem 2)

C. Semantic annotation
  • labeling words according to semantic field (word
    senses) so that you can
  • extract all the related words by querying on
    the semantic field
  • extract only those instances of ambiguous words
    with the specific senses you want by querying on
    the combination of word and semantic field

C. Semantic annotation
  • semantic fields sense relations and other kinds
    of relations (e.g., part-of, related-to etc.)
  • annotation (cf. PoS tagging)
  • definition of the tagging scheme (labels and
    their meanings)
  • guidelines for applying the tagging scheme
  • in semantics this is not as easy and
    straightforward as for PoS tagging!
  • requirements
  • should make linguistic/psycholinguistic sense
  • should be able to account for the vocabulary in
    the corpus exhaustively
  • should be suitable for texts from different
    periods and register (comprehensiveness)
  • should preferably have a hierarchical structure

C. Semantic annotation
  • multiple membership, e.g., deepened color
    and change/remain
  • multiword units, e.g., stubbed out encoded
    as two separate words, but belonging together
  • one recent ambitious attempt at a taxonomy of
    such semantic relations (sense relations,
    thesaurus-type relations, semantic fields etc.)
    WORDNET at
  • you can try it online

(No Transcript)
C. Semantic annotation
  • How to do it?
  • manually
  • computer-assisted (need at least a
    computer-readable lexicon and a disambiguation
    process - similar to PoS tagging)
  • fully automatic (not really feasible)
  • semantic analysis is even harder than syntactic
  • no integrated parse of meaning possible at the
    present time

D. Discourse annotation
  • discourse features what are they?
  • Typically cohesion and coherence
  • coherence what makes a text hang together in
    terms of content
  • cohesion the means of making a text hang
  • reference, substitution, ellipsis, conjunctive
    relations (cause, result, effect etc.), thematic
  • Halliday Hasan, 76

D. Discourse annotation
  • example anaphoric relations in the IBM/Lancaster
    corpus (UCREL)
  • try to build up sth. like an anaphoric treebank
  • what are anaphoric relations?
  • links between a proform and an antecedent
  • example The married couple said that they
    were happy with their lot. The married
    couple said that they were happy with their lot.

D. Discourse annotation
  • anaphoric annotation in UCREL categories used
    are based n Halliday Hasan, 76
  • example of annotation (1 Feodor Baumenk
    1), a former Nazi death camp guard, has asked the
    U.S Supreme Court to allow ltREF1 him to retain
    ltREF1 his American citizenship. (2 The Hartford
    Courant 2) said
  • symbols (1), (2) antecedent lt
    anaphoric (gt cataphoric)
    REF central pronoun

D. Discourse annotation
  • few corpora annotated for discourse features
  • how to do it?
  • manually
  • computer-assisted either interactive hand
    annotation, using some kind of specialized editor
    or automatic annotation with the possibility of
    hand correction or disambiguation
  • a tool supporting annotation of anaphora XANADU
    in Lancaster

E. Pragmatic annotation
  • anything beyond sentences and discourse contexts
    of situation and culture
  • examples of things people look at in pragmatics
  • carry-on signals in conversation (e.g.,
    Stenstroem 87) which functions do carry-on
    signals such as well, you know etc. have in
  • speech acts (e.g., Stiles 92) speech act types
    in conversation, e.g., in doctor-patient
    interactions PATIENT I have the
    headaches to the point that I have to vomit
    (D) DOCTOR Mm -hm (K) PATIENT Then I
    have to go to bed and I sleep for a while
    (E) D Disclosure K Acknowledgment E

E. Pragmatic annotation
  • how to do it?
  • manually
  • computer-assisted ?
  • fully-automatic -
  • You have to use your imagination!
  • Stenstroem example Can be done with a
    concordance program because its essentially
  • Stiles example would probably have to be done
    manually (then use a concordance program on the
    annotated texts?)

Higher-level annotation tools
  • Tools that support specialized analysis, such as
  • specialized editors, e.g., Xanadu for anaphoric
  • specialized in terms of linguistics models,
  • e.g., Sys-Tools for Systemic Functional Grammar
  • (http//
  • e.g., RSTTools for rhetorical relations analysis
  • Tools that support various kinds of analysis (but
    not quite everything you might want to do)
  • TATOE (

  • Garside R., G. Leech A. McEnery (eds.), 1997.
    Corpus Annotation. Linguistic Information from
    Computer Text Corpora. Longman London
  • Fellbaum C. (ed), 1998. WordNet. An Electronic
    Lexical Database. MIT Press.
  • Garside et al., 1997. Corpus annotation. London,
  • Halliday M.A.K. R. Hasan, 1976. Cohesion in
    English. Longman, London.
  • Mindt, 1991. Syntactic evidence for semantic
    distinctions in English. In Aijmer Altenberg
    (eds), English Corpus Linguistics Studies in
    Honour of Jan Svartvik, London, Longman.
  • Stenstroem, 1987. Carry-on signals in English
    conversation. In Meijs (ed), Corpus Linguistics
    and Beyond. Amsterdam, Rodopi.
  • Stiles, 1992. Describing talk a taxonomy of
    verbal response models. Beverly Hills, Sage.
Write a Comment
User Comments (0)