Annotation of corpora - PowerPoint PPT Presentation

About This Presentation
Title:

Annotation of corpora

Description:

Annotation of corpora A. Part-of-speech tagging B. Syntactic annotation C. Semantic annotation D. Discourse annotation E. Pragmatic annotation – PowerPoint PPT presentation

Number of Views:145
Avg rating:3.0/5.0
Slides: 31
Provided by: Elk110
Category:

less

Transcript and Presenter's Notes

Title: Annotation of corpora


1
Annotation of corpora
  • A. Part-of-speech tagging
  • B. Syntactic annotation
  • C. Semantic annotation
  • D. Discourse annotation
  • E. Pragmatic annotation

2
Annotation of corpora
  • perfectly plain produced by scanning no
    information about text (usually, not even
    edition)
  • marked up for formatting attributes e.g. page
    breaks, paragraphs, font sizes, italics, etc.
  • annotated with identifying information, e.g.
    edition date, author, genre, register, etc.
  • annotated for part of speech, syntactic
    structure, discourse information, etc.

3
A. Part-of-speech tagging
  • LOB sample with POS tagging
  • A01 2 '_' stop_VB electing_VBG life_NN
    peers_NNS '_' ._.
  • A01 3 by_IN Trevor_NP Williams_NP ._.
  • A01 4 a_AT move_NN to_TO stop_VB \0Mr_NPT
    Gaitskell_NP from_IN
  • A01 4 nominating_VBG any_DTI more_AP labour_NN
  • A01 5 life_NN peers_NNS is_BEZ to_TO be_BE
    made_VBN at_IN a_AT meeting_NN
  • A01 5 of_IN labour_NN \0MPs_NPTS tomorrow_NR ._.

4
A. Part-of-speech tagging
  • Main steps
  • Divide the text into word tokens (tokenization)
  • Select a set of tags
  • Apply tag set to tokens
  • Tokenization
  • orthographic word - morpho-syntactic unit?
  • multiwords, e.g., in spite of label as
    in_PREP31 spite_PREP32 of_PREP33
  • mergers, e.g., clitics as in hasnt, je taime,
    vendetelo label as vendete_VERBlo_PRON
  • compounds, e.g., tag set label as
    tagset_NOUN or tag_NOUN set_NOUN?

5
A. Part-of-speech tagging
  • Choice of tag set
  • sophisticated, linguistically well grounded set
    of tags
  • BUT not automatically applicable without loss of
    accuracy
  • example come - present plural indicative,
    imperative, subjunctive Lancaster corpus
    distinguish from to-infinitive, LOB, Brown
    corpus dont distinguish

6
A. Part-of-speech tagging
  • tag word class
  • label alphanumeric characters
  • examples preposition preposition p
    rep IN singular proper
    noun NOUNpropsing N-p-sg NP1
  • logically organized (taxonomy), e.g., in
    Lancaster, BNC, C7
  • presentation horizontal or vertical

7
A. Part-of-speech tagging
  • encoding of tags
  • TEI (SGML), e.g., BNC ltw AV0gtEven ltw AT0gtthe
    ltw AJ0gt old ltw NN2gtwomen ltw VVBgtmanage ltc PUNgt,
    ltw AVOgtjust ltw CJSgtas ltw PNPgtthey ltw VVBgtre ltw
    VVGgtpassing ltwPNPgtyou ltc PUNgt.lt/PUNgt (Garside et
    al., 1997)

8
A. Part-of-speech tagging
  • Applying tags to words
  • tagging scheme should include a procedure of how
    to assign tags to words (both for humans and
    machines)
  • need a lexicon it will say which tags are
    assignable to which words
  • again ambuguity is a problem

9
B. Syntactic annotation
  • syntactic annotation parsed corpora
  • purposes
  • training automatic parsers (computational
    linguistics, e.g. probabilistic parsers -
    inductive training through extraction of
    frequency counts)
  • extracting information (linguistics, e.g.,
    building a lexicon, investigating
    subcategorization frames, collocations or other
    linguistic things, describing sublanguages)

10
B. Syntactic annotation
  • a parsing scheme needs (cf. POS tagging)
  • a list of symbols
  • definitions of symbols
  • description of how to apply symbols to text
  • syntactically annotated corpora tree banks
  • examples of tree banks Penn Treebank, Nijmegen
    Treebank, Susanne Corpus , Helsinki Constraint
    Grammar (ENGCG), Lancaster/IBM SEC treebank

11
B. Syntactic annotation
  • Parsing
  • the (automatic) analysis of texts (sentences) in
    terms of syntactic categories

S
NP
VP
NP
PP
NP
NP ADJP
NP
Pierre 61 old will join
the as an executive Nov 29 Vinken
years board director
12
B. Syntactic annotation
  • Penn Treebank
  • skeleton parsing partial parse, leaving out the
    hard things (such as PP-attachment)
  • phrase structure model (Garside et al., 1997,
    p.42)

((S (NP (NP Pierre Vinken) , (ADJP (NP 61
years) old ,)) will (VP
join (NP the board) (PP as (NP a nonexecutive
director)) (NP Nov 29))) .)
13
B. Syntactic annotation
  • Penn Treebank
  • available through LDC
  • size 3,300,000 words (Feb 97)
  • Brown corpus, Wall Street Journal
  • in the current phase
  • add function labels (Subj, Obj etc.)
  • add null constituents or traces (e.g., Its easy
    t to eat)
  • add indices for coreferences (e.g., Maryi saw
    herselfi in the mirror)
  • discontinuous constituents
  • add semantic roles (Agent, Goal etc)
  • may get too complex for large-scale reliable
    analysis

14
B. Syntactic annotation
  • Susanne Corpus
  • part of the Brown corpus, 128,000 words
  • result of manual analysis
  • parsing scheme specified in great detail
  • available from Oxford Text Archive
  • sable.ox.ac.uk/ota (http)
  • ota.ox.ac.uk/pub/ota/public (ftp)

15
A./B. Demo
  • TIGER
  • NEGRA

16
C. Semantic annotation
  • problem (1) more than one way of referring to a
    concept, e.g.,
  • text analysis choice of expression may reflect
    ideologies in the text or relationships between
    participants in conversation, for example, in
    doctor-patient interaction abdomen --- tummy
  • information retrieval historian in fashion seeks
    information about trousers trousers ---
    slacks, shorts, leggings, breeches --gt
    cf. RECALL in IR

17
C. Semantic annotation
  • problem (2) one single word can refer to
    different concepts, e.g.,
  • information retrieval historian in fashion wants
    to know about boots boot --- may refer to
    shoe, computer, kick, car --gt cf.
    PRECISION in IR
  • so
  • need to identify related words (problem 1)
  • need to identify the different senses of a word
    (problem 2)

18
C. Semantic annotation
  • labeling words according to semantic field (word
    senses) so that you can
  • extract all the related words by querying on
    the semantic field
  • extract only those instances of ambiguous words
    with the specific senses you want by querying on
    the combination of word and semantic field

19
C. Semantic annotation
  • semantic fields sense relations and other kinds
    of relations (e.g., part-of, related-to etc.)
  • annotation (cf. PoS tagging)
  • definition of the tagging scheme (labels and
    their meanings)
  • guidelines for applying the tagging scheme
  • in semantics this is not as easy and
    straightforward as for PoS tagging!
  • requirements
  • should make linguistic/psycholinguistic sense
  • should be able to account for the vocabulary in
    the corpus exhaustively
  • should be suitable for texts from different
    periods and register (comprehensiveness)
  • should preferably have a hierarchical structure

20
C. Semantic annotation
  • multiple membership, e.g., deepened color
    and change/remain
  • multiword units, e.g., stubbed out encoded
    as two separate words, but belonging together
  • one recent ambitious attempt at a taxonomy of
    such semantic relations (sense relations,
    thesaurus-type relations, semantic fields etc.)
    WORDNET at www.cogsci.princeton.edu/wn/
  • you can try it online www.cogsci.princeton.edu/w
    n/online/

21
(No Transcript)
22
C. Semantic annotation
  • How to do it?
  • manually
  • computer-assisted (need at least a
    computer-readable lexicon and a disambiguation
    process - similar to PoS tagging)
  • fully automatic (not really feasible)
  • semantic analysis is even harder than syntactic
    parsing
  • no integrated parse of meaning possible at the
    present time

23
D. Discourse annotation
  • discourse features what are they?
  • Typically cohesion and coherence
  • coherence what makes a text hang together in
    terms of content
  • cohesion the means of making a text hang
    together
  • reference, substitution, ellipsis, conjunctive
    relations (cause, result, effect etc.), thematic
    development
  • Halliday Hasan, 76

24
D. Discourse annotation
  • example anaphoric relations in the IBM/Lancaster
    corpus (UCREL)
  • try to build up sth. like an anaphoric treebank
  • what are anaphoric relations?
  • links between a proform and an antecedent
  • example The married couple said that they
    were happy with their lot. The married
    couple said that they were happy with their lot.

25
D. Discourse annotation
  • anaphoric annotation in UCREL categories used
    are based n Halliday Hasan, 76
  • example of annotation (1 Feodor Baumenk
    1), a former Nazi death camp guard, has asked the
    U.S Supreme Court to allow ltREF1 him to retain
    ltREF1 his American citizenship. (2 The Hartford
    Courant 2) said
  • symbols (1), (2) antecedent lt
    anaphoric (gt cataphoric)
    REF central pronoun

26
D. Discourse annotation
  • few corpora annotated for discourse features
  • how to do it?
  • manually
  • computer-assisted either interactive hand
    annotation, using some kind of specialized editor
    or automatic annotation with the possibility of
    hand correction or disambiguation
  • a tool supporting annotation of anaphora XANADU
    in Lancaster

27
E. Pragmatic annotation
  • anything beyond sentences and discourse contexts
    of situation and culture
  • examples of things people look at in pragmatics
  • carry-on signals in conversation (e.g.,
    Stenstroem 87) which functions do carry-on
    signals such as well, you know etc. have in
    conversation?
  • speech acts (e.g., Stiles 92) speech act types
    in conversation, e.g., in doctor-patient
    interactions PATIENT I have the
    headaches to the point that I have to vomit
    (D) DOCTOR Mm -hm (K) PATIENT Then I
    have to go to bed and I sleep for a while
    (E) D Disclosure K Acknowledgment E
    Edification

28
E. Pragmatic annotation
  • how to do it?
  • manually
  • computer-assisted ?
  • fully-automatic -
  • You have to use your imagination!
  • Stenstroem example Can be done with a
    concordance program because its essentially
    word-based
  • Stiles example would probably have to be done
    manually (then use a concordance program on the
    annotated texts?)

29
Higher-level annotation tools
  • Tools that support specialized analysis, such as
  • specialized editors, e.g., Xanadu for anaphoric
    relations
  • specialized in terms of linguistics models,
  • e.g., Sys-Tools for Systemic Functional Grammar
    (minerva.ling.mq.edu.au/)
  • (http//cirrus.dai.ed.ac.uk8000/Coder/index.html)
  • e.g., RSTTools for rhetorical relations analysis
    (www.dai.ed.ac.uk/daidb/people/homes/micko/RSTTool
    /index.html)
  • Tools that support various kinds of analysis (but
    not quite everything you might want to do)
  • TATOE (www.darmstadt.gmd.de/rostek/tatoe.htm)

30
References
  • Garside R., G. Leech A. McEnery (eds.), 1997.
    Corpus Annotation. Linguistic Information from
    Computer Text Corpora. Longman London
  • Fellbaum C. (ed), 1998. WordNet. An Electronic
    Lexical Database. MIT Press.
  • Garside et al., 1997. Corpus annotation. London,
    Longman.
  • Halliday M.A.K. R. Hasan, 1976. Cohesion in
    English. Longman, London.
  • Mindt, 1991. Syntactic evidence for semantic
    distinctions in English. In Aijmer Altenberg
    (eds), English Corpus Linguistics Studies in
    Honour of Jan Svartvik, London, Longman.
  • Stenstroem, 1987. Carry-on signals in English
    conversation. In Meijs (ed), Corpus Linguistics
    and Beyond. Amsterdam, Rodopi.
  • Stiles, 1992. Describing talk a taxonomy of
    verbal response models. Beverly Hills, Sage.
Write a Comment
User Comments (0)
About PowerShow.com