Title: Corpus Mark-up
1Corpus Mark-up
- UoL Summer Institute in Corpus Linguistics
- Matthew Brook ODonnell
2Aims
- Introduce the concepts of corpus mark-up and
annotation - Consider why we would want to add extra
non-textual information to corpus texts - Use a pos-tagger and tagged text
3What is Corpus Annotation?
- the practice of adding interpretative linguistic
information to a corpus (Leech 2005) - interpretative
- linguistic
- results in -gt value-added corpus
4Terminology
- Corpus Markup
- processing/formatting information
- metadata/text classifications
- structural representation
- Tagging
- (usually) inline addition of category to word(s)
- Parsing
- higher-level, multiword units (constituents)
- chunking/shallow vs. full syntactical parsing
- neednt just be syntactical analysis
- XML
- eXtensible Markup Language
5Why Annotate?
- Manual examination of corpus
- Automatic analysis of corpus
- Reusability of annotations
- Multi-functionality
- Objective record of analysis
- Annotation process is corpus analysis
Leech 2005
McEnery 2003
ODonnell 1999
6Types of Corpus Annotation
- Part-of-speech (POS)
- Lemmatization
- Syntactical (parsing)
- Semantic (domain classifications)
- Coreference (Discourse)
- Pragmatic (Speech acts dialogue)
- Stylistic
- Research specific (ad hoc)
7POS Tagging Claws C5
- Corpus_NN1 annotation_NN1 is_VBZ
- the_AT0 practice_NN1 of_PRF
- adding_VVG interpretative_AJ0
- linguistic_AJ0 information_NN1
- to_PRP a_AT0 corpus_NN1 ._.
NN1 singular noun AJ0 adjective (unmarked) VBZ
-s form of the verb "BE PRF the preposition
OF VVG -ing form of lexical verb AT0 article
8POS Tagging Claws C7
- Corpus_NN1 annotation_NN1 is_VBZ
- the_AT practice_NN1 of_IO
- adding_VVG interpretative_JJ
- linguistic_JJ information_NN1
- to_II a_AT1 corpus_NN1 ._.
http//www.comp.lancs.ac.uk/ucrel/claws/trial.html
9POS Tagging POSTagger
- Corpus/NN annotation/NN is/VBZ
- the/DT practice/NN of/IN
- adding/VBG interpretative/JJ
- linguistic/JJ information/NN
- to/TO a/DT corpus/NN ./.
10Parsing Chunking
- NP (NN Corpus) (NN annotation)
- (VBZ is)
- NP (DT the) (NN practice)
- (IN of) (VBG adding)
- NP (JJ interpretative) (JJ linguistic) (NN
information) - PP (TO to) NP (DT a) (NN corpus)
11Parsing
- (S
- (NP Corpus annotation)
- (VP is
- (NP
- (NP the practice)
- (PP of
- (S (VP adding
- (NP interpretative
linguistic information) - (PP to (NP a corpus))
- ))
- )
- )
- )
- .)
12Semantic Annotation
- Each word given code from thesaurus-style
dictionary - Also called Word Sense Tagging
- Examples
- UCREL Semantic Analysis System
- http//www.comp.lancs.ac.uk/ucrel/usas/
- WordNet
- http//wordnet.princeton.edu/
13Semantic Annotation
- The noun move has 5 senses (first 5 from tagged
texts) -
- 1. (377) move -- (the act of deciding to do
something "he didn't make a move to help" "his
first move was to hire a lawyer") - 2. (70) move, relocation -- (the act of changing
your residence or place of business "they say
that three moves equal one fire") - 3. (57) motion, movement, move, motility -- (a
change of position that does not entail a change
of location "the reflex motion of his eyebrows
revealed his surprise" "movement is a sign of
life" "an impatient move of his hand"
"gastrointestinal motility") - 4. (30) motion, movement, move -- (the act of
changing location from one place to another
"police controlled the motion of the crowd" "the
movement of people from the farms to the cities"
"his move put him directly in my path") - 5. (5) move -- ((game) a player's turn to take
some action permitted by the rules of the game)
14Semantic Annotation
- The verb move has 16 senses (first 13 from tagged
texts) -
- 1. (130) travel, go, move, locomote -- (change
location move, travel, or proceed "How fast
does your new car go?" "We travelled from Rome
to Naples by bus" "The policemen went from door
to door looking for the suspect" "The soldiers
moved towards the city in an attempt to take it
before night fell") - 2. (60) move, displace -- (cause to move, both in
a concrete and in an abstract sense "Move those
boxes into the corner, please" "I'm moving my
money to another bank" "The director moved more
responsibilities onto his new assistant") - 3. (52) move -- (move so as to change position,
perform a nontranslational motion "He moved his
hand slightly to the right") - 4. (20) move -- (change residence, affiliation,
or place of employment "We moved from Idaho to
Nebraska" "The basketball player moved from one
team to another")
15Tools
- XML
- Annotation Editors
- GATE
- WordSmith
16The Great Annotation Debate
- Leech et al. annotation value added
- Sinclair annotation perilous activity
- Scott beware of the POS prison!
17Sinclair on the perils of corpus annotation
- The interspersing of tags in a language text is
a perilous activity, because the text thereby
loses integrity
Current Issues in Corpus Linguistics (Sinclair
2004 191)
18Sinclair on the perils of corpus annotation
- ..one cosy consequence of using tagged text is
that the description which produces the tags in
the first place is not challenged it is
protected. The corpus data can only be observed
through the tags that is to say, anything the
tags are not sensitive to will be missed
Current Issues in Corpus Linguistics (Sinclair
2004 191)
19Sinclair on the perils of corpus annotation
- In corpus-driven linguistics you do not use
pre-tagged text, but you process the raw text
directly and then patterns of this uncontaminated
text are able to be observed.
Current Issues in Corpus Linguistics (Sinclair
2004 191)
20Hunston annotation as double-edged sword
- the categories used to annotate a corpus are
typically determined before any corpus analysis
is carried out, which in turn tends to limit, not
the kind of question that can be asked, but the
kind of question that usually is asked.
(Hunston 2002 93)
21Hunston annotation as double-edged sword
- Most of the work that is done using annotated
corpora uses categories that have been developed
in pre-corpus days, such as nominal clauses,
anaphoric reference Phenomena such as frames or
semantic prosody tend to have been identified
from plain text corpora and word-based studies.
(Hunston 2002 93)
22Corpus-based approach
annotated corpus
CORPUS METHODS
ANALYSIS categorization
DATA
ANALYSIS generalization
plain corpus
- Annotate Corpus
- POS
- Parsing
- Semantic
- Reference
RESULTS
23Corpus-driven approach
CORPUS METHODS
plain corpus
DATA
ANALYSIS generalization categorization
RESULTS
24Problem for both CB CD Approach
- Serial/Sequential process
- CB analysis before (annotation) and after
processing - CD analysis only after processing (so no need for
annotation) - Empirical process is cyclic
- analysis feeds back into process and around
again and again
25So what if.
- Hunston - Most of the work that is done using
annotated corpora uses categories that have been
developed in pre-corpus days. - we annotate categories that have come out of
corpus analysis instead of/as well as traditional
categories?
(Hunston 2002 93)
26New uses for corpus annotation
- Cyclic investigation process
- KWIC/Frequency list/Collocates etc.
- Annotate results
- Goto 1
- How sould we annotate
- collocates
- lexical items
- semantic associations/prosodies
- Local textual functions
27References
- Leech, G
- 2005 Adding Linguistic Annotation, in M.
Wynne, Developing Linguistic Corpora a Guide to
Good Practice (Oxford Oxbrow Books), pp. 17-29 - http//ahds.ac.uk/linguistic-corpora/
- Hunston, S.
- 2002 Corpora in Applied Linguistics (Cambridge
Cambridge University Press) - McEnery, A
- 2003 Corpus Linguistics, in R. Mitov (ed.),
The Oxford Handbook of Computational Linguistics
(Oxford Oxford University Press), pp. 448-463
28References
- ODonnell, M.B.
- The Use of Annotated Corpora for New Testament
Discourse Analysis A Survey of Current Practice
and Future Prospects, in S.E. Porter and J.T.
Reed (eds.), Discourse Analysis and the New
Testament Results and Applications (Sheffield
Sheffield Academic Press, 1999), pp. 71-117. - Sinclair, J.
- 2004 Trust the Text Language, Corpus and
Discourse (London Routledge) -