Title: Coreferencing Treebank data using CESAC
1Coreferencing Treebank data using CESAC
- Annotating and analysing IS in corpora of
historical English - Berlin, 13-14 November 2009
2Overview
Contents
Coreferencing using Cesac
- CESAC
- - Goals
- - Coreference types operationalizing IS
- - Input and output
- Inter-rater agreement
- Example
- Summary and conclusion
3CESAC goals
Goal of CESAC
Coreferencing using Cesac
- Overall goal
- - Referring from any one constituent to any other
constituent - More specifically
- - Source and destination IP/phrase/node or
DP/lexeme/endnode - - Attributes
- Coreference type
- Distance measure
- (NP type definite, indefinite, etc.)
- (Animacy)
4CESAC coreference types operationalizing IS
Goal of CESAC
Coreferencing using Cesac
- Two basic rules
- - Do not omit possible coreference information
- - A source should be linked to the nearest
possible destination - Labels encoding different forms of anaphoricity
- - Identity
- Jacqueline plays the cello. She is an amazing
musician - - Cross Speech
- John said to Paul Why dont you play the
guitar? - - Inferred
- Do you see that house? They say the kitchen
is extremely spacious - - World knowledge (separate category)
- According to Burt Reynolds, all dogs go to
heaven - Cross Speech gt Identity gt Inferred
- Encoding facts vs. encoding interpretations
objective data
5CESAC input 1
The input Penn-Treebank
Coreferencing using Cesac
- Standard Penn-Treebank format
- - Collection of ltnodesgt
- - Each ltnodegt consists of
- Brackets ()
- Label (NP )
- Other node (NP (N ) )
- Lexeme (N man)
- Possibly ltlexemegtltnodegt (P to (NP him))
- - Attributes in label
- (NP-ACC (PROA hine))
- - Extra-textual data in CODE nodes
- (CODE ltTEXT tylastegt)
6CESAC input 2
The input Penn-Treebank
Coreferencing using Cesac
( (CODE ltT06080009600,11.4gt) (IP-MAT (CONJ
And) (NP-NOM (DN tat) (NN folc))
(NP-ACC (PROA hine)) (ADVP-TMP (ADVT ta))
(PP (P mid) (NP-DAT (ADJD
unasecgendlicre) (ND wurdmynte))) (PP (P
to) (NP-DAT (ND scipe))) (VBDI
geladdon) (. ,)) (ID coapollo,ApT11.4.183)) (
(IP-MAT (CONJ and) (NP-NOM (NRN
Apollonius)) (NP-ACC-1 (PROA hi)) (VBDI
bad) (IP-INF (NP-ACC-SBJ ICH-1)
(QP-ACC (QA ealle)) (VB gretan))) (ID
coapollo,ApT11.4.184))
7CESAC output 1
enriched Penn-Treebank
Coreferencing using Cesac
- Penn-Treebank format
- Enriched with coreference information
- - Source node ID
- - Destination node ID
- - Coreference type
- - Coreference distance derivable
- Destination node example
- (NP-SBJ (CODE ltCoref_Id"339"_/gt) (NPR Crist))
- Source node example
- (NP-OB1
- (CODE ltCoref_Id"20"_Ref"21"_Type"Identity"_NdD
ist"16"_/gt) - (PRO hem) )
8CESAC output 2
enriched Penn-Treebank
Coreferencing using Cesac
Destination node
ltnodegt one-or-more ltnodegt OR ltlexemegt
9CESAC output 3
enriched Penn-Treebank
Coreferencing using Cesac
( (IP-MAT (CONJ and) (NP-NOM con (CODE
ltCoref_Id"1488"_Ref"1489"_Type"Identity"_NdDist
"12"_/gt)) (VBD ladde) (NP-ACC (CODE
ltCoref_Id"1476"_Ref"1477"_Type"Identity"_NdDist
"8"_/gt) (PROA hine)) (PP (P mid)
(NP-DAT-RFL (CODE ltCoref_Id"1487"_Ref"1488"_
Type"Identity"_NdDist"6"_/gt) (PROD him)))
(PP (P to) (NP-DAT (PRO his (CODE
ltCoref_Id"1486"_Ref"1487"_Type"Identity"_NdDist
"5"_/gt)) (ND huse)))) (ID
coapollo,ApT12.16.209))
10CESAC output 4
enriched Penn-Treebank
Coreferencing using Cesac
Source node
ltnodegt one-or-more ltnodegt OR ltnodegt
ltlexemegt OR ltlexemegt
11Inter-rater agreement 1
Goal of CESAC
Coreferencing using Cesac
- Two features measured
- - Coreference destination (node ID)
- - Coreference type
- Adapted version of Cohens kappa ? gt .6
- Two important problems
- - Identity vs. cross speech
- - Omission of link
- Solutions
- - Create new rule(s)
- - Adapt/specify existing rule(s)
12Inter-rater agreement 2
Goal of CESAC
Coreferencing using Cesac
- Tool used to calculate inter-rater agreement
concerning - - Coreference destination (feature 1) ? .67
- - Coreference type (feature 2) ? .66
13Example 1
Goal of CESAC
Coreferencing using Cesac
14Example 2
Goal of CESAC
Coreferencing using Cesac
- Clean text fragment with translation
- Ant warshipe hire easked. Hweonene cumest tu
fearlac deades munegunge. Ich cume he
seid of helle. - And Worship him asked, From where come
you, Fearlac, deaths reminder? I come,
he said, from hell. - Text fragment in CESAC coreference file
- 170.64 2031 ant2033 warschipe2035 hire
easked. Hweonene20422043 cumest 2045
tu2047 fearlac2049 deades munegunge . - 170.65 20532054 Ich cume20572058 he seid
of2063 helle . - Text fragment in Penn-Treebank file
- ( (IP-MAT (CONJ ant)
- (NP-SBJ (N warschipe))
- (NP-OB1 (PRO hire))
- (VBP easked)
- (, .)
- (CP-QUE-SPE (WADVP-1 (WADV Hweonene))
- (IP-SUB-SPE (ADVP-DIR T-1)
- (VBP cumest)
- (NP-SBJ (CODE ltCoref_Id"346"_Ref"345"_Ty
pe"CrossSpeech"_NdDist"10"_/gt) (PRO tu)) - (NP-VOC (CODE ltCoref_Id"50"_Ref"346"_Typ
e"Identity"_NdDist"2"_/gt) (N fearlac)
15Summary and conclusion
Goal of CESAC
Coreferencing using Cesac
- Annotation program CESAC
- - Input standard Penn-Treebank
- - Output relatively easy to analyse
- - Inter-rater agreement measured
- Operationalizing IS
- - 4 coreference types
- - As objective as possible facts vs.
interpretations - Plans
- - Fixed set of coreference types
- - Larger corpus of coreferenced texts
-
16Thank you for your attention!
Goal of CESAC
Coreferencing using Cesac