Title: Diapositiva 1
1 A Bilingual Corpus of
Inter-linked Events
Tommaso Caselli?, Nancy Ide?, Roberto Bartolini
? ?Istituto di Linguistica Computazionale
ILC-CNR Pisa ?Department of Computer Science,
Vassar College, USA tommaso.caselli_at_ilc.cnr.it
ide_at_cs.vassar.edu roberto.bartolini_at_ilc.cnr.it
LREC 08 Marrakech, 30 May 2008
2Outline
- Motivations
- A (gentle) introduction to TimeML, TimeBank and
Italian TimeBank Corpora - The Bilingual Corpus linking events in TimeBank
Italian TimeBank by means of the Inter-Lingual
Index - Evaluation and Experiments
- similar events in the two corpora
- the ILI a bootstrapping device for creating
comparable corpora - Conclusion
3Motivations
- Retrieving the temporal relations between events
in texts is required to improve the performance
of I.R. and Open Domain Q.A. systems - one of the most challenging task is represented
by event identification
- can we facilitate events recognition by linking
two comparable corpora - on size, content and
annotation in two different languages by means
of the Inter-Lingual Index (ILI), which links IWN
WN?
- are events encoded in the same way in Italian
and English?
- can we import layers of annotations from a
corpus to another in two different languages by
exploiting the ILI?
4TimeML, TimeBank Italian TimeBank
- TimeML (Pustejovsky et al., 2003) is a
specification language to annotate core elements
in a temporal framework
- temporal expressions (ltTIMEX3gt) e.g. December
1st, three years
- a wide range of linguistic expressions, like
verbs, nouns, nominalizations, stative
adjectives..., realizing eventualities (ltEVENTgt),
i.e. events and states, and classifies them into
7 classes, i.e. ASPECTUAL, REPORTING, I_ACTION,
I_STATE, PERCEPTION, STATE, OCCURRENCE, according
to semantic and syntactic criteria
- connectives and temporal prepositions
(ltSIGNALgt), which make explicit the relation
holding between two entities
- it creates dependencies between events (ltSLINKgt,
ltALINKgt, ltTLINKgt) and between events and times
(ltTLINKgt).
5TimeML, TimeBank Italian TimeBank
(2)?
TimeBank 1.2.
- first available corpus annotated with TimeML
- 183 news article from different sources,
including the Penn TreeBank2 Wall Street Journal,
for a total of 61K words - 7,935 events K0.81 on partial match on event
identification K0.67 on event classification
Italian TimeBank
- Italian corpus comparable in size (62K words),
content and annotation to TB 1.2. (171 articles
from the Italian TreeBank and the PAROLE corpus)
- under development gt13K words annotated, 1,755
events - customization of TimeML to Italian (ISO-TimeML)
imperfect value for TENSE two new attributes
-V_FORM MOOD for the ltEVENTgt tag,
modification of ltEVENTgt tag text span - mapping of the 7 TimeML event classes to the
SIMPLE Ontology to improve event classification
(K0.84)?
6The Bilingual Corpus Linking Events
Linkage between the TimeBank (TB) Italian
TimeBank is accomplished through the
Inter-Lingual Index (ILI), developed in the
EuroWordNet Project (1999)?
The ILI is effectively an unstructured version of
WN, used as a hub through which WN synsets are
associated with synsets in WNs of other languages
- In IWN the ILI is augmented with several
semantic relations, such as eq_synonym,
eq_hyperonym, eq_cause...
specific information on the synsets relations
between English and Italian.
- 1,835 events (1,777 verbs 658 nominalization)
manual annotation of WN 2.0. senses, by 2 native
speakers 91 annotators agreement
1,686 events
- 1,253 events (778 verbs 462 nominalizations
and nouns) semi-automatic annotation of IWN
sense.
7The Bilingual Corpus Linking Events (2)?
WN 2.0
IWD 1.5
Auto-Generated Mapping from WD 2.0 to IWD 1.5
IWN SENSE
WN SENSE
ILI
ILI
Augmented TimeBank SENSE (WN 2.0)
Italian TimeBank SENSE (from IWN)?
ILI LINK
ILI (IWN)
ILI (IWN)
The ILI link is automatically determined and
restricted to the eq_synonym and eq_near synonym
relations
only events with exaclty or approximately the
same meaning
1,103 events in TB with 115 event synsets 1,250
event in Italian TB with 653 event synsets
8Evaluation Similar Events
- To which extent the introduction of WN senses is
useful for event identification? - Verify the Semantic Homogeneity Hypothesis
events with (almost) the same meaning assign the
same TimeML class i.e. are semantically
homogeneous.
Automatic extraction of all events (nouns and
verbs) with same ILI from both corpora
DATA SPARSNESS
- 35 common event synsets for verbs vs. 11 common
event synsets for nouns
9Evaluation Similar Events - Verbs
Analysis of common event synsets with a
significant number of occurrences in both
languages 25 event synsets, each with 5
occurrences at least
- for each event token we analyzed its semantic
pattern
- basic argument structure e.g. ARG0 E ARG1
ARG2 - semantic class of each argument and thematic
role e.g. ARG0PersonAgent - subvalency features e.g. PersonAgent Def_Np
E EventThemeClause
and its TimeML class.
- 30 different patterns have been identified for
the 25 common synsets - 93.22 of cases support the Semantic Homogeneity
Hypothesis same meaning, same semantic pattern,
same TimeML class - instances of event subcategorization (5 cases)
i.e. more than one pattern.
10Evaluation Similar
Events Verbs (2)?
lt 10 of cases seem to question the validity of
Semantic Homogeneity
NOT A COUNTEREXAMPLE
ILI 1432563 WN seek3 IWN cercare2 same
semantic pattern person/organization E
event TimeBank class I_ACTION Italian TB
I_STATE
Inconsistency of the data is due to the
exploitation of the SIMPLE TimeML Mapping and
Heuristics (Caselli et al. 2007)?
- SIMPLETimeML Mapping SIMPLE Semantic_type
Modal Event I_STATE - cercare2 Modal Event
I_STATE
Purpose Act I_ACTION
All other instances of possible counterexamples
we've found can all be explained in terms of
factors others than a real difference between
event realizations in the 2 languages
11Evaluation Similar
Events Nouns
All 11 common types have been analyzed. They are
all instances of nominalization of a
corresponding event verb.
Presence of WN senses is useful for identifying
incorrect or inconsistent annotations in the
source and target corpora and to more easily
identify those instances which satisfy the
criteria for an event in TimeML
Incorrect Annotations in Italian TB missing
semantic types in SIMPLE e.g. aumento_n has 3
senses in IWN but 1 semantic type in SIMPLE
Incorrect Annotations in TB over-extension of
the notion nominalizationevent e.g.
payment_n 8/10 occurrences are marked as EVENT
when their meaning is ''a sum of money''.
BUT WN senses are not always sufficient to
determine if a nominal realize an event or not,
due to the existence in the lexicon of cases
where the (non-)eventive reading is, somehow,
always possible.
12Experiments the ILI as Bootstrapping Device
- Can the ILI and wordnet senses be used as a
bootstrapping strategy for the creation of
comparable corpora?
- Key idea if the Semantic Homogeneity Hypothesis
holds, this will enable the import of one layer
of annotation from a source corpus to a target
one.
To verify the validity of this hypothesis we
developed a system which takes as input the
events augmented with WN senses from the TB, and
gives as output an additional layer of
annotation, i.e. it creates the EVENT tag in
Italian.
Italian Corpus IWN sense
Italian Corpus IWN sense (partial)
EVENT annotation
TB WN sense?
ILI P.O.S of TB EVENT
13Experiments the ILI as Bootstrapping Device (2)
- To evaluate the reliability of this approach we
have used the entire corpus of the Italian
TreeBank where a total of 62,522 words (9,832
verbs and 44,957 nouns) are manually assigned a
sense from IWD.
Our system has identified 3,700 events (6.7),
1,183 of which are considered as ''probable
events'' which need human post-processing. 58 new
event synsets have been retrieved.
- identification of annotation inconsistencies
i.e. over-extension of the notion of event for
nominalizations (e.g. movement4 social
movement) - sense assigment is not sufficient to
disambiguate eventive/non eventive reading of
nominals e.g. indication1 segnale1 -
partial matches occur due to the way sense
annotation is performed with WN - significant
reduction of manual effort only the set of
probable events requires validation and is
restricted to those words whose event reading is
not present in WN senses.
14Conclusion
- Identification of a new methodology to link
comparable corpora in different languages by
means of WN senses and the ILI
- Data from the resulting resource can be used for
contrastive analysis of events as well as
multilingual temporal analysis of texts
- There is a semantic homogeneity between similar
events in different languages, including semantic
preferences for thematic roles and TimeML
classes
- Sense assignment to events improves accuracy in
annotation, in particular for event
identification, and useful to reveal
inconsistencies and errors
- Modification to TimeML is suggested
introduction of a tag for those instances of
ambiguous cases where a double reading
(eventive/non-eventive) is always possible
- The ILI can be used as a semi-automatic
bootstrapping device to create resources by
importing layers of annotation for words with
similar sense
15 16Experiments and Evaluation Similar Events
Nouns (2)?
Identification of the senses is not enough to
determine if a nominal may realize an event or
not.
- the couple ''agreement1 eq_synonym intesa3,
accordo3'' do not have a clearcut eventive sense
in both wordnets BUT - in TB 31/32 occurrences are tagged as events
over-extension of the event reading - in Italian TB only 7/16 occurrences are tagged
as events, in Italian intesa3, accordo3 cannot
be systematically interpreted as events - no difference in WN IWN senses is signalled
between the eventive and non eventive readings!!
- This calls for a refinement of annotation
schemes for events to provide explicit means to
mark ambiguous cases where the double reading is,
somehow, always possible.