Diapositiva 1 - PowerPoint PPT Presentation

About This Presentation

Title:

Diapositiva 1

Description:

Italian corpus comparable in size (62K words), content and annotation to TB 1.2. ... under development: 13K words annotated, 1,755 events; ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 17

Provided by: Tomm221

Learn more at: http://www.lrec-conf.org

Category:

more less

Transcript and Presenter's Notes

Title: Diapositiva 1

1
A Bilingual Corpus of
Inter-linked Events
Tommaso Caselli?, Nancy Ide?, Roberto Bartolini
? ?Istituto di Linguistica Computazionale
ILC-CNR Pisa ?Department of Computer Science,
Vassar College, USA tommaso.caselli_at_ilc.cnr.it
ide_at_cs.vassar.edu roberto.bartolini_at_ilc.cnr.it
LREC 08 Marrakech, 30 May 2008
2
Outline

Motivations
A (gentle) introduction to TimeML, TimeBank and
Italian TimeBank Corpora
The Bilingual Corpus linking events in TimeBank
Italian TimeBank by means of the Inter-Lingual
Index
Evaluation and Experiments
similar events in the two corpora
the ILI a bootstrapping device for creating
comparable corpora
Conclusion

3
Motivations

Retrieving the temporal relations between events
in texts is required to improve the performance
of I.R. and Open Domain Q.A. systems
one of the most challenging task is represented
by event identification

can we facilitate events recognition by linking
two comparable corpora - on size, content and
annotation in two different languages by means
of the Inter-Lingual Index (ILI), which links IWN
WN?

are events encoded in the same way in Italian
and English?

can we import layers of annotations from a
corpus to another in two different languages by
exploiting the ILI?

4
TimeML, TimeBank Italian TimeBank

TimeML (Pustejovsky et al., 2003) is a
specification language to annotate core elements
in a temporal framework

temporal expressions (ltTIMEX3gt) e.g. December
1st, three years

a wide range of linguistic expressions, like
verbs, nouns, nominalizations, stative
adjectives..., realizing eventualities (ltEVENTgt),
i.e. events and states, and classifies them into
7 classes, i.e. ASPECTUAL, REPORTING, I_ACTION,
I_STATE, PERCEPTION, STATE, OCCURRENCE, according
to semantic and syntactic criteria

connectives and temporal prepositions
(ltSIGNALgt), which make explicit the relation
holding between two entities

it creates dependencies between events (ltSLINKgt,
ltALINKgt, ltTLINKgt) and between events and times
(ltTLINKgt).

5
TimeML, TimeBank Italian TimeBank
(2)?
TimeBank 1.2.

first available corpus annotated with TimeML
183 news article from different sources,
including the Penn TreeBank2 Wall Street Journal,
for a total of 61K words
7,935 events K0.81 on partial match on event
identification K0.67 on event classification

Italian TimeBank

Italian corpus comparable in size (62K words),
content and annotation to TB 1.2. (171 articles
from the Italian TreeBank and the PAROLE corpus)
under development gt13K words annotated, 1,755
events
customization of TimeML to Italian (ISO-TimeML)
imperfect value for TENSE two new attributes
-V_FORM MOOD for the ltEVENTgt tag,
modification of ltEVENTgt tag text span
mapping of the 7 TimeML event classes to the
SIMPLE Ontology to improve event classification
(K0.84)?

6
The Bilingual Corpus Linking Events
Linkage between the TimeBank (TB) Italian
TimeBank is accomplished through the
Inter-Lingual Index (ILI), developed in the
EuroWordNet Project (1999)?
The ILI is effectively an unstructured version of
WN, used as a hub through which WN synsets are
associated with synsets in WNs of other languages

In IWN the ILI is augmented with several
semantic relations, such as eq_synonym,
eq_hyperonym, eq_cause...

specific information on the synsets relations
between English and Italian.

1,835 events (1,777 verbs 658 nominalization)
manual annotation of WN 2.0. senses, by 2 native
speakers 91 annotators agreement

1,686 events

1,253 events (778 verbs 462 nominalizations
and nouns) semi-automatic annotation of IWN
sense.

7
The Bilingual Corpus Linking Events (2)?
WN 2.0
IWD 1.5
Auto-Generated Mapping from WD 2.0 to IWD 1.5
IWN SENSE
WN SENSE
ILI
ILI
Augmented TimeBank SENSE (WN 2.0)
Italian TimeBank SENSE (from IWN)?
ILI LINK
ILI (IWN)
ILI (IWN)
The ILI link is automatically determined and
restricted to the eq_synonym and eq_near synonym
relations
only events with exaclty or approximately the
same meaning
1,103 events in TB with 115 event synsets 1,250
event in Italian TB with 653 event synsets
8
Evaluation Similar Events

To which extent the introduction of WN senses is
useful for event identification?
Verify the Semantic Homogeneity Hypothesis
events with (almost) the same meaning assign the
same TimeML class i.e. are semantically
homogeneous.

Automatic extraction of all events (nouns and
verbs) with same ILI from both corpora

56 common event synsets

DATA SPARSNESS

35 common event synsets for verbs vs. 11 common
event synsets for nouns

9
Evaluation Similar Events - Verbs
Analysis of common event synsets with a
significant number of occurrences in both
languages 25 event synsets, each with 5
occurrences at least

for each event token we analyzed its semantic
pattern

basic argument structure e.g. ARG0 E ARG1
ARG2
semantic class of each argument and thematic
role e.g. ARG0PersonAgent
subvalency features e.g. PersonAgent Def_Np
E EventThemeClause

and its TimeML class.

30 different patterns have been identified for
the 25 common synsets
93.22 of cases support the Semantic Homogeneity
Hypothesis same meaning, same semantic pattern,
same TimeML class
instances of event subcategorization (5 cases)
i.e. more than one pattern.

10
Evaluation Similar
Events Verbs (2)?
lt 10 of cases seem to question the validity of
Semantic Homogeneity
NOT A COUNTEREXAMPLE
ILI 1432563 WN seek3 IWN cercare2 same
semantic pattern person/organization E
event TimeBank class I_ACTION Italian TB
I_STATE
Inconsistency of the data is due to the
exploitation of the SIMPLE TimeML Mapping and
Heuristics (Caselli et al. 2007)?
- SIMPLETimeML Mapping SIMPLE Semantic_type
Modal Event I_STATE - cercare2 Modal Event
I_STATE
Purpose Act I_ACTION
All other instances of possible counterexamples
we've found can all be explained in terms of
factors others than a real difference between
event realizations in the 2 languages
11
Evaluation Similar
Events Nouns
All 11 common types have been analyzed. They are
all instances of nominalization of a
corresponding event verb.
Presence of WN senses is useful for identifying
incorrect or inconsistent annotations in the
source and target corpora and to more easily
identify those instances which satisfy the
criteria for an event in TimeML
Incorrect Annotations in Italian TB missing
semantic types in SIMPLE e.g. aumento_n has 3
senses in IWN but 1 semantic type in SIMPLE
Incorrect Annotations in TB over-extension of
the notion nominalizationevent e.g.
payment_n 8/10 occurrences are marked as EVENT
when their meaning is ''a sum of money''.
BUT WN senses are not always sufficient to
determine if a nominal realize an event or not,
due to the existence in the lexicon of cases
where the (non-)eventive reading is, somehow,
always possible.
12
Experiments the ILI as Bootstrapping Device

Can the ILI and wordnet senses be used as a
bootstrapping strategy for the creation of
comparable corpora?

Key idea if the Semantic Homogeneity Hypothesis
holds, this will enable the import of one layer
of annotation from a source corpus to a target
one.

To verify the validity of this hypothesis we
developed a system which takes as input the
events augmented with WN senses from the TB, and
gives as output an additional layer of
annotation, i.e. it creates the EVENT tag in
Italian.
Italian Corpus IWN sense
Italian Corpus IWN sense (partial)
EVENT annotation
TB WN sense?
ILI P.O.S of TB EVENT
13
Experiments the ILI as Bootstrapping Device (2)

To evaluate the reliability of this approach we
have used the entire corpus of the Italian
TreeBank where a total of 62,522 words (9,832
verbs and 44,957 nouns) are manually assigned a
sense from IWD.

Our system has identified 3,700 events (6.7),
1,183 of which are considered as ''probable
events'' which need human post-processing. 58 new
event synsets have been retrieved.
- identification of annotation inconsistencies
i.e. over-extension of the notion of event for
nominalizations (e.g. movement4 social
movement) - sense assigment is not sufficient to
disambiguate eventive/non eventive reading of
nominals e.g. indication1 segnale1 -
partial matches occur due to the way sense
annotation is performed with WN - significant
reduction of manual effort only the set of
probable events requires validation and is
restricted to those words whose event reading is
not present in WN senses.
14
Conclusion

Identification of a new methodology to link
comparable corpora in different languages by
means of WN senses and the ILI

Data from the resulting resource can be used for
contrastive analysis of events as well as
multilingual temporal analysis of texts

There is a semantic homogeneity between similar
events in different languages, including semantic
preferences for thematic roles and TimeML
classes

Sense assignment to events improves accuracy in
annotation, in particular for event
identification, and useful to reveal
inconsistencies and errors

Modification to TimeML is suggested
introduction of a tag for those instances of
ambiguous cases where a double reading
(eventive/non-eventive) is always possible

The ILI can be used as a semi-automatic
bootstrapping device to create resources by
importing layers of annotation for words with
similar sense

Thank You!!

16
Experiments and Evaluation Similar Events
Nouns (2)?
Identification of the senses is not enough to
determine if a nominal may realize an event or
not.

the couple ''agreement1 eq_synonym intesa3,
accordo3'' do not have a clearcut eventive sense
in both wordnets BUT
in TB 31/32 occurrences are tagged as events
over-extension of the event reading
in Italian TB only 7/16 occurrences are tagged
as events, in Italian intesa3, accordo3 cannot
be systematically interpreted as events
no difference in WN IWN senses is signalled
between the eventive and non eventive readings!!

This calls for a refinement of annotation
schemes for events to provide explicit means to
mark ambiguous cases where the double reading is,
somehow, always possible.

Write a Comment

User Comments (0)