Title: semantic markup
1From PAROLE to FLaReNet and beyond
some historical notes to show a path along an
evolving vision
the link between ILC initiatives
priorities and challenges in the international
scenario
Nicoletta Calzolari Istituto di Linguistica
Computazionale - CNR - Pisa glottolo_at_ilc.cnr.it
2Why such needed LRs, are lacking after 30 years
of RD in the field?
Old slide with Antonio Zampolli (80s/early 90s)
- ? 1) Because the main trend until mid-80s was to
privilege the processing of critical phenomena,
studied by the dominating linguistic theories,
rather than focusing on the deep analysis of the
real uses of a language - As a result CL was focusing on
- few examples - often artificially built
- lexicons made of few entries (toy lexicons)
- grammars with poor coverage
- ? 2) Because large-scale LRs are costly their
production requires a big organizing effort
Why we still lack them??
3 back from the 70s/80s
Historical notes
Automatic acquisition of lexical information from
MRDs
Pioneering Research
- Was my first research became central in the
Pisa group (ACQUILEX) - And also Amsler, Briscoe, Boguraev, Wilks group,
IBM, then Japanese groups, - The trend was large-scale computational methods
for the transformation of machine readable
dictionaries into machine tractable dictionaries
- It became evident that
- Part of the results of meaning extraction, e.g.
many meaning distinctions, which could be
generalised over lexicographic definitions and
automatically captured, - were unmanageable at the formal representation
level, and had to be blurred into unique features
and values - Unfortunately, it is still today difficult to
constrain word-meanings within a rigorously
defined organization by their very nature they
tend to evade any strict boundaries
4Since then after the Grosseto Workshop (1985)
Historical notes
- LRs have acquired larger resonance in the last 2
decades, when many activities, in Europe and
world-wide, have contributed to substantial
advances in - knowledge and capability of how to represent,
create, acquire, access, exploit, harmonise,
tune, maintain, distribute, etc. large lexical
and textual repositories - In Europe an essential role was played by the EC,
through initiatives - ACQUILEX
- NERC
- EAGLES
- PAROLE
- SIMPLE
- EuroWordNet
- ISLE
- ELSNET
- RELATOR ? ELRA
- SPARKLE
- ENABLER
-
- that saw the participation of many EU groups,
linked over the years by sharing common
approaches and visions
Start AZ breakfast meeting
EAGLES acronym by Roberto
5 Automatic acquisition of info from texts
Historical notes
back from the late 80s
After acquisition from MRDs,
- This trend has become today a consolidated
pervasive fact, and we have moved - from focusing on acquisition of linguistic
information (as at the beginning) - to broader acquisition of general knowledge,
with more data intensive, robust, reliable methods
6We started building
Historical notes
- LRs as necessary infrastructure
- both for research applications
- LRs give to NLP systems the knowledge needed for
the various linguistic processing - Realising that most of the needed information
- escapes individual introspection
- can only be acquired analysing large textual
corpora attesting language use in different
fields/communicative contexts - BUT need of adequate models to handle actual
usage of language
- Sub-product? Importance of statistical methods
- Lesson
- Going from core sets to large coverage has
implications not just in quantitative terms, but
more interestingly in terms of changes to the
models and the strategies of processes
7PAROLE-SIMPLE-CLIPS 4 levels of linguistic
description
Accent position Vowel openness Consonant
pronunciation
Phonological Unit
Correspondence Ufon.-Umorf.
Grammatical Cat. Subcat. Inflectional Paradigm
Morphological Unit
Correspondence Umorf.-Usint.
Syntactic structure 1
Link
Syntactic Unit
Syntactic structure 2
Syntax - Semantics Correspondence
Correspondence Usint.-Usem
Ontological type
Event type
Semantic features
Semantic relations
Semantic Unit
Extended Qualia Structure
Regular polysemy
Link type
Predicative representation
8Semantic entry
USem3527vaporizzatore
semantic type Instrument unification_path
Concrete_entity ArtifactAgentive Telic
apparecchio usato per vaporizzare
un vaporizzatore per piante
eventype
cleaning, gardening, cosmetics
USem3527vaporizzatore synonymy
USem72288nebulizzatore USem3527vaporizzatore
instrumentverb Usem5239vaporizzare
USem3527vaporizzatore isa
Usem3479apparecchio USem3527vaporizzatore
has_as_part Usem61633pulsante USem3527vaporizza
tore created_by UsemD387fabbricare USem3527
vaporizzatore used_for
UsemD66019nebulizzare
regular polysemy
Very innovative too much??
from Nilda Ruimy
9The SIMPLE Ontology
TELIC
AGENTIVE
CONSTITUTIVE
ENTITY
CONCRETE_ENTITY
ABSTRACT_ENTITY
PROPERTY
REPRESENTATION
EVENT
- Artifact Material
- Furniture
- Clothing
- Container
- Artwork
- Instrument
- Money
- Vehicle
- Semiotic Artifact
Multidimensionality
from Nilda Ruimy
10Template for the Sem. Type Instrument
ontological information
predicative representation
extended qualia structure
from Nilda Ruimy
11 Constitutive
Agentive
Telic
Formal
made_of is_a_follower_of has_as_member is_a_member
_of has_as_part instrument kinship is_a_part_of re
sulting_state relates uses causes concerns affects
constitutive_activity contains
has_as_colour has_as_effect has_as_property measu
red_by measures produces produced_by
property_of quantifies related_to successor_of pr
ecedes typical_of feeling is_in lives_in typical_
location
result_of agentive_prog agentive_cause agentive_ex
perience caused_by source created_by derived_from
used_for used_as used_by used_against
A G E N T I V E
C O N S T I T U T I V E
is_a antonym_comp antonym_grad mult_opposition
INSTRUMENTAL
indirect_telic purpose
TELIC
ARTIFACTUAL AGENTIVE
is_the_activity_of is_the_ability_of is_the_habit_
of
ACTIVITY
DIRECT TELIC
object_of_activity
Extended Qualia Structure
P R O P E R T Y
T-cell, Blood Stem Cell
Ribose, Nucleotide
Catalyze, Enzyme
regulates is_regulated_by ..
NEW!
LOCATION
12Ontologisation of SIMPLE
- Automatically converting and enriching a
computational lexicon into a formal Ontology - For NLP semantic tasks
- Potential of ontologies in NLP as Backbone in
LKBs - Pivot in multilingual architectures (e.g. KYOTO)
- Reasoning capabilities
- Ontologisation of SIMPLE into OWL
- Conversion of the SIMPLE ontology
- Bottom-up enrichment promoting lexicon knowledge
to the ontology level - Language independent knowledge from Italian
lexico-semantic information
from Antonio Toral
13Use of SIMPLE Lexicon Ontology for Time and
Event detection/annotation (TimeML)
- Different PoS may realise an event verbs,
nouns, adjectives, prep. phrases
- The SIMPLE Lexicon helps in identifying
classifying Events (eventive nouns adjectives)
? in a 10K Words Annotation Experiment - each event is associated with an Ontological Type
- the Event-Type from the SIMPLE-Ontology can be
used as default value to provide event
composition, and consequently to instantiate a
temporal representation for each Event - improvement both in identification
classification of Events by annotators 81.17
accuracy (vs.72.35) and K-coefficient 0.84
(vs. 0.7)
SIMPLE Lexicon
Morpho-Syntactic Analysis
Event Detection Classification
from Tommaso Caselli
14GLML Generative Lexicon Markup Language
with James Pustejovsky, Olga Batiukova, Anna
Rumshisky, Marc Verhagen
- Annotating texts with Argument Selection,
Argument Coercion, Qualia Roles - The corpus brings reality to the model, provides
statistical cues to improve language models - Lexical semantic info, like type
coercion/selection, required for applications
such as WSD, categorisation, IR (query
reformulation, filtering), IE (coreference
resolution, relation extraction), entailment,
- Predicate Argument constructions
- Predicate Sense Disambiguation
- Argument selection type selection /coercion
- Qualia role/relation selection
- Modification constructions
- Noun Sense Disambiguation
- Qualia role/relation selection in Adjectival
Modification - Qualia role/relation selection in Nominal
Modification
- Complex Types
- Type selection in modification of Dot Objects
from Valeria Quochi
15GLML - Using Existing Resources for Italian
- SIMPLE LexiconOntology/ItalWordNet
- Sense Disambiguation
- Type selection /coercion
- Type selection in Dot Objects
- SIMPLE Extended Qualia Structure
- Selection of Qualia roles/relations., e.g.
- Constitutive Relations
- e.g Is_a_part_of , Is_a_member_of
- Telic Relations
- e.g. Purpose, Object_of_the_activity
- Agentive Relations
- e.g. Source, Result_of
from Valeria Quochi
16ISO LMF Lexical Markup Framework
Builds on EAGLES/ISLE
The field is mature
Structural skeleton, with the basic hierarchy of
information in a lexical entry
various extensions
- Modular framework
- LMF specs comply with modelling UML principles
- an XML DTD allows implementation
New initiatives
LIRICS
ICT KYOTO
NEDO Asian Lang.
NICT Language-Grid Service Ontology
17BioLexicon SIMPLE model ISO-LMF standard
A unique resource among large-scale computational
lexicons within the biomedical domain in terms of
coverage and typology of contained information
Designed to meet bio-Text Mining requirements
BL
Populated with info from available biomedical
resources and texts
Including both domain-specific and general
language words
Semi-automatically populated from
corpora Population toolkit available
Rich linguistic information ranging over
different linguistic descriptions levels
Conformant to international lexical
representation standards
from Monica Monachini
18The BioLexicon where from
Incremental population process
Existing repositories
chemical compounds, species names, disease,
enzymes
genes/proteins
Subclustering of term variants
BioLexicon
new genes/proteins names
MEDLINE
Named Entity Recognition
Term Mapping by Normalisation
Verbs, nouns, adjs, advs (variants, inflected
forms, derivative relations, ...)
Manual curation
Subcat extraction
Linguistic pre-processing
Syn-sem mapping
Manual annotation of a bio-event corpus
Bio-event extraction
from Simonetta Montemagni
19Environmental organizations
from Piek Vossen
20KYOTO SYSTEM
Linear MAF/SYNAF
Term extraction Tybot
Semantic annotation
Linear SEMAF
Generic TMF
Fact extraction Kybot
Domain editing Wikyoto
Fact User
Concept User
LMF API
OWL API
Linear Generic FACTAF
Domain Wordnet
Domain ontology
Wordnet
Ontology
from Piek Vossen
21Contribution of KYOTO
- hundreds of thousands sources in the environment
domain - in many different languages
- spread all over the world
- changing every day
- KYOTO learns terms and concepts from text
documents, - Stored as structures that people and computers
understand
- KYOTO delivers a Web 2.0 environment for
community based control - Connects people across language and cultures
- Establish consensus and knowledge transition
- KYOTO enables semantic search and fact
extraction - Software can partially understand language and
exploit web 1 data - Understanding is helped by the terms and
concepts defined for each language
html
pdf
xls
KYBOT
WIKYOTO
TYBOT
from Piek Vossen
22Since few years
A new paradigm of RD in LRs LT
- Open distributed linguistic infrastructures
for LRs LT - adopting the paradigm of accumulation of
knowledge so successful in more mature
disciplines, based on sharing LRs tools - ability to build on each other achievements,
results accessible to various systems, allowing
controlled effective cooperation of many groups
on common tasks (see HumanGenomeProject ? HGP) - Emerging concept of collective intelligence
- Emphasize interoperability among LRs, LT
knowledge bases - e. g. initiatives aimed at achieving
international consensus on annotation guidelines
to merge annotation efforts, produce coherent,
comprehensive linguistic annotations to be
readily disseminated throughout the community - New ways of extending large-scale LRs and
knowledge bases relying on volunteer labour,
wiki-mode?
interoperability
23Some steps for a new generation of LRs
- From huge efforts in building static,
large-scale, general-purpose LRs - To dynamic LRs rapidly built on-demand, tailored
to spefic user needs - From closed, locally developed and centralized
resources - To LRs residing over distributed places,
accessible on the web, choreographed by agents
acting over them
- From Language Resources
- To Language Services
- Need of tools to make this vision operational
concrete
24Lexical WEB Content Interoperability ?
Standards
- As a critical step for semantic mark-up in the
SemWeb
Global WordNet GRID
WordNets
NomLex
WordNets
ComLex
WordNets
with intelligent agents
SIMPLE-WEB
SIMPLE
LMF
Lex_x
BioLexicon
FrameNet
Lex_y
Standards for Interoperability
Enough??
25Distributed Language Services
- A long-term scenario implying
- content interoperability standards,
- supra-national cooperation and
- development of architectures enabling
accessibility - Create new resources on the basis of existing
- Exchange and integrate information across
repositories - Compose new services on demand
-
- Collaborative collective/social development and
validation, cross-resource integration and
exchange of information
Language Grid
Wiki
26In the Semantic Web vision ...
- need to tackle the twofold challenge of
- content availability
- multilinguality
- Natural convergence with HLT
- multilingual semantic processing
- ontologies
- semantic-syntactic computational lexicons
27BUT
Lack of communication betw. the communities of
HLT Semantic Web (SW)/Ontologists
The SW needs HLT HLT will highly benefit from
the SW
IAAI
otherwise, risk of re-discovery of what was done
20 years ago
see first issue of the International Journal on
Semantic Web Information Systems, 2005, with
statements identical to ours in papers of the
80s!!!
28Today
Why a Network of LRs LTs? Many dimensions
around the notion of language
finally
- We need to consider together
- technical
- organisational
- strategic
- economic
- cultural
- legal
- political issues wrt LRs LTs
- EU Network FLaReNet
- Next to build the ORI
Need of bodies for a broad research agenda
strategic actions for LTLRs (W/S /MM) based on
all the dimensions
Sensitive
Political issues e.g. a commonly agreed list of
minimal requirements for national LRs BLARK
Multilingualism
Inter- Multidisciplinarity
- Economic,
- social issues
- Applications
- Services
Technical, scientific issues
- Cultural issues
- Language and cultural identity
29The wealth of data of basic technologies is
such that
Today
- We should reflect again at the field as a whole
ask if - Standards
- Creation of LRs
- Automatic acquisition
- Distribution
- are still the important components,
- or how they have changed/must change
- Collaborative creation Manag.
- Distributed architectures/infrastr
Which new challenges/initiatives towards a new
more mature infrastructure of LRs LTs??
30ILC Strategic - Coordination Activities
- ENABLER Network of National Projects
- EAGLES-ISLE (EU-US Asia) to define standards
for LRs - PAROLE-SIMPLE to create LRs
- ACQUILEX - SPARKLE to acquire LRs
- RELATOR --gt ELRA (European Language Resources
Association) - ELSNET (Network of Excellence)
- Forum TAL - Founding Member
-
- LREC
- LRE Journal
- .
- CLARIN Research Infrastructure for the Humanities
Social Sciences - FLaReNet - Fostering Language Resources Network
--gt - The ORI in the new NoE T4ME
31LRs knowledge of the past to build the future
- The trends
- From EACL in Geneva with the Leech episode
- To current ACL Exec discussion on diversifying
papers - A turning point
- LREC (AZ ELRA)
- Now also LRE Journal (NC NI for ELRA)
- LR infrastructure for HLT a new turning point
for a new paradigm - Past attempts of proposing LR infrastructures
(6th FP ELITE) - Now the idea is taken
- Requires a change of mentality
- From my approach to some compromise allowing
to go for big amounts/integration/ building on
each other/ - Building the ORI .
32Which Communities?
Many LRs LTs exist, but a global vision, policy
strategy is still missing
core
for
- Language Resources
- Language Technologies
- Standardisation
- Content/Ontologies
- System developers
- Integrators
-
- Many tasks application domains
- MT
- CLIR
-
- e-government
- content industry
- intelligence
- e-culture
- e-health
- domotics
EU Forum
FLaReNet Network
with
Multilinguality
- EC
- National funding agencies
- Industry
Focus on cooperation
33Fostering Language Resources Network FLaReNet at
a glance
- The largest Network of LR LT players (more than
200 from all the world) - Structure the area of LR LT of the future by
discussing new strategies to - Recast the definition of LRs in the light of
recent scientific, methodological, technological,
social developments - Create a shared policy in the field of LRs and LT
for the next years - Foster a European strategy for consolidating the
sector and enhancing competitiveness - Address also multicultural multilingual aspects
- Consolidate methods and approaches, common
practices, frameworks and architectures - Integrate so far partial solutions into broader
infrastructures - Convert existing LTs related to LRs into useful
economic societal benefits - Anticipate the needs of new types of LRs/LTs or
Infrastructures - FLaReNet promotes international cooperation
- For a world-wide effort to build consensus about
sharing of data technologies
http//www.flarenet.eu
34Expected Outcome Impact
- The outcomes will be of a directive nature
- A roadmap
- Identifying areas where consensus is
achieved/emerging vs. areas where more discussion
testing is required, - Indicating priorities
- Recommendations in the form of a plan of coherent
actions for the EU, national organisations
industry - As input to policy development at EU and national
level - Identifying new language policies supporting
linguistic diversity - Strengthening the language product market, e.g.
for new products innovative services - A ( Eu) model for the LRs/LTs of the next years
Ambitious!
35FLaReNet Steering Committee
- Nicoletta Calzolari, ILC-CNR, Italy (Coordinator)
- Khalid Choukri, ELDA, France
- Stelios Piperidis, ILSP, Greece
- Joseph Mariani, LIMSI-CNRS IMMI, France
- Núria Bel, Universitat Pompeu Fabra, Spain
- Gerhard Budin, Universität Wien, Austria
- Jan Odijk, Universiteit Utrecht, The Netherlands
36The Network is increasing!
Individual Subscribers
- 78 Institutional Members
- 198 Individual Subscribers
Institutional Members
- Calls for international cooperation also outside
Europe - Worldwide Forum for LRs LTs
Coordinator Steering Committee
- Essential Community mobilisation around the ORI
37Results from Vienna ForumInternational
Cooperation
Shaping the Future of the Multilingual Digital
Europe
- Standards Interoperability
- Standards, interoperability metadata are topics
to be approached in cooperation. A metadata
catalogue should involve every party - Common repositories for tools language data
should be established that are universally and
easily accessible by everyone - Try to connect ongoing work done by many groups
- The creation of a shared repository with data
formats, annotations, etc. where to find the
most frequently used and preferred schemes is
proposed as a major help to achieve and promote
standardisation - Coordinate input to ISO standardisation work also
from Asian countries
38Results from Vienna International Cooperation
- For a new world-wide language infrastructure
- The issue of access to LRT is a critical one that
should involve and have impact on all the
community - Need to create the means to plug together
different LR LT, in a web-based resource and
technology grid - With the possibility to easily create new
workflows - Create conditions to easily share and re-use
technologies, to have more open source tools to
be made available for use also to under-funded
groups. - A platform for cooperation could be thought of
around the notions of BLARK and ELARK
39Results from Vienna International Cooperation
- Networking International Forum
- A focus of FLaReNet has to be to find a way to
really bring everyone on board, making sure all
the players around the world are involved - Networking support actions must be conducted
more intensively, with establishment of
international committees that have formal
recognition, organisation and participation to
common workshops - International Forum (a meta-body) to share
information, discuss strategies and declare that
there are common objectives
40Some Actions for FLaReNet
- Use its collaborative website to create a pool of
ideas on which to have a joint reflection - Promote help in the standardisation-oriented
tasks and efforts toward harmonisation, sharing
and distribution - Assemble a broad community of relevant people
institutions around the world into a
collaborative network where institutions
individuals involved are committed - Promote a new worldwide language infrastructure
for easy access to LRT, in a web-based resource
and technology grid - Act as a communication vector for open source
resources and tools this could be in wiki mode - Produce a White paper summarising ideas for
directors of programs of funding agencies, and
organise a Forum of directors of funding agencies - Must establish an International Advisory Board
this group can constitute the nucleus act as
the needed International Forum - Prepare a MoU with the main issues discussed and
ask members of FLaReNet to sign it when joining
the Network
41FLaReNet for the ORIfirst actions
- Mobilisation of the community FLaReNet brand
already accepted/well-known by the community - LREC
- Workshops to define together get feedback at
various events - FlaReNet wiki site
- Community built maintained ORI
- Organise workshops with relevant groups
initiatives - Europeana, Wikipedia, Genome,
- Creating links cooperation with parallel
initiatives (US, Asia, ) - Cyberling NSF
- LanguageGrid NICT NACTEC
- AFNLP
- CLARIN
- LDC
- Endorsement of standards, links with relevant
initiatives ISO, W3C, - Synergies interactions with new projects in
negotiation phase?
42FLaReNet for the ORIfirst actions
- Draft concrete usage scenarios
- To highlight define main basic principles
various dimensions options - Discuss basic principles with the community(ies)
- First definitions (governance, architecture,
services, sustainability, ) receive bottom-up
feedback (avoid mistake of top-down imposition
from a small group) - Create consensus acceptance
- Start a wikipedia of LRs
- Implications for harmonisation, metadata,
- Start a shared repository of data formats,
annotation guidelines, standards, best practices,
- ? FLaReNet as the ORI self-organised
community - Ensure broad coverage of LRs LTs for the ORI
- LREC Map
43FLaReNet ORI at LREC
- Special Highlight Contribute to building the
LREC2010 Map! - Time is ripe to launch an important initiative,
the LREC2010 Map of Language Resources,
Technologies and Evaluation. - The Map will be a collective enterprise of the
LREC community, as a first step towards the
creation of a very broad, community-built, Open
Resource Infrastructure. - First in a series, it will become an essential
instrument to monitor the field and to identify
shifts in the production, use and evaluation of
LRs and LTs over the years. - When submitting a paper (lt 900!), from the START
page fill in a very simple template to provide
essential information about resources (in a broad
sense, also technologies, standards, evaluation
kits.) either used for the work described or a
new result of your research - The Map will be disclosed at LREC, where some
event(s) will be organised around this initiative