semantic markup - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

semantic markup

Description:

agentive: vaporizzatore created_by fabbricare. telic: ... Agentive Relations. e.g. Source, Result_of. from Valeria Quochi. ISO LMF. Lexical Markup Framework ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 44
Provided by: Nicoletta
Category:

less

Transcript and Presenter's Notes

Title: semantic markup


1
From PAROLE to FLaReNet and beyond
some historical notes to show a path along an
evolving vision
the link between ILC initiatives
priorities and challenges in the international
scenario
Nicoletta Calzolari Istituto di Linguistica
Computazionale - CNR - Pisa glottolo_at_ilc.cnr.it
2
Why such needed LRs, are lacking after 30 years
of RD in the field?
Old slide with Antonio Zampolli (80s/early 90s)
  • ? 1) Because the main trend until mid-80s was to
    privilege the processing of critical phenomena,
    studied by the dominating linguistic theories,
    rather than focusing on the deep analysis of the
    real uses of a language
  • As a result CL was focusing on
  • few examples - often artificially built
  • lexicons made of few entries (toy lexicons)
  • grammars with poor coverage
  • ? 2) Because large-scale LRs are costly their
    production requires a big organizing effort

Why we still lack them??
3
back from the 70s/80s
Historical notes
Automatic acquisition of lexical information from
MRDs
Pioneering Research
  • Was my first research became central in the
    Pisa group (ACQUILEX)
  • And also Amsler, Briscoe, Boguraev, Wilks group,
    IBM, then Japanese groups,
  • The trend was large-scale computational methods
    for the transformation of machine readable
    dictionaries into machine tractable dictionaries
  • It became evident that
  • Part of the results of meaning extraction, e.g.
    many meaning distinctions, which could be
    generalised over lexicographic definitions and
    automatically captured,
  • were unmanageable at the formal representation
    level, and had to be blurred into unique features
    and values
  • Unfortunately, it is still today difficult to
    constrain word-meanings within a rigorously
    defined organization by their very nature they
    tend to evade any strict boundaries

4
Since then after the Grosseto Workshop (1985)
Historical notes
  • LRs have acquired larger resonance in the last 2
    decades, when many activities, in Europe and
    world-wide, have contributed to substantial
    advances in
  • knowledge and capability of how to represent,
    create, acquire, access, exploit, harmonise,
    tune, maintain, distribute, etc. large lexical
    and textual repositories
  • In Europe an essential role was played by the EC,
    through initiatives
  • ACQUILEX
  • NERC
  • EAGLES
  • PAROLE
  • SIMPLE
  • EuroWordNet
  • ISLE
  • ELSNET
  • RELATOR ? ELRA
  • SPARKLE
  • ENABLER
  • that saw the participation of many EU groups,
    linked over the years by sharing common
    approaches and visions

Start AZ breakfast meeting
EAGLES acronym by Roberto
5
Automatic acquisition of info from texts
Historical notes
back from the late 80s
After acquisition from MRDs,
  • This trend has become today a consolidated
    pervasive fact, and we have moved
  • from focusing on acquisition of linguistic
    information (as at the beginning)
  • to broader acquisition of general knowledge,
    with more data intensive, robust, reliable methods

6
We started building
Historical notes
  • LRs as necessary infrastructure
  • both for research applications
  • LRs give to NLP systems the knowledge needed for
    the various linguistic processing
  • Realising that most of the needed information
  • escapes individual introspection
  • can only be acquired analysing large textual
    corpora attesting language use in different
    fields/communicative contexts
  • BUT need of adequate models to handle actual
    usage of language
  • Sub-product? Importance of statistical methods
  • Lesson
  • Going from core sets to large coverage has
    implications not just in quantitative terms, but
    more interestingly in terms of changes to the
    models and the strategies of processes

7
PAROLE-SIMPLE-CLIPS 4 levels of linguistic
description
Accent position Vowel openness Consonant
pronunciation
Phonological Unit
Correspondence Ufon.-Umorf.
Grammatical Cat. Subcat. Inflectional Paradigm
Morphological Unit
Correspondence Umorf.-Usint.
Syntactic structure 1
Link
Syntactic Unit
Syntactic structure 2
Syntax - Semantics Correspondence
Correspondence Usint.-Usem
Ontological type
Event type
Semantic features
Semantic relations
Semantic Unit
Extended Qualia Structure
Regular polysemy
Link type
Predicative representation
8
Semantic entry
USem3527vaporizzatore
semantic type Instrument unification_path
Concrete_entity ArtifactAgentive Telic
apparecchio usato per vaporizzare
un vaporizzatore per piante
eventype
cleaning, gardening, cosmetics
USem3527vaporizzatore synonymy
USem72288nebulizzatore USem3527vaporizzatore
instrumentverb Usem5239vaporizzare

USem3527vaporizzatore isa
Usem3479apparecchio USem3527vaporizzatore
has_as_part Usem61633pulsante USem3527vaporizza
tore created_by UsemD387fabbricare USem3527
vaporizzatore used_for
UsemD66019nebulizzare
regular polysemy
Very innovative too much??
from Nilda Ruimy
9
The SIMPLE Ontology
TELIC
AGENTIVE
CONSTITUTIVE
ENTITY
CONCRETE_ENTITY
ABSTRACT_ENTITY
PROPERTY
REPRESENTATION
EVENT
  • Artifact Material
  • Furniture
  • Clothing
  • Container
  • Artwork
  • Instrument
  • Money
  • Vehicle
  • Semiotic Artifact

Multidimensionality
from Nilda Ruimy
10
Template for the Sem. Type Instrument
ontological information
predicative representation
extended qualia structure
from Nilda Ruimy
11



Constitutive
Agentive
Telic
Formal
made_of is_a_follower_of has_as_member is_a_member
_of has_as_part instrument kinship is_a_part_of re
sulting_state relates uses causes concerns affects
constitutive_activity contains
has_as_colour has_as_effect has_as_property measu
red_by measures produces produced_by
property_of quantifies related_to successor_of pr
ecedes typical_of feeling is_in lives_in typical_
location
result_of agentive_prog agentive_cause agentive_ex
perience caused_by source created_by derived_from
used_for used_as used_by used_against
A G E N T I V E
C O N S T I T U T I V E
is_a antonym_comp antonym_grad mult_opposition
INSTRUMENTAL
indirect_telic purpose
TELIC

ARTIFACTUAL AGENTIVE
is_the_activity_of is_the_ability_of is_the_habit_
of
ACTIVITY
DIRECT TELIC
object_of_activity
Extended Qualia Structure
P R O P E R T Y
T-cell, Blood Stem Cell
Ribose, Nucleotide
Catalyze, Enzyme
regulates is_regulated_by ..
NEW!
LOCATION
12
Ontologisation of SIMPLE
  • Automatically converting and enriching a
    computational lexicon into a formal Ontology
  • For NLP semantic tasks
  • Potential of ontologies in NLP as Backbone in
    LKBs
  • Pivot in multilingual architectures (e.g. KYOTO)
  • Reasoning capabilities
  • Ontologisation of SIMPLE into OWL
  • Conversion of the SIMPLE ontology
  • Bottom-up enrichment promoting lexicon knowledge
    to the ontology level
  • Language independent knowledge from Italian
    lexico-semantic information

from Antonio Toral
13
Use of SIMPLE Lexicon Ontology for Time and
Event detection/annotation (TimeML)
  • Different PoS may realise an event verbs,
    nouns, adjectives, prep. phrases
  • The SIMPLE Lexicon helps in identifying
    classifying Events (eventive nouns adjectives)
    ? in a 10K Words Annotation Experiment
  • each event is associated with an Ontological Type
  • the Event-Type from the SIMPLE-Ontology can be
    used as default value to provide event
    composition, and consequently to instantiate a
    temporal representation for each Event
  • improvement both in identification
    classification of Events by annotators 81.17
    accuracy (vs.72.35) and K-coefficient 0.84
    (vs. 0.7)

SIMPLE Lexicon
Morpho-Syntactic Analysis
Event Detection Classification
from Tommaso Caselli
14
GLML Generative Lexicon Markup Language
with James Pustejovsky, Olga Batiukova, Anna
Rumshisky, Marc Verhagen
  • Annotating texts with Argument Selection,
    Argument Coercion, Qualia Roles
  • The corpus brings reality to the model, provides
    statistical cues to improve language models
  • Lexical semantic info, like type
    coercion/selection, required for applications
    such as WSD, categorisation, IR (query
    reformulation, filtering), IE (coreference
    resolution, relation extraction), entailment,
  • Predicate Argument constructions
  • Predicate Sense Disambiguation
  • Argument selection type selection /coercion
  • Qualia role/relation selection
  • Modification constructions
  • Noun Sense Disambiguation
  • Qualia role/relation selection in Adjectival
    Modification
  • Qualia role/relation selection in Nominal
    Modification
  • Complex Types
  • Type selection in modification of Dot Objects

from Valeria Quochi
15
GLML - Using Existing Resources for Italian
  • SIMPLE LexiconOntology/ItalWordNet
  • Sense Disambiguation
  • Type selection /coercion
  • Type selection in Dot Objects
  • SIMPLE Extended Qualia Structure
  • Selection of Qualia roles/relations., e.g.
  • Constitutive Relations
  • e.g Is_a_part_of , Is_a_member_of
  • Telic Relations
  • e.g. Purpose, Object_of_the_activity
  • Agentive Relations
  • e.g. Source, Result_of

from Valeria Quochi
16
ISO LMF Lexical Markup Framework
Builds on EAGLES/ISLE
The field is mature
Structural skeleton, with the basic hierarchy of
information in a lexical entry
various extensions
  • Modular framework
  • LMF specs comply with modelling UML principles
  • an XML DTD allows implementation

New initiatives
LIRICS
ICT KYOTO
NEDO Asian Lang.
NICT Language-Grid Service Ontology
17
BioLexicon SIMPLE model ISO-LMF standard
A unique resource among large-scale computational
lexicons within the biomedical domain in terms of
coverage and typology of contained information
Designed to meet bio-Text Mining requirements
BL
Populated with info from available biomedical
resources and texts
Including both domain-specific and general
language words
Semi-automatically populated from
corpora Population toolkit available
Rich linguistic information ranging over
different linguistic descriptions levels
Conformant to international lexical
representation standards
from Monica Monachini
18
The BioLexicon where from
Incremental population process
Existing repositories
chemical compounds, species names, disease,
enzymes
genes/proteins
Subclustering of term variants
BioLexicon
new genes/proteins names
MEDLINE
Named Entity Recognition
Term Mapping by Normalisation
Verbs, nouns, adjs, advs (variants, inflected
forms, derivative relations, ...)
Manual curation
Subcat extraction
Linguistic pre-processing
Syn-sem mapping
Manual annotation of a bio-event corpus
Bio-event extraction
from Simonetta Montemagni
19
Environmental organizations
from Piek Vossen
20
KYOTO SYSTEM
Linear MAF/SYNAF
Term extraction Tybot
Semantic annotation
Linear SEMAF
Generic TMF
Fact extraction Kybot
Domain editing Wikyoto
Fact User
Concept User
LMF API
OWL API
Linear Generic FACTAF
Domain Wordnet
Domain ontology
Wordnet
Ontology
from Piek Vossen
21
Contribution of KYOTO
  • hundreds of thousands sources in the environment
    domain
  • in many different languages
  • spread all over the world
  • changing every day
  • KYOTO learns terms and concepts from text
    documents,
  • Stored as structures that people and computers
    understand
  • KYOTO delivers a Web 2.0 environment for
    community based control
  • Connects people across language and cultures
  • Establish consensus and knowledge transition
  • KYOTO enables semantic search and fact
    extraction
  • Software can partially understand language and
    exploit web 1 data
  • Understanding is helped by the terms and
    concepts defined for each language

html
pdf
xls
KYBOT
WIKYOTO
TYBOT
from Piek Vossen
22
Since few years
A new paradigm of RD in LRs LT
  • Open distributed linguistic infrastructures
    for LRs LT
  • adopting the paradigm of accumulation of
    knowledge so successful in more mature
    disciplines, based on sharing LRs tools
  • ability to build on each other achievements,
    results accessible to various systems, allowing
    controlled effective cooperation of many groups
    on common tasks (see HumanGenomeProject ? HGP)
  • Emerging concept of collective intelligence
  • Emphasize interoperability among LRs, LT
    knowledge bases
  • e. g. initiatives aimed at achieving
    international consensus on annotation guidelines
    to merge annotation efforts, produce coherent,
    comprehensive linguistic annotations to be
    readily disseminated throughout the community
  • New ways of extending large-scale LRs and
    knowledge bases relying on volunteer labour,
    wiki-mode?

interoperability
23
Some steps for a new generation of LRs
  • From huge efforts in building static,
    large-scale, general-purpose LRs
  • To dynamic LRs rapidly built on-demand, tailored
    to spefic user needs
  • From closed, locally developed and centralized
    resources
  • To LRs residing over distributed places,
    accessible on the web, choreographed by agents
    acting over them
  • From Language Resources
  • To Language Services
  • Need of tools to make this vision operational
    concrete

24
Lexical WEB Content Interoperability ?
Standards
  • As a critical step for semantic mark-up in the
    SemWeb

Global WordNet GRID
WordNets
NomLex
WordNets
ComLex
WordNets
with intelligent agents
SIMPLE-WEB
SIMPLE
LMF
Lex_x
BioLexicon
FrameNet
Lex_y
Standards for Interoperability
Enough??
25
Distributed Language Services
  • A long-term scenario implying
  • content interoperability standards,
  • supra-national cooperation and
  • development of architectures enabling
    accessibility
  • Create new resources on the basis of existing
  • Exchange and integrate information across
    repositories
  • Compose new services on demand
  • Collaborative collective/social development and
    validation, cross-resource integration and
    exchange of information

Language Grid
Wiki
26
In the Semantic Web vision ...
  • need to tackle the twofold challenge of
  • content availability
  • multilinguality
  • Natural convergence with HLT
  • multilingual semantic processing
  • ontologies
  • semantic-syntactic computational lexicons

27
BUT
Lack of communication betw. the communities of
HLT Semantic Web (SW)/Ontologists
The SW needs HLT HLT will highly benefit from
the SW
IAAI
otherwise, risk of re-discovery of what was done
20 years ago
see first issue of the International Journal on
Semantic Web Information Systems, 2005, with
statements identical to ours in papers of the
80s!!!
28
Today
Why a Network of LRs LTs? Many dimensions
around the notion of language
finally
  • We need to consider together
  • technical
  • organisational
  • strategic
  • economic
  • cultural
  • legal
  • political issues wrt LRs LTs
  • EU Network FLaReNet
  • Next to build the ORI

Need of bodies for a broad research agenda
strategic actions for LTLRs (W/S /MM) based on
all the dimensions
Sensitive
Political issues e.g. a commonly agreed list of
minimal requirements for national LRs BLARK
Multilingualism
Inter- Multidisciplinarity
  • Economic,
  • social issues
  • Applications
  • Services

Technical, scientific issues
  • Cultural issues
  • Language and cultural identity

29
The wealth of data of basic technologies is
such that
Today
  • We should reflect again at the field as a whole
    ask if
  • Standards
  • Creation of LRs
  • Automatic acquisition
  • Distribution
  • are still the important components,
  • or how they have changed/must change
  • Content interoperability
  • Collaborative creation Manag.
  • Dynamic LRs
  • Sharing

  • Distributed architectures/infrastr

Which new challenges/initiatives towards a new
more mature infrastructure of LRs LTs??
30
ILC Strategic - Coordination Activities
  • ENABLER Network of National Projects
  • EAGLES-ISLE (EU-US Asia) to define standards
    for LRs
  • PAROLE-SIMPLE to create LRs
  • ACQUILEX - SPARKLE to acquire LRs
  • RELATOR --gt ELRA (European Language Resources
    Association)
  • ELSNET (Network of Excellence)
  • Forum TAL - Founding Member
  • LREC
  • LRE Journal
  • .
  • CLARIN Research Infrastructure for the Humanities
    Social Sciences
  • FLaReNet - Fostering Language Resources Network
    --gt
  • The ORI in the new NoE T4ME

31
LRs knowledge of the past to build the future
  • The trends
  • From EACL in Geneva with the Leech episode
  • To current ACL Exec discussion on diversifying
    papers
  • A turning point
  • LREC (AZ ELRA)
  • Now also LRE Journal (NC NI for ELRA)
  • LR infrastructure for HLT a new turning point
    for a new paradigm
  • Past attempts of proposing LR infrastructures
    (6th FP ELITE)
  • Now the idea is taken
  • Requires a change of mentality
  • From my approach to some compromise allowing
    to go for big amounts/integration/ building on
    each other/
  • Building the ORI .

32
Which Communities?
Many LRs LTs exist, but a global vision, policy
strategy is still missing
core
for
  • Language Resources
  • Language Technologies
  • Standardisation
  • Content/Ontologies
  • System developers
  • Integrators
  • Many tasks application domains
  • MT
  • CLIR
  • e-government
  • content industry
  • intelligence
  • e-culture
  • e-health
  • domotics

EU Forum
FLaReNet Network
with
Multilinguality
  • EC
  • National funding agencies
  • Industry

Focus on cooperation
33
Fostering Language Resources Network FLaReNet at
a glance
  • The largest Network of LR LT players (more than
    200 from all the world)
  • Structure the area of LR LT of the future by
    discussing new strategies to
  • Recast the definition of LRs in the light of
    recent scientific, methodological, technological,
    social developments
  • Create a shared policy in the field of LRs and LT
    for the next years
  • Foster a European strategy for consolidating the
    sector and enhancing competitiveness
  • Address also multicultural multilingual aspects
  • Consolidate methods and approaches, common
    practices, frameworks and architectures
  • Integrate so far partial solutions into broader
    infrastructures
  • Convert existing LTs related to LRs into useful
    economic societal benefits
  • Anticipate the needs of new types of LRs/LTs or
    Infrastructures
  • FLaReNet promotes international cooperation
  • For a world-wide effort to build consensus about
    sharing of data technologies

http//www.flarenet.eu
34
Expected Outcome Impact
  • The outcomes will be of a directive nature
  • A roadmap
  • Identifying areas where consensus is
    achieved/emerging vs. areas where more discussion
    testing is required,
  • Indicating priorities
  • Recommendations in the form of a plan of coherent
    actions for the EU, national organisations
    industry
  • As input to policy development at EU and national
    level
  • Identifying new language policies supporting
    linguistic diversity
  • Strengthening the language product market, e.g.
    for new products innovative services
  • A ( Eu) model for the LRs/LTs of the next years

Ambitious!
35
FLaReNet Steering Committee
  • Nicoletta Calzolari, ILC-CNR, Italy (Coordinator)
  • Khalid Choukri, ELDA, France
  • Stelios Piperidis, ILSP, Greece
  • Joseph Mariani, LIMSI-CNRS IMMI, France
  • Núria Bel, Universitat Pompeu Fabra, Spain
  • Gerhard Budin, Universität Wien, Austria
  • Jan Odijk, Universiteit Utrecht, The Netherlands

36
The Network is increasing!
Individual Subscribers
  • 78 Institutional Members
  • 198 Individual Subscribers

Institutional Members
  • Calls for international cooperation also outside
    Europe
  • Worldwide Forum for LRs LTs

Coordinator Steering Committee
  • Essential Community mobilisation around the ORI

37
Results from Vienna ForumInternational
Cooperation
Shaping the Future of the Multilingual Digital
Europe
  • Standards Interoperability
  • Standards, interoperability metadata are topics
    to be approached in cooperation. A metadata
    catalogue should involve every party
  • Common repositories for tools language data
    should be established that are universally and
    easily accessible by everyone
  • Try to connect ongoing work done by many groups
  • The creation of a shared repository with data
    formats, annotations, etc. where to find the
    most frequently used and preferred schemes is
    proposed as a major help to achieve and promote
    standardisation
  • Coordinate input to ISO standardisation work also
    from Asian countries

38
Results from Vienna International Cooperation
  • For a new world-wide language infrastructure
  • The issue of access to LRT is a critical one that
    should involve and have impact on all the
    community
  • Need to create the means to plug together
    different LR LT, in a web-based resource and
    technology grid
  • With the possibility to easily create new
    workflows
  • Create conditions to easily share and re-use
    technologies, to have more open source tools to
    be made available for use also to under-funded
    groups.
  • A platform for cooperation could be thought of
    around the notions of BLARK and ELARK

39
Results from Vienna International Cooperation
  • Networking International Forum
  • A focus of FLaReNet has to be to find a way to
    really bring everyone on board, making sure all
    the players around the world are involved
  • Networking support actions must be conducted
    more intensively, with establishment of
    international committees that have formal
    recognition, organisation and participation to
    common workshops
  • International Forum (a meta-body) to share
    information, discuss strategies and declare that
    there are common objectives

40
Some Actions for FLaReNet
  • Use its collaborative website to create a pool of
    ideas on which to have a joint reflection
  • Promote help in the standardisation-oriented
    tasks and efforts toward harmonisation, sharing
    and distribution
  • Assemble a broad community of relevant people
    institutions around the world into a
    collaborative network where institutions
    individuals involved are committed
  • Promote a new worldwide language infrastructure
    for easy access to LRT, in a web-based resource
    and technology grid
  • Act as a communication vector for open source
    resources and tools this could be in wiki mode
  • Produce a White paper summarising ideas for
    directors of programs of funding agencies, and
    organise a Forum of directors of funding agencies
  • Must establish an International Advisory Board
    this group can constitute the nucleus act as
    the needed International Forum
  • Prepare a MoU with the main issues discussed and
    ask members of FLaReNet to sign it when joining
    the Network

41
FLaReNet for the ORIfirst actions
  • Mobilisation of the community FLaReNet brand
    already accepted/well-known by the community
  • LREC
  • Workshops to define together get feedback at
    various events
  • FlaReNet wiki site
  • Community built maintained ORI
  • Organise workshops with relevant groups
    initiatives
  • Europeana, Wikipedia, Genome,
  • Creating links cooperation with parallel
    initiatives (US, Asia, )
  • Cyberling NSF
  • LanguageGrid NICT NACTEC
  • AFNLP
  • CLARIN
  • LDC
  • Endorsement of standards, links with relevant
    initiatives ISO, W3C,
  • Synergies interactions with new projects in
    negotiation phase?

42
FLaReNet for the ORIfirst actions
  • Draft concrete usage scenarios
  • To highlight define main basic principles
    various dimensions options
  • Discuss basic principles with the community(ies)
  • First definitions (governance, architecture,
    services, sustainability, ) receive bottom-up
    feedback (avoid mistake of top-down imposition
    from a small group)
  • Create consensus acceptance
  • Start a wikipedia of LRs
  • Implications for harmonisation, metadata,
  • Start a shared repository of data formats,
    annotation guidelines, standards, best practices,
  • ? FLaReNet as the ORI self-organised
    community
  • Ensure broad coverage of LRs LTs for the ORI
  • LREC Map

43
FLaReNet ORI at LREC
  • Special Highlight Contribute to building the
    LREC2010 Map!
  • Time is ripe to launch an important initiative,
    the LREC2010 Map of Language Resources,
    Technologies and Evaluation.
  • The Map will be a collective enterprise of the
    LREC community, as a first step towards the
    creation of a very broad, community-built, Open
    Resource Infrastructure.
  • First in a series, it will become an essential
    instrument to monitor the field and to identify
    shifts in the production, use and evaluation of
    LRs and LTs over the years.
  • When submitting a paper (lt 900!), from the START
    page fill in a very simple template to provide
    essential information about resources (in a broad
    sense, also technologies, standards, evaluation
    kits.) either used for the work described or a
    new result of your research
  • The Map will be disclosed at LREC, where some
    event(s) will be organised around this initiative
Write a Comment
User Comments (0)
About PowerShow.com