Title: MEANING: a Roadmap to Knowledge Technologies
1Meaning
- MEANING a Roadmap to Knowledge Technologies
- German Rigau. TALP Research Center. UPC.
Barcelona. rigau_at_lsi.upc.es - Bernardo Magnini. ITC-IRST. Povo-Trento.
magnini_at_itc.it - Eneko Agirre. IXA group. EHU. Donostia.
eneko_at_si.ehu.es - Piek Vossen. Irion Technologies. Delft.
Piek.Vossen_at_irion.nl - John Carroll. COGS. U. Sussex. Brighton.
johnca_at_cogs.susx.ac.uk - http//www.lsi.upc.es/nlp/meaning/meaning.html
-
2MeaningIntroduction
- Knowledge technologies (semantic web) make
sense of petabytes of information - Range of techniques to automate knowledge
lifecycle - Lexical KB (ontologies)
- Text understanding (IE or other)
- extract high-level meaning
- represent and manage in a KB
- HLT to enable knowledge technologies
3MeaningIntroduction
- Building large and rich KB by hand
- ExpensiveE.g. CYC, WordNet (EuroWordNet)
- Introspection fails to reflect reality in texts,
domains - Is a saint an animate being? not always,
image. - Contradictions
- ? Hamper applications of HLT and KT
- Richer KBs (ontologies)
- Domain knowledge
- Contradictory subsets
- ? Semi-automatic means
4MeaningIntroduction
- Crucial intermediate tasks
- Word Sense Disambiguation? From words to
concepts (word senseconcept in KB) - Large scale enrichment of (multilingual) Lexical
KB? Enable semantic processing - Goal
- ?Large-scale extraction of shallow meaning
relations among concepts
5MeaningShallow semantics
act
Invite s456
object
source
destination
s378
s412
s933
(Chirac) (invita) (al Dalai_Lama) (a un almuerzo
oficial)
(Chirac) (invites) (the Dalai_Lama) (to an
official lunch)
6MeaningIntroduction
- Crucial intermediate tasks
- Word Sense Disambiguation
- Large scale enrichment of (multilingual) Lexical
KB - Problems (research goals)
- Enriching LKBs, acquisition of linguistic
knowledge - Corpora need to be accurately tagged with
concepts - Accurate WSD needs
- Hand-tagged data OR richer LKB
- Multilinguality
- Words in several languages linked to common
concepts
7MeaningOutline
- Major research goals
- Knowledge acquisition into LKBs
- WSD into LKB concepts
- Multilingualism
- Meaning roadmap
- Overview of the project
8MeaningKnowledge acquisition into LKBs
- Semi-automatic acquisition of linguistic
knowledge from corpora is working - Subcategorization information
- Selectional preferences
- Thematic role assignments
- Diathesis alternations
- Domain information
- Topic signatures
- Rich lexico-semantic relations between words
(dictionaries) -
- Large bodies of text with (fast) shallow
processors
9MeaningKnowledge acquisition into LKBs
- Knowledge for words is not enough
- Verb senses have different selectional
preferences for e.g. the subject - The car ate all the petrol (WN)
- Verb senses may have different subcat. frames
-
- Better to key into word senses source corpora
should be tagged - Better reflect linguistic phenomena
- Detect new senses
- Clustering senses
- Integrate easily into the multilingual LKB
10MeaningWSD into LKB concepts
- Senseval-2 uses word senses (concepts) from WN
1.7 - No large-scale broad-coverage WSD system is
available - Accuracy around 60-70 (V/A/N) when hand-tagged
data available - Use hand-tagged data to train ML systems
- Ngs estimate 16 persons/year (short)
- Promising research lines
- Automatically create training corpus using
semantic relations in the LKB (WN) - Use untagged data to improve performance
- Higher precision if more knowledgeable features
are used (subcat, sel. preferences, domains) - Coarse grained Domain tagging / Clusters of
senses
11MeaningExploiting EWN Semantic Relations WSD
12MeaningExploiting EWN Semantic Relations
partido 1 Pero España puso al partido
intensidad, ritmo y coraje. El seleccionador cree
que el partido de hoy contra Italia dará la
medida de España El Racing no gana en su campo
desde hace seis partidos. partido 2 Todos los
partidos piden reformas legales para TV3. La
derecha planea agruparse en un partido. El
diputado reiteró que ni él ni UDC, como
partido, han recibido dinero de Pellerols.
13MeaningExploiting EWN Semantic Relations
partido 1 Rivera pide el soporte de la afición
para encarrilar las semifinales. Sólo el equipo
de Valero Ribera puede sentenciar una semifinal
como lo hizo ayer en un Palau Blaugrana
completamente entregado. El Racing ganó los
cuartos de final en su campo. partido 2 No
negociaremos nunca com un partido político que
sea partidario de la independencia de Taiwan. Una
vez más es noticia la desviación de fondos
destinados a la formación ocupacional hacia la
financiación de un partido político. Estas lleyes
fueron votadas gracias a un consenso general de
los partidos políticos.
14MeaningMultilingualism
- Language diversity is a barrier
- Language diversity is helpful
- Languages realize meaning in different ways
- Use EuroWN multilingual architecture
Interlingual Index (ILI) links translation
equivalents via interlingual concepts - head ---------- s984574 --------- cabeza
- -------- s984557 --------- jefe
- Research on how linguistic knowledge behaves when
ported to other language (e.g.subcat information) - Very important for resource-poor languages
15MeaningMultilingualism
- Selectional preference for the object of the
first sense of know - sense 1 know, cognize -- (be cognizant or aware
of a fact or a specific piece of information
possess knowledge or information about - 0,1128 ltcommunicationgt
- 0,0615 ltmeasure quantity amount quantumgt
- 0,0535 ltattributegt
- 0,0389 ltobject physical_objectgt
- 0,0307 ltcognition knowledgegt
- In EuroWordNet (http//ixa.si.ehu.es)
- antzeman_1, jakin_2 and ezagutu_1 in Basque.
- conocer_1 and saber_1 in Spanish
- conèixer_1 and saber_1 in Catalan
16MeaningMEANING roadmap
- Solutions have been tried with relative success
in isolation - Combination for significant advances (which?)
- Web as corpus BNC (100 Mw) small for many
phenomena - Incremental design
- WSD using whatever knowledge available at the
time for bootstrapping - Acquisition of linguistic knowledge using WSD
available at the time (may discard low accuracy
examples) - Integrating acquired knowledge in the
Multilingual Central Repository and porting
knowledge from one language to the other - Series of cycles WSD0, WSD1, WSD2, ACQ0, ACQ1,
ACQ2, PORT0, PORT1, PORT2
17Meaning
Architecture
Italian Web Corpus
English Web Corpus
WSD
WSD
Italian EWN
English EWN
ACQ
ACQ
UPLOAD
UPLOAD
Multilingual Central Repository
PORT
PORT
PORT
PORT
Basque EWN
Spanish EWN
ACQ
ACQ
UPLOAD
UPLOAD
Basque Web Corpus
Catalan EWN
Spanish Web Corpus
WSD
Catalan Web Corpus
WSD
18Meaning
Project overview
- 3 years research project (started march 2002)
- 1.610 M Euro
- 2 contracted people per site
- Consortium
- TALP, UPC (German Rigau)
- ITC-IRST (Bernardo Magnini)
- IXA, UPV/EHU (Eneko Agirre)
- University of Sussex (John Carroll)
- Irion Technologies (Piek Vossen)
19Meaning Project results
- A Tool Set that using the semantic knowledge of
EWN will obtain automatically from the web large
collections of examples for each particular word
sense. - A Tool Set for enriching EWN using the knowledge
acquired automatically from the Web. - A Tool Set for selecting accurately the senses of
the open-class words for the languages involved
in the project. - Multilingual Central Repository to maintain
compatibility between WordNets of different
languages and versions, past and new. - A semantically annotated corpus for each WordNet
word sense, that is, a multilingual web corpus
with semantically annotated corpora - Demonstration CLIR, Q/A system.
- The results of MEANING will be public and free
for research.
20MeaningWhy now?
- Huge amounts of data throw out non reliable
- Syntactic dependencies with high enough accuracy
- Supervised WSD with high enough accuracy
- Coarser grains, sense domain tagging
- Bootstrapping
- Success coping with multilingualism
- Porting linguistic knowledge from one language to
other using MT / comparable corpora - CLIR as good as monolingual IR
21Meaning
- MEANING a Roadmap to Knowledge Technologies
- German Rigau. TALP Research Center. UPC.
Barcelona. rigau_at_lsi.upc.es - Bernardo Magnini. ITC-IRST. Povo-Trento.
magnini_at_itc.it - Eneko Agirre. IXA group. EHU. Donostia.
eneko_at_si.ehu.es - Piek Vossen. Irion Technologies. Delft.
Piek.Vossen_at_irion.nl - John Carroll. COGS. U. Sussex. Brighton.
johnca_at_cogs.susx.ac.uk - http//www.lsi.upc.es/nlp/meaning/meaning.html
-