Title: Post PAROLESIMPLE lexical resources and initiatives in Sweden
1Post PAROLE/SIMPLE lexical resources and
initiatives in Sweden
- Maria Toporowska Gronostaj
- Maria.Gronostaj_at_svenska.gu.se
New Horizons for Linguistic Resources in a Global
Context 7-8th July 2009, Barcelona
2The main aims of my talk
- present a work in progress on building a free
full-scale lexical resource, the Swedish
FrameNet (SFN), being conducted by the
Swedish Language Bank - give an overview of lexical resources
contributing to the development of SFN - describe a core of SFN, the SALDO lexicon
- reflect on merging lexical data from different
resources to acquire information on frames
3The overall objectives of the SFN
- create a robust lexical resource aimed at LT
applications with - exhaustive morphological, syntactic and semantic
description of lexical units, incl. information
on frames and world knowledge relevant for
word/text understanding - produce it cost-effectively by merging data from
free lexical resources and re-using free software
tools - ensure its content interoperability
- create an interactive text-lexicon block with
morphological and semantic annotations on the fly
4Content interoperability a challenge for SFN
- The contributing lexicons are heterogeneous in
several respects - have partly different types of content
- were developed for different purposes
- were or are developed by different groups of
contributors - language experts
- a collective effort of both language engineers
and users of web-lexicons
5Free lexical resources behind SFN (1)
- SALDO Swedish monolingual lexicon with semantic
and morphological layers - 76,750 entries 74,000 distinct semantic units
- The Swedish Associative Thesaurus by
- L. Lönngren (1992) reincarnated by L. Borin
- enhanced with a complete morphological
description by L. Borin M. Forsberg - People's Synonym Dictionary (web-lexicon)
- 80,000 Swedish synonym pairs
- synonymy graded from 5 to 0 by lexicon users
- collective effort of web-lexicon users
- language engineering Viggo Kann
6Free lexical resources behind SFN (2)
- The People's Dictionary (Swedish/English)
- collective effort of web-lexicon users
- equivalents are graded by lexicon users
- language engineering Viggo Kann
- SemNet
- 52,800 hyperonymy/hyponymy relations
automatically retrieved from the definitions of
nouns and verbs in GLDB - Parole/Simple lexicons
- 29,000 syntactic units (valency) and 8,500
semantic units encoded with mandatory information
7(No Transcript)
8SALDO unusual semantic network
- Lexemes, arranged in a hierarchical network
according to the principle of centrality,
capturing semantic closeness between two lexemes - Semantic relations are postulated for both open
and closed classes and can go beyond a word class - There are 51 primitive semantically unrelated
concepts being the top nodes of the hierarchies
capturing the centrality. These nodes are
connected to an artificial top node PRIM to form
a tree - There are no synsets in the sense of Wordnet.
Neither glosses of the lexemes, nor semantic
relations, such as hyponymy, hyperonymy or qualia
relations are explicitly specified there.
9Semantic centrality in SALDO
- Each lexical unit is given
- an obligatory main descriptor, mother, which can
be complemented by an optional determinative
descriptor, father - bröd (bread) mat mjöl (foodflour)
- brud (bride) gifta sig hon (get marriedshe)
- bröllop (wedding) gifta sig (get married)
- gifta sig (get married) par (pair)
10Semantic relations in SALDO
- Mother descriptor is usually
- semantically more close to the key word,
- semantically and/or morphologically less complex
than the key word - more frequent
- stylistically more unmarked
- acquired earlier in the first and second language
acquisition - Father descriptors are used mainly to
differentiate lexemes having the same mother. - They are assigned to ca 50 of words
11Associative sets, assets
- Keywords can function as mother- or father
descriptors for other lexemes and thus form the
basis of any number of derived relations,
referred to as assets - brud (bride) get married she
- kronbrud (crown bride) bride chastity
- brudbukett (bride bouquet) bouquet bride
- brudklänning (bride dress) dress bride
- brudkrona (bride crown) crown bride
- brudgum (bridegroom) get maried he
- no assets
12Assets sharing mother relations build natural
semantic groupings
- sol (sun) lysa himmel (shine sky)
- comet, moon, star (shine sky)
- blinka (blink) lysa snabbt (shine quickly)
- ljus (candle) lysa brinna (shine burn)
- The lexemes sharing mother and father relations
are closer related to each other, as compared to
those having different father descriptors
13SALDO world knowledge (1)
- SALDO an intrinsic network capturing the world
knowledge underlying lexical-semantic relations - The network relations are based on the notion of
centrality by the depth of an entry, its
distance down from the PRIM root node - The deeper an entry lies in the tree, the less
central it is - PRIM
- one
- unit
- two
- pair
- get married
- bride
- The average depth of entries in SALDO is 5, 7
14SALDO world knowledge (2)
- SALDO is supportive in recognizing entailments by
pointing out the mother to a key word, which
promotes word text understanding - It provides explicit information on distribution
of the associative sets among lexemes (e.g. bride
bridegroom) - It includes named entities as entries
- Bulgakov författare rysk (writer Russian)
15Approaches towards frames acquisition in SFN
- Merging relevant lexical data from available free
lexical resources - Cross-language transfer of lexical units with
information on the frames and frame elements from
FN to SFN - Automatic acquisition of frames from corpora
using a software tool, FrameNet Labeler system
for Swedish text
16Merging lexical data with SALDO involves
- interlinking the morphological units from the
component lexicons (based on lemmas form, part
of speech and inflectional patterns, whenever
possible) - augmenting the SALDOs lexical units with the
semantic content from SemNet, SIMPLE, Peoples
Synonym Dictionary and English equivalents from
the Peoples Dictionary (Swedish/English) - adding syntactic information from the PAROLE
lexicon to SALDO
17Frame acquistion supported by PAROLE/SIMPLE
- V gifta sig (to marry/get married)
- PAROLE
- Sub. (Anim.) V (refl.) PrepObj (Anim.) med
(with) - Sub. (Plural) (Anim.) V (refl.)
- SIMPLE
- Semantic type V Cooperative activity
- Selection restrictions Human V Human, Human V
- HumanVCooperative activityHuman gt Partner(s)
- In FN the Partner role is a core FE in the
frames - Collaboration, Forming Relationship, Personal
relations - Due to the semantic syntactic data in the P/S
lexicon, the frame Forming Relationship is
selected for the verb marry
18Automatic acquistion of frames and FEs
- a software tool FrameNet Labeler for Swedish
text - elaborated by R. Johansson, P. Nugues
- trained on semantically annotated corpus,
produced by a cross-language transfer - 75 accuracy in classification of FEs
19Populating the frames in SFNwith lexical units
- re-using the lexical data retrieved from corpora
by the FrameNet labeler - cross-language transfer of lexical units from FN
to SFN - semantic mining and refining lexical data in the
SIMPLE lexicon - enhancing the repository of lexical units with
synonyms, hyponyms and siblings
20Conclusions (1)
- Lexicons can be re-purposed and re-used for the
task of SFN creation - Content integration and interoperability seems to
be feasible to achieve - SFN can be augmented with
- synsets to compensate for the lack of glosses,
(data from Peoples Synonym Dictionary) - hyperonymy/hyponymy relationer from SemNet
- world knowledge from the SALDO lexicon
- Creation of a text-lexicon block with SALDO
annotations on the fly is in progress
21Conclusions (2)
- Desirable further extensions of SFN
- valency information
- explicit semantic typing of lexical units
- multi-word expressions
- broader coverage of different domains
- creation of text-lexicon block with semantic role
annotations - SFN will make a Swedish contribution to
BLARK/CLARIN available under Creative Commons
Attribute-Share Alike Licence and LGPL 3.0
22References
- Borin L., Forsberg M. 2009. All in the Family A
comparison of SALDO and WordNet. Proceedings of
the 17th Nordic Conference of Computational
Linguistics NODALIDA 2009. Odense. - Johansson, R. Nugues, P. 2007. Construction of a
FrameNet labeler for Swedish Text. NODALIDA 2007.
Helsinki. - Kann, V. , Rosell, M. 2005. Free Construction of
a Swedish Dictionary of Synonyms. NODALIDA 2005.
Joensuu. - Lönngren, L. 1989. Svensk associationslexikon.
Del /-IV Institutionen för lingvistik. Uppsala
universitet. Rapport UCDL-R-89-1. - Lönngren, L. 1998. A Swedish associative
thesaurus. In Euralex 98 proceedings, Vol.2. pp
467-474. - SALDO http//spraakbanken.gu.se/sal/