Title: Semi-automatic methods for WordNet construction
1Semi-automatic methods for WordNet construction
- German Rigau i Claramunt
- http//www.lsi.upc.es/rigau
- TALP Research Center
- Universitat Politècnica de Catalunya
- Eneko Agirre
- http//www.ji.si.upc.es/users/eneko
- IxA NLP Group
- University of the Basque Country
2002 International WordNet Conference
2Setting
- NLP and the Lexicon
- Theoretical WG, GPSG, HPSG.
- Practical realistic complexity and coverage
- Lexical bottleneck (Briscoe 91)
- Even worse for languages other than English
3Setting
- Which LK is needed by a concrete NLP system?
- Where is this LK located?
- Which procedures can be applied?
4Setting
- Which LK is needed by a concrete NLP system?
- Phonology phonemes, stress, etc.
- Morphology POS, etc.
- Syntactic category, subcat., etc.
- Semantic class, SRs, etc.
- Pragmatic usage, registers, TDs, etc.
- Translations translation links
5Setting
- Where is this LK located?
- Human brain
- Structured Lexical Resources
- Monolingual and bilingual MRDs
- Thesauri
- Unstructured Lexical Resources
- Monolingual and bilingual Corpora
- Mixing resources
6Setting
- Which procedures can be applied?
- Prescriptive approach
- Machine-aided manual construction
- Descriptive approach
- Automatic acquisition from pre-existing Lexical
Resources - Mixed approach
7Outline
- Setting
- Words and Works
- Merge approach
- Taxonomy construction monolingual MRDs
- Mapping taxonomies bilingual MRDs
- Expand approach
- Translation of synsets bilingual MRDs
- Interface for manual revision
- Conclusions
8Words and WorksWhere is this Lexical Knowledge
located?
- Human brain
- Linguistic String Project (Fox et al. 88)
- Lexical Information for 10,000 entries
- WordNet (Miller et al. 90)
- Semantic Information v1.6 with 99,642 synsets
- Comlex (Grishman et al. 94)
- Syntactic information 38,000 English words
- CYC Ontology (Lenat 95)
- a person-century of effort to produce 100,000
terms - LDOCE3-NLP
- dictionary with 80,000 senses
9Words and WorksWhere is this Lexical Knowledge
located?
- Structured Lexical Resources
- Monolingual MRDs
- LDOCE
- learners dictionary
- 35,956 entries and 76,059 definitions
- 86 semantic and 44 pragmatic codes
- controlled vocabulary of 2,000 words
- (Boguraev Briscoe 89)
- (Vossen Serail 90)
- (Bruce Guthrie 92), (Wilks et al. 93)
- (Dolan et al. 93), (Richardson 97)
10Words and WorksWhere is this Lexical Knowledge
located?
- Structured Lexical Resources
- Other Monolingual MRDs
- Websters (Jensen Ravin 87)
- LPPL (Artola 93)
- DGILE (Castellón 93), (Taulé 95), (Rigau 98)
- CIDE (Harley Glennon 97)
- AHD (Richardson 97)
- WordNet (Harabagiu 98)
- Bilingual MRDs
- Collins Spanish/English (Knigth Luk 94)
- Vox/Harraps Spanish/English (Rigau 98)
11Words and WorksWhere is this Lexical Knowledge
located?
- Structured Lexical Resources
- Thesauri
- Rogets Thesaurus
- 60,071 words in 1,000 categories
- (Yarowsky 92), (Grefenstette 93), (Resnik 95)
- Rogets II and The New Collins Thesaurus
- (Byrd 89)
- Macquaries thesaurus
- (Grefenstette 93)
- Bunrui Goi Hyou Japanese thesaurus
- (Utsuro et al. 93)
12Words and WorksWhere is this Lexical Knowledge
located?
- Structured Lexical Resources
- Encyclopaedia
- Groliers Encyclopaedia (Yarowsky 92)
- Encarta (Richardson et al. 98)
- Others
- Telephonic Guides
- Mixing structured lexical resources
- Rogets Thesaurus and Groliers (Yarowsky 92)
- LDOCE, WN, Collins, ONTOS, UM (Knight Luk 94)
- Japanese MRD to WN (Okumura Hovy 94)
- LLOCE, LDOCE (Chen Chang 98)
13Words and WorksWhere is this Lexical Knowledge
located?
- Unstructured Lexical Resources
- Corpora
- WSJ, Brown Corpus (SemCor), Hansard
- Proper Nouns (Hearst Schütze 95)
- Idiosyncratic Collocations (Church et al. 91)
- Preposition preferences (Resnik and Hearst 93)
- Subcategorization structures (Briscoe and Carroll
97) - Selectional restrictions (Resnik 93), (Ribas 95)
- Thematic structure (Basili et al. 92)
- Word semantic classes (Dagan et al. 94)
- Bilingual Lexicons for MT (Fung 95)
14Words and WorksWhere is this Lexical Knowledge
located?
- Using both structured and non-structured Lexical
Resources - MRDs and Corpora
- (Liddy Paik 92)
- (Klavans Tzoukermann 96)
- WordNet and Corpora
- (Resnik 93), (Ribas 95), (Li Abe 95), (McCarthy
01)
15Words and WorksInternational Projects on Lexical
Acquisition
- Japanese Projects
- EDR (Yokoi 95)
- Nine years project oriented to MT
- Bilingual Corpora with 250,000 words
- Monolingual, bilingual and coocurrence
dictionaries - 200,000 general vocabulary
- 100,000 technical terminology
- 400,000 concepts
16Words and Works International Projects on
Lexical Acquisition
- American Projects
- Comlex (Grishman et al. 94)
- Syntactic information for 38,000 words
- WordNet (Miller 90)
- Semantic Information
- more than 123,000 words organised in 99,000
synsets - more than 116,000 relations between synsets
- Pangloss (Knight Luk 94)
- PUM, ONTOS, LDOCE semantic categories, WordNet
- Cyc (Lenat 95)
- common-sense knowledge
- 100,000 concepts and 1,000,000 axioms
17Words and Works International Projects on
Lexical Acquisition
- European Projects
- Acquilex I and II
- LA from monolingual and bilingual MRDs and
corpora - LE-Parole
- Large-scale harmonised set of corpora and
lexicons for all the EU languages - EuroWordNet
- Multilingual WordNet for several European
Languages - Meaning
- Large-scale of LK from the web
- Large-scale WSD
18Words and WorksLexical Acquisition from MRDs
- Syntactic Disambiguation (Dolan et al. 93)
- Semantic Processing (Vanderwende 95)
- WSD (Lesk 86), (Wilks Stevenson 97), (Rigau 98)
- IR (Krovetz Croft 92)
- MT (Knight and Luk 94), (Tanaka Umemura 94)
- Semantically enriching MRDs
- (Yarowsky 92), (Knight 93), (Chen Chan 98)
- Building LKBs
- (Bruce Guthrie 92)
- (Dolan et al. 93)
- (Artola 93)
- (Castellón 93), (Taulé 95), (Rigau 98)
19Words and WorksAcquisition of LK from MRDs
- This tutorial focus on
- the massive acquisition of LK
- from MRDs (conventional, in any language)
- using (semi) automatic methodologies
- Why MRDs?
The conventional dictionaries for human use
usually contain spelling, pronunciation,
hyphenation, capitalization, usage notes for
semantic domains, geographic regions, and
propiety ethimological, syntactic and semantic
information about the most basic units of the
language (Amsler 81)
20Words and WorksMain Problems of MRDs
- Conventional dictionaries are not systematic
- Dictionaries are built for human use
- Implicit Knowledge
- words are described/translated in terms of words
21Words and WorksMRDs and Semantic Knowledge
- jardÃn_1_1 Terreno donde se cultivan plantas y
flores ornamentales. - florero_1_4 Maceta con flores.
- ramo_1_3 Conjunto natural o artificial de
flores, ramas o hierbas. - pétalo_1_1 Hoja que forma la corola de la flor.
- tálamo_1_3 Receptáculo de la flor.
- miel_1_1 Substancia viscosa y muy dulce que
elaboran las abejas, en una distensión del
esófago, con el jugo de las flores y luego
depositan en las celdillas de sus panales. - florerÃa_1_1 FloristerÃa tienda o puesto donde
se venden flores. - florista_1_1 Persona que tiene por oficio hacer
o vender flores. - camelia_1_1 Arbusto cameliáceo de jardÃn,
originario de Oriente, de hojas perennes y
lustrosas, y flores grandes, blancas, rojas o
rosadas (Camellia japonica). - camelia_1_2 Flor de este arbusto.
- rosa_1_1 Flor del rosal.
22Outline
- Setting
- Words and Works
- Merge approach
- Taxonomy construction monolingual MRDs
- Mapping taxonomies bilingual MRDs
- Expand approach
- Translation of synsets bilingual MRDs
- Interface for manual revision
- Conclusions
23Merge approachMain Methodology
24Merge approachMain Methodology
- Taxonomy construction (Rigau et al. 98, 97)
- monolingual MRDs
- Step 1 Selection of the main top beginners
for a semantic primitive - Step 2 Exploiting genus, construction of
taxonomies for each semantic primitive - Mapping taxonomies (Daudé et al. 99)
- bilingual MRDs
- Step 3 Creation of translation links
25Merge approach Taxonomy ConstructionMethodology
- Problems following a pure descriptive approach
- Circularity
- Errors and inconsistencies
- Definitions with omitted genus
- Top dictionary senses do not usually represent
useful knowledge for the LKB - Too general
- Too specific
26Merge approach Taxonomy ConstructionMethodology
Mixed Methodology
Prescriptive approach Manual construction of
the Top Structure
27Merge approach Taxonomy Construction
Methodology
Mixed Methodology
Prescriptive approach Manual construction of
the Top Structure
Descriptive approach Acquiring implicit
information from MRDs
28Merge approach Taxonomy Construction
Methodology
Mixed Methodology
Prescriptive approach Manual construction of
the Top Structure
Descriptive approach Acquiring implicit
information from MRDs
29Merge approach Taxonomy Construction Step 1
Selection of the main top beginners
Word sense zumo_1_1 Attached-to c_art_subst
type. Definition lÃquido que se extrae de las
flores, hierbas, frutos, etc. (liquid
extracted from flowers, herbs, fruits, etc).
30Merge approach Taxonomy Construction Step 1
Selection of the main top beginners
- A) Attaching DGILE senses to semantic primitives
- 1) First labelling
- Conceptual Distance (Rigau 94)
- 2) Second labelling
- Salient Words (Yarowsky 92)
- B) Filtering Process
31Merge approach Taxonomy Construction Step 1
Selection of the main top beginners
- A.1) First labelling
- Conceptual Distance (Agirre et al. 94)
- length of the shortest path
- specificity of the concepts
- using WordNet
- Bilingual dictionary
32Merge approach Taxonomy Construction Step 1
Selection of the main top beginners
ltentitygt
ltobject, ...gt
ltartifact, artefactgt
ltstructure, constructiongt
lthouse, lodginggt
ltbuilding, edificegt
ltplace of worship, ...gt
ltreligious residence, cloisergt
ltchurch, church buildinggt
ltconventgt
ltmonasterygt
ltabbeygt
ltabbeygt
ltabbeygt
abadÃa_1_2 Iglesia o monasterio regido por un
abad o abadesa (abbey, a church or a monastery
ruled by an abbot or an abbess)
33Merge approach Taxonomy Construction Step 1
Selection of the main top beginners
ltentitygt
ltobject, ...gt
ltartifact, artefactgt
ltstructure, constructiongt
lthouse, lodginggt
ltbuilding, edificegt
ltplace of worship, ...gt
ltreligious residence, cloisergt
ltchurch, church buildinggt
ltconventgt
ltmonasterygt
ltabbeygt 06 ARTIFACT
ltabbeygt
ltabbeygt
abadÃa_1_2 Iglesia o monasterio regido por un
abad o abadesa (abbey, a church or a monastery
ruled by an abbot or an abbess)
34Merge approach Taxonomy Construction Step 1
Selection of the main top beginners
- A.1) First labelling (Results)
- 29,205 labelled definitions (31)
- 61 accuracy at a sense level
- 64 accuracy at a file level
35Merge approach Taxonomy Construction Step 1
Selection of the main top beginners
- A.2) Second labelling
- Salient Words (Yarowsky 92)
- Importance
- local frequency
- appears more significantly more often in the
corpus of a semantic category than at other
points in the whole corpus
36Merge approach Taxonomy Construction Step 1
Selection of the main top beginners
- A.2) Second labelling (Results)
- 86,759 labelled definitions (93)
- 80 accuracy at a file level
biberón_1_1 ARTIFACT 4.8399 Frasco de cristal ...
(glass flask ...) biberón_1_2 FOOD
7.4443 Leche que contiene este frasco ...
(milk contained in that flask ...)
37Merge approach Taxonomy Construction Step 1
Selection of the main top beginners
- B) Filtering process (FOODs)
- removes all genus terms
- FILTER 1 not FOODs by the bilingual mapping
- FILTER 2 appear more often as genus in other
Semantic Primitive - FILTER 3 with a low frequency
38Merge approach Taxonomy Construction Step 1
Selection of the main top beginners
- B) Filtering process (FOOD Results)
39Merge approach Taxonomy Construction Step 2
Exploiting Genus
Word sense vino_1_1 Hypernym zumo_1_1. Definiti
on zumo de uvas fermentado. (fermented
juice of grapes). Word sense rueda_2_1
Hypernym vino_1_1. Definition vino procedente
de la región de Rueda (Valladolid).
(wine from the region of Rueda).
40Merge approach Taxonomy Construction Step 2
Exploiting Genus
- Genus Sense Identification
- 97 accuracy for nouns
- Genus Sense Disambiguation
- Unrestricted WSD (coverage 100)
- Knowledge-based WSD (not supervised)
- Eight Heuristics (McRoy 92)
- Combining several lexical resources
- Combining several methods
41Merge approach Taxonomy Construction Step 2
Exploiting Genus
Results
42Merge approach Taxonomy Construction Step 2
Exploiting Genus
Knowledge provided by each heuristic
43Merge approach Taxonomy Construction Step 2
Exploiting Genus
F2F3gt9 35,099 definitions F2F3gt4 40,754
definitions No filters 111,624 definitions
44Merge approach Taxonomy Construction Step 2
Exploiting Genus
... zumo_1_1 vino_1_1 quianti_1_1 zumo_1_1
vino_1_1 raya_1_8 zumo_1_1 vino_1_1
requena_1_1 zumo_1_1 vino_1_1 reserva_1_12
zumo_1_1 vino_1_1 ribeiro_1_1 zumo_1_1
vino_1_1 rioja_1_1 zumo_1_1 vino_1_1
roete_1_1 zumo_1_1 vino_1_1 rosado_1_3
zumo_1_1 vino_1_1 rueda_2_1 zumo_1_1
vino_1_1 sherry_1_1 zumo_1_1 vino_1_1
tarragona_1_1 zumo_1_1 vino_1_1 tintilla_1_1
zumo_1_1 vino_1_1 tintorro_1_1 zumo_1_1
vino_1_1 toro_3_1 ...
45Merge approach Mapping Taxonomies Step 3
Creation of translation links
C1
C2
C3
C4
C5
C6
46Merge approach Mapping Taxonomies Step 3
Creation of translation links
C1
C2
C3
C4
C5
C6
47Merge approach Mapping Taxonomies Step 3
Creation of translation links
- Connecting already existing Hierarchies
- Relaxation labelling Algorithm
- Constraints
- Between
- Spanish taxonomy automatically derived from an
MRD (Rigau et al. 98) - WordNet
- using a bilingual MRD
48Merge approach Mapping Taxonomies Step 3
Creation of translation links
49Merge approach Mapping Taxonomies Step 3
Relaxation Labelling algorithm
- Iterative algorithm for function optimisation
based on local information - it can deal with any kind of constraints
- variables (senses of the taxonomy)
- labels (synsets)
- Finds a weight assignment for each possible label
for each variable - weights for the labels of the same variable add
up to one - weight assignation satisfies -to the maximum
possible extent- the set of constraints
50Merge approach Mapping Taxonomies Step 3
Relaxation Labelling algorithm
- 1) Start with a random weight assignment
- 2) Compute the support value for each label of
each variable (according to the constraints) - 3) Increase the weights of the labels more
compatible with context and decrease those and
decrease those of the less compatible labels. - 4) If a stopping/convergence is satisfied, stop,
- otherwise go to step 2.
51Merge approach Mapping Taxonomies Step 3
Constraints
- Rely on the taxonomy structure
- Coded with three characters
- X Spanish Taxonomy, I (immediate),
- Y English Taxonomy, A (ancestor)
- X Relation, E (hypernym), O (hyponym), B (both)
- Examples
IIE
AAB
52Merge approach Mapping Taxonomies Step 3
Results
- Poly TOK, FOK TOK, FNOK total
- animal 279 (90) 30 (91) 209 (90)
- food 166 (94) 3 (100) 169 (94)
- cognition 198 (67) 27 (90) 225 (69)
- communication 533 (77) 40 (97) 573 (78)
- all TOK, FOK TOK, FNOK total
- animal 424 (93) 62 (95) 486 (90)
- food 166 (94) 83 (100) 249 (96)
- cognition 200 (67) 245 (90) 445 (82)
- communication 536 (77) 234 (97) 760 (81)
53Merge approach Mapping Taxonomies Step 3
Example
piel
(substance ltskin, fur, peelgt)
marta
(substance ltsable, marte, coal_backgt)
visón
(substance ltmink, mink_coatgt)
54Outline
- Setting
- Words and Works
- Merge approach
- Taxonomy construction monolingual MRDs
- Mapping taxonomies bilingual MRDs
- Expand approach
- Translation of synsets bilingual MRDs
- Interface for manual revision
- Conclusions
55Expand approach
- Take one WordNet as starting point
- Translate synsets
- English ltcar, automobilegt
- Basque ltauto, berebilgt
- We obtain a structurally similar WordNet in
another language, but some of the synsets will be
missing - Use bilingual dictionary
- maintien n.m. (attitude) bearing (conservation)
maintenance - 1. Keep bilingual senses (Agirre Rigau 95)
- maintien1 (attitude) bearing maintien2
(conservation) maintenance - 2. Produce all translation pairs (Atserias et al.
97) - maintien - bearing
- maintien - maintenance
56Expand approach - produce all pairings
- Used to produce the first version of the nominal
part of the Spanish WordNet - Based on WN 1.5
- Both directions in bilingual dictionary merged
- Spanish/English 19,443 translation pairs
- English/Spanish 16,324 translation pairs
- Harmonized bilingual 28,131 translation pairs
- Overlap with WordNet 12,665 nouns (14)
- Two methods
- class methods consider only pairings
- conceptual distance methods consider similarity
of synsets
57Expand approach - produce all pairings
- Ten class methods
- Four monosemic criteria
- Four polysemic criteria
- Two hybrid criteria
- Three conceptual distance methods
- CD1 using pairwise word coocurrences
- CD2 using headword and genus
- CD3 using bilingual Spanish entries with
multiple translations
58Expand approach - produce all pairings
- Class methods
- Four possible configurations for pairs which
either share an English word or an Spanish word
connected graph.
59Expand approach - produce all pairings
- 4 monosemous class methods
- All English words involved are monosemous in WN
60Expand approach - produce all pairings
- 4 polysemous class methods
- At least 1 English word involved is polysemous
61Expand approach - produce all pairings
- 2 other class methods
- Variant criteriontwo synonyms share a single
SW - Field criterionuse field indicators in
bilingual entry when available
lt..., EW, ..., EW, ...gt
SW
VC
lt..., headword-EW, ..., Ind-EW, ...gt
FC
SW
62Expand approach - produce all pairings
- Ten class methods (results)
63Expand approach - produce all pairings
- Conceptual Distance Methods (Agirre et al. 94)
- length of the shortest path
- specificity of the concepts
- Using WordNet
- Bilingual dictionary
64Expand approach - produce all pairings
- Three conceptual distance methods
- CD1 using pairwise word coocurrences from
monolingual dict. - CD2 using headword and genus from monolingual
def. - CD3 using bilingual Spanish entries with
multiple translations
65Expand approach - produce all pairings
ltentitygt
CD2
ltobject, ...gt
ltartifact, artefactgt
lthouse, lodginggt
ltreligious residence, cloisergt
abadÃa_1_2 Iglesia o monasterio regido por un
abad o abadesa (abbey, a church or a monastery
ruled by an abbot or an abbess)
66Expand approach - produce all pairings
- Three conceptual distance methods
67Expand approach - produce all pairings
- Keep SW-synset pairs produced by methods with
precision above 85 - mono1
- mono2
- mono3
- mono4
- variant
- But, if two different methods propose the same
SW-synset pair, it could get better confidence - try pairwise combinations of methods
68Expand approach - produce all pairings
- Combinations of methods higher precision in some
cases
69Expand approach - produce all pairings
- Results
- SpWN v 0.1
- BasqueWN v 0.1
- 2 bilingual dictionaries
- apply first 8 class methods only
70Expand approach - bilingual senses
- Smaller experiment with French bilingual
dictionary - Based on WN 1.5
- Keep structure of bilingual dictionary bilingual
senses - 21322 entries, 31502 subentries (senses)
- 16917 nominal subentries
- Disambiguation is possible
- 1) one of the translation words is monosemous in
WordNet. - 2) the translation is given by a list of words.
- 3) a cue in French is provided alongside the
translation. - 4) a semantic field is provided.
- folie 1 n.f. madness
- provision 1 n.f. supply, store
- trésor 2 n.m. (ressources) (comm.) finances
71Expand approach - bilingual senses
- Possible disambiguation case by case
72Expand approach - bilingual senses
- Disambiguation Conceptual Density Agirre
Rigau 95 - The relatedness of a certain word-sense to the
words in the context (cue, other translations
and/or semantic field) allows us to select that
sense over the others - Bilingual dictionary English WordNet
73Expand approach - summary
- all pairings
- coverage and precision
- produce a good starting point for manual revision
- bilingual senses
- keeping bilingual sense might help precision
- very low coverage
74Outline
- Setting
- Words and Works
- Merge approach
- Taxonomy construction monolingual MRDs
- Mapping taxonomies bilingual MRDs
- Expand approach
- Translation of synsets bilingual MRDs
- Interface for manual revision
- Conclusions
75Interface for manual revision
76Interface for manual revision
77Interface for manual revision
- Client/Server achitecture
- Data base
- EWN design implemented on SQL tables
- English, Spanish, Catalan and Basque
- Interface
- Perl CGIs that access the data bases
-
78Outline
- Setting
- Words and Works
- Merge approach
- Taxonomy construction monolingual MRDs
- Mapping taxonomies bilingual MRDs
- Expand approach
- Translation of synsets bilingual MRDs
- Interface for manual revision
- Conclusions
79Conclusions
- methods to automatically produce preliminary
versions - methods mainly for nouns
- need to manually revise
- merge approach
- method to produce native hierarchies and word
senses - trust lexicographers hierarchies
- need to map to ILI in independent process
- expand approach
- method to translate English WNs synsets
- trusts WNs hierarchies, sense distinctions
- mapping to ILI for free
80Conclusions
- merge approach
- manual work
- revising and re-organizing the automatic
hierarchies (hard) - revising automatic mapping (very hard)
- allows for integration of data from monolingual
dictionary - definition text itself
- lexico-semantic relations from definitions
- expand approach
- manual work
- revise proposed translations (fast)
- review the rest of the synsets (many)
- include glosses
81Conclusions
- Interface to speed up manual work
- Downloadable soon
- WN 1.5 in data-base format
- Interface
- WordNets can be checked at
- http//www.lsi.upc.es/nlp
- http//ixa.si.ehu.es/wei3.html
- This slides will (shortly) be available at
- http// ...
- http//www.ji.si.ehu.es/users/eneko
82Bibliography
83Semi-automatic methods for WordNet construction
- German Rigau i Claramunt
- http//www.lsi.upc.es/rigau
- TALP Research Center
- Universitat Politècnica de Catalunya
- Eneko Agirre
- http//www.ji.si.upc.es/users/eneko
- IxA NLP Group
- University of the Basque Country
2002 International WordNet Conference