Title: Bilingual term extraction revisited
1Bilingual term extraction revisited
- Špela Vintar.
- University of Ljubljana
- spela.vintar_at_ff.uni-lj.si
2Extracting terms from the Acquis corpus
- Using a bilingual subcorpus on Nuclear Energy
(EN-SL) - No linguistic preprocessing, only stop lists
- Universal terms and collocations
- Council regulation
- European Union
- Member State
- Commission directive
- Article
- Having regard to
- Danger of Acquis stoplists European Atomic
Energy Community
3keyness
- Measures of keyness
- subcorpus vs. general language corpus (here
Acquis)relative corpus frequency - document vs. document collectiontf.idf
-
- Applied to single or multi-word units.
4Examples of unigrams extracted through rel. freq
- Words with high rel.freq.
- 0, 54 Radiological
- 0,49 concerned
- 0,21 Board
- 0,11 aid
- 0,10 Potential
- 0,08 reasonably
- 0,08 Reconstruction
- 0,08 give
- 0,08 extend
- 0,07 alia
- 0,04 CHAPTER
- 0,01 qualified
- 0,01 measurement
- 0,01 Nuclear
- 0,01 materials
- 0,01 steps
- 0,01 energy
- 0,01 declared
- Words not found in the reference corpus
- 1 sievert
- 1 gray
- 1 Sv
- 1 wT
- 1 radon
- 1 becquerel
- 1 wTHT
- 1 DT
- 1 EDA
- 1 aboveground
- 1 APPRENTICES
- 1 Thermonuclear
- 1 wR
- 1 dN
- 1 ankles
- 1 mSv
- 1 after-effects
- 1 DOSE
5Tf.idf
- radiological 0,67
- exposures 0,25
- JRC 0,20
- lens 0,19
- radiation 0,17
- apprentices 0,14
- ionizing 0,14
- serviceable 0,13
- dose 0,13
- nuclear 0,12
- doses 0,12
- workplaces 0,11
- EXPOSURE 0,11
- radioactive 0,10
- joule 0,10
- Resolutions 0,10
- Governors 0,10
- Dose 0,08
- students 0,08
- Cabinet 0,076
- Nuclear 0,067
- exposure 0,067
- non-Member 0,059
- gender 0,056
- workers 0,052
- Reactor 0,050
- Euratom 0,049
- proceeds 0,047
- disregarded 0,043
- Exchanges 0,042
- Optimization 0,042
- PRACTICES 0,042
- dosimetric 0,042
- exposed 0,037
- population 0,036
- contaminating 0,033
6Tf.idf - Slovene
- sevanju 0,19082
- radiološkega 0,17864
- dozimetrijo 0,17052
- sivert 0,13804
- radionuklidov 0,13804
- sevanja 0,13195
- Dana 0,12992
- Cernobil 0,12180
- Izpostavljenost 0,12180
- Jedrska 0,11368
- dozo 0,09473
- prebivalstva 0,09256
- sevanjem 0,08932
- ITER 0,08120
- Oddelkom 0,07308
- inovativnosti 0,07308
- študente 0,07308
- izpostavljenosti 0,07308
- radioaktivne 0,06766
- cepitve 0,05684
- nivoji 0,05684
- efektivno 0,05684
- medicine 0,05278
- fuzije 0,05075
- zaposlitvijo 0,04872
- termonuklearni 0,04872
- študentov 0,04872
- guvernerjev 0,04872
- prioritete 0,04872
- reaktorja 0,04872
- jedrske 0,04872
- delodajalca 0,04669
- izpostavljenih 0,04601
- ionizirajocemu 0,04466
- ekvivalentno 0,04263
- dosegljive 0,04060
- ionizirajocega 0,04060
- jedrskem 0,04060
7Other indicators of termhood
- Acronyms (NPP, SG, RBB ...)
- Unknown words
- not found in the reference corpus
- unknown to the lemmatizer
- Cognates Named entities
- radioactive radioaktivna 1.0
- radioactive Radioaktivna 1.0
- Radioactive Radioaktivna 1.0
- radioactive radioaktivne 1.0
- radioactive radioaktivnih 1.0
- radioactive radioaktivnimi 1.0
- radioactive radioaktivno 1.0
- radioactive radioaktivnosti 1.0
- radioactive radiokativnega 1.0
- radiography radiografijo 1.0
- radionuclide radionuklid 1.0
- radionuclide radionuklida 1.0
- radionuclide radionuklidov 1.0
- radionuclides radionuklide 1.0
- radionuclides radionuklidov 1.0
- ratify ratificirajo 1.0
- Reactor reaktorja 1.0
- reactor reaktorjev 1.0
- reactors reaktorji 1.0
8Identifying multi-word units
- Collocation extraction techniques
- Mutual Information (Church Hanks 1990)
- Log-likelihood ratio (Dunning 1993)
- Entropy-based (Shimohata et al. 1997)
- Semantic non-compositionality (Pearce 2001)
- Daille (1994) LL is the most appropriate measure
- for n gt 3 n-gram frequency ( stopword
filtering) also works
9N-gram term weighting
- statistically extracted n-grams are not
necessarily terms ? need for filtering /
weighting - Stopword filtering
- Weighting with tf.idf, ll-rank/core
frequencyw(tw1, w2, w3) tf.idfw1tf.idfw2tf.idf
w3/n 1/rank
102-grams, weighted with rel.freq.
- Thermonuclear Experimental 1.91766291545192
- International Thermonuclear 1.90047962704222
- wR values 1.74111305022281
- cosmic radiation 1.68720469442766
- non-Member States 1.67427461796584
- Atomic Energy 0.996377043841846
- European Atomic 0.995366262170687
- Energy Community 0.995029334946967
- Member States 0.994692407723247
- Member State 0.994355480499528
- exposed workers 0.990312353814892
- radiation protection 0.988290790472574
- ionizing radiation 0.985847228548466
- nuclear power 0.975824483194946
- Nuclear Safety 0.97077057483915
113-grams
- Thermonuclear Experimental Reactor 2.8353258309038
4 - International Thermonuclear Experimental 2.8181425
4249414 - mSv per year 2.73507410483208
- APPRENTICES AND STUDENTS 2.69461709334949
- exceed 1 mSv 2.46078960008804
- feet and ankles 2.2734580636999
- European Atomic Energy 1.99321785686789
- Atomic Energy Community 1.99288092964417
- DECIDED AS FOLLOWS 1.95055494141597
- nuclear power stations 1.94785693049428
- Nuclear Safety Account 1.94301570053366
- controlled nuclear fusion 1.88877041751479
- Energy Community represented 1.87461947411856
- natural radiation sources 1.87453104455193
- nuclear power station 1.87309490609042
- apprentices and students 1.86800777160465
- Chernobyl nuclear power 1.86257180721416
- establishing the European 1.85670559767151
12Treatment of nested terms
- C-value (Frantzi Ananiadou 1996)
- C-value(a) (length(a) 1)(freq(a) t(a)/c(a))
- n-gram C-value
- Chernobyl nuclear 7,3
- nuclear power plant 15,2
- Chernobyl nuclear power plant 20,4
13Bilingual lexicon extraction
- using Twente (Hiemstra 1998)
- based on the Iterative Proportional Fitting
Procedure (IPFP), word-to-word translation model - outputs translation candidates scores for each
word in the corpus both ways - using stopword-filtered corpora to improve
results - bilingual lexicon expanded with cognates
14Output of Twente lexicon extraction
15Term alignment
- for each source term candidate we collect all
single-word equivalents from the bilingual
lexicon jedrska elektrarna Cernobil
power 0.50 plant 0.50
Chernobyl 1.00
nuclear 1.00
16Term alignment
- for each source term candidate we collect all
single-word equivalents from the bilingual
lexicon jedrska elektrarna Cernobil
power 0.50 plant 0.50
Chernobyl 1.00
nuclear 1.00
Nuclear power plant 2.00 Power
plant 1.00 Chernobyl nuclear power plant 3.00
17Term candidates
Slovene English Equivalence
doznih mej dose limits 1.49
nadzorovane jedrske fuzije controlled nuclear fusion 1.89
varstvo pred sevanjem radiation protection 2.00
mednarodnega termonuklearnega poskusnega International thermonuclear experimental 2.49
poskusnega reaktorja experimental reactor 1.49
študenti in pripravniki Students and apprentices 1.50
izpostavljenost ionizirajocemu sevanju emitting ionizing radiation 1.99
zdravstvenimi službami approved medical practitioners 0.75
izpostavljenih delavcev exposed workers 1.78
države clanice Member states require 1.49
18Outcome
- Corpus 17.000 tokens
- Extracted 193 Slovene and 199 English term
candidates - Bilingual (aligned) 112
- What we miss
- Term variation
- disposal of waste / emplacement of waste
- safety levels / levels of safety
- Hapax
- radiation weighting factor, tissue weighting
factor
19Purpose of term extraction
- Extraction vs. annotation
Application Term form Level of specificity Accuracy
Keyword assignment, Document classification Short, NPs preferred, not too specialised, resolution of variants low medium
Translation, terminography All, variation wanted, collocations wanted high medium
Information extraction Short-medium length, NPs preferred, resolution of nesting high high
Semantic processing (IR, TM) Short-medium length, ambiguity resolution high high
20Problems
- Distinguishing between generic and text-specific
terms (same form, same frequency!) - Capturing low frequency terms in inflected
languages - We want to capture domain-specific terms. But
most texts are multi-domain!