Bilingual term extraction revisited - PowerPoint PPT Presentation

1 / 20

About This Presentation

Title:

Bilingual term extraction revisited

Description:

Bilingual term extraction revisited pela Vintar. University of Ljubljana spela.vintar_at_ff.uni-lj.si – PowerPoint PPT presentation

Number of Views:133

Avg rating:3.0/5.0

Slides: 21

Provided by: Spel70

Category:

more less

Transcript and Presenter's Notes

Title: Bilingual term extraction revisited

1
Bilingual term extraction revisited

Špela Vintar.
University of Ljubljana
spela.vintar_at_ff.uni-lj.si

2
Extracting terms from the Acquis corpus

Using a bilingual subcorpus on Nuclear Energy
(EN-SL)
No linguistic preprocessing, only stop lists
Universal terms and collocations
Council regulation
European Union
Member State
Commission directive
Article
Having regard to
Danger of Acquis stoplists European Atomic
Energy Community

3
keyness

Measures of keyness
subcorpus vs. general language corpus (here
Acquis)relative corpus frequency
document vs. document collectiontf.idf
Applied to single or multi-word units.

4
Examples of unigrams extracted through rel. freq

Words with high rel.freq.
0, 54 Radiological
0,49 concerned
0,21 Board
0,11 aid
0,10 Potential
0,08 reasonably
0,08 Reconstruction
0,08 give
0,08 extend
0,07 alia
0,04 CHAPTER
0,01 qualified
0,01 measurement
0,01 Nuclear
0,01 materials
0,01 steps
0,01 energy
0,01 declared

Words not found in the reference corpus
1 sievert
1 gray
1 Sv
1 wT
1 radon
1 becquerel
1 wTHT
1 DT
1 EDA
1 aboveground
1 APPRENTICES
1 Thermonuclear
1 wR
1 dN
1 ankles
1 mSv
1 after-effects
1 DOSE

5
Tf.idf

radiological 0,67
exposures 0,25
JRC 0,20
lens 0,19
radiation 0,17
apprentices 0,14
ionizing 0,14
serviceable 0,13
dose 0,13
nuclear 0,12
doses 0,12
workplaces 0,11
EXPOSURE 0,11
radioactive 0,10
joule 0,10
Resolutions 0,10
Governors 0,10
Dose 0,08
students 0,08

Cabinet 0,076
Nuclear 0,067
exposure 0,067
non-Member 0,059
gender 0,056
workers 0,052
Reactor 0,050
Euratom 0,049
proceeds 0,047
disregarded 0,043
Exchanges 0,042
Optimization 0,042
PRACTICES 0,042
dosimetric 0,042
exposed 0,037
population 0,036
contaminating 0,033

6
Tf.idf - Slovene

sevanju 0,19082
radiološkega 0,17864
dozimetrijo 0,17052
sivert 0,13804
radionuklidov 0,13804
sevanja 0,13195
Dana 0,12992
Cernobil 0,12180
Izpostavljenost 0,12180
Jedrska 0,11368
dozo 0,09473
prebivalstva 0,09256
sevanjem 0,08932
ITER 0,08120
Oddelkom 0,07308
inovativnosti 0,07308
študente 0,07308
izpostavljenosti 0,07308
radioaktivne 0,06766

cepitve 0,05684
nivoji 0,05684
efektivno 0,05684
medicine 0,05278
fuzije 0,05075
zaposlitvijo 0,04872
termonuklearni 0,04872
študentov 0,04872
guvernerjev 0,04872
prioritete 0,04872
reaktorja 0,04872
jedrske 0,04872
delodajalca 0,04669
izpostavljenih 0,04601
ionizirajocemu 0,04466
ekvivalentno 0,04263
dosegljive 0,04060
ionizirajocega 0,04060
jedrskem 0,04060

7
Other indicators of termhood

Acronyms (NPP, SG, RBB ...)
Unknown words
not found in the reference corpus
unknown to the lemmatizer
Cognates Named entities

radioactive radioaktivna 1.0
radioactive Radioaktivna 1.0
Radioactive Radioaktivna 1.0
radioactive radioaktivne 1.0
radioactive radioaktivnih 1.0
radioactive radioaktivnimi 1.0
radioactive radioaktivno 1.0
radioactive radioaktivnosti 1.0
radioactive radiokativnega 1.0
radiography radiografijo 1.0
radionuclide radionuklid 1.0
radionuclide radionuklida 1.0
radionuclide radionuklidov 1.0
radionuclides radionuklide 1.0
radionuclides radionuklidov 1.0
ratify ratificirajo 1.0
Reactor reaktorja 1.0
reactor reaktorjev 1.0
reactors reaktorji 1.0

8
Identifying multi-word units

Collocation extraction techniques
Mutual Information (Church Hanks 1990)
Log-likelihood ratio (Dunning 1993)
Entropy-based (Shimohata et al. 1997)
Semantic non-compositionality (Pearce 2001)
Daille (1994) LL is the most appropriate measure
for n gt 3 n-gram frequency ( stopword
filtering) also works

9
N-gram term weighting

statistically extracted n-grams are not
necessarily terms ? need for filtering /
weighting
Stopword filtering
Weighting with tf.idf, ll-rank/core
frequencyw(tw1, w2, w3) tf.idfw1tf.idfw2tf.idf
w3/n 1/rank

10
2-grams, weighted with rel.freq.

Thermonuclear Experimental 1.91766291545192
International Thermonuclear 1.90047962704222
wR values 1.74111305022281
cosmic radiation 1.68720469442766
non-Member States 1.67427461796584
Atomic Energy 0.996377043841846
European Atomic 0.995366262170687
Energy Community 0.995029334946967
Member States 0.994692407723247
Member State 0.994355480499528
exposed workers 0.990312353814892
radiation protection 0.988290790472574
ionizing radiation 0.985847228548466
nuclear power 0.975824483194946
Nuclear Safety 0.97077057483915

11
3-grams

Thermonuclear Experimental Reactor 2.8353258309038
4
International Thermonuclear Experimental 2.8181425
4249414
mSv per year 2.73507410483208
APPRENTICES AND STUDENTS 2.69461709334949
exceed 1 mSv 2.46078960008804
feet and ankles 2.2734580636999
European Atomic Energy 1.99321785686789
Atomic Energy Community 1.99288092964417
DECIDED AS FOLLOWS 1.95055494141597
nuclear power stations 1.94785693049428
Nuclear Safety Account 1.94301570053366
controlled nuclear fusion 1.88877041751479
Energy Community represented 1.87461947411856
natural radiation sources 1.87453104455193
nuclear power station 1.87309490609042
apprentices and students 1.86800777160465
Chernobyl nuclear power 1.86257180721416
establishing the European 1.85670559767151

12
Treatment of nested terms

C-value (Frantzi Ananiadou 1996)
C-value(a) (length(a) 1)(freq(a) t(a)/c(a))
n-gram C-value
Chernobyl nuclear 7,3
nuclear power plant 15,2
Chernobyl nuclear power plant 20,4

13
Bilingual lexicon extraction

using Twente (Hiemstra 1998)
based on the Iterative Proportional Fitting
Procedure (IPFP), word-to-word translation model
outputs translation candidates scores for each
word in the corpus both ways
using stopword-filtered corpora to improve
results
bilingual lexicon expanded with cognates

14
Output of Twente lexicon extraction
15
Term alignment

for each source term candidate we collect all
single-word equivalents from the bilingual
lexicon jedrska elektrarna Cernobil

power 0.50 plant 0.50
Chernobyl 1.00
nuclear 1.00
16
Term alignment

for each source term candidate we collect all
single-word equivalents from the bilingual
lexicon jedrska elektrarna Cernobil

power 0.50 plant 0.50
Chernobyl 1.00
nuclear 1.00
Nuclear power plant 2.00 Power
plant 1.00 Chernobyl nuclear power plant 3.00
17
Term candidates
Slovene English Equivalence
doznih mej dose limits 1.49
nadzorovane jedrske fuzije controlled nuclear fusion 1.89
varstvo pred sevanjem radiation protection 2.00
mednarodnega termonuklearnega poskusnega International thermonuclear experimental 2.49
poskusnega reaktorja experimental reactor 1.49
študenti in pripravniki Students and apprentices 1.50
izpostavljenost ionizirajocemu sevanju emitting ionizing radiation 1.99
zdravstvenimi službami approved medical practitioners 0.75
izpostavljenih delavcev exposed workers 1.78
države clanice Member states require 1.49
18
Outcome

Corpus 17.000 tokens
Extracted 193 Slovene and 199 English term
candidates
Bilingual (aligned) 112
What we miss
Term variation
disposal of waste / emplacement of waste
safety levels / levels of safety
Hapax
radiation weighting factor, tissue weighting
factor

19
Purpose of term extraction

Extraction vs. annotation

Application Term form Level of specificity Accuracy
Keyword assignment, Document classification Short, NPs preferred, not too specialised, resolution of variants low medium
Translation, terminography All, variation wanted, collocations wanted high medium
Information extraction Short-medium length, NPs preferred, resolution of nesting high high
Semantic processing (IR, TM) Short-medium length, ambiguity resolution high high
20
Problems

Distinguishing between generic and text-specific
terms (same form, same frequency!)
Capturing low frequency terms in inflected
languages
We want to capture domain-specific terms. But
most texts are multi-domain!

Write a Comment

User Comments (0)