Bilingual term extraction revisited - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Bilingual term extraction revisited

Description:

Bilingual term extraction revisited pela Vintar. University of Ljubljana spela.vintar_at_ff.uni-lj.si – PowerPoint PPT presentation

Number of Views:129
Avg rating:3.0/5.0
Slides: 21
Provided by: Spel70
Category:

less

Transcript and Presenter's Notes

Title: Bilingual term extraction revisited


1
Bilingual term extraction revisited
  • Špela Vintar.
  • University of Ljubljana
  • spela.vintar_at_ff.uni-lj.si

2
Extracting terms from the Acquis corpus
  • Using a bilingual subcorpus on Nuclear Energy
    (EN-SL)
  • No linguistic preprocessing, only stop lists
  • Universal terms and collocations
  • Council regulation
  • European Union
  • Member State
  • Commission directive
  • Article
  • Having regard to
  • Danger of Acquis stoplists European Atomic
    Energy Community

3
keyness
  • Measures of keyness
  • subcorpus vs. general language corpus (here
    Acquis)relative corpus frequency
  • document vs. document collectiontf.idf
  • Applied to single or multi-word units.

4
Examples of unigrams extracted through rel. freq
  • Words with high rel.freq.
  • 0, 54 Radiological
  • 0,49 concerned
  • 0,21 Board
  • 0,11 aid
  • 0,10 Potential
  • 0,08 reasonably
  • 0,08 Reconstruction
  • 0,08 give
  • 0,08 extend
  • 0,07 alia
  • 0,04 CHAPTER
  • 0,01 qualified
  • 0,01 measurement
  • 0,01 Nuclear
  • 0,01 materials
  • 0,01 steps
  • 0,01 energy
  • 0,01 declared
  • Words not found in the reference corpus
  • 1 sievert
  • 1 gray
  • 1 Sv
  • 1 wT
  • 1 radon
  • 1 becquerel
  • 1 wTHT
  • 1 DT
  • 1 EDA
  • 1 aboveground
  • 1 APPRENTICES
  • 1 Thermonuclear
  • 1 wR
  • 1 dN
  • 1 ankles
  • 1 mSv
  • 1 after-effects
  • 1 DOSE

5
Tf.idf
  • radiological 0,67
  • exposures 0,25
  • JRC 0,20
  • lens 0,19
  • radiation 0,17
  • apprentices 0,14
  • ionizing 0,14
  • serviceable 0,13
  • dose 0,13
  • nuclear 0,12
  • doses 0,12
  • workplaces 0,11
  • EXPOSURE 0,11
  • radioactive 0,10
  • joule 0,10
  • Resolutions 0,10
  • Governors 0,10
  • Dose 0,08
  • students 0,08
  • Cabinet 0,076
  • Nuclear 0,067
  • exposure 0,067
  • non-Member 0,059
  • gender 0,056
  • workers 0,052
  • Reactor 0,050
  • Euratom 0,049
  • proceeds 0,047
  • disregarded 0,043
  • Exchanges 0,042
  • Optimization 0,042
  • PRACTICES 0,042
  • dosimetric 0,042
  • exposed 0,037
  • population 0,036
  • contaminating 0,033

6
Tf.idf - Slovene
  • sevanju 0,19082
  • radiološkega 0,17864
  • dozimetrijo 0,17052
  • sivert 0,13804
  • radionuklidov 0,13804
  • sevanja 0,13195
  • Dana 0,12992
  • Cernobil 0,12180
  • Izpostavljenost 0,12180
  • Jedrska 0,11368
  • dozo 0,09473
  • prebivalstva 0,09256
  • sevanjem 0,08932
  • ITER 0,08120
  • Oddelkom 0,07308
  • inovativnosti 0,07308
  • študente 0,07308
  • izpostavljenosti 0,07308
  • radioaktivne 0,06766
  • cepitve 0,05684
  • nivoji 0,05684
  • efektivno 0,05684
  • medicine 0,05278
  • fuzije 0,05075
  • zaposlitvijo 0,04872
  • termonuklearni 0,04872
  • študentov 0,04872
  • guvernerjev 0,04872
  • prioritete 0,04872
  • reaktorja 0,04872
  • jedrske 0,04872
  • delodajalca 0,04669
  • izpostavljenih 0,04601
  • ionizirajocemu 0,04466
  • ekvivalentno 0,04263
  • dosegljive 0,04060
  • ionizirajocega 0,04060
  • jedrskem 0,04060

7
Other indicators of termhood
  • Acronyms (NPP, SG, RBB ...)
  • Unknown words
  • not found in the reference corpus
  • unknown to the lemmatizer
  • Cognates Named entities
  • radioactive radioaktivna 1.0
  • radioactive Radioaktivna 1.0
  • Radioactive Radioaktivna 1.0
  • radioactive radioaktivne 1.0
  • radioactive radioaktivnih 1.0
  • radioactive radioaktivnimi 1.0
  • radioactive radioaktivno 1.0
  • radioactive radioaktivnosti 1.0
  • radioactive radiokativnega 1.0
  • radiography radiografijo 1.0
  • radionuclide radionuklid 1.0
  • radionuclide radionuklida 1.0
  • radionuclide radionuklidov 1.0
  • radionuclides radionuklide 1.0
  • radionuclides radionuklidov 1.0
  • ratify ratificirajo 1.0
  • Reactor reaktorja 1.0
  • reactor reaktorjev 1.0
  • reactors reaktorji 1.0

8
Identifying multi-word units
  • Collocation extraction techniques
  • Mutual Information (Church Hanks 1990)
  • Log-likelihood ratio (Dunning 1993)
  • Entropy-based (Shimohata et al. 1997)
  • Semantic non-compositionality (Pearce 2001)
  • Daille (1994) LL is the most appropriate measure
  • for n gt 3 n-gram frequency ( stopword
    filtering) also works

9
N-gram term weighting
  • statistically extracted n-grams are not
    necessarily terms ? need for filtering /
    weighting
  • Stopword filtering
  • Weighting with tf.idf, ll-rank/core
    frequencyw(tw1, w2, w3) tf.idfw1tf.idfw2tf.idf
    w3/n 1/rank

10
2-grams, weighted with rel.freq.
  • Thermonuclear Experimental 1.91766291545192
  • International Thermonuclear 1.90047962704222
  • wR values 1.74111305022281
  • cosmic radiation 1.68720469442766
  • non-Member States 1.67427461796584
  • Atomic Energy 0.996377043841846
  • European Atomic 0.995366262170687
  • Energy Community 0.995029334946967
  • Member States 0.994692407723247
  • Member State 0.994355480499528
  • exposed workers 0.990312353814892
  • radiation protection 0.988290790472574
  • ionizing radiation 0.985847228548466
  • nuclear power 0.975824483194946
  • Nuclear Safety 0.97077057483915

11
3-grams
  • Thermonuclear Experimental Reactor 2.8353258309038
    4
  • International Thermonuclear Experimental 2.8181425
    4249414
  • mSv per year 2.73507410483208
  • APPRENTICES AND STUDENTS 2.69461709334949
  • exceed 1 mSv 2.46078960008804
  • feet and ankles 2.2734580636999
  • European Atomic Energy 1.99321785686789
  • Atomic Energy Community 1.99288092964417
  • DECIDED AS FOLLOWS 1.95055494141597
  • nuclear power stations 1.94785693049428
  • Nuclear Safety Account 1.94301570053366
  • controlled nuclear fusion 1.88877041751479
  • Energy Community represented 1.87461947411856
  • natural radiation sources 1.87453104455193
  • nuclear power station 1.87309490609042
  • apprentices and students 1.86800777160465
  • Chernobyl nuclear power 1.86257180721416
  • establishing the European 1.85670559767151

12
Treatment of nested terms
  • C-value (Frantzi Ananiadou 1996)
  • C-value(a) (length(a) 1)(freq(a) t(a)/c(a))
  • n-gram C-value
  • Chernobyl nuclear 7,3
  • nuclear power plant 15,2
  • Chernobyl nuclear power plant 20,4

13
Bilingual lexicon extraction
  • using Twente (Hiemstra 1998)
  • based on the Iterative Proportional Fitting
    Procedure (IPFP), word-to-word translation model
  • outputs translation candidates scores for each
    word in the corpus both ways
  • using stopword-filtered corpora to improve
    results
  • bilingual lexicon expanded with cognates

14
Output of Twente lexicon extraction
15
Term alignment
  • for each source term candidate we collect all
    single-word equivalents from the bilingual
    lexicon jedrska elektrarna Cernobil

power 0.50 plant 0.50
Chernobyl 1.00
nuclear 1.00
16
Term alignment
  • for each source term candidate we collect all
    single-word equivalents from the bilingual
    lexicon jedrska elektrarna Cernobil

power 0.50 plant 0.50
Chernobyl 1.00
nuclear 1.00
Nuclear power plant 2.00 Power
plant 1.00 Chernobyl nuclear power plant 3.00
17
Term candidates
Slovene English Equivalence
doznih mej dose limits 1.49
nadzorovane jedrske fuzije controlled nuclear fusion 1.89
varstvo pred sevanjem radiation protection 2.00
mednarodnega termonuklearnega poskusnega International thermonuclear experimental 2.49
poskusnega reaktorja experimental reactor 1.49
študenti in pripravniki Students and apprentices 1.50
izpostavljenost ionizirajocemu sevanju emitting ionizing radiation 1.99
zdravstvenimi službami approved medical practitioners 0.75
izpostavljenih delavcev exposed workers 1.78
države clanice Member states require 1.49
18
Outcome
  • Corpus 17.000 tokens
  • Extracted 193 Slovene and 199 English term
    candidates
  • Bilingual (aligned) 112
  • What we miss
  • Term variation
  • disposal of waste / emplacement of waste
  • safety levels / levels of safety
  • Hapax
  • radiation weighting factor, tissue weighting
    factor

19
Purpose of term extraction
  • Extraction vs. annotation

Application Term form Level of specificity Accuracy
Keyword assignment, Document classification Short, NPs preferred, not too specialised, resolution of variants low medium
Translation, terminography All, variation wanted, collocations wanted high medium
Information extraction Short-medium length, NPs preferred, resolution of nesting high high
Semantic processing (IR, TM) Short-medium length, ambiguity resolution high high
20
Problems
  • Distinguishing between generic and text-specific
    terms (same form, same frequency!)
  • Capturing low frequency terms in inflected
    languages
  • We want to capture domain-specific terms. But
    most texts are multi-domain!
Write a Comment
User Comments (0)
About PowerShow.com