Title: Diapositive 1
1A multi-word term extraction program for Arabic
language
LREC 28-30 May 2008 Marrakech
Siham Boulaknadel, Béatrice Daille and Driss
Aboutajdine LINA University of Nantes GSCM_LRIT Un
iversity of Rabat
2Outline
- Multi-word term
- Motivation
- Approach
- Comparing statistical methods
- Conclusion and future work
3Terms
- Refer to a defined concept ... (ISO 704).
- Represent a limited number of part of speech
nouns, verbs, adjectives, and adverbs. - Given subject domain
4Multi-word terms
- ????? ?????? ?????????? ????? ????? ??????
???????? ???? ??? ?? ????? ??????? ???????
wikipidea - Nitrogen oxides consists of all combustion
processes taking place at high temperature - MWTs extracted
- ?????? ??????????
- ????? ??????? ???????
- ?????? ????????
5Motivation
- Frequent MWTs
- Application
- for building index from unstructured documents
- for enhancing document retrieval system
6MWT extraction system Concept extraction
Corpus
Identification of Term Candidates - linguistic
filtering (shallow parsing)
Filtering of Term Candidates - statistical
significance (LLR, FLR, MI3,T-score)
Candidate list
7MWT evaluation
- unithood measure the strengh of association of
the constituents of MWU - United nations environment domain
- Unithood
- termhood measure relatedness to existing domain
specific concepts. - Soil degradation environment domain
- Termhood Unithood
8MWT patterns
Pattern Sub-pattern Arabic MWT English translation
N ADJ ?????? ????????? Chemical pollution
N1 N2 ???? ????? Water pollution
N1 PREP N2 N1 ? N2 ?????? ??????? Pollution with lead
N1 PREP N2 N1 ? N2 ?????? ??????? Exposure to diseases
N1 PREP N2 N1 ?? N2 ?????? ?? ???????? Waste disposal
9MWT variations
- Multiple forms for the same concept
- Variations types
- Inflexional morphology
- Number
- N1 N2 / N1 N2 suffix(??, ??)
- ???? ?????? ocean pollution
- ???? ???????? oceans pollution
- Definite form
- N Adj / Prefix(??) N prefix(??) Adj
- ???? ??????? chemical polution
- ?????? ????????? the chemical pollution
- Derivational morphosyntactic phenomena
- N1 ADJ /N1 PREP N2
- ??? ???? gt ??? ?? ????? oil well
- Syntactically (modification postposition)
- N1 N2 / N1 N2 ADJ
- ???? ??????? degree of temperature
- ???? ??????? ??????? high degree of temperature
10Comparing statistical filtering
- Mutual Information (MI3) (Daille, 1994) as
baseline - Loglikelihood (Dunning, 1994)
- t-Score (Church, 1991)
- FLR (Nakagawa and Mori, 2003)
11Experiment Data
- Arabic specific domain corpus on environment
- Compiled from the web Al-Khat Alakhdar Akhbar
Albiae from 2004-2006 - 475,148 words
- Motivation
- The no-availability of Arabic specific domain
corpora
12Gold standard
- Reference list
- Arabic environment terminology Agrovoc
- Total 65,000 unique known terms ( single and
MWT) - Dynamic search
- Eurodicautom
13Preprocessing
- Moving diacritics
- Buckwalters transliteration
- Diabs parsing (Diab, 2004)
- Input
- wlm yHtsb AlHkm Almjry sAndwr bwl rklp jzA' SHyHp
Avr Erqlp dAxl AlmnTqp mn qbl AlysAndrw. - Output
- w/CC lm/RP yHtsb/VBP Al/DT Hkm/NN Al/DT mjry/JJ
sAndwr/NNP bwl/NNP rklp/NN jzA'/NN SHyHp/JJ
Avr/IN Erqlp/NN dAxl/IN Al/DT mnTqp/NN mn/IN
qbl/NN Al/DT ysAndrw/NNP ./PUNC
14Evaluation and results
Precision
LLR 85
FLR 60
IM3 26
T-score 57
- For each association score
- Examine the first candidates term
- Compute precision (termhood) for 100 candidates
term - Precision (termhood) is quotient of attested MWT
and all extracted sequences. - the loglikelihood is the best measure
LLR list Eng. transl
???? ??????? ?????? ??????? ?????? ??????? ???? ??????? ?????? ??????? ?????? ?????? Acidity degree Dioxide Waste water Sewage Climate change Nervous system
Agrovoc Eurodicautom Acidity degree Dioxide Waste water Sewage Climate change Nervous system
15Summary future work
- Develop MWT extraction for Arabic
- Define MWT patterns and variations
- Obtain best results than european languages
- Improvement of system
- Adding new variation
- Improve lemmatisation
16Introduction
- MWTs are sufficiently informative to help human
readers get a feel of the essential topics - Use in many text related applications
- Text clustering
- Document similarity
- Document summarization
17Related Work
- Linguistic Approach
- Based on linguistic pre-processing and
annotations (result of taggers, shallow parsers) - Detect recurrent syntactic term formation
patterns - Noun Noun
- (Adj Noun) Noun
-
18Systems based on linguistics
- Ananiadou, S. (1994) recognises single-word terms
from domain of Immunology based on morphological
analysis of term formation patterns (internal
term make up) - Justeson Katz (1995, TERMS) extract complex
terms based on two characteristics (which
distinguishes them from non terms) - the syntactic patterns are restricted
- terms appear with the same form throughout the
text, omissions of modifiers are avoided
19Systems based on linguistics
- The text is tagged a filter is applied to
extract terms - ((AN) ((AN) (N P)?) (AN))
N - AN / NA / AAN / ANN / NAN / NNN / NPN
- Filtering based on simple POS pattern
- A pattern must occur above a certain threshold to
be considered a valid term pattern. - Recall 71 Precision 71 -- 96
- LEXTER (Bourigault, 1994)
- Extracts French compound terms based on surface
syntactic analysis and text heuristics - Terms are identified according to certain
syntactic patterns
20- Uses a boundary method to identify the extent of
terms - categories or sequences of categories that are
not found in term patterns form the boundaries
e.g. verbs, any preposition (except de and à)
followed by a determiner. Non productive
sequences become boundaries. - Precision 95 although tests have shown that
lots of noise is generated
21Approaches using statistical information
- Main measures used
- Frequency of occurrence
- Mutual Information
- C/NC value
- Experiments also with loglike coefficient
Dunning, 1993
22Frequency of occurrence
- Simplest and most popular method for Domain
independent, requires no external resources - Some filtering is used in form of syntactic
patterns - Systems using frequency of occurrence
- Dagan Church (TERMIGHT, 1994)
- Enguehard Pantera (1994)
- Lauriston (TERMINO, 1996)
23Mutual Information
- The amount of information provided by the
occurrence of the event represented by yi about
the occurrence of the event represented by xk is
defined as - I(xk,yi) ? log P(xk,yi) / P(xk)
P(,yi)
Fano (196127-28) - This measure is about how much a word tells us
about the other. - Problems for MI come from data sparseness
- Damerau (1993) and Daille (1994) used MI for the
extraction of candidate terms (only for two-word
candidate terms)
24C/NC value (Frantzi Ananiadou)
- C/value
- total frequency of occurrence of string in
corpus - frequency of string as part of longer candidate
terms - number of these longer candidate terms
- length of string (in number of words)
25NC value
- NC-value(a) 0.8 C-value(a) 0.2 CF(a)
- a is the candidate term,
- C-value(a) is the C-value for the candidate term
a, - CF(a) is the context factor for the candidate
term a - we obtain the CF by summing up the weights for
its term context words, multiplied by their
frequency appearing with this candidate term.
26Hybrid approaches
- Combination of linguistic information (filters),
shallow parsing results and statistical measures - Daille, B., Frantzi Ananiadou
27