Diapositive 1 - PowerPoint PPT Presentation

About This Presentation

Title:

Diapositive 1

Description:

Represent a limited number of part of speech: nouns, verbs, adjectives, and adverbs. ... Enguehard & Pantera (1994) Lauriston (TERMINO, 1996) 23. Mutual Information ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 28

Provided by: sih4

Learn more at: http://www.lrec-conf.org

Category:

more less

Transcript and Presenter's Notes

Title: Diapositive 1

1
A multi-word term extraction program for Arabic
language
LREC 28-30 May 2008 Marrakech
Siham Boulaknadel, Béatrice Daille and Driss
Aboutajdine LINA University of Nantes GSCM_LRIT Un
iversity of Rabat
2
Outline

Multi-word term
Motivation
Approach
Comparing statistical methods
Conclusion and future work

3
Terms

Refer to a defined concept ... (ISO 704).
Represent a limited number of part of speech
nouns, verbs, adjectives, and adverbs.
Given subject domain

4
Multi-word terms

????? ?????? ?????????? ????? ????? ??????
???????? ???? ??? ?? ????? ??????? ???????
wikipidea
Nitrogen oxides consists of all combustion
processes taking place at high temperature
MWTs extracted
?????? ??????????
????? ??????? ???????
?????? ????????

5
Motivation

Frequent MWTs
Application
for building index from unstructured documents
for enhancing document retrieval system

6
MWT extraction system Concept extraction
Corpus
Identification of Term Candidates - linguistic
filtering (shallow parsing)
Filtering of Term Candidates - statistical
significance (LLR, FLR, MI3,T-score)
Candidate list
7
MWT evaluation

unithood measure the strengh of association of
the constituents of MWU
United nations environment domain
Unithood
termhood measure relatedness to existing domain
specific concepts.
Soil degradation environment domain
Termhood Unithood

8
MWT patterns
Pattern Sub-pattern Arabic MWT English translation
N ADJ ?????? ????????? Chemical pollution
N1 N2 ???? ????? Water pollution
N1 PREP N2 N1 ? N2 ?????? ??????? Pollution with lead
N1 PREP N2 N1 ? N2 ?????? ??????? Exposure to diseases
N1 PREP N2 N1 ?? N2 ?????? ?? ???????? Waste disposal
9
MWT variations

Multiple forms for the same concept
Variations types
Inflexional morphology
Number
N1 N2 / N1 N2 suffix(??, ??)
???? ?????? ocean pollution
???? ???????? oceans pollution
Definite form
N Adj / Prefix(??) N prefix(??) Adj
???? ??????? chemical polution
?????? ????????? the chemical pollution
Derivational morphosyntactic phenomena
N1 ADJ /N1 PREP N2
??? ???? gt ??? ?? ????? oil well
Syntactically (modification postposition)
N1 N2 / N1 N2 ADJ
???? ??????? degree of temperature
???? ??????? ??????? high degree of temperature

10
Comparing statistical filtering

Mutual Information (MI3) (Daille, 1994) as
baseline
Loglikelihood (Dunning, 1994)
t-Score (Church, 1991)
FLR (Nakagawa and Mori, 2003)

11
Experiment Data

Arabic specific domain corpus on environment
Compiled from the web Al-Khat Alakhdar Akhbar
Albiae from 2004-2006
475,148 words
Motivation
The no-availability of Arabic specific domain
corpora

12
Gold standard

Reference list
Arabic environment terminology Agrovoc
Total 65,000 unique known terms ( single and
MWT)
Dynamic search
Eurodicautom

13
Preprocessing

Moving diacritics
Buckwalters transliteration
Diabs parsing (Diab, 2004)
Input
wlm yHtsb AlHkm Almjry sAndwr bwl rklp jzA' SHyHp
Avr Erqlp dAxl AlmnTqp mn qbl AlysAndrw.
Output
w/CC lm/RP yHtsb/VBP Al/DT Hkm/NN Al/DT mjry/JJ
sAndwr/NNP bwl/NNP rklp/NN jzA'/NN SHyHp/JJ
Avr/IN Erqlp/NN dAxl/IN Al/DT mnTqp/NN mn/IN
qbl/NN Al/DT ysAndrw/NNP ./PUNC

14
Evaluation and results
Precision
LLR 85
FLR 60
IM3 26
T-score 57

For each association score
Examine the first candidates term
Compute precision (termhood) for 100 candidates
term
Precision (termhood) is quotient of attested MWT
and all extracted sequences.
the loglikelihood is the best measure

LLR list Eng. transl
???? ??????? ?????? ??????? ?????? ??????? ???? ??????? ?????? ??????? ?????? ?????? Acidity degree Dioxide Waste water Sewage Climate change Nervous system
Agrovoc Eurodicautom Acidity degree Dioxide Waste water Sewage Climate change Nervous system
15
Summary future work

Develop MWT extraction for Arabic
Define MWT patterns and variations
Obtain best results than european languages
Improvement of system
Adding new variation
Improve lemmatisation

16
Introduction

MWTs are sufficiently informative to help human
readers get a feel of the essential topics
Use in many text related applications
Text clustering
Document similarity
Document summarization

17
Related Work

Linguistic Approach
Based on linguistic pre-processing and
annotations (result of taggers, shallow parsers)
Detect recurrent syntactic term formation
patterns
Noun Noun
(Adj Noun) Noun

18
Systems based on linguistics

Ananiadou, S. (1994) recognises single-word terms
from domain of Immunology based on morphological
analysis of term formation patterns (internal
term make up)
Justeson Katz (1995, TERMS) extract complex
terms based on two characteristics (which
distinguishes them from non terms)
the syntactic patterns are restricted
terms appear with the same form throughout the
text, omissions of modifiers are avoided

19
Systems based on linguistics

The text is tagged a filter is applied to
extract terms
((AN) ((AN) (N P)?) (AN))
N
AN / NA / AAN / ANN / NAN / NNN / NPN
Filtering based on simple POS pattern
A pattern must occur above a certain threshold to
be considered a valid term pattern.
Recall 71 Precision 71 -- 96
LEXTER (Bourigault, 1994)
Extracts French compound terms based on surface
syntactic analysis and text heuristics
Terms are identified according to certain
syntactic patterns

Uses a boundary method to identify the extent of
terms
categories or sequences of categories that are
not found in term patterns form the boundaries
e.g. verbs, any preposition (except de and à)
followed by a determiner. Non productive
sequences become boundaries.
Precision 95 although tests have shown that
lots of noise is generated

21
Approaches using statistical information

Main measures used
Frequency of occurrence
Mutual Information
C/NC value
Experiments also with loglike coefficient
Dunning, 1993

22
Frequency of occurrence

Simplest and most popular method for Domain
independent, requires no external resources
Some filtering is used in form of syntactic
patterns
Systems using frequency of occurrence
Dagan Church (TERMIGHT, 1994)
Enguehard Pantera (1994)
Lauriston (TERMINO, 1996)

23
Mutual Information

The amount of information provided by the
occurrence of the event represented by yi about
the occurrence of the event represented by xk is
defined as
I(xk,yi) ? log P(xk,yi) / P(xk)
P(,yi)
Fano (196127-28)
This measure is about how much a word tells us
about the other.
Problems for MI come from data sparseness
Damerau (1993) and Daille (1994) used MI for the
extraction of candidate terms (only for two-word
candidate terms)

24
C/NC value (Frantzi Ananiadou)

C/value
total frequency of occurrence of string in
corpus
frequency of string as part of longer candidate
terms
number of these longer candidate terms
length of string (in number of words)

25
NC value

NC-value(a) 0.8 C-value(a) 0.2 CF(a)
a is the candidate term,
C-value(a) is the C-value for the candidate term
a,
CF(a) is the context factor for the candidate
term a
we obtain the CF by summing up the weights for
its term context words, multiplied by their
frequency appearing with this candidate term.

26
Hybrid approaches