Title: TKE 2005
1Towards a Text Mining Driven Approach for
Terminology Construction
- Valentina Ceausu, Sylvie Desprès
- CRIP 5, René Descartes University
2Overview
3Why a terminology of road accidents ?
- Exploited by a case based reasoning system
- CBR
- Case base (collection of source cases)
- Created from accident scenarios
- Accident scenarios natural language description
of sets of similar accidents - Created by experts in road safety
- New problem (target case)
- Created from accident reports
- Accident reports created by policemen
4Â Scope and available resources
- Scope
-
- To compare cases created from accident reports
with cases created from accident scenarios - Problem scenarios and reports are created by
different communities - Available resources
- Meta-model to represent accidents
- Ontology of road accidents (Protege 2000)
- To solve the problem
- Create a terminology of road accidents from a
set of accident reports
5Knowledge extraction patterns recognition
algorithm
- Available corpora 250 reports of accidents in
and around Lille - Goal to extract knowledge from natural language
corpora - Recognition of lexical patterns
- Pattern association of lexical types
- Nominal (Noun, Preposition, Noun)
- Verbal (Verb, Preposition, Noun )
- Input
- Annotated corpora (TreeTagger, Cordial)
- Output
- Important number of word regroupings
- Refining approaches
Extract of Accident Report Le cycle de marque GO
SPORT conduit par M XXXXXXXXXXXXXXXXld d'Auteuil,
vient du carrefour des Anciens Combattants et se
dirige vers l'ave Robert Schuman. Au niveau du Nø
31 du dit boulevard le cycle s'arrête sur le côté
droit du côté des num XXXXXXXXXXXXXXXXXe long des
véhicules en stationnement se préparant Ã
traverser vers le num XXXXXXXXXXXXXXXXXcycle et
sur le passage piétons. Lorsque le cycle commence
sa manoeuvre la voiture de marque Volkswagen Nø
381 LTL 75 conduite par Me XXXXXXXXXXXXXXXXcule,
vient et se dirige dans le même sens de
progression que le cycle, heurte de son avant la
roue arrière du vélo. Suite au choc le cycliste
est blessé légèrement. Transport à l'hôpital
A.Paré à Boulogne par les sapeurs pompiers
locaux. Non admis. Le changement de direction
sans précaution de la part du cycliste et la non
maîtrise de son véhicule de la part de
l'automobiliste semblent être à l'origine de
l'accident.
6Lexical patterns and corresponding regroupings
- Lexical Patterns
- Noun , Noun
- Noun, Preposition, Noun
- Noun, Preposition, Adjective
- Verb, Preposition, Noun
- Verb, Preposition, Adjective
- Corresponding regroupings
- accident , agent (accident, policeman)
- usager de route (road user)
- groupe de piéton ( group of pedestrians)Â
- trottoir de droite (right side pavement)
- diriger vers place (direct to square)
- virer à gauche (turn left)
- virer à droite (turn right)
7Apriori algorithm (1/3)
- Association rules extraction
- Agrawal Srikant, 1994
- Adaptation to text mining Maedche Staab, 2000
- Basic association rules algorithm
- Set of transactions
- Set of words
- véhicule, conducteur,(vehicle, driver)
-
- Association (XgtY)
- X and Y are word regroupings
- X conducteur (driver) Y de véhicule (of
vehicle )
8Apriori algorithm (2/3)
- Linguistic rule word co-occurrences
- Quality measures
- Thresholds defined by user
- Intervention of an expert to select threshold
values - Support and confidence exceed user-defined
thresholds gtassociation rule
9Apriori algorithm (3/3)
- Steps of Apriori algorithm
- Generate the association set (according to
patterns ) - For each association
- Determinate support
- Determinate confidence
- Output association rules that exceed user-defined
confidence and support - Apriori output
- véhicule, automobile ( vehicle, car)
- volant, véhicule (steering wheel,
vehicle) - conducteur, véhicule (driver, vehicle)
- conducteur, camion (driver, van)
- conducteur, cyclomoteur (driver, motorbike)
- Output interpretation
- terms of field
- trottoir de droite (right side pavement)
-
- Relations
- conducteur, véhicule (driver, vehicle)
- Type of relations
- IS-A
- véhicule, automobile ( vehicle, car)
- PART-OF
- volant, véhicule (steering wheel, vehicle)
- Functional
- conducteur, propriétaire (driver, owner)
conducteur, véhicule (driver, vehicle) - Particular form
- conducteur, camion  (driver, van)
10Refining the set of verbal syntagms (1/4)
- Verbal syntagms instances of verbal patterns
- Verb classes identification
- Class of verbs a set of regroupings generated
by the same verb - Two-term regroupings diriger vers (direct to),
venir de (come from) - Three-term regroupings
- Instances of Verb, Preposition, (Argument)
patterns - Extensions of two term regroupings
- venir de gauche (come from left ) diriger vers
infrastructure (direct to infrastructure ) - Important number of three term regroupings
- Extremely fine level of granularity
11Refining the set of verbal syntagms (2/4)
- Using a domain model to refine the set of verbal
syntagms - extensions of three-term associations can be
organized in homogeneous lists - Direction (direction) droite (right), gauche
(left), devant (in front of)Â - Lieu (place) Â usine (factory),
parc (parc), domicile (home) - Humain enfant (child ),
piéton (pedestrian), personne (person) - Associating each list to a concept of ontology of
road accidents - Ontology previously created from experts
knowledge - Manual intervention to assign lists to concepts
12Refining the set of verbal syntagms (3/4) Venir
(to come) class
- venir de hau bourdin (come from hau bourdin )
- venir de i (come from i)
- venir de abbaye (come from abbey )
- venir de résidence (come from residence )
- venir de rue (come from street )
- venir de gauche (come from left)
- venir par (come by )
- venir par droite (come by right)
- venir vers enfant (come to child )
- Noise, instances are eliminated
- venir de lieu (come from place)
- venir de infrastructure (come from
infrastructure) - venir de direction (come from direction)
- venir par direction (come by direction)
- venir vers humain (come towards human)
13Refining the set of verbal syntagms (4/4)
- Decreasing the number of three-term regroupings
- Many arguments assigned to the same concept
- Eliminate parasitic regroupings and noise
- Created lists will not contain terms out of the
field - Â diriger vers 12Â (direct to 12) 12 will
be not included in a list - - Eliminating valuable regroupings if created
lists are incomplete -
14Text mining driven terminology construction
15Â Linguistic analysis integrating text mining
results
- Input of linguistic analysis phase
- Syntex and Cordial output
- Goal of this phase
- Selection of domain terms and
- Identification of lexical relations
- Difficulties of this phase
- Manual treatment difficult for large corpora
- No information available to guide the selection
- To solve difficulties
- Integrate Apriori results
- Selection of terms
- Identification of lexical relations
16Linguistic analysis
17Normalization phase integrating text mining
results
- Input of linguistic analysis phase
- Previously selected terms
- Lexical relations between terms
- Goal
- Definition of terminological concepts
- Semantic relations modeling
- Difficulties
- No information for semantic relations
- To solve difficulties
- Integrate lexical relations
- Integrate previously identified verb classes
- Integrate non-taxonomic relations provided by
Apriori
18 Formalization phase integrating text mining
results
19Conclusion
- Semi-automatic approach to build a terminology
- Construction process supported by text mining
results - Association rules results to guide selection of
terms - Lexical patterns improve work with Linguae module
- Identify non-taxonomic relations
- Results obtained are more general
- Syntex output SE DIRIGER vers la Commune de
Wahagnies (Direct to Wahagnies village ) - Text mining output diriger vers lieu (direct to
a place) - Semantic relation modelingÂ
- Guided by verbs of domain
- Apriori output
20Future work
- Tools in the pre-treatment phase
- Definition and identification of syntactic
patterns - New heuristics to generate associations
- Using other quality measures to rank extracted
rules - Towards an automatic approach to assign lists of
terms to ontology concepts - Towards identifying functional and structural
properties
21Thank you
- ceausu_at_math-info.univ-paris5.fr