Title: Automatic term categorization by extracting knowledge from theWeb
1Automatic term categorizationby extracting
knowledge from theWeb
- Leonardo Rigutini, Ernesto Di Iorio, Marco
Ernandes e Marco Maggini - Dipartimento di Ingengeria dellInformazione
- Università degli studi di Siena
- rigutini,diiorio,ernandes,maggini_at_dii.unisi.it
2Text mining
- To provide semantic information to entities
extracted by the text documents - Uses of thesauri, gazetteers and domain specific
lexicons - Problems in maintenance of these resources
- Large amount of human effort in tracking changes
and in adding new lexical entities
3Term Categorization
- Key task in text mining research area
- A lexicon or a more articulated structure
(ontology) can be automatically populated by
associating each unknown lexical entity to one o
more semantic categories - The goal of term categorization is to label
lexical entities using a set of semantic themes
(disciplines, domains)
4Lexicons
- Domains-specific lexicons have been used in
several tasks - Word-sense disambiguation
- the semantic categories of the terms surrounding
the target word help to disambiguate it - Query-expansion
- adding semantic information to the query make it
more specific and focused and it increases the
precision of the answer
5Lexicons
- Cross-lingual text categorization
- Ontologies are used to replace particualr
entities with their semantic category, thus
reducing the temporal and geographic dependency
of the content of documents. Entities like proper
names or factory brands depend by the country and
by the time to which the document has been
produced. Replacing them with their semantic
category (politician or singer, computer or shoes
factory) improve the categorization of text
documents.
6Automatic term categorization
- Several attempts to face the problem of automatic
expansion of ontologies and thesauri have been
proposed in literature - F. Cebah1 proposed two possible approach to the
problem - Exogenous, where the sense of a term is inferred
by the context in which it appears - Endogenous, in which the sense of a term relies
only on statistical information extracted by the
sequence of characthers constituting the entity
1 F.Cerbah,Exogenous and endogenous approaches
to semantic categorization of unknown technical
terms,in Proceedings of the 18th International
Conference on Computational Linguistic (COLING)
7Automatic term categorization
- Sebastiani et al. proposed an exogenous approach
that faced the problem as the dual of the text
categorization2 - the Reuter corpus provided the knowledge base to
classify terms - they tried to replicate the Wordnet Domains
ontology selecting only the terms appearing in
the Reuter corpus - Their approach showed low F1 values
- high precision but a very small recall ( 0.4
)
2 H. Avancini,.A. Lavelli, B.Magnini,
F.Sebastiani, R. Zanoli,Expanding
domain-specific lexicons by term categorization,
in Proceedings of the 2003 ACM symposium on
Applied computing (SAC03)
8The proposed system
- We propose a system to automatically categorize
entities that exploit the Web to build an
enriched representation of the entity, the Entity
Context Lexicon (ECL) - the ECL is a list of all the words appearing in
the context of the entity - for each word, some statistical information are
storedterm frequency, snippet frequency etc - basically, an ECL is a bag-of-words
representation of the words appearing in the
context of th entity - The idea is that entities of the same semantic
category should appear in similar contexts
9System description
- The system for term classification is composed by
two modules - the training module is used to train the
classifier from a set of labeled examples - the entity classification module is applied to
predict the appropriate categoryfor a given input
entity - Both modules exploit the Web to build the ECL
representation of the entities - They are composed by some sub-modules
- two ECL generators, which build the ECLs
- the classifier, which is trained to classify the
unknown ECL
10System block diagram
11The ECL generator
- We choose to use the Web as knowledge base to
build the ECLs - The snippets returned by the search engines
submitting the entity as query report the
contexts in which the terms in the query appear - The ECL of an entity e is simply the set of the
context terms extracted from the snippets
12The ECL generator
- Given an entity e
- it is submitted as query to a search engine
- the related top-scored S snippets are collected
- the terms in the snippets are used to build the
ECL - for each word the term frequency and the snippet
frequency are stored - In order to avoid inclusion of not significant
terms, a stop-words list or feature selection
technique can be used
13The classifier
- Each entity e is characterized by the
corresponding ECLe - thus a set of labeled ECLs can be used to train
an automatic classifier - then the trained classifier can be used to label
the unlabeld ECLs - The most common classifier models can be used
- SVM, Naive Bayes, Complement Naive Bayes and
profile based (es. Rocchio).
14The CCL classifier
- Following the idea that similar entities appears
in similar contexts, we exploited a new type of
profile-based classifier - a profile for each class is built by merging the
training ECLs associated to that class - for each term in the profile is evaluated a
weight using a weighting function W - the obtained lexicon is called Class Context
Lexicon (CCL) - a similarity function is used to measure the
similarity of an unlabeled ECL with each CCL - When an unlabeled ECL is passed to the
classifier - it is assigned to the class reporting the highest
similarity score
15The CCL classifer
- Weighting functions
- tf
- tf-idf
- snippet-frequency inverse class frequency
(sficf), which provides high scores to a word if
it is much frequent in a class and low frequent
in the remaining classes - Similarity functions
- Euclidean similarity
- Cosine similarity
- Gravity similarity
16Experimental results
- We selected 8 categories
- soccer, music, location, computer, politics,
food, philosophy, medicine - For each of them we collected predefined
gazetteers from the web and sampled 200 entities
for each class. We performed tests varying the
size of the learning set LM where M indicates the
number of learning entities per class - We used Google as seach engine and we set the
number of snippets selected for building the ECL
to 10 (S10)
17Experimental results
- We tested all the classifiers listed previously
- SVM, NB, CNB and CCL
- and we used the F1 values to measure the
performances of the system - Firstly, we tested the CCL classifier by
combining the weighting functions and the
similarity functions listed previously - We selected the CCL configuration reporting the
best performances and then we compared it with
the SVM, NB and CNB classifiers
18Performances using the CCL classifiers
- We selected the CCL-sficf-gravity configuration
as the CCL classifier reporting the best
performances
19Overall perfromances
- The CNB classifier showed the best performances
even if the CCL model results are comparable
20Conclusions
- We propose a system for Web based term
categorization oriented to automatic thesaurus
construction - The idea is that terms from a same semantic
category should appear in very similar contexts,
i.e. that contain approximately the same words. - the system builds an Entity Context Lexicon (ECL)
for each entity using the Web as knowledge base - this enriched representaion is used to train an
automatic classifier - We tested all the most common classifier models
(SVM, Naive Bayes and Complement Naive Bayes) - Moreover, we propose a profile-based classifier
called CCL that builds the class profiles by
merging the learning ECLs
21Conclusions
- The experimental results shows that the CNB
classifier reports the best performances - However, the CCL classifier results are very
promising and comparable with the CNB ones - Additional tests have been planned considering a
multi-label classification task and to verify the
robustness of the system in out-of-topic cases