Automatic term categorization by extracting knowledge from theWeb - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Automatic term categorization by extracting knowledge from theWeb

Description:

The snippets returned by the search engines submitting the entity as query ... top-scored S snippets are collected. the terms in the snippets are used to build ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 22

Provided by: grif9

Category:

more less

Transcript and Presenter's Notes

Title: Automatic term categorization by extracting knowledge from theWeb

1
Automatic term categorizationby extracting
knowledge from theWeb

Leonardo Rigutini, Ernesto Di Iorio, Marco
Ernandes e Marco Maggini
Dipartimento di Ingengeria dellInformazione
Università degli studi di Siena
rigutini,diiorio,ernandes,maggini_at_dii.unisi.it

2
Text mining

To provide semantic information to entities
extracted by the text documents
Uses of thesauri, gazetteers and domain specific
lexicons
Problems in maintenance of these resources
Large amount of human effort in tracking changes
and in adding new lexical entities

3
Term Categorization

Key task in text mining research area
A lexicon or a more articulated structure
(ontology) can be automatically populated by
associating each unknown lexical entity to one o
more semantic categories
The goal of term categorization is to label
lexical entities using a set of semantic themes
(disciplines, domains)

4
Lexicons

Domains-specific lexicons have been used in
several tasks
Word-sense disambiguation
the semantic categories of the terms surrounding
the target word help to disambiguate it
Query-expansion
adding semantic information to the query make it
more specific and focused and it increases the
precision of the answer

5
Lexicons

Cross-lingual text categorization
Ontologies are used to replace particualr
entities with their semantic category, thus
reducing the temporal and geographic dependency
of the content of documents. Entities like proper
names or factory brands depend by the country and
by the time to which the document has been
produced. Replacing them with their semantic
category (politician or singer, computer or shoes
factory) improve the categorization of text
documents.

6
Automatic term categorization

Several attempts to face the problem of automatic
expansion of ontologies and thesauri have been
proposed in literature
F. Cebah1 proposed two possible approach to the
problem
Exogenous, where the sense of a term is inferred
by the context in which it appears
Endogenous, in which the sense of a term relies
only on statistical information extracted by the
sequence of characthers constituting the entity

1 F.Cerbah,Exogenous and endogenous approaches
to semantic categorization of unknown technical
terms,in Proceedings of the 18th International
Conference on Computational Linguistic (COLING)
7
Automatic term categorization

Sebastiani et al. proposed an exogenous approach
that faced the problem as the dual of the text
categorization2
the Reuter corpus provided the knowledge base to
classify terms
they tried to replicate the Wordnet Domains
ontology selecting only the terms appearing in
the Reuter corpus
Their approach showed low F1 values
high precision but a very small recall ( 0.4
)

2 H. Avancini,.A. Lavelli, B.Magnini,
F.Sebastiani, R. Zanoli,Expanding
domain-specific lexicons by term categorization,
in Proceedings of the 2003 ACM symposium on
Applied computing (SAC03)
8
The proposed system

We propose a system to automatically categorize
entities that exploit the Web to build an
enriched representation of the entity, the Entity
Context Lexicon (ECL)
the ECL is a list of all the words appearing in
the context of the entity
for each word, some statistical information are
storedterm frequency, snippet frequency etc
basically, an ECL is a bag-of-words
representation of the words appearing in the
context of th entity
The idea is that entities of the same semantic
category should appear in similar contexts

9
System description

The system for term classification is composed by
two modules
the training module is used to train the
classifier from a set of labeled examples
the entity classification module is applied to
predict the appropriate categoryfor a given input
entity
Both modules exploit the Web to build the ECL
representation of the entities
They are composed by some sub-modules
two ECL generators, which build the ECLs
the classifier, which is trained to classify the
unknown ECL

10
System block diagram
11
The ECL generator

We choose to use the Web as knowledge base to
build the ECLs
The snippets returned by the search engines
submitting the entity as query report the
contexts in which the terms in the query appear
The ECL of an entity e is simply the set of the
context terms extracted from the snippets

12
The ECL generator

Given an entity e
it is submitted as query to a search engine
the related top-scored S snippets are collected
the terms in the snippets are used to build the
ECL
for each word the term frequency and the snippet
frequency are stored
In order to avoid inclusion of not significant
terms, a stop-words list or feature selection
technique can be used

13
The classifier

Each entity e is characterized by the
corresponding ECLe
thus a set of labeled ECLs can be used to train
an automatic classifier
then the trained classifier can be used to label
the unlabeld ECLs
The most common classifier models can be used
SVM, Naive Bayes, Complement Naive Bayes and
profile based (es. Rocchio).

14
The CCL classifier

Following the idea that similar entities appears
in similar contexts, we exploited a new type of
profile-based classifier
a profile for each class is built by merging the
training ECLs associated to that class
for each term in the profile is evaluated a
weight using a weighting function W
the obtained lexicon is called Class Context
Lexicon (CCL)
a similarity function is used to measure the
similarity of an unlabeled ECL with each CCL
When an unlabeled ECL is passed to the
classifier
it is assigned to the class reporting the highest
similarity score

15
The CCL classifer

Weighting functions
tf
tf-idf
snippet-frequency inverse class frequency
(sficf), which provides high scores to a word if
it is much frequent in a class and low frequent
in the remaining classes
Similarity functions
Euclidean similarity
Cosine similarity
Gravity similarity

16
Experimental results

We selected 8 categories
soccer, music, location, computer, politics,
food, philosophy, medicine
For each of them we collected predefined
gazetteers from the web and sampled 200 entities
for each class. We performed tests varying the
size of the learning set LM where M indicates the
number of learning entities per class
We used Google as seach engine and we set the
number of snippets selected for building the ECL
to 10 (S10)

17
Experimental results

We tested all the classifiers listed previously
SVM, NB, CNB and CCL
and we used the F1 values to measure the
performances of the system
Firstly, we tested the CCL classifier by
combining the weighting functions and the
similarity functions listed previously
We selected the CCL configuration reporting the
best performances and then we compared it with
the SVM, NB and CNB classifiers

18
Performances using the CCL classifiers

We selected the CCL-sficf-gravity configuration
as the CCL classifier reporting the best
performances

19
Overall perfromances

The CNB classifier showed the best performances
even if the CCL model results are comparable

20
Conclusions

We propose a system for Web based term
categorization oriented to automatic thesaurus
construction
The idea is that terms from a same semantic
category should appear in very similar contexts,
i.e. that contain approximately the same words.
the system builds an Entity Context Lexicon (ECL)
for each entity using the Web as knowledge base
this enriched representaion is used to train an
automatic classifier
We tested all the most common classifier models
(SVM, Naive Bayes and Complement Naive Bayes)
Moreover, we propose a profile-based classifier
called CCL that builds the class profiles by
merging the learning ECLs

21
Conclusions

The experimental results shows that the CNB
classifier reports the best performances
However, the CCL classifier results are very
promising and comparable with the CNB ones
Additional tests have been planned considering a
multi-label classification task and to verify the
robustness of the system in out-of-topic cases

Write a Comment

User Comments (0)