Integrate Text Clustering Features in Text Categorization System - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Integrate Text Clustering Features in Text Categorization System

Description:

Automatically mapping from FIFA's topic tags to the domain tags used in 863 tag set. ... There is a information gap between FIFA and MPCA ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 31

Provided by: nlp8

Category:

more less

Transcript and Presenter's Notes

Title: Integrate Text Clustering Features in Text Categorization System

1
Integrate Text Clustering Features in Text
Categorization System

ZHOU Qiang, ZHENG Yabin
Computer Science and Artificial Intelligence
Division National Laboratory for Information
Science and Technology
Dept. of Computer Science and technology
Tsinghua University, Beijing 100084

2
Outline

Background researches
Basic algorithms
The Integration algorithm
Experimental results
Conclusions

3
Automatic Text Categorization

Task description
To build software tools capable of classifying
text documents under predefined categories or
topic codes.
Dominant techniques Machine Learning and
statictical model
Rocchio method 2
Naïve Bayes Model 3
K-NN algorithm 4
Decision Tree model 5
Neural Network 6
Support Vector Machine (SVM) 7
A comparison research by Yiming Yang 9
Among above ML models, SVM and K-NN have the
better categorization performances

4
A New idea for ATC

Two key techniques for a typical ATC system
How to select the suitable high discriminative
features to represent the topical characteristics
of a document?
How to aggregate these features to form several
suitable categories for the document?
Our method to integrate some supportive features
extracted from a text clustering model to improve
the classification performance of an ATC system
Text categorization algorithm FIFA
Developed by Dr. Jingbo Zhu in Northeast Univ.,
China
Text clustering algorithm MPCA
Developed by Dr. Wray Buntine in HIIT, Finland

5
Basic Algorithm 1 FIFA

A supervised text categorization algorithm
Basic functions Features Identification and
Features Aggregation
Based on a large scale knowledge base with about
400,000 entries manually annotated with detailed
topic and domain information.

FIFA
Topic Tags
Chinese Texts
6
Output of FIFA

Give 10 topic tags and their weights for a
document
Top-3 topic tags are reliable
Other lower weighted tags may be noise
About 900 topic tags were used in the FIFA
algorithm

7
Basic Algorithm 2 MPCA

An unsupervised text clustering algorithm
Basic functions a multinomial variation of the
discrete Principal Component Analysis

MPCA
Text corpora
Clusters
Cluster Number N
N 2
8
Output of MPCA

Give several clusters and all the documents in
them along with their probabilities
The documents with top high probabilities are
reliable
The documents with the lowest probabilities may
be useless

9
The Integration Algorithm

Topic pre-selection
Document pre-selection
Label the cluster with suitable topic tags
Feedback topic information of clusters
Information combination

10
Step 1 Topic pre-selection

Goal To generate a good DBT
DBT A Document-By-Topic matrix
Each element (DBT)ij represents the probability
of document Di with topic Tj
Method remove the noise information based on the
following heuristics
If a document has a salient topic tag, other tags
can be regarded as noise information.
If one topic appears in almost all the documents
or in just a small part of the documents, they
can be regarded as a noisy topic.
Some thresholds were set for the above removing
operations

11
Step 2 Document pre-selection

Goal To generate a good DBC
DBC A Document-By-Cluster matrix
Each element (DBC)ij represents the probability
of document Di belongs to a cluster Cj
Method
Remove the documents with lower probabilities in
a cluster.
The threshold is set to the average of the
probabilities in the cluster.

12
Step 3 Label the clusters

Goal To compute a CBT
CBT A Cluster-By-Topic matrix
Each element (CBT)ij represents the probability
of a Cluster Ci with a topic Tj
Computational formula
Where
DBC is a Document-By-Cluster matrix
DBT is a Document-By-Topic matrix

13
Step 4 Topic information feedback

Goal To compute a FDBT
FDBT A Feedback-Document-By-Topic matrix
Each element (FDBT)ij represents the feedbacking
probability of a document Di with a topic Tj
Computational formula
Where
DBC is a Document-By-Cluster matrix
CBT is a Cluster-By-Topic matrix

14
Step 5 Information combination

Goal to integrate the following information
The original FIFA topic tags
The feedback topic tags from labeled clusters
Computational formula
Where
DBT is a Document-By-Topic matrix
CBT is a Feedback-Document-By-Topic matrix
are the parameters for tuning the
weight between the original and feedback
information.

15
Outline

Background researches
Basic algorithms
The Integration algorithm
Experimental results
Conclusions

16
Experiment 1 863 data set

Use the Chinese library classification system
37 domain tags and 885 topic tags
Mainly designed for human experts, there are many
ambiguous boundaries between different tags
It may not be suitable for ATC.
3600 articles were manually annotated with
different domain tags
Automatically mapping from FIFAs topic tags to
the domain tags used in 863 tag set.
885 topic tags ? 37 domain tags

17
Experimental result (1) 863 data set

Initial FIFA V.S. Result after topic pre-selection

18
Experimental result (2) 863 data set

Result after topic pre-selection V.S. Result
after feedback and information combination

19
863 data set some examples

Correct domain tag philosophy(??)

20
863 data set some examples

FIFA

Integration Algorithm
21
863 data set some examples

Ambiguous tags Agriculture(??), Business(??),

22
863 data set some examples

FIFA

Integration Algorithm
23
Experimental result analysis

ATC Precision improvement
After topic pre-selection 46.4? 54.9 (Top-1
tag)
After feedback and combination 54.9 ? 57.4
The result is not very good and far below our
expectation, due to the following reasons
FIFA uses a very large topic tagset (about 900
tags)
It is a big challenge for a ATC system
Even for a human expert, it is not a easy task
MPCA is sensible to its cluster number N
We set N20 in our experiment
There is a information gap between FIFA and MPCA
Question Can the new algorithm perform better in
a smaller tag set?

24
Experiment 2 FD data set

Use a topic tag set with 9 tags
computer, traffic, education, economy, military,
sports, medicine, arts and politics
Data set 2615 articles extracted from Internet
and manually annotated with the above 9 topic
tags
Training set 450 articles
Test set 2165 articles
Processing procedures stages 2-4 of our
algorithm
Use the MPCA algorithm to cluster all the 2615
documents into 9 clusters
Label them with suitable tags based on correct
tag information of the training documents among
them
Feedback the cluster tag to all the test
documents in the cluster.

25
Experimental result Precision
26
Experimental result Recall
27
Experimental result F1-Measure
28
Experimental result Overall
INT-ALG
INT-ALG
29
Conclusions

We proposed a new algorithm to integrate the text
clustering features in a text categorization
system
The integrated algorithm showed a few performance
improvement in a larger topic tag set
Its overall performance is between SVM and KNN in
a smaller topic tag set
Some new techniques will be explored to further
improve the performance of the integration
algorithm in the future