Title: Integrate Text Clustering Features in Text Categorization System
1Integrate Text Clustering Features in Text
Categorization System
- ZHOU Qiang, ZHENG Yabin
- Computer Science and Artificial Intelligence
Division National Laboratory for Information
Science and Technology - Dept. of Computer Science and technology
- Tsinghua University, Beijing 100084
2Outline
- Background researches
- Basic algorithms
- The Integration algorithm
- Experimental results
- Conclusions
3Automatic Text Categorization
- Task description
- To build software tools capable of classifying
text documents under predefined categories or
topic codes. - Dominant techniques Machine Learning and
statictical model - Rocchio method 2
- Naïve Bayes Model 3
- K-NN algorithm 4
- Decision Tree model 5
- Neural Network 6
- Support Vector Machine (SVM) 7
- A comparison research by Yiming Yang 9
- Among above ML models, SVM and K-NN have the
better categorization performances
4A New idea for ATC
- Two key techniques for a typical ATC system
- How to select the suitable high discriminative
features to represent the topical characteristics
of a document? - How to aggregate these features to form several
suitable categories for the document? - Our method to integrate some supportive features
extracted from a text clustering model to improve
the classification performance of an ATC system - Text categorization algorithm FIFA
- Developed by Dr. Jingbo Zhu in Northeast Univ.,
China - Text clustering algorithm MPCA
- Developed by Dr. Wray Buntine in HIIT, Finland
5Basic Algorithm 1 FIFA
- A supervised text categorization algorithm
- Basic functions Features Identification and
Features Aggregation - Based on a large scale knowledge base with about
400,000 entries manually annotated with detailed
topic and domain information.
FIFA
Topic Tags
Chinese Texts
6Output of FIFA
- Give 10 topic tags and their weights for a
document - Top-3 topic tags are reliable
- Other lower weighted tags may be noise
- About 900 topic tags were used in the FIFA
algorithm
7Basic Algorithm 2 MPCA
- An unsupervised text clustering algorithm
- Basic functions a multinomial variation of the
discrete Principal Component Analysis
MPCA
Text corpora
Clusters
Cluster Number N
N 2
8Output of MPCA
- Give several clusters and all the documents in
them along with their probabilities - The documents with top high probabilities are
reliable - The documents with the lowest probabilities may
be useless
9 The Integration Algorithm
- Topic pre-selection
- Document pre-selection
- Label the cluster with suitable topic tags
- Feedback topic information of clusters
- Information combination
10Step 1 Topic pre-selection
- Goal To generate a good DBT
- DBT A Document-By-Topic matrix
- Each element (DBT)ij represents the probability
of document Di with topic Tj - Method remove the noise information based on the
following heuristics - If a document has a salient topic tag, other tags
can be regarded as noise information. - If one topic appears in almost all the documents
or in just a small part of the documents, they
can be regarded as a noisy topic. - Some thresholds were set for the above removing
operations
11Step 2 Document pre-selection
- Goal To generate a good DBC
- DBC A Document-By-Cluster matrix
- Each element (DBC)ij represents the probability
of document Di belongs to a cluster Cj - Method
- Remove the documents with lower probabilities in
a cluster. - The threshold is set to the average of the
probabilities in the cluster.
12Step 3 Label the clusters
- Goal To compute a CBT
- CBT A Cluster-By-Topic matrix
- Each element (CBT)ij represents the probability
of a Cluster Ci with a topic Tj - Computational formula
- Where
- DBC is a Document-By-Cluster matrix
- DBT is a Document-By-Topic matrix
-
13Step 4 Topic information feedback
- Goal To compute a FDBT
- FDBT A Feedback-Document-By-Topic matrix
- Each element (FDBT)ij represents the feedbacking
probability of a document Di with a topic Tj - Computational formula
- Where
- DBC is a Document-By-Cluster matrix
- CBT is a Cluster-By-Topic matrix
14Step 5 Information combination
- Goal to integrate the following information
- The original FIFA topic tags
- The feedback topic tags from labeled clusters
- Computational formula
- Where
- DBT is a Document-By-Topic matrix
- CBT is a Feedback-Document-By-Topic matrix
- are the parameters for tuning the
weight between the original and feedback
information.
15Outline
- Background researches
- Basic algorithms
- The Integration algorithm
- Experimental results
- Conclusions
16Experiment 1 863 data set
- Use the Chinese library classification system
- 37 domain tags and 885 topic tags
- Mainly designed for human experts, there are many
ambiguous boundaries between different tags - It may not be suitable for ATC.
- 3600 articles were manually annotated with
different domain tags - Automatically mapping from FIFAs topic tags to
the domain tags used in 863 tag set. - 885 topic tags ? 37 domain tags
17Experimental result (1) 863 data set
- Initial FIFA V.S. Result after topic pre-selection
18Experimental result (2) 863 data set
- Result after topic pre-selection V.S. Result
after feedback and information combination
19863 data set some examples
- Correct domain tag philosophy(??)
20863 data set some examples
Integration Algorithm
21863 data set some examples
- Ambiguous tags Agriculture(??), Business(??),
22863 data set some examples
Integration Algorithm
23Experimental result analysis
- ATC Precision improvement
- After topic pre-selection 46.4? 54.9 (Top-1
tag) - After feedback and combination 54.9 ? 57.4
- The result is not very good and far below our
expectation, due to the following reasons - FIFA uses a very large topic tagset (about 900
tags) - It is a big challenge for a ATC system
- Even for a human expert, it is not a easy task
- MPCA is sensible to its cluster number N
- We set N20 in our experiment
- There is a information gap between FIFA and MPCA
- Question Can the new algorithm perform better in
a smaller tag set?
24Experiment 2 FD data set
- Use a topic tag set with 9 tags
- computer, traffic, education, economy, military,
sports, medicine, arts and politics - Data set 2615 articles extracted from Internet
and manually annotated with the above 9 topic
tags - Training set 450 articles
- Test set 2165 articles
- Processing procedures stages 2-4 of our
algorithm - Use the MPCA algorithm to cluster all the 2615
documents into 9 clusters - Label them with suitable tags based on correct
tag information of the training documents among
them - Feedback the cluster tag to all the test
documents in the cluster.
25Experimental result Precision
26Experimental result Recall
27Experimental result F1-Measure
28Experimental result Overall
INT-ALG
INT-ALG
29Conclusions
- We proposed a new algorithm to integrate the text
clustering features in a text categorization
system - The integrated algorithm showed a few performance
improvement in a larger topic tag set - Its overall performance is between SVM and KNN in
a smaller topic tag set - Some new techniques will be explored to further
improve the performance of the integration
algorithm in the future
30