Title: Classification Algorithms for NETNEWS Articles
1Classification Algorithms for NETNEWS Articles
Master's Thesis By Wen-Lin Hsu Advisory
Committee Dr. Sheau-Dong Lang,Chairman Dr. Ronald
Dutton Dr. Mostafa Bassiouni
2Overview
- Introduction
- Related work
- Basic method
- Improved methods and results
- Conclusion
- Future research
3Introduction
- Text categorization (classification)
- definition
- the process of deciding the appropriate
categories for a given document - applications
- determining the area of an essay
- directing an email message to a proper folder
- routing the NETNEWS articles to their newsgroups
4Related Work
- NewsWeeder project using user feedback
- Classification of NETNEWS articles using AI
learning techniques - Automatic web page categorization for IR systems
using Knowledge Based (KB) techniques - Filtering junk mail using Bayesian approaches
5Existing Techniques
- Use batch updating in B-trees
- Use of inverted lists to update database
incrementally - Remove redundant words using corpus statistics
- Select most relevant articles to train
- Clustering
6Baseline Algorithm
- SMART (Salton 1970s still the best)
- Vector space model
- Term weighting scheme
- Inverted file
7Baseline Algorithm (contd)
8Baseline Algorithm (contd)
- Term list for each incoming article
9Baseline Algorithm (contd)
- Normalized similarity measure
- tf-idf weighting scheme
10Data Set
11Sample Articles
--------------------------------------------------
-- I found this number 1-800-IAM-RICH rec/travel/c
ruise --gt rec/arts/disney/animation ------------
---------------------------------------- Please
send email if interested. Includes box,docs and
shipping in the U.S. rec/games/video/nintendo
--gt rec/radio/amateur/swap ---------------------
------------------------------- No, you just fly
around. rec/games/video/nintendo --gt
rec/aviation/hang-gliding ------------------------
---------------------------- Looking for open
court, indoor or out in the Wilsonville-Portland,
Oregon area. rec/sport/volleyball --gt
rec/sport/basketball/women
12Statistics (June, July August 1997)
13Experimental Results
- show the competitiveness of our system
- compare our results to those reported by Weiss at
Johns Hopkins, 1996 - achieve comparable results (88 vs. 89) using
same approach and similar data. - results using our data
14Methods Used to Improve the Baseline Algorithm
- 1. Batch Routing
- 2. Batch Updating
- 3. Feature Reduction
- 4. Top-k Approach
- 5. Multi-level Routing
- 6. Multiple Representatives
15Improved Method 1
- Batch routing
- slightly improves efficiency
- routing accuracy unchanged
- slightly higher storage
16Results of Batch Routing
17Improved Method 2
- Batch updating
- 2.1 adding new terms and newsgroups
- 2.2 2.1 removing unwanted terms and
- newsgroups
18Improved Method 2.1
- Batch updating
- adding new terms and newsgroups to the inverted
file after n1 articles - improves accuracy
- increases runtime and storage requirements
19Results of Batch Updating
- Routing accuracy vs. updating with new terms and
newsgroups
20Increased time and storage when terms and new
groups are included in the updating scheme after
every 100 articles (1000 for rec).
21Improved Method 2.2
- Batch updating
- 1. adding new terms and newsgroups
- 2. removing unwanted terms and newsgroups
- improves efficiency and storage requirements in
1. without losing accuracy
22Routing Accuracy vs. Updates and term removal
23Overall Performance of Updating
24Improved Method 3
- Feature reduction
- reduce the size of the training set
- pre-manipulate the training data
- select articles based on their similarity values
- retrain using selected articles
- improves efficiency, storage, and accuracy
25Results for Feature Reduction
- BL all articles in June
- I all correctly routed articles (64 in I)
- II articles with similarity greater than mean
(32 in I) - III articles with similarity within one standard
deviation - of mean (32 in I)
26 Feature Reduction Updating
BLU all articles in June I all correctly
routed articles II articles with similarity
greater than mean IIIarticles with similarity
within one standard deviation
27Improvement in Routing Accuracy with Feature
Reduction Updating
28Improved Method 4
- Top-k Approach
- re-evaluate the system performance
- give suggestions to users
- get feedback from users
- needs no extra storage
- requires more time
- improves accuracy
29Results for Top-k Approach
30Routing accuracy using the top-k ranks.Each
figure shows the results w/o updates.
31Improved Method 5
- Multi-level Routing
- accuracy ? efficiency? storage?
32Results for Multi-level Routing
33Improved Method 6
- Multiple representatives
- improves accuracy significantly
- requires little extra storage
- takes time to cluster articles before training
34One Representative per Newsgroup
35Two Representatives per Newsgroup
36K-means Clustering Algorithm
- 1. Randomly select k articles as the first
cluster centers - 2. Distribute the articles among the cluster
centers - 3. Update the cluster center of the new clusters
- 4. If any cluster center has changed in this
iteration, then go back to 2.
37Results for Multiple Representatives
38Improvements in Accuracy using Multiple
Representatives
39Multiple Representatives Top-k Approach
- Routing accuracy for the top-10 rank
40Multiple Representatives Feature Reduction
41Multiple Representatives Updating
42Conclusion
43Future Research
- Updating
- find optimal frequencies to update
- Multiple representatives
- find optimal number of representatives for each
group - Updating Multiple representatives
- find a suitable updating scheme