Document Classification Comparison

About This Presentation

Title:

Description:

Number of Views:13

Avg rating:3.0/5.0

Slides: 11

Provided by: cse4

Category:

Tags: classification | comparison | document | pruned

Transcript and Presenter's Notes

Title: Document Classification Comparison

1
Document ClassificationComparison

2
Overview

3
What did we do?

Compared document classification accuracy of
three pieces of software on data from 20
newsgroups
Rainbow (Naïve Bayes)
C4.5 (Decision Tree)
Neural Network (Back-propagation)
Initially planned on taking a single document and
locating other documents similar to it

4
How did we do it?.

Used Rainbow as benchmark
Used it to create a model of the data
Was trained and tested with a common set of data
Used perl scripts to separate the data into
training/testing sets and create input files for
C4.5 and the neural network software
Rainbow's ability to output word counts for the
top N words was used to create the input files
Initially wanted to use word probabilities, but
it is only capable of doing this with classes,
not single documents

5
.How did we do it?

Modified image neural network from previous
assignment so that it would look at documents
instead of images
Needed to have 20 output nodes, one for each
newsgroup
Took in 1000 words (initially at least)
Started with the default hidden nodes (4) and
used all the way up to approximately 2000 (2x the
number of inputs)
http//www.faqs.org/faqs/ai-faq/neural-nets/part3/
section-10.html

6
Results

The Decision Tree software was able to get
between 15 and 40 accuracy (depending on
whether the tree was pruned and using test data)
Training set was about 17 after pruning
Test set was about 40 after pruning
Neural Network proved to be much more difficult
than we at first thought
Very very slow (on full training data, took
approximately 1 hour per epoch on a 1.2Ghz Linux
machine)
Accuracy did not increase over many trials
Spent a great amount of time experimenting with
the various paramaters
Learning Rate, Momentum, Hidden Units
Never got better than about 5 accuracy

7
.Results.

8
Why is text classifcation important?

9
Conclusions

Naïve Bayes seems to empirically be the best for
classifying documents
At least for newsgroup data
Still made similar errors to C4.5 which used only
word counts
If we had pre-processed the data better, perhaps
removing outliers and normalizing the information
then we could have gotten better results with the
Neural Network
Word counts not enough to specify a document,
C4.5 seemed to create a tree that did not
generalize well to the test data
Neural Networks are definitely not plug and
chug, every application is specific and needs
specific parameters
Hard to know how much data to use, or how many
features.
Most people dont have 10000 emails to train
with
Should investigate a threshold minimum for
getting accurate results

Document Classification Comparison - PowerPoint PPT Presentation