Document Classification Comparison - PowerPoint PPT Presentation

About This Presentation
Title:

Document Classification Comparison

Description:

Document Classification Comparison – PowerPoint PPT presentation

Number of Views:13
Avg rating:3.0/5.0
Slides: 11
Provided by: cse4
Category:

less

Transcript and Presenter's Notes

Title: Document Classification Comparison


1
Document ClassificationComparison
  • Evangel Sarwar, Josh Woolever, Rebecca Zimmerman

2
Overview
  • What we did
  • How we did it
  • Results
  • Why does this matter
  • Conclusions
  • Questions?

3
What did we do?
  • Compared document classification accuracy of
    three pieces of software on data from 20
    newsgroups
  • Rainbow (Naïve Bayes)
  • C4.5 (Decision Tree)
  • Neural Network (Back-propagation)
  • Initially planned on taking a single document and
    locating other documents similar to it

4
How did we do it?.
  • Used Rainbow as benchmark
  • Used it to create a model of the data
  • Was trained and tested with a common set of data
  • Used perl scripts to separate the data into
    training/testing sets and create input files for
    C4.5 and the neural network software
  • Rainbow's ability to output word counts for the
    top N words was used to create the input files
  • Initially wanted to use word probabilities, but
    it is only capable of doing this with classes,
    not single documents

5
.How did we do it?
  • Modified image neural network from previous
    assignment so that it would look at documents
    instead of images
  • Needed to have 20 output nodes, one for each
    newsgroup
  • Took in 1000 words (initially at least)
  • Started with the default hidden nodes (4) and
    used all the way up to approximately 2000 (2x the
    number of inputs)
  • http//www.faqs.org/faqs/ai-faq/neural-nets/part3/
    section-10.html

6
Results
  • The Decision Tree software was able to get
    between 15 and 40 accuracy (depending on
    whether the tree was pruned and using test data)
  • Training set was about 17 after pruning
  • Test set was about 40 after pruning
  • Neural Network proved to be much more difficult
    than we at first thought
  • Very very slow (on full training data, took
    approximately 1 hour per epoch on a 1.2Ghz Linux
    machine)
  • Accuracy did not increase over many trials
  • Spent a great amount of time experimenting with
    the various paramaters
  • Learning Rate, Momentum, Hidden Units
  • Never got better than about 5 accuracy

7
.Results.
  • Rainbow
  • Approximately 80 accuracy
  • C4.5 and Rainbow made similar errors
  • Misclassified documents within the similar
    groups
  • Alt.atheism, talk.religion.misc,
    talk.politics.misc
  • Comp.

8
Why is text classifcation important?
  • Spam detection
  • General mail filtering into folders
  • Automatically place documents in file system at
    proper location

9
Conclusions
  • Naïve Bayes seems to empirically be the best for
    classifying documents
  • At least for newsgroup data
  • Still made similar errors to C4.5 which used only
    word counts
  • If we had pre-processed the data better, perhaps
    removing outliers and normalizing the information
    then we could have gotten better results with the
    Neural Network
  • Word counts not enough to specify a document,
    C4.5 seemed to create a tree that did not
    generalize well to the test data
  • Neural Networks are definitely not plug and
    chug, every application is specific and needs
    specific parameters
  • Hard to know how much data to use, or how many
    features.
  • Most people dont have 10000 emails to train
    with
  • Should investigate a threshold minimum for
    getting accurate results

10
Fin.
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com