Title: Text Classification With Support Vector Machines
1Text Classification With Support Vector Machines
Presenter Aleksandar Milisic Supervisor Dr.
David Albrecht
2Overview
- Text Classification What and Why?
- Text Clustering
- Support Vector Machines
- Current Techniques
- Project Aim and Plan
3Text Classification What and Why?
- Text Classification assigning documents to
predefined classes (categories). - Example Web pages can be assigned to politics,
sport, business, entertainment etc. - There are thousands of categories associated with
web pages. - Labeling manually is time-consuming and sometimes
impossible the process needs to be automated!
4Text Classification What and Why?
- Automated text classifiers need to be able to
learn from - Small set of labeled documents
- Large set of unlabeled documents
- Otherwise a lot of labeling would have to be
done by humans - So how is it done?
5Representing Text
1 Companies
3 Document
0 Distance
. . . . . .
1 Offices
0 Unix
0 Match
With paperless offices becoming more common,
companies start using document databases with
classification schemes
Feature Vector
6Clustering
Feature Vectors
1 2
0 4
1 0
Labeled documents Unlabeled documents
7Support Vector Machines (SVM)
- Binary Classifiers
- Maximizes distance between two classes (finds
Optimal Separating Hyperplane OSH) - Support Vectors are closest to OSH
OSH
Class1
Not Class 1
Support Vectors
8Current Techniques
- Clustering Methods
- Rasmussens Single Pass Algorithm (as described
by Raskutti et al. (2002)) - Reallocation Method
- Hierarchical Methods
- Classification Methods
- Support Vector Machines
- Co-Training Algorithm (Blum and Mitchell, 1998)
- Raskutti et al. (2002) describe an interesting
approach combining SVMs with Rasmussens
clustering algorithm
9Combining SVM With Clustering
Added
Features
Labeled documents (Class 1) Labeled documents
(Not Class 1) Unlabeled documents Support
Vectors Separating Hyperplane
10Project Aim
- Can combining SVMs with other techniques
improve performance?
- Documents have thousands of features
- Can different feature representation (selection)
techniques improve performance without affecting
accuracy?
- Documents can belong to multiple classes but
SVMs - are binary classifiers!
11Project Plan
- Currently implementing clustering technique
described in Raskutti et al. (2002) - Plan to implement other clustering techniques
- Investigate different feature representation
(selection) techniques - For example, different weights for words in
different positions in document -
- Investigate multi-class problem
12References
- Blum, A. and T. Mitchell (1998). Combining
labeled and unlabeled data with co-training. - In COLT Proceedings of the Workshop on
Computational Learning Theory, Morgan Kaufmann
Publishers - Raskutti, B., H. Ferra, and A. Kowalczyk (2002).
Using unlabeled data for text classification
through addition of cluster parameters. - In International Conference on Machine
Learning (Accepted)