Text Classification With Support Vector Machines

About This Presentation

Title:

Text Classification With Support Vector Machines

Description:

Text Classification With Support Vector Machines. Presenter: Aleksandar Milisic ... Support Vector Machines. Co-Training Algorithm (Blum and Mitchell, 1998) ... – PowerPoint PPT presentation

Number of Views:228

Avg rating:3.0/5.0

Slides: 13

Provided by: milisical

Category:

more less

Transcript and Presenter's Notes

Title: Text Classification With Support Vector Machines

1
Text Classification With Support Vector Machines
Presenter Aleksandar Milisic Supervisor Dr.
David Albrecht
2
Overview

Text Classification What and Why?
Text Clustering
Support Vector Machines
Current Techniques
Project Aim and Plan

3
Text Classification What and Why?

Text Classification assigning documents to
predefined classes (categories).
Example Web pages can be assigned to politics,
sport, business, entertainment etc.
There are thousands of categories associated with
web pages.
Labeling manually is time-consuming and sometimes
impossible the process needs to be automated!

4
Text Classification What and Why?

Automated text classifiers need to be able to
learn from
Small set of labeled documents
Large set of unlabeled documents
Otherwise a lot of labeling would have to be
done by humans
So how is it done?

5
Representing Text
1 Companies
3 Document
0 Distance
. . . . . .
1 Offices
0 Unix
0 Match
With paperless offices becoming more common,
companies start using document databases with
classification schemes
Feature Vector
6
Clustering
Feature Vectors
1 2
0 4
1 0
Labeled documents Unlabeled documents

7
Support Vector Machines (SVM)

Binary Classifiers
Maximizes distance between two classes (finds
Optimal Separating Hyperplane OSH)
Support Vectors are closest to OSH

OSH
Class1
Not Class 1
Support Vectors
8
Current Techniques

Clustering Methods
Rasmussens Single Pass Algorithm (as described
by Raskutti et al. (2002))
Reallocation Method
Hierarchical Methods
Classification Methods
Support Vector Machines
Co-Training Algorithm (Blum and Mitchell, 1998)
Raskutti et al. (2002) describe an interesting
approach combining SVMs with Rasmussens
clustering algorithm

9
Combining SVM With Clustering
Added
Features
Labeled documents (Class 1) Labeled documents
(Not Class 1) Unlabeled documents Support
Vectors Separating Hyperplane

10
Project Aim

Resolve following issues

Can combining SVMs with other techniques
improve performance?

Documents have thousands of features
Can different feature representation (selection)
techniques improve performance without affecting
accuracy?

Documents can belong to multiple classes but
SVMs
are binary classifiers!

11
Project Plan

Currently implementing clustering technique
described in Raskutti et al. (2002)
Plan to implement other clustering techniques
Investigate different feature representation
(selection) techniques
For example, different weights for words in
different positions in document
Investigate multi-class problem

12
References

Blum, A. and T. Mitchell (1998). Combining
labeled and unlabeled data with co-training.
In COLT Proceedings of the Workshop on
Computational Learning Theory, Morgan Kaufmann
Publishers
Raskutti, B., H. Ferra, and A. Kowalczyk (2002).
Using unlabeled data for text classification
through addition of cluster parameters.
In International Conference on Machine
Learning (Accepted)