Faceted Classification using SVM - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

Faceted Classification using SVM

Description:

News Articles. Image Collection. American Political History. State Department Collection ... News Articles: 500 documents. 8 categories. Large feature set. ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 31

Provided by: lawanyach

Category:

more less

Transcript and Presenter's Notes

Title: Faceted Classification using SVM

1
Faceted Classification using SVM

Deepthi Singh Rajput
dsingh_at_cs.odu.edu
Advisors Dr. Zubair, Dr. Maly

2
Overview

Proposal
Approach
SVM overview
Implementation
Integration
Conclusion

3
Collaborative Faceted Classification

Faceted classification
Multiple perspectives.
Mutually exclusive facets
Each facet is a hierarchy
Simple and easy to explore
EX
C//subfolder/test.txt
Type, size, date of mod

4
Collaborative Faceted Classification

Collaborative Classification
Collaboratively build faceted classification
Add, modify facets
Categorize, reclassify documents
Automated Classification (Supervised Learning
Problem)
Collaboration dynamically changes the facet
schema
Calls for automatic reclassifying of existing
documents
Manual classification / reclassification of huge
collections is time consuming.

5
Objective

Use of Support Vector Machine (SVM) to implement
a system which
Automates Document classification into predefined
facets
Suggests classifications for new documents.
Reclassifies documents with changing faceted
schema.

6
Approach

Boot Strapping
Non classified collection of documents.
A facet for this collection is defined manually
Manually, a subset of the collection is
classified into positive and negative sets (w.r.t
to the facet).
The system is trained to identify the documents
of each set as positive or negative using
automated learning.

7
Approach

Soft classification The rest of the collection
is tentatively classified by the system using the
training set.
Soft classifications are made available to users
for exploration.
Soft Classifications can be hardened or
rejected by the users.

8
Approach

Real Time Classification
Users may contribute new documents to the
collection.
The system tentatively classifies the document.
Users can
approve of the suggested classification
classify as they wish, into the existing facet
schema.
Change the schema by creating a new category or
facet for the document.
With change in the facet schema the system
reclassifies the documents.

9
Example Collection Schema

News Articles
Image Collection
American Political History
State Department Collection
African American Activists

10
Technologies Used

jdk1.5.0_05
IDE Eclipse
Maven 2.0.4
LibSVM

11
SVM Overview

Statistical Learning Theory
Widely used in pattern recognition
Important and active field of all Machine
Learning research.
An instance of Kernel Machines, a large class of
learning algorithms.

12
Good Decision Boundary

Consider a two-class, linearly separable
classification problem
Many decision boundaries!
The Perceptron algorithm can be used to find such
a boundary
Are all decision boundaries equally good?

Class 1
Class 2
13
Large-margin Decision Boundary

The decision boundary should be as far away from
the data of both classes as possible
We should maximize the margin, m

14
Kernel Mapping Computing feature space
15
SVM Implementation

Transform data to the format of an SVM software
Conduct simple scaling/normalization on the data
Consider the RBF kernel
Find the best parameter C and Gamma
Use the best parameter C and Gamma to train the
training set
Test

16
Transformation

The collection has to be transformed to a format
of the SVM library (LibSVM)
Every training collection contains a number of
different documents (Document Frequency)
Documents contain different terms and term
frequencies.
Term Frequency (TF) and Document Frequency are
used to compute the features.

17
LibSVM format

Each document represents a line in the SVM model
Each line starts with a class label (0,1)
followed by feature (Id, value) pairs
label featureIdfeatureValue
featureIdfeatureValue
Training and Testing files have the same format.

18
Scaling and Kernel

Scaling
avoid attributes in greater numeric ranges
dominate those in smaller numeric ranges.
RBF Kernel
Number of hyper parameters influences the
complexity of model selection.
Polynomial kernel has more hyper parameters than
the RBF kernel.

19
Training