Faceted Classification using SVM - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

Faceted Classification using SVM

Description:

News Articles. Image Collection. American Political History. State Department Collection ... News Articles: 500 documents. 8 categories. Large feature set. ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 31
Provided by: lawanyach
Category:

less

Transcript and Presenter's Notes

Title: Faceted Classification using SVM


1
Faceted Classification using SVM
  • Deepthi Singh Rajput
  • dsingh_at_cs.odu.edu
  • Advisors Dr. Zubair, Dr. Maly

2
Overview
  • Proposal
  • Approach
  • SVM overview
  • Implementation
  • Integration
  • Conclusion

3
Collaborative Faceted Classification
  • Faceted classification
  • Multiple perspectives.
  • Mutually exclusive facets
  • Each facet is a hierarchy
  • Simple and easy to explore
  • EX
  • C//subfolder/test.txt
  • Type, size, date of mod

4
Collaborative Faceted Classification
  • Collaborative Classification
  • Collaboratively build faceted classification
  • Add, modify facets
  • Categorize, reclassify documents
  • Automated Classification (Supervised Learning
    Problem)
  • Collaboration dynamically changes the facet
    schema
  • Calls for automatic reclassifying of existing
    documents
  • Manual classification / reclassification of huge
    collections is time consuming.

5
Objective
  • Use of Support Vector Machine (SVM) to implement
    a system which
  • Automates Document classification into predefined
    facets
  • Suggests classifications for new documents.
  • Reclassifies documents with changing faceted
    schema.

6
Approach
  • Boot Strapping
  • Non classified collection of documents.
  • A facet for this collection is defined manually
  • Manually, a subset of the collection is
    classified into positive and negative sets (w.r.t
    to the facet).
  • The system is trained to identify the documents
    of each set as positive or negative using
    automated learning.

7
Approach
  • Soft classification The rest of the collection
    is tentatively classified by the system using the
    training set.
  • Soft classifications are made available to users
    for exploration.
  • Soft Classifications can be hardened or
    rejected by the users.

8
Approach
  • Real Time Classification
  • Users may contribute new documents to the
    collection.
  • The system tentatively classifies the document.
  • Users can
  • approve of the suggested classification
  • classify as they wish, into the existing facet
    schema.
  • Change the schema by creating a new category or
    facet for the document.
  • With change in the facet schema the system
    reclassifies the documents.

9
Example Collection Schema
  • News Articles
  • Image Collection
  • American Political History
  • State Department Collection
  • African American Activists

10
Technologies Used
  • jdk1.5.0_05
  • IDE Eclipse
  • Maven 2.0.4
  • LibSVM

11
SVM Overview
  • Statistical Learning Theory
  • Widely used in pattern recognition
  • Important and active field of all Machine
    Learning research.
  • An instance of Kernel Machines, a large class of
    learning algorithms.

12
Good Decision Boundary
  • Consider a two-class, linearly separable
    classification problem
  • Many decision boundaries!
  • The Perceptron algorithm can be used to find such
    a boundary
  • Are all decision boundaries equally good?

Class 1
Class 2
13
Large-margin Decision Boundary
  • The decision boundary should be as far away from
    the data of both classes as possible
  • We should maximize the margin, m

14
Kernel Mapping Computing feature space
15
SVM Implementation
  • Transform data to the format of an SVM software
  • Conduct simple scaling/normalization on the data
  • Consider the RBF kernel
  • Find the best parameter C and Gamma
  • Use the best parameter C and Gamma to train the
    training set
  • Test

16
Transformation
  • The collection has to be transformed to a format
    of the SVM library (LibSVM)
  • Every training collection contains a number of
    different documents (Document Frequency)
  • Documents contain different terms and term
    frequencies.
  • Term Frequency (TF) and Document Frequency are
    used to compute the features.

17
LibSVM format
  • Each document represents a line in the SVM model
  • Each line starts with a class label (0,1)
    followed by feature (Id, value) pairs
  • label featureIdfeatureValue
    featureIdfeatureValue
  • Training and Testing files have the same format.

18
Scaling and Kernel
  • Scaling
  • avoid attributes in greater numeric ranges
    dominate those in smaller numeric ranges.
  • RBF Kernel
  • Number of hyper parameters influences the
    complexity of model selection.
  • Polynomial kernel has more hyper parameters than
    the RBF kernel.

19
Training
  • Identify good (C, ?) so that the classifier can
    accurately predict unknown data
  • Train the system with a set of positive and
    negative documents.
  • Calculate the training accuracy.
  • Iterate with different sample set sizes.
  • Pick the best one avoid over fitting and under
    fitting.

20
Testing
  • Test with a set of positive and negative samples
  • Determine the accuracy of prediction
  • P of positive documents
  • N of Negative documents
  • RP Positive documents rightly classified
  • RN Negative documents rightly classified
  • Accuracy (RP RN) / (PN)

21
Integration
  • Integration with the Collaborative Faceted
    Classification to achieve
  • Boot Strapping
  • Real Time Classification

22
Experiments and Results
  • Image Collection
  • 518 documents
  • 18 categories
  • Sparse feature set
  • News Articles
  • 500 documents
  • 8 categories
  • Large feature set.
  • Training set 40, 80, 100
  • Testing 30, 40

23
News Articles 40-40
24
Over fitting 100 40
25
Image Collection 40- 30 (under fitting)
26
Image 80 - 30
27
Impact of sample size
28
Conclusions
  • Unstable on small sized training sets
  • Over Fitting
  • Large training sets with far less feature
    dimensions (Image Collections)
  • Small training sets with a small set of feature
    dimensions.
  • Under Fitting
  • Large features dimensions than training sets.
  • Need to choose a subset of them before giving the
    data to SVM.

29
Future Work
  • Re-classification
  • Approval of the schema
  • Point of Reclassification
  • Training set

30
References
  • http//128.82.7.230/categorization/developers/svmh
    mm.ppt
  • http//www.csie.ntu.edu.tw/cjlin/papers/guide/gui
    de.pdf
  • http//www.csie.ntu.edu.tw/cjlin/libsvm
  • http//128.82.7.230/categorization/joachims.pdf
  • http//www.csie.ntu.edu.tw/cjlin/libsvm/faq.html
Write a Comment
User Comments (0)
About PowerShow.com