Title: Machine Learning in Text Categorization
1Machine Learning in Text Categorization
- Presenter Muluwork Geremew
- Course ENEE 752
2Overview
- Introduction
- Text Classification
- Text Categorization (TC)
- Application areas of TC
- Automatic classifier construction
- Evaluation of text classifiers
- Conclusion
3Introduction
- Availability of large sets of digital data
- Biological information, songs, TV broadcast
- Web, Text collection and etc
- Increase demand for methods to
- Sort, Retrieve
- Filter and manage digital resources
- Information Retrieval (IR) techniques
- Text search -provides tools for searching
relevant documents within this large collection. - Text classification- offers tools for converting
unstructured text collections into structured
one. - In so doing storage and search gets easier
4Text Classification
- It is important ingredient for organizing
documents. - E.g.. Web directories such Yahoo
- The two variants of TC
- Text clustering finds a latent but yet
unspecified group structure. - Classes are not known in advance.
- Text Categorization structured the collection
according to a scheme provided as input. - Classes are predefined.
5Text Categorization
- Definition
- A process of assigning to a text document d from
a given domain D one or more class labels from a
finite set of predefined categories.
d3
C6
C7
C5
C4
C1
C3
Ck
C2
...
6Automatic Text Categorization
- Formal definition
- Given a set of previously unseen documents D
d1, d2, d3. and a set of pre-defined classes
or categories C c1,c2,c3. ck, a classifier
(categorizer) is a function ? that maps a
document from set D to the set of all subsets of
C. - Gains in efficiency and man power by automation
may significantly aid the business process of any
organizations.
7Automatic Text Categorization
- Single-label vs. Multi-label TC
- Exactly one category must be assigned to a
document d. - Binary categorization.
- Any number of categories can be assigned to d.
- Most text categorization problems are
multi-class. - Soft (ranking) vs. Hard TC
- Partially automated systems and generate ranked
categories. - Fully automated and classifier decides on
documents category. - Presentation deals with hard text categorization.
8Application of TC
- Document organization
- Newswire filtering grouping of news stories
produced by news agencies to thematic classes of
interest. - Patent classification
- Text Filtering
- Classifying incoming stream of documents
- Spam e-mail filtering
- Single-label TC
- Classifying into two disjoint categories
relevant and irrelevant - Authorship attribution and etc.
9Construction of Automatic Classifiers
- Starting from late 80s to early 90s, there have
been two popular approaches. - Knowledge engineering
- Define rules to classify documents
- All the rules are defined manually.
- Supervised machine learning
- building an automatic text classifier by learning
the characteristics of the categories from
training set. - saves a lot of time and skilled manpower.
- Classification accuracy is better than that of
classifiers built by knowledge engineering
methods.
10Automatic Text Categorization Process
- To build a categorizer, we need a set of
pre-classified documents. - Training set pre-classified documents for
training the categorizer. - Test set - pre-classified documents used for
testing the effectiveness of the classifier - Basic outline for construction
11Automatic Text Categorization Phase-1
- Building internal representations for documents
- Transforms document d to compact vector form.
- Tokenization of documents
- Dimension of the vector corresponds to number of
distinct words or tokens in the training set. - Each entry in the vector represents the weight of
each term. A document di f1i.fmi where m
dimension - The tfdif function (tf idf )
- The weights are normalized to limit them to 0,
1. - Dimensionality reduction methods
- Feature selection
- Feature extraction
12Automatic Text Categorization Phase-2
- After phase1, a learner automatically builds a
text classifier for provided categories by
observing pre-classified samples in the training
set. - Various machine learning techniques
- Naïve Bayesian
- Support Vector Machine
- K-Nearest-Neighbors
- Neural Network
- Decision Tress and etc.
13Machine Learning Algorithm
- Naïve Bayes
- Probabilistic classifier
- Based on Bayes theorem and assumes conditional
independence. - Both and are estimated
using the training data during training phase. - Training scales linearly with the size of
training set - Categorization scales linearly with the size of
classes
14Machine Learning Algorithms
- Support Vector Machines (SVM)
- Uses Polynomial and RBF kernels
- SVMlight to decompose big QP into smaller one and
solve it iteratively until solution converges. - SVM classifier for each category k(k-1)/2
- One- to-one approach
- Training scales poorly with training set size
- K-Nearest-Neighbor (KNN)
- Lazy learner no offline learning phase
- A class with k most similar document to a new one
is assigned. - Simple but computationally intensive
- Slow categorization phase
15Evaluation of Text Categorizers
- Evaluation measures
- Effectiveness ability to take the right
classification decision. - Efficiency
- Training efficiency - average time it takes to
build a classifier for a category from a training
set - Classification efficiency -average time it takes
to classify previously unseen documents. - Measures of TC effectiveness
- Precision p -the probability that dj is
classified under ci given that it was supposed to
be classified under ci - Recall ? - the probability that dj was supposed
to be classified under ci given that it is
classified under ci
16Evaluation of Text Categorizers
- How do we estimate precision and recall ?
- Contingency table that summarizes the
categorization result for given category. - Classifier should be evaluated by a measure that
combines precision and recall. - Breakeven point- value at which precision
recall - Microaveraging computes recall and precisions
for all categories or for most frequent
categories
17Experimental Design(Joachim's,1997)
- Learning methods compared
- Naïve Bayesian
- SVMs - polynomial and RBF kernels
- Training carried out with SVMlight package
- KNN
- Benchmarks for text categorization
- Reuters- collection of newswire stories by Lewis
- 12902 Reuters newswire stories
- Training set - 9603 (75 ) , testing set - 3299
(25) - 90 categories
- 9962 distinct terms (after preprocessing)
18Experimental Design
- OHSUMED- title-plus-abstracts from medical
journals compiled by Hersh - 10000 training and 10000 test documents
- 23 MeSH diseases categories
- 15561 distinct terms (after preprocessing)
- Dimensionality Reduction
- 500,1000,5000 best or all features are selected
using information gain.
19Results
- SVMs outperform KNN and Naïve Bayes
- Holds for all parameter settings ( d , ?)
- No overtting, even for complex hypothesis spaces
- RBF is better than polynomial kernel
- KNN outperforms Naïve Bayes
20Results
- On Ohsumeds data set
- SVMs outperform KNN and Naïve Bayes.
- Polynomial SVM 65.9
- RBF SVM 66.0
- KNN achieves better performance than Naïve Bayes
- KNN 59.1 where k 30
- Naïve Bayes 57.0
- Training efficiency
- SVM are more expensive to train than both
classifiers. - Classification efficiency
- All classifiers are pretty fast.
- KNN is the slowest.
21Conclusion
- Automated text categorization has many important
applications. - Reduces time and skilled manpower for TC.
- Produces comparable classifications to those of
experts. - Challenges
- Build classifiers with high accuracy in all
applicative context. - Classifying documents manually for use in
training phase is costly. - Semi-supervised methods
22Questions?