Machine Learning in Text Categorization - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Machine Learning in Text Categorization

Description:

Web, Text collection and etc. Increase demand for methods to. Sort, Retrieve ... Text Categorization structured the collection according to a scheme provided ... – PowerPoint PPT presentation

Number of Views:677
Avg rating:3.0/5.0
Slides: 23
Provided by: mger7
Category:

less

Transcript and Presenter's Notes

Title: Machine Learning in Text Categorization


1
Machine Learning in Text Categorization
  • Presenter Muluwork Geremew
  • Course ENEE 752

2
Overview
  • Introduction
  • Text Classification
  • Text Categorization (TC)
  • Application areas of TC
  • Automatic classifier construction
  • Evaluation of text classifiers
  • Conclusion

3
Introduction
  • Availability of large sets of digital data
  • Biological information, songs, TV broadcast
  • Web, Text collection and etc
  • Increase demand for methods to
  • Sort, Retrieve
  • Filter and manage digital resources
  • Information Retrieval (IR) techniques
  • Text search -provides tools for searching
    relevant documents within this large collection.
  • Text classification- offers tools for converting
    unstructured text collections into structured
    one.
  • In so doing storage and search gets easier

4
Text Classification
  • It is important ingredient for organizing
    documents.
  • E.g.. Web directories such Yahoo
  • The two variants of TC
  • Text clustering finds a latent but yet
    unspecified group structure.
  • Classes are not known in advance.
  • Text Categorization structured the collection
    according to a scheme provided as input.
  • Classes are predefined.

5
Text Categorization
  • Definition
  • A process of assigning to a text document d from
    a given domain D one or more class labels from a
    finite set of predefined categories.

d3
C6
C7
C5
C4
C1
C3
Ck
C2
...
6
Automatic Text Categorization
  • Formal definition
  • Given a set of previously unseen documents D
    d1, d2, d3. and a set of pre-defined classes
    or categories C c1,c2,c3. ck, a classifier
    (categorizer) is a function ? that maps a
    document from set D to the set of all subsets of
    C.
  • Gains in efficiency and man power by automation
    may significantly aid the business process of any
    organizations.

7
Automatic Text Categorization
  • Single-label vs. Multi-label TC
  • Exactly one category must be assigned to a
    document d.
  • Binary categorization.
  • Any number of categories can be assigned to d.
  • Most text categorization problems are
    multi-class.
  • Soft (ranking) vs. Hard TC
  • Partially automated systems and generate ranked
    categories.
  • Fully automated and classifier decides on
    documents category.
  • Presentation deals with hard text categorization.

8
Application of TC
  • Document organization
  • Newswire filtering grouping of news stories
    produced by news agencies to thematic classes of
    interest.
  • Patent classification
  • Text Filtering
  • Classifying incoming stream of documents
  • Spam e-mail filtering
  • Single-label TC
  • Classifying into two disjoint categories
    relevant and irrelevant
  • Authorship attribution and etc.

9
Construction of Automatic Classifiers
  • Starting from late 80s to early 90s, there have
    been two popular approaches.
  • Knowledge engineering
  • Define rules to classify documents
  • All the rules are defined manually.
  • Supervised machine learning
  • building an automatic text classifier by learning
    the characteristics of the categories from
    training set.
  • saves a lot of time and skilled manpower.
  • Classification accuracy is better than that of
    classifiers built by knowledge engineering
    methods.

10
Automatic Text Categorization Process
  • To build a categorizer, we need a set of
    pre-classified documents.
  • Training set pre-classified documents for
    training the categorizer.
  • Test set - pre-classified documents used for
    testing the effectiveness of the classifier
  • Basic outline for construction

11
Automatic Text Categorization Phase-1
  • Building internal representations for documents
  • Transforms document d to compact vector form.
  • Tokenization of documents
  • Dimension of the vector corresponds to number of
    distinct words or tokens in the training set.
  • Each entry in the vector represents the weight of
    each term. A document di f1i.fmi where m
    dimension
  • The tfdif function (tf idf )
  • The weights are normalized to limit them to 0,
    1.
  • Dimensionality reduction methods
  • Feature selection
  • Feature extraction

12
Automatic Text Categorization Phase-2
  • After phase1, a learner automatically builds a
    text classifier for provided categories by
    observing pre-classified samples in the training
    set.
  • Various machine learning techniques
  • Naïve Bayesian
  • Support Vector Machine
  • K-Nearest-Neighbors
  • Neural Network
  • Decision Tress and etc.

13
Machine Learning Algorithm
  • Naïve Bayes
  • Probabilistic classifier
  • Based on Bayes theorem and assumes conditional
    independence.
  • Both and are estimated
    using the training data during training phase.
  • Training scales linearly with the size of
    training set
  • Categorization scales linearly with the size of
    classes

14
Machine Learning Algorithms
  • Support Vector Machines (SVM)
  • Uses Polynomial and RBF kernels
  • SVMlight to decompose big QP into smaller one and
    solve it iteratively until solution converges.
  • SVM classifier for each category k(k-1)/2
  • One- to-one approach
  • Training scales poorly with training set size
  • K-Nearest-Neighbor (KNN)
  • Lazy learner no offline learning phase
  • A class with k most similar document to a new one
    is assigned.
  • Simple but computationally intensive
  • Slow categorization phase

15
Evaluation of Text Categorizers
  • Evaluation measures
  • Effectiveness ability to take the right
    classification decision.
  • Efficiency
  • Training efficiency - average time it takes to
    build a classifier for a category from a training
    set
  • Classification efficiency -average time it takes
    to classify previously unseen documents.
  • Measures of TC effectiveness
  • Precision p -the probability that dj is
    classified under ci given that it was supposed to
    be classified under ci
  • Recall ? - the probability that dj was supposed
    to be classified under ci given that it is
    classified under ci

16
Evaluation of Text Categorizers
  • How do we estimate precision and recall ?
  • Contingency table that summarizes the
    categorization result for given category.
  • Classifier should be evaluated by a measure that
    combines precision and recall.
  • Breakeven point- value at which precision
    recall
  • Microaveraging computes recall and precisions
    for all categories or for most frequent
    categories

17
Experimental Design(Joachim's,1997)
  • Learning methods compared
  • Naïve Bayesian
  • SVMs - polynomial and RBF kernels
  • Training carried out with SVMlight package
  • KNN
  • Benchmarks for text categorization
  • Reuters- collection of newswire stories by Lewis
  • 12902 Reuters newswire stories
  • Training set - 9603 (75 ) , testing set - 3299
    (25)
  • 90 categories
  • 9962 distinct terms (after preprocessing)

18
Experimental Design
  • OHSUMED- title-plus-abstracts from medical
    journals compiled by Hersh
  • 10000 training and 10000 test documents
  • 23 MeSH diseases categories
  • 15561 distinct terms (after preprocessing)
  • Dimensionality Reduction
  • 500,1000,5000 best or all features are selected
    using information gain.

19
Results
  • SVMs outperform KNN and Naïve Bayes
  • Holds for all parameter settings ( d , ?)
  • No overtting, even for complex hypothesis spaces
  • RBF is better than polynomial kernel
  • KNN outperforms Naïve Bayes

20
Results
  • On Ohsumeds data set
  • SVMs outperform KNN and Naïve Bayes.
  • Polynomial SVM 65.9
  • RBF SVM 66.0
  • KNN achieves better performance than Naïve Bayes
  • KNN 59.1 where k 30
  • Naïve Bayes 57.0
  • Training efficiency
  • SVM are more expensive to train than both
    classifiers.
  • Classification efficiency
  • All classifiers are pretty fast.
  • KNN is the slowest.

21
Conclusion
  • Automated text categorization has many important
    applications.
  • Reduces time and skilled manpower for TC.
  • Produces comparable classifications to those of
    experts.
  • Challenges
  • Build classifiers with high accuracy in all
    applicative context.
  • Classifying documents manually for use in
    training phase is costly.
  • Semi-supervised methods

22
Questions?
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com