A Survey on Text Categorization with Machine Learning - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

A Survey on Text Categorization with Machine Learning

Description:

Title: PowerPoint Presentation Last modified by: STD Created Date: 1/1/1601 12:00:00 AM Document presentation format: Other titles – PowerPoint PPT presentation

Number of Views:768
Avg rating:3.0/5.0
Slides: 29
Provided by: logosTUt
Category:

less

Transcript and Presenter's Notes

Title: A Survey on Text Categorization with Machine Learning


1
A Survey on Text Categorization with Machine
Learning
  • Chikayama lab.
  • Dai Saito

2
IntroductionText Categorization
  • Many digital Texts are available
  • E-mail, Online news, Blog
  • Need of Automatic Text Categorization is
    increasing
  • without human resource
  • Merits of time and cost

3
IntroductionText Categorization
  • Application
  • Spam filter
  • Topic Categorization

4
IntroductionMachine Learning
  • Making Categorization rule automatically by
    Feature of Text
  • Types of Machine Learning (ML)
  • Supervised Learning
  • Labeling
  • Unsupervised Learning
  • Clustering

5
Introductionflow of ML
  • Prepare training Text data with label
  • Feature of Text
  • Learn
  • Categorize new Text

Label1
?
Label2
6
Outline
  • Introduction
  • Text Categorization
  • Feature of Text
  • Learning Algorithm
  • Conclusion

7
Number of labels
  • Binary-label
  • True or False (Ex. spam or not)
  • Applied for other types
  • Multi-label
  • Many labels, butOne Text has one label
  • Overlapping-label
  • One Text has some labels

Yes
No
L1
L2
L3
L4
L1
L2
L3
L4
8
Types of labels
  • Topic Categorization
  • Basic Task
  • Compare individual words
  • Author Categorization
  • Sentiment Categorization
  • Ex) Review of products
  • Need more linguistic information

9
Outline
  • Introduction
  • Text Categorization
  • Feature of Text
  • Learning Algorithm
  • Conclusion

10
Feature of Text
  • How to express a feature of Text?
  • Bag of Words
  • Ignore an order of words
  • Structure
  • Ex) I like this car. I dont like this car.
  • Bag of Words will not work well
  • (ddocument text)
  • (tterm word)

11
Preprocessing
  • Remove stop words
  • the a for
  • Stemming
  • relational -gt relate, truly -gt true

12
Term Weighting
  • Term Frequency
  • Number of a term in a document
  • Frequent terms in a document seems to be
    important for categorization
  • tfidf
  • Terms appearing in many documents are not useful
    for categorization

13
Sentiment Weighting
  • For sentiment classification,weight a word as
    Positive or Negative
  • Constructing sentiment dictionary
  • WordNet 04 Kamps et al.
  • Synonym Database
  • Using a distancefrom good and bad

d (good, happy) 2
d (bad, happy) 4
14
Dimension Reduction
  • Size of feature vector is (terms)(documents)
  • terms ? size of dictionary
  • High calculation cost
  • Risk of overfitting
  • Best for training data ? Best for real data
  • Choosing effective feature
  • to improve accuracy and calculation cost

15
Dimension Reduction
  • df-threshold
  • Terms appearing in very few documents(ex.only
    one) are not important
  • Score
  • If t and cj are independent, Score is equal
    to Zero

16
Outline
  • Introduction
  • Text Categorization
  • Feature of Text
  • Learning Algorithm
  • Conclusion

17
Learning Algorithm
  • Many (Almost all?) algorithms are used in Text
    Categorization
  • Simple approach
  • Naïve Bayes
  • K-Nearest Neighbor
  • High performance approach
  • Boosting
  • Support Vector Machine
  • Hierarchical Learning

18
Naïve Bayes
  • Bayes Rule
  • This value is hard to calculate
  • ?
  • Assumption each terms occurs independently

19
k-Nearest Neighbor
  • Define a distance of two Texts
  • Ex)Sim(d1, d2) d1d2 / d1d2
    cos?
  • check k of high similarityTexts and categorize
    bymajority vote
  • If size of test data is larger, memory and
    search cost is higher

d1
k3
d2
?
20
Boosting
  • BoosTexter 00 Schapire et al.
  • Ada boost
  • making many weak learners with different
    parameters
  • Kth weak learner checks performance of
    1..K-1th, and tries to classify right to the
    worst score training data
  • BoosTexter uses Decision Stump as weak learner

21
Simple example of Boosting
22
Support Vector Machine
  • Text Categorization with SVM98 Joachims
  • Maximize margin

23
Text Categorization with SVM
  • SVM works well for Text Categorization
  • Robustness for high dimension
  • Robustness for overfitting
  • Most Text Categorization problems are linearly
    separable
  • All of OHSUMED (MEDLINE collection)
  • Most of Reuters-21578 (NEWS collection)

24
Comparison of these methods
  • 02 Sebastiani
  • Reuters-21578 (2 versions)
  • difference number of Categories

Method Ver.1(90) Ver.2(10)
k-NN .860 .823
Naïve Bayes .795 .815
Boosting .878 -
SVM .870 .920
25
Hierarchical Learning
  • TreeBoost06 Esuli et al.
  • Boosting algorithm for Hierarchical labels
  • Hierarchical labels and Texts with label as
    Training data
  • Applying AdaBoost recursively
  • Better classifier than flat AdaBoost
  • Accuracy 2-3 up
  • Time training and categorization time down
  • Hierarchical SVM04 Cai et al.

26
TreeBoost
root
L1
L2
L3
L4
L11
L12
L41
L42
L43
L421
L422
27
Outline
  • Introduction
  • Text Categorization
  • Feature of Text
  • Learning Algorithm
  • Conclusion

28
Conclusion
  • Overview of Text Categorizationwith Machine
    Learning
  • Feature of Text
  • Learning Algorithm
  • Future Work
  • Natural Language Processing with Machine
    Learning, especially in Japanese
  • Calculation Cost
Write a Comment
User Comments (0)
About PowerShow.com