Text Classification from Labeled and Unlabeled Documents using EM - PowerPoint PPT Presentation

About This Presentation

Title:

Text Classification from Labeled and Unlabeled Documents using EM

Description:

Text Classification from Labeled and Unlabeled Documents using EM Kamal Nigam Andrew Kachites Mccallum Sebastian Thrun Tom Mitchell Presented by Yuan Fang, Fengyuan ... – PowerPoint PPT presentation

Number of Views:659

Avg rating:3.0/5.0

Slides: 59

Provided by: infEdAcUk

Category:

more less

Transcript and Presenter's Notes

Title: Text Classification from Labeled and Unlabeled Documents using EM

1
Text Classification from Labeled and Unlabeled
Documents using EM

Kamal Nigam
Andrew Kachites Mccallum
Sebastian Thrun
Tom Mitchell
Presented by
Yuan Fang, Fengyuan Hu and Sandhya Prabhakaran

2
Job Hunting?
3
Roadmap

Part 1 Text Classification
Part 2 Incorporating Unlabeled data with EM
Part 3 Results and Recap

4
Part I Text Classification
5
Text Classification the Definition

Text classification systems categorize documents
into one (or several) of a set of pre-defined
topics of interest

6
How Are Automatic Text Classifiers Created

Before Manual construction of rule sets (Painful
and time-consuming )
Present Supervised learning to construct a
classifier (efficient and successful)

7
What To Provide

An algorithm with an example set of documents for
each class and allow it to find a representation
or decision rule for classifying future documents
automatically
This approach will
- give high-accuracy classifiers
- be significantly less expensive

8
What Data is Available

Key difficulty A large number of labeled
training examples are required to learn
accurately - What we need but don't have
One would obviously prefer algorithms that can
provide accurate classifications after hand
labeling only a dozen articles, rather than
thousands
What other sources of information can reduce the
need for labeled data?

9
Unlabeled data

How unlabeled data can be used to increase
classification accuracy, especially when labeled
data are scarce
An intuitive example

10
Goal And Merit

The goal
To demonstrate that supervised learning
algorithms can use a small number of labeled
examples with a large number of unlabeled
examples to create high-accuracy text classifiers
The merit
Unlabeled examples are much less
expensive and easily available

11
Parametric Generative Model Overview

Assumption a statistical process generates the
documents (words and class labels)
statistical process - parametric generative model

12
Incorporating Unlabeled Data withGenerative
Models

Using EM to find high-probability parameters of
the model given a combination of labeled and
unlabeled data
Experimental evidence shows that using unlabeled
data with EM can increase classification accuracy

13
Assumptions In the Model

(1) Documents are generated by a mixture of
multinomials model, where each mixture
component corresponds to a class (1 class
to 1
component)
(2) The mixture components are multinomial
distributions of individual words - the
words are
produced independently of each other given
the
class

14
Two Multisided Dies

Let there be C classes and a vocabulary of size
V each document d has d words in it.
First, we roll a biased C-sided die to
determine the class of our document.
We roll the biased V-sided die that corresponds
to the chosen class d times and write down the
indicated words. These words form the generated
document.

15
Parametric Generative Model

- parameters for the mixture model
- mixture of
components
- mixture weights or class
probabilities
- document distribution of selected
class
Equation (1)

16
Denotation

- the jth mixture component, as well as the
jth class.
- the class label for a particular document
( )
A document is considered to be an ordered
list of word events,
We write for the word in
position k of
- a word in the vocabulary
- document length, chosen independently
of the component, its own probability

17
Parametric Generative Model

Expanding the Equation (1) with document length
and the words in the document. Equation (2)
The words of a document are generated
independently of context
Equation (3)
Combining these last two equations gives the
naive Bayes expression for the probability of a
document given its class
Equation (4)

18
Model Parameters

Collection of word probabilities, each written
Document length is identically distributed, no
need to be parameterized for classification
denoted as the mixture weights (class
probabilities)
The complete collection of model parameters

19
Naive Bayes Text Classification

Using a collection of labeled documents for
training
Finding the most probable parameters for the
statistical model introduced

20
Training A Naive Bayes Classifier With Labeled
Data

Estimating the parameters of the generative model
by using a set of labeled training data
(the estimate of the parameters is written
)
Finding (MAP), the
value of that is most probable given the
evidence of the training data and a prior.

21
Training A Naive Bayes Classifier With Labeled
Data

The word probability estimates are
given by Equation (6)
Class probabilities
Equation (7)

22
Classifying New Documents with Naive Bayes

Equation (8)
If the task is to classify a test document
into a single class, then the class with the
highest posterior probability
is selected.

23
Part ?Incorporating Unlabeled Data with EM
24
The Problem

The case that given only labeled data is
explained already.
MAP to maximize the posterior probability.
Naïve Bayes do classification of labeled data.
Now the case is given both labeled and unlabeled
data.
Searching for a solution? Here it is!

25
Revision of EM

Recall the EM knowledge in PMR Might be
painful, but helpful
Mixture Model
Hidden variable z to active the components

26
Revision of EM

EM applied to Gaussian Mixture Model
Maximum Likelihood Estimation Parameters µ andS
E step evaluate the responsibilities using
current estimators/parameters
M step re-estimate by using the maximum a
posteriori parameter
Run the demo

27
Back to the paper
28
Back to the paper

Collection of labeled and unlabeled documents.
MAP
Try to maximize P(?D)
Bayesian method -- P(?D) ? P(?) P(D ?)

29
Back to the paper

Log likelihood
Incomplete equation

30
Back to the paper

z binary indicator variables which is set to be
1 if y c, else zero.
Then problem of the incomplete log probability
can be transferred to complete log probability of
parameters.

31
Back to the paper

Methods used in the paper
Basic EM
Augmented EM
(1) Weighting the unlabeled data
(2) Multiple mixture components per class

32
Basic EM

Initialize the NB classifier using MAP parameter
estimation, from only labeled dataset.
E step estimate the component membership
by calculating its expected value generated
by
from only unlabeled data.
M step re-estimate the classifier for the whole
data set, using MAP, loop from E step
Look at to measure the
improvement of the parameters, decide when to
stop the loop

33
Restrictions of Basic EM

Assumptions/Restrictions
Large unlabeled data set, small labeled data set
? if not true, unlabeled data will hurt the
accuracy.
One-to-one correspondence of components and
classes ? not so accurate because subtopics exist.

34
Augmented EM weighting unlabeled data

Method weakening the contribution of unlabeled
data while the labeled set is already good enough
for classification.
Equation

35
Augmented EM weighting unlabeled data

?is decided by leave-one-out cross validation.
is defined to tell whether it is labeled
or unlabeled.
Modified MAP parameters

36
Augmented EM -- multiple mixture components per
class

Method Relax the assumption that one-to-one
correspondence of components and classes.
Many-to-one relationship between components and
classes.

37
Augmented EM multiple mixture components per
class

How?
Decide the number of components per class by
again cross-validation.
Mapping from components to classes

38
The complete algorithm

Collections of labeled, unlabeled documents.
Set ?by cross-validation.
Set the number of components per class.
Randomly assign for mixture
components.
Initialize the parameters ? of NB classifier
using MAP.
Loop until complete log likelihood of labeled and
unlabeled data is satisfying enough.
E step estimate the component membership of each
doc using ?
M step re-estimate ?given the membership, still
MAP.

39
Comparison

Basic EM performs well comparing with naïve
bayes classifier alone, with large unlabeled
dataset and small set of labeled data
EM-? can apparently improve the accuracy if the
assumption above doesnt fit.
Multiple Components dramatically outperforms
than basic EM.

40
Part III Results and Recap
41
Experimental Results

Empirical evidence that on combining labeled with
unlabeled data using EM outperforms naive Bayes.
20 Newsgroups, WebKB, Reuters
Improvements in accuracy due to unlabeled data
are dramatic, especially when the number of
labeled data is low.
Augmented EM can increase performance even when
basic EM performs poor due to large number of
unlabeled data.

42
Data sets and Protocols

20 Newsgroup
20017 articles divided evenly among 20 different
UseNet discussion groups.
Task - to classify an article into the one
newsgroup to which it was posted.
Many categories fall into confusable clusters.
Stop words are removed 62258 unique words
Word counts are normalized and scaled each
document has constant length.

43
Data sets and Protocols

- WebKB
8145 Web pages gathered from university computer
science departments.
Choosing 4199 pages covering categories student,
faculty, course and project.
Task - to classify a web page into one of the
four categories.
Stemming and stoplist are not used.
Vocabulary is limited to 300 most informative
words using leave-one-out cross validation.

44
Data sets and Protocols

Reuters
12902 articles and 90 topic categories.
Task - to build a binary classifier for each of
the ten most populous classes to identify the
news topic.
Words inside ltTEXTgt tags are used REUTERS and
not used.
Stoplist are used, but no stemming.
Metrics are Recall and Precision instead of
Accuracy.

45
Precision-Recall breakeven point

Standard information retrieval measure
Recall number of correct positive
predictions
number of positive
examples
Precision - number of correct positive
predictions
number of positive
predictions

46
Wall-clock timing

EM usually converges after 10 iterations
Less than 1 minute for the WebKB
Less than 15 minutes for 20 Newsgroups huge
vocabulary and more documents

47
EM with unlabeled data increases Accuracy

Figure 1- Accuracy versus of Labeled
Documents. (20 Newsgroups)
48
Effect of varying the of unlabeled documents
Figure 2- Accuracy versus of unlabeled
documents. (20 Newsgroups)
49
EM algorithm in action
Figure 3- Course class for WebKB
dataset
50
EM performance degradation
Figure 4- As of Labeled data increases,
accuracy of classifier falls with more of
unlabeled data. Importance of weighting factor ?.
(WebKB)
51
Effects of different EM
Figure 5- Comparison between EM, CV EM-? and
EM-? (WebKB)
52
Performance of EM on different of mixture
components
Figure 6- Too few or too many mixture components
result in poor performance. Unlabeled data is
used. (Reuters)
53
Precision-Recall breakeven points
Figure 7- Comparison between NB and EM on
Reuters dataset
54
Related Work

EM is a well-known family of algorithms that
works by treating unclassified data as
incomplete.
According to Miller et al - EM on non-textual
tasks using mixture of Gaussians assumed
unlabeled data to be sufficient to estimate
parameter values.
Castelli and Cover - unlabeled data does not
improve the classification results in the absence
of labeled data.
EM can be combined with active-learning to
improve performance now only slightly more than
half of labeled data was enough!
EM can be applied with other machine learning
algorithms like SVM, kNN.

55
Punchwords

Text classification
Naive Bayes
Expectation Maximisation Algorithm
EM-?
Multiple Mixture models for subclass
Leave-one-out cross validation
Stemming and stoplist words
Accuracy, Precision, Recall

56
Recap

A family of algorithms have been presented to
address text classification using voluminous
unlabeled data and scarce labeled data.
When data is consistent with the assumptions -
Basic EM performs well.
When data is not consistent - 2 extensions hold
valid
- EM-? controlling the contribution of
unlabeled data.
- Multiple Mixture Components per Class
many-to-one constraint.

57
References

Using Unlabeled Data to Improve Text
Classification May 2001 at
www.kamalnigam.com/papers/thesis-nigam.pdf
Netlab toolkit - www.ncrg.aston.ac.uk/netlab/
Validation Lecture Intelligent Sensor Systems,
RicardoGutierrez-Osuna, Wright State University

58
Question Time!!