Title: Text Classification from Labeled and Unlabeled Documents using EM
1Text Classification from Labeled and Unlabeled
Documents using EM
- Kamal Nigam
- Andrew Kachites Mccallum
- Sebastian Thrun
- Tom Mitchell
- Presented by
- Yuan Fang, Fengyuan Hu and Sandhya Prabhakaran
2Job Hunting?
3Roadmap
- Part 1 Text Classification
- Part 2 Incorporating Unlabeled data with EM
- Part 3 Results and Recap
4Part I Text Classification
5Text Classification the Definition
- Text classification systems categorize documents
into one (or several) of a set of pre-defined
topics of interest
6How Are Automatic Text Classifiers Created
- Before Manual construction of rule sets (Painful
and time-consuming ) - Present Supervised learning to construct a
classifier (efficient and successful)
7What To Provide
- An algorithm with an example set of documents for
each class and allow it to find a representation
or decision rule for classifying future documents
automatically - This approach will
- - give high-accuracy classifiers
- - be significantly less expensive
8What Data is Available
- Key difficulty A large number of labeled
training examples are required to learn
accurately - What we need but don't have - One would obviously prefer algorithms that can
provide accurate classifications after hand
labeling only a dozen articles, rather than
thousands - What other sources of information can reduce the
need for labeled data?
9Unlabeled data
- How unlabeled data can be used to increase
classification accuracy, especially when labeled
data are scarce - An intuitive example
10Goal And Merit
- The goal
- To demonstrate that supervised learning
algorithms can use a small number of labeled
examples with a large number of unlabeled
examples to create high-accuracy text classifiers - The merit
- Unlabeled examples are much less
expensive and easily available
11Parametric Generative Model Overview
- Assumption a statistical process generates the
documents (words and class labels) - statistical process - parametric generative model
12Incorporating Unlabeled Data withGenerative
Models
- Using EM to find high-probability parameters of
the model given a combination of labeled and
unlabeled data - Experimental evidence shows that using unlabeled
data with EM can increase classification accuracy
13Assumptions In the Model
- (1) Documents are generated by a mixture of
- multinomials model, where each mixture
- component corresponds to a class (1 class
to 1 - component)
- (2) The mixture components are multinomial
- distributions of individual words - the
words are - produced independently of each other given
the - class
14Two Multisided Dies
- Let there be C classes and a vocabulary of size
V each document d has d words in it. - First, we roll a biased C-sided die to
determine the class of our document. - We roll the biased V-sided die that corresponds
to the chosen class d times and write down the
indicated words. These words form the generated
document.
15Parametric Generative Model
- - parameters for the mixture model
- - mixture of
components - - mixture weights or class
probabilities - - document distribution of selected
class - Equation (1)
16Denotation
- - the jth mixture component, as well as the
jth class. - - the class label for a particular document
( ) - A document is considered to be an ordered
list of word events, - We write for the word in
position k of - - a word in the vocabulary
- - document length, chosen independently
of the component, its own probability
17Parametric Generative Model
- Expanding the Equation (1) with document length
and the words in the document. Equation (2) - The words of a document are generated
independently of context - Equation (3)
- Combining these last two equations gives the
naive Bayes expression for the probability of a
document given its class - Equation (4)
18Model Parameters
- Collection of word probabilities, each written
- Document length is identically distributed, no
need to be parameterized for classification - denoted as the mixture weights (class
probabilities) - The complete collection of model parameters
19Naive Bayes Text Classification
- Using a collection of labeled documents for
training - Finding the most probable parameters for the
statistical model introduced
20Training A Naive Bayes Classifier With Labeled
Data
- Estimating the parameters of the generative model
by using a set of labeled training data - (the estimate of the parameters is written
) - Finding (MAP), the
value of that is most probable given the
evidence of the training data and a prior.
21Training A Naive Bayes Classifier With Labeled
Data
- The word probability estimates are
given by Equation (6) - Class probabilities
- Equation (7)
22Classifying New Documents with Naive Bayes
- Equation (8)
- If the task is to classify a test document
into a single class, then the class with the
highest posterior probability - is selected.
23Part ?Incorporating Unlabeled Data with EM
24The Problem
- The case that given only labeled data is
explained already. - MAP to maximize the posterior probability.
- Naïve Bayes do classification of labeled data.
- Now the case is given both labeled and unlabeled
data. - Searching for a solution? Here it is!
25Revision of EM
- Recall the EM knowledge in PMR Might be
painful, but helpful - Mixture Model
- Hidden variable z to active the components
26Revision of EM
- EM applied to Gaussian Mixture Model
- Maximum Likelihood Estimation Parameters µ andS
- E step evaluate the responsibilities using
current estimators/parameters - M step re-estimate by using the maximum a
posteriori parameter - Run the demo
27Back to the paper
28Back to the paper
- Collection of labeled and unlabeled documents.
- MAP
- Try to maximize P(?D)
- Bayesian method -- P(?D) ? P(?) P(D ?)
29Back to the paper
- Log likelihood
- Incomplete equation
30Back to the paper
- z binary indicator variables which is set to be
1 if y c, else zero. - Then problem of the incomplete log probability
can be transferred to complete log probability of
parameters.
31Back to the paper
- Methods used in the paper
- Basic EM
- Augmented EM
- (1) Weighting the unlabeled data
- (2) Multiple mixture components per class
32Basic EM
- Initialize the NB classifier using MAP parameter
estimation, from only labeled dataset. - E step estimate the component membership
-
- by calculating its expected value generated
by - from only unlabeled data.
- M step re-estimate the classifier for the whole
data set, using MAP, loop from E step - Look at to measure the
improvement of the parameters, decide when to
stop the loop
33Restrictions of Basic EM
- Assumptions/Restrictions
- Large unlabeled data set, small labeled data set
? if not true, unlabeled data will hurt the
accuracy. - One-to-one correspondence of components and
classes ? not so accurate because subtopics exist.
34Augmented EM weighting unlabeled data
- Method weakening the contribution of unlabeled
data while the labeled set is already good enough
for classification. - Equation
35Augmented EM weighting unlabeled data
- ?is decided by leave-one-out cross validation.
- is defined to tell whether it is labeled
or unlabeled. - Modified MAP parameters
36Augmented EM -- multiple mixture components per
class
- Method Relax the assumption that one-to-one
correspondence of components and classes. - Many-to-one relationship between components and
classes.
37Augmented EM multiple mixture components per
class
- How?
- Decide the number of components per class by
again cross-validation. - Mapping from components to classes
38The complete algorithm
- Collections of labeled, unlabeled documents.
- Set ?by cross-validation.
- Set the number of components per class.
- Randomly assign for mixture
components. - Initialize the parameters ? of NB classifier
using MAP. - Loop until complete log likelihood of labeled and
unlabeled data is satisfying enough. - E step estimate the component membership of each
doc using ? - M step re-estimate ?given the membership, still
MAP.
39Comparison
- Basic EM performs well comparing with naïve
bayes classifier alone, with large unlabeled
dataset and small set of labeled data - EM-? can apparently improve the accuracy if the
assumption above doesnt fit. - Multiple Components dramatically outperforms
than basic EM.
40Part III Results and Recap
41Experimental Results
- Empirical evidence that on combining labeled with
unlabeled data using EM outperforms naive Bayes. - 20 Newsgroups, WebKB, Reuters
- Improvements in accuracy due to unlabeled data
are dramatic, especially when the number of
labeled data is low. - Augmented EM can increase performance even when
basic EM performs poor due to large number of
unlabeled data.
42Data sets and Protocols
- 20 Newsgroup
- 20017 articles divided evenly among 20 different
UseNet discussion groups. - Task - to classify an article into the one
newsgroup to which it was posted. - Many categories fall into confusable clusters.
- Stop words are removed 62258 unique words
- Word counts are normalized and scaled each
document has constant length.
43Data sets and Protocols
- - WebKB
- 8145 Web pages gathered from university computer
science departments. - Choosing 4199 pages covering categories student,
faculty, course and project. - Task - to classify a web page into one of the
four categories. - Stemming and stoplist are not used.
- Vocabulary is limited to 300 most informative
words using leave-one-out cross validation.
44Data sets and Protocols
- Reuters
- 12902 articles and 90 topic categories.
- Task - to build a binary classifier for each of
the ten most populous classes to identify the
news topic. - Words inside ltTEXTgt tags are used REUTERS and
not used. - Stoplist are used, but no stemming.
- Metrics are Recall and Precision instead of
Accuracy.
45Precision-Recall breakeven point
- Standard information retrieval measure
- Recall number of correct positive
predictions - number of positive
examples - Precision - number of correct positive
predictions - number of positive
predictions
46Wall-clock timing
- EM usually converges after 10 iterations
- Less than 1 minute for the WebKB
- Less than 15 minutes for 20 Newsgroups huge
vocabulary and more documents
47EM with unlabeled data increases Accuracy
Figure 1- Accuracy versus of Labeled
Documents. (20 Newsgroups)
48Effect of varying the of unlabeled documents
Figure 2- Accuracy versus of unlabeled
documents. (20 Newsgroups)
49EM algorithm in action
Figure 3- Course class for WebKB
dataset
50EM performance degradation
Figure 4- As of Labeled data increases,
accuracy of classifier falls with more of
unlabeled data. Importance of weighting factor ?.
(WebKB)
51Effects of different EM
Figure 5- Comparison between EM, CV EM-? and
EM-? (WebKB)
52Performance of EM on different of mixture
components
Figure 6- Too few or too many mixture components
result in poor performance. Unlabeled data is
used. (Reuters)
53Precision-Recall breakeven points
Figure 7- Comparison between NB and EM on
Reuters dataset
54Related Work
- EM is a well-known family of algorithms that
works by treating unclassified data as
incomplete. - According to Miller et al - EM on non-textual
tasks using mixture of Gaussians assumed
unlabeled data to be sufficient to estimate
parameter values. - Castelli and Cover - unlabeled data does not
improve the classification results in the absence
of labeled data. - EM can be combined with active-learning to
improve performance now only slightly more than
half of labeled data was enough! - EM can be applied with other machine learning
algorithms like SVM, kNN.
55 Punchwords
- Text classification
- Naive Bayes
- Expectation Maximisation Algorithm
- EM-?
- Multiple Mixture models for subclass
- Leave-one-out cross validation
- Stemming and stoplist words
- Accuracy, Precision, Recall
56Recap
- A family of algorithms have been presented to
address text classification using voluminous
unlabeled data and scarce labeled data. - When data is consistent with the assumptions -
Basic EM performs well. - When data is not consistent - 2 extensions hold
valid - - EM-? controlling the contribution of
unlabeled data. - - Multiple Mixture Components per Class
many-to-one constraint.
57References
- Using Unlabeled Data to Improve Text
Classification May 2001 at - www.kamalnigam.com/papers/thesis-nigam.pdf
- Netlab toolkit - www.ncrg.aston.ac.uk/netlab/
- Validation Lecture Intelligent Sensor Systems,
RicardoGutierrez-Osuna, Wright State University
58Question Time!!
- Route further questions to ...
- Ryan - 0789317
- Neo - 0785401
- Sandhya - 0671562
Thank you !!