Text Classification from Labeled and Unlabeled Documents using EM - PowerPoint PPT Presentation

About This Presentation
Title:

Text Classification from Labeled and Unlabeled Documents using EM

Description:

Text Classification from Labeled and Unlabeled Documents using EM Kamal Nigam Andrew Kachites Mccallum Sebastian Thrun Tom Mitchell Presented by Yuan Fang, Fengyuan ... – PowerPoint PPT presentation

Number of Views:657
Avg rating:3.0/5.0
Slides: 59
Provided by: infEdAcUk
Category:

less

Transcript and Presenter's Notes

Title: Text Classification from Labeled and Unlabeled Documents using EM


1
Text Classification from Labeled and Unlabeled
Documents using EM
  • Kamal Nigam
  • Andrew Kachites Mccallum
  • Sebastian Thrun
  • Tom Mitchell
  • Presented by
  • Yuan Fang, Fengyuan Hu and Sandhya Prabhakaran

2
Job Hunting?
3
Roadmap
  • Part 1 Text Classification
  • Part 2 Incorporating Unlabeled data with EM
  • Part 3 Results and Recap

4
Part I Text Classification
5
Text Classification the Definition
  • Text classification systems categorize documents
    into one (or several) of a set of pre-defined
    topics of interest

6
How Are Automatic Text Classifiers Created
  • Before Manual construction of rule sets (Painful
    and time-consuming )
  • Present Supervised learning to construct a
    classifier (efficient and successful)

7
What To Provide
  • An algorithm with an example set of documents for
    each class and allow it to find a representation
    or decision rule for classifying future documents
    automatically
  • This approach will
  • - give high-accuracy classifiers
  • - be significantly less expensive

8
What Data is Available
  • Key difficulty A large number of labeled
    training examples are required to learn
    accurately - What we need but don't have
  • One would obviously prefer algorithms that can
    provide accurate classifications after hand
    labeling only a dozen articles, rather than
    thousands
  • What other sources of information can reduce the
    need for labeled data?

9
Unlabeled data
  • How unlabeled data can be used to increase
    classification accuracy, especially when labeled
    data are scarce
  • An intuitive example

10
Goal And Merit
  • The goal
  • To demonstrate that supervised learning
    algorithms can use a small number of labeled
    examples with a large number of unlabeled
    examples to create high-accuracy text classifiers
  • The merit
  • Unlabeled examples are much less
    expensive and easily available

11
Parametric Generative Model Overview
  • Assumption a statistical process generates the
    documents (words and class labels)
  • statistical process - parametric generative model

12
Incorporating Unlabeled Data withGenerative
Models
  • Using EM to find high-probability parameters of
    the model given a combination of labeled and
    unlabeled data
  • Experimental evidence shows that using unlabeled
    data with EM can increase classification accuracy

13
Assumptions In the Model
  • (1) Documents are generated by a mixture of
  • multinomials model, where each mixture
  • component corresponds to a class (1 class
    to 1
  • component)
  • (2) The mixture components are multinomial
  • distributions of individual words - the
    words are
  • produced independently of each other given
    the
  • class

14
Two Multisided Dies
  • Let there be C classes and a vocabulary of size
    V each document d has d words in it.
  • First, we roll a biased C-sided die to
    determine the class of our document.
  • We roll the biased V-sided die that corresponds
    to the chosen class d times and write down the
    indicated words. These words form the generated
    document.

15
Parametric Generative Model
  • - parameters for the mixture model
  • - mixture of
    components
  • - mixture weights or class
    probabilities
  • - document distribution of selected
    class
  • Equation (1)

16
Denotation
  • - the jth mixture component, as well as the
    jth class.
  • - the class label for a particular document
    ( )
  • A document is considered to be an ordered
    list of word events,
  • We write for the word in
    position k of
  • - a word in the vocabulary
  • - document length, chosen independently
    of the component, its own probability

17
Parametric Generative Model
  • Expanding the Equation (1) with document length
    and the words in the document. Equation (2)
  • The words of a document are generated
    independently of context
  • Equation (3)
  • Combining these last two equations gives the
    naive Bayes expression for the probability of a
    document given its class
  • Equation (4)

18
Model Parameters
  • Collection of word probabilities, each written
  • Document length is identically distributed, no
    need to be parameterized for classification
  • denoted as the mixture weights (class
    probabilities)
  • The complete collection of model parameters

19
Naive Bayes Text Classification
  • Using a collection of labeled documents for
    training
  • Finding the most probable parameters for the
    statistical model introduced

20
Training A Naive Bayes Classifier With Labeled
Data
  • Estimating the parameters of the generative model
    by using a set of labeled training data
  • (the estimate of the parameters is written
    )
  • Finding (MAP), the
    value of that is most probable given the
    evidence of the training data and a prior.

21
Training A Naive Bayes Classifier With Labeled
Data
  • The word probability estimates are
    given by Equation (6)
  • Class probabilities
  • Equation (7)

22
Classifying New Documents with Naive Bayes
  • Equation (8)
  • If the task is to classify a test document
    into a single class, then the class with the
    highest posterior probability
  • is selected.

23
Part ?Incorporating Unlabeled Data with EM
24
The Problem
  • The case that given only labeled data is
    explained already.
  • MAP to maximize the posterior probability.
  • Naïve Bayes do classification of labeled data.
  • Now the case is given both labeled and unlabeled
    data.
  • Searching for a solution? Here it is!

25
Revision of EM
  • Recall the EM knowledge in PMR Might be
    painful, but helpful
  • Mixture Model
  • Hidden variable z to active the components

26
Revision of EM
  • EM applied to Gaussian Mixture Model
  • Maximum Likelihood Estimation Parameters µ andS
  • E step evaluate the responsibilities using
    current estimators/parameters
  • M step re-estimate by using the maximum a
    posteriori parameter
  • Run the demo

27
Back to the paper
28
Back to the paper
  • Collection of labeled and unlabeled documents.
  • MAP
  • Try to maximize P(?D)
  • Bayesian method -- P(?D) ? P(?) P(D ?)

29
Back to the paper
  • Log likelihood
  • Incomplete equation

30
Back to the paper
  • z binary indicator variables which is set to be
    1 if y c, else zero.
  • Then problem of the incomplete log probability
    can be transferred to complete log probability of
    parameters.

31
Back to the paper
  • Methods used in the paper
  • Basic EM
  • Augmented EM
  • (1) Weighting the unlabeled data
  • (2) Multiple mixture components per class

32
Basic EM
  • Initialize the NB classifier using MAP parameter
    estimation, from only labeled dataset.
  • E step estimate the component membership
  • by calculating its expected value generated
    by
  • from only unlabeled data.
  • M step re-estimate the classifier for the whole
    data set, using MAP, loop from E step
  • Look at to measure the
    improvement of the parameters, decide when to
    stop the loop

33
Restrictions of Basic EM
  • Assumptions/Restrictions
  • Large unlabeled data set, small labeled data set
    ? if not true, unlabeled data will hurt the
    accuracy.
  • One-to-one correspondence of components and
    classes ? not so accurate because subtopics exist.

34
Augmented EM weighting unlabeled data
  • Method weakening the contribution of unlabeled
    data while the labeled set is already good enough
    for classification.
  • Equation

35
Augmented EM weighting unlabeled data
  • ?is decided by leave-one-out cross validation.
  • is defined to tell whether it is labeled
    or unlabeled.
  • Modified MAP parameters

36
Augmented EM -- multiple mixture components per
class
  • Method Relax the assumption that one-to-one
    correspondence of components and classes.
  • Many-to-one relationship between components and
    classes.

37
Augmented EM multiple mixture components per
class
  • How?
  • Decide the number of components per class by
    again cross-validation.
  • Mapping from components to classes

38
The complete algorithm
  • Collections of labeled, unlabeled documents.
  • Set ?by cross-validation.
  • Set the number of components per class.
  • Randomly assign for mixture
    components.
  • Initialize the parameters ? of NB classifier
    using MAP.
  • Loop until complete log likelihood of labeled and
    unlabeled data is satisfying enough.
  • E step estimate the component membership of each
    doc using ?
  • M step re-estimate ?given the membership, still
    MAP.

39
Comparison
  • Basic EM performs well comparing with naïve
    bayes classifier alone, with large unlabeled
    dataset and small set of labeled data
  • EM-? can apparently improve the accuracy if the
    assumption above doesnt fit.
  • Multiple Components dramatically outperforms
    than basic EM.

40
Part III Results and Recap
41
Experimental Results
  • Empirical evidence that on combining labeled with
    unlabeled data using EM outperforms naive Bayes.
  • 20 Newsgroups, WebKB, Reuters
  • Improvements in accuracy due to unlabeled data
    are dramatic, especially when the number of
    labeled data is low.
  • Augmented EM can increase performance even when
    basic EM performs poor due to large number of
    unlabeled data.

42
Data sets and Protocols
  • 20 Newsgroup
  • 20017 articles divided evenly among 20 different
    UseNet discussion groups.
  • Task - to classify an article into the one
    newsgroup to which it was posted.
  • Many categories fall into confusable clusters.
  • Stop words are removed 62258 unique words
  • Word counts are normalized and scaled each
    document has constant length.

43
Data sets and Protocols
  • - WebKB
  • 8145 Web pages gathered from university computer
    science departments.
  • Choosing 4199 pages covering categories student,
    faculty, course and project.
  • Task - to classify a web page into one of the
    four categories.
  • Stemming and stoplist are not used.
  • Vocabulary is limited to 300 most informative
    words using leave-one-out cross validation.

44
Data sets and Protocols
  • Reuters
  • 12902 articles and 90 topic categories.
  • Task - to build a binary classifier for each of
    the ten most populous classes to identify the
    news topic.
  • Words inside ltTEXTgt tags are used REUTERS and
    not used.
  • Stoplist are used, but no stemming.
  • Metrics are Recall and Precision instead of
    Accuracy.

45
Precision-Recall breakeven point
  • Standard information retrieval measure
  • Recall number of correct positive
    predictions
  • number of positive
    examples
  • Precision - number of correct positive
    predictions
  • number of positive
    predictions

46
Wall-clock timing
  • EM usually converges after 10 iterations
  • Less than 1 minute for the WebKB
  • Less than 15 minutes for 20 Newsgroups huge
    vocabulary and more documents

47
EM with unlabeled data increases Accuracy

Figure 1- Accuracy versus of Labeled
Documents. (20 Newsgroups)
48
Effect of varying the of unlabeled documents
Figure 2- Accuracy versus of unlabeled
documents. (20 Newsgroups)
49
EM algorithm in action
Figure 3- Course class for WebKB
dataset
50
EM performance degradation
Figure 4- As of Labeled data increases,
accuracy of classifier falls with more of
unlabeled data. Importance of weighting factor ?.
(WebKB)
51
Effects of different EM
Figure 5- Comparison between EM, CV EM-? and
EM-? (WebKB)
52
Performance of EM on different of mixture
components
Figure 6- Too few or too many mixture components
result in poor performance. Unlabeled data is
used. (Reuters)
53
Precision-Recall breakeven points
Figure 7- Comparison between NB and EM on
Reuters dataset
54
Related Work
  • EM is a well-known family of algorithms that
    works by treating unclassified data as
    incomplete.
  • According to Miller et al - EM on non-textual
    tasks using mixture of Gaussians assumed
    unlabeled data to be sufficient to estimate
    parameter values.
  • Castelli and Cover - unlabeled data does not
    improve the classification results in the absence
    of labeled data.
  • EM can be combined with active-learning to
    improve performance now only slightly more than
    half of labeled data was enough!
  • EM can be applied with other machine learning
    algorithms like SVM, kNN.

55
Punchwords
  • Text classification
  • Naive Bayes
  • Expectation Maximisation Algorithm
  • EM-?
  • Multiple Mixture models for subclass
  • Leave-one-out cross validation
  • Stemming and stoplist words
  • Accuracy, Precision, Recall

56
Recap
  • A family of algorithms have been presented to
    address text classification using voluminous
    unlabeled data and scarce labeled data.
  • When data is consistent with the assumptions -
    Basic EM performs well.
  • When data is not consistent - 2 extensions hold
    valid
  • - EM-? controlling the contribution of
    unlabeled data.
  • - Multiple Mixture Components per Class
    many-to-one constraint.

57
References
  • Using Unlabeled Data to Improve Text
    Classification May 2001 at
  • www.kamalnigam.com/papers/thesis-nigam.pdf
  • Netlab toolkit - www.ncrg.aston.ac.uk/netlab/
  • Validation Lecture Intelligent Sensor Systems,
    RicardoGutierrez-Osuna, Wright State University

58
Question Time!!
  • Route further questions to ...
  • Ryan - 0789317
  • Neo - 0785401
  • Sandhya - 0671562

Thank you !!
Write a Comment
User Comments (0)
About PowerShow.com