Text Classification from Labeled and Unlabeled Documents using EM - PowerPoint PPT Presentation

About This Presentation

Title:

Text Classification from Labeled and Unlabeled Documents using EM

Description:

Text Classification from Labeled and Unlabeled Documents using EM [ Kamal Nigal, Andrew McCallum, Sebastian Thrun, Tom Mitchell, 1999 ] Eleni Foteinopoulou s0969664 – PowerPoint PPT presentation

Number of Views:178

Avg rating:3.0/5.0

Slides: 20

Provided by: Ganes66

Category:

more less

Transcript and Presenter's Notes

Title: Text Classification from Labeled and Unlabeled Documents using EM

1
Text Classification from Labeled and Unlabeled
Documents using EM
Kamal Nigal, Andrew McCallum, Sebastian Thrun,
Tom Mitchell, 1999

Eleni Foteinopoulou s0969664
Efthymios Kouloumpis s0928744

2
Overview

Introduction
Motivation
Naïve Bayes Learning
Combination of NB and EM
EM Extensions
Experiments
Summary

3
Text Classification
Bag of Words
4
Need for an intermediate approach

Unsupervised and Supervised learning
Unsupervised learning
collection of documents without any labels
easy to collect, free, inexpensive, large pool
Supervised learning
each object tagged with a class
laborious job, time-consuming process
Semi-supervised learning
Real life applications

5
Challenges

How to reduce the number of labeled examples?
Can unlabeled examples increase the
classification accuracy?
Any ideas...?
Semi-Supervised Learning

6
Motivation

Document collection D
A subset (with ) has known
labels
Goal to label the rest of the collection.
Approach
Train a supervised learner using , the
labeled subset. ? NB
Apply the trained learner on the remaining
documents. ? EM
Idea
Harness information from unlabeled subset.

7
The Generative Model

Probabilistic generative model
Every document Probability distribution
Assumptions
Mixture model
One to one correspondence between
mixture components and classes
Document length distribution

8
Naïve Bayes Learning

Assign each document to a particular mixture
component.
The parameters of an individual mixture
component form a multinomial distribution over
words
Estimate model parameters ? maximum a posteriori
estimation

9
Naïve Bayes Learning

Maximum a posteriori estimate of the model
parameters given a small set of labeled data ?
high variance
How to improve parameter estimates?
Incorporate unlabeled documents

10
EM Algorithm

Iterative algorithm for parameter estimation
(maximum a posteriori)
Incomplete data ? missing labels
Estimate parameters ? from labeled subset
Iterate
E step calculate probabilistic labels for the
unlabeled documents using current parameter
estimate ?.
M step maximize the complete likelihood ? new
maximum a posteriori estimate using current
estimates of
Continue till convergence ? ? local max.

11
EM Issues

Generative model vs. real-world text data
Mixture model - One to one correspondence
between mixture components and classes
Same parameter model as used in
classification ? violation
Word conditional independence
NB assumption
Extreme class probability estimates

12
EM Extensions

Real world data?
Weighting factor
Multiple mixture components

13
EM Reducing belief in unlabeled data

Problems due to unlabeled data
Noise in term distribution of documents in
Mistakes in E-step
Solution
attenuate the contribution from documents in
Add a damping factor ae0,1, in E Step for
contribution from

14
EM Modeling labels using many mixture components

Previous extension ? reduces effect of mixture
model assumption
Goal Relax assumption of one to one
correspondence between mixture components and
class labels.
Introduce many to one mapping ? missing values D
E.g. For two class case football vs. not
football
Documents not about football are actually about
a variety of other things

15
EM Modeling labels using many mixture components