Cotraining - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Cotraining

Description:

Can co-training algorithms be applied to datasets without natural feature divisions? ... Many algorithms: EM, co-EM, self-training, co-training, ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 33

Provided by: facultyWa9

Learn more at: http://faculty.washington.edu

Category:

more less

Transcript and Presenter's Notes

Title: Cotraining

1
Co-training

LING 572
Fei Xia
02/21/06

2
Overview

Proposed by Blum and Mitchell (1998)
Important work
(Nigam and Ghani, 2000)
(Goldman and Zhou, 2000)
(Abney, 2002)
(Sarkar, 2002)
Used in document classification, parsing, etc.

3
Outline

Basic concept (Blum and Mitchell, 1998)
Relation with other SSL algorithms (Nigam and
Ghani, 2000)

4
An example

Web-page classification e.g., find homepages of
faculty members.
Page text words occurring on that page
e.g., research interest, teaching
Hyperlink text words occurring in hyperlinks
that point to that page
e.g., my advisor

5
Two views

Features can be split into two sets
The instance space
Each example
D the distribution over X
C1 the set of target functions over X1.
C2 the set of target function over X2.

6
Assumption 1 compatibility

The instance distribution D is compatible with
the target function f(f1, f2) if for any x(x1,
x2) with non-zero prob, f(x)f1(x1)f2(x2).
The compatibility of f with D

? Each set of features is sufficient for
classification
7
Assumption 2 conditional independence
8
Co-training algorithm
9
Co-training algorithm (cont)

Why uses U, in addition to U?
Using U yields better results.
Possible explanation this forces h1 and h2
select examples that are more representative of
the underlying distribution D that generates U.
Choosing p and n the ratio of p/n should match
the ratio of positive examples and negative
examples in D.
Choosing the iteration number and the size of U.

10
Intuition behind the co-training algorithm

h1 adds examples to the labeled set that h2 will
be able to use for learning, and vice verse.
If the conditional independence assumption holds,
then on average each added document will be as
informative as a random document, and the
learning will progress.

11
Experiments setting

1051 web pages from 4 CS depts
263 pages (25) as test data
The remaining 75 of pages
Labeled data 3 positive and 9 negative examples
Unlabeled data the rest (776 pages)
Manually labeled into a number of categories
e.g., course home page.
Two views
View 1 (page-based) words in the page
View 2 (hyperlink-based) words in the
hyperlinks
Learner Naïve Bayes

12
Naïve Bayes classifier(Nigam and Ghani, 2000)
13
Experiment results
p1, n3 of iterations 30 U 75
14
Questions

Can co-training algorithms be applied to datasets
without natural feature divisions?
How sensitive are the co-training algorithms to
the correctness of the assumptions?
What is the relation between co-training and
other SSL methods (e.g., self-training)?

15
(Nigam and Ghani, 2000)
16
EM

Pool the features together.
Use initial labeled data to get initial parameter
estimates.
In each iteration use all the data (labeled and
unlabeled) to re-estimate the parameters.
Repeat until converge.

17
Experimental results WebKB course database
EM performs better than co-training Both are
close to supervised method when trained on
more labeled data.
18
Another experiment The News 22 dataset

A semi-artificial dataset
Conditional independence assumption holds.

Co-training outperforms EM and the oracle
result.
19
Co-training vs. EM

Co-training splits features, EM does not.
Co-training incrementally uses the unlabeled
data.
EM probabilistically labels all the data at each
round EM iteratively uses the unlabeled data.

20
Co-EM EM with feature split

Repeat until converge
Train A-feature-set classifier using the labeled
data and the unlabeded data with Bs labels
Use classifier A to probabilistically label all
the unlabeled data
Train B-feature-set classifier using the labeled
data and the unlabeled data with As labels.
B re-labels the data for use by A.

21
Four SSL methods
Results on the News 22 dataset
22
Random feature split
Co-training 3.7 ? 5.5 Co-EM 3.3 ? 5.1

When the conditional independence assumption does
not hold, but
there is sufficient redundancy among the
features,
co-training still works well.

23
Assumptions

Assumptions made by the underlying classifier
(supervised learner)
Naïve Bayes words occur independently of each
other, given the class of the document.
Co-training uses the classifier to rank the
unlabeled examples by confidence.
EM uses the classifier to assign probabilities to
each unlabeled example.
Assumptions made by SSL method
Co-training conditional independence assumption.
EM maximizing likelihood correlates with
reducing classification errors.

24
Summary of (Nigam and Ghani, 2002)

Comparison of four SSL methods self-training,
co-training, EM, co-EM.
The performance of the SSL methods depends on how
well the underlying assumptions are met.
Random splitting features is not as good as
natural splitting, but it still works if there is
sufficient redundancy among features.

25
Variations of co-training

Goldman and Zhou (2000) use two learners of
different types but both takes the whole feature
set.
Zhou and Li (2005) use three learners. If two
agree, the data is used to teach the third
learner.
Balcan et al. (2005) relax the conditional
independence assumption with much weaker
expansion condition.

26
An alternative?

L ? L1, L?L2
U ?U1, U ? U2
Repeat
Train h1 using L1 on Feat Set1
Train h2 using L2 on Feat Set2
Classify U2 with h1 and let U2 be the subset
with the most confident scores, L2 U2 ? L2,
U2-U2 ? U2
Classify U1 with h2 and let U1 be the subset
with the most confident scores, L1 U1 ? L1,
U1-U1 ? U1

27
Yarowskys algorithm

one-sense-per-discourse
? View 1 the ID of the document that a word
is in
one-sense-per-allocation
? View 2 local context of word in the
document
Yarowskys algorithm is a special case of
co-training (Blum Mitchell, 1998)
Is this correct? No, according to (Abney, 2002).

28
Summary of co-training

The original paper (Blum and Mitchell, 1998)
Two independent views split the features into
two sets.
Train a classifier on each view.
Each classifier labels data that can be used to
train the other classifier.
Extension
Relax the conditional independence assumptions
Instead of using two views, use two or more
classifiers trained on the whole feature set.

29
Summary of SSL

Goal use both labeled and unlabeled data.
Many algorithms EM, co-EM, self-training,
co-training,
Each algorithm is based on some assumptions.
SSL works well when the assumptions are satisfied.

30
Additional slides
31
Rule independence

H1 (H2) consists of rules that are functions of
X1 (X2, resp) only.

EM the data is generated according to some
simple known parametric model.
Ex the positive examples are generated according
to an n-dimensional Gaussian D centered around
the point

Write a Comment

User Comments (0)