Title: Cotraining
1Co-training
- LING 572
- Fei Xia
- 02/21/06
2Overview
- Proposed by Blum and Mitchell (1998)
- Important work
- (Nigam and Ghani, 2000)
- (Goldman and Zhou, 2000)
- (Abney, 2002)
- (Sarkar, 2002)
-
- Used in document classification, parsing, etc.
3Outline
- Basic concept (Blum and Mitchell, 1998)
- Relation with other SSL algorithms (Nigam and
Ghani, 2000)
4An example
- Web-page classification e.g., find homepages of
faculty members. - Page text words occurring on that page
- e.g., research interest, teaching
- Hyperlink text words occurring in hyperlinks
that point to that page - e.g., my advisor
5Two views
- Features can be split into two sets
- The instance space
- Each example
- D the distribution over X
- C1 the set of target functions over X1.
- C2 the set of target function over X2.
6Assumption 1 compatibility
- The instance distribution D is compatible with
the target function f(f1, f2) if for any x(x1,
x2) with non-zero prob, f(x)f1(x1)f2(x2). - The compatibility of f with D
? Each set of features is sufficient for
classification
7Assumption 2 conditional independence
8Co-training algorithm
9Co-training algorithm (cont)
- Why uses U, in addition to U?
- Using U yields better results.
- Possible explanation this forces h1 and h2
select examples that are more representative of
the underlying distribution D that generates U. - Choosing p and n the ratio of p/n should match
the ratio of positive examples and negative
examples in D. - Choosing the iteration number and the size of U.
10Intuition behind the co-training algorithm
- h1 adds examples to the labeled set that h2 will
be able to use for learning, and vice verse. - If the conditional independence assumption holds,
then on average each added document will be as
informative as a random document, and the
learning will progress.
11Experiments setting
- 1051 web pages from 4 CS depts
- 263 pages (25) as test data
- The remaining 75 of pages
- Labeled data 3 positive and 9 negative examples
- Unlabeled data the rest (776 pages)
- Manually labeled into a number of categories
e.g., course home page. - Two views
- View 1 (page-based) words in the page
- View 2 (hyperlink-based) words in the
hyperlinks - Learner Naïve Bayes
12Naïve Bayes classifier(Nigam and Ghani, 2000)
13Experiment results
p1, n3 of iterations 30 U 75
14Questions
- Can co-training algorithms be applied to datasets
without natural feature divisions? - How sensitive are the co-training algorithms to
the correctness of the assumptions? - What is the relation between co-training and
other SSL methods (e.g., self-training)?
15(Nigam and Ghani, 2000)
16EM
- Pool the features together.
- Use initial labeled data to get initial parameter
estimates. - In each iteration use all the data (labeled and
unlabeled) to re-estimate the parameters. - Repeat until converge.
17Experimental results WebKB course database
EM performs better than co-training Both are
close to supervised method when trained on
more labeled data.
18Another experiment The News 22 dataset
- A semi-artificial dataset
- Conditional independence assumption holds.
Co-training outperforms EM and the oracle
result.
19Co-training vs. EM
- Co-training splits features, EM does not.
- Co-training incrementally uses the unlabeled
data. - EM probabilistically labels all the data at each
round EM iteratively uses the unlabeled data.
20Co-EM EM with feature split
- Repeat until converge
- Train A-feature-set classifier using the labeled
data and the unlabeded data with Bs labels - Use classifier A to probabilistically label all
the unlabeled data - Train B-feature-set classifier using the labeled
data and the unlabeled data with As labels. - B re-labels the data for use by A.
21Four SSL methods
Results on the News 22 dataset
22Random feature split
Co-training 3.7 ? 5.5 Co-EM 3.3 ? 5.1
- When the conditional independence assumption does
not hold, but - there is sufficient redundancy among the
features, - co-training still works well.
23Assumptions
- Assumptions made by the underlying classifier
(supervised learner) - Naïve Bayes words occur independently of each
other, given the class of the document. - Co-training uses the classifier to rank the
unlabeled examples by confidence. - EM uses the classifier to assign probabilities to
each unlabeled example. - Assumptions made by SSL method
- Co-training conditional independence assumption.
- EM maximizing likelihood correlates with
reducing classification errors.
24Summary of (Nigam and Ghani, 2002)
- Comparison of four SSL methods self-training,
co-training, EM, co-EM. - The performance of the SSL methods depends on how
well the underlying assumptions are met. - Random splitting features is not as good as
natural splitting, but it still works if there is
sufficient redundancy among features.
25Variations of co-training
- Goldman and Zhou (2000) use two learners of
different types but both takes the whole feature
set. - Zhou and Li (2005) use three learners. If two
agree, the data is used to teach the third
learner. - Balcan et al. (2005) relax the conditional
independence assumption with much weaker
expansion condition.
26An alternative?
- L ? L1, L?L2
- U ?U1, U ? U2
- Repeat
- Train h1 using L1 on Feat Set1
- Train h2 using L2 on Feat Set2
- Classify U2 with h1 and let U2 be the subset
with the most confident scores, L2 U2 ? L2,
U2-U2 ? U2 - Classify U1 with h2 and let U1 be the subset
with the most confident scores, L1 U1 ? L1,
U1-U1 ? U1
27Yarowskys algorithm
- one-sense-per-discourse
- ? View 1 the ID of the document that a word
is in - one-sense-per-allocation
- ? View 2 local context of word in the
document - Yarowskys algorithm is a special case of
co-training (Blum Mitchell, 1998) - Is this correct? No, according to (Abney, 2002).
28Summary of co-training
- The original paper (Blum and Mitchell, 1998)
- Two independent views split the features into
two sets. - Train a classifier on each view.
- Each classifier labels data that can be used to
train the other classifier. - Extension
- Relax the conditional independence assumptions
- Instead of using two views, use two or more
classifiers trained on the whole feature set.
29Summary of SSL
- Goal use both labeled and unlabeled data.
- Many algorithms EM, co-EM, self-training,
co-training, - Each algorithm is based on some assumptions.
- SSL works well when the assumptions are satisfied.
30Additional slides
31Rule independence
- H1 (H2) consists of rules that are functions of
X1 (X2, resp) only.
32- EM the data is generated according to some
simple known parametric model. - Ex the positive examples are generated according
to an n-dimensional Gaussian D centered around
the point