Title: Co-Training and Expansion: Towards Bridging Theory and Practice
1Co-Training and Expansion Towards Bridging
Theory and Practice
- Maria-Florina Balcan, Avrim Blum, Ke Yang
- Carnegie Mellon University,
- Computer Science Department
2Combining Labeled and Unlabeled Data (a.k.a.
Semi-supervised Learning)
- Many applications have lots of unlabeled data,
but labeled data is rare or expensive - Web page, document classification
- OCR, Image classification
- Several methods have been developed to try to use
unlabeled data to improve performance, e.g. - Transductive SVM
- Co-training
- Graph-based methods
3Co-training method for combining labeled
unlabeled data
- Works in scenarios where examples have distinct,
yet sufficient feature sets - An example
- Belief is that the two parts of the example are
consistent, i.e. 9 c1, c2 such that - Each view is sufficient for correct
classification - Works by using unlabeled data to propagate
learned information.
4Co-Training method for combining labeled
unlabeled data
- For example, if we want to classify web pages
5Iterative Co-Training
- Have learning algorithms A1, A2 on each of the
two views. - Use labeled data to learn two initial hypotheses
h1, h2. - Look through unlabeled data to find examples
where one of hi is confident but other is not. - Have the confident hi label it for algorithm A3-i.
Repeat
6Iterative Co-Training A Simple Example
Learning Intervals
h21
h11
Use labeled data to learn h11 and h21
Use unlabeled data to bootstrap
7Theoretical/Conceptual Question
- What properties do we need for co-training to
work well? - Need assumptions about
- the underlying data distribution
- the learning algorithms on the two sides
8Theoretical/Conceptual Question
- What property of the data do we need for
co-training to work well? - Previous work
- Independence given the label
- Weak rule dependence
- Our work - much weaker assumption about how the
data should behave - expansion property of the underlying distribution
- Though we will need stronger assumption on the
learning algorithm compared to (1).
9Co-Training, Formal Setting
- Assume that examples are drawn from distribution
D over instance space . - Let c be the target function assume that each
view is sufficient for correct classification - c can be decomposed into c1, c2 over each view s.
t. D has no probability mass on examples x with
c1(x1) ? c2(x2) - Let X and X- denote the positive and negative
regions of X. - Let D and D- be the marginal distribution of D
over X and X- respectively. - Let
- think of as
D
D-
10 (Formalization)
Expansion
- We assume that D is expanding.
- Expansion
- This is a natural analog of the graph-theoretic
notions of conductance and expansion.
11 Property of the underlying distribution
Expansion
- Necessary condition for co-training to work well
- If S1 and S2 (our confident sets) do not expand,
then we might never see examples for which one
hypothesis could help the other. - We show, sufficient for co-training to generalize
well in a relatively small number of iterations,
under some assumptions - the data is perfectly separable
- have strong learning algorithms on the two sides
12Expansion, Examples Learning Intervals
Non-expanding distribution
Expanding distribution
13Expansion
- Weaker than independence given the label than
weak rule dependence.
e.g, w.h.p. a random degree-3 bipartite graph is
expanding, but would NOT have independence given
the label, or weak rule dependence
D
D-
14Main Result
- Assume D is ?-expanding.
- Assume that on each of the two views we have
algorithms A1 and A2 for learning from positive
data only. - Assume that we have initial confident sets S10
and S20 such that
15Main Result, Interpretation
- Assumption on A1, A2 implies the they never
generalize incorrectly. - Question is what needs to be true for them to
actually generalize to whole of D?
16Main Result, Proof Idea
- Expansion implies that at each iteration, there
is reasonable probability mass on "new, useful"
data. - Algorithms generalize to most of this new region.
- See paper for real proof.
17What if assumptions are violated?
- What if our algorithms can make incorrect
generalizations and/or there is no perfect
separability?
18What if assumptions are violated?
- Expect "leakage" to negative region.
- If negative region is expanding too, then
incorrect generalizations will grow at
exponential rate. - Correct generalization are growing at exponential
rate too, but will slow down first. - Expect overall accuracy to go up then down.
19Synthetic Experiments
- Create a 2n-by-2n bipartite graph
- nodes 1 to n on each side represent positive
clusters - nodes n1 to 2n on each side represent negative
clusters - Connect each node on the left to 3 nodes on the
right - each neighbor is chosen with prob. 1-? to be a
random node of the same class, and with prob. ?
to be a random node of the opposite class - Begin with an initial confident set
and then propagate confidence through rounds of
co-training - monitor the percentage of the positive class
covered, the percent of the negative class
mistakenly covered, and the overall accuracy
20Synthetic Experiments
?0.01, n5000, d3
?0.001, n5000, d3
- solid line indicates overall accuracy
- green curve is accuracy on positives
- red curve is accuracy on negatives
21Conclusions
- We propose a much weaker expansion assumption of
the underlying data distribution. - It seems to be the right condition on the
distribution for co-training to work well. - It directly motivates the iterative nature of
many of the practical co-training based
algorithms.