Title: Analysis of perceptron-based active learning
1 - Analysis of perceptron-based active learning
- Sanjoy Dasgupta, UCSD
- Adam Tauman Kalai, TTI-Chicago
- Claire Monteleoni, MIT
2Selective sampling, online constraints
- Selective sampling framework
- Unlabeled examples, xt, are received one at a
time. - Learner makes a prediction at each time-step.
- A noiseless oracle to label yt, can be queried
at a cost. - Goal minimize number of labels to reach error ??
- ? is the error rate (w.r.t. the target) on the
sampling distribution. - Online constraints
- Space Learner cannot store all previously seen
examples (and then perform batch learning). - Time Running time of learners belief update
step should not scale with number of seen
examples/mistakes.
3AC Milan v. Inter Milan
4Problem framework
Target Current hypothesis Error
region Assumptions Separability u is through
origin xUniform on S error rate
u
vt
?t
?t
5Related work
- Analysis under selective sampling model, of Query
By Committee algorithm Seung,OpperSompolinsky92
- Theorem Freund,Seung,ShamirTishby 97 Under
selective sampling from the uniform, QBC can
learn a half-space through the origin to
generalization error ?, using Õ(d log 1/?)
labels. - ! BUT space required, and time complexity of the
update both scale with number of seen mistakes!
6Related work
- Perceptron a simple online algorithm
- If yt ? SGN(vt xt), then Filtering rule
- vt1 vt yt xt Update step
- Distribution-free mistake bound O(1/?2), if
exists margin ?. -
- Theorem Baum89 Perceptron, given sequential
labeled examples from the uniform distribution,
can converge to generalization error ? after
Õ(d/?2) mistakes.
7Our contributions
- A lower bound for Perceptron in active learning
context of ?(1/?2) labels. - A modified Perceptron update with a Õ(d log 1/?)
mistake bound. - An active learning rule and a label bound of Õ(d
log 1/?). - A bound of Õ(d log 1/?) on total errors (labeled
or not).
8Perceptron
- Perceptron update vt1 vt yt xt
-
- ? error does not decrease monotonically.
-
vt1
u
vt
xt
9Lower bound on labels for Perceptron
- Theorem 1 The Perceptron algorithm, using any
active learning rule, requires ?(1/?2) labels to
reach generalization error ??w.r.t. the uniform
distribution. - Proof idea Lemma For small ?t, the Perceptron
update will increase ?t unless kvtk - is large ?(1/sin ?t). But, kvtk growth
rate - So need t 1/sin2?t.
- Under uniform,
- ?t / ?t sin ?t.
vt1
u
vt
xt
10A modified Perceptron update
- Standard Perceptron update
- vt1 vt yt xt
- Instead, weight the update by confidence w.r.t.
current hypothesis vt - vt1 vt 2 yt vt xt xt (v1 y0x0)
- (similar to update in Blum et al.96 for
noise-tolerant learning) - Unlike Perceptron
- Error decreases monotonically
- cos(?t1) u vt1 u vt 2 vt xtu
xt - u vt cos(?t)
- kvtk 1 (due to factor of 2)
11A modified Perceptron update
- Perceptron update vt1 vt yt xt
-
- Modified Perceptron update vt1 vt 2 yt vt
xt xt -
vt1
vt1
u
vt
vt1
vt
xt
12Mistake bound
- Theorem 2 In the supervised setting, the
modified Perceptron converges to generalization
error ??after Õ(d log 1/?) mistakes. - Proof idea The exponential convergence follows
from a multiplicative decrease in ?t -
-
- On an update,
-
- ! We lower bound 2vt xtu xt, with high
probability, using our distributional assumption.
13Mistake bound
- Theorem 2 In the supervised setting, the
modified Perceptron converges to generalization
error ??after Õ(d log 1/?) mistakes. - Lemma (band) For any fixed a kak1, ?? 1 and
for xU on S -
- Apply to vt x and u x ) 2vt xtu
xt is - large enough in expectation (using size of ?t).
a
k
x a x k
14Active learning rule
- Goal Filter to label just those points in the
error region. - ! but ?t, and thus ?t unknown!
- Define labeling region
- Tradeoff in choosing threshold st
- If too high, may wait too long for an error.
- If too low, resulting update is too small.
-
- makes
-
- constant.
-
- ! But ?t unknown! Choose st adaptively
- Start high. Halve, if no error in R consecutive
labels.
vt
u
st
L
15Label bound
- Theorem 3 In the active learning setting, the
modified Perceptron, using the adaptive filtering
rule, will converge to generalization error
??after Õ(d log 1/?) labels. - Corollary The total errors (labeled and
unlabeled) will be Õ(d log 1/?).
16Proof technique
- Proof outline We show the following lemmas hold
with sufficient probability - Lemma 1. st does not decrease too quickly
- Lemma 2. We query labels on a constant fraction
of ?t. - Lemma 3. With constant probability the update
is good. - By algorithm, 1/R labels are mistakes. 9 R
Õ(1). - ) Can thus bound labels and total errors by
mistakes.
17Proof technique
- Lemma 1. st is large enough
- Proof (By contradiction) Let t be first time
- Then
- A halving event means we saw R labels with no
mistakes, so -
-
- Lemma 1a For any particular i, this event
happens w.p. 3/4 -
18Proof technique
Lemma 1a. Proof idea Using this value of st,
band lemma in Rd-1 gives constant probability
of x0 falling in appropriately defined band
w.r.t. u0. where x0 component of x
orthogonal to vt u0 component of u orthogonal to
vt )
vt
u
st
19Proof technique
- Lemma 2. We query labels on a constant fraction
of ?t. - Proof Assume Lemma 1 for lower bound on st.
Apply Lemma 1a and band lemma ) - Lemma 3. With constant probability the update is
good. - Proof Assuming Lemma 1, by Lemma 2, each error
is labeled w. constant p. From mistake bound
proof, each update is good (multiplicative
decrease in error) w. constant p. - Finally, solve for R Every R labels there is at
least 1 update or we halve st, so - There exists R Õ(1) s.t.
20Summary of contributions
- samples mistakes labels
total errors online? - PAC
- complexity
- Long03
- Long95
- Perceptron
- Baum97
- QBC
- FSST97
- DKM05
Õ(d/?) ?(d/?)
Õ(d/?3) ?(1/?2) Õ(d/?2) ?(1/?2) ?(1/?2)
Õ(d/??log 1/?) Õ(d?log 1/?) Õ(d?log 1/?)
Õ(d/??log 1/?) Õ(d?log 1/?) Õ(d?log 1/?) Õ(d?log 1/?)
21Conclusions and open problems
- Achieve optimal label-complexity for this problem
- unlike QBC, a fully online algorithm
- Matching bound on total errors (labeled and
unlabeled). - Future work
- Relax distributional assumptions
- Uniform is sufficient but not necessary for
proof. - Note this bound is not possible under
arbitrary distributions Dasgupta04. - Relax separability assumption
- Allow margin of tolerated error.
- Analyze margin version
- for exponential convergence, without d
dependence. -
22