Analysis%20of%20greedy%20active%20learning - PowerPoint PPT Presentation

About This Presentation
Title:

Analysis%20of%20greedy%20active%20learning

Description:

... over the unit sphere, can learn homogeneous linear separators using just O(d log ... minute we allow non-homogeneous hyperplanes, the query complexity ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 24
Provided by: SanjoyD5
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: Analysis%20of%20greedy%20active%20learning


1
Analysis of greedy active learning
  • Sanjoy Dasgupta
  • UC San Diego

2
Standard learning model
  • Given m labeled points, want to learn a
    classifier with misclassification rate lt?, chosen
    from a hypothesis class H with VC dimension d lt 1.

VC theory need m to be roughly d/?, in the
realizable case.
3
Active learning
  • Unlabeled data is easy to come by, but there is a
    charge for each label.

What is the minimum number of labels needed to
achieve the target error rate?
4
Can adaptive querying help?
Simple hypothesis class threshold functions on
the real line hw(x) 1(x w), H hw w
2 R
Start with m ¼ 1/? unlabeled points
Binary search need just log m labels, from
which the rest can be inferred! An exponential
improvement in sample complexity.
5
Binary search
X1
X8
X6
X3
m data points there are effectively m1
different hypotheses. Query tree has m1 leaves,
depth ¼ log m.
Question Is this a general phenomenon? For other
hypothesis classes, is a generalized binary
search possible?
6
Bad news I
H linear separators in R1 active learning
reduces sample complexity from m to log m. But H
linear separators in R2 there are some
target hypotheses for which all m labels need to
be queried! (No matter how benign the input
distribution.)
In this case learning to accuracy ? requires 1/?
labels
7
The benefit of averaging
For linear separators in R2 In the worst case
over target hypotheses, active learning offers no
improvement in sample complexity. But there is a
query tree in which the depths of the O(m2)
target hypotheses are spread almost evenly over
log m, m. The average depth is just log m.
Question does active learning help only in a
Bayesian model?
8
Degrees of Bayesian-ity
Different stopping criteria. Suppose the
remaining version space is
  • Prior ? over hypotheses
  • Pseudo-Bayesian model
  • The prior is used only to count queries
  • Bayesian model
  • The prior is used for counting queries and also
    for the generalization bound

9
Effective hypothesis class
Fix a hypothesis class H of VC dimension d lt 1,
and a set of unlabeled examples x1, x2, , xm,
where m d/?. Sauers lemma H can label these
points in at most md different ways the
effective hypothesis class Heff (h(x1),
h(x2), , h(xm)) h 2 H has size Heff
md. Goal (in the realizable case) pick the
element of Heff which is consistent with all the
hidden labels, while asking for just a small
subset of these labels.
10
Model of an active learner
Query tree
Each leaf is annotated with an element of
Heff. Weights ? over Heff. Goal a tree T of
small average depth, Q(T,?) ?h ?(h) depth(h)
X1?
(can also use random coin flips at internal
nodes)
X6?
X8?
X3?
h6
h3
h2
h1
h5
Question in this averaged model, can we always
find a tree of depth o(m)?
11
Bad news II
Pick any d gt 0 and m 2d. There is an input
space of size m and a hypothesis class H of VC
dimension d such that (for uniform ?) any active
learning strategy requires m/8 queries on
average. Choose Input space any x1, ,
xm H all concepts which are positive on
exactly d inputs.
12
A revised goal
Depending on the choice of ? the hypothesis
class perhaps the input distribution the average
number of labels needed by an optimal active
learner is somewhere in the range d log m,
m. Ideal case d log m perfect binary
search Worst case m randomly chosen
labels (within constants) Is there a generic
active learning strategy which always achieves
close to the optimal number of queries, no matter
what it might be?
13
Heuristics for active learning
  • A common strategy in many heuristics
  • Greedy strategy. After seeing t labels, remaining
    version space is some Ht. Always choose the point
    which most evenly divides Ht, according to
    ?-mass.
  • For instance, Tong-Koller (2000) linear
    separators

Question How good is this greedy scheme? And how
does its performance depend on the choice of ??
? / volume
14
Greedy active learning
Choose any ?. How does the greedy query tree TG
compare to the optimal tree T? Upper bound.
Q(TG, ?) 4 Q(T, ?) log 1/(minh
?(h)). Example For uniform ?, the approximation
ratio is log Heff d log m. Lower
bounds. 1 Uniform ? we have an example in
which Q(TG, ?) Q(T, ?) ?(log Heff/log
log Heff) 2 Non-uniform ? an example where ?
ranges between 1/2 and 1/2n, and Q(TG, ?) Q(T,
?) ?(n).
15
Sub-optimality of greedy scheme
1 The case of uniform ?. There are simple
examples in which the greedy scheme uses ?(log
n/log log n) times the optimal number of
labels. (a) The hypothesis class consists of
several clusters (b) Each cluster is efficiently
searchable (c) But first the version space must
be narrowed down to one of these clusters an
inefficient process Invoke this construction
recursively. Optimal strategy reduces entropy
only gradually at first, then ramps it up later
an over-eager greedy scheme is fooled.
16
Sub-optimality, contd
  • 2 The case of general ?.
  • For any n 2
  • There is a hypothesis class H of size 2n1 and
    distribution ? over H such that
  • ? ranges from 1/2 to 1/2n1
  • optimal expected number of queries is lt3
  • greedy strategy uses n/2 queries on average.

H, ? (proportional to area)
17
Sub-optimality, contd
Three types of queries (i) Is target some h1i ?
(ii) some h2i ? (iii) h1j
or h2j ?
18
Upper bound overview
  • Upper bound. Q(TG, ?) 4 Q(T, ?) log 1/(minh
    ?(h)).
  • If the optimal tree is short, then
  • either there is a query which (in expectation)
    cuts off a good chunk of the version space
  • or some particular hypothesis has high weight.
  • At least in the first case, the greedy scheme
    gets off to a good start cf. Johnsons argument
    for set cover.

19
Quality of a query
  • Need a notion of query quality which can only
    decrease with time.
  • If S is a version space, and query xi splits it
    into S, S-, well say that xi shrinks (S, ?)
    by
  • 2 ?(S) ?(S-)
  • ?(S)
  • Claim If xi shrinks (Heff, ?) by ?, then it
    shrinks (S,?) by at most ?, for any S µ Heff.

20
When is the optimal tree short?
  • Claim Pick any S µ Heff, and any tree T whose
    leaves include all of S. Then there must be a
    query which shrinks (S, ?S) by at least
  • (1 CP(?S))/Q(T, ?S).
  • Here
  • ?S is ? restricted to S
  • CP(?) ?h ?(h)2 (collision probability)

21
Main argument
If the optimal tree has small average depth, then
there are two possible cases Case one there is
some query which shrinks the version space
significantly In this case, the greedy strategy
will find such a query and clear progress will be
made. The resulting subtrees, considered
together, will also require few queries.
22
Proof, contd
Case two some classifier h has very high
?-mass In this case, the version space might
shrink by just an insignificant amount in one
round. But in roughly the number of queries
that the optimal strategy requires for target h,
the greedy strategy will either eliminate h or
declare it to be the answer. In the former
case, by the time h is eliminated, the version
space will have shrunk significantly. These two
cases form the basis of an inductive argument.
23
An open problem
  • Just about the only positive result in active
    learning
  • FSST97 Query by committee if the data
    distribution is uniform over the unit sphere, can
    learn homogeneous linear separators using just
    O(d log 1/?) labels.
  • But the minute we allow non-homogeneous
    hyperplanes, the query complexity increases to
    1/? Whats going on?
Write a Comment
User Comments (0)
About PowerShow.com