Coarse sample complexity bounds for active learning - PowerPoint PPT Presentation

About This Presentation
Title:

Coarse sample complexity bounds for active learning

Description:

Given access to labeled data (drawn iid from an unknown underlying distribution ... Where in this large range does the label complexity of active learning lie? ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 23
Provided by: SanjoyD5
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: Coarse sample complexity bounds for active learning


1
Coarse sample complexity bounds for active
learning
  • Sanjoy Dasgupta
  • UC San Diego

2
Supervised learning
  • Given access to labeled data (drawn iid from an
    unknown underlying distribution P), want to learn
    a classifier chosen from hypothesis class H, with
    misclassification rate lt?.

Sample complexity characterized by d VC
dimension of H. If data is separable, need
roughly d/? labeled samples.
3
Active learning
  • In many situations like speech recognition and
    document retrieval unlabeled data is easy to
    come by, but there is a charge for each label.

What is the minimum number of labels needed to
achieve the target error rate?
4
Our result
  • A parameter which coarsely characterizes the
    label complexity of active learning in the
    separable setting

5
Can adaptive querying really help?
CAL92, D04 Threshold functions on the real
line hw(x) 1(x w), H hw w 2 R

-
w
Start with 1/? unlabeled points
Binary search need just log 1/? labels, from
which the rest can be inferred! Exponential
improvement in sample complexity.
6
More general hypothesis classes
For a general hypothesis class with VC dimension
d, is a generalized binary search
possible? Random choice of queries d/?
labels Perfect binary search d log 1/?
labels Where in this large range does the label
complexity of active learning lie? Weve already
handled linear separators in 1-d
7
Linear separators in R2
For linear separators in R1, need just log 1/?
labels. But when H linear separators in R2
some target hypotheses require 1/? labels to be
queried!
Consider any distribution over the circle in R2.
Need 1/? labels to distinguish between h0, h1,
h2, , h1/? !
8
A fuller picture
For linear separators in R2 some bad target
hypotheses which require 1/? labels, but most
require just O(log 1/?) labels
good
bad
9
A view of the hypothesis space
H linear separators in R2
Good region
All-positive hypothesis
All-negative hypothesis
Bad regions
10
Geometry of hypothesis space
H any hypothesis class, of VC dimension d lt
1. P underlying distribution of data.
H
h
h
(i) Non-Bayesian setting no probability measure
on H (ii) But there is a natural (pseudo) metric
d(h,h) P(h(x) ? h(x)) (iii) Each point x
defines a cut through H
11
The learning process
H
h0
(h0 target hypothesis) Keep asking for labels
until the diameter of the remaining version space
is at most ?.
12
Searchability index
Accuracy ? Data distribution P Amount of
unlabeled data
Each hypothesis h 2 H has a searchability index
?(h)
? ?(h) 1, bigger is better
Example linear separators in R2, data on a
circle
H
1/3
1/4
1/5
?
1/2
?
All positive hypothesis
1/5
1/4
1/3
?(h) / min(pos mass of h, neg mass of h), but
never lt ?
13
Searchability index
Accuracy ? Data distribution P Amount of
unlabeled data
Each hypothesis h 2 H has a searchability index
?(h)
Searchability index lies in the range ? ?(h)
1 Upper bound. There is an active learning scheme
which identifies any target hypothesis h 2 H
(within accuracy ?) with a label complexity of
at most Lower
bound. For any h 2 H, any active learning scheme
for the neighborhood B(h, ?(h)) has a label
complexity of at least When ?(h) À ? active
learning helps a lot.
14
Linear separators in Rd
Previous sample complexity results for active
learning have focused on the following case H
homogeneous (through the origin) linear
separators in Rd Data distributed uniformly over
unit sphere 1 Query by committee SOS92,
FSST97 Bayesian setting average-case over
target hypotheses picked uniformly from the unit
sphere 2 Perceptron-based active learner
DKM05 Non-Bayesian setting worst-case over
target hypotheses In either case just (d log
1/?) labels needed!
15
Example linear separators in Rd
H Homogeneous linear separators in Rd, P
uniform distribution
?(h) is the same for all h, and is 1/8
This sample complexity is realized by many
schemes SOS92, FSST97 Query by
committee DKM05 Perceptron-based active
learner Simplest of all, CAL92 pick a random
point whose label is not completely certain (with
respect to current version space)
as before
16
Linear separators in Rd
Uniform distribution
Concentrated near the equator (any equator)
17
Linear separators in Rd
Instead distribution P with a different vertical
marginal
Say that for ? lt 1, U(x)/? P(x) ? U(x) (U
uniform)

-
Result ? 1/32, provided amt of unlabeled data
grows by ? Do the schemes CAL92, SOS92, FSST97,
DKM05 achieve this label complexity?
18
What next
  • Make this algorithmic!
  • Linear separators is some kind of querying near
    current boundary a reasonable approximation?
  • 2. Nonseparable data
  • Need a robust base learner!


true boundary
-
19
Thanks
For helpful discussions Peter Bartlett Yoav
Freund Adam Kalai John Langford Claire Monteleoni
20
Star-shaped configurations
Data space
Hypothesis space In the vicinity of the bad
hypothesis h0, we find a star structure
h0
h3
h2
h1/?
h1
21
Example the 1-d line
Searchability index lies in range ? ?(h)
1 Theorem labels needed
Example Threshold functions on the line
Result ? 1/2 for any target hypothesis and any
input distribution
22
Linear separators in Rd
Data lies on the rim of two slabs, distributed
uniformly
origin
Result ? ?(1) for most target hypotheses, but
is ? for the hypothesis that makes one slab ,
the other - the most natural one!
Write a Comment
User Comments (0)
About PowerShow.com