Title: Active Learning
1Active Learning
2Learning from Examples
- Passive learning
- A random set of labeled examples
3Active Learning
- Active learning
- The leaner chooses the specific examples to be
labeled - ? The learner works harder, in order to use fewer
examples
4Membership Queries
- The learner constructs the examples from basic
units - Examples of problems that are only solvable under
this setting (finite automata) - ?Problem might get to irrelevant regions of the
input space
5Selective Sampling
- Available are 2 oracles
- Sample returning unlabeled queries according to
the input distribution - Label given an unlabeled example, returns its
label - Query filtering From the set of unlabeled
examples, choose the most informative, and query
for their label.
6Selecting the Most Informative Queries
- Input X D (in Rd)
- Concepts c X? 0,1
- Bayesian Model C P
- Version Space ViV(ltx1,c(x1)gtltxi,c(xi)gt)
7Selecting the Most Informative Queries
- Instantaneous information gain from the ith
example
8Selecting the Most Informative Queries
- For the next example xi
- P0 Pr(c(xi)0)
9Example
- X 0,1
- W U
- Vithe max X value labeled with 0, the min X
value labeled with 1
10Example - Expected Prediction Error
- The final predictors error is proportional to
the length of the VS segment - Both Wfinal and Wtarget are selected uniformly
from the VS (p 1/L) - The error of each such pair is Wfinal - Wtarget
- ? Using n random examples 1/n
- But by cutting it in the middle the expected
error decreases exponentially
11Query by Committee(Seung, Opper Sompolinsky
1992Freund, Seung, Shamir Tishby)
- Uses oracles
- Gibbs(V,x)
- h ? randp(V)
- Return h(x)
- Sample
- Lable(x)
12Query by Committee(Seung, Opper Sompolinsky
1992Freund, Seung, Shamir Tishby)
- While (t lt Tn)
- x Sample()
- y1 Gibbs(V,x)
- y2 Gibbs(V,x)
- If (y1 ! y2) then
- Label(x) (and use it to learn and to get the new
VS) - t 0
- Update Tn
- endif
- End
- Return Gibbs(Vn,x)
13QBC Finds Better Queries than Random
- Prob of querying an example X which divides the
VS to fractions F and 1-F 2F(1-F) - Reminder the information gain is H(F)
- But this is not enough
14Example
- W is in 0,12
- X is a line parallel to one of the axes
- The error is proportional to the perimeter of the
VS rectangle
15- If for a concept class C
- VCdim(c) lt 8
- The expected information gain of queries made by
QBC is uniformly lower bounded by g gt 0 - Then, with probability larger than 1-d over the
target concepts, the sequence of examples and the
choices made by QBC - NSample is bounded
- NLabel is proportional to log(NSample)
- The error probability of Gibbs(VQBC,x) lt e
16QBC will Always Stop
- The information gain of all the samples
(Isamples) grows more slowly as the no. of
samples grows (proportional to dlog(me/d) ) - The information gain from queries (Iqueries) is
lower bounded and thus grows linearly - IsamplesIqueries
- The time between two query events grows
exponentially - The algorithm will pass the Tn bound and stop
17(No Transcript)
18Isamples
- Cumulative Information Gain
- The expected cumulative info. gain
19Isamples
- Souers LemmaThe number of different sets of
labels for m examples (em/d)d - Uniform distribution over n labels has the
maximum entropy - ? The max expected info. gain is dlog(em/d)
20The Error Probability
- Definition Pr(h(x) ? c(x)) h,cPVS
- This is exactly the probability of querying a
sample in QBC - ? This is stopping condition in QBC
21Before We Go Further
- The basic intuition - gaining more information by
choosing examples that cut the VS to parts of
similar size - This condition is not sufficient
- If there exists a lower bound on the expected
info. gain QBC will work - The error bound in QBC is based on the analogy
between the problem definition and Gibbs, not on
the VS cutting.
22But in Practice
- Proved for linear separators if the sample space
and VS distributions are uniform. - Is the setting realistic?
- Implementation of Gibbs by Sampling from Convex
Bodies
23Kernel QBC
24What about Noise?
- In practice labels might be noisy
- Active learners are sensitive to noise since they
try to minimize redundancy
25Noise Tolerant QBC
- do
- Let x be a random instance.
- ?1 rand(posterior)
- ?2 rand(posterior)
- If argmax p(yx,?1) ? argmax p(yx,?2) then
- ask for the label of x.
- Update the posterior.
- Until no labels were requested for t consecutive
instances. - Return rand(posterior)
26SVM Active Learning with Applications to Text
ClassificationTong Koller (2001)
- Setting pool-based active learning
- Aim Fast reduction of the VSs size
- Identifying the query that halves the VS
- Simple Margin choose the next query as the point
closest to the current separator minwi
F(x) - MinMax Margin max min(m,m-) to get an max
split - Ratio Margin to get an equal split
27SVM Active Learning with Applications to Text
ClassificationTong Koller (2001)
- The VS in SVM is the unit vectors
- (The data must be separable in the feature space)
- Points in F ?? hyperplanes in W
28SVM Active Learning with Applications to Text
ClassificationTong Koller (2001)
29Results
- Reuters and newsgroups data
- Each document is represented by a 105 dimensions
vector of words frequencies
30Results
31Results
32Whats next
- Theory meet practice
- New methods (other than cutting the VS)
- Generative setting (committee based sampling for
training probabilistic classifiers,
Engelson,1995) - Interesting applications
33Thank You!