Active Learning, Experimental Design - PowerPoint PPT Presentation

1 / 45

About This Presentation

Title:

Active Learning, Experimental Design

Description:

Which movies do you show users to best extrapolate movie preferences? ... Loss Functions ... Can be cast as a semi-definite program ... – PowerPoint PPT presentation

Number of Views:97

Avg rating:3.0/5.0

Slides: 46

Provided by: bee50

Category:

more less

Transcript and Presenter's Notes

Title: Active Learning, Experimental Design

1
Active Learning, Experimental Design

CS294 Practical Machine Learning
December 4, 2006
Barbara Engelhardt

2
Problem Setup

Unlabeled data available but labels are expensive
I would like to choose which data to label
to maximize the value of that data to my
problem
to minimize the cost of labeling
Todays lecture covers algorithms to solve this
problem and ways to measure the value of data

3
Toy Example threshold function
x
x
x
x
x
x
x
x
x
x
Unlabeled data labels are all 0 then all 1 (left
to right)
Classifier is threshold function
hw(x) 1 if x gt w (0 otherwise)
Goal find transition between 0 and 1 labels in
minimum steps
Naïve method choose points to label at random on
line
Better method binary search for transition
between 0 and 1
4
Example Sequencing genomes

What genome should be sequenced next?
Criteria for selection?
Optimal species to detect functional elements
across genomes
Breadth of species encompassing biological
phenomena of interest
(Not the same as the most diverged set of
species)
Marsupials should be sequenced next

McAuliffe et al., 2004
5
Example collaborative filtering

Users rate only a few movies usually ratings
expensive
Which movies do you show users to best
extrapolate movie preferences?
Also known as questionnaire design

Baseline questionnaires
Random m movies randomly
Most Popular Movies m most frequently rated
movies
Most popular movies is not better than random
design!
Popular movies rated highly by all users do not
discriminate tastes

Yu et al. 2006
6
Topics for today

Introduction information theory
Active learning
Query by committee
Uncertainty sampling
Information-based loss functions
Response surface technique
Optimal experimental design
A-optimal design
D-optimal design
E-optimal design
Sequential experimental design
Bayesian experimental design
Maximin experimental design
Summary

7
Topics for today

Introduction information theory
Active learning
Query by committee
Uncertainty sampling
Information-based loss functions
Response surface technique
Optimal experimental design
A-optimal design
D-optimal design
E-optimal design
Sequential experimental design
Bayesian experimental design
Maximin experimental design
Summary

8
Entropy Function

A measure of information in random event X with
possible outcomes x1,,xn
Comments on entropy function
Entropy of an event is zero when the outcome is
known
Entropy is maximal when all outcomes are equally
likely
The average minimum yes/no questions to answer
some question (connection to binary search)

H(x) - Si p(xi) log2 p(xi)
Shannon, 1948
9
Kullback Leibler divergence

P is the true distribution Q distribution is
used to encode data instead of P
KL divergence is the expected extra message
length per datum that must be transmitted using Q
Measure of how wrong Q is with respect to true
distribution P

DKL(P Q) Si P(xi) log (P(xi)/Q(xi))
Si P(xi) log Q(xi) Si P(xi)
log P(xi)
-H(P,Q) H(P)
-Cross-entropy entropy
10
KL divergence properties

Non-negative D(PQ) 0
Divergence 0 if and only if P and Q are equal
D(PQ) 0 iff P Q
Non-symmetric D(PQ) ? D(QP)
Does not satisfy triangle inequality
D(PQ) D(PR) D(RQ)

11
KL divergence as gain

Modeling the KL divergence of the posteriors
measures the amount of information gain expected
from query (where x, q are data, parameters
after query)
Goal choose a query that maximizes the KL
divergence between posterior and prior
Basic idea largest KL divergence between updated
posterior probability and the current posterior
probability represents largest gain

D( p(q x) p(q x))
12
Loss Functions

A function L that maps an event to a real number,
representing cost or regret associated with event
E.g., in regression problems, L(y, qTf(x)) maps
to reals
Examples
Quadratic (least squares) loss
Linear (absolute value) loss
0-1 (binary) loss
Exponential

13
Risk Function

Risk is also known as expected loss
The (frequentist) risk function is explicitly
expected loss
Bayes risk is defined as posterior expected loss
Trade-off Bayes risk performs well when p(q x)
accurate
Gain here is chooses x to minimize expected loss

R(Q, X) Sx L(q, x) p(xq)
R(Q, X) Sq L(q, x) p(qx)
P(qx)
Loss
X
14
Minimax loss

Walds (1950) alternative to minimize the
maximum (expected) loss
Assume response x is the worst case scenario
(gives the greatest expected loss)
Our problem can be thought of as maximizing the
minimum gain (maximin)

Minimax(X, Q) minqmaxxLoss(X,Q)
Loss
15
Topics for today

Introduction information theory
Active learning
Query by committee
Uncertainty sampling
Information-based loss functions
Response surface technique
Optimal experimental design
A-optimal design
D-optimal design
E-optimal design
Sequential experimental design
Bayesian experimental design
Maximin experimental design
Summary

16
What is Active Learning?

Unlabeled data are readily available labels are
expensive
Want to use adaptive decisions to choose which
labels to acquire for a given dataset
Goal is accurate classifier with minimal cost

17
Active learning warning

Choice of data is only as good as the model
itself
Assume a linear model, then two data points are
sufficient
What happens when data are not linear?

18
Active Learning

Active learner is able to query world and receive
a response before outputting a classifier
Learner selects queries (but cannot impact
response)
Two general methods
Select most uncertain data given model and
parameters
Select most informative data to optimize
expected gain
Given model M with parameters q and loss function
L
Query q with response x updates the model
posterior q

L(q, X) ExL(q)
19
Query by Committee

Prior distribution over hypotheses
Samples a set of classifiers from distribution
Queries an example based on the degree of
disagreement between committee of classifiers

x
x
x
x
x
x
x
x
x
x
C
A
B
Seung et al. 1992, Freund et al. 1997
20
Query by Committee Application

Used naïve Bayes model for text classification in
a Bayesian learning setting (20 Newsgroups
dataset)

McCallum Nigam, 1998
21
Uncertainty Sampling

Query the event that the current classifier is
most uncertain about
Used trivially in SVMs, graphical models, etc.

If uncertainty is measured in Euclidean distance
x
x
x
x
x
x
x
x
x
x
Lewis Gale, 1994
22
Information-based Loss Function

Maximize KL divergence between posterior and
prior
Maximize reduction in model entropy between
posterior and prior
Minimize cross-entropy between posterior and
prior
Many other possibilities
All of these are notions of information gain
All of these can be extended to optimal design
algorithms
Must decide how to handle uncertainty about query
response, model parameters

MacKay, 1992
23
Response Surface Methods

Estimate effects of and interactions between
local changes to the experiments
Given a set of datapoints, interpolate a local
surface
(This local surface is called the response
surface)
Hill-climb on the response surface to find next x
Use next x to interpolate subsequent response
surface
Note this is original model for which optimal
experimental designs were developed Kiefer,
1959

24
Example Response Surface

Goal Approximate the function f(c)
score(minimize(c))
1. Fit a smoothed response surface to the data
points
2. Minimize response surface to find new
candidate
3. Use method to find nearby local minimum of
score function
4. Add candidate to data points
5. Re-fit surface, repeat

Blum, unpublished
25
Topics for today

Introduction information theory
Active learning
Query by committee
Uncertainty sampling
Information-based loss functions
Response surface technique
Optimal experimental design
A-optimal design
D-optimal design
E-optimal design
Sequential experimental design
Bayesian experimental design
Maximin experimental design
Summary

26
What is Experimental Design?

Choose among a menu of possible experiments x
Each experiment has error rate ei
Goal is to select experiments in order to
optimize a specific design criterion
Equivalently, goal is to choose experiments that
minimize error covariance matrix

27
Example experimental design

Design experiment to show enzyme reacting with
substrate S
Problem what concentration(s) of the substrate
to test?
Experiment result is velocity of the reaction, as
modeled by a non-linear equation

Flaherty et al. 2005
28
Optimal Experimental Design Assumptions

Unbiasedness of experiments Eei 0
Uncorrelatedness of experiments Eeiej 0
Variance homogeneity Eej2 ? s2 gt 0, j 1N
Linearity of parameterization
h(x,q) QTf(x) for f(x)
(f1(x),,fm(x))T
Functions fi(x) are continuous, independent on x

Atkinson, 1996
29
Experimental Design

Select a set of (possibly noisy) queries (or
experiments) from x that together are maximally
informative
Let matrix X fi(x)i 1n
Let w be Boolean weight vector that selects from
X W diag(w)
Using squared error criterion, the error
covariance matrix is given by s2(XTWX)-1
(XTWX)-1 is the inverted Hessian of the squared
error, or the inverted Fisher information matrix
of the labeled data
(XTWX)-1 measures the informativeness gain of
experiments

30
Relaxed Experimental Design

The relaxed problem allows wi 0, Si wi 1

Relaxed problem
Boolean problem
N 3
31
Experimental Design Types

A-optimal design maximizes the trace of (XTWX)-1
D-optimal design maximizes log determinant of
(XTWX)-1
E-optimal design minimizes maximum eigenvalue of
(XTWX)-1
All of these design methods can use convex
optimization techniques for evaluation when w is
relaxed to be in 0,1
Computational complexity polynomial for
semi-definite programs (A- and E-optimal designs)

Boyd Vandenberghe, 2004
32
A-Optimal Design

A-optimal design maximizes the trace of (XTWX)-1
Maximizing trace (sum of diagonal elements)
essentially chooses maximally independent columns
Tends to choose points on the border of the
dataset
Example mixture of four Gaussians

Yu et al., 2006
33
A-Optimal Design

A-optimal design maximizes the trace of (XTWX)-1
Can be cast as a semi-definite program
Example 20 datapoints, ellipsoid of dual problem

Boyd Vandenberghe, 2004
34
D-Optimal design

D-optimal design maximizes log determinant of
(XTWX)-1
Matrix determinant loosely measures independence
of the columns (hence maximizes experiments
independence)
Dual of problem chooses ellipsoid with minimum
volume
Note that det(exp(A)) exp(trace(A))
Note also that non-zero experiment weights are
equal

Boyd Vandenberghe, 2004
35
E-Optimal design

E-optimal design minimizes largest eigenvalue of
(XTWX)-1
Equivalently, minimizes the norm of (XTWX)-1
Can be cast as a semi-definite program
Minimizes the diameter of the ellipsoid in the
dual problem

Boyd Vandenberghe, 2004
36
Extensions to optimal design

Cost associated with each experiment
Add a cost vector, constrain total cost by a
budget B (one additional constraint)
Multiple samples from single experiment
Each xi is now a matrix instead of a vector
Optimization (covariance matrix) is identical to
before
Timeline/ordering of experiments
Add time dimension to each experiment vector xi

Atkinson, 1996
Boyd Vandenberghe, 2004
37
Topics for today

Introduction information theory
Active learning
Query by committee
Uncertainty sampling
Information-based loss functions
Response surface technique
Optimal experimental design
A-optimal design
D-optimal design
E-optimal design
Sequential experimental design
Bayesian experimental design
Maximin experimental design
Summary

38
Optimal design in non-linear models

Given a non-linear model y g(x,q)
The design matrix is described by a Taylor
expansion around a q0
fj(x,q) ? g(x,q) / ? qj, evaluated at q0
Maximization of information matrix is now the
same as the linear model
Yields a locally optimal design, optimal for the
particular value of q
Yields no information on the (lack of) fit of the
model

Atkinson, 1996
39
Optimal design in non-linear models

Problem parameter value q, used to choose
experiments X, is unknown
Three general techniques to address this problem,
useful for many possible notions of gain
Sequential experimental design iterate between
choosing experiment x and updating parameter
estimates q
Bayesian experimental design put a prior
distribution on parameter q, choose a best set x
Maximin experimental design assume worst case
scenario for parameter q, choose a best set x

40
Sequential Experimental Design

Model parameter values are not known exactly
Multiple experiments are possible
Learner assumes that only one experiment is
possible makes best guess as to optimal data
point for given q
Each iteration
Select data point to collect via experimental
design using q
Single experiment performed
Model parameters q are updated based on all data
x
Similar idea to Expectation Maximization
Active learning methods all examples of
sequential experiment design

Pronzato Thierry, 2000
41
Bayesian Experimental Design

Effective when knowledge of distribution for q is
available
Example KL divergence between posterior and
prior
?x argmaxw ?q?Q D( p(q w,x) p(q )) p(x w) dq
dx
Example A-optimal design
?x argmaxw ?q?Q tr(XTWX)-1p(q w,x)p(x w) dq dx
Often sensitive to distributions
The only situation within Bayesian theory in
which we average over sample space x

Chaloner Verdinelli, 1995
42
Maximin Experimental Design

Maximize the minimum gain
Example D-optimal design
argmaxw minq?Q log det (XTWX)-1
Example KL divergence
argmaxw minq?Q D(p(q w,x) p(q))
Does not require prior/empirical knowledge
Good when very little is known about distribution
of parameter q

Pronzato Walter, 1988
43
Topics for today

Introduction information theory
Active learning
Query by committee
Uncertainty sampling
Information-based loss functions
Response surface technique
Optimal experimental design
A-optimal design
D-optimal design
E-optimal design
Sequential experimental design
Bayesian experimental design
Maximin experimental design
Summary

44
Related ML Problems

Decide where to sample data
No more menu of experiments
Function Optimization
Find parameters that maximize a specific function
Feature selection
Select features that enable best generalization
of data
Model selection
Select a model that best generalizes data

45
Summary
Distribution over parameter Probabilistic
sequential

Active learning
Query by committee
Uncertainty sampling
Information-based loss functions
Response surface technique
Optimal experimental design
A-optimal design
D-optimal design
E-optimal design
Sequential experimental design
Bayesian experimental design
Maximin experimental design

Distribution over parameter Distance function
sequential
Maximize gain sequential
Interpolate local function sequential
Maximize trace of information matrix
Maximize log det of information matrix
Minimize largest eigenvalue of information matrix
Multiple-shot experiments Little known of
parameter
Single-shot experiment Some idea of
parameter distribution
Single-shot experiment Little known of
parameter Distribution (range known)

Write a Comment

User Comments (0)