Title: Active Learning, Experimental Design
1Active Learning, Experimental Design
- CS294 Practical Machine Learning
- December 4, 2006
- Barbara Engelhardt
2Problem Setup
- Unlabeled data available but labels are expensive
- I would like to choose which data to label
- to maximize the value of that data to my
problem - to minimize the cost of labeling
- Todays lecture covers algorithms to solve this
problem and ways to measure the value of data
3Toy Example threshold function
x
x
x
x
x
x
x
x
x
x
Unlabeled data labels are all 0 then all 1 (left
to right)
Classifier is threshold function
hw(x) 1 if x gt w (0 otherwise)
Goal find transition between 0 and 1 labels in
minimum steps
Naïve method choose points to label at random on
line
Better method binary search for transition
between 0 and 1
4Example Sequencing genomes
- What genome should be sequenced next?
- Criteria for selection?
- Optimal species to detect functional elements
across genomes - Breadth of species encompassing biological
phenomena of interest - (Not the same as the most diverged set of
species) - Marsupials should be sequenced next
McAuliffe et al., 2004
5Example collaborative filtering
- Users rate only a few movies usually ratings
expensive - Which movies do you show users to best
extrapolate movie preferences? - Also known as questionnaire design
- Baseline questionnaires
- Random m movies randomly
- Most Popular Movies m most frequently rated
movies - Most popular movies is not better than random
design! - Popular movies rated highly by all users do not
discriminate tastes
Yu et al. 2006
6Topics for today
- Introduction information theory
- Active learning
- Query by committee
- Uncertainty sampling
- Information-based loss functions
- Response surface technique
- Optimal experimental design
- A-optimal design
- D-optimal design
- E-optimal design
- Sequential experimental design
- Bayesian experimental design
- Maximin experimental design
- Summary
7Topics for today
- Introduction information theory
- Active learning
- Query by committee
- Uncertainty sampling
- Information-based loss functions
- Response surface technique
- Optimal experimental design
- A-optimal design
- D-optimal design
- E-optimal design
- Sequential experimental design
- Bayesian experimental design
- Maximin experimental design
- Summary
8Entropy Function
- A measure of information in random event X with
possible outcomes x1,,xn - Comments on entropy function
- Entropy of an event is zero when the outcome is
known - Entropy is maximal when all outcomes are equally
likely - The average minimum yes/no questions to answer
some question (connection to binary search)
H(x) - Si p(xi) log2 p(xi)
Shannon, 1948
9Kullback Leibler divergence
- P is the true distribution Q distribution is
used to encode data instead of P - KL divergence is the expected extra message
length per datum that must be transmitted using Q - Measure of how wrong Q is with respect to true
distribution P
DKL(P Q) Si P(xi) log (P(xi)/Q(xi))
Si P(xi) log Q(xi) Si P(xi)
log P(xi)
-H(P,Q) H(P)
-Cross-entropy entropy
10KL divergence properties
- Non-negative D(PQ) 0
- Divergence 0 if and only if P and Q are equal
- D(PQ) 0 iff P Q
- Non-symmetric D(PQ) ? D(QP)
- Does not satisfy triangle inequality
- D(PQ) D(PR) D(RQ)
11KL divergence as gain
- Modeling the KL divergence of the posteriors
measures the amount of information gain expected
from query (where x, q are data, parameters
after query) - Goal choose a query that maximizes the KL
divergence between posterior and prior - Basic idea largest KL divergence between updated
posterior probability and the current posterior
probability represents largest gain
D( p(q x) p(q x))
12Loss Functions
- A function L that maps an event to a real number,
representing cost or regret associated with event - E.g., in regression problems, L(y, qTf(x)) maps
to reals - Examples
- Quadratic (least squares) loss
- Linear (absolute value) loss
- 0-1 (binary) loss
- Exponential
13Risk Function
- Risk is also known as expected loss
- The (frequentist) risk function is explicitly
expected loss - Bayes risk is defined as posterior expected loss
- Trade-off Bayes risk performs well when p(q x)
accurate - Gain here is chooses x to minimize expected loss
R(Q, X) Sx L(q, x) p(xq)
R(Q, X) Sq L(q, x) p(qx)
P(qx)
Loss
X
14Minimax loss
- Walds (1950) alternative to minimize the
maximum (expected) loss - Assume response x is the worst case scenario
(gives the greatest expected loss) - Our problem can be thought of as maximizing the
minimum gain (maximin)
Minimax(X, Q) minqmaxxLoss(X,Q)
Loss
15Topics for today
- Introduction information theory
- Active learning
- Query by committee
- Uncertainty sampling
- Information-based loss functions
- Response surface technique
- Optimal experimental design
- A-optimal design
- D-optimal design
- E-optimal design
- Sequential experimental design
- Bayesian experimental design
- Maximin experimental design
- Summary
16What is Active Learning?
- Unlabeled data are readily available labels are
expensive - Want to use adaptive decisions to choose which
labels to acquire for a given dataset - Goal is accurate classifier with minimal cost
17Active learning warning
- Choice of data is only as good as the model
itself - Assume a linear model, then two data points are
sufficient - What happens when data are not linear?
18Active Learning
- Active learner is able to query world and receive
a response before outputting a classifier - Learner selects queries (but cannot impact
response) - Two general methods
- Select most uncertain data given model and
parameters - Select most informative data to optimize
expected gain - Given model M with parameters q and loss function
L - Query q with response x updates the model
posterior q
L(q, X) ExL(q)
19Query by Committee
- Prior distribution over hypotheses
- Samples a set of classifiers from distribution
- Queries an example based on the degree of
disagreement between committee of classifiers
x
x
x
x
x
x
x
x
x
x
C
A
B
Seung et al. 1992, Freund et al. 1997
20Query by Committee Application
- Used naïve Bayes model for text classification in
a Bayesian learning setting (20 Newsgroups
dataset)
McCallum Nigam, 1998
21Uncertainty Sampling
- Query the event that the current classifier is
most uncertain about - Used trivially in SVMs, graphical models, etc.
If uncertainty is measured in Euclidean distance
x
x
x
x
x
x
x
x
x
x
Lewis Gale, 1994
22Information-based Loss Function
- Maximize KL divergence between posterior and
prior - Maximize reduction in model entropy between
posterior and prior - Minimize cross-entropy between posterior and
prior - Many other possibilities
- All of these are notions of information gain
- All of these can be extended to optimal design
algorithms - Must decide how to handle uncertainty about query
response, model parameters
MacKay, 1992
23Response Surface Methods
- Estimate effects of and interactions between
local changes to the experiments - Given a set of datapoints, interpolate a local
surface - (This local surface is called the response
surface) - Hill-climb on the response surface to find next x
- Use next x to interpolate subsequent response
surface - Note this is original model for which optimal
experimental designs were developed Kiefer,
1959
24Example Response Surface
- Goal Approximate the function f(c)
score(minimize(c)) - 1. Fit a smoothed response surface to the data
points - 2. Minimize response surface to find new
candidate - 3. Use method to find nearby local minimum of
score function - 4. Add candidate to data points
- 5. Re-fit surface, repeat
Blum, unpublished
25Topics for today
- Introduction information theory
- Active learning
- Query by committee
- Uncertainty sampling
- Information-based loss functions
- Response surface technique
- Optimal experimental design
- A-optimal design
- D-optimal design
- E-optimal design
- Sequential experimental design
- Bayesian experimental design
- Maximin experimental design
- Summary
26What is Experimental Design?
- Choose among a menu of possible experiments x
- Each experiment has error rate ei
- Goal is to select experiments in order to
optimize a specific design criterion - Equivalently, goal is to choose experiments that
minimize error covariance matrix
27Example experimental design
- Design experiment to show enzyme reacting with
substrate S - Problem what concentration(s) of the substrate
to test? - Experiment result is velocity of the reaction, as
modeled by a non-linear equation
Flaherty et al. 2005
28Optimal Experimental Design Assumptions
- Unbiasedness of experiments Eei 0
- Uncorrelatedness of experiments Eeiej 0
- Variance homogeneity Eej2 ? s2 gt 0, j 1N
- Linearity of parameterization
- h(x,q) QTf(x) for f(x)
(f1(x),,fm(x))T - Functions fi(x) are continuous, independent on x
Atkinson, 1996
29Experimental Design
- Select a set of (possibly noisy) queries (or
experiments) from x that together are maximally
informative - Let matrix X fi(x)i 1n
- Let w be Boolean weight vector that selects from
X W diag(w) - Using squared error criterion, the error
covariance matrix is given by s2(XTWX)-1 - (XTWX)-1 is the inverted Hessian of the squared
error, or the inverted Fisher information matrix
of the labeled data - (XTWX)-1 measures the informativeness gain of
experiments
30Relaxed Experimental Design
- The relaxed problem allows wi 0, Si wi 1
Relaxed problem
Boolean problem
N 3
31Experimental Design Types
- A-optimal design maximizes the trace of (XTWX)-1
- D-optimal design maximizes log determinant of
(XTWX)-1 - E-optimal design minimizes maximum eigenvalue of
(XTWX)-1 - All of these design methods can use convex
optimization techniques for evaluation when w is
relaxed to be in 0,1 - Computational complexity polynomial for
semi-definite programs (A- and E-optimal designs)
Boyd Vandenberghe, 2004
32A-Optimal Design
- A-optimal design maximizes the trace of (XTWX)-1
- Maximizing trace (sum of diagonal elements)
essentially chooses maximally independent columns - Tends to choose points on the border of the
dataset - Example mixture of four Gaussians
Yu et al., 2006
33A-Optimal Design
- A-optimal design maximizes the trace of (XTWX)-1
- Can be cast as a semi-definite program
- Example 20 datapoints, ellipsoid of dual problem
Boyd Vandenberghe, 2004
34D-Optimal design
- D-optimal design maximizes log determinant of
(XTWX)-1 - Matrix determinant loosely measures independence
of the columns (hence maximizes experiments
independence) - Dual of problem chooses ellipsoid with minimum
volume - Note that det(exp(A)) exp(trace(A))
- Note also that non-zero experiment weights are
equal
Boyd Vandenberghe, 2004
35E-Optimal design
- E-optimal design minimizes largest eigenvalue of
(XTWX)-1 - Equivalently, minimizes the norm of (XTWX)-1
- Can be cast as a semi-definite program
- Minimizes the diameter of the ellipsoid in the
dual problem
Boyd Vandenberghe, 2004
36Extensions to optimal design
- Cost associated with each experiment
- Add a cost vector, constrain total cost by a
budget B (one additional constraint) - Multiple samples from single experiment
- Each xi is now a matrix instead of a vector
- Optimization (covariance matrix) is identical to
before - Timeline/ordering of experiments
- Add time dimension to each experiment vector xi
Atkinson, 1996
Boyd Vandenberghe, 2004
37Topics for today
- Introduction information theory
- Active learning
- Query by committee
- Uncertainty sampling
- Information-based loss functions
- Response surface technique
- Optimal experimental design
- A-optimal design
- D-optimal design
- E-optimal design
- Sequential experimental design
- Bayesian experimental design
- Maximin experimental design
- Summary
38Optimal design in non-linear models
- Given a non-linear model y g(x,q)
- The design matrix is described by a Taylor
expansion around a q0 - fj(x,q) ? g(x,q) / ? qj, evaluated at q0
- Maximization of information matrix is now the
same as the linear model - Yields a locally optimal design, optimal for the
particular value of q - Yields no information on the (lack of) fit of the
model
Atkinson, 1996
39Optimal design in non-linear models
- Problem parameter value q, used to choose
experiments X, is unknown - Three general techniques to address this problem,
useful for many possible notions of gain - Sequential experimental design iterate between
choosing experiment x and updating parameter
estimates q - Bayesian experimental design put a prior
distribution on parameter q, choose a best set x - Maximin experimental design assume worst case
scenario for parameter q, choose a best set x
40Sequential Experimental Design
- Model parameter values are not known exactly
- Multiple experiments are possible
- Learner assumes that only one experiment is
possible makes best guess as to optimal data
point for given q - Each iteration
- Select data point to collect via experimental
design using q - Single experiment performed
- Model parameters q are updated based on all data
x - Similar idea to Expectation Maximization
- Active learning methods all examples of
sequential experiment design
Pronzato Thierry, 2000
41Bayesian Experimental Design
- Effective when knowledge of distribution for q is
available - Example KL divergence between posterior and
prior - ?x argmaxw ?q?Q D( p(q w,x) p(q )) p(x w) dq
dx - Example A-optimal design
- ?x argmaxw ?q?Q tr(XTWX)-1p(q w,x)p(x w) dq dx
- Often sensitive to distributions
- The only situation within Bayesian theory in
which we average over sample space x
Chaloner Verdinelli, 1995
42Maximin Experimental Design
- Maximize the minimum gain
- Example D-optimal design
- argmaxw minq?Q log det (XTWX)-1
- Example KL divergence
- argmaxw minq?Q D(p(q w,x) p(q))
- Does not require prior/empirical knowledge
- Good when very little is known about distribution
of parameter q
Pronzato Walter, 1988
43Topics for today
- Introduction information theory
- Active learning
- Query by committee
- Uncertainty sampling
- Information-based loss functions
- Response surface technique
- Optimal experimental design
- A-optimal design
- D-optimal design
- E-optimal design
- Sequential experimental design
- Bayesian experimental design
- Maximin experimental design
- Summary
44Related ML Problems
- Decide where to sample data
- No more menu of experiments
- Function Optimization
- Find parameters that maximize a specific function
- Feature selection
- Select features that enable best generalization
of data - Model selection
- Select a model that best generalizes data
45Summary
Distribution over parameter Probabilistic
sequential
- Active learning
- Query by committee
- Uncertainty sampling
- Information-based loss functions
- Response surface technique
- Optimal experimental design
- A-optimal design
- D-optimal design
- E-optimal design
- Sequential experimental design
- Bayesian experimental design
- Maximin experimental design
Distribution over parameter Distance function
sequential
Maximize gain sequential
Interpolate local function sequential
Maximize trace of information matrix
Maximize log det of information matrix
Minimize largest eigenvalue of information matrix
Multiple-shot experiments Little known of
parameter
Single-shot experiment Some idea of
parameter distribution
Single-shot experiment Little known of
parameter Distribution (range known)