Active Learning, Experimental Design - PowerPoint PPT Presentation

1 / 45
About This Presentation
Title:

Active Learning, Experimental Design

Description:

Which movies do you show users to best extrapolate movie preferences? ... Loss Functions ... Can be cast as a semi-definite program ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 46
Provided by: bee50
Category:

less

Transcript and Presenter's Notes

Title: Active Learning, Experimental Design


1
Active Learning, Experimental Design
  • CS294 Practical Machine Learning
  • December 4, 2006
  • Barbara Engelhardt

2
Problem Setup
  • Unlabeled data available but labels are expensive
  • I would like to choose which data to label
  • to maximize the value of that data to my
    problem
  • to minimize the cost of labeling
  • Todays lecture covers algorithms to solve this
    problem and ways to measure the value of data

3
Toy Example threshold function
x
x
x
x
x
x
x
x
x
x
Unlabeled data labels are all 0 then all 1 (left
to right)
Classifier is threshold function
hw(x) 1 if x gt w (0 otherwise)
Goal find transition between 0 and 1 labels in
minimum steps
Naïve method choose points to label at random on
line
Better method binary search for transition
between 0 and 1
4
Example Sequencing genomes
  • What genome should be sequenced next?
  • Criteria for selection?
  • Optimal species to detect functional elements
    across genomes
  • Breadth of species encompassing biological
    phenomena of interest
  • (Not the same as the most diverged set of
    species)
  • Marsupials should be sequenced next

McAuliffe et al., 2004
5
Example collaborative filtering
  • Users rate only a few movies usually ratings
    expensive
  • Which movies do you show users to best
    extrapolate movie preferences?
  • Also known as questionnaire design
  • Baseline questionnaires
  • Random m movies randomly
  • Most Popular Movies m most frequently rated
    movies
  • Most popular movies is not better than random
    design!
  • Popular movies rated highly by all users do not
    discriminate tastes

Yu et al. 2006
6
Topics for today
  • Introduction information theory
  • Active learning
  • Query by committee
  • Uncertainty sampling
  • Information-based loss functions
  • Response surface technique
  • Optimal experimental design
  • A-optimal design
  • D-optimal design
  • E-optimal design
  • Sequential experimental design
  • Bayesian experimental design
  • Maximin experimental design
  • Summary

7
Topics for today
  • Introduction information theory
  • Active learning
  • Query by committee
  • Uncertainty sampling
  • Information-based loss functions
  • Response surface technique
  • Optimal experimental design
  • A-optimal design
  • D-optimal design
  • E-optimal design
  • Sequential experimental design
  • Bayesian experimental design
  • Maximin experimental design
  • Summary

8
Entropy Function
  • A measure of information in random event X with
    possible outcomes x1,,xn
  • Comments on entropy function
  • Entropy of an event is zero when the outcome is
    known
  • Entropy is maximal when all outcomes are equally
    likely
  • The average minimum yes/no questions to answer
    some question (connection to binary search)

H(x) - Si p(xi) log2 p(xi)
Shannon, 1948
9
Kullback Leibler divergence
  • P is the true distribution Q distribution is
    used to encode data instead of P
  • KL divergence is the expected extra message
    length per datum that must be transmitted using Q
  • Measure of how wrong Q is with respect to true
    distribution P

DKL(P Q) Si P(xi) log (P(xi)/Q(xi))
Si P(xi) log Q(xi) Si P(xi)
log P(xi)
-H(P,Q) H(P)
-Cross-entropy entropy
10
KL divergence properties
  • Non-negative D(PQ) 0
  • Divergence 0 if and only if P and Q are equal
  • D(PQ) 0 iff P Q
  • Non-symmetric D(PQ) ? D(QP)
  • Does not satisfy triangle inequality
  • D(PQ) D(PR) D(RQ)

11
KL divergence as gain
  • Modeling the KL divergence of the posteriors
    measures the amount of information gain expected
    from query (where x, q are data, parameters
    after query)
  • Goal choose a query that maximizes the KL
    divergence between posterior and prior
  • Basic idea largest KL divergence between updated
    posterior probability and the current posterior
    probability represents largest gain

D( p(q x) p(q x))
12
Loss Functions
  • A function L that maps an event to a real number,
    representing cost or regret associated with event
  • E.g., in regression problems, L(y, qTf(x)) maps
    to reals
  • Examples
  • Quadratic (least squares) loss
  • Linear (absolute value) loss
  • 0-1 (binary) loss
  • Exponential

13
Risk Function
  • Risk is also known as expected loss
  • The (frequentist) risk function is explicitly
    expected loss
  • Bayes risk is defined as posterior expected loss
  • Trade-off Bayes risk performs well when p(q x)
    accurate
  • Gain here is chooses x to minimize expected loss

R(Q, X) Sx L(q, x) p(xq)
R(Q, X) Sq L(q, x) p(qx)
P(qx)
Loss
X
14
Minimax loss
  • Walds (1950) alternative to minimize the
    maximum (expected) loss
  • Assume response x is the worst case scenario
    (gives the greatest expected loss)
  • Our problem can be thought of as maximizing the
    minimum gain (maximin)

Minimax(X, Q) minqmaxxLoss(X,Q)
Loss
15
Topics for today
  • Introduction information theory
  • Active learning
  • Query by committee
  • Uncertainty sampling
  • Information-based loss functions
  • Response surface technique
  • Optimal experimental design
  • A-optimal design
  • D-optimal design
  • E-optimal design
  • Sequential experimental design
  • Bayesian experimental design
  • Maximin experimental design
  • Summary

16
What is Active Learning?
  • Unlabeled data are readily available labels are
    expensive
  • Want to use adaptive decisions to choose which
    labels to acquire for a given dataset
  • Goal is accurate classifier with minimal cost

17
Active learning warning
  • Choice of data is only as good as the model
    itself
  • Assume a linear model, then two data points are
    sufficient
  • What happens when data are not linear?

18
Active Learning
  • Active learner is able to query world and receive
    a response before outputting a classifier
  • Learner selects queries (but cannot impact
    response)
  • Two general methods
  • Select most uncertain data given model and
    parameters
  • Select most informative data to optimize
    expected gain
  • Given model M with parameters q and loss function
    L
  • Query q with response x updates the model
    posterior q

L(q, X) ExL(q)
19
Query by Committee
  • Prior distribution over hypotheses
  • Samples a set of classifiers from distribution
  • Queries an example based on the degree of
    disagreement between committee of classifiers

x
x
x
x
x
x
x
x
x
x
C
A
B
Seung et al. 1992, Freund et al. 1997
20
Query by Committee Application
  • Used naïve Bayes model for text classification in
    a Bayesian learning setting (20 Newsgroups
    dataset)

McCallum Nigam, 1998
21
Uncertainty Sampling
  • Query the event that the current classifier is
    most uncertain about
  • Used trivially in SVMs, graphical models, etc.

If uncertainty is measured in Euclidean distance
x
x
x
x
x
x
x
x
x
x
Lewis Gale, 1994
22
Information-based Loss Function
  • Maximize KL divergence between posterior and
    prior
  • Maximize reduction in model entropy between
    posterior and prior
  • Minimize cross-entropy between posterior and
    prior
  • Many other possibilities
  • All of these are notions of information gain
  • All of these can be extended to optimal design
    algorithms
  • Must decide how to handle uncertainty about query
    response, model parameters

MacKay, 1992
23
Response Surface Methods
  • Estimate effects of and interactions between
    local changes to the experiments
  • Given a set of datapoints, interpolate a local
    surface
  • (This local surface is called the response
    surface)
  • Hill-climb on the response surface to find next x
  • Use next x to interpolate subsequent response
    surface
  • Note this is original model for which optimal
    experimental designs were developed Kiefer,
    1959

24
Example Response Surface
  • Goal Approximate the function f(c)
    score(minimize(c))
  • 1. Fit a smoothed response surface to the data
    points
  • 2. Minimize response surface to find new
    candidate
  • 3. Use method to find nearby local minimum of
    score function
  • 4. Add candidate to data points
  • 5. Re-fit surface, repeat

Blum, unpublished
25
Topics for today
  • Introduction information theory
  • Active learning
  • Query by committee
  • Uncertainty sampling
  • Information-based loss functions
  • Response surface technique
  • Optimal experimental design
  • A-optimal design
  • D-optimal design
  • E-optimal design
  • Sequential experimental design
  • Bayesian experimental design
  • Maximin experimental design
  • Summary

26
What is Experimental Design?
  • Choose among a menu of possible experiments x
  • Each experiment has error rate ei
  • Goal is to select experiments in order to
    optimize a specific design criterion
  • Equivalently, goal is to choose experiments that
    minimize error covariance matrix

27
Example experimental design
  • Design experiment to show enzyme reacting with
    substrate S
  • Problem what concentration(s) of the substrate
    to test?
  • Experiment result is velocity of the reaction, as
    modeled by a non-linear equation

Flaherty et al. 2005
28
Optimal Experimental Design Assumptions
  • Unbiasedness of experiments Eei 0
  • Uncorrelatedness of experiments Eeiej 0
  • Variance homogeneity Eej2 ? s2 gt 0, j 1N
  • Linearity of parameterization
  • h(x,q) QTf(x) for f(x)
    (f1(x),,fm(x))T
  • Functions fi(x) are continuous, independent on x

Atkinson, 1996
29
Experimental Design
  • Select a set of (possibly noisy) queries (or
    experiments) from x that together are maximally
    informative
  • Let matrix X fi(x)i 1n
  • Let w be Boolean weight vector that selects from
    X W diag(w)
  • Using squared error criterion, the error
    covariance matrix is given by s2(XTWX)-1
  • (XTWX)-1 is the inverted Hessian of the squared
    error, or the inverted Fisher information matrix
    of the labeled data
  • (XTWX)-1 measures the informativeness gain of
    experiments

30
Relaxed Experimental Design
  • The relaxed problem allows wi 0, Si wi 1

Relaxed problem
Boolean problem
N 3
31
Experimental Design Types
  • A-optimal design maximizes the trace of (XTWX)-1
  • D-optimal design maximizes log determinant of
    (XTWX)-1
  • E-optimal design minimizes maximum eigenvalue of
    (XTWX)-1
  • All of these design methods can use convex
    optimization techniques for evaluation when w is
    relaxed to be in 0,1
  • Computational complexity polynomial for
    semi-definite programs (A- and E-optimal designs)

Boyd Vandenberghe, 2004
32
A-Optimal Design
  • A-optimal design maximizes the trace of (XTWX)-1
  • Maximizing trace (sum of diagonal elements)
    essentially chooses maximally independent columns
  • Tends to choose points on the border of the
    dataset
  • Example mixture of four Gaussians

Yu et al., 2006
33
A-Optimal Design
  • A-optimal design maximizes the trace of (XTWX)-1
  • Can be cast as a semi-definite program
  • Example 20 datapoints, ellipsoid of dual problem

Boyd Vandenberghe, 2004
34
D-Optimal design
  • D-optimal design maximizes log determinant of
    (XTWX)-1
  • Matrix determinant loosely measures independence
    of the columns (hence maximizes experiments
    independence)
  • Dual of problem chooses ellipsoid with minimum
    volume
  • Note that det(exp(A)) exp(trace(A))
  • Note also that non-zero experiment weights are
    equal

Boyd Vandenberghe, 2004
35
E-Optimal design
  • E-optimal design minimizes largest eigenvalue of
    (XTWX)-1
  • Equivalently, minimizes the norm of (XTWX)-1
  • Can be cast as a semi-definite program
  • Minimizes the diameter of the ellipsoid in the
    dual problem

Boyd Vandenberghe, 2004
36
Extensions to optimal design
  • Cost associated with each experiment
  • Add a cost vector, constrain total cost by a
    budget B (one additional constraint)
  • Multiple samples from single experiment
  • Each xi is now a matrix instead of a vector
  • Optimization (covariance matrix) is identical to
    before
  • Timeline/ordering of experiments
  • Add time dimension to each experiment vector xi

Atkinson, 1996
Boyd Vandenberghe, 2004
37
Topics for today
  • Introduction information theory
  • Active learning
  • Query by committee
  • Uncertainty sampling
  • Information-based loss functions
  • Response surface technique
  • Optimal experimental design
  • A-optimal design
  • D-optimal design
  • E-optimal design
  • Sequential experimental design
  • Bayesian experimental design
  • Maximin experimental design
  • Summary

38
Optimal design in non-linear models
  • Given a non-linear model y g(x,q)
  • The design matrix is described by a Taylor
    expansion around a q0
  • fj(x,q) ? g(x,q) / ? qj, evaluated at q0
  • Maximization of information matrix is now the
    same as the linear model
  • Yields a locally optimal design, optimal for the
    particular value of q
  • Yields no information on the (lack of) fit of the
    model

Atkinson, 1996
39
Optimal design in non-linear models
  • Problem parameter value q, used to choose
    experiments X, is unknown
  • Three general techniques to address this problem,
    useful for many possible notions of gain
  • Sequential experimental design iterate between
    choosing experiment x and updating parameter
    estimates q
  • Bayesian experimental design put a prior
    distribution on parameter q, choose a best set x
  • Maximin experimental design assume worst case
    scenario for parameter q, choose a best set x

40
Sequential Experimental Design
  • Model parameter values are not known exactly
  • Multiple experiments are possible
  • Learner assumes that only one experiment is
    possible makes best guess as to optimal data
    point for given q
  • Each iteration
  • Select data point to collect via experimental
    design using q
  • Single experiment performed
  • Model parameters q are updated based on all data
    x
  • Similar idea to Expectation Maximization
  • Active learning methods all examples of
    sequential experiment design

Pronzato Thierry, 2000
41
Bayesian Experimental Design
  • Effective when knowledge of distribution for q is
    available
  • Example KL divergence between posterior and
    prior
  • ?x argmaxw ?q?Q D( p(q w,x) p(q )) p(x w) dq
    dx
  • Example A-optimal design
  • ?x argmaxw ?q?Q tr(XTWX)-1p(q w,x)p(x w) dq dx
  • Often sensitive to distributions
  • The only situation within Bayesian theory in
    which we average over sample space x

Chaloner Verdinelli, 1995
42
Maximin Experimental Design
  • Maximize the minimum gain
  • Example D-optimal design
  • argmaxw minq?Q log det (XTWX)-1
  • Example KL divergence
  • argmaxw minq?Q D(p(q w,x) p(q))
  • Does not require prior/empirical knowledge
  • Good when very little is known about distribution
    of parameter q

Pronzato Walter, 1988
43
Topics for today
  • Introduction information theory
  • Active learning
  • Query by committee
  • Uncertainty sampling
  • Information-based loss functions
  • Response surface technique
  • Optimal experimental design
  • A-optimal design
  • D-optimal design
  • E-optimal design
  • Sequential experimental design
  • Bayesian experimental design
  • Maximin experimental design
  • Summary

44
Related ML Problems
  • Decide where to sample data
  • No more menu of experiments
  • Function Optimization
  • Find parameters that maximize a specific function
  • Feature selection
  • Select features that enable best generalization
    of data
  • Model selection
  • Select a model that best generalizes data

45
Summary
Distribution over parameter Probabilistic
sequential
  • Active learning
  • Query by committee
  • Uncertainty sampling
  • Information-based loss functions
  • Response surface technique
  • Optimal experimental design
  • A-optimal design
  • D-optimal design
  • E-optimal design
  • Sequential experimental design
  • Bayesian experimental design
  • Maximin experimental design

Distribution over parameter Distance function
sequential
Maximize gain sequential
Interpolate local function sequential
Maximize trace of information matrix
Maximize log det of information matrix
Minimize largest eigenvalue of information matrix
Multiple-shot experiments Little known of
parameter
Single-shot experiment Some idea of
parameter distribution
Single-shot experiment Little known of
parameter Distribution (range known)
Write a Comment
User Comments (0)
About PowerShow.com