Bayesian Learning for Conditional Models - PowerPoint PPT Presentation

About This Presentation

Title:

Bayesian Learning for Conditional Models

Description:

Joint work with T. Minka, Z. Ghahramani, M. Szummer, and R. W. Picard. Motivation ... Approximate a probability distribution by simpler parametric terms (Minka 2001) ... – PowerPoint PPT presentation

Number of Views:166

Avg rating:3.0/5.0

Slides: 51

Provided by: Ala2

Learn more at: http://people.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Bayesian Learning for Conditional Models

1
Bayesian Learning for Conditional Models

Alan Qi
MIT CSAIL
September, 2005
Joint work with T. Minka, Z. Ghahramani, M.
Szummer, and R. W. Picard

2
Motivation

Two types of graphical models generative and
conditional
Conditional models
Make no assumptions about data generation
Enable the use of flexible features
Learning conditional models estimating
(distributions of) model parameters
Maximum likelihood approaches overfitting
Bayesian learning

3
Outline

Background
Conditional models for independent and relational
data classification
Bayesian learning
Bayesian classification and Predictive ARD
Feature selection
Fast kernel learning
Bayesian conditional random fields
Contextual object recognition/Segmentation
Conclusions

4
Outline

Background
Conditional models
Bayesian learning
Bayesian classification and Predictive ARD
Bayesian conditional random fields
Conclusions

5
Graphical Models
Conditional models - Logistic/Probit regression - Classification of independent data Conditional random fields -Model relational data, such as natural language and images

6
Bayesian learning

Simple Given prior distributions and data
likelihoods, estimate the posterior distributions
of model parameters or the predictive posterior
of a new data point.
Difficult calculating the posterior
distributions in practice.
Randomized methods Markov Chain Monte Carlo,
Importance Sampling
Deterministic approximation Varitional methods,
Expectation propagation.

7
Outline

Background
Bayesian classification and Predictive ARD
Feature selection
Fast kernel learning
Bayesian conditional random fields
Conclusions

8
Goal

Task 1 Classify high dimensional datasets with
many irrelevant features, e.g., normal v.s.
cancer microarray data.
Task 2 Sparse Bayesian kernel classifiers for
fast test performance.

9
Part 1 Roadmap

Automatic relevance determination (ARD)
Risk of Overfitting by optimizing hyperparameters
Predictive ARD by expectation propagation (EP)
Approximate prediction error
EP approximation
Experiments
Conclusions

10
Bayesian Classification Model
Labels t inputs X parameters w Likelihood
for the data set
Prior of the classifier w
Where
is a cumulative distribution function for
a standard Gaussian.
11
Evidence and Predictive Distribution
The evidence, i.e., the marginal likelihood of
the hyperparameters
The predictive posterior distribution of the
label for a new input
12
Automatic Relevance Determination (ARD)

Give the classifier weight independent Gaussian
priors whose variance, , controls how far
away from zero each weight is allowed to go
Maximize , the marginal likelihood of
the model, with respect to .
Outcome many elements of go to infinity,
which naturally prunes irrelevant features in the
data.

13
Two Types of Overfitting

Classical Maximum likelihood
Optimizing the classifier weights w can directly
fit noise in the data, resulting in a complicated
model.
Type II Maximum likelihood (ARD)
Optimizing the hyperparameters corresponds to
choosing which variables are irrelevant. Choosing
one out of exponentially many models can also
overfit if we maximize the model marginal
likelihood.

14
Risk of Optimizing

X Class 1 vs O Class 2

15
Predictive-ARD

Choosing the model with the best estimated
predictive performance instead of the most
probable model.
Expectation propagation (EP) estimates the
leave-one-out predictive performance without
performing any expensive cross-validation.

16
Estimate Predictive Performance

Predictive posterior given a test data point
EP can estimate predictive leave-one-out error
probability
where q( w t\i) is the approximate posterior of
leaving out the ith label.
EP can also estimate predictive leave-one-out
error count

17
Expectation Propagation in a Nutshell

Approximate a probability distribution by
simpler parametric terms
Each approximation term lives in an
exponential family (e.g. Gaussian)

18
EP in a Nutshell

Three key steps
Deletion Step approximate the leave-one-out
predictive posterior for the ith point
Minimizing the following KL divergence by moment
matching
Inclusion

The key observation we can use the approximate
predictive posterior, obtained in the deletion
step, for model selection. No extra computation!
19
Comparison of different model selection criteria
for ARD training
The estimated leave-one-out error probabilities
and counts are better correlated with the test
error than evidence and sparsity level.

1st row Test error
2nd row Estimated leave-one-out error
probability
3rd row Estimated leave-one-out error counts
4th row Evidence (Model marginal likelihood)
5th row Fraction of selected features

20
Gene Expression Classification

Task Classify gene expression datasets into
different categories, e.g., normal v.s. cancer
Challenge Thousands of genes measured in the
micro-array data. Only a small subset of genes
are probably correlated with the classification
task.

21
Classifying Leukemia Data

The task distinguish acute myeloid leukemia
(AML) from acute lymphoblastic leukemia (ALL).
The dataset 47 and 25 samples of type ALL and
AML respectively with 7129 features per sample.
The dataset was randomly split 100 times into 36
training and 36 testing samples.

22
Classifying Colon Cancer Data

The task distinguish normal and cancer samples
The dataset 22 normal and 40 cancer samples with
2000 features per sample.
The dataset was randomly split 100 times into 50
training and 12 testing samples.
SVM results from Li et al. 2002

23
Bayesian Sparse Kernel Classifiers

Using feature/kernel expansions defined on
training data points
Predictive-ARD-EP trains a classifier that
depends on a small subset of the training set.
Fast test performance.

24
Test error rates and numbers of relevance or
support vectors on breast cancer dataset.

50 partitionings of the data were used. All
these methods use the same Gaussian kernel with
kernel width 5. The trade-off parameter C in
SVM is chosen via 10-fold cross-validation for
each partition.

25
Part 1 Conclusions

Maximizing marginal likelihood can lead to
overfitting in the model space if there are a lot
of features.
We propose Predictive-ARD based on EP for
feature selection
sparse kernel learning
In practice Predictive-ARD works better than
traditional ARD.

26
Outline

Background
Bayesian classification and Predictive ARD
Bayesian conditional random fields
Contextual object recognition/Segmentation
Conclusions

27
(No Transcript)
28
Bayesian Conditional Networks

Bayesian training to avoid overfitting
Need efficient training
The exact posterior of w
The Gaussian approximate posterior of w

29
Learning the parameter w by ML/MAP

Maximum likelihood (ML) Maximize the data
likelihood
where
Maximum a posterior (MAP)Gaussian prior on w
ML/MAP problem Overfitting to the noise in data.

30
EP in a Nutshell

Approximate a probability distribution by
simpler parametric terms (Minka 2001)
For Bayesian networks
For Markov networks
For conditional classification
For conditional random fields
Each approximation term or
lives in an exponential family (such as Gaussian
Multinomial)

31
EP in a Nutshell (2)

The approximate term minimizes the
following KL divergence by moment matching

Where the leave-one-out approximation is
32
EP in a Nutshell (3)

Three key steps
Deletion Step approximate the leave-one-out
predictive posterior for the ith point
Minimizing the following KL divergence by moment
matching (Assumed Density filtering)
Inclusion

33
Two Difficulties for Bayesian Training

the partition function appears in the denominator
Regular EP does not apply
the partition function is a complicated function
of w

34
Turn Denominator to Numerator (1)

Transformed EP
Deletion
ADF
Inclusion

35
Turn Denominator to Numerator (2)

Power EP
Deletion
ADF
Inclusion

Power EP minimizes ? divergence
36
Approximating the partition function

The parameters w and the labels t are intertwined
in Z(w)
where k i, j is the index of edges.
The joint distribution of w and t
Factorized approximation

37
Flatten Approximation Structure
Iterations
Iterations
Increased efficiency, stability, and accuracy!
38
Model Averaging for Prediction

Bayesian training provides a set of estimated
models
Bayesian model averaging combines predictions
from all the models to eliminate overfitting
Approximate model averaging weighted belief
propagation

39
Results on Synthetic Data

Data generation first, randomly sample input x,
fixed true parameters w, and then sample the
labels t
Graphical structure Four nodes in a simple loop
Comparing maximum likelihood trained CRF with
BCRFs 10 Trials. 100 training examples and 1000
test examples.

40
FAQs Labeling

The dataset consists of 47 files, belonging to 7
Usenet newsgroup FAQs. Each file has multiple
lines, which can be the header (H), a question
(Q), an answer (A), or the tail (T).
Task label the lines that are questions or
answers.

41
FAQs Features
42
Results
BCRFs outperform MAP-trained CRFs with a high
statistical significance on FAQs labeling.
43
Ink Application analyzing handwritten
organization charts

Parsing a graph into different components
containers vs. connectors

44
Comparing results
Results from Bayes Point Machine
Results from MAP-trained CRF
Results from BCRF
45
Results
BCRF outperforms ML and MAP trained-CRFs.
BCRF-ARD further improves test accuracy. The
results are averaged over 20 runs.
46
Part 2Conclusions

Bayesian CRFs
Model the relational data
BCRFs improve the predictive performance over ML-
and MAP-trained CRFs, especially by approximate
model averaging
ARD for CRFs enables feature selection
More applications image segmentation and joint
scene analysis, etc.

47
Outline

Background
Bayesian classification and Predictive ARD
Bayesian conditional random fields
Conclusions

48
Conclusions

Predictive ARD by EP
Gene expression classification Outperformed
traditional ARD, SVM with feature selection
Bayesian conditional random fields
FAQs labeling and joint diagram analysis Beats
ML- and MAP-trained CRFs
Future work

49
END
50
Appendix Sequential Updates

EP approximates true likelihood terms by
Gaussian virtual observations.
Based on Gaussian virtual observations, the
classification model becomes a regression model.
Then, we can achieve efficient sequential updates
without maintaining and updating a full
covariance matrix. (Faul Tipping 02)

Write a Comment

User Comments (0)