Title: CSC2535 Lecture 12 Learning Multiplicative Interactions
1CSC2535 Lecture 12Learning Multiplicative
Interactions
- Geoffrey Hinton
- Department of Computer Science
- University of Toronto
2Two different meanings of multiplicative
- If we take two density models and multiply
together their probability distributions at each
point in data-space, we get a product of
experts. - The product of two Gaussian experts is a
Gaussian. - If we take two variables and we multiply them
together to provide input to a third variable we
get a multiplicative interaction. - The distribution of the product of two
Gaussian-distributed variables is NOT Gaussian
distributed. It is a heavy-tailed distribution.
One Gaussian determines the standard deviation of
the other Gaussian. - Heavy-tailed distributions are the signatures of
multiplicative interactions between latent
variables.
3The heavy-tailed world
- The prediction errors for financial time-series
are typically heavy-tailed. This is mainly
because the variance is much higher in times of
uncertainty. - The prediction errors made by a linear dynamical
systems are usually heavy-tailed on real data. - Occasional very weird things happen. This
violates the conditions of the central limit
theorem. - The outputs of linear filters applied to images
are heavy-tailed. - Gabor filters nearly always output almost exactly
zero. But occasionally they have large outputs.
4But first One final product of experts
- Each hidden unit of a binary RBM defines an
expert. - The expert is a mixture of a uniform distribution
(if the hidden unit is off) and a multivariate
Bernoulli in which the expert specifies the log
odds for each visible unit. - The visible biases are like an additional expert.
- We could replace each hidden unit by an HMM and
replace the data-vector by a sequence of
data-vectors. - The two binary states of an RBM hidden unit are
then replaced by all possible paths through the
hidden nodes of an HMM.
5Why products of HMMs should be better than
ordinary HMMs.
- An HMM requires 2N hidden states to generate
strings in which the past has N bits of mutual
information with the future. - The number of hidden nodes required scales
exponentially with the mutual information. - A product of k HMMs allows k times as much
mutual information as one HMM. - The number of fixed-size HMMs required scales
linearly with the mutual information. - So many small HMMs are much better at capturing
componential structure through time than one big
HMM.
6Inference in a PoHMM
- The HMMs are all conditionally independent given
the data sequence. - So we run the forward-backward algorithm in each
HMM separately to get the sufficient statistics
of the posterior over paths. - To get statistics for CD1 learning
- First get sufficient statistics with the data.
- Then reconstruct the data.
- Then get sufficient statistics with the
reconstruction. - Adjust the parameters to raise the log prob of
the data and lower the log prob of the
reconstruction.
7How to reconstruct from a PoHMM
- First pick a path through the hidden nodes of
each HMM from the posterior distribution given
the data. - Then reconstruct each time step separately by
sampling from the product of the output
distributions specified by all the nodes chosen
for that time step. - This is easy with a diagonal Gaussian output
model or with a discrete output model. - It is harder with a full covariance output model
because we need to add the inverse covariance
matrices of the different experts. - It can be very hard if each node uses a mixture
of Gaussians output model. So dont.
8How well do PoHMMs work?
- We now know how to evaluate the partition
function of an HMM by using annealed importance
sampling. - So we can finally evaluate how well they work.
- So far they work significantly better than single
HMMs (Graham Taylors recent results) - For very big datasets with highly componential
data, a single HMM needs a huge number of nodes. - PoHMMs have an exponential win in
representational power so they can have many less
parameters. - This means they should be faster to train.
9Back to multiplicative interactions
- It is fairly easy to learn multiplicative
interactions if all of the variables are
observed. - This is possible if we control the variables used
to create a training set (e.g. pose, lighting,
identity ) - It is also easy to learn energy-based models in
which all but one of the terms in each
multiplicative interaction are observed. - Inference is still easy.
- If more than one of the terms in each
multiplicative interaction are unobserved, the
interactions between hidden variables make
inference difficult. - Alternating Gibbs can be used if the latent
variables form a bi-partite graph.
10Learning how style and content interact
- Tenenbaum and Freeman (2000) describe a model in
which a style vector and a content vector
interact multiplicatively to determine a
datavector (e.g. and image). - The outer-product of the style and content
vectors determines a set of coefficients for
basis functions. - This is not at all like the way a user vector and
a movie vector interact to determine a rating.
The rating is the inner-product.
11It is an unfortunate coincidence that the number
of components in each pose vector is equal to the
number of different pose vectors. The model is
only really interesting if we have less
components per style or content vector than style
or content vectors
12Some ways to use the bilinear model
13A simpler asymmetric model in which the pose
vectors and basis vectors have already been
multiplied together. Learning is easier in the
asymmetric model.
14content
The other asymmetric model in which person
vectors and basis vectors have been pre-multiplied
style
15Fitting the asymmetric model
- Each training case is a column vector labeled
with its discrete style and content classes. - For multiple examples of the same style and
content, just average the images since we are
minimizing squared reconstruction error. - Concatenate all the cases that have the same
content class to get a single matrix with as many
columns as content classes. - Then perform SVD on this matrix
style1
style2
style3
Y
16Decomposing the matrix
Desired decomposition
Throw out the columns of W and the rows of B that
correspond to the smallest eigenvalues. This
makes the model non-trivial
17Fitting the symmetric model
- Fix the style vectors and then fit the content
vectors of the asymmetric model. - Then freeze the content vectors and fit the style
vectors of the other asymmetric model. - Alternate until a local optimum is reached.
- Then analytically fit the basis vectors with the
style and pose vectors fixed. - See Tenenbaum and Freeman (2000) for details.
18Higher order Boltzmann machines (Sejnowski, 1986)
- The usual energy function is quadratic in the
states
- But we could use higher order interactions
- Hidden unit h acts as a switch. When h is on, it
switches in the pairwise interaction between unit
i and unit j. - Units i and j can also be viewed as switches that
control the pairwise interactions between j and h
or between i and h.
19A higher-order Boltzmann machine with one visible
group and two hidden groups
object-based features
viewing transform
- We can view it as a Boltzmann machine in which
the inputs create interactions between the other
variables. - This type of model is now called a conditional
random field. - Inference can be hard in this model.
- Inference is much easier with two visible groups
and one hidden group
Is this an I or an H?
retina-based features
20Using higher-order Boltzmann machines to model
image transformations (Memisevic and Hinton,
2007)
- A global transformation specifies which pixel
goes to which other pixel. - Conversely, each pair of similar intensity
pixels, one in each image, votes for a particular
global transformation.
image transformation
image(t)
image(t1)
21Making the reconstruction easier
- Condition on the first image so that only one
visible group needs to be reconstructed. - Given the hidden states and the previous image,
the pixels in the second image are conditionally
independent.
image transformation
image(t)
image(t1)
22Rolands unfactorized model
23The main problem with 3-way interactions
- There are far too many of them.
- We can reduce the number in several
straight-forward ways - Do dimensionality reduction on each group before
the three way interactions. - Use spatial locality to limit the range of the
three-way interactions. - A much more interesting approach (which can be
combined with the other two) is to factor the
interactions so that they can be specified with
fewer parameters. - This leads to a novel type of learning module.
24Factoring three-way interactions
- If three-way interactions are being used to model
a nice regular multi-linear structure, we may not
need cubically many degrees of freedom. - For modelling effects like viewpoint and
illumination many fewer degrees of freedom may be
sufficient. - There are many ways to factor 3-D interaction
tensors. - We use factors that correspond to 3-way
outer-products. - Each factor only has 3N parameters.
- By using about N/3 factors we get quadratically
many parameters which is the same as a simple
weight matrix.
25A picture of factor f
A three-way interaction tensor can be represented
as a sum of factorized three-way tensors. Each
factorized tensor is a three-way outer-product
of three vectors.
26Another picture of factor f
The unfactored version only has a single
connection to each vertex of the factor, but it
has lots of factors.
27Factoring the three-way interactions
unfactored
factored
How changing the binary state of unit j changes
the energy contributed by factor f.
What unit j needs to know in order to do Gibbs
sampling
28The dynamics
- The visible and hidden units get weighted input
from the factors and use this input in the usual
stochastic way. - They have stochastic binary states (or a
mean-field approximation to stochastic binary
states). - The factors are deterministic and implement a
type of belief propagation. They do not have
states. - Each factor computes three separate sums by
adding up the input it gets from each separate
group of units. - Then it sends the product of the summed inputs
from two groups to the third group.
29Belief propagation
The outgoing message at each vertex of the factor
is the product of the weighted sums at the other
two vertices.
30The learning
- All the pairwise weights can be learned in the
usual way by lowering the energy of the data and
raising the energy of the reconstructions.
This is the input that a hidden unit receives
from a factor
This is the input that a visible unit receives
from a factor
31A nasty numerical problem
- In a standard Boltzmann machine the gradient of a
weight on a training case always lies between 1
and -1. - With factored three-way interactions, the
gradient contains the product of two sums each
of which can be large, so the gradient can
explode. - We can keep a running average of each sum over
many training cases and divide the gradient by
this average (or its square). This helps. - For any particular weight, we must divide the
gradient by the same quantity on all training
cases to guarantee a positive correlation with
the true gradient. - Updating the weights on every training case may
also help because we get feedback faster when
weights are blowing up.
32Rolands experiments
33A principle of hierarchical systems
- Each level in the hierarchy should not try to
micro-manage the level below. - Instead, it should create an objective function
for the level below and leave the level below to
optimize it. - This allows the fine details of the solution to
be decided locally where the detailed information
is available. - Objective functions are a good way to do
abstraction.
34Why hierarchical generative models require
lateral interactions
square
pose parameters
- One way to maintain the constraints between the
parts is to generate each part very accurately - But this would require a lot of communication
bandwidth. - Sloppy top-down specification of the parts is
less demanding - but it messes up relationships between features
- so use redundant features and use lateral
interactions to clean up the mess. - Each transformed feature helps to locate the
others - This allows a noisy channel
sloppy top-down activation of parts
features with top-down support
clean-up using known interactions
Its like soldiers on a parade ground
35Restricted Boltzmann Machines with multiplicative
interactions
- In a standard RBM, the states of the hidden units
determine the effective biases of the visible
units. - In a multiplicative RBM the hidden units can
control pair-wise interactions between visible
units. - For modelling motion we have two different images
at two different times. - For modelling static images we let each factor
see the pixels twice.
36A picture of factor f
Each layer is a scaled version of the same
inverse covariance matrix. The basis inverse
covariance matrix is specified as an outer
product with typical term So each active hidden
unit contributes a scalar, times the
inverse covariance matrix of factor f
37Factoring the three-way interactions
factored
unfactored
squared output of linear filter
Top-down gain control.
38An advantage of modeling correlations between
pixels rather than pixels
- During generation, a vertical edge unit can
turn off the horizontal interpolation in a region
without worrying about exactly where the
intensity discontinuity will be. - This gives some translational invariance
- It also gives a lot of invariance to brightness
and contrast. - So the vertical edge unit is like a complex
cell. - By modulating the correlations between pixels
rather than the pixels intensities, the
generative model can still allow interpolation
parallel to the edge. - This is important for denoising images.
39Keeping perceptual inference tractable
- We want the hiddens to modulate the correlations
between the visibles. - So the visibles are NOT conditionally independent
given the hiddens. - But we also want the hiddens to be conditionally
independent given the visibles so that bottom-up
inference is simple and fast. - Its not obvious that two hiddens remain
conditionally independent when they are both
connected to the same factor. - They would not be conditionally independent if we
replaced the factors by a layer of hidden units
that had their own states.
40Why the hiddens remain conditionally independent
- If the states of the visibles are fixed, the
hiddens make additive contributions to the energy
(i.e. multiplicative contributions to the
probability). - This ensures that they are conditionally
independent.
41Where does the asymmetry in the independence
relations of visibles and hiddens come?
- Each clique contains two visibles and one hidden.
- The states in any one group are independent given
the states of the other two groups.
42Summary of the learning procedure
- Activate the hidden units using the squared
outputs of the linear filters defined by each
factor. - Then compute the positive-phase correlations
between units and the messages they receive from
factors. - With the hidden states fixed, run a few
mean-field iterations to update the pixels. - The effective pairwise weights between pixels are
determined by the hidden activities. - Then activate the hidden units again.
Compute the negative-phase correlations between
units and the messages they receive from factors.
43Learning a factored Boltzmann Machine
1. Start with a training vector on the visible
units. 2. Update all of the hidden units in
parallel via the factors. 3. Compute the
effective lateral weights between visible
units. 4. Repeatedly update all of the visible
units in parallel using mean-field updates (with
hiddens fixed) to get a reconstruction. 5.
Update all of the hidden units again.
j
j
i
i
k
i
k
k
t 0 t 1
reconstruction
data
The learning signal is the difference in the
pairwise correlations between unit states and
factor messages with data and with
reconstructions.
44Linear filters learned by the factors on MNIST
digits
45Linear-Linear blow-up
- If the visible units are rectified linear instead
of binary, the energy can easily be made
infinitely negative by making the activities very
big and positive in a set of units that mutually
excite one another. - To keep the energy under control we need
inhibitory interactions between the visibles that
have a super-quadratic energy term - We can achieve this using the same factoring
trick. It gives factors that look very like
inhibitory inter-neurons.
46Three-way interactions between pixels
The outgoing message at each vertex of the factor
is the product of the weighted sums at the other
two vertices.
If the real weights are all negative, the belief
propagation can be implemented by having positive
weights going into a factor and negative weights
coming out.
47Factoring the three-way interactions between
pixels
All of these weights are negative
squared output of linear filter
This weight needs to have its minus sign
These weights do not need the minus sign because
the output of the linear filter is squared.
48How to create the reconstructions for linear
visible units
- The top-down effects from the factors determine
an inverse covariance matrix. - We could invert this matrix and then sample from
the full covariance Gaussian. - Alternatively, we could start at the data and do
stochastic gradient descent in the energy
function. For CD learning we only need to get
closer to equilibrium than the data. - Hybrid Monte Carlo is a good way to do the
stochastic gradient descent. - If we include inhibitory inter-neurons, the
covaraince matrix of the visibles changes every
time the visible states change so we need to use
hybrid monte carlo.
49A very similar model
- If we make all of the hidden units deterministic,
it is possible to learn nice topographic maps
using contrastive divergence. - Hybrid monte carlo is used to do the
reconstructions - The outputs of linear filters are squared before
being sent to the second hidden layer - Is this just a coincidence?
50How to learn a topographic map
The outputs of the linear filters are squared and
locally pooled. This makes it cheaper to put
filters that are violated at the same time next
to each other.
Pooled squared filters
Local connectivity
Cost of second violation
Linear filters
Global connectivity
Cost of first violation
image
51(No Transcript)
52Two models with similar energy functions
E
0
10
E
-1
-10
Linear filter with squared output
squared filter output
contrastive backprop model
Factored RBM model
53THE END