CSC2535 Lecture 12 Learning Multiplicative Interactions - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

CSC2535 Lecture 12 Learning Multiplicative Interactions

Description:

The outer-product of the style and content vectors determines a set of ... Then analytically fit the basis vectors with the style and pose vectors fixed. ... – PowerPoint PPT presentation

Number of Views:67
Avg rating:3.0/5.0
Slides: 54
Provided by: hin9
Category:

less

Transcript and Presenter's Notes

Title: CSC2535 Lecture 12 Learning Multiplicative Interactions


1
CSC2535 Lecture 12Learning Multiplicative
Interactions
  • Geoffrey Hinton
  • Department of Computer Science
  • University of Toronto

2
Two different meanings of multiplicative
  • If we take two density models and multiply
    together their probability distributions at each
    point in data-space, we get a product of
    experts.
  • The product of two Gaussian experts is a
    Gaussian.
  • If we take two variables and we multiply them
    together to provide input to a third variable we
    get a multiplicative interaction.
  • The distribution of the product of two
    Gaussian-distributed variables is NOT Gaussian
    distributed. It is a heavy-tailed distribution.
    One Gaussian determines the standard deviation of
    the other Gaussian.
  • Heavy-tailed distributions are the signatures of
    multiplicative interactions between latent
    variables.

3
The heavy-tailed world
  • The prediction errors for financial time-series
    are typically heavy-tailed. This is mainly
    because the variance is much higher in times of
    uncertainty.
  • The prediction errors made by a linear dynamical
    systems are usually heavy-tailed on real data.
  • Occasional very weird things happen. This
    violates the conditions of the central limit
    theorem.
  • The outputs of linear filters applied to images
    are heavy-tailed.
  • Gabor filters nearly always output almost exactly
    zero. But occasionally they have large outputs.

4
But first One final product of experts
  • Each hidden unit of a binary RBM defines an
    expert.
  • The expert is a mixture of a uniform distribution
    (if the hidden unit is off) and a multivariate
    Bernoulli in which the expert specifies the log
    odds for each visible unit.
  • The visible biases are like an additional expert.
  • We could replace each hidden unit by an HMM and
    replace the data-vector by a sequence of
    data-vectors.
  • The two binary states of an RBM hidden unit are
    then replaced by all possible paths through the
    hidden nodes of an HMM.

5
Why products of HMMs should be better than
ordinary HMMs.
  • An HMM requires 2N hidden states to generate
    strings in which the past has N bits of mutual
    information with the future.
  • The number of hidden nodes required scales
    exponentially with the mutual information.
  • A product of k HMMs allows k times as much
    mutual information as one HMM.
  • The number of fixed-size HMMs required scales
    linearly with the mutual information.
  • So many small HMMs are much better at capturing
    componential structure through time than one big
    HMM.

6
Inference in a PoHMM
  • The HMMs are all conditionally independent given
    the data sequence.
  • So we run the forward-backward algorithm in each
    HMM separately to get the sufficient statistics
    of the posterior over paths.
  • To get statistics for CD1 learning
  • First get sufficient statistics with the data.
  • Then reconstruct the data.
  • Then get sufficient statistics with the
    reconstruction.
  • Adjust the parameters to raise the log prob of
    the data and lower the log prob of the
    reconstruction.

7
How to reconstruct from a PoHMM
  • First pick a path through the hidden nodes of
    each HMM from the posterior distribution given
    the data.
  • Then reconstruct each time step separately by
    sampling from the product of the output
    distributions specified by all the nodes chosen
    for that time step.
  • This is easy with a diagonal Gaussian output
    model or with a discrete output model.
  • It is harder with a full covariance output model
    because we need to add the inverse covariance
    matrices of the different experts.
  • It can be very hard if each node uses a mixture
    of Gaussians output model. So dont.

8
How well do PoHMMs work?
  • We now know how to evaluate the partition
    function of an HMM by using annealed importance
    sampling.
  • So we can finally evaluate how well they work.
  • So far they work significantly better than single
    HMMs (Graham Taylors recent results)
  • For very big datasets with highly componential
    data, a single HMM needs a huge number of nodes.
  • PoHMMs have an exponential win in
    representational power so they can have many less
    parameters.
  • This means they should be faster to train.

9
Back to multiplicative interactions
  • It is fairly easy to learn multiplicative
    interactions if all of the variables are
    observed.
  • This is possible if we control the variables used
    to create a training set (e.g. pose, lighting,
    identity )
  • It is also easy to learn energy-based models in
    which all but one of the terms in each
    multiplicative interaction are observed.
  • Inference is still easy.
  • If more than one of the terms in each
    multiplicative interaction are unobserved, the
    interactions between hidden variables make
    inference difficult.
  • Alternating Gibbs can be used if the latent
    variables form a bi-partite graph.

10
Learning how style and content interact
  • Tenenbaum and Freeman (2000) describe a model in
    which a style vector and a content vector
    interact multiplicatively to determine a
    datavector (e.g. and image).
  • The outer-product of the style and content
    vectors determines a set of coefficients for
    basis functions.
  • This is not at all like the way a user vector and
    a movie vector interact to determine a rating.
    The rating is the inner-product.

11
It is an unfortunate coincidence that the number
of components in each pose vector is equal to the
number of different pose vectors. The model is
only really interesting if we have less
components per style or content vector than style
or content vectors
12
Some ways to use the bilinear model
13
A simpler asymmetric model in which the pose
vectors and basis vectors have already been
multiplied together. Learning is easier in the
asymmetric model.
14
content
The other asymmetric model in which person
vectors and basis vectors have been pre-multiplied
style
15
Fitting the asymmetric model
  • Each training case is a column vector labeled
    with its discrete style and content classes.
  • For multiple examples of the same style and
    content, just average the images since we are
    minimizing squared reconstruction error.
  • Concatenate all the cases that have the same
    content class to get a single matrix with as many
    columns as content classes.
  • Then perform SVD on this matrix

style1
style2
style3
Y
16
Decomposing the matrix
Desired decomposition
Throw out the columns of W and the rows of B that
correspond to the smallest eigenvalues. This
makes the model non-trivial
17
Fitting the symmetric model
  • Fix the style vectors and then fit the content
    vectors of the asymmetric model.
  • Then freeze the content vectors and fit the style
    vectors of the other asymmetric model.
  • Alternate until a local optimum is reached.
  • Then analytically fit the basis vectors with the
    style and pose vectors fixed.
  • See Tenenbaum and Freeman (2000) for details.

18
Higher order Boltzmann machines (Sejnowski, 1986)
  • The usual energy function is quadratic in the
    states
  • But we could use higher order interactions
  • Hidden unit h acts as a switch. When h is on, it
    switches in the pairwise interaction between unit
    i and unit j.
  • Units i and j can also be viewed as switches that
    control the pairwise interactions between j and h
    or between i and h.

19
A higher-order Boltzmann machine with one visible
group and two hidden groups
object-based features
viewing transform
  • We can view it as a Boltzmann machine in which
    the inputs create interactions between the other
    variables.
  • This type of model is now called a conditional
    random field.
  • Inference can be hard in this model.
  • Inference is much easier with two visible groups
    and one hidden group

Is this an I or an H?
retina-based features
20
Using higher-order Boltzmann machines to model
image transformations (Memisevic and Hinton,
2007)
  • A global transformation specifies which pixel
    goes to which other pixel.
  • Conversely, each pair of similar intensity
    pixels, one in each image, votes for a particular
    global transformation.

image transformation
image(t)
image(t1)
21
Making the reconstruction easier
  • Condition on the first image so that only one
    visible group needs to be reconstructed.
  • Given the hidden states and the previous image,
    the pixels in the second image are conditionally
    independent.

image transformation
image(t)
image(t1)
22
Rolands unfactorized model
23
The main problem with 3-way interactions
  • There are far too many of them.
  • We can reduce the number in several
    straight-forward ways
  • Do dimensionality reduction on each group before
    the three way interactions.
  • Use spatial locality to limit the range of the
    three-way interactions.
  • A much more interesting approach (which can be
    combined with the other two) is to factor the
    interactions so that they can be specified with
    fewer parameters.
  • This leads to a novel type of learning module.

24
Factoring three-way interactions
  • If three-way interactions are being used to model
    a nice regular multi-linear structure, we may not
    need cubically many degrees of freedom.
  • For modelling effects like viewpoint and
    illumination many fewer degrees of freedom may be
    sufficient.
  • There are many ways to factor 3-D interaction
    tensors.
  • We use factors that correspond to 3-way
    outer-products.
  • Each factor only has 3N parameters.
  • By using about N/3 factors we get quadratically
    many parameters which is the same as a simple
    weight matrix.

25
A picture of factor f
A three-way interaction tensor can be represented
as a sum of factorized three-way tensors. Each
factorized tensor is a three-way outer-product
of three vectors.
26
Another picture of factor f
The unfactored version only has a single
connection to each vertex of the factor, but it
has lots of factors.
27
Factoring the three-way interactions
unfactored
factored
How changing the binary state of unit j changes
the energy contributed by factor f.
What unit j needs to know in order to do Gibbs
sampling
28
The dynamics
  • The visible and hidden units get weighted input
    from the factors and use this input in the usual
    stochastic way.
  • They have stochastic binary states (or a
    mean-field approximation to stochastic binary
    states).
  • The factors are deterministic and implement a
    type of belief propagation. They do not have
    states.
  • Each factor computes three separate sums by
    adding up the input it gets from each separate
    group of units.
  • Then it sends the product of the summed inputs
    from two groups to the third group.

29
Belief propagation
The outgoing message at each vertex of the factor
is the product of the weighted sums at the other
two vertices.
30
The learning
  • All the pairwise weights can be learned in the
    usual way by lowering the energy of the data and
    raising the energy of the reconstructions.

This is the input that a hidden unit receives
from a factor
This is the input that a visible unit receives
from a factor
31
A nasty numerical problem
  • In a standard Boltzmann machine the gradient of a
    weight on a training case always lies between 1
    and -1.
  • With factored three-way interactions, the
    gradient contains the product of two sums each
    of which can be large, so the gradient can
    explode.
  • We can keep a running average of each sum over
    many training cases and divide the gradient by
    this average (or its square). This helps.
  • For any particular weight, we must divide the
    gradient by the same quantity on all training
    cases to guarantee a positive correlation with
    the true gradient.
  • Updating the weights on every training case may
    also help because we get feedback faster when
    weights are blowing up.

32
Rolands experiments
33
A principle of hierarchical systems
  • Each level in the hierarchy should not try to
    micro-manage the level below.
  • Instead, it should create an objective function
    for the level below and leave the level below to
    optimize it.
  • This allows the fine details of the solution to
    be decided locally where the detailed information
    is available.
  • Objective functions are a good way to do
    abstraction.

34
Why hierarchical generative models require
lateral interactions
square

pose parameters
  • One way to maintain the constraints between the
    parts is to generate each part very accurately
  • But this would require a lot of communication
    bandwidth.
  • Sloppy top-down specification of the parts is
    less demanding
  • but it messes up relationships between features
  • so use redundant features and use lateral
    interactions to clean up the mess.
  • Each transformed feature helps to locate the
    others
  • This allows a noisy channel

sloppy top-down activation of parts
features with top-down support
clean-up using known interactions
Its like soldiers on a parade ground
35
Restricted Boltzmann Machines with multiplicative
interactions
  • In a standard RBM, the states of the hidden units
    determine the effective biases of the visible
    units.
  • In a multiplicative RBM the hidden units can
    control pair-wise interactions between visible
    units.
  • For modelling motion we have two different images
    at two different times.
  • For modelling static images we let each factor
    see the pixels twice.

36
A picture of factor f
Each layer is a scaled version of the same
inverse covariance matrix. The basis inverse
covariance matrix is specified as an outer
product with typical term So each active hidden
unit contributes a scalar, times the
inverse covariance matrix of factor f
37
Factoring the three-way interactions
factored
unfactored
squared output of linear filter
Top-down gain control.
38
An advantage of modeling correlations between
pixels rather than pixels
  • During generation, a vertical edge unit can
    turn off the horizontal interpolation in a region
    without worrying about exactly where the
    intensity discontinuity will be.
  • This gives some translational invariance
  • It also gives a lot of invariance to brightness
    and contrast.
  • So the vertical edge unit is like a complex
    cell.
  • By modulating the correlations between pixels
    rather than the pixels intensities, the
    generative model can still allow interpolation
    parallel to the edge.
  • This is important for denoising images.

39
Keeping perceptual inference tractable
  • We want the hiddens to modulate the correlations
    between the visibles.
  • So the visibles are NOT conditionally independent
    given the hiddens.
  • But we also want the hiddens to be conditionally
    independent given the visibles so that bottom-up
    inference is simple and fast.
  • Its not obvious that two hiddens remain
    conditionally independent when they are both
    connected to the same factor.
  • They would not be conditionally independent if we
    replaced the factors by a layer of hidden units
    that had their own states.

40
Why the hiddens remain conditionally independent
  • If the states of the visibles are fixed, the
    hiddens make additive contributions to the energy
    (i.e. multiplicative contributions to the
    probability).
  • This ensures that they are conditionally
    independent.

41
Where does the asymmetry in the independence
relations of visibles and hiddens come?
  • Each clique contains two visibles and one hidden.
  • The states in any one group are independent given
    the states of the other two groups.

42
Summary of the learning procedure
  • Activate the hidden units using the squared
    outputs of the linear filters defined by each
    factor.
  • Then compute the positive-phase correlations
    between units and the messages they receive from
    factors.
  • With the hidden states fixed, run a few
    mean-field iterations to update the pixels.
  • The effective pairwise weights between pixels are
    determined by the hidden activities.
  • Then activate the hidden units again.
    Compute the negative-phase correlations between
    units and the messages they receive from factors.

43
Learning a factored Boltzmann Machine
1. Start with a training vector on the visible
units. 2. Update all of the hidden units in
parallel via the factors. 3. Compute the
effective lateral weights between visible
units. 4. Repeatedly update all of the visible
units in parallel using mean-field updates (with
hiddens fixed) to get a reconstruction. 5.
Update all of the hidden units again.
j
j
i
i
k
i
k
k
t 0 t 1

reconstruction
data
The learning signal is the difference in the
pairwise correlations between unit states and
factor messages with data and with
reconstructions.
44
Linear filters learned by the factors on MNIST
digits
45
Linear-Linear blow-up
  • If the visible units are rectified linear instead
    of binary, the energy can easily be made
    infinitely negative by making the activities very
    big and positive in a set of units that mutually
    excite one another.
  • To keep the energy under control we need
    inhibitory interactions between the visibles that
    have a super-quadratic energy term
  • We can achieve this using the same factoring
    trick. It gives factors that look very like
    inhibitory inter-neurons.

46
Three-way interactions between pixels
The outgoing message at each vertex of the factor
is the product of the weighted sums at the other
two vertices.
If the real weights are all negative, the belief
propagation can be implemented by having positive
weights going into a factor and negative weights
coming out.
47
Factoring the three-way interactions between
pixels
All of these weights are negative
squared output of linear filter
This weight needs to have its minus sign
These weights do not need the minus sign because
the output of the linear filter is squared.
48
How to create the reconstructions for linear
visible units
  • The top-down effects from the factors determine
    an inverse covariance matrix.
  • We could invert this matrix and then sample from
    the full covariance Gaussian.
  • Alternatively, we could start at the data and do
    stochastic gradient descent in the energy
    function. For CD learning we only need to get
    closer to equilibrium than the data.
  • Hybrid Monte Carlo is a good way to do the
    stochastic gradient descent.
  • If we include inhibitory inter-neurons, the
    covaraince matrix of the visibles changes every
    time the visible states change so we need to use
    hybrid monte carlo.

49
A very similar model
  • If we make all of the hidden units deterministic,
    it is possible to learn nice topographic maps
    using contrastive divergence.
  • Hybrid monte carlo is used to do the
    reconstructions
  • The outputs of linear filters are squared before
    being sent to the second hidden layer
  • Is this just a coincidence?

50
How to learn a topographic map
The outputs of the linear filters are squared and
locally pooled. This makes it cheaper to put
filters that are violated at the same time next
to each other.
Pooled squared filters
Local connectivity
Cost of second violation
Linear filters
Global connectivity
Cost of first violation
image
51
(No Transcript)
52
Two models with similar energy functions
E
0
10
E
-1
-10
Linear filter with squared output
squared filter output
contrastive backprop model
Factored RBM model
53
THE END
Write a Comment
User Comments (0)
About PowerShow.com