CSC2535: Advanced Machine Learning Lecture 11 Learning by maximizing agreement between outputs

1 / 43
About This Presentation
Title:

CSC2535: Advanced Machine Learning Lecture 11 Learning by maximizing agreement between outputs

Description:

It is invariant over time. Its angular momentum also stays the same if it is in free fall. ... Use three consecutive time frames from a fake video sequence as ... –

Number of Views:78
Avg rating:3.0/5.0
Slides: 44
Provided by: hin9
Category:

less

Transcript and Presenter's Notes

Title: CSC2535: Advanced Machine Learning Lecture 11 Learning by maximizing agreement between outputs


1
CSC2535 Advanced Machine Learning Lecture 11
Learning by maximizing agreement between outputs
  • Geoffrey Hinton

2
The aims of unsupervised learning
  • We would like to extract a representation of the
    sensory input that is useful for later
    processing.
  • We want to do this without requiring labeled
    data.
  • Prior ideas about what the internal
    representation should look like ought to be
    helpful. So what would we like in a
    representation?
  • Hidden causes that explain high-order
    correlations?
  • Constraints that often hold?
  • A low-dimensional manifold that contains all the
    data?
  • Properties that are invariant across space or
    time?

3
Temporally invariant properties
  • Consider a rigid object that is moving relative
    to the retina
  • Its retinal image changes in predictable ways
  • Its true 3-D shape stays exactly the same. It is
    invariant over time.
  • Its angular momentum also stays the same if it is
    in free fall.
  • Properties that are invariant over time are
    usually interesting.
  • Similarly for properties invariant over space.

4
Learning temporal invariances
maximize agreement
non-linear features
non-linear features
hidden layers
hidden layers
image
image
time t1
time t
5
Some obvious measures of agreement
  • Minimize the average squared difference between
    the two output vectors.
  • This is easy to achieve. Both modules always
    output the same vector.
  • That is not what we meant by agreement!
  • Minimize the variance of the difference in the
    two output vectors relative to the variance of
    each output vector separately.
  • This works well for learning linear
    transformations.
  • It is a version of canonical correlation.
  • There is a subtle reason why it does not work
    well for learning non-linear transformations.

6
A new way to get a teaching signal
  • Each module uses the output of the other module
    as the teaching signal.
  • This does not work if the two modules can see the
    same data. They just report one component of the
    data and agree perfectly.
  • It also fails if a module always outputs a
    constant. The modules can just ignore the data
    and agree on what constant to output.
  • We need a sensible definition of the amount of
    agreement between the outputs.

7
Mutual information
  • Two variables, a and b, have high mutual
    information if you can predict a lot about one
    from the other.

Joint entropy
Mutual Information
Individual entropies
  • There is also an asymmetric way to define mutual
    information
  • Compute derivatives of I w.r.t. the feature
    activities. Then backpropagate to get derivatives
    for all the weights in the network.
  • The network at time t is using the network at
    time t1 as its teacher (and vice versa).

8
Some advantages of mutual information
  • If the modules output constants the mutual
    information is zero.
  • If the modules each output a vector, the mutual
    information is maximized by making the components
    of each vector be as independent as possible.
  • Mutual information exactly captures what we mean
    by agreeing.

9
A problem
  • We can never have more mutual information between
    the two output vectors than there is between the
    two input vectors.
  • So why not just use the input vector as the
    output?
  • We want to preserve as much mutual information as
    possible whilst also achieving something else
  • Dimensionality reduction?
  • A simple form for the prediction of one output
    from the other?

10
Simple forms for the relationship
  • Assumption the output of module a equals the
    output of module b plus noise
  • Alternative assumption a and b are both noisy
    versions of the same underlying signal.

11
Learning temporal invariances
Backpropagate derivatives
Backpropagate derivatives
maximize mutual information
non-linear features
non-linear features
hidden layers
hidden layers
image
image
time t1
time t
12
Spatially invariant properties
  • Consider a smooth surface covered in random dots
    that is viewed from two different directions
  • Each image is just a set of random dots.
  • A stereo pair of images has disparity that
    changes smoothly over space. Nearby regions of
    the image pair have very similar disparities.

plane of fixation
left eye right eye
surface
13
Maximizing mutual information between a local
region and a larger context
Contextual prediction
w1 w2
w3 w4
Maximize MI
hidden
hidden
hidden
hidden
hidden
left eye right eye
surface
14
How well does it work?
  • If we use weight sharing between modules and
    plenty of hidden units, it works really well.
  • It extracts the depth of the surface fairly
    accurately.
  • It simultaneously learns the optimal weights of
    -1/6, 4/6, 4/6, -1/6 for interpolating the
    depths of the context to predict the depth at the
    middle module.
  • If the data is noisy or the modules are
    unreliable it learns a more robust interpolator
    that uses smaller weights in order not to amplify
    noise.

15
But what about discontinuities?
  • Real surfaces are mostly smooth but also have
    sharp discontinuities in depth.
  • How can we preserve the high mutual information
    between local depth and contextual depth?
  • Discontinuities cause occasional high residual
    errors. The Gaussian model of residuals requires
    high variance to accommodate these large errors.

16
A simple mixture approach
  • We assume that there are continuity cases in
    which there is high MI and discontinuity cases
    in which there is no MI.
  • The variance of the residual is only computed on
    the continuity cases so it can stay small.
  • The residual can be used to compute the posterior
    probability of each type of case.
  • Aim to maximize the mixing proportion of the
    continuity cases times the MI in those cases.

17
Mixtures of expert interpolators
  • Instead of just giving up on discontinuity cases
    we can use a different interpolator that ignores
    the surface beyond the discontinuity
  • To predict the depth at c use a 2b
  • To choose this interpolator, find the location of
    the discontinuity.

a b c d e
18
The mixture of interpolators net
  • There are five interpolators, each with its own
    controller.
  • Each controller is a neural net that looks at the
    outputs of all 5 modules and learns to detect a
    discontinuity at a particular location.
  • Except for the controller for the full
    interpolator which checks that there is no
    discontinuity.
  • The mixture of expert interpolators trains the
    controllers and the interpolators and the local
    depth modules all together.

19
Mutual Information with multi-dimensional output
  • For a multidimensional Gaussian, the entropy is
    given by the magnitude of the determinant of the
    covariance matrix (the volume of the Gaussian)
  • If we use the shared signal assumption to measure
    the information between the outputs of two
    modules we get

20
Optimizing non-linear transformations to maximize
mutual information between multi-dimensional
outputs
  • Assume that each output is a multi-dimensional
    Gaussian and also assume that the joint
    distribution of both outputs is Gaussian.
  • If we back-propagate the derivatives of this MI
    we get bizarre results (The same problems occur
    with one-dimensional outputs but they are less
    obvious for 1-D)
  • What is wrong?

21
Beware of Gaussian assumptions
  • We need to maximize
  • We need to minimize
  • We actually maximize
  • We actually minimize

gap
linked
Maximizing an upper bound encourages it to make
the bound looser.
22
Violating the Gaussian Assumption(experiments by
Russ Salakhutdinov)
  • Suppose we use pairs of images that only have one
    scalar in common (the orientation of a face).
  • Suppose we let each module have two-dimensional
    output.
  • What does it do?

a1 is uncorrelated with a2 so the determinant is
big. But the entropy of a is low because a is
one-dimensional. It is extremely non-Gaussian.
23
A lucky escape
  • What if we do a fixed non-linear expansion of the
    input and then learn a linear mapping from the
    non-linear expansion to the output?
  • A linear mapping cannot change the
    Gaussian-ness of a distribution (i.e the ratio
    of entropy to variance.)
  • So it cannot cheat by making the bound looser.
  • It is easy to find linear mappings that
    optimize quadratic constraints.
  • We can use the kernel trick to allow a huge
    non-linear expansion.

output
adaptive linear
hidden layer
fixed non-linear
image
24
Kernel Canonical Correlation (Bach and Jordan)
  • Canonical correlation finds a fixed linear
    transformation of each input to maximize the
    correlation of the outputs.
  • It can be Kernelized by using a kernel in input
    space to allow efficient computation of the best
    linear mapping in a very high-dimensional
    non-linear expansion of the input space.
  • The Gaussian-ness of the distribution in the high
    dimensional space is not affected by adapting the
    linear mapping.

25
Slow Feature Analysis(Berges Wiskott, Wiskott
Sejnowski)
  • Use three consecutive time frames from a fake
    video sequence as the two inputs t-1, t t,
    t1
  • The sequence is made from a large, still, natural
    image by translating, expanding ,and rotating a
    square window and then pixelating to get
    sequences of 16x16 images.
  • Two 256 pixel images are reduced to 100
    dimensions using PCA then non-linearly expanded
    by taking pairwise products of components. This
    provides the 5050 dimensional input to one module.

26
The SFA objective function
The solution can be found by solving a
generalized eigenvalue problem
27
The slow features
  • They have a lot of similarities to the features
    found in the first stage of visual cortex.
  • They can be displayed by showing the pair of
    temporally adjacent images that excite them most
    and the pair that inhibit them most.

28
The most excitatory pair of images and the most
inhibitory pair of images for some slow features
29
(No Transcript)
30
(No Transcript)
31
Relationship to linear dynamical system
linear features
linear features
Linear model (could be the identity plus noise)
The past
We predict in this domain so we cannot cheat
image
image
time t1
time t
32
A way to learn non-linear transformations that
maximize agreement between the outputs of two
modules
  • We want to explain why we observe particular
    pairs of images rather than observing other
    pairings of the same set of images.
  • This captures the non iid-ness of the data.
  • We can formulate this probabilistically using
    disagreement energies

33
An energy-based model of agreement
same case c
agree
b
a
hidden layers
hidden layers
A
B
34
Its the same cost as symmetric SNE!
  • Model the joint probability of picking pairs of
    images. Temporal or spatial adjacency is now used
    to get a set of desired probabilities for
    pairs.
  • In the model, the joint probability is
    proportional to the squared distance between the
    codes for i and j.

35
The forces acting on the output vectors
  • Output vectors from a correct pair are pulled
    towards each other with a force that depends on
    their squared difference.
  • Output vectors from an incorrect pair are
    repelled with a force that falls off as the
    vectors get far apart relative to the correct

b
a
36
Combining symmetric SNE with a feedforward neural
net
  • The aim of the net is to make the codes similar
    for the pairs it is given.
  • Use pairs of face images that have similar
    orientations and scales but are otherwise quite
    different.
  • Use a feedforward net to map the image to a 2-D
    code.
  • The SNE derivatives are back-propagated through
    the net.
  • This regularizes the embedding and also makes it
    easy to apply to new data.

Code i
Code j
Face i
Face j
37
Large pair
Small pair
38
Each color is for a different band of
orientations (from -45 to 45)
39
Each color is for a different scale (from small
to large)
40
A non-probabilistic version
  • Hadsell, Chopra and LeCun (2006) use a
    non-probabilistic version of NCA.
  • They need to use a complicated heuristic to force
    the outputs from dissimilar pairs to be far
    apart.
  • They get similar results when they map images of
    objects to a low dimensional space.

41
Neighborhood Components Analysis
  • The idea is to map datapoints to a low
    dimensional space in such a way that nearest
    neighbors classification works well
  • If we restrict the mapping from inputs to outputs
    to be linear we get an alternative to Fishers
    Linear Discriminant Analysis.
  • LDA maximizes the ratio of between class variance
    to within class variance.
  • This is the wrong thing to do if the classes
    naturally form extended low-dimensional manifolds

42
An objective function for NCA
low-D output vector
high-D input vector
class
43
Non-linear NCA
  • This should be much more powerful than linear
    NCA, but it is harder to optimize.
  • Maybe it would help to initialize the mapping by
    learning a multilayer model of the inputs using
    RBMs
  • Maybe it would help to combine the NCA objective
    function with an autoencoder.
  • Maybe the autoencoder would take care of the
    collapse problem so that we could avoid the
    quadratically expensive consideration of all the
    pairs for different classes.
Write a Comment
User Comments (0)
About PowerShow.com