Title: CSC2535: Advanced Machine Learning Lecture 11 Learning by maximizing agreement between outputs
1CSC2535 Advanced Machine Learning Lecture 11
Learning by maximizing agreement between outputs
2The aims of unsupervised learning
- We would like to extract a representation of the
sensory input that is useful for later
processing. - We want to do this without requiring labeled
data. - Prior ideas about what the internal
representation should look like ought to be
helpful. So what would we like in a
representation? - Hidden causes that explain high-order
correlations? - Constraints that often hold?
- A low-dimensional manifold that contains all the
data? - Properties that are invariant across space or
time?
3Temporally invariant properties
- Consider a rigid object that is moving relative
to the retina - Its retinal image changes in predictable ways
- Its true 3-D shape stays exactly the same. It is
invariant over time. - Its angular momentum also stays the same if it is
in free fall. - Properties that are invariant over time are
usually interesting. - Similarly for properties invariant over space.
4Learning temporal invariances
maximize agreement
non-linear features
non-linear features
hidden layers
hidden layers
image
image
time t1
time t
5Some obvious measures of agreement
- Minimize the average squared difference between
the two output vectors. - This is easy to achieve. Both modules always
output the same vector. - That is not what we meant by agreement!
- Minimize the variance of the difference in the
two output vectors relative to the variance of
each output vector separately. - This works well for learning linear
transformations. - It is a version of canonical correlation.
- There is a subtle reason why it does not work
well for learning non-linear transformations.
6A new way to get a teaching signal
- Each module uses the output of the other module
as the teaching signal. - This does not work if the two modules can see the
same data. They just report one component of the
data and agree perfectly. - It also fails if a module always outputs a
constant. The modules can just ignore the data
and agree on what constant to output. - We need a sensible definition of the amount of
agreement between the outputs.
7Mutual information
- Two variables, a and b, have high mutual
information if you can predict a lot about one
from the other. -
Joint entropy
Mutual Information
Individual entropies
- There is also an asymmetric way to define mutual
information - Compute derivatives of I w.r.t. the feature
activities. Then backpropagate to get derivatives
for all the weights in the network. - The network at time t is using the network at
time t1 as its teacher (and vice versa). -
8Some advantages of mutual information
- If the modules output constants the mutual
information is zero. - If the modules each output a vector, the mutual
information is maximized by making the components
of each vector be as independent as possible. - Mutual information exactly captures what we mean
by agreeing.
9A problem
- We can never have more mutual information between
the two output vectors than there is between the
two input vectors. - So why not just use the input vector as the
output? - We want to preserve as much mutual information as
possible whilst also achieving something else - Dimensionality reduction?
- A simple form for the prediction of one output
from the other?
10Simple forms for the relationship
- Assumption the output of module a equals the
output of module b plus noise
- Alternative assumption a and b are both noisy
versions of the same underlying signal.
11Learning temporal invariances
Backpropagate derivatives
Backpropagate derivatives
maximize mutual information
non-linear features
non-linear features
hidden layers
hidden layers
image
image
time t1
time t
12Spatially invariant properties
- Consider a smooth surface covered in random dots
that is viewed from two different directions - Each image is just a set of random dots.
- A stereo pair of images has disparity that
changes smoothly over space. Nearby regions of
the image pair have very similar disparities.
plane of fixation
left eye right eye
surface
13Maximizing mutual information between a local
region and a larger context
Contextual prediction
w1 w2
w3 w4
Maximize MI
hidden
hidden
hidden
hidden
hidden
left eye right eye
surface
14How well does it work?
- If we use weight sharing between modules and
plenty of hidden units, it works really well. - It extracts the depth of the surface fairly
accurately. - It simultaneously learns the optimal weights of
-1/6, 4/6, 4/6, -1/6 for interpolating the
depths of the context to predict the depth at the
middle module. - If the data is noisy or the modules are
unreliable it learns a more robust interpolator
that uses smaller weights in order not to amplify
noise.
15But what about discontinuities?
- Real surfaces are mostly smooth but also have
sharp discontinuities in depth. - How can we preserve the high mutual information
between local depth and contextual depth? - Discontinuities cause occasional high residual
errors. The Gaussian model of residuals requires
high variance to accommodate these large errors.
16A simple mixture approach
- We assume that there are continuity cases in
which there is high MI and discontinuity cases
in which there is no MI. - The variance of the residual is only computed on
the continuity cases so it can stay small. - The residual can be used to compute the posterior
probability of each type of case. - Aim to maximize the mixing proportion of the
continuity cases times the MI in those cases.
17Mixtures of expert interpolators
- Instead of just giving up on discontinuity cases
we can use a different interpolator that ignores
the surface beyond the discontinuity - To predict the depth at c use a 2b
- To choose this interpolator, find the location of
the discontinuity.
a b c d e
18The mixture of interpolators net
- There are five interpolators, each with its own
controller. - Each controller is a neural net that looks at the
outputs of all 5 modules and learns to detect a
discontinuity at a particular location. - Except for the controller for the full
interpolator which checks that there is no
discontinuity. - The mixture of expert interpolators trains the
controllers and the interpolators and the local
depth modules all together.
19Mutual Information with multi-dimensional output
- For a multidimensional Gaussian, the entropy is
given by the magnitude of the determinant of the
covariance matrix (the volume of the Gaussian) - If we use the shared signal assumption to measure
the information between the outputs of two
modules we get
20Optimizing non-linear transformations to maximize
mutual information between multi-dimensional
outputs
- Assume that each output is a multi-dimensional
Gaussian and also assume that the joint
distribution of both outputs is Gaussian. - If we back-propagate the derivatives of this MI
we get bizarre results (The same problems occur
with one-dimensional outputs but they are less
obvious for 1-D) - What is wrong?
21Beware of Gaussian assumptions
- We need to maximize
- We need to minimize
- We actually maximize
- We actually minimize
gap
linked
Maximizing an upper bound encourages it to make
the bound looser.
22Violating the Gaussian Assumption(experiments by
Russ Salakhutdinov)
- Suppose we use pairs of images that only have one
scalar in common (the orientation of a face). - Suppose we let each module have two-dimensional
output. - What does it do?
a1 is uncorrelated with a2 so the determinant is
big. But the entropy of a is low because a is
one-dimensional. It is extremely non-Gaussian.
23A lucky escape
- What if we do a fixed non-linear expansion of the
input and then learn a linear mapping from the
non-linear expansion to the output? - A linear mapping cannot change the
Gaussian-ness of a distribution (i.e the ratio
of entropy to variance.) - So it cannot cheat by making the bound looser.
- It is easy to find linear mappings that
optimize quadratic constraints. - We can use the kernel trick to allow a huge
non-linear expansion.
output
adaptive linear
hidden layer
fixed non-linear
image
24Kernel Canonical Correlation (Bach and Jordan)
- Canonical correlation finds a fixed linear
transformation of each input to maximize the
correlation of the outputs. - It can be Kernelized by using a kernel in input
space to allow efficient computation of the best
linear mapping in a very high-dimensional
non-linear expansion of the input space. - The Gaussian-ness of the distribution in the high
dimensional space is not affected by adapting the
linear mapping.
25Slow Feature Analysis(Berges Wiskott, Wiskott
Sejnowski)
- Use three consecutive time frames from a fake
video sequence as the two inputs t-1, t t,
t1 - The sequence is made from a large, still, natural
image by translating, expanding ,and rotating a
square window and then pixelating to get
sequences of 16x16 images. - Two 256 pixel images are reduced to 100
dimensions using PCA then non-linearly expanded
by taking pairwise products of components. This
provides the 5050 dimensional input to one module.
26The SFA objective function
The solution can be found by solving a
generalized eigenvalue problem
27The slow features
- They have a lot of similarities to the features
found in the first stage of visual cortex. - They can be displayed by showing the pair of
temporally adjacent images that excite them most
and the pair that inhibit them most.
28The most excitatory pair of images and the most
inhibitory pair of images for some slow features
29(No Transcript)
30(No Transcript)
31Relationship to linear dynamical system
linear features
linear features
Linear model (could be the identity plus noise)
The past
We predict in this domain so we cannot cheat
image
image
time t1
time t
32A way to learn non-linear transformations that
maximize agreement between the outputs of two
modules
- We want to explain why we observe particular
pairs of images rather than observing other
pairings of the same set of images. - This captures the non iid-ness of the data.
- We can formulate this probabilistically using
disagreement energies
33An energy-based model of agreement
same case c
agree
b
a
hidden layers
hidden layers
A
B
34Its the same cost as symmetric SNE!
- Model the joint probability of picking pairs of
images. Temporal or spatial adjacency is now used
to get a set of desired probabilities for
pairs. - In the model, the joint probability is
proportional to the squared distance between the
codes for i and j.
35The forces acting on the output vectors
- Output vectors from a correct pair are pulled
towards each other with a force that depends on
their squared difference. - Output vectors from an incorrect pair are
repelled with a force that falls off as the
vectors get far apart relative to the correct
b
a
36Combining symmetric SNE with a feedforward neural
net
- The aim of the net is to make the codes similar
for the pairs it is given.
- Use pairs of face images that have similar
orientations and scales but are otherwise quite
different. - Use a feedforward net to map the image to a 2-D
code. - The SNE derivatives are back-propagated through
the net. - This regularizes the embedding and also makes it
easy to apply to new data.
Code i
Code j
Face i
Face j
37Large pair
Small pair
38Each color is for a different band of
orientations (from -45 to 45)
39Each color is for a different scale (from small
to large)
40A non-probabilistic version
- Hadsell, Chopra and LeCun (2006) use a
non-probabilistic version of NCA. - They need to use a complicated heuristic to force
the outputs from dissimilar pairs to be far
apart. - They get similar results when they map images of
objects to a low dimensional space.
41Neighborhood Components Analysis
- The idea is to map datapoints to a low
dimensional space in such a way that nearest
neighbors classification works well - If we restrict the mapping from inputs to outputs
to be linear we get an alternative to Fishers
Linear Discriminant Analysis. - LDA maximizes the ratio of between class variance
to within class variance. - This is the wrong thing to do if the classes
naturally form extended low-dimensional manifolds
42An objective function for NCA
low-D output vector
high-D input vector
class
43Non-linear NCA
- This should be much more powerful than linear
NCA, but it is harder to optimize. - Maybe it would help to initialize the mapping by
learning a multilayer model of the inputs using
RBMs - Maybe it would help to combine the NCA objective
function with an autoencoder. - Maybe the autoencoder would take care of the
collapse problem so that we could avoid the
quadratically expensive consideration of all the
pairs for different classes.