Title: Blind Source Separation by Independent Components Analysis
1Blind Source Separation by Independent Components
Analysis
- Professor Dr. Barrie W. Jervis
- School of Engineering
- Sheffield Hallam University
- England
- B.W.Jervis_at_shu.ac.uk
2The Problem
- Temporally independent unknown source signals are
linearly mixed in an unknown system to produce a
set of measured output signals. - It is required to determine the source signals.
3- Methods of solving this problem are known as
Blind Source Separation (BSS) techniques. - In this presentation the method of Independent
Components Analysis (ICA) will be described. - The arrangement is illustrated in the next slide.
4Arrangement for BSS by ICA
y1g1(u1) y2g2(u2) yngn(un)
g(.)
s1 s2 sn
x1 x2 xn
u1 u2 un
A
W
5Neural Network Interpretation
- The si are the independent source signals,
- A is the linear mixing matrix,
- The xi are the measured signals,
- W ? A-1 is the estimated unmixing matrix,
- The ui are the estimated source signals or
activations, i.e. ui ?? si, - The gi(ui) are monotonic nonlinear functions
(sigmoids, hyperbolic tangents), - The yi are the network outputs.
6Principles of Neural Network Approach
- Use Information Theory to derive an algorithm
which minimises the mutual information between
the outputs yg(u). - This minimises the mutual information between the
source signal estimates, u, since g(u) introduces
no dependencies. - The different u are then temporally independent
and are the estimated source signals.
7Cautions I
- The magnitudes and signs of the estimated source
signals are unreliable since - the magnitudes are not scaled
- the signs are undefined
- because magnitude and sign information is shared
between the source signal vector and the unmixing
matrix, W. - The order of the outputs is permutated compared
wiith the inputs
8Cautions II
- Similar overlapping source signals may not be
properly extracted. - If the number of output channels ? number of
source signals, those source signals of lowest
variance will not be extracted. This is a problem
when these signals are important.
9Information Theory I
- If X is a vector of variables (messages) xi which
occur with probabilities P(xi), then the average
information content of a stream of N messages is
bits
and is known as the entropy of the random
variable, X.
10Information Theory II
- Note that the entropy is expressible in terms of
probability. - Given the probability density distribution (pdf)
of X we can find the associated entropy. - This link between entropy and pdf is of the
greatest importance in ICA theory.
11Information Theory III
- The joint entropy between two random variables X
and Y is given by
- For independent variables
12Information Theory IV
- The conditional entropy of Y given X measures the
average uncertainty remaining about y when x is
known, and is
- The mutual information between Y and X is
- In ICA, X represents the measured signals, which
are applied to the nonlinear function g(u) to
obtain the outputs Y.
13Bell and Sejnowskis ICA Theory (1995)
- Aim to maximise the amount of mutual information
between the inputs X and the outputs Y of the
neural network.
(Uncertainty about Y when X is unknown)
- Y is a function of W and g(u).
- Here we seek to determine the W which
produces the ui ? si, assuming the correct g(u).
14Differentiating
(0, since it did not come through W from X.) So,
maximising this mutual information is equivalent
to maximising the joint output entropy,
which is seen to be equivalent to minimising the
mutual information between the outputs and hence
the ui, as desired.
15The Functions g(u)
- The outputs yi are amplitude bounded random
variables, and so the marginal entropies H(yi)
are maximum when the yi are uniformly distributed
- a known statistical result. - With the H(yi) maximised, I(Y,X) 0, and the yi
uniformly distributed, the nonlinearity gi(ui)
has the form of the cumulative distribution
function of the probability density function of
the si, - a proven result.
16Pause and review g(u) and W
- W has to be chosen to maximise the joint output
entropy H(Y,X), which minimises the mutual
information between the estimated source signals,
ui. - The g(u) should be the cumulative distribution
functions of the source signals, si. - Determining the g(u) is a major problem.
17One input and one output
- For a monotonic nonlinear function, g(x),
(independent of W)
(we only need to maximise this term)
18- A stochastic gradient ascent learning rule is
adopted to maximise H(y) by assuming
- Further progress requires knowledge of g(u).
Assume for now, after Bell and Sejnowski, that
g(u) is sigmoidal, i.e.
19Learning Rule 1 input, 1 output
Hence, we find
20Learning Rule N inputs, N outputs
- Assuming g(u) is sigmoidal again, we obtain
21- The network is trained until the changes in the
weights become acceptably small at each
iteration. - Thus the unmixing matrix W is found.
22The Natural Gradient
- The computation of the inverse matrix
is time-consuming, and may be avoided by
rescaling the entropy gradient by multiplying it
by
- Thus, for a sigmoidal g(u) we obtain
- This is the natural gradient, introduced by
Amari (1998), and now widely adopted.
23The nonlinearity, g(u)
- We have already learnt that the g(u) should be
the cumulative probability densities of the
individual source distributions. - So far the g(u) have been assumed to be
sigmoidal, so what are the pdfs of the si? - The corresponding pdfs of the si are
super-Gaussian.
24Super- and sub-Gaussian pdfs
Gaussian
Super-Gaussian
Sub-Gaussian
- Note there are no mathematical definitions of
super- and sub-Gaussians
25 Super- and sub-Gaussians
- Super-Gaussians kurtosis (fourth order
central moment, measures the flatness of the
pdf) gt 0. infrequent signals of short duration,
e.g. evoked brain signals. - Sub-Gaussians kurtosis lt 0 signals
mainly on, e.g. 50/60 Hz electrical mains
supply, but also eye blinks.
26Kurtosis
- Kurtosis 4th order central moment
- and is seen to be calculated from the current
estimates of the source signals. - To separate the independent sources, information
about their pdfs such as skewness (3rd. moment)
and flatness (kurtosis) is required. - First and 2nd. moments (mean and variance) are
insufficient.
27A more generalised learning rule
- Girolami (1997) showed that tanh(ui) and
-tanh(ui) could be used for super- and
sub-Gaussians respectively. - Cardoso and Laheld (1996) developed a stability
analysis to determine whether the source signals
were to be considered super- or sub-Gaussian. - Lee, Girolami, and Sejnowski (1998) applied these
findings to develop their extended infomax
algorithm for super- and sub-Gaussians using a
kurtosis-based switching rule.
28Extended Infomax Learning Rule
- With super-Gaussians modelled as
and sub-Gaussians as a Pearson mixture model
the new extended learning rule is
29Switching Decision
and the ki are the elements of the N-dimensional
diagonal matrix, K, and
- Modifications of the formula for ki exist, but
in our experience the extended algorithm has been
unsatisfactory.
30Reasons for unsatisfactory extended algorithm
- 1) Initial assumptions about super- and
sub-Gaussian distributions may be too inaccurate. - 2) The switching criterion may be inadequate.
Alternatives
- Postulate vague distributions for the source
signals which are then developed iteratively
during training. - Use an alternative approach, e.g, statistically
based, JADE (Cardoso).
31Summary so far
- We have seen how W may be obtained by training
the network, and the extended algorithm for
switching between super- and sub-Gaussians has
been described. - Alternative approaches have been mentioned.
- Next we consider how to obtain the source signals
knowing W and the measured signals, x.
32Source signal determination
Mixing matrix A
Unmixing matrix W
g(u)
si unknown
xi measured
ui?si estimated
yi
- Hence UW.x and xA.S where A?W-1, and U?S.
- The rows of U are the estimated source signals,
known as activations (as functions of time). - The rows of x are the time-varying measured
signals.
33Source Signals
Channel number
Time, or sample number
34Expressions for the Activations
- We see that consecutive values of u are obtained
by filtering consecutive columns of x by the same
row of W.
- The ith row of u is the ith row of w by the
columns of x.
35Procedure
- Record N time points from each of M sensors,
where N ? 5M. - Pre-process the data, e.g. filtering, trend
removal. - Sphere the data using Principal Components
Analysis (PCA). This is not essential but speeds
up the computation by first removing first and
second order moments. - Compute the ui ? si. Include desphering.
- Analyse the results.
36Optional Procedures I
- The contribution of each activation at a sensor
may be found by back-projecting it to the
sensor.
37Optional Procedures II
- A measured signal which is contaminated by
artefacts or noise may be extracted by
back-projecting all the signal activations to
the measurement electrode, setting other
activations to zero. (An artefact and noise
removal method).
38Current Developments
- Overcomplete representations - more signal
sources than sensors. - Nonlinear mixing.
- Nonstationary sources.
- General formulation of g(u).
39Conclusions
- It has been shown how to extract temporally
independent unknown source signals from their
linear mixtures at the outputs of an unknown
system using Independent Components Analysis. - Some of the limitations of the method have been
mentioned. - Current developments have been highlighted.