Title: Information Theoretic Learning Finding structure in data ...
1Information Theoretic LearningFinding structure
in data ...
- Jose Principe and Sudhir Rao
- University of Florida
- principe_at_cnel.ufl.edu
- www.cnel.ufl.edu
2Outline
- Structure
- Connection to Learning
- Learning Structure the old view
- A new framework
- Applications
3Structure
Patterns / Regularities
Amorphous/chaos
Interdependence between subsystems
White Noise
4Connection to Learning
5Type of Learning
- Supervised Learning
- Data
- Desired Signal/Teacher
- Reinforcement Learning
- Data
- Rewards/Punishments
- Unsupervised Learning
- Only the Data
6Unsupervised Learning
- What can be done only with the data??
Examples
First Principles
Auto associative memory, ART PCA, Linskers
informax rule
Preserve maximum information
Barlows minimum redundancy principle, ICA etc
Extract independent features
Gaussian Mixture Models, EM algorithm, Parametric
Density Estimation.
Learn the probability distribution
7Connection to Self Organization
If cell 1 is one of the cells providing input to
cell 2, and if cell 1s activity tends to be
high whenever cell 2s activity is high, then
the future contributions that the firing of cell
1 makes to the firing of cell 2 should increase..
-Donald Hebb, 1949, Neuropsychologist.
What is the purpose????
A
-
Does the Hebb-type algorithm cause a developing
perceptual network to optimize some property that
is deeply connected with the mature networks
functioning as a information processing system.
C
B
Increase wb proportional to activity of B and C
- Linsker, 1988
8Linskers Infomax principle
Linear Network
X1
w1
noise
X2
Under Gaussian assumptions and uncorrelated noise
the rate for a linear network is ,
Y
XL-1
wL
XL
?Maximize Rate
Maximize Shannon Rate I(X,Y)
?
Hebbian Rule!!
9Barlows redundancy principle
Independence features ?no redundancy
ICA!!!
Converting an M dimensional problem ? N one
dimensional problems
N conditional probabilities required for an event
V P(VFeature i)
2M conditional probabilities required for an
event V P(Vstimuli)
10Summary 1
Global Objective Function
example, Infomax
Extracting desired signal from the data itself
Self Organizing Rule
example, Hebbian rule
Revealing the structure through interaction of
the data points
Unsupervised Learning
Discovering structure in Data
example, PCA
11Questions
- Can we go beyond these preprocessing stages??
- Can we create global cost function which extract
goal oriented structures from the data? - Can we derive self organizing principle from such
a cost function??
A big YES!!!
12What is Information Theoretic Learning?
- ITL is a methodology to adapt linear or nonlinear
systems using criteria based on the information
descriptors of entropy and divergence. - Center piece is a non-parametric estimator for
entropy that - Does not require an explicit estimation of pdf
- Uses the Parzen window method which is known to
be consistent and efficient - Estimator is smooth
- Readily integrated in conventional gradient
descent learning - Provides a link to Kernel learning and SVMs.
- Allows an extension to random processes
13ITL is a different way of thinking about data
quantification
- Moment expansions, in particular Second Order
moments are still today the workhorse of
statistics. We automatically translate deep
concepts (e.g. similarity, Hebbs postulate of
learning ) in 2nd order statistical equivalents. - ITL replaces 2nd order moments with a geometric
statistical interpretation of data in probability
spaces. -
- Variance by Entropy
- Correlation by Correntopy
- Mean square error (MSE) by Minimum error entropy
(MEE) - Distances in data space by distances in
probability spaces
14Information Theoretic LearningEntropy
- Entropy quantifies the degree of uncertainty in a
r.v. Claude Shannon defined entropy as
Not all random variables (r.v.) are equally
random!
5
15Information Theoretic LearningRenyis Entropy
Renyis entropy equals Shannons as
16Information Theoretic LearningParzen windowing
- Given only samples drawn from a distribution
- Convergence
17Information Theoretic Learning Renyis Quadratic
Entropy
- Order-2 entropy Gaussian kernels
Pairwise interactions between samples O(N2)
Information potential, V2(X)
provides a potential field over the
space of the samples parameterized by the kernel
size s
Principe, Fisher, Xu, Unsupervised Adaptive
Filtering, (S. Haykin), Wiley, 2000.
18Information Theoretic Learning Information Force
- In adaptation, samples become information
particles that interact through information
forces. - Information potential
- Information force
Principe, Fisher, Xu, Unsupervised Adaptive
Filtering, (S. Haykin), Wiley, 2000. Erdogmus,
Principe, Hild, Natural Computing, 2002.
19What will happen if we allow the particles to
move under the influence of these forces?
Information force within a dataset arising due to
H(X)
20Information Theoretic Learning Backpropagation
of Information Forces
- Information forces become the injected error to
the dual or adjoint network that determines the
weight updates for adaptation.
21Information Theoretic Learning Quadratic
divergence measures
- Kulback-Liebler
- Divergence
- Renyis Divergence
- Euclidean Distance
- Cauchy- Schwartz
- Distance
- Mutual Information is a special case (divergence
between the joint and the product of marginals)
Principe, Fisher, Xu, Unsupervised Adaptive
Filtering, (S. Haykin), Wiley, 2000.
22Information Theoretic Learning Unifying
criterion for learning from samples
23Training ADALINE sample by sample Stochastic
information gradient (SIG)
- Theorem The expected value of the stochastic
information gradient (SIG), is the gradient of
Shannons entropy estimated from the samples
using Parzen windowing. - For the Gaussian kernel and M1
- The form is the same as for LMS except that
entropy learning works with differences in
samples. - The SIG works implicitly with the L1 norm of the
error.
Erdogmus, Principe, Hild, IEEE Signal Processing
Letters, 2003.
24SIG Hebbian updates
- In a linear network the Hebbian update is
- The update maximizing Shannon output entropy
- with the SIG becomes
- Which is more powerful and biologically
plausible?
Hebbian updates would converge to any direction
but SIG found consistently the 90 degree
direction!
Generated 50 samples of a 2D distribution where
the x axis is uniform and the y axis is Gaussian
and the sample covariance matrix is 1
Erdogmus, Principe, Hild, IEEE Signal Processing
Letters, 2003.
25ITL - Applications
www.cnel.ufl.edu ? ITL has examples and Matlab
code
26Renyis cross entropy
Let be
two r.vs with iid samples. Then Renyis cross
entropy is given by
Using parzen estimate for the pdfs gives
27Cross information potential and cross
information force
Force between particles of two datasets
28Cross information force between two datasets
arising due to H(XY)
29Cauchy Schwartz Divergence
A measure of similarity between two datasets
Same probability density functions
30A New ITL FrameworkInformation Theoretic Mean
Shift
STATEMENT
Consider a dataset with iid
samples. We wish to find a new dataset
which captures interesting
structures of the original dataset .
FORMULATION
Cost Redundancy Reduction term Similarity
Measure Term
Weighted Combination
31Information Theoretic Mean Shift
Form 1
This cost looks like a reaction diffusion
equation Entropy term implements
diffusion Cauchy Schwarz implements attraction to
the original data
32Analogy
The weighting parameter ? squeezes the
information flow through a bottleneck extracting
different levels of structure in the data.
- We can also visualize ? as a slope parameter. The
previous methods used only ?1 or
33Self organizing rule
Rewriting cost function as
Differentiating w.r.to xk1,2,,N and
rearranging gives
Fixed Point Update!!
34An Example
Crescent shaped Dataset
35Effect of ?
36Summary 2
Starting with the Data
??8
? 1
? 0
Back to Data
Modes
Single Point
37Applications- Clustering
Statement
Segment data into different groups such that
samples belonging to same group are closer to
each other than samples of different groups.
The idea
Mode Finding Ability
Clustering
38Mean Shift a review
Modes are stationary points of the equation,
39Two variants GBMS and GMS
Gaussian Mean Shift
Gaussian Blurring Mean Shift
Single dataset X Initialize XXo
Two datasets X and Xo Initialize XXo
40Connection to ITMS
? 1
? 0
GBMS
GMS
41Applications- Clustering
10 Random Gaussian Clusters and its pdf plot
42GBMS result
GMS result
43Image segmentation
44GBMS
GMS
45Applications- Principal Curves
- Non linear extension of PCA.
- Self-consistent smooth curves which pass
through the middle of a d-dimensional
probability distribution or data cloud.
A new definition (Erdogmus et al.)
A point is an element of the d-dimensional
principal set ,denoted by iff is
orthonormal to at least (n-d)
eigenvectors of and is a strict
local maximum in the subspace spanned by these
eigenvectors.
46PC continued
- is a 0-dimensional principal set
corresponding to modes of the data. is the
1-dimensional principal curve, is a
2-dimensional principal surface and so on - Hierarchical structure, .
- .
- ITMS satisfies this definition (experimentally).
- Gives principal curve for .
47Principal curve of spiral data passing through
the modes
48Denoising
Chain of Ring Dataset
49Applications -Vector Quantization
- Limiting case of ITMS (? ? 8).
- Dcs(XXo) can be seen as distortion measure
between X and Xo. - Initialize X with far fewer points than Xo
50Comparison
ITVQ
LBG
51Unsupervised Learning Tasks choose a point in a
2D space!
Vector Q.
? ? Different Tasks
Principal curves
clustering
s ? Different Scales
52Conclusions
- Goal-oriented structures goes beyond
preprocessing stages and helps us extract
abstract representations in the data. - A common framework binds these interesting
structures as different levels of information
extraction from the data. ITMS achieves this and
can be used for - Clustering
- Principal Curves
- Vector Quantization and more ..
53Whats Next?
54CorrentropyA new generalized similarity measure
- Correlation is one of the most widely used
functions in signal processing and pattern
recognition. - But, correlation only quantifies similarity fully
if the random variables are Gaussian distributed.
- Can we define a new function that measures
similarity but it is not restricted to second
order statistics? - Use the ITL framework.
55CorrentropyA new generalized similarity measure
- Define correntropy of a random process xt as
- We can easily estimate correntropy using kernels
- The name correntropy comes from the fact that the
average over the lags (or the dimensions) is the
information potential (argument of Renyis
entropy) - For strictly stationary and ergodic r. p.
Santamaria I., Pokharel P., Principe J.,
Generalized Correlation Function Definition,
Properties and Application to Blind
Equalization, IEEE Trans. Signal Proc. vol 54,
no 6, pp 2187- 2186, 2006.
56CorrentropyA new generalized similarity measure
- How does it look like? The sinewave
57CorrentropyA new generalized similarity measure
- Properties of Correntropy
- It has a maximum at the origin ( )
- It is a symmetric positive function
- Its mean value is the information potential
- Correntropy includes higher order moments of data
- The matrix whose elements are the correntopy at
different lags is Toeplitz
58CorrentropyA new generalized similarity measure
- Correntropy as a cost function versus MSE.
59CorrentropyA new generalized similarity measure
- Correntropy induces a metric (CIM) in the sample
space defined by - Therefore correntropy can
- be used as an alternative
- similarity criterion in the
- space of samples.
Liu W., Pokharel P., Principe J., Correntropy
Properties and Applications in Non Gaussian
Signal Processing, accepted in IEEE Trans. Sig.
Proc
60RKHS induced by CorrentropyDefinition
- For a stochastic process Xt, t?T with T being
an index set, correntropy defined as - VX(t,s) EK(XtXs)
- is symmetric and positive definite.
- Thus it induces a new RKHS denoted as VRKHS (VH).
There is a kernel mapping ? such that - Vx(t,s)lt ?(t),?(s) gtVH
- Any symmetric non-negative kernel is the
covariance kernel of a random function and vice
versa (Parzen). - Therefore, given a random function Xt, t ? T
there exists another random function ft, t ? T
such that - Eft fsVX(t,s)
61RKHS induced by CorrentropyDefinition
- This RKHS seems very appropriate for nonlinear
signal processing. - In this space we can compute using linear
algorithms nonlinear systems in the input space
such as - Matched filters
- Wiener filters
- Principal Component Analysis
- Solve constrained optimization problems
- Due adaptive filtering and controls