Information Theoretic Learning Finding structure in data ...

About This Presentation

Title:

Information Theoretic Learning Finding structure in data ...

Description:

Information Theoretic Learning Finding structure in data ... Jose Principe and Sudhir Rao University of Florida principe_at_cnel.ufl.edu www.cnel.ufl.edu – PowerPoint PPT presentation

Number of Views:127

Avg rating:3.0/5.0

Slides: 62

Provided by: Sudhi7

Category:

more less

Transcript and Presenter's Notes

Title: Information Theoretic Learning Finding structure in data ...

1
Information Theoretic LearningFinding structure
in data ...

Jose Principe and Sudhir Rao
University of Florida
principe_at_cnel.ufl.edu
www.cnel.ufl.edu

2
Outline

Structure
Connection to Learning
Learning Structure the old view
A new framework
Applications

3
Structure
Patterns / Regularities
Amorphous/chaos
Interdependence between subsystems
White Noise
4
Connection to Learning
5
Type of Learning

Supervised Learning
Data
Desired Signal/Teacher
Reinforcement Learning
Data
Rewards/Punishments
Unsupervised Learning
Only the Data

6
Unsupervised Learning

What can be done only with the data??

Examples
First Principles
Auto associative memory, ART PCA, Linskers
informax rule
Preserve maximum information
Barlows minimum redundancy principle, ICA etc
Extract independent features
Gaussian Mixture Models, EM algorithm, Parametric
Density Estimation.
Learn the probability distribution
7
Connection to Self Organization
If cell 1 is one of the cells providing input to
cell 2, and if cell 1s activity tends to be
high whenever cell 2s activity is high, then
the future contributions that the firing of cell
1 makes to the firing of cell 2 should increase..
-Donald Hebb, 1949, Neuropsychologist.
What is the purpose????
A
-
Does the Hebb-type algorithm cause a developing
perceptual network to optimize some property that
is deeply connected with the mature networks
functioning as a information processing system.
C

B
Increase wb proportional to activity of B and C

- Linsker, 1988
8
Linskers Infomax principle
Linear Network
X1
w1
noise
X2
Under Gaussian assumptions and uncorrelated noise
the rate for a linear network is ,
Y
XL-1
wL
XL
?Maximize Rate
Maximize Shannon Rate I(X,Y)
?
Hebbian Rule!!
9
Barlows redundancy principle
Independence features ?no redundancy
ICA!!!
Converting an M dimensional problem ? N one
dimensional problems
N conditional probabilities required for an event
V P(VFeature i)
2M conditional probabilities required for an
event V P(Vstimuli)
10
Summary 1
Global Objective Function
example, Infomax
Extracting desired signal from the data itself
Self Organizing Rule
example, Hebbian rule
Revealing the structure through interaction of
the data points
Unsupervised Learning
Discovering structure in Data
example, PCA
11
Questions

Can we go beyond these preprocessing stages??
Can we create global cost function which extract
goal oriented structures from the data?
Can we derive self organizing principle from such
a cost function??

A big YES!!!
12
What is Information Theoretic Learning?

ITL is a methodology to adapt linear or nonlinear
systems using criteria based on the information
descriptors of entropy and divergence.
Center piece is a non-parametric estimator for
entropy that
Does not require an explicit estimation of pdf
Uses the Parzen window method which is known to
be consistent and efficient
Estimator is smooth
Readily integrated in conventional gradient
descent learning
Provides a link to Kernel learning and SVMs.
Allows an extension to random processes

13
ITL is a different way of thinking about data
quantification

Moment expansions, in particular Second Order
moments are still today the workhorse of
statistics. We automatically translate deep
concepts (e.g. similarity, Hebbs postulate of
learning ) in 2nd order statistical equivalents.
ITL replaces 2nd order moments with a geometric
statistical interpretation of data in probability
spaces.
Variance by Entropy
Correlation by Correntopy
Mean square error (MSE) by Minimum error entropy
(MEE)
Distances in data space by distances in
probability spaces

14
Information Theoretic LearningEntropy

Entropy quantifies the degree of uncertainty in a
r.v. Claude Shannon defined entropy as

Not all random variables (r.v.) are equally
random!
5
15
Information Theoretic LearningRenyis Entropy

Norm of the pdf

Renyis entropy equals Shannons as
16
Information Theoretic LearningParzen windowing

Given only samples drawn from a distribution
Convergence

17
Information Theoretic Learning Renyis Quadratic
Entropy

Order-2 entropy Gaussian kernels

Pairwise interactions between samples O(N2)
Information potential, V2(X)
provides a potential field over the
space of the samples parameterized by the kernel
size s
Principe, Fisher, Xu, Unsupervised Adaptive
Filtering, (S. Haykin), Wiley, 2000.
18
Information Theoretic Learning Information Force

In adaptation, samples become information
particles that interact through information
forces.
Information potential
Information force

Principe, Fisher, Xu, Unsupervised Adaptive
Filtering, (S. Haykin), Wiley, 2000. Erdogmus,
Principe, Hild, Natural Computing, 2002.
19
What will happen if we allow the particles to
move under the influence of these forces?
Information force within a dataset arising due to
H(X)
20
Information Theoretic Learning Backpropagation
of Information Forces

Information forces become the injected error to
the dual or adjoint network that determines the
weight updates for adaptation.

21
Information Theoretic Learning Quadratic
divergence measures

Kulback-Liebler
Divergence
Renyis Divergence
Euclidean Distance
Cauchy- Schwartz
Distance
Mutual Information is a special case (divergence
between the joint and the product of marginals)

Principe, Fisher, Xu, Unsupervised Adaptive
Filtering, (S. Haykin), Wiley, 2000.
22
Information Theoretic Learning Unifying
criterion for learning from samples
23
Training ADALINE sample by sample Stochastic
information gradient (SIG)

Theorem The expected value of the stochastic
information gradient (SIG), is the gradient of
Shannons entropy estimated from the samples
using Parzen windowing.
For the Gaussian kernel and M1
The form is the same as for LMS except that
entropy learning works with differences in
samples.
The SIG works implicitly with the L1 norm of the
error.

Erdogmus, Principe, Hild, IEEE Signal Processing
Letters, 2003.
24
SIG Hebbian updates

In a linear network the Hebbian update is
The update maximizing Shannon output entropy
with the SIG becomes
Which is more powerful and biologically
plausible?

Hebbian updates would converge to any direction
but SIG found consistently the 90 degree
direction!
Generated 50 samples of a 2D distribution where
the x axis is uniform and the y axis is Gaussian
and the sample covariance matrix is 1
Erdogmus, Principe, Hild, IEEE Signal Processing
Letters, 2003.
25
ITL - Applications
www.cnel.ufl.edu ? ITL has examples and Matlab
code
26
Renyis cross entropy
Let be
two r.vs with iid samples. Then Renyis cross
entropy is given by
Using parzen estimate for the pdfs gives
27
Cross information potential and cross
information force
Force between particles of two datasets
28
Cross information force between two datasets
arising due to H(XY)
29
Cauchy Schwartz Divergence
A measure of similarity between two datasets
Same probability density functions
30
A New ITL FrameworkInformation Theoretic Mean
Shift
STATEMENT
Consider a dataset with iid
samples. We wish to find a new dataset
which captures interesting
structures of the original dataset .
FORMULATION
Cost Redundancy Reduction term Similarity
Measure Term
Weighted Combination
31
Information Theoretic Mean Shift
Form 1
This cost looks like a reaction diffusion
equation Entropy term implements
diffusion Cauchy Schwarz implements attraction to
the original data
32
Analogy
The weighting parameter ? squeezes the
information flow through a bottleneck extracting
different levels of structure in the data.

We can also visualize ? as a slope parameter. The
previous methods used only ?1 or

33
Self organizing rule
Rewriting cost function as
Differentiating w.r.to xk1,2,,N and
rearranging gives
Fixed Point Update!!
34
An Example
Crescent shaped Dataset
35
Effect of ?
36
Summary 2
Starting with the Data
??8
? 1
? 0
Back to Data
Modes
Single Point
37
Applications- Clustering
Statement
Segment data into different groups such that
samples belonging to same group are closer to
each other than samples of different groups.
The idea
Mode Finding Ability
Clustering
38
Mean Shift a review
Modes are stationary points of the equation,
39
Two variants GBMS and GMS
Gaussian Mean Shift
Gaussian Blurring Mean Shift
Single dataset X Initialize XXo
Two datasets X and Xo Initialize XXo
40
Connection to ITMS
? 1
? 0
GBMS
GMS
41
Applications- Clustering
10 Random Gaussian Clusters and its pdf plot
42
GBMS result
GMS result
43
Image segmentation
44
GBMS
GMS
45
Applications- Principal Curves

Non linear extension of PCA.
Self-consistent smooth curves which pass
through the middle of a d-dimensional
probability distribution or data cloud.

A new definition (Erdogmus et al.)
A point is an element of the d-dimensional
principal set ,denoted by iff is
orthonormal to at least (n-d)
eigenvectors of and is a strict
local maximum in the subspace spanned by these
eigenvectors.
46
PC continued

is a 0-dimensional principal set
corresponding to modes of the data. is the
1-dimensional principal curve, is a
2-dimensional principal surface and so on
Hierarchical structure, .
.
ITMS satisfies this definition (experimentally).
Gives principal curve for .

47
Principal curve of spiral data passing through
the modes
48
Denoising
Chain of Ring Dataset
49
Applications -Vector Quantization

Limiting case of ITMS (? ? 8).
Dcs(XXo) can be seen as distortion measure
between X and Xo.
Initialize X with far fewer points than Xo

50
Comparison
ITVQ
LBG
51
Unsupervised Learning Tasks choose a point in a
2D space!
Vector Q.
? ? Different Tasks
Principal curves
clustering
s ? Different Scales
52
Conclusions

Goal-oriented structures goes beyond
preprocessing stages and helps us extract
abstract representations in the data.
A common framework binds these interesting
structures as different levels of information
extraction from the data. ITMS achieves this and
can be used for
Clustering
Principal Curves
Vector Quantization and more ..

53
Whats Next?
54
CorrentropyA new generalized similarity measure

Correlation is one of the most widely used
functions in signal processing and pattern
recognition.
But, correlation only quantifies similarity fully
if the random variables are Gaussian distributed.
Can we define a new function that measures
similarity but it is not restricted to second
order statistics?
Use the ITL framework.

55
CorrentropyA new generalized similarity measure

Define correntropy of a random process xt as
We can easily estimate correntropy using kernels
The name correntropy comes from the fact that the
average over the lags (or the dimensions) is the
information potential (argument of Renyis
entropy)
For strictly stationary and ergodic r. p.

Santamaria I., Pokharel P., Principe J.,
Generalized Correlation Function Definition,
Properties and Application to Blind
Equalization, IEEE Trans. Signal Proc. vol 54,
no 6, pp 2187- 2186, 2006.
56
CorrentropyA new generalized similarity measure

How does it look like? The sinewave

57
CorrentropyA new generalized similarity measure

Properties of Correntropy
It has a maximum at the origin ( )
It is a symmetric positive function
Its mean value is the information potential
Correntropy includes higher order moments of data
The matrix whose elements are the correntopy at
different lags is Toeplitz

58
CorrentropyA new generalized similarity measure

Correntropy as a cost function versus MSE.

59
CorrentropyA new generalized similarity measure

Correntropy induces a metric (CIM) in the sample
space defined by
Therefore correntropy can
be used as an alternative
similarity criterion in the
space of samples.

Liu W., Pokharel P., Principe J., Correntropy
Properties and Applications in Non Gaussian
Signal Processing, accepted in IEEE Trans. Sig.
Proc
60
RKHS induced by CorrentropyDefinition

For a stochastic process Xt, t?T with T being
an index set, correntropy defined as
VX(t,s) EK(XtXs)
is symmetric and positive definite.
Thus it induces a new RKHS denoted as VRKHS (VH).
There is a kernel mapping ? such that
Vx(t,s)lt ?(t),?(s) gtVH
Any symmetric non-negative kernel is the
covariance kernel of a random function and vice
versa (Parzen).
Therefore, given a random function Xt, t ? T
there exists another random function ft, t ? T
such that
Eft fsVX(t,s)

61
RKHS induced by CorrentropyDefinition

This RKHS seems very appropriate for nonlinear
signal processing.
In this space we can compute using linear
algorithms nonlinear systems in the input space
such as
Matched filters
Wiener filters
Principal Component Analysis
Solve constrained optimization problems
Due adaptive filtering and controls

Write a Comment

User Comments (0)

About PowerShow.com

Information Theoretic Learning Finding structure in data ... - PowerPoint PPT Presentation

Information Theoretic Learning Finding structure in data ...

Information Theoretic Learning Finding structure in data ... Jose Principe and Sudhir Rao University of Florida principe_at_cnel.ufl.edu www.cnel.ufl.edu – PowerPoint PPT presentation