Title: Feature Selection as Relevant Information Encoding
1Feature Selectionas Relevant Information
Encoding
- Naftali Tishby
- School of Computer Science and Engineering
- The Hebrew University, Jerusalem, Israel
- NIPS 2001
2- Many thanks to
- Noam Slonim
- Amir Globerson
- Bill Bialek
- Fernando Pereira
- Nir Friedman
3Feature Selection?
- NOT generative modeling!
- no assumptions about the source of the data
- Extracting relevant structure from data
- functions of the data (statistics) that preserve
information - Information about what?
- Approximate Sufficient Statistics
- Need a principle that is both general and
precise. - Good Principles survive longer!
4A Simple Example...
5Simple Example
6A new compact representation
The document clusters preserve the relevant
information between the documents and words
7Documents
Words
8Mutual information
- How much X is telling about Y?
- I(XY) function of the joint probability
distribution p(x,y) - - minimal number of yes/no questions (bits) needed
to ask about x, in order to learn all we can
about Y. - Uncertainty removed about X when we know Y
- I(XY) H(X) - H( XY) H(Y) - H(YX)
I(XY)
H(XY)
H(YX)
9Relevant Coding
- What are the questions that we need to ask about
X in order to learn about Y? - Need to partition X into relevant domains, or
clusters, between which we really need to
distinguish...
P(xy1)
Xy1
y1
y2
Xy2
P(xy2)
X
Y
10Bottlenecks and Neural Nets
- Auto association forcing compact
representations - is a relevant code of w.r.t.
Input
Output
Sample 1
Sample 2
Past
Future
11- Q How many bits are needed to determine the
relevant representation? - need to index the max number of non-overlapping
green blobs inside the blue blob - (mutual information!)
-
-
12Information Bottleneck
- The distortion function determines the relevant
part of the pattern - but what if we dont know the distortion
function but rather a relevance variable? - Examples Speech vs its transcription
- Images vs objects-names
- Faces vs expressions
- Stimuli vs spike-trains (neural codes)
- Protein-sequences vs structure/function
- Documents vs text-categories
- Input vs responses
- etc...
13- The idea find a compressed signal
- that needs short encoding ( small
) - while preserving as much as possible the
information on the relevant signal (
)
14A Variational Principle
- We want a short representation of X that keeps
the information about another variable, Y, if
possible.
15The Self Consistent Equations
- Marginal
- Markov condition
- Bayes rule
16The emerged effective distortion measure
- Regular if is absolutely
continuous w.r.t. - Small if predicts y as well as x
17The iterative algorithm (Generalized
Blahut-Arimoto)
Generalized BA-algorithm
18The Information Bottleneck Algorithm
free energy
19- The Information - plane, the optimal
for a given
is a concave function
impossible
Possible phase
20Regression as relevant encoding
- Extracting relevant information is fundamental
for many problems in learning - Regression
- Knowing the parametric class we can calculate
p(X,Y), without sampling!
Gaussian noise
21Manifold of relevance
- The self consistent equations
- Assuming a continuous manifold for
- Coupled (local in ) eigenfunction equations,
with ? as an eigenvalue.
22Generalization as relevant encoding
- The two sample problem
- the probability that two samples come from one
source -
- Knowing the function class we can estimate
p(X,Y). - Convergence depends on the class complexity.
23Document classification - information curves
24Multivariate Information Bottleneck
- Complex relationship between many variables
- Multiple unrelated dimensionality reduction
schemes - Trade between known and desired dependencies
- Express IB in the language of Graphical Models
- Multivariate extension of Rate-Distortion Theory
25Multivariate Information Bottleneck Extending
the dependency graphs
(Multi-information)
26(No Transcript)
27Sufficient Dimensionality Reduction(with Amir
Globerson)
- Exponential families have sufficient statistics
- Given a joint distribution ,
find an approximation of the exponential form
This can be done by alternating maximization of
Entropy under the constraints
The resulting functions are our relevant features
at rank d.
28Conclusions
- There may be a single principle behind ...
- Noise filtering
- time series prediction
- categorization and classification
- feature extraction
- supervised and unsupervised learning
- visual and auditory segmentation
- clustering
- self organized representation
- ...
29Summary
- We present a general information theoretic
approach for extracting relevant information. - It is a natural generalization of
Rate-Distortion theory with similar convergence
and optimality proofs. - Unifies learning, feature extraction, filtering,
and prediction... - Applications (so far) include
- Word sense disambiguation
- Document classification and categorization
- Spectral analysis
- Neural codes
- Bioinformatics,
- Data clustering based on multi-distance
distributions -