Feature Selection as Relevant Information Encoding - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Feature Selection as Relevant Information Encoding

Description:

Knowing the parametric class we can calculate p(X,Y), without sampling! Gaussian noise ... class we can estimate p(X,Y). Convergence depends on the class ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 26
Provided by: Tis51
Category:

less

Transcript and Presenter's Notes

Title: Feature Selection as Relevant Information Encoding


1
Feature Selectionas Relevant Information
Encoding
  • Naftali Tishby
  • School of Computer Science and Engineering
  • The Hebrew University, Jerusalem, Israel
  • NIPS 2001

2
  • Many thanks to
  • Noam Slonim
  • Amir Globerson
  • Bill Bialek
  • Fernando Pereira
  • Nir Friedman

3
Feature Selection?
  • NOT generative modeling!
  • no assumptions about the source of the data
  • Extracting relevant structure from data
  • functions of the data (statistics) that preserve
    information
  • Information about what?
  • Approximate Sufficient Statistics
  • Need a principle that is both general and
    precise.
  • Good Principles survive longer!

4
A Simple Example...
5
Simple Example
6
A new compact representation
The document clusters preserve the relevant
information between the documents and words
7
Documents
Words
8
Mutual information
  • How much X is telling about Y?
  • I(XY) function of the joint probability
    distribution p(x,y) -
  • minimal number of yes/no questions (bits) needed
    to ask about x, in order to learn all we can
    about Y.
  • Uncertainty removed about X when we know Y
  • I(XY) H(X) - H( XY) H(Y) - H(YX)

I(XY)
H(XY)
H(YX)
9
Relevant Coding
  • What are the questions that we need to ask about
    X in order to learn about Y?
  • Need to partition X into relevant domains, or
    clusters, between which we really need to
    distinguish...

P(xy1)
Xy1
y1
y2
Xy2
P(xy2)
X
Y
10
Bottlenecks and Neural Nets
  • Auto association forcing compact
    representations
  • is a relevant code of w.r.t.

Input
Output
Sample 1
Sample 2
Past
Future
11
  • Q How many bits are needed to determine the
    relevant representation?
  • need to index the max number of non-overlapping
    green blobs inside the blue blob
  • (mutual information!)

12
Information Bottleneck
  • The distortion function determines the relevant
    part of the pattern
  • but what if we dont know the distortion
    function but rather a relevance variable?
  • Examples Speech vs its transcription
  • Images vs objects-names
  • Faces vs expressions
  • Stimuli vs spike-trains (neural codes)
  • Protein-sequences vs structure/function
  • Documents vs text-categories
  • Input vs responses
  • etc...

13
  • The idea find a compressed signal
  • that needs short encoding ( small
    )
  • while preserving as much as possible the
    information on the relevant signal (
    )

14
A Variational Principle
  • We want a short representation of X that keeps
    the information about another variable, Y, if
    possible.

15
The Self Consistent Equations
  • Marginal
  • Markov condition
  • Bayes rule

16
The emerged effective distortion measure
  • Regular if is absolutely
    continuous w.r.t.
  • Small if predicts y as well as x

17
The iterative algorithm (Generalized
Blahut-Arimoto)


Generalized BA-algorithm
18
The Information Bottleneck Algorithm
free energy
19
  • The Information - plane, the optimal
    for a given
    is a concave function

impossible
Possible phase
20
Regression as relevant encoding
  • Extracting relevant information is fundamental
    for many problems in learning
  • Regression
  • Knowing the parametric class we can calculate
    p(X,Y), without sampling!

Gaussian noise
21
Manifold of relevance
  • The self consistent equations
  • Assuming a continuous manifold for
  • Coupled (local in ) eigenfunction equations,
    with ? as an eigenvalue.

22
Generalization as relevant encoding
  • The two sample problem
  • the probability that two samples come from one
    source
  • Knowing the function class we can estimate
    p(X,Y).
  • Convergence depends on the class complexity.

23
Document classification - information curves
24
Multivariate Information Bottleneck
  • Complex relationship between many variables
  • Multiple unrelated dimensionality reduction
    schemes
  • Trade between known and desired dependencies
  • Express IB in the language of Graphical Models
  • Multivariate extension of Rate-Distortion Theory

25
Multivariate Information Bottleneck Extending
the dependency graphs
(Multi-information)
26
(No Transcript)
27
Sufficient Dimensionality Reduction(with Amir
Globerson)
  • Exponential families have sufficient statistics
  • Given a joint distribution ,
    find an approximation of the exponential form

This can be done by alternating maximization of
Entropy under the constraints
The resulting functions are our relevant features
at rank d.
28
Conclusions
  • There may be a single principle behind ...
  • Noise filtering
  • time series prediction
  • categorization and classification
  • feature extraction
  • supervised and unsupervised learning
  • visual and auditory segmentation
  • clustering
  • self organized representation
  • ...

29
Summary
  • We present a general information theoretic
    approach for extracting relevant information.
  • It is a natural generalization of
    Rate-Distortion theory with similar convergence
    and optimality proofs.
  • Unifies learning, feature extraction, filtering,
    and prediction...
  • Applications (so far) include
  • Word sense disambiguation
  • Document classification and categorization
  • Spectral analysis
  • Neural codes
  • Bioinformatics,
  • Data clustering based on multi-distance
    distributions
Write a Comment
User Comments (0)
About PowerShow.com