Feature selection based on information theory, consistency and separability indices

1 / 22

About This Presentation

Title:

Feature selection based on information theory, consistency and separability indices

Description:

What am I going to say. Selection of information. Information theory - filters ... Information gained considering attribute Xj and classes C together ... –

Number of Views:194

Avg rating:3.0/5.0

Slides: 23

Provided by: valeri86

Category:

more less

Transcript and Presenter's Notes

Title: Feature selection based on information theory, consistency and separability indices

1
Feature selection based on information theory,
consistency and separability indices

Wlodzislaw Duch, Tomasz Winiarski, Krzysztof
Grabczewski, Jacek Biesiada, Adam Kachel
Dept. of Informatics, Nicholas Copernicus
University, Torun, Poland
http//www.phys.uni.torun.pl/duch
ICONIP Singapore, 18-22.11.2002

2
What am I going to say

Selection of information
Information theory - filters
Information theory - selection
Consistency indices
Separability indices
Empirical comparison artificial data
Empirical comparison real data
Conclusions, or what have we learned?

3
Selection of information

Attention basic cognitive skill
Find relevant information
discard attributes that do not contain
information,
use weights to express the relative importance,
create new, more informative attributes
reduce dimensionality aggregating information
Ranking treat each feature as independent.Select
ion search for subsets, remove redundant.
Filters universal criteria, model-independent.Wr
appers criteria specific for data models are
used.
Here filters for ranking and selection.

4
Information theory - filters

X vectors, Xj attributes, Xjf attribute
values,
Ci - class i 1 .. K, joint probability
distribution p(C, Xj).
The amount of information contained in this joint
distribution, summed over all classes, gives an
estimation of feature importance

For continuous attribute values integrals are
approximated by sums. This implies
discretization into rk(f) regions, an issue in
itself. Alternative fitting p(Ci,f) density
using Gaussian or other kernels. Which method is
more accurate and what are expected errors?
5
Information gain

Information gained by considering the joint
probability distribution p(C, f) is a difference
between

A feature is more important if its information
gain is larger.
Modifications of the information gain, frequently
used as criteria in decision trees, include
IGR(C,Xj) IG(C,Xj)/I(Xj) the gain ratio
IGn(C,Xj) IG(C,Xj)/I(C) an asymmetric
dependency coefficient DM(C,Xj)
1-IG(C,Xj)/I(C,Xj) normalized Mantaras distance

6
Information indices

Information gained considering attribute Xj and
classes C together is also known as mutual
information, equal to the Kullback-Leibler
divergence between joint and product probability
distributions

Entropy distance measure is a sum of conditional
information
Symmetrical uncertainty coefficient is obtained
from entropy distance
7
Weighted I(C,X)

Joint information should be weighted by p(rk(f
))

For continuous attribute values integrals are
approximated by sums. This implies
discretization into rk(f ) regions, an issue in
itself. Alternative fitting p(Ci, f ) density
using Gaussian or other kernels. Which method is
more accurate and how large are expected errors?
8
Purity indices

Many information-based quantities may be used to
evaluate attributes.Consistency or purity-based
indices are one alternative.

For selection of subset of attributes FXi the
sum runs over all Cartesian products, or
multidimensional partitions rk(F). Advantages
simplest approach both ranking and
selection Hashing techniques are used to
calculate p(rk(F)) probabilities.
9
4 Gaussians in 8D

Artificial data set of 4 Gaussians in 8D, 1000
points per Gaussian, each as a separate class.

Dimension 1-4, independent, Gaussians centered
at (0,0,0,0), (2,1,0.5,0.25), (4,2,1,0.5),
(6,3,1.5,0.75). Ranking and overlapping
strength are inversely related Ranking X1 ?
X2 ? X3 ? X4. Attributes Xi4 2Xi uniform
noise 1.5. Best ranking X1 ? X5 ? X2 ? X6 ?
X3 ? X7 ? X4 ? X8 Best selection X1 ? X2 ? X3
? X4 ? X5 ? X6 ? X7 ? X8
10
Dim X1 vs. X2
11
Dim X1 vs. X5
12
Ranking algorithms

WI(C,f) information from weighted
p(r(f))p(C,r(f)) distribution
MI(C,f) mutual information (information gain)
ICR(Cf) information gain ratio
IC(Cf) information from maxC posterior
distribution
GD(C,f) transinformation matrix with
Mahalanobis distance
7 other methods based on IC and
correlation-based distances, Markov blanket
and Relieff selection methods.

13
Selection algorithms

Maximize evaluation criterion for single remove
redundant features.
1. MI(Cf) - b MI(fg) algorithm (Battiti 1994)

2. IC(C,f)-b IC(f,g), same algorithm but with IC
criterion 3. Max IC(CF) adding single attribute
that maximizes IC 4. Max MI(CF) adding single
attribute that maximizes IC 5. SSV decision tree
based on separability criterion.
14
Ranking for 8D Gaussians

Partitions of each attribute into 4, 8, 16, 24,
32 parts, with equal width.
Methods that found perfect rankingMI(Cf),
IGR(Cf), WI(C,f), GD transinformation distance
IC(f) correct, except for P8, feature 2-6
reversed (6 is the noisy version of 2).
Other, more sophisticated algorithms, made more
errors.
Selection for Gaussian distributions is rather
easy using any evaluation measure.
Simpler algorithms work better.

15
Selection for 8D Gaussians

Partitions of each attribute into 4, 8, 16, 24,
32 parts, with equal width.
Ideal selection subsets with 1, 12,
123, or 1234 attributes.
1. MI(Cf)-0.5MI(fg) algorithm P24 no errors,
for P8, 16, 32 small error (4?8)

Max MI(CF) P8-24 no errors, P32 (3,4?7,8)
Max IC(CF) P24 no errors, P8 (2?6), P16 (3?7),
P32 (3,4?7,8)
SSV decision tree based on separability
criterion creates its own discretization.
Selects 1, 2, 6, 3, 7, others are not important.
Univariate trees have bias for slanted
distributions. Selection should take into account
the type of classification system that will be
used.

16
Hypothyroid equal bins

Mutual information for different number of equal
width partitions,
ordered from largest to smallest, for the
hypothyroid data 6 continuous and 15 binary
attributes.

17
Hypothyroid SSV bins

Mutual information for different number of equal
SSV decision tree partitions, ordered from
largest to smallest, for the hypothyroid data.
Values are twice as large since bins are more
pure.

18
Hypothyroid ranking

Best ranking largest area under curve
accuracy(best n features).

SBL evaluating and adding one attribute at a
time (costly). Best 2 SBL, best 3 SSV BFS, best
4 SSV beam BA - failure
19
Hypothyroid ranking

Results from FSM neurofuzzy system.

Best 2 SBL, best 3 SSV BFS, best 4 SSV beam
BA failure Global correlation misses local
usefulness ...
20
Hypothyroid SSV ranking

More results using FSM and selection based on
SSV.

SSV with beam search P24 finds the best small
subsets, depending on the search depth here best
results for 5 attributes are achieved.
21
Conclusions

About 20 ranking and selection methods have been
checked.
The actual feature evaluation index (information,
consistency, correlation) is not so important.
Discretization is very important naive
equi-width or equidistance discretization may
give unpredictable results entropy-based
discretization is fine, but separability-based is
less expensive.
Continuous kernel-based approximations to
calculation of feature evaluation indices are a
useful alternative.
Ranking is easy if global evaluation is
sufficient, but different sets of features may be
important for separation of different classes,
and some are important in small regions only
cf. decision trees.
Selection requires calculation of
multidimensional evaluation indices, done
effectively using hashing techniques.
Local selection and ranking is the most promising
technique.

22
Open questions

Is the best selection method based on filters
possible?
Perhaps it depends on the ability of different
methods to use the information contained in
selected attributes.

Discretization or kernel estimation?
Best discretization Vopt histograms, entropy,
separability?
How useful is fuzzy partitioning?
Use of feature weighting from ranking/selection
to scale input data.
How to make evaluation index that includes
local information?
Hoe to use selection methods to find
combination of attributes?
These and other ranking/selection methods will be
integrated into the GhostMiner data mining
package
Google GhostMiner