Title: Feature selection based on information theory, consistency and separability indices
1Feature selection based on information theory,
consistency and separability indices
- Wlodzislaw Duch, Tomasz Winiarski, Krzysztof
Grabczewski, Jacek Biesiada, Adam Kachel - Dept. of Informatics, Nicholas Copernicus
University, Torun, Poland - http//www.phys.uni.torun.pl/duch
- ICONIP Singapore, 18-22.11.2002
2What am I going to say
- Selection of information
- Information theory - filters
- Information theory - selection
- Consistency indices
- Separability indices
- Empirical comparison artificial data
- Empirical comparison real data
- Conclusions, or what have we learned?
3Selection of information
- Attention basic cognitive skill
- Find relevant information
- discard attributes that do not contain
information, - use weights to express the relative importance,
- create new, more informative attributes
- reduce dimensionality aggregating information
- Ranking treat each feature as independent.Select
ion search for subsets, remove redundant. - Filters universal criteria, model-independent.Wr
appers criteria specific for data models are
used. - Here filters for ranking and selection.
4Information theory - filters
- X vectors, Xj attributes, Xjf attribute
values, - Ci - class i 1 .. K, joint probability
distribution p(C, Xj). - The amount of information contained in this joint
distribution, summed over all classes, gives an
estimation of feature importance
For continuous attribute values integrals are
approximated by sums. This implies
discretization into rk(f) regions, an issue in
itself. Alternative fitting p(Ci,f) density
using Gaussian or other kernels. Which method is
more accurate and what are expected errors?
5Information gain
- Information gained by considering the joint
probability distribution p(C, f) is a difference
between
- A feature is more important if its information
gain is larger. - Modifications of the information gain, frequently
used as criteria in decision trees, include
IGR(C,Xj) IG(C,Xj)/I(Xj) the gain ratio
IGn(C,Xj) IG(C,Xj)/I(C) an asymmetric
dependency coefficient DM(C,Xj)
1-IG(C,Xj)/I(C,Xj) normalized Mantaras distance
6Information indices
- Information gained considering attribute Xj and
classes C together is also known as mutual
information, equal to the Kullback-Leibler
divergence between joint and product probability
distributions
Entropy distance measure is a sum of conditional
information
Symmetrical uncertainty coefficient is obtained
from entropy distance
7Weighted I(C,X)
- Joint information should be weighted by p(rk(f
))
For continuous attribute values integrals are
approximated by sums. This implies
discretization into rk(f ) regions, an issue in
itself. Alternative fitting p(Ci, f ) density
using Gaussian or other kernels. Which method is
more accurate and how large are expected errors?
8Purity indices
- Many information-based quantities may be used to
evaluate attributes.Consistency or purity-based
indices are one alternative.
For selection of subset of attributes FXi the
sum runs over all Cartesian products, or
multidimensional partitions rk(F). Advantages
simplest approach both ranking and
selection Hashing techniques are used to
calculate p(rk(F)) probabilities.
94 Gaussians in 8D
- Artificial data set of 4 Gaussians in 8D, 1000
points per Gaussian, each as a separate class.
Dimension 1-4, independent, Gaussians centered
at (0,0,0,0), (2,1,0.5,0.25), (4,2,1,0.5),
(6,3,1.5,0.75). Ranking and overlapping
strength are inversely related Ranking X1 ?
X2 ? X3 ? X4. Attributes Xi4 2Xi uniform
noise 1.5. Best ranking X1 ? X5 ? X2 ? X6 ?
X3 ? X7 ? X4 ? X8 Best selection X1 ? X2 ? X3
? X4 ? X5 ? X6 ? X7 ? X8
10Dim X1 vs. X2
11Dim X1 vs. X5
12Ranking algorithms
- WI(C,f) information from weighted
p(r(f))p(C,r(f)) distribution - MI(C,f) mutual information (information gain)
- ICR(Cf) information gain ratio
- IC(Cf) information from maxC posterior
distribution - GD(C,f) transinformation matrix with
Mahalanobis distance - 7 other methods based on IC and
correlation-based distances, Markov blanket
and Relieff selection methods.
13Selection algorithms
- Maximize evaluation criterion for single remove
redundant features. - 1. MI(Cf) - b MI(fg) algorithm (Battiti 1994)
2. IC(C,f)-b IC(f,g), same algorithm but with IC
criterion 3. Max IC(CF) adding single attribute
that maximizes IC 4. Max MI(CF) adding single
attribute that maximizes IC 5. SSV decision tree
based on separability criterion.
14Ranking for 8D Gaussians
- Partitions of each attribute into 4, 8, 16, 24,
32 parts, with equal width. - Methods that found perfect rankingMI(Cf),
IGR(Cf), WI(C,f), GD transinformation distance - IC(f) correct, except for P8, feature 2-6
reversed (6 is the noisy version of 2). - Other, more sophisticated algorithms, made more
errors. - Selection for Gaussian distributions is rather
easy using any evaluation measure. - Simpler algorithms work better.
15Selection for 8D Gaussians
- Partitions of each attribute into 4, 8, 16, 24,
32 parts, with equal width. - Ideal selection subsets with 1, 12,
123, or 1234 attributes. - 1. MI(Cf)-0.5MI(fg) algorithm P24 no errors,
for P8, 16, 32 small error (4?8)
- Max MI(CF) P8-24 no errors, P32 (3,4?7,8)
- Max IC(CF) P24 no errors, P8 (2?6), P16 (3?7),
P32 (3,4?7,8) - SSV decision tree based on separability
criterion creates its own discretization.
Selects 1, 2, 6, 3, 7, others are not important. - Univariate trees have bias for slanted
distributions. Selection should take into account
the type of classification system that will be
used.
16Hypothyroid equal bins
- Mutual information for different number of equal
width partitions, - ordered from largest to smallest, for the
hypothyroid data 6 continuous and 15 binary
attributes.
17Hypothyroid SSV bins
- Mutual information for different number of equal
SSV decision tree partitions, ordered from
largest to smallest, for the hypothyroid data.
Values are twice as large since bins are more
pure.
18Hypothyroid ranking
- Best ranking largest area under curve
accuracy(best n features).
SBL evaluating and adding one attribute at a
time (costly). Best 2 SBL, best 3 SSV BFS, best
4 SSV beam BA - failure
19Hypothyroid ranking
- Results from FSM neurofuzzy system.
Best 2 SBL, best 3 SSV BFS, best 4 SSV beam
BA failure Global correlation misses local
usefulness ...
20Hypothyroid SSV ranking
- More results using FSM and selection based on
SSV.
SSV with beam search P24 finds the best small
subsets, depending on the search depth here best
results for 5 attributes are achieved.
21Conclusions
- About 20 ranking and selection methods have been
checked. - The actual feature evaluation index (information,
consistency, correlation) is not so important. - Discretization is very important naive
equi-width or equidistance discretization may
give unpredictable results entropy-based
discretization is fine, but separability-based is
less expensive. - Continuous kernel-based approximations to
calculation of feature evaluation indices are a
useful alternative. - Ranking is easy if global evaluation is
sufficient, but different sets of features may be
important for separation of different classes,
and some are important in small regions only
cf. decision trees. - Selection requires calculation of
multidimensional evaluation indices, done
effectively using hashing techniques. - Local selection and ranking is the most promising
technique.
22Open questions
- Is the best selection method based on filters
possible? - Perhaps it depends on the ability of different
methods to use the information contained in
selected attributes.
- Discretization or kernel estimation?
- Best discretization Vopt histograms, entropy,
separability? - How useful is fuzzy partitioning?
- Use of feature weighting from ranking/selection
to scale input data. - How to make evaluation index that includes
local information? - Hoe to use selection methods to find
combination of attributes? - These and other ranking/selection methods will be
integrated into the GhostMiner data mining
package - Google GhostMiner