Feature selection based on information theory, consistency and separability indices

1 / 22
About This Presentation
Title:

Feature selection based on information theory, consistency and separability indices

Description:

What am I going to say. Selection of information. Information theory - filters ... Information gained considering attribute Xj and classes C together ... –

Number of Views:194
Avg rating:3.0/5.0
Slides: 23
Provided by: valeri86
Category:

less

Transcript and Presenter's Notes

Title: Feature selection based on information theory, consistency and separability indices


1
Feature selection based on information theory,
consistency and separability indices
  • Wlodzislaw Duch, Tomasz Winiarski, Krzysztof
    Grabczewski, Jacek Biesiada, Adam Kachel
  • Dept. of Informatics, Nicholas Copernicus
    University, Torun, Poland
  • http//www.phys.uni.torun.pl/duch
  • ICONIP Singapore, 18-22.11.2002

2
What am I going to say
  • Selection of information
  • Information theory - filters
  • Information theory - selection
  • Consistency indices
  • Separability indices
  • Empirical comparison artificial data
  • Empirical comparison real data
  • Conclusions, or what have we learned?

3
Selection of information
  • Attention basic cognitive skill
  • Find relevant information
  • discard attributes that do not contain
    information,
  • use weights to express the relative importance,
  • create new, more informative attributes
  • reduce dimensionality aggregating information
  • Ranking treat each feature as independent.Select
    ion search for subsets, remove redundant.
  • Filters universal criteria, model-independent.Wr
    appers criteria specific for data models are
    used.
  • Here filters for ranking and selection.

4
Information theory - filters
  • X vectors, Xj attributes, Xjf attribute
    values,
  • Ci - class i 1 .. K, joint probability
    distribution p(C, Xj).
  • The amount of information contained in this joint
    distribution, summed over all classes, gives an
    estimation of feature importance

For continuous attribute values integrals are
approximated by sums. This implies
discretization into rk(f) regions, an issue in
itself. Alternative fitting p(Ci,f) density
using Gaussian or other kernels. Which method is
more accurate and what are expected errors?
5
Information gain
  • Information gained by considering the joint
    probability distribution p(C, f) is a difference
    between
  • A feature is more important if its information
    gain is larger.
  • Modifications of the information gain, frequently
    used as criteria in decision trees, include
    IGR(C,Xj) IG(C,Xj)/I(Xj) the gain ratio
    IGn(C,Xj) IG(C,Xj)/I(C) an asymmetric
    dependency coefficient DM(C,Xj)
    1-IG(C,Xj)/I(C,Xj) normalized Mantaras distance

6
Information indices
  • Information gained considering attribute Xj and
    classes C together is also known as mutual
    information, equal to the Kullback-Leibler
    divergence between joint and product probability
    distributions

Entropy distance measure is a sum of conditional
information
Symmetrical uncertainty coefficient is obtained
from entropy distance
7
Weighted I(C,X)
  • Joint information should be weighted by p(rk(f
    ))

For continuous attribute values integrals are
approximated by sums. This implies
discretization into rk(f ) regions, an issue in
itself. Alternative fitting p(Ci, f ) density
using Gaussian or other kernels. Which method is
more accurate and how large are expected errors?
8
Purity indices
  • Many information-based quantities may be used to
    evaluate attributes.Consistency or purity-based
    indices are one alternative.

For selection of subset of attributes FXi the
sum runs over all Cartesian products, or
multidimensional partitions rk(F). Advantages
simplest approach both ranking and
selection Hashing techniques are used to
calculate p(rk(F)) probabilities.
9
4 Gaussians in 8D
  • Artificial data set of 4 Gaussians in 8D, 1000
    points per Gaussian, each as a separate class.

Dimension 1-4, independent, Gaussians centered
at (0,0,0,0), (2,1,0.5,0.25), (4,2,1,0.5),
(6,3,1.5,0.75). Ranking and overlapping
strength are inversely related Ranking X1 ?
X2 ? X3 ? X4. Attributes Xi4 2Xi uniform
noise 1.5. Best ranking X1 ? X5 ? X2 ? X6 ?
X3 ? X7 ? X4 ? X8 Best selection X1 ? X2 ? X3
? X4 ? X5 ? X6 ? X7 ? X8
10
Dim X1 vs. X2
11
Dim X1 vs. X5
12
Ranking algorithms
  • WI(C,f) information from weighted
    p(r(f))p(C,r(f)) distribution
  • MI(C,f) mutual information (information gain)
  • ICR(Cf) information gain ratio
  • IC(Cf) information from maxC posterior
    distribution
  • GD(C,f) transinformation matrix with
    Mahalanobis distance
  • 7 other methods based on IC and
    correlation-based distances, Markov blanket
    and Relieff selection methods.

13
Selection algorithms
  • Maximize evaluation criterion for single remove
    redundant features.
  • 1. MI(Cf) - b MI(fg) algorithm (Battiti 1994)

2. IC(C,f)-b IC(f,g), same algorithm but with IC
criterion 3. Max IC(CF) adding single attribute
that maximizes IC 4. Max MI(CF) adding single
attribute that maximizes IC 5. SSV decision tree
based on separability criterion.
14
Ranking for 8D Gaussians
  • Partitions of each attribute into 4, 8, 16, 24,
    32 parts, with equal width.
  • Methods that found perfect rankingMI(Cf),
    IGR(Cf), WI(C,f), GD transinformation distance
  • IC(f) correct, except for P8, feature 2-6
    reversed (6 is the noisy version of 2).
  • Other, more sophisticated algorithms, made more
    errors.
  • Selection for Gaussian distributions is rather
    easy using any evaluation measure.
  • Simpler algorithms work better.

15
Selection for 8D Gaussians
  • Partitions of each attribute into 4, 8, 16, 24,
    32 parts, with equal width.
  • Ideal selection subsets with 1, 12,
    123, or 1234 attributes.
  • 1. MI(Cf)-0.5MI(fg) algorithm P24 no errors,
    for P8, 16, 32 small error (4?8)
  • Max MI(CF) P8-24 no errors, P32 (3,4?7,8)
  • Max IC(CF) P24 no errors, P8 (2?6), P16 (3?7),
    P32 (3,4?7,8)
  • SSV decision tree based on separability
    criterion creates its own discretization.
    Selects 1, 2, 6, 3, 7, others are not important.
  • Univariate trees have bias for slanted
    distributions. Selection should take into account
    the type of classification system that will be
    used.

16
Hypothyroid equal bins
  • Mutual information for different number of equal
    width partitions,
  • ordered from largest to smallest, for the
    hypothyroid data 6 continuous and 15 binary
    attributes.

17
Hypothyroid SSV bins
  • Mutual information for different number of equal
    SSV decision tree partitions, ordered from
    largest to smallest, for the hypothyroid data.
    Values are twice as large since bins are more
    pure.

18
Hypothyroid ranking
  • Best ranking largest area under curve
    accuracy(best n features).

SBL evaluating and adding one attribute at a
time (costly). Best 2 SBL, best 3 SSV BFS, best
4 SSV beam BA - failure
19
Hypothyroid ranking
  • Results from FSM neurofuzzy system.

Best 2 SBL, best 3 SSV BFS, best 4 SSV beam
BA failure Global correlation misses local
usefulness ...
20
Hypothyroid SSV ranking
  • More results using FSM and selection based on
    SSV.

SSV with beam search P24 finds the best small
subsets, depending on the search depth here best
results for 5 attributes are achieved.
21
Conclusions
  • About 20 ranking and selection methods have been
    checked.
  • The actual feature evaluation index (information,
    consistency, correlation) is not so important.
  • Discretization is very important naive
    equi-width or equidistance discretization may
    give unpredictable results entropy-based
    discretization is fine, but separability-based is
    less expensive.
  • Continuous kernel-based approximations to
    calculation of feature evaluation indices are a
    useful alternative.
  • Ranking is easy if global evaluation is
    sufficient, but different sets of features may be
    important for separation of different classes,
    and some are important in small regions only
    cf. decision trees.
  • Selection requires calculation of
    multidimensional evaluation indices, done
    effectively using hashing techniques.
  • Local selection and ranking is the most promising
    technique.

22
Open questions
  • Is the best selection method based on filters
    possible?
  • Perhaps it depends on the ability of different
    methods to use the information contained in
    selected attributes.
  • Discretization or kernel estimation?
  • Best discretization Vopt histograms, entropy,
    separability?
  • How useful is fuzzy partitioning?
  • Use of feature weighting from ranking/selection
    to scale input data.
  • How to make evaluation index that includes
    local information?
  • Hoe to use selection methods to find
    combination of attributes?
  • These and other ranking/selection methods will be
    integrated into the GhostMiner data mining
    package
  • Google GhostMiner
Write a Comment
User Comments (0)
About PowerShow.com