Title: Analyzing Attribute Dependencies
1Analyzing Attribute Dependencies
- Aleks Jakulin Ivan Bratko
- Faculty of Computer and Information Science
- University of Ljubljana
- Slovenia
2Overview
- Problem
- Generalize the notion of correlation from two
variables to three or more variables. - Approach
- Use the Shannons entropy as the foundation for
quantifying interaction. - Application
- Visualization, with focus on supervised learning
domains. - Result
- We can explain several mysteries of machine
learning through higher-order dependencies.
3Problem Attribute Dependencies
4Approach Shannons Entropy
A
C
5Interaction Information
I(ABC)
I(ABC)
- I(BC)
- I(AC)
I(ABC) - I(AB)
(Partial) history of independent reinventions
McGill 54 (Psychometrika) -
interaction information Han 80 (Information
Control) - multiple mutual
information Yeung 91 (IEEE Trans. On Inf.
Theory) - mutual information GrabischRo
ubens 99 (I. J. of Game Theory) - Banzhaf
interaction index Matsuda 00 (Physical Review
E) - higher-order mutual
inf. Brenner et al. 00 (Neural Computation)
- average synergy Demar 02 (A thesis in
machine learning) - relative information
gain Bell 03 (NIPS02, ICA2003) -
co-information Jakulin 03 -
interaction gain
6Properties
- Invariance with respect to attribute/label
division - I(ABC) I(ACB) I(CAB)
I(BAC) I(CBA) I(BCA). - Decomposition of mutual information
- I(ABC) I(AC)I(BC)I(ABC)
- I(ABC) is synergistic information.
- A, B, C are independent ? I(ABC) 0.
7Positive and Negative Interactions
- If any pair of the attributes is conditionally
independent w/r to a third attribute, the
3-information neutralizes the 2-information - I(ABC) 0 ? I(ABC) -I(AB)
- Interaction information may be positive or
negative - Positive XOR problem (A B ? C) synergy
- Negative conditional independence, redundant
attributes redundancy - Zero Independence of one of the attributes or a
mix of synergy and redundancy.
8Applications
- Visualization
- Interaction graphs
- Interaction dendrograms
- Model construction
- Feature construction
- Feature selection
- Ensemble construction
- Evaluation on the CMC domain predicting
contraception method from demographics.
9Interaction Graphs
10CMC
11Application Feature Construction
- NBC Model Predictive perf.
- (Brier
score)__ - 0.2157 ? 0.0013
- Wedu, Hedu 0.2087 ? 0.0024
- Wedu 0.2068 ? 0.0019
- Wedu?Hedu 0.2067 ? 0.0019
- Age, Child 0.1951 ? 0.0023
- Age?Child 0.1918 ? 0.0026
- A?C?W?H 0.1873 ? 0.0027
- A, C, W, H 0.1870 ? 0.0030
- A, C, W 0.1850 ? 0.0027
- A?C, W?H 0.1831 ? 0.0032
- A?C, W 0.1814 ? 0.0033
12Alternatives
TAN
NBC
0.1874 ? 0.0032
0.1849 ? 0.0028
BEST gt100000 models A?C, W?H, MediaExp
GBN
0.1811 ? 0.0032
0.1815 ? 0.0029
13Dissimilarity Measures
- The relationships between attributes are to some
extent transitive. - Algorithm
- Define a dissimilarity measure between two
attributes in the context of the label C - Apply hierarchical clustering to summarize the
dissimilarity matrix.
14Interaction Dendrogram
weakly interacting
strongly interacting
cluster tightness
loose
tight
15Application Feature Selection
- Soybean domain
- predict disease from symptoms
- predominantly negative interactions.
- Global optimization procedure for feature
selection gt5000 NBC models tested (B-Course)
- Selected features balance dissimilarity and
importance. - We can understand what global optimization did
from the dendrogram.
16Application Ensembles
17Implication Assumptions in Machine Learning
18Work in Progress
- Overfitting the interaction information
computations do not account for the increase in
complexity. - Support for numerical and ordered attributes.
- Inductive learning algorithms which use these
heuristics automatically. - Models that are based on the real relationships
in the data, not on our assumptions about them.
19Summary
- There are relationships exclusive to groups of n
attributes. - Interaction information is a heuristic for
quantification of relationships with entropy. - Two visualization methods
- Interaction graphs
- Interaction dendrograms