Title: Finding lowentropy sets and trees from binary data
1Finding low-entropy sets and trees from binary
data
- Hannes Heikinheimo Eino Hinkkanen
- Heikki Mannila Taneli Mielikäinen Jouni K.
Seppänen - HIIT Basic Research Unit
- University of Helsinki Helsinki University of
Technology - Finland
2Summary (1)
- Low entropy set a set X of variables such that
the data on X has small entropy - Such attribute sets are simple
- A generalization of frequent sets
- Task find all low entropy sets
- Monotone concept, thus levelwise search
3Summary (2)
- D-trees low entropy sets such that the depency
structure of the variables in X is a tree with
edges going away from the root - U-trees low entropy sets such that the depency
structure of the variables in X is a tree with
edges going towards the root - Modified levelwise search
4Summary (3)
- Experimental results on generated and real data
- The concepts produce intuitive results
- Efficiency is OK
5Outline of the talk
- Low entropy sets
- D-trees
- U-trees
- Experiments
6Low entropy attribute sets
- A 0-1 dataset
- The entropy of the data projected to a set X of
attributes - The entropy of X
- A threshold s
- A set X of attributes has low entropy if the
entropy H(X) of the data on X is less than s
7Example
A dataset with lots of attributes
Low entropy set X A, B, C
High entropy set Y D,E,F
A B C ..............D E F..... 0 1 0
0 1 0 0 1 0
1 0 1 0 1 0 0 1 1 0
1 1 1 1 0 0 1 1
0 0 1 0 1 0 1 1
1 0 1 1 0 0 0 0 1 0
0 1 1
8Problem definition
- Given a dataset and a threshold s
- Find all sets X of attributes such that H(X) s
- As H(X) is monotone, levelwise search works
9D-trees
- A low entropy set X is simple
- A set can be simple in many ways the connections
between attributes can still be complex - D-trees and U-trees low-entropy sets of
restricted type - D-tree T
- a Bayes net on a subset X of attributes such that
the entropy HT(X) of the data on X when viewed by
T is low.
10Example of a D-tree
HT(A,B,C) H(A) H(B A) H(C A)
11U-trees
- As D-trees, but the direction of the links has
been reversed
HT(A,B,C) H(B) H(C) H(A B,C)
Example U-tree
12Problem definitions
- Given data and a threshold s, find all D-trees T
such that HT(X) lt s - Given data and a threshold s, find all U-trees T
such that HT(X) lt s
13Algorithm for findinglow-entropy trees
- First phase generate all height-one trees with
entropy below threshold - D-trees pairwise entropies
- U-trees breadth-first search
- Second phase
- combine pairs of trees, forming new trees
- stop when all new trees have entropy above
threshold
14Experimental results
- Generated data
- The algorithms find the trees used to generate
the data - Real data about
- Courses taken by students at the CS department of
University of Helsinki. - Terms used in a bibliography on theory and
foundations of computer science.
15Experimental results Course data
- 2405 observations (students) and 5021 attributes
(columns) corresponding to courses. - Preprocessing rare attributes removed.
16Experimental results Course data
The lowest-entropy 5-node U-tree (left) and
7-node D-tree (right) found in the course data
17Related work
- Lots of work on finding frequent sets and
variations. - Siebes et al. (2006) selecting itemsets that
compress the data. - only all-1s items.
- Knobbe and Ho (2006) maximally informative
k-itemsets, itemsets with high entropy. - fixed number k of elements.
- Heikinheimo et al. (2006) local tree patterns of
general and more specific items. - U-trees are a generalization of this idea.
- Decision trees and Bayes networks
- Complete models for entire data
- Here, lots of small trees for different subsets
of attributes
18Concluding remarks
- We considered the problem of finding local
structure in binary data in the form of
low-entropy sets and trees. - The idea is a natural generalization of frequent
sets and association rules. - We gave effective algorithms for finding both
low-entropy sets and trees. - Experiments showed that the approach is feasible
and can find interesting structure in data.
19Algorithm Low-Entropy U-trees
- A U-tree T and a U-tree U of height 1 may be
combined to produce a U-tree V, if the root of U
is a leaf of T.