Finding lowentropy sets and trees from binary data - PowerPoint PPT Presentation

1 / 19

About This Presentation

Title:

Finding lowentropy sets and trees from binary data

Description:

Such attribute sets are simple. A generalization of frequent sets ... A set X of attributes has low entropy if the entropy H(X) of the data on X is less than s ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 20

Provided by: heik78

Category:

more less

Transcript and Presenter's Notes

Title: Finding lowentropy sets and trees from binary data

1
Finding low-entropy sets and trees from binary
data

Hannes Heikinheimo Eino Hinkkanen
Heikki Mannila Taneli Mielikäinen Jouni K.
Seppänen
HIIT Basic Research Unit
University of Helsinki Helsinki University of
Technology
Finland

2
Summary (1)

Low entropy set a set X of variables such that
the data on X has small entropy
Such attribute sets are simple
A generalization of frequent sets
Task find all low entropy sets
Monotone concept, thus levelwise search

3
Summary (2)

D-trees low entropy sets such that the depency
structure of the variables in X is a tree with
edges going away from the root
U-trees low entropy sets such that the depency
structure of the variables in X is a tree with
edges going towards the root
Modified levelwise search

4
Summary (3)

Experimental results on generated and real data
The concepts produce intuitive results
Efficiency is OK

5
Outline of the talk

Low entropy sets
D-trees
U-trees
Experiments

6
Low entropy attribute sets

A 0-1 dataset
The entropy of the data projected to a set X of
attributes
The entropy of X
A threshold s
A set X of attributes has low entropy if the
entropy H(X) of the data on X is less than s

7
Example
A dataset with lots of attributes
Low entropy set X A, B, C
High entropy set Y D,E,F
A B C ..............D E F..... 0 1 0
0 1 0 0 1 0
1 0 1 0 1 0 0 1 1 0
1 1 1 1 0 0 1 1
0 0 1 0 1 0 1 1
1 0 1 1 0 0 0 0 1 0
0 1 1
8
Problem definition

Given a dataset and a threshold s
Find all sets X of attributes such that H(X) s
As H(X) is monotone, levelwise search works

9
D-trees

A low entropy set X is simple
A set can be simple in many ways the connections
between attributes can still be complex
D-trees and U-trees low-entropy sets of
restricted type
D-tree T
a Bayes net on a subset X of attributes such that
the entropy HT(X) of the data on X when viewed by
T is low.

10
Example of a D-tree
HT(A,B,C) H(A) H(B A) H(C A)
11
U-trees

As D-trees, but the direction of the links has
been reversed

HT(A,B,C) H(B) H(C) H(A B,C)
Example U-tree
12
Problem definitions

Given data and a threshold s, find all D-trees T
such that HT(X) lt s
Given data and a threshold s, find all U-trees T
such that HT(X) lt s

13
Algorithm for findinglow-entropy trees

First phase generate all height-one trees with
entropy below threshold
D-trees pairwise entropies
U-trees breadth-first search
Second phase
combine pairs of trees, forming new trees
stop when all new trees have entropy above
threshold

14
Experimental results

Generated data
The algorithms find the trees used to generate
the data
Real data about
Courses taken by students at the CS department of
University of Helsinki.
Terms used in a bibliography on theory and
foundations of computer science.

15
Experimental results Course data

2405 observations (students) and 5021 attributes
(columns) corresponding to courses.
Preprocessing rare attributes removed.

16
Experimental results Course data
The lowest-entropy 5-node U-tree (left) and
7-node D-tree (right) found in the course data
17
Related work

Lots of work on finding frequent sets and
variations.
Siebes et al. (2006) selecting itemsets that
compress the data.
only all-1s items.
Knobbe and Ho (2006) maximally informative
k-itemsets, itemsets with high entropy.
fixed number k of elements.
Heikinheimo et al. (2006) local tree patterns of
general and more specific items.
U-trees are a generalization of this idea.
Decision trees and Bayes networks
Complete models for entire data
Here, lots of small trees for different subsets
of attributes

18
Concluding remarks

We considered the problem of finding local
structure in binary data in the form of
low-entropy sets and trees.
The idea is a natural generalization of frequent
sets and association rules.
We gave effective algorithms for finding both
low-entropy sets and trees.
Experiments showed that the approach is feasible
and can find interesting structure in data.

19
Algorithm Low-Entropy U-trees