Finding lowentropy sets and trees from binary data - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Finding lowentropy sets and trees from binary data

Description:

Such attribute sets are simple. A generalization of frequent sets ... A set X of attributes has low entropy if the entropy H(X) of the data on X is less than s ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 20
Provided by: heik78
Category:

less

Transcript and Presenter's Notes

Title: Finding lowentropy sets and trees from binary data


1
Finding low-entropy sets and trees from binary
data
  • Hannes Heikinheimo Eino Hinkkanen
  • Heikki Mannila Taneli Mielikäinen Jouni K.
    Seppänen
  • HIIT Basic Research Unit
  • University of Helsinki Helsinki University of
    Technology
  • Finland

2
Summary (1)
  • Low entropy set a set X of variables such that
    the data on X has small entropy
  • Such attribute sets are simple
  • A generalization of frequent sets
  • Task find all low entropy sets
  • Monotone concept, thus levelwise search

3
Summary (2)
  • D-trees low entropy sets such that the depency
    structure of the variables in X is a tree with
    edges going away from the root
  • U-trees low entropy sets such that the depency
    structure of the variables in X is a tree with
    edges going towards the root
  • Modified levelwise search

4
Summary (3)
  • Experimental results on generated and real data
  • The concepts produce intuitive results
  • Efficiency is OK

5
Outline of the talk
  • Low entropy sets
  • D-trees
  • U-trees
  • Experiments

6
Low entropy attribute sets
  • A 0-1 dataset
  • The entropy of the data projected to a set X of
    attributes
  • The entropy of X
  • A threshold s
  • A set X of attributes has low entropy if the
    entropy H(X) of the data on X is less than s

7
Example
A dataset with lots of attributes
Low entropy set X A, B, C
High entropy set Y D,E,F
A B C ..............D E F..... 0 1 0
0 1 0 0 1 0
1 0 1 0 1 0 0 1 1 0
1 1 1 1 0 0 1 1
0 0 1 0 1 0 1 1
1 0 1 1 0 0 0 0 1 0
0 1 1
8
Problem definition
  • Given a dataset and a threshold s
  • Find all sets X of attributes such that H(X) s
  • As H(X) is monotone, levelwise search works

9
D-trees
  • A low entropy set X is simple
  • A set can be simple in many ways the connections
    between attributes can still be complex
  • D-trees and U-trees low-entropy sets of
    restricted type
  • D-tree T
  • a Bayes net on a subset X of attributes such that
    the entropy HT(X) of the data on X when viewed by
    T is low.

10
Example of a D-tree
HT(A,B,C) H(A) H(B A) H(C A)
11
U-trees
  • As D-trees, but the direction of the links has
    been reversed

HT(A,B,C) H(B) H(C) H(A B,C)
Example U-tree
12
Problem definitions
  • Given data and a threshold s, find all D-trees T
    such that HT(X) lt s
  • Given data and a threshold s, find all U-trees T
    such that HT(X) lt s

13
Algorithm for findinglow-entropy trees
  • First phase generate all height-one trees with
    entropy below threshold
  • D-trees pairwise entropies
  • U-trees breadth-first search
  • Second phase
  • combine pairs of trees, forming new trees
  • stop when all new trees have entropy above
    threshold

14
Experimental results
  • Generated data
  • The algorithms find the trees used to generate
    the data
  • Real data about
  • Courses taken by students at the CS department of
    University of Helsinki.
  • Terms used in a bibliography on theory and
    foundations of computer science.

15
Experimental results Course data
  • 2405 observations (students) and 5021 attributes
    (columns) corresponding to courses.
  • Preprocessing rare attributes removed.

16
Experimental results Course data
The lowest-entropy 5-node U-tree (left) and
7-node D-tree (right) found in the course data
17
Related work
  • Lots of work on finding frequent sets and
    variations.
  • Siebes et al. (2006) selecting itemsets that
    compress the data.
  • only all-1s items.
  • Knobbe and Ho (2006) maximally informative
    k-itemsets, itemsets with high entropy.
  • fixed number k of elements.
  • Heikinheimo et al. (2006) local tree patterns of
    general and more specific items.
  • U-trees are a generalization of this idea.
  • Decision trees and Bayes networks
  • Complete models for entire data
  • Here, lots of small trees for different subsets
    of attributes

18
Concluding remarks
  • We considered the problem of finding local
    structure in binary data in the form of
    low-entropy sets and trees.
  • The idea is a natural generalization of frequent
    sets and association rules.
  • We gave effective algorithms for finding both
    low-entropy sets and trees.
  • Experiments showed that the approach is feasible
    and can find interesting structure in data.

19
Algorithm Low-Entropy U-trees
  • A U-tree T and a U-tree U of height 1 may be
    combined to produce a U-tree V, if the root of U
    is a leaf of T.
Write a Comment
User Comments (0)
About PowerShow.com