Title: Lattice Representation of Data
1Lattice Representation of Data
- Dr. Alex Pogel
- Physical Science Laboratory
- New Mexico State University
2Basic Idea
- Replace tabular representation by lattice
representation in order to reveal hierarchical
structure - Basic definitions
- Information in the lattice
- Carving up epidemiological data
Ganter Wille Formal Concept Analysis
(FCA) Barwise Seligman Information Flow
3Input data
- Base data structure is a 0,1-table
- A set G of objects (represented by rows) and
- A set M of attributes (represented by columns)
- an entry of 1 indicates object g has attribute m
M
G
4Input data, mathematically
- Mathematically speaking
- a binary relation I from G to M, a subset of G x
M - interpreted as an indication of which objects g
have which attributes m - Via (g,m) e I
5Key Definitions
- The notion of formal concept is based on
natural mappings that arise from the binary
relation I - interpret G and M as before
- to each subset H of G, we associate the set a(A)
of all attributes the objects in H satisfy in
common - a P(G)?P(M)
- to each subset N of M, we associate the set o(N)
of all objects satisfying every attribute in N - o P(M)?P(G)
6Key Definitions
- The attribute subsets N of M such that a(o(N))
N - are called formal concepts in FCA
- And are called
- closed sets in mathematics, as a(o()) is a
closure operator on M - A formal concept can be identified geometrically
within a data table by reshuffling rows and
columns such that - object-attribute relations are maintained and
- a maximal rectangle of 1s appears.
7Animal Context
8Shuffling Reveals a Concept
9BIRD is the (formal) concept
10Closure System Arises
- Taking all closed sets together we obtain a
closure system - aka a topped intersection structure, in
Davey-Priestley - which is always a complete lattice an ordered
set for which every subset has both a supremum
and infimum in the set - Examples
- R with lt,
- P(S) with inclusion,
- any topology with inclusion,
11Focus on attribute logic
12Full list difficult, redundant
- all implications that hold for the data, with up
to three attributes in their premise 125 with
positive support
13Duquenne-Guigues Basis
- 20 implications generate the full list, and serve
as a basis (analogy with linear algebra) ordered
by support value
14Full list, basis, and original data
15Implication Reads Upwards
- at top right warm-blooded implies airbreather
- 1st in basis high support indicated in lime
green
16A Subinterval of the lattice
- Â
- fourlegged implies airbreather
-
- pet implies warm-blooded
- (iguana?)
- and
-
- fur implies
- fourlegged and warm-blooded (platypus?)
17Original data preserved
- animals 26 and 27
- share the attributes
- lives in water,
- is warm-blooded and
- is an airbreather
18Original data preserved
- animals 26 and 27 share the attributes
- lives in water, is warm-blooded and is an
airbreather
19Color-coded support
- the similarity in color between livestock and
the concept node below it yields the association
rule - livestock implies fur
- with 79 confidence
- And 11 support (bottom)
20Visual Vocabulary
- Small subdiagrams
- (Specifically meet-subsemilattices)
- can be recognized as complex sentences
213 unordered attribute concepts
c
b
a
Note the top element is really irrelevant, but
adding it makes everything well look at a
lattice instead of just a meet semilattice
(definition an ordered structure closed under
finite meet (glb))
22Heres the best known outcome
No non-trivial implications
c
b
a
23W over V a c ? b
c
b
a
24Diamond in diamond
Under condition c, a and b are equivalent
b
c
a
25Convergence
any two imply the third
b
c
a
26Two Complex Sentences
- So, we can read that
- Â
- For nocturnal animals and pets, the attributes
fourlegged and warm-blooded are equivalent, - Â
- and
- the only implication between the attributes
nocturnal, fur and pet is - pet and nocturnal implies fur.
27The Hague, Netherlands
28Before Freese improvement
29After Freese improvement
30Apparent Splits
31Eliminating Light Smokers
32Why no object names?
33Lung Cancer and Smoking
- nearly half of these 30 year smokers have lung
cancer
34Bird-keeping and Smoking
- Association rules involving bird-keeping and
smoking
35Limitations as KDD Process
- Needs attention given to data preparation
- Need more built-in verification of discovered
rules - No domain-specific constructions (advantage ?)
- Does not scale without clustering (universal ?)
36Epidemiological functions
- Plan to add odds ratio calculation, via click
OR 3.9
37Clustering for too large lattices
38Support for improvement
- Traditional diagram improvement algorithms are
based solely upon the order structure - We are now moving towards the inclusion of
support values in these algorithms - I will talk about this topic in detail in July,
here at DIMACS, as part of the Applications of
Lattice Theory workshop
END