Title: Concept Extraction based on Optimal Clique Searches
1Concept Extraction based on Optimal Clique
Searches
1st Franco-Japanese Workshop on Information
Search, Integration and Personalization June 30
July 2, 2003 at Hokudai VBL
Makoto HARAGUCHI and Yoshiaki OKUBO makoto,
yoshiaki_at_db-ei.eng.hokudai.ac.jp http//mhjcc3-ei
.eng.hokudai.ac.jp/index-e.html
- English paper now submitting to an
international conference. - Abstract in English
http//mhjcc3-ei.eng.hokudai.ac.jp/web/makoto/hp/p
dfs/ida03.ps - Japanese report Concept Learning based on
Optimal Clique Searches", JSAI SIG report
SIG-FAI-A202-11(12/19),63-66, 2002.
2Data Abstraction, an information theoretical
extension of attribute-oriented induction
Large DB
Generalized DB
Generalization
control
select
Classification algorithm
A set of abstractions, called data abstractions
Machine readable Dictionary
rules
The effect of data abstraction
Use it as a generator of possible data
abstraction for categorical attributes
- Improvement of readability and efficiency
- Reduction of the number of output rules
3MRD as a generator of candidate cluster of
attribute values
Cluster of categorical values dominated by a term
is a candidate.
Domain of categorical attribute
Possible partition ?a1,a2,a3,a5,a6,
where each cell is a candidate generated by MRD.
Select ? that minimizes the conditional entropy
after the abstraction
4Entropy Minimization Principle
IBM DM group the optimal binary partition (1997)
Prior distribution
The attribute is numerical or boolean. They
provide a very efficient way to compute the
optimal binary partition
An extension
K.Takahata Optimal clustering of class
distributions consider an optimal
m-partition consisting of m cells given a
parameter m
5The optimal partition is not always a good one
Prob. 0.32
The optimal 3-partition. a cluster of more
closer distributions is formed by the optimality.
(0.1,0.9)
xy1
.04
.32
.32
A cluster of major distributions
Whose probabirity, 0.64, is sufficiently large
enough, compared with the probability of minor
distribution, 0.04.
An admissible cluster (whose entropy is not so
high) of major distributions and some minor
distributions as well.
6Empirical condition for data abstration to work
well.
A metric for class distribution
Some exceptions whose probability is relatively
low, compare with Or relatively close to the
core of clusters
Core
Information loss by adding is relatively
small. (Entory does not increase so much.)
Many distributions close each other
7Entropy as a Constraint
- Without dictionary,
- Never consider a partition with optimal entropy.
- Attain a maximal probability (support) under an
entropy condition to extract a major reasonable
concept of attribute values w.r.t. a given target
class.
Maximize
where
Entropy of cluster g
8Basic Defs
schema
and its instances
Target class, attribute domain
Cluster Atomic distribution (posterior class
distribution, given attribute value
Posterior class distribution, given a cluster of
attributes
Distribution obtained by merging two
distributions linear combination
9Entropy is not monotone w.r.t. the addition of
new distribution
(0,1)
p
(1-p,p)
dlog 2 1
(p,1-p)
is concave
- Entropy of smaller set of values is not
necessarily within a given bound d, even when its
larger set of values is within d
p
(0,0)
(1,0)
Zd(the upperbound)
M12
a1,a2NG
No bottom-up construction of solution is possible
in general. So, we introduce a new distance
notion, and make the bottom-up construction
possible.
a1,a2,a3 OK
10Distance Notion
Basic Strategy collect two or more distributions
that are close each other and the entropy of
integrated distribution is less than d
Convex hull of close distributions and the red
region should be separated. In order to guarantee
the condition, we introduce the following def.
tangent line at N
p2
p1 and p2 are close ?
Either A(p1,p2) or A(p2,p1)
p1
A(pi,pj) ?
Consider a perpendicular dropped from pj to
at its foot N. pi and pj are on the same
side of the tangent at N.
N
Region of d?H(q)
11Separation by Tangent Line(Hyper Plane)
A set G of points (distributions) that satisfy
the entropy constraint and are close each other
Convex hull of G
and convex hull of G can be separated by
the tangent line
12Definition of Graph
- A set of vertices An attribute domain (their
corresponding distributions) - Edges There is no edge between p and q ?
p and q satisfy the entropy constraint and are
not close
Complete graph G1 outside of the red zone
Red Region of q s.t. d?H(q)
Some G1 with its exceptions G2 s.t.
is greatest
13Branch-and-Bound Search (depth-first)
R (candidate node set)
Q (Node of Search Tree
Inequality for brach-and-bound search
More tight condition
Addition of candidate node
Approximation of minimal chromatic number under
some order of nodes
C1
Q?q
C2
C1
C2
Refined inequality for branch-and-bound search
C1
14Entropy-based Pruning
- The order of distributions
- Entropy-based pruning
d? H(Q?q) , where q ?R ? Never
generate a successor node Q?q
The Safety
is a clique, and
Then, for any j,
15Experimentation not yet enough!
- Census DB of 199523 tuples with 42 attributes
- Target class marital state(never married,
married, divorced) with single condition
attribute, country of birth self of 42 values.
H?d0.9
- Branch-and-bound search is successful for graphs
of 1000 nodes gtgt 42. - clusters formed very similar to those computed
by NN given allowable distance. But the NN might
miss the greatest cluster. - In addition, Objects near to the boundary that
may be added for lower threshold is calculated as
exceptions.
Hltd
d,95
Hungary Yugoslavia
Italy Greece
d1
Cluster g of ltAa,Bb,Ccgt s.t. H(g)?d
- Clique search is much more meaningful when we
applied for combined condition of more than two
attributes whose domain is about 1000 2000
vectors.
Greatest probability.
16For two or more attributes
- Condition attributes Education(16-values) and
Gender(2-values) - Target attribute Workclass(8 classes)
- Cluster with the greatest probability
The clique found
Preschool, 1st-6th, 9th, 10th, 11th, 12th and
Prof-school male,female
- It might be interpreted as Primary Education.
10th, 11th,
12th and Prof-school are exceptions - It will tell us the Gender attribute is
neglectable (abstractable), given Education for
Workclass. - 7th and 8th is excluded. This might be due to the
fact that DB is relatively small.
17End