Data Discretization Unification - PowerPoint PPT Presentation

About This Presentation

Title:

Data Discretization Unification

Description:

Cost(T/Model) is the complexity of table encoding in. the given model. ... GFModel (T) = Cost(Model, T0) Cost(Model, T) Goodness Function Definition ... – PowerPoint PPT presentation

Number of Views:176

Avg rating:3.0/5.0

Slides: 37

Provided by: ksu7

Learn more at: https://www.cs.kent.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Discretization Unification

1
Data Discretization Unification

Ruoming Jin
Yuri Breitbart
Chibuike Muoh
Kent State University, Kent, USA

2
Outline

Motivation
Problem Statement
Prior Art Our Contributions
Goodness Function Definition
Unification
Parameterized Discretization
Discretization Algorithm Its Performance

3
Motivation
Patients Table
Age Success Failure Total
18 10 1 19
25 5 2 7
41 100 5 105
. . .
51 250 10 260
52 360 5 365
53 249 10 259
. .. .. ..
4
Motivation

Possible implication of the table
If a person is between 18 and 25, the probability
of procedure success is much higher than if the
person is between 45 and 55
Is that a good rule or this one is better If a
person is between 18 and 30, the probability of
procedure success is much higher than if the
person is between 46 and 61
What is the best interval?

5
Motivation

Without data discretization some rules would be
difficult to establish.
Several existing data mining systems cannot
handle continuous variables without
discretization.
Data discretization significantly improves the
quality of the discovered knowledge.
New methods of discretization needed for tables
with rare events.
Data discretization significantly improves the
performance of data mining algorithms. Some
studies reported ten fold increase in
performance. However
Any discretization process generally
leads to a loss
of information. Minimizing such a
possible loss is the mark
of good discretization method.

6
Problem Statement
Given an input table
Intervals Class1 Class2 . Class J Row Sum
S1 r11 r12 . r1J N1
S2 r21 r22 . r2J N2
. . . . . . . . . . . . . . . . . .
SI rI1 rI2 rIJ NI
Column Sum M1 M2 MJ N(Total)
7
Problem Statement
Obtain an output table
Intervals Class1 Class2 . Class J Row Sum
S1 C11 C12 . C1J N1
S2 C21 C22 . C2J N2
. . . . . . . . . . . . . . . . . .
SI CI1 CI2 CIJ NJ
Column Sum M1 M2 MJ N(Total)
Where Si Union of consecutive k intervals. The
quality of discretization is measured by
cost(model) cost(data/model)
penalty(model)
8
Prior Art

Unsupervised Discretization no class
information is provided
Equal-width
Equal-frequency
Supervised Discretization class information is
provided with each attribute value
MDLP
Pearsons X2 or Wilks G2 statistic based
methods
Dougherty, Kohavi (1995) compare unsupervised and
supervised methods of Holte (1993) and entropy
based methods by Fayyad and Irani (1993) and
conclude that supervised methods give less
classification errors than the unsupervised ones
and supervised methods based on entropy are
better than other supervised methods .

9
Prior Art

There are several recent (2003-2006) papers
introduced new discretization algorithms Yang
and Webb Kurgan and Cios (CAIM) Boulle
(Khiops).
CAIM attempts to minimize the number of
discretization intervals and at the same time to
minimize the information loss.
Khiops uses Pearsons X2 statistic to select
merging consecutive intervals that minimize the
value of X2.
Yang and Webb studied discretization using naïve
Bayesian classifiers. They report that their
method generates a lower number of
classification errors than the alternative
discretization methods that appeared in literature

10
Our Results

There is a strong connection between
discretization methods based on statistic and on
entropy.
There is a parametric function so that any prior
discretization method is derivable from this
function by choosing at most two parameters.
There is an optimal dynamic programming method
that derived from our discretization approach
that mostly outperforms any prior discretization
method in experiments that we conducted.

11
Goodness Function Definition(Preliminaries)
Intervals Class1 Class2 . Class J Row Sum
S1 C11 C12 . C1J N1
S2 C21 C22 . C2J N2
. . . . . . . . . . . . . . . . . .
SI CI1 CI2 CIJ NJ
Column Sum M1 M2 MJ N(Total)
12
Goodness Function Definition(Preliminaries)
Entropy of the i-th row of a contingency table
Total entropy of all intervals Entropy of a
contingency table

Binary encoding of a row requires H(Si) binary
characters.
Binary encoding of a set of rows requires H( S1,
S2, SI ) binary characters
Binary encoding of a table requires SL binary
characters

NH( S1 ,S2 , .SI ) SL
13
Goodness Function Definition

Cost(Model, T) Cost(T/Model)Penalty(Model)
(Mannila, et. al.)
Cost(T/Model) is the complexity of table
encoding in
the given model.
Penalty(Model) reflects a complexity of the
resulting
table.
GFModel (T) Cost(Model, T0) Cost(Model, T)

14
Goodness Function DefinitionModels To Be
Considered

MDLP (Information Theory)
Statistical Model Selection
Confidence Level of Rows Independence
Gini Index

15
Goodness Function Examples

Entropy
Statistical Akaike (AIC)
Statistical Bayesian (BIC)

16
Goodness Function Examples (Cont)

Confidence level based Goodness Functions
Pearson X2 statistic
where
Wilks G2-statistic
Tables degree of freedom is df(I-1)(J-1)
Distribution functions for these statistics are
It is known in statistic that asymptotically both
Pearsons
X2 and Wilks G2 statistics have chi-square
distribution with df degrees of freedom.

17
Unification

The following relationship between G2 and
goodness functions for MDLP, AIC, and BIC holds
G2/2 NH(S1U USI) SL
Thus, the goodness functions for MDLP, AIC and
BIC can be rewritten as follows

18
Unification

Normal Deviate is the difference between the mean
and the variable divided by standard deviation
Consider t a random variable chi-square
distributed. U(t) be a normal deviate so that the
following equation holds
Let u(t) be a normal deviate function. The
following theorem holds (see Wallace 1959, 1960)
For all tgtdf, all dfgt.37, and with
w(t)t-df-dflog(t/df)(1/2)
0ltw(t)u(t)w(t).6
df-(1/2)

19
Unification

From this theorem it follows that if df goes to
infinity and w(t)gtgt0,
u(t)/w(t) 1.
Finally, w2(t) u2(t) t df dflog(t/df) and
GFG2(T)u2(G2)G2 - df(1 log(G2 /df))
and similarly goodness function for GFX2(T)
is asymptotically the same

20
Unification

G2 estimate
If G2 gtdf, then G2 lt2NlogJ. It follows from the
upper bound logJ on H(S1 U .U SJ) and lower
bound 0 on entropy of a specific row of the
contingency table.
Recall that u2(G2)G2 - df(1 log(G2 /df))
Thus, penalty of u2(G2) is between O(df) and
O(dflogN)
If G2 cdf and cgt1, then penalty is O(df)
If G2 cdf and N/df N/(IJ)c, then penalty is
also O(df)
If G2 cdf and N-gtinf and N/(IJ) -gtinf, then
penalty is
O(dflogN)

21
Unification

GFMDLP G2 (depending on the ratio of N/I)
O(df)
O(dflogN)
GFAIC G2 - df
GFBIC G2 - df(logN)/2
GFG2 G2 - df(either constant or logN,
depending on
the ratio
between N and I and J
In general, GF G2 - dff(N,I,J)
To unify a Gini function as one of the cost
functions
we resort to the parametric approach to
goodness of discretization

22
Gini Based Goodness Function

Let Si be a row of a contingency table. Gini
index on Si is defined as follows
Cost(Gini, T)
Gini Goodness Function

23
Parametrized Discretization

Parametrized Entropy
Entropy of the row
Gini of the row
Parametrized Data Cost

24
Parametrized Discretization

Parametrized Cost of T0
Parametrized Goodness Function

25
Parameters for Known Goodness Functions
26
Parametrized Dynamic Programming Algorithm
27
Dynamic Programming Algorithm
28
Experiments
29
(No Transcript)
30
(No Transcript)
31
Classification errors for Glass dataset (Naïve
Bayesian)
32
IrisC4.5
33
Experiments C4.5 Validation
34
Experiments Naïve Bayesian Validation
35
Conclusions

We considered several seemingly different
approaches to discretization and demonstrated
that they can be unified by considering a notion
of generalized entropy.
Each of the methods that were discussed in
literature can be derived from generalized
entropy by selecting at most two parameters.
Dynamic Programming Algorithm for a given set of
two parameters selects an optimal method of
discretization (in terms of the discretization
goodness function)

36
What Remains To be Done

How to find analytically a relationship between
the goodness function in terms of the model and
the number of classification errors?
What is the Algorithm for Selecting the Best
Parameters for a Given Set of Data?

Write a Comment

User Comments (0)