Title: Data Discretization Unification
1Data Discretization Unification
- Ruoming Jin
- Yuri Breitbart
- Chibuike Muoh
- Kent State University, Kent, USA
2Outline
- Motivation
- Problem Statement
- Prior Art Our Contributions
- Goodness Function Definition
- Unification
- Parameterized Discretization
- Discretization Algorithm Its Performance
3Motivation
Patients Table
Age Success Failure Total
18 10 1 19
25 5 2 7
41 100 5 105
. . .
51 250 10 260
52 360 5 365
53 249 10 259
. .. .. ..
4Motivation
- Possible implication of the table
- If a person is between 18 and 25, the probability
of procedure success is much higher than if the
person is between 45 and 55 - Is that a good rule or this one is better If a
person is between 18 and 30, the probability of
procedure success is much higher than if the
person is between 46 and 61 - What is the best interval?
5Motivation
- Without data discretization some rules would be
difficult to establish. - Several existing data mining systems cannot
handle continuous variables without
discretization. - Data discretization significantly improves the
quality of the discovered knowledge. - New methods of discretization needed for tables
with rare events. - Data discretization significantly improves the
performance of data mining algorithms. Some
studies reported ten fold increase in
performance. However - Any discretization process generally
leads to a loss - of information. Minimizing such a
possible loss is the mark - of good discretization method.
6Problem Statement
Given an input table
Intervals Class1 Class2 . Class J Row Sum
S1 r11 r12 . r1J N1
S2 r21 r22 . r2J N2
. . . . . . . . . . . . . . . . . .
SI rI1 rI2 rIJ NI
Column Sum M1 M2 MJ N(Total)
7Problem Statement
Obtain an output table
Intervals Class1 Class2 . Class J Row Sum
S1 C11 C12 . C1J N1
S2 C21 C22 . C2J N2
. . . . . . . . . . . . . . . . . .
SI CI1 CI2 CIJ NJ
Column Sum M1 M2 MJ N(Total)
Where Si Union of consecutive k intervals. The
quality of discretization is measured by
cost(model) cost(data/model)
penalty(model)
8Prior Art
- Unsupervised Discretization no class
information is provided - Equal-width
- Equal-frequency
- Supervised Discretization class information is
provided with each attribute value - MDLP
- Pearsons X2 or Wilks G2 statistic based
methods - Dougherty, Kohavi (1995) compare unsupervised and
supervised methods of Holte (1993) and entropy
based methods by Fayyad and Irani (1993) and
conclude that supervised methods give less
classification errors than the unsupervised ones
and supervised methods based on entropy are
better than other supervised methods . -
9Prior Art
- There are several recent (2003-2006) papers
introduced new discretization algorithms Yang
and Webb Kurgan and Cios (CAIM) Boulle
(Khiops). - CAIM attempts to minimize the number of
discretization intervals and at the same time to
minimize the information loss. - Khiops uses Pearsons X2 statistic to select
merging consecutive intervals that minimize the
value of X2. - Yang and Webb studied discretization using naïve
Bayesian classifiers. They report that their
method generates a lower number of
classification errors than the alternative
discretization methods that appeared in literature
10Our Results
- There is a strong connection between
discretization methods based on statistic and on
entropy. - There is a parametric function so that any prior
discretization method is derivable from this
function by choosing at most two parameters. - There is an optimal dynamic programming method
that derived from our discretization approach
that mostly outperforms any prior discretization
method in experiments that we conducted. -
11Goodness Function Definition(Preliminaries)
Intervals Class1 Class2 . Class J Row Sum
S1 C11 C12 . C1J N1
S2 C21 C22 . C2J N2
. . . . . . . . . . . . . . . . . .
SI CI1 CI2 CIJ NJ
Column Sum M1 M2 MJ N(Total)
12Goodness Function Definition(Preliminaries)
Entropy of the i-th row of a contingency table
Total entropy of all intervals Entropy of a
contingency table
- Binary encoding of a row requires H(Si) binary
characters. - Binary encoding of a set of rows requires H( S1,
S2, SI ) binary characters - Binary encoding of a table requires SL binary
characters
NH( S1 ,S2 , .SI ) SL
13Goodness Function Definition
- Cost(Model, T) Cost(T/Model)Penalty(Model)
- (Mannila, et. al.)
- Cost(T/Model) is the complexity of table
encoding in - the given model.
- Penalty(Model) reflects a complexity of the
resulting - table.
- GFModel (T) Cost(Model, T0) Cost(Model, T)
14Goodness Function DefinitionModels To Be
Considered
- MDLP (Information Theory)
- Statistical Model Selection
- Confidence Level of Rows Independence
- Gini Index
15Goodness Function Examples
- Entropy
- Statistical Akaike (AIC)
- Statistical Bayesian (BIC)
-
-
16Goodness Function Examples (Cont)
- Confidence level based Goodness Functions
- Pearson X2 statistic
- where
- Wilks G2-statistic
- Tables degree of freedom is df(I-1)(J-1)
- Distribution functions for these statistics are
- It is known in statistic that asymptotically both
Pearsons - X2 and Wilks G2 statistics have chi-square
distribution with df degrees of freedom.
17Unification
- The following relationship between G2 and
goodness functions for MDLP, AIC, and BIC holds - G2/2 NH(S1U USI) SL
- Thus, the goodness functions for MDLP, AIC and
BIC can be rewritten as follows
18Unification
- Normal Deviate is the difference between the mean
and the variable divided by standard deviation - Consider t a random variable chi-square
distributed. U(t) be a normal deviate so that the
following equation holds - Let u(t) be a normal deviate function. The
following theorem holds (see Wallace 1959, 1960) -
- For all tgtdf, all dfgt.37, and with
w(t)t-df-dflog(t/df)(1/2) -
- 0ltw(t)u(t)w(t).6
df-(1/2) -
19Unification
- From this theorem it follows that if df goes to
infinity and w(t)gtgt0, - u(t)/w(t) 1.
- Finally, w2(t) u2(t) t df dflog(t/df) and
- GFG2(T)u2(G2)G2 - df(1 log(G2 /df))
- and similarly goodness function for GFX2(T)
- is asymptotically the same
-
20Unification
- G2 estimate
- If G2 gtdf, then G2 lt2NlogJ. It follows from the
upper bound logJ on H(S1 U .U SJ) and lower
bound 0 on entropy of a specific row of the
contingency table. - Recall that u2(G2)G2 - df(1 log(G2 /df))
- Thus, penalty of u2(G2) is between O(df) and
O(dflogN) - If G2 cdf and cgt1, then penalty is O(df)
- If G2 cdf and N/df N/(IJ)c, then penalty is
also O(df) - If G2 cdf and N-gtinf and N/(IJ) -gtinf, then
penalty is - O(dflogN)
21Unification
- GFMDLP G2 (depending on the ratio of N/I)
O(df) -
O(dflogN) - GFAIC G2 - df
- GFBIC G2 - df(logN)/2
- GFG2 G2 - df(either constant or logN,
depending on - the ratio
between N and I and J - In general, GF G2 - dff(N,I,J)
- To unify a Gini function as one of the cost
functions - we resort to the parametric approach to
goodness of discretization
22Gini Based Goodness Function
- Let Si be a row of a contingency table. Gini
index on Si is defined as follows - Cost(Gini, T)
- Gini Goodness Function
23Parametrized Discretization
- Parametrized Entropy
- Entropy of the row
- Gini of the row
- Parametrized Data Cost
24Parametrized Discretization
- Parametrized Cost of T0
- Parametrized Goodness Function
25Parameters for Known Goodness Functions
26Parametrized Dynamic Programming Algorithm
27Dynamic Programming Algorithm
28Experiments
29(No Transcript)
30(No Transcript)
31Classification errors for Glass dataset (Naïve
Bayesian)
32IrisC4.5
33Experiments C4.5 Validation
34Experiments Naïve Bayesian Validation
35Conclusions
- We considered several seemingly different
approaches to discretization and demonstrated
that they can be unified by considering a notion
of generalized entropy. - Each of the methods that were discussed in
literature can be derived from generalized
entropy by selecting at most two parameters. - Dynamic Programming Algorithm for a given set of
two parameters selects an optimal method of
discretization (in terms of the discretization
goodness function)
36What Remains To be Done
- How to find analytically a relationship between
the goodness function in terms of the model and
the number of classification errors? - What is the Algorithm for Selecting the Best
Parameters for a Given Set of Data?