Title: Introduction of Decomposition Methods in Classification Models
1Introduction of Decomposition Methods in
Classification Models
- Lior Rokach and Oded Maimon
- Department of Industrial Engineering
- Tel-Aviv University
CsStat 2001 Haifa Winter Workshop on Computer
Science and Statistics
2Agenda
- Decomposition Concepts.
- Attribute Decomposition
- DOT Algorithm.
- Generalization Error of DOT.
- Benchmark Testing.
- Conclusions Future Work.
3Decomposition
- The purpose of decomposition methodology is to
break down a complex problem into several
smaller, less complex and more manageable
sub-problems that are solvable by existing tools,
then joining them to get the solution of the
original problem.
4Decomposition Advantages
- Increased performance (classification accuracy)
- Reduced execution time
- Achieving clearer results (more understandable)
- Suitability for Parallel or Distributed
computation - Ability to use different solution techniques for
individual sub problems - Modularity easier maintenance and support of
the evolutionary computation concept - Feasibility of large problems
5Issues in Decomposition
- What basic types of decomposition methods exist
in supervised learning? - Given a certain problem and certain inducer which
decomposition method performs best? - How should the sub-problems recomposed to
represent original concept learning? - How can we utilize prior knowledge for
decomposing the learning task?
6Various Elementary Decomposition Approaches
7How does the decompositions structure is
obtained?
- Manually based on expert's knowledge on a
specific domain (Michie, 1995). - Arbitrarily (Domingos, 1995).
- Due to some restrictions (distributed learning).
- Induced by a suitable algorithm (Zupan, 1997).
8mutually exclusive or partially overlapping?
- Mutually exclusive (pure decomposition) forms a
restriction on the problem space. - However
- ME has a greater tendency in reduction of
execution time. - Smaller models better comprehensibility and
better maintenance of the solution. - help avoid some of the error correlation problems
that characterize non mutually exclusive
decompositions.
9Our Long Term Goal
Develop a meta-algorithm that recursively
decomposes a classification problem using
elementary decomposition methods.
10Illustrative Example
11Attribute Decomposition
Using the attributes Ownership and Volume
if Ownership House then - Gold if
Ownership Tenement and Volume 1000
and Volume
None if Ownership None then -
Silver if Ownership Tenement and
(Volume1300) then -
Silver
Using the attributes Employment and Education
if Employment Employee and Education 12
then - Silver if Employment Employee and
Education none if Employment
Self then - Gold if Employment None
then - Silver
12Sample Decomposition
Model induced from the first half
if Employment Self then - Gold if
Volume - None if Volume 1100 and Employment
Employee then - Silver if Employment
None then - Silver
Model induced from the second half
if Employment Self then - Gold if
Education then - None if Education 12 and
Employment Employee then - Silver if
Employment None then - Silver
13Space Decomposition
Model induced for Education 15
if Volume 1000 then - Gold
if Volume Silver
Model induced for Education if EmploymentEmployee and Ownership House
then - None if EmploymentEmployee and
Ownership None then - Silver if
EmploymentEmployee and Ownership Tenement
and Volume None if
EmploymentEmployee and Ownership Tenement
and Volume 1300 then - Silver if
Employment None then - Silver if
Employment Self then - Gold
14 Concept Aggregation Decomposition
Initially we check whether the customer is
willing to purchase
if OwnershipTenement and VolumeEmploymentEmployee then - No else - Yes
Then what type of insurance
if Employment Self then -
Gold if Ownership House and
Aggregation Yes then - Gold
if Employment Employee then -
Silver if Ownership Tenement and
Employment None then - Silver
if Ownership None and Employment
None then - Silver
15 Function Decomposition
First a new concept named "wealth" is defined as
following
Then the original concept is considered
if Wealth Rich then - Gold
if Wealth Poor and Employment
Employee then - Silver if
Wealth Poor and Employment None
then - None if Wealth Else
then - Silver
16Classification Using Attributes Decomposition
17Simple Example(Illustrating the concepts)
- A Training set containing 20 examples created
according to
- Uniformly Distributed.
- No noise.
- No irrelevant/redundant attributes.
18(No Transcript)
19Optimal Decision Tree
Minimal optimal tree non unique solution
Classification Accuracy 100
20 Actual Decision Tree Generated by C4.5
Classification Accuracy 93.75
21Two Decision Trees Generated Using Attribute
Decomposition
Classification Accuracy 100
22Naive Bayes in Attribute Decomposition
Terminology"
Classification Accuracy 68.75
23Notation
24The Bayesian Approach for Classification Problems
25The Bayesian Approach (cont.)
- Duda and Hart (1973) showed that Bayesian
classifier has the highest possible accuracy. - The problem It is hard to estimate the actual
probability distribution.
26Naïve/Simple Bayes
The well know representation of Naïve Bayes
Representation suitable for Attribute
Decomposition
27Justification for Using Naïve/Simple Bayes
- Suitable to Attribute Decomposition.
- Understandable.
- Despite its simplicity it tends (in many cases)
to outperform complicated methods like Decision
Trees or Neural Networks.
28Why? Bias-Variance Tradeoff
- The bias is the persistent or systematic error
that the learning algorithm is expected to make. - The variance captures random variation in the
algorithm from one training set to another, due
to noise in the training data, or from random
behavior in the learning algorithm itself.
29Bias-Variance
- Simple methods (like Naïve Bayes) tend to have
high bias error and low variance error. - Complex methods (like Decision Trees) tend to
have low bias error and high variance error.
30Bias-Variance Tradeoff
Attribute Decomposition
Optimal
Attribute Decomposition can be better but need
to find the right one
31Attribute Decomposition Approach with Simple
Bayesian Combination
- Decomposing the original input attribute set into
mutually exclusive subsets. - Running an inducer upon the training data for
each subset independently. - Combining the generated models with naïve Bayes.
32Generalization Error
- Let h represents the model generated by Inducer I
on S. - Generalization error of the model h is the
probability of h to misclassify an instance in
the space selected according to the distribution
D of the instance labeled space.
33Problem Definition
Note This is an extension of the feature
selection problem when
34Attribute Decomposition
Problem It is very hard to find the optimal
decomposition. (NP-Hard)Conclusion We need a
heuristic algorithm.
35Definition Complete Equivalence
36Lemma 1 sufficient condition
37Lemma 2 k-CNF problem
38Lemma 3 The XOR Problem
39Oblivious Decision Tree
40(No Transcript)
41Generalization Error using VC Dimension
42Generalization Error of DOT
43Generalization Error of DOT
- We use the estimation of the generalization error
in order decide whether adding a certain
attribute to a certain subset improve the entire
decomposition. - Our experiments have shown that using the lower
bound of the VC-Dimension is more reliable.
44Artificial Dataset I
- Fabricated Dataset with four input attributes
that constitute two independent groups given the
target attribute. - The aim of this problem is to check whether the
algorithm can identify the groups and reduce
error rate.
45Artificial Dataset I ResultsLemma 1
46Artificial Dataset II k-CNF Lemma2
47UCI Repository Results
48The link between Error Reduction and the
Complexity
49Main Result
- Attribute Decomposition contributes to the
accuracy.
50Conclusion Advantages
- Increase the classification accuracy.
- Decrease Model Complexity.
- Enabling effective treatment of database with
high dimensionally.
51Conclusion Disadvantages
- Developing if-then rules is difficult.
- Potential losing of complex models.