Title: Decision Tree Algorithms
1Decision Tree Algorithms
2What will be discussed
- Some definitions.
- Algorithms.
- There is no perfect algorithm.
- The importance of having a software which can
implement multiple algorithms for the same
dataset. (The meta-learner, and the
meta-meta-learner) - The software being developed.
3Some definitions
- Variables
- Continuous its measured values are real numbers
(ex. 73.827, 23). - Categorical takes values in a finite set not
having any natural ordering (ex. black, red,
green). - Ordered finite set, with some way of sorting the
elements of the set. (ex. age in years, interval
of integer numbers, 01/09/2004). - Dependent variable or set of classes The aspect
of the data to be studied. - Independent variable or set of attributes
Variables that are manipulated to explain the
dependent variable. - Regression-type problems. Ex House selling price
( value) - Classification-type problems. Ex Who will
graduate (yes, no)
4CRT
- CRT family CRT, tree (S), etc.
- Motivation Classification-type problems and
Regression-type problems. - Exactly two branches from each nonterminal node.
- Split attribute can be continuous or categorical.
- Independent variables can be categorical, ordered
or continuous. - Many times splitting a node in more than 2
splits creates more parsimonious models.
5C4.5
- CLS family CLS, ID3, C4.5, etc.
- Motivation Concept learning (Classification-type
problems) - Usually creates parsimonious trees.
- Great for categorical dependent variables.
- In the original version, the number of branches
is equal the number of attributes of the
independent variable to be split, if the
independent variable is not continuous. However,
there are other adaptations of that. - Independent variables are nominal only. In some
circumstances it is acceptable to divide the
continuous variables in discrete bands as
workaround for that issue. Ex LOS bands A0,
0.5) B0.5, 1) C1,2) etc
6C4.5
- Idea
- Select a leaf node with an inhomogeneous sample
set. - Replace that leaf node by a test node that
divides the inhomogeneous sample set into
minimally inhomogeneous subsets, according to an
entropy calculation.
7C4.5
- Entropy Formulae
- Entropy, a measure from information theory,
characterizes the (im)purity, or homogeneity, of
an arbitrary collection of examples. - Given
- nb, the number of instances in branch b.
- nbc, the number of instances in branch b of class
c. Of course, nbc is less than or equal to nb - nt, the total number of instances in all
branches. -
- If all the instances on the branch are positive,
then Pb 1 (homogeneous positive) - If all the instances on the branch are negative,
then Pb 0 (homogeneous negative) -
8C4.5
- As you move from perfect balance and perfect
homogeneity, entropy varies smoothly between zero
and one. - The entropy is zero when the set is perfectly
homogeneous. - The entropy is one when the set is perfectly
inhomogeneous. -
9There is no perfect algorithm.
- Examples
- C4.5, THAID and QUEST are classification
algorithm only. - AID, MAID, and XAID are for quantitative
responses only. - The CRT does both. However, is a slow algorithm
and it will always yield binary trees, which can
sometimes not be summarised efficiently. - The QUEST does not do regression. It is very
fast, unfortunately uses a lot of memory for
large datasets.
10CRT and C4.5 comparison using the golf dataset
(Answer tree Vs Spartacus)
11CRT (golf dataset)
12 13Basic comparison results
14A software which can implement multiple algorithms
- The software will be able to run the different
algorithms for the same dataset. - Trees generated from different algorithms will be
created and will be compared. The user will be
able to visually compare them, or to pick the one
that has the inferior misclassification rate. - Depending on the nature of the problem
(classification or regression) a specific
algorithm can be much more efficient.
15A software which can implement multiple algorithms
- The meta-learner
- The user will choose the dataset and the
variables. - A trial of different runs, using combinations of
different methods will be the input of a neural
network (the meta-learner).
16Set of rules
C1 CRT
Data quality
Meta-learner
Optimal data quality
CPU time Memory utilisation
Dataset
Neural network
simpler rules
Total CPU time
CPU time Memory utilisation
C2 QUEST
Data quality
Memory utilisationS memory(c) / CPU(c) c
-------------------------------Total time
Set of rules
17The meta-meta-learners
Meta-Learner 1 CRT
User defined could be a function likeBest
meta-learner DataQuality A Simpler rules
B - Memory C - Time D
Meta-Learner2 Neural networkLinear discriminant
Dataset
Neural network(probably not necessary)
Meta-Learner 3 Relation rulesC4.5STR-Tree
18The meta-meta-learners user input and output
Input
Output
Dataset name? NHSDependent variables? LOS,
OUTCOME, STROKE How much do you care
aboutData quality (0-99) Parsimonious models
(0-99)Time to process (0-99)Memory utilisation
(0-99)
The best meta-learner for youis a combination
of C4.5, ANN and Relation rules.These are the
best rules1- IF HEART ATTACK and AGE gt 90
then DEAD (error 3)2- Everybody that has STOKE
also has HIGH BLOOD PRESSURE 3- AGE 2.3
APACHE2 0.4 LOS (error 25)
19A software which can implement multiple algorithms
- Once the best meta-learner is found for a given
situation, dataset and dependent variable, the
user can define this meta-learner as the one to
be executed in similar situations. - Ex To find the out the patients LOS in the ICU
datasets the ML3(CRT) will be used. However to
find out the outcome of the patient (died or
survived) the ML103(C4.5, relation rules) will be
used.
20Work in progress
- Algorithm fully implemented in the system
- ID3.
- Algorithm partially implemented in the system
- C4.5 (missing grouping of categorical
attributes, pruning, classification error and
missing attributes handling). - Algorithms to be implemented
- CRT, CHAID and Spartacus (PhD thesis)
- Future implementations of neural networks aspects
like, tree automatic adaptation based on recent
inputs. Various neural network architectures are
also applicable to solve regression-type
problems. - Any other suggestion
21Work in progress
- Capacity to handle large datasets. (memory
optimisation) - VirtualTable concept. No unnecessary data copy
for nodes. - Sub datasets on the fly. (speed optimisation)
- Instead of creating a sub VirtualTables for each
set of data that will be used to test the split,
the software tests for splits in the parent node
on the fly. - It makes tests a little bit more complex, but
speed up the system - Memory access for items.
- No I/O delay when the dataset is less than
350Mbytes for computers with 512Mbytes of RAM.
(Theoretical, has never been tested due lack of
time) - Nodes data visualisation.
- Comma separated values (.csv) files, dBase tables
and MS excel spreadsheet support. - In a near future
- Tree pruning.
22Work in progress
- Reports.
- Grouping attributes values in binary splits.
- Manually move data across nodes.
- Costs associated with misclassification.
- C4.5 gain ratio.
- Special treatment for missing attributes.
- Bug fixes.
- Allow the trees to be saved (XML).
- Look up tables for codes.
- Translation of leaves into rules.
- Relation rules, ANN.
- Metar-learners and Meta-meta-learners