Decision Tree Algorithms - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Decision Tree Algorithms

Description:

C&RT and C4.5 comparison using the golf' dataset (Answer tree Vs Spartacus) Don'tPlay ... CART. Item. 4/11/09. 14. A software which can implement multiple algorithms ... – PowerPoint PPT presentation

Number of Views:384
Avg rating:3.0/5.0
Slides: 23
Provided by: evandrom
Category:

less

Transcript and Presenter's Notes

Title: Decision Tree Algorithms


1
Decision Tree Algorithms
  • Brief review

2
What will be discussed
  • Some definitions.
  • Algorithms.
  • There is no perfect algorithm.
  • The importance of having a software which can
    implement multiple algorithms for the same
    dataset. (The meta-learner, and the
    meta-meta-learner)
  • The software being developed.

3
Some definitions
  • Variables
  • Continuous its measured values are real numbers
    (ex. 73.827, 23).
  • Categorical takes values in a finite set not
    having any natural ordering (ex. black, red,
    green).
  • Ordered finite set, with some way of sorting the
    elements of the set. (ex. age in years, interval
    of integer numbers, 01/09/2004).
  • Dependent variable or set of classes The aspect
    of the data to be studied.
  • Independent variable or set of attributes
    Variables that are manipulated to explain the
    dependent variable.
  • Regression-type problems. Ex House selling price
    ( value)
  • Classification-type problems. Ex Who will
    graduate (yes, no)

4
CRT
  • CRT family CRT, tree (S), etc.
  • Motivation Classification-type problems and
    Regression-type problems.
  • Exactly two branches from each nonterminal node.
  • Split attribute can be continuous or categorical.
  • Independent variables can be categorical, ordered
    or continuous.
  • Many times splitting a node in more than 2
    splits creates more parsimonious models.

5
C4.5
  • CLS family CLS, ID3, C4.5, etc.
  • Motivation Concept learning (Classification-type
    problems)
  • Usually creates parsimonious trees.
  • Great for categorical dependent variables.
  • In the original version, the number of branches
    is equal the number of attributes of the
    independent variable to be split, if the
    independent variable is not continuous. However,
    there are other adaptations of that.
  • Independent variables are nominal only. In some
    circumstances it is acceptable to divide the
    continuous variables in discrete bands as
    workaround for that issue. Ex LOS bands A0,
    0.5) B0.5, 1) C1,2) etc

6
C4.5
  • Idea
  • Select a leaf node with an inhomogeneous sample
    set.
  • Replace that leaf node by a test node that
    divides the inhomogeneous sample set into
    minimally inhomogeneous subsets, according to an
    entropy calculation.

7
C4.5
  • Entropy Formulae
  • Entropy, a measure from information theory,
    characterizes the (im)purity, or homogeneity, of
    an arbitrary collection of examples.
  • Given
  • nb, the number of instances in branch b.
  • nbc, the number of instances in branch b of class
    c. Of course, nbc is less than or equal to nb
  • nt, the total number of instances in all
    branches.
  • If all the instances on the branch are positive,
    then Pb 1 (homogeneous positive)
  • If all the instances on the branch are negative,
    then Pb 0 (homogeneous negative)

8
C4.5
  • As you move from perfect balance and perfect
    homogeneity, entropy varies smoothly between zero
    and one.
  • The entropy is zero when the set is perfectly
    homogeneous.
  • The entropy is one when the set is perfectly
    inhomogeneous.

9
There is no perfect algorithm.
  • Examples
  • C4.5, THAID and QUEST are classification
    algorithm only.
  • AID, MAID, and XAID are for quantitative
    responses only.
  • The CRT does both. However, is a slow algorithm
    and it will always yield binary trees, which can
    sometimes not be summarised efficiently.
  • The QUEST does not do regression. It is very
    fast, unfortunately uses a lot of memory for
    large datasets.

10
CRT and C4.5 comparison using the golf dataset
(Answer tree Vs Spartacus)
11
CRT (golf dataset)
12

13
Basic comparison results
14
A software which can implement multiple algorithms
  • The software will be able to run the different
    algorithms for the same dataset.
  • Trees generated from different algorithms will be
    created and will be compared. The user will be
    able to visually compare them, or to pick the one
    that has the inferior misclassification rate.
  • Depending on the nature of the problem
    (classification or regression) a specific
    algorithm can be much more efficient.

15
A software which can implement multiple algorithms
  • The meta-learner
  • The user will choose the dataset and the
    variables.
  • A trial of different runs, using combinations of
    different methods will be the input of a neural
    network (the meta-learner).

16
Set of rules
C1 CRT
Data quality
Meta-learner
Optimal data quality
CPU time Memory utilisation
Dataset
Neural network
simpler rules
Total CPU time
CPU time Memory utilisation
C2 QUEST
Data quality
Memory utilisationS memory(c) / CPU(c) c
-------------------------------Total time
Set of rules
17
The meta-meta-learners
Meta-Learner 1 CRT
User defined could be a function likeBest
meta-learner DataQuality A Simpler rules
B - Memory C - Time D
Meta-Learner2 Neural networkLinear discriminant
Dataset
Neural network(probably not necessary)
Meta-Learner 3 Relation rulesC4.5STR-Tree
18
The meta-meta-learners user input and output
Input
Output
Dataset name? NHSDependent variables? LOS,
OUTCOME, STROKE How much do you care
aboutData quality (0-99) Parsimonious models
(0-99)Time to process (0-99)Memory utilisation
(0-99)
The best meta-learner for youis a combination
of C4.5, ANN and Relation rules.These are the
best rules1- IF HEART ATTACK and AGE gt 90
then DEAD (error 3)2- Everybody that has STOKE
also has HIGH BLOOD PRESSURE 3- AGE 2.3
APACHE2 0.4 LOS (error 25)
19
A software which can implement multiple algorithms
  • Once the best meta-learner is found for a given
    situation, dataset and dependent variable, the
    user can define this meta-learner as the one to
    be executed in similar situations.
  • Ex To find the out the patients LOS in the ICU
    datasets the ML3(CRT) will be used. However to
    find out the outcome of the patient (died or
    survived) the ML103(C4.5, relation rules) will be
    used.

20
Work in progress
  • Algorithm fully implemented in the system
  • ID3.
  • Algorithm partially implemented in the system
  • C4.5 (missing grouping of categorical
    attributes, pruning, classification error and
    missing attributes handling).
  • Algorithms to be implemented
  • CRT, CHAID and Spartacus (PhD thesis)
  • Future implementations of neural networks aspects
    like, tree automatic adaptation based on recent
    inputs. Various neural network architectures are
    also applicable to solve regression-type
    problems.
  • Any other suggestion

21
Work in progress
  • Capacity to handle large datasets. (memory
    optimisation)
  • VirtualTable concept. No unnecessary data copy
    for nodes.
  • Sub datasets on the fly. (speed optimisation)
  • Instead of creating a sub VirtualTables for each
    set of data that will be used to test the split,
    the software tests for splits in the parent node
    on the fly.
  • It makes tests a little bit more complex, but
    speed up the system
  • Memory access for items.
  • No I/O delay when the dataset is less than
    350Mbytes for computers with 512Mbytes of RAM.
    (Theoretical, has never been tested due lack of
    time)
  • Nodes data visualisation.
  • Comma separated values (.csv) files, dBase tables
    and MS excel spreadsheet support.
  • In a near future
  • Tree pruning.

22
Work in progress
  • Reports.
  • Grouping attributes values in binary splits.
  • Manually move data across nodes.
  • Costs associated with misclassification.
  • C4.5 gain ratio.
  • Special treatment for missing attributes.
  • Bug fixes.
  • Allow the trees to be saved (XML).
  • Look up tables for codes.
  • Translation of leaves into rules.
  • Relation rules, ANN.
  • Metar-learners and Meta-meta-learners
Write a Comment
User Comments (0)
About PowerShow.com