Title: Decision Tree Models in Data Mining
1Decision Tree Models in Data Mining
- Matthew J. Liberatore
- Thomas Coghlan
2Decision Trees in Data Mining
- Decision Trees can be used to predict a
categorical or a continuous target (called
regression trees in the latter case) - Like logistic regression and neural networks
decision trees can be applied for classification
and prediction - Unlike these methods no equations are estimated
- A tree structure of rules over the input
variables are used to classify or predict the
cases according to the target variable - The rules are of an IF-THEN form for example
- If Risk Low, then predict on-time payment of a
loan
3Decision Tree Approach
- A decision tree represents a hierarchical
segmentation of the data - The original segment is called the root node and
is the entire data set - The root node is partitioned into two or more
segments by applying a series of simple rules
over an input variables - For example, risk low, risk not low
- Each rule assigns the observations to a segment
based on its input value - Each resulting segment can be further partitioned
into sub-segments, and so on - For example risk low can be partitioned into
income low and income not low - The segments are also called nodes, and the final
segments are called leaf nodes or leaves
4Decision Tree Example Loan Payment
- Income
- lt 30k gt 30k
- Age Credit Score
- lt 25 gt25 lt 600 gt 600
- not on-time on-time
not on-time on-time
5Growing the Decision Tree
- Growing the tree involves successively
partitioning the data recursively partitioning - If an input variable is binary, then the two
categories can be used to split the data - If an input variable is interval, a splitting
value is used to classify the data into two
segments - For example, if household income is interval and
there are 100 possible incomes in the data set,
then there are 100 possible splitting values - For example, income lt 30k, and income gt 30k
6Evaluating the partitions
- When the target is categorical, for each
partition of an input variable a chi-square
statistic is computed - A contingency table is formed that maps
responders and non-responders against the
partitioned input variable - For example, the null hypothesis might be that
there is no difference between people with income
lt30k and those with income gt30k in making an
on-time loan payment - The lower the significance or p-value, the more
likely that we reject this hypothesis, meaning
that this income split is a discriminating factor
7Contingency Table
lt30k gt30k total
Payment on-time
Payment not on-time
total
8Chi-Square Statistic
- The chi-square statistic computes a measure of
how different the number of observations is in
each of the four cells as compared to the
expected number - The p-value associated with the null hypothesis
is computed - Enterprise Miner then computes the logworth of
the p-value, logworth - log10(p-value) - The split that generates the highest logworth for
a given input variable is selected
9Growing the Tree
- In our loan payment example, we have three
interval-valued input variables income, age, and
credit score - We compute the logworth of the best split for
each of these variables - We then select the variable that has the highest
logworth and use its split suppose it is income - Under each of the two income nodes, we then find
the logworth of the best split of age and credit
score and continue the process -- - subject to meeting the threshold on the
significance of the chi-square value for
splitting and other stopping criteria (described
later)
10Other Splitting Criteria for a Categorical Target
- The gini and entropy measures are based on how
heterogeneous the observations are at a given
node - relates to the mix of responders and
non-responders at the node - Let p1 and p0 represent the proportion of
responders and non-responders at a node,
respectively - If two observations are chosen (with replacement)
from a node, the probability that they are either
both responders or both non-responders is (p1)2
(p0)2 - The gini index 1 (p1)2 (p0)2, the
probability that both observations are different - Best case is a gini index of 0 (all observations
are the same) - An index of ½ means both groups equally
represented
11Other Splitting Criteria for a Categorical Target
- The rarity of an event is defined as -log2(pi)
- Entropy sums up the rarity of response and
non-response over all observations - Entropy ranges from the best case of 0 (all
responders or all non-responders) to 1 (equal mix
of responders and non-responders)
12Splitting Criteria for a Continuous (Interval)
Target
- An F-statistic is used to measure the degree of
separation of a split for an interval target,
such as revenue - Similar to the sum of squares discussion under
multiple regression, the F-statistic is based on
the ratio of the sum of squares between the
groups and the sum of squares within groups, both
adjusted for the number of degrees of freedom - The null hypothesis is that there is no
difference in the target mean between the two
groups - As before, the logworth of the p-value is computed
13Some Adjustments
- The more possible splits of an input variable,
the less accurate the p-value (bigger chance of
rejecting the null hypothesis) - If there are m splits, the Bonferroni adjustment
adjusts the p-value of the best case by
subtracting log10(m) from the logworth - If Time of Kass Adjustment is set to before then
the p-values of the splits are compared with
Bonferroni adjustment
14Some Adjustments
- Setting Split Adjustment property to Yes means
that the significance of the p-value can be
adjusted by the depth of the tree - For example, at the fourth split, a calculate
p-value of 0.04 becomes 0.0424 0.64, making
the split statistically insignificant - This leads to rejecting more splits, limiting the
size of the tree - Tree growth can also be controlled by setting
- Leaf Size property (minimum number of
observations in a leaf) - Split Size property (minimum number of
observations to allow a node to be split) - Maximum Depth property (maximum number of
generation of nodes)
15Some Results
- The posterior probabilities are the proportions
of responders and non-responders at each node - A node is classified as a responder or
non-responder depending on which posterior
probability is the largest - In selecting the best tree, one can use
Misclassification, Lift, or Average Squared Error
16Creating a Decision Tree Model in Enterprise
Miner
- Open the bankrupt project, and create a new
diagram called Bankrupt_DecTree - Drag and drop the bankrupt data node and the
Decision Tree node (from the model tab) onto the
diagram - Connect the nodes
17Select ProbChisq for the Criterion under
Splitting RuleChange Use Input Once to Yes
(otherwise, the same variable can appear more
than once in the tree)
18Under Subtree select Misclassification for
Assessment MeasureKeep defaults under P-Value
Adjustment and Output VariablesUnder Score set
Variable Selection to No (otherwise variables
with importance values greater than 0.05 are set
as rejected and not considered by the tree)
19The Decision Tree has only one split on RE/TA.
The misclassification rate is 0.15 (3/20), with 2
false negatives and 1 false positive. The
cumulative lift is somewhat lower than the best
cumulative lift, and starts out at 1.777 vs. the
best value of 2.000.
20Under Subtree, set Method to Largest and rerun.
The result show that another split is added,
using EBIT/TA. However, the misclassification
rate is unchanged at 0.15. This result shows that
setting Method to Assessment and
Misclassification for Assessment Measure finds
the smallest tree having the lowest
misclassification
21Model Comparison
- The Model Comparison node under the Assess tab
can be used to compare several different models - Create a diagram called Full Model that includes
the bankrupt data node connected into the
regression, decision tree, and neural network
nodes - Connect the three model nodes into the Model
Comparison node, and connect it and the
bankrupt_score data node into a Score node
22For Regression, set Selection Model to none for
Neural Network, set Model Selection Criterion to
Average Error, and the Network properties as
before for Decision Tree, set Assessment Measure
as Average Squared Error, and the other
properties as before. This puts each of the
models on a similar basis for fit. For Model
Comparison set Selection Criterion as Average
Squared Error.
23Neural Network is selected, although Regression
is nearly identical in average squared error.
The Receiver Operating Characteristic (ROC) curve
shows sensitivity (true positives) vs.
1-specificity (false positives) for various
cutoff probabilities of a response. The chart
shows that no matter what the cutoff
probabilities are, regression and neural network
classify 100 of responders as responders
(sensitivity) and 0 of non-responders as
responders (1-specificity). Decision tree
performs reasonably well, as indicated by the
area above the diagonal line.