Decision Tree Models in Data Mining - PowerPoint PPT Presentation

About This Presentation

Title:

Decision Tree Models in Data Mining

Description:

Decision Tree Models in Data Mining Matthew J. Liberatore Thomas Coghlan Decision Trees in Data Mining Decision Trees can be used to predict a categorical or a ... – PowerPoint PPT presentation

Number of Views:92

Avg rating:3.0/5.0

Slides: 24

Provided by: www70Home

Learn more at: https://homepage.villanova.edu

Category:

more less

Transcript and Presenter's Notes

Title: Decision Tree Models in Data Mining

1
Decision Tree Models in Data Mining

Matthew J. Liberatore
Thomas Coghlan

2
Decision Trees in Data Mining

Decision Trees can be used to predict a
categorical or a continuous target (called
regression trees in the latter case)
Like logistic regression and neural networks
decision trees can be applied for classification
and prediction
Unlike these methods no equations are estimated
A tree structure of rules over the input
variables are used to classify or predict the
cases according to the target variable
The rules are of an IF-THEN form for example
If Risk Low, then predict on-time payment of a
loan

3
Decision Tree Approach

A decision tree represents a hierarchical
segmentation of the data
The original segment is called the root node and
is the entire data set
The root node is partitioned into two or more
segments by applying a series of simple rules
over an input variables
For example, risk low, risk not low
Each rule assigns the observations to a segment
based on its input value
Each resulting segment can be further partitioned
into sub-segments, and so on
For example risk low can be partitioned into
income low and income not low
The segments are also called nodes, and the final
segments are called leaf nodes or leaves

4
Decision Tree Example Loan Payment

Income
lt 30k gt 30k
Age Credit Score
lt 25 gt25 lt 600 gt 600
not on-time on-time
not on-time on-time

5
Growing the Decision Tree

Growing the tree involves successively
partitioning the data recursively partitioning
If an input variable is binary, then the two
categories can be used to split the data
If an input variable is interval, a splitting
value is used to classify the data into two
segments
For example, if household income is interval and
there are 100 possible incomes in the data set,
then there are 100 possible splitting values
For example, income lt 30k, and income gt 30k

6
Evaluating the partitions

When the target is categorical, for each
partition of an input variable a chi-square
statistic is computed
A contingency table is formed that maps
responders and non-responders against the
partitioned input variable
For example, the null hypothesis might be that
there is no difference between people with income
lt30k and those with income gt30k in making an
on-time loan payment
The lower the significance or p-value, the more
likely that we reject this hypothesis, meaning
that this income split is a discriminating factor

7
Contingency Table
lt30k gt30k total
Payment on-time
Payment not on-time
total
8
Chi-Square Statistic

The chi-square statistic computes a measure of
how different the number of observations is in
each of the four cells as compared to the
expected number
The p-value associated with the null hypothesis
is computed
Enterprise Miner then computes the logworth of
the p-value, logworth - log10(p-value)
The split that generates the highest logworth for
a given input variable is selected

9
Growing the Tree

In our loan payment example, we have three
interval-valued input variables income, age, and
credit score
We compute the logworth of the best split for
each of these variables
We then select the variable that has the highest
logworth and use its split suppose it is income
Under each of the two income nodes, we then find
the logworth of the best split of age and credit
score and continue the process --
subject to meeting the threshold on the
significance of the chi-square value for
splitting and other stopping criteria (described
later)

10
Other Splitting Criteria for a Categorical Target

The gini and entropy measures are based on how
heterogeneous the observations are at a given
node
relates to the mix of responders and
non-responders at the node
Let p1 and p0 represent the proportion of
responders and non-responders at a node,
respectively
If two observations are chosen (with replacement)
from a node, the probability that they are either
both responders or both non-responders is (p1)2
(p0)2
The gini index 1 (p1)2 (p0)2, the
probability that both observations are different
Best case is a gini index of 0 (all observations
are the same)
An index of ½ means both groups equally
represented

11
Other Splitting Criteria for a Categorical Target

The rarity of an event is defined as -log2(pi)
Entropy sums up the rarity of response and
non-response over all observations
Entropy ranges from the best case of 0 (all
responders or all non-responders) to 1 (equal mix
of responders and non-responders)

12
Splitting Criteria for a Continuous (Interval)
Target

An F-statistic is used to measure the degree of
separation of a split for an interval target,
such as revenue
Similar to the sum of squares discussion under
multiple regression, the F-statistic is based on
the ratio of the sum of squares between the
groups and the sum of squares within groups, both
adjusted for the number of degrees of freedom
The null hypothesis is that there is no
difference in the target mean between the two
groups
As before, the logworth of the p-value is computed

13
Some Adjustments

The more possible splits of an input variable,
the less accurate the p-value (bigger chance of
rejecting the null hypothesis)
If there are m splits, the Bonferroni adjustment
adjusts the p-value of the best case by
subtracting log10(m) from the logworth
If Time of Kass Adjustment is set to before then
the p-values of the splits are compared with
Bonferroni adjustment

14
Some Adjustments

Setting Split Adjustment property to Yes means
that the significance of the p-value can be
adjusted by the depth of the tree
For example, at the fourth split, a calculate
p-value of 0.04 becomes 0.0424 0.64, making
the split statistically insignificant
This leads to rejecting more splits, limiting the
size of the tree
Tree growth can also be controlled by setting
Leaf Size property (minimum number of
observations in a leaf)
Split Size property (minimum number of
observations to allow a node to be split)
Maximum Depth property (maximum number of
generation of nodes)

15
Some Results

The posterior probabilities are the proportions
of responders and non-responders at each node
A node is classified as a responder or
non-responder depending on which posterior
probability is the largest
In selecting the best tree, one can use
Misclassification, Lift, or Average Squared Error

16
Creating a Decision Tree Model in Enterprise
Miner

Open the bankrupt project, and create a new
diagram called Bankrupt_DecTree
Drag and drop the bankrupt data node and the
Decision Tree node (from the model tab) onto the
diagram
Connect the nodes

17
Select ProbChisq for the Criterion under
Splitting RuleChange Use Input Once to Yes
(otherwise, the same variable can appear more
than once in the tree)
18
Under Subtree select Misclassification for
Assessment MeasureKeep defaults under P-Value
Adjustment and Output VariablesUnder Score set
Variable Selection to No (otherwise variables
with importance values greater than 0.05 are set
as rejected and not considered by the tree)
19
The Decision Tree has only one split on RE/TA.
The misclassification rate is 0.15 (3/20), with 2
false negatives and 1 false positive. The
cumulative lift is somewhat lower than the best
cumulative lift, and starts out at 1.777 vs. the
best value of 2.000.
20
Under Subtree, set Method to Largest and rerun.
The result show that another split is added,
using EBIT/TA. However, the misclassification
rate is unchanged at 0.15. This result shows that
setting Method to Assessment and
Misclassification for Assessment Measure finds
the smallest tree having the lowest
misclassification
21
Model Comparison

The Model Comparison node under the Assess tab
can be used to compare several different models
Create a diagram called Full Model that includes
the bankrupt data node connected into the
regression, decision tree, and neural network
nodes
Connect the three model nodes into the Model
Comparison node, and connect it and the
bankrupt_score data node into a Score node

22
For Regression, set Selection Model to none for
Neural Network, set Model Selection Criterion to
Average Error, and the Network properties as
before for Decision Tree, set Assessment Measure
as Average Squared Error, and the other
properties as before. This puts each of the
models on a similar basis for fit. For Model
Comparison set Selection Criterion as Average
Squared Error.
23
Neural Network is selected, although Regression
is nearly identical in average squared error.
The Receiver Operating Characteristic (ROC) curve
shows sensitivity (true positives) vs.
1-specificity (false positives) for various
cutoff probabilities of a response. The chart
shows that no matter what the cutoff
probabilities are, regression and neural network
classify 100 of responders as responders
(sensitivity) and 0 of non-responders as
responders (1-specificity). Decision tree
performs reasonably well, as indicated by the
area above the diagonal line.

Write a Comment

User Comments (0)