Title: Decision Trees: Overview
1DSCI 4520/5240 Data MiningFall 2013 Dr. Nick
Evangelopoulos
- Lecture 4
- Decision Trees Overview
Some slide material taken from SAS Education
2On the News The Rise of the Numerati.
BusinessWeek, Sep 8, 2008 With the explosion of
data from the Internet, cell phones, and credit
cards, the people who can make sense of it all
are changing our world. An excerpt from the
introduction of the book The Numerati by Stephen
Baker
Imagine you're in a café, perhaps the noisy one
I'm sitting in at this moment. A young woman at a
table to your right is typing on her laptop. You
turn your head and look at her screen. She surfs
the Internet. You watch. Hours pass. She reads an
online newspaper. You notice that she reads three
articles about China. She scouts movies for
Friday night and watches the trailer for Kung Fu
Panda. She clicks on an ad that promises to
connect her to old high school classmates. You
sit there taking notes. With each passing minute,
you're learning more about her. Now imagine that
you could watch 150 million people surfing at the
same time. That's what is happening today at the
business place.
3On the News The Rise of the Numerati.
By building mathematical models of its own
employees, IBM aims to improve productivity and
automate management.
In 2005, IBM embarked on research to harvest
massive data on employees, and to build
mathematical models of 50,000 of the companys
consultants. The goal was to optimize them, using
operations research, so that they can be deployed
with ever more efficiency. Data on IBM employees
include
-Allergies -Number of interns managed -Client
visits -Computer languages -Number of words per
e-mail -Amount spent entertaining clients -Number
of weekends worked
-Time spent in meetings -Social network
participation -Time spent surfing the
Web -Response time to e-mails -Amount of
sales -Marital status -Ratio of personal to work
e-mails
4Agenda
- Introduce the concept of Curse of
Dimensionality - Benefits and Pitfalls in Decision Tree modeling
- Consequences of a decision
5The Curse of Dimensionality
The dimension of a problem refers to the number
of input variables (actually, degrees of
freedom). Data mining problems are often massive
in both the number of cases and the dimension.
2D
3D
The curse of dimensionality refers to the
exponential increase in data required to densely
populate space as the dimension increases. For
example, the eight points fill the
one-dimensional space but become more separated
as the dimension increases. In 100-dimensional
space, they would be like distant galaxies. The
curse of dimensionality limits our practical
ability to fit a flexible model to noisy data
(real data) when there are a large number of
input variables. A densely populated input space
is required to fit highly complex models.
6Addressing the Curse of Dimensionality Reduce
the Dimensions
Redundancy
Irrelevancy
The two principal reasons for eliminating a
variable are redundancy and irrelevancy. A
redundant input does not give any new information
that has not already been explained. Useful
methods principal components, factor analysis,
variable clustering. An irrelevant input is not
useful in explaining variation in the target.
Interactions and partial associations make
irrelevancy more difficult to detect than
redundancy. It is often useful to first eliminate
redundant dimensions and then tackle irrelevancy.
7Model Complexity
Too flexible
Not flexible enough
A naïve modeler might assume that the most
complex model should always outperform the
others, but this is not the case. An overly
complex model might be too flexible. This will
lead to overfitting accommodating nuances of
the random noise in the particular sample (high
variance). A model with just enough flexibility
will give the best generalization.
8Overfitting
Training Set
Test Set
9Better Fitting
Training Set
Test Set
10The Cultivation of Trees
- Split Search
- Which splits are to be considered?
- Splitting Criterion
- Which split is best?
- Stopping Rule
- When should the splitting stop?
- Pruning Rule
- Should some branches be lopped off?
11Possible Splits to Consideran enormous number
500,000
Nominal Input
400,000
Ordinal Input
300,000
200,000
100,000
1
2
4
6
8
10
12
14
16
18
20
Input Levels
12Splitting Criteria
How is the best split determined? In some
situations, the worth of a split is obvious. If
the expected target is the same in the child
nodes as in the parent node, no improvement was
made, and the split is worthless! In contrast,
if a split results in pure child nodes, the split
is undisputedly best. For classification trees,
the three most widely used splitting criteria are
based on the Pearson chi-squared test, the Gini
index, and entropy. All three measure the
difference in class distributions across the
child nodes. The three methods usually give
similar results.
13Splitting Criteria
Left
Right
3196
1304
4500
Debt-to-Income Ratio lt 45
Not Bad
154
346
Bad
500
Left
Center
Right
2521
1188
791
4500
Not Bad
A Competing Three-Way Split
115
162
223
500
Bad
4500
0
4500
Not Bad
Perfect Split
0
500
500
Bad
14Controlling tree growth Stunting
Stunting
A universally accepted rule is to stop growing if
the node is pure. Two other popular rules for
stopping tree growth are to stop if the number of
cases in a node falls below a specified limit or
to stop when the split is not statistically
significant at a specified level. This is called
Pre-pruning, or stunting.
15Controlling tree growth Pruning
Pruning (also called post-pruning) creates a
sequence of trees of increasing complexity. An
assessment criterion is needed for deciding the
best (sub) tree. The assessment criteria are
usually based on performance on holdout samples
(validation data or with cross-validation). Cost
or profit considerations can be incorporated into
the assessment.
Pruning
16Benefits of Trees
- Interpretability
- tree-structured presentation
- Mixed Measurement Scales
- nominal, ordinal, interval
- Regression trees
- Robustness
- Missing Values
17Benefits of Trees
- Automatically
- Detects interactions (AID)
- Accommodates nonlinearity
- Selects input variables
Multivariate Step Function
18Drawbacks of Trees
- Roughness
- Linear, Main Effects
- Instability
19Building and Interpreting Decision Trees
- Explore the types of decision tree models
available in Enterprise Miner. - Build a decision tree model.
- Examine the model results and interpret these
results. - Choose a decision threshold theoretically and
empirically.
20Consequences of a Decision
Decision 1 Decision 0
Actual 1 True Positive False Negative
Actual 0 False Positive True Negative
21Example
- Recall the home equity line of credit scoring
example. Presume that every two dollars loaned
eventually returns three dollars if the loan is
paid off in full.
22Consequences of a Decision
Decision 1 Decision 0
Actual 1 True Positive False Negative (cost2)
Actual 0 False Positive (cost1) True Negative
23Bayes Rule Optimal threshold
- Using the cost structure defined for the home
equity example, the optimal threshold is
1/(1(2/1)) 1/3. That is, - reject all applications whose predicted
probability of default exceeds 0.33.
24Consequences of a Decision Profit matrix (SAS EM)
Decision 1 Decision 0
Actual 1 True Positive (profit2) False Negative
Actual 0 False Positive (profit-1) True Negative
25Decision Tree Algorithms
- Read Lecture 5 notes (Tree Algorithms) before
coming to class next week - Focus on Rule-induction using Entropy and
Information Gain