Title: Data Mining With Decision Trees
1Data Mining With Decision Trees
- Craig A. Struble, Ph.D.
- Marquette University
2Overview
- Decision Trees
- Rules and Language Bias
- Constructing Decision Trees
- Some Analyses
- Heuristics
- Quality Assessment
- Extensions
3Goals
- Explore the complete data mining process
- Understand decision trees as a model
- Understand how to construct a decision tree
- Recognize the language bias, search bias, and
overfitting avoidance bias for decision trees - Be able to assess the performance of decision
trees
4Decision Trees
- A graph (tree) based model used primarily for
classification - Extensively studied
- Quinlan is the primary contributor to the field
- Applications are wide ranging
- Data mining
- Aircraft flying
- Medical diagnosis
- Etc.
5Decision Trees
6What kind of data?
- Initially, we will restrict the data to having
only nominal values - Well explore numeric/continuous values later
- Number of attributes doesnt matter
- Beware of the curse of dimensionality though
- Well see this later
7Classification Rules
- It is relatively straight forward to convert a
decision tree into a set of rules for
classification
8Language Bias
- Decision trees are restricted to functions that
can be represented by rules of the form - if X and Y then A
- if X and W and V then B
- if Y and V then A
- That is, decision trees represent collections of
implications - The rules can be combined with or
- if Y and (X or V) then A
9Language Bias
- Examples of functions not well represented by
decision trees - Parity output is true if an even number of
attributes are true - Majority output is true if more than half of the
attributes are true
10Propositional Logic
- Essentially, decision trees can represent any
function in propositional logic - A, B, C propositional variables
- and, or, not, (implies), (equivalent)
connectives - A proposition is a statement that is either true
or false - The sky is blue.
color of sky blue - Hence, decision trees are an example of a
propositional learner.
11Constructing Decision Trees
12Select an Attribute
Alt
13Partition The Data
Alt
No
Yes
3,6,7,8,9,11
1,2,4,5,10,12
14Select Next Attribute
Alt
No
Yes
1,2,4,5,10,12
3,6,7,8,9,11
Res
Yes
No
1,5,10
2,4,12
15Continue Selecting Attributes
Alt
No
Yes
1,2,4,5,10,12
3,6,7,8,9,11
Res
Yes
No
1,5,10
2,4,12
Fri
This process continues along a subtree until all
instances have the same label.
Yes
No
5,10
1
No
Yes
16Basic Algorithm
algorithm LearnDecisionTree(examples, attributes,
default) returns a decision tree inputs examples
, a set of examples attributes, a set of
attributes default, default value for goal
attribute if examples is empty then return
default else if all examples have same value for
goal attribute then return value else best
ChooseAttribute(attributes, examples) tree a
new decision tree with root test best for each
value vi of best do examplesi elements of
examples with best vi subtree
LearnDecisionTree(examplesi, attributes
best, MajorityValue(examples)) add a
branch to tree with label vi and subtree
subtree return tree
17Analysis of Basic Algorithm
- Let m be the number of attributes
- Let n be the number of instances
- Assumption Depth of tree is O(log n)
- For each level of the tree all n instances are
considered (best vi) - O(n log n) work for a single attribute over the
entire tree - Total cost is O(mn log n) since all attributes
are eventually considered.
18How Many Possible Decision Trees?
- Assume a set of m non-goal boolean attributes
- We can construct a decision tree for each boolean
function with m non-goal attributes - There are 2m possible ways to assign the
attributes - The number of different functions is the number
of subsets of the rows, assign those rows in the
subset a value of true. - So, there must be 22m possible decision trees!
- How do we select the best one?
19Applying Heuristics
- In the basic algorithm, the ChooseAttribute
function makes an arbitrary choice of an
attribute to build the tree. - We can make this function try to choose the
best attribute to avoid making poor choices - This in effect biases the search.
20Information Theory
- One method for assessing attribute quality
- Described by Shannon and Weaver (1949)
- Measurement of the expected amount of information
in terms of bits - These are not your ordinary computer bits
- Often information is fractional
- Other Applications
- Compression
- Feature selection
- This is the ID3 algorithm for decision tree
construction.
21Notation
- Let vi be a possible answer (value of attribute)
- Let P(vi) be the probability of getting answer vi
from a random data element - The information content I of the knowing the
actual answer is
22Example
- Consider a fair coin, P(heads) P(tails) ½
- Consider an unfair coin, P(heads) 0.99 and
P(tails)0.01 - The value of the actual answer is reduced if you
know there is a bias
23Application to Decision Trees
- Measure the value of information after splitting
the instances by an attribute A - Attribute A splits the instances E into subsets
E1, , Ea where a is the number of values A can
have - where P(v1i) is the probability of an element in
Ei having value v1 for the goal attribute, etc. - Number of elements in Ei having v1 divided by Ei
24Application to Decision Trees
- The information gain of an attribute A is
- or the amount of information before selecting
the attribute minus how much is still needed
afterwards (the values are for the goal
attribute) - Heuristic select attribute with highest gain
25Example
- Calculate for Patrons and Type
- Which attribute would be chosen?
- Exercise calculate information gain of Alt
26Carrying On
- When you use information gain in lower levels of
the tree, remember your set of instances under
consideration changes - The decision tree construction procedure is
recursive - This is the single most common mistake when
calculating information gain by hand
27Highly Branching Attributes
- Highly branching attributes might generate
spurious attributes with high gain - Correct for this by using the gain ratio
- Calculate the information of the split
- Calculate Gain(A)/Split(A)
- Choose attribute with highest gain ratio
28Assessing Decision Trees
- Two kinds of assessments that we may want
- Assess the performance of a single model
- Assess the performance of a data mining technique
- What kinds of metrics can we use?
- Model size
- Accuracy
29Comparing Model Size
- Suppose two models with the same accuracy
- Choose the model with smaller size
- Ockhams razor The most likely hypothesis is the
simplest one that is consistent with all
observations. - Can be used as a heuristic (other data mining
techniques) - Why?
- Efficiency
- Generality
- The problem of finding the smallest model is
often intractable - NP-complete for decision tree learning
30Accuracy
- Measurement of the correctness of the technique
- Success rate
- Definitions
- True positive a positive instance that is
correctly classified - True negative a negative instance correctly
classified - False positive a negative instance classified as
a positive one - False negative a positive instance classified as
a negative one - Accuracy is f (tp tn) / E
- Sometimes were more accepting of some errors
- Spam filter
31Testing Procedures
- In general, instances are split into two disjoint
sets - Training set the set of instances used to build
the model - Test set the set of instances used to test the
accuracy - In both sets, the correct labeling is known
Test Set
Training Set
32Testing Dilemma
- Wed like both sets to be as large as possible
- Try to create sets that are representative of
possible data - As the number of attributes grows, the size of a
representative set grows exponentially. (Why?)
33Assessing a Single Model
- Each test instance constitutes a Bernoulli trial
of the model. - Mean and variance of single trial are p and
p(1-p) - For N instances, f is a random variable with mean
p, variance is p(1-p)/N - For large N (100), the distribution of f
approaches a normal distribution (bell curve) - Calculate P(-z the confidence interval and c defines the
confidence
34Assessing a Single Model
- The accuracy f needs to have 0 mean and unit
variance - Values for c and z can be found in standard
statistical texts - Solve for p,which is shown in the text
35Assessing a Single Model
- Two models are significantly different if their
confidence intervals for p do not overlap - Choose the model with a better confidence
interval for p
36Assessing a Method
- n-fold cross-validation
- Split the instances into n equal sized partitions
- Make sure each partition is as representative as
possible - Run n training and testing sessions, treating
each partition as a testing set during one
session - Calculate accuracy and error rates
- Means and standard deviation
- 10 fold tests are common
- Leave-one-out (or jackknife)
- Special case of n-fold cross validation
- Use for small datasets
- Each instance is its own test set.
37WEKA Output
38WEKA Output
39Extensions to Basic Algorithm
- Numeric Attributes
- Missing Values
- Overfitting Avoidance (Pruning)
- Interpreting Decision Trees
40Handling Numeric Attributes
- Recall that decision trees work for nominal
attributes - Cant have infinite number of branches
- Our approach is to convert numeric attributes
into ordinal (nominal) attributes - This process is called discretization
41Discretization
- Binary split (weather data)
- Select a breakpoint between values with maximum
information gain (equivalently, lowest Remainder) - For each breakpoint calculate gain for less than
and greater than the breakpoint. - For n values, this is an O(n) process (assuming
instances are sorted already).
42Discretization
- Example
- You can reuse continuous attributes, but causes
difficulty in interpreting the results.
43Discretization
- Equal-interval (equiwidth) binning splits the
range into n equal sized ranges - (max min) / n is the range width
- Often distributes the instances unevenly
- Equal-frequency (equidepth) binning splits into n
bins containing an equal (or close to equal)
number of instances - Identify splits until the histogram is flat
44Discretization
45Discretization
- Entropy (information content) based
- Requires class labeling (goal attribute)
- Recursively apply the approach on slide 41
- Select the breakpoint B with lowest Remainder
- Recursively select breakpoint with lowest
remainder on each of the two partitions - Stop splitting when some criterion is met
- Minimum description length in section 5.9
- If Gain(
- A formula for determining t is given in the book.
46Handling Missing Values
- Ignore instances with missing values
- Pretty harsh, and missing value might not be
important - Ignore attributes with missing values
- Again, may not be feasible
- Treat missing value as another nominal value
- Fine if missing a value has significant meaning
- Estimate missing values
- Data imputation regression, nearest neighbor,
mean, mode, etc. - Well cover this in more detail later in the
semester
47Handling Missing Values
- Follow the leader
- An instance with a missing value for a tested
attribute is sent down the branch with the most
instances
Temp
75
5 instances
3 instances
Instance included on the left branch
48Handling Missing Values
- Partition the instance (branches show of
instances)
Temp
3 3/8
5 5/8
Wind
Sunny
2 5/8
1 3/8
3
1
1
49Pruning
- To avoid overfitting, we can prune or simplify a
decision tree. - More efficient, Ockhams Razor
- Prepruning tries to decide a priori when to stop
creating subtrees - This turns out to be fairly difficult to do well
in practice - Postpruning simplifies an existing decision tree
50Postpruning
- Subtree replacement replaces a subtree with a
single leaf node
Alt
Alt
Yes
Yes
Yes
Price
12/15
No
Yes
Yes
4/5
1/2
7/8
51Postpruning
- Subtree raising moves a subtree to a higher level
in the decision tree, subsuming its parent
Alt
Alt
Yes
Yes
Res
Price
Yes
No
No
Price
No
4/4
Yes
Yes
4/5
4/5
7/9
No
Yes
Yes
4/5
1/2
7/8
52Postpruning
- When do we want to perform subtree replacement or
subtree raising? - Consider the estimated error of the pruning
operation - Estimating error
- With a test set, similar to accuracy except
replace f(tptn)/E with f(fpfn)/E, the
error rate and use confidence of 25 - The confidence can be tweaked to achieve better
performance - Without a test set, consider number of
misclassified training instances as errors, and
take pessimistic estimate of error rate.
53Using Error Estimate
- To determine if a node should be replaced,
compare the error rate estimate for the node with
the combined error rates of the children. Replace
the node if its error rate is less than combined
rates of its children.
Price
5/15 err(1/5,5) 8/15 err(1/8, 8) 2/15
err(1/2,2) 0.33 err(3/15, 15) 0.28
No
Yes
Yes
4/5
1/2
7/8
54Interpreting Decision Trees
- Although the decision is used for classification,
you can use the classification rules from the
decision tree to describe concepts
55Interpreting Decision Trees
- A description of hard contact wearers,
appropriate for regular people - In general, a nearsighted person with an
astigmatism and normal tear production should be
prescribed hard contacts.
56Summary
- Decision trees are a classification technique
- They can represent any function representable
with propositional logic - Heuristics such as information content are used
to select relevant attributes - Pruning is used to avoid over fitting
- The output of decision trees can be used for
descriptive as well as predictive purposes