Data Mining With Decision Trees - PowerPoint PPT Presentation

1 / 56

About This Presentation

Title:

Data Mining With Decision Trees

Description:

Decision trees are restricted to functions that can be represented by rules of the form ... That is, decision trees represent collections of implications. The ... – PowerPoint PPT presentation

Number of Views:604

Avg rating:3.0/5.0

Slides: 57

Provided by: CraigAS7

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining With Decision Trees

1
Data Mining With Decision Trees

Craig A. Struble, Ph.D.
Marquette University

2
Overview

Decision Trees
Rules and Language Bias
Constructing Decision Trees
Some Analyses
Heuristics
Quality Assessment
Extensions

3
Goals

Explore the complete data mining process
Understand decision trees as a model
Understand how to construct a decision tree
Recognize the language bias, search bias, and
overfitting avoidance bias for decision trees
Be able to assess the performance of decision
trees

4
Decision Trees

A graph (tree) based model used primarily for
classification
Extensively studied
Quinlan is the primary contributor to the field
Applications are wide ranging
Data mining
Aircraft flying
Medical diagnosis
Etc.

5
Decision Trees
6
What kind of data?

Initially, we will restrict the data to having
only nominal values
Well explore numeric/continuous values later
Number of attributes doesnt matter
Beware of the curse of dimensionality though
Well see this later

7
Classification Rules

It is relatively straight forward to convert a
decision tree into a set of rules for
classification

8
Language Bias

Decision trees are restricted to functions that
can be represented by rules of the form
if X and Y then A
if X and W and V then B
if Y and V then A
That is, decision trees represent collections of
implications
The rules can be combined with or
if Y and (X or V) then A

9
Language Bias

Examples of functions not well represented by
decision trees
Parity output is true if an even number of
attributes are true
Majority output is true if more than half of the
attributes are true

10
Propositional Logic

Essentially, decision trees can represent any
function in propositional logic
A, B, C propositional variables
and, or, not, (implies), (equivalent)
connectives
A proposition is a statement that is either true
or false
The sky is blue.
color of sky blue
Hence, decision trees are an example of a
propositional learner.

11
Constructing Decision Trees
12
Select an Attribute
Alt
13
Partition The Data
Alt
No
Yes
3,6,7,8,9,11
1,2,4,5,10,12
14
Select Next Attribute
Alt
No
Yes
1,2,4,5,10,12
3,6,7,8,9,11
Res
Yes
No
1,5,10
2,4,12
15
Continue Selecting Attributes
Alt
No
Yes
1,2,4,5,10,12
3,6,7,8,9,11
Res
Yes
No
1,5,10
2,4,12
Fri
This process continues along a subtree until all
instances have the same label.
Yes
No
5,10
1
No
Yes
16
Basic Algorithm
algorithm LearnDecisionTree(examples, attributes,
default) returns a decision tree inputs examples
, a set of examples attributes, a set of
attributes default, default value for goal
attribute if examples is empty then return
default else if all examples have same value for
goal attribute then return value else best
ChooseAttribute(attributes, examples) tree a
new decision tree with root test best for each
value vi of best do examplesi elements of
examples with best vi subtree
LearnDecisionTree(examplesi, attributes
best, MajorityValue(examples)) add a
branch to tree with label vi and subtree
subtree return tree
17
Analysis of Basic Algorithm

Let m be the number of attributes
Let n be the number of instances
Assumption Depth of tree is O(log n)
For each level of the tree all n instances are
considered (best vi)
O(n log n) work for a single attribute over the
entire tree
Total cost is O(mn log n) since all attributes
are eventually considered.

18
How Many Possible Decision Trees?

Assume a set of m non-goal boolean attributes
We can construct a decision tree for each boolean
function with m non-goal attributes
There are 2m possible ways to assign the
attributes
The number of different functions is the number
of subsets of the rows, assign those rows in the
subset a value of true.
So, there must be 22m possible decision trees!
How do we select the best one?

19
Applying Heuristics

In the basic algorithm, the ChooseAttribute
function makes an arbitrary choice of an
attribute to build the tree.
We can make this function try to choose the
best attribute to avoid making poor choices
This in effect biases the search.

20
Information Theory

One method for assessing attribute quality
Described by Shannon and Weaver (1949)
Measurement of the expected amount of information
in terms of bits
These are not your ordinary computer bits
Often information is fractional
Other Applications
Compression
Feature selection
This is the ID3 algorithm for decision tree
construction.

21
Notation

Let vi be a possible answer (value of attribute)
Let P(vi) be the probability of getting answer vi
from a random data element
The information content I of the knowing the
actual answer is

22
Example

Consider a fair coin, P(heads) P(tails) ½
Consider an unfair coin, P(heads) 0.99 and
P(tails)0.01
The value of the actual answer is reduced if you
know there is a bias

23
Application to Decision Trees

Measure the value of information after splitting
the instances by an attribute A
Attribute A splits the instances E into subsets
E1, , Ea where a is the number of values A can
have
where P(v1i) is the probability of an element in
Ei having value v1 for the goal attribute, etc.
Number of elements in Ei having v1 divided by Ei

24
Application to Decision Trees

The information gain of an attribute A is
or the amount of information before selecting
the attribute minus how much is still needed
afterwards (the values are for the goal
attribute)
Heuristic select attribute with highest gain

25
Example

Calculate for Patrons and Type
Which attribute would be chosen?
Exercise calculate information gain of Alt

26
Carrying On

When you use information gain in lower levels of
the tree, remember your set of instances under
consideration changes
The decision tree construction procedure is
recursive
This is the single most common mistake when
calculating information gain by hand

27
Highly Branching Attributes

Highly branching attributes might generate
spurious attributes with high gain
Correct for this by using the gain ratio
Calculate the information of the split
Calculate Gain(A)/Split(A)
Choose attribute with highest gain ratio

28
Assessing Decision Trees

Two kinds of assessments that we may want
Assess the performance of a single model
Assess the performance of a data mining technique
What kinds of metrics can we use?
Model size
Accuracy

29
Comparing Model Size

Suppose two models with the same accuracy
Choose the model with smaller size
Ockhams razor The most likely hypothesis is the
simplest one that is consistent with all
observations.
Can be used as a heuristic (other data mining
techniques)
Why?
Efficiency
Generality
The problem of finding the smallest model is
often intractable
NP-complete for decision tree learning

30
Accuracy

Measurement of the correctness of the technique
Success rate
Definitions
True positive a positive instance that is
correctly classified
True negative a negative instance correctly
classified
False positive a negative instance classified as
a positive one
False negative a positive instance classified as
a negative one
Accuracy is f (tp tn) / E
Sometimes were more accepting of some errors
Spam filter

31
Testing Procedures

In general, instances are split into two disjoint
sets
Training set the set of instances used to build
the model
Test set the set of instances used to test the
accuracy
In both sets, the correct labeling is known

Test Set
Training Set
32
Testing Dilemma

Wed like both sets to be as large as possible
Try to create sets that are representative of
possible data
As the number of attributes grows, the size of a
representative set grows exponentially. (Why?)

33
Assessing a Single Model

Each test instance constitutes a Bernoulli trial
of the model.
Mean and variance of single trial are p and
p(1-p)
For N instances, f is a random variable with mean
p, variance is p(1-p)/N
For large N (100), the distribution of f
approaches a normal distribution (bell curve)
Calculate P(-z the confidence interval and c defines the
confidence

34
Assessing a Single Model

The accuracy f needs to have 0 mean and unit
variance
Values for c and z can be found in standard
statistical texts
Solve for p,which is shown in the text

35
Assessing a Single Model

Two models are significantly different if their
confidence intervals for p do not overlap
Choose the model with a better confidence
interval for p

36
Assessing a Method

n-fold cross-validation
Split the instances into n equal sized partitions
Make sure each partition is as representative as
possible
Run n training and testing sessions, treating
each partition as a testing set during one
session
Calculate accuracy and error rates
Means and standard deviation
10 fold tests are common
Leave-one-out (or jackknife)
Special case of n-fold cross validation
Use for small datasets
Each instance is its own test set.

37
WEKA Output
38
WEKA Output
39
Extensions to Basic Algorithm

Numeric Attributes
Missing Values
Overfitting Avoidance (Pruning)
Interpreting Decision Trees

40
Handling Numeric Attributes

Recall that decision trees work for nominal
attributes
Cant have infinite number of branches
Our approach is to convert numeric attributes
into ordinal (nominal) attributes
This process is called discretization

41
Discretization

Binary split (weather data)
Select a breakpoint between values with maximum
information gain (equivalently, lowest Remainder)
For each breakpoint calculate gain for less than
and greater than the breakpoint.
For n values, this is an O(n) process (assuming
instances are sorted already).

42
Discretization

Example
You can reuse continuous attributes, but causes
difficulty in interpreting the results.

43
Discretization

Equal-interval (equiwidth) binning splits the
range into n equal sized ranges
(max min) / n is the range width
Often distributes the instances unevenly
Equal-frequency (equidepth) binning splits into n
bins containing an equal (or close to equal)
number of instances
Identify splits until the histogram is flat

44
Discretization
45
Discretization

Entropy (information content) based
Requires class labeling (goal attribute)
Recursively apply the approach on slide 41
Select the breakpoint B with lowest Remainder
Recursively select breakpoint with lowest
remainder on each of the two partitions
Stop splitting when some criterion is met
Minimum description length in section 5.9
If Gain(
A formula for determining t is given in the book.

46
Handling Missing Values

Ignore instances with missing values
Pretty harsh, and missing value might not be
important
Ignore attributes with missing values
Again, may not be feasible
Treat missing value as another nominal value
Fine if missing a value has significant meaning
Estimate missing values
Data imputation regression, nearest neighbor,
mean, mode, etc.
Well cover this in more detail later in the
semester

47
Handling Missing Values

Follow the leader
An instance with a missing value for a tested
attribute is sent down the branch with the most
instances

Temp
75

5 instances
3 instances
Instance included on the left branch
48
Handling Missing Values

Partition the instance (branches show of
instances)

Temp
3 3/8
5 5/8
Wind
Sunny
2 5/8
1 3/8
3
1
1
49
Pruning

To avoid overfitting, we can prune or simplify a
decision tree.
More efficient, Ockhams Razor
Prepruning tries to decide a priori when to stop
creating subtrees
This turns out to be fairly difficult to do well
in practice
Postpruning simplifies an existing decision tree

50
Postpruning

Subtree replacement replaces a subtree with a
single leaf node

Alt
Alt
Yes
Yes
Yes
Price
12/15

No
Yes
Yes
4/5
1/2
7/8
51
Postpruning

Subtree raising moves a subtree to a higher level
in the decision tree, subsuming its parent

Alt
Alt
Yes
Yes
Res
Price
Yes
No

No
Price
No
4/4
Yes
Yes

4/5
4/5
7/9

No
Yes
Yes
4/5
1/2
7/8
52
Postpruning

When do we want to perform subtree replacement or
subtree raising?
Consider the estimated error of the pruning
operation
Estimating error
With a test set, similar to accuracy except
replace f(tptn)/E with f(fpfn)/E, the
error rate and use confidence of 25
The confidence can be tweaked to achieve better
performance
Without a test set, consider number of
misclassified training instances as errors, and
take pessimistic estimate of error rate.

53
Using Error Estimate

To determine if a node should be replaced,
compare the error rate estimate for the node with
the combined error rates of the children. Replace
the node if its error rate is less than combined
rates of its children.

Price

5/15 err(1/5,5) 8/15 err(1/8, 8) 2/15
err(1/2,2) 0.33 err(3/15, 15) 0.28

No
Yes
Yes
4/5
1/2
7/8
54
Interpreting Decision Trees

Although the decision is used for classification,
you can use the classification rules from the
decision tree to describe concepts

55
Interpreting Decision Trees

A description of hard contact wearers,
appropriate for regular people
In general, a nearsighted person with an
astigmatism and normal tear production should be
prescribed hard contacts.

56
Summary

Decision trees are a classification technique
They can represent any function representable
with propositional logic
Heuristics such as information content are used
to select relevant attributes
Pruning is used to avoid over fitting
The output of decision trees can be used for
descriptive as well as predictive purposes

Write a Comment

User Comments (0)