Data Mining With Decision Trees - PowerPoint PPT Presentation

1 / 56
About This Presentation
Title:

Data Mining With Decision Trees

Description:

Decision trees are restricted to functions that can be represented by rules of the form ... That is, decision trees represent collections of implications. The ... – PowerPoint PPT presentation

Number of Views:603
Avg rating:3.0/5.0
Slides: 57
Provided by: CraigAS7
Category:
Tags: data | decision | mining | trees

less

Transcript and Presenter's Notes

Title: Data Mining With Decision Trees


1
Data Mining With Decision Trees
  • Craig A. Struble, Ph.D.
  • Marquette University

2
Overview
  • Decision Trees
  • Rules and Language Bias
  • Constructing Decision Trees
  • Some Analyses
  • Heuristics
  • Quality Assessment
  • Extensions

3
Goals
  • Explore the complete data mining process
  • Understand decision trees as a model
  • Understand how to construct a decision tree
  • Recognize the language bias, search bias, and
    overfitting avoidance bias for decision trees
  • Be able to assess the performance of decision
    trees

4
Decision Trees
  • A graph (tree) based model used primarily for
    classification
  • Extensively studied
  • Quinlan is the primary contributor to the field
  • Applications are wide ranging
  • Data mining
  • Aircraft flying
  • Medical diagnosis
  • Etc.

5
Decision Trees
6
What kind of data?
  • Initially, we will restrict the data to having
    only nominal values
  • Well explore numeric/continuous values later
  • Number of attributes doesnt matter
  • Beware of the curse of dimensionality though
  • Well see this later

7
Classification Rules
  • It is relatively straight forward to convert a
    decision tree into a set of rules for
    classification

8
Language Bias
  • Decision trees are restricted to functions that
    can be represented by rules of the form
  • if X and Y then A
  • if X and W and V then B
  • if Y and V then A
  • That is, decision trees represent collections of
    implications
  • The rules can be combined with or
  • if Y and (X or V) then A

9
Language Bias
  • Examples of functions not well represented by
    decision trees
  • Parity output is true if an even number of
    attributes are true
  • Majority output is true if more than half of the
    attributes are true

10
Propositional Logic
  • Essentially, decision trees can represent any
    function in propositional logic
  • A, B, C propositional variables
  • and, or, not, (implies), (equivalent)
    connectives
  • A proposition is a statement that is either true
    or false
  • The sky is blue.
    color of sky blue
  • Hence, decision trees are an example of a
    propositional learner.

11
Constructing Decision Trees
12
Select an Attribute
Alt
13
Partition The Data
Alt
No
Yes
3,6,7,8,9,11
1,2,4,5,10,12
14
Select Next Attribute
Alt
No
Yes
1,2,4,5,10,12
3,6,7,8,9,11
Res
Yes
No
1,5,10
2,4,12
15
Continue Selecting Attributes
Alt
No
Yes
1,2,4,5,10,12
3,6,7,8,9,11
Res
Yes
No
1,5,10
2,4,12
Fri
This process continues along a subtree until all
instances have the same label.
Yes
No
5,10
1
No
Yes
16
Basic Algorithm
algorithm LearnDecisionTree(examples, attributes,
default) returns a decision tree inputs examples
, a set of examples attributes, a set of
attributes default, default value for goal
attribute if examples is empty then return
default else if all examples have same value for
goal attribute then return value else best
ChooseAttribute(attributes, examples) tree a
new decision tree with root test best for each
value vi of best do examplesi elements of
examples with best vi subtree
LearnDecisionTree(examplesi, attributes
best, MajorityValue(examples)) add a
branch to tree with label vi and subtree
subtree return tree
17
Analysis of Basic Algorithm
  • Let m be the number of attributes
  • Let n be the number of instances
  • Assumption Depth of tree is O(log n)
  • For each level of the tree all n instances are
    considered (best vi)
  • O(n log n) work for a single attribute over the
    entire tree
  • Total cost is O(mn log n) since all attributes
    are eventually considered.

18
How Many Possible Decision Trees?
  • Assume a set of m non-goal boolean attributes
  • We can construct a decision tree for each boolean
    function with m non-goal attributes
  • There are 2m possible ways to assign the
    attributes
  • The number of different functions is the number
    of subsets of the rows, assign those rows in the
    subset a value of true.
  • So, there must be 22m possible decision trees!
  • How do we select the best one?

19
Applying Heuristics
  • In the basic algorithm, the ChooseAttribute
    function makes an arbitrary choice of an
    attribute to build the tree.
  • We can make this function try to choose the
    best attribute to avoid making poor choices
  • This in effect biases the search.

20
Information Theory
  • One method for assessing attribute quality
  • Described by Shannon and Weaver (1949)
  • Measurement of the expected amount of information
    in terms of bits
  • These are not your ordinary computer bits
  • Often information is fractional
  • Other Applications
  • Compression
  • Feature selection
  • This is the ID3 algorithm for decision tree
    construction.

21
Notation
  • Let vi be a possible answer (value of attribute)
  • Let P(vi) be the probability of getting answer vi
    from a random data element
  • The information content I of the knowing the
    actual answer is

22
Example
  • Consider a fair coin, P(heads) P(tails) ½
  • Consider an unfair coin, P(heads) 0.99 and
    P(tails)0.01
  • The value of the actual answer is reduced if you
    know there is a bias

23
Application to Decision Trees
  • Measure the value of information after splitting
    the instances by an attribute A
  • Attribute A splits the instances E into subsets
    E1, , Ea where a is the number of values A can
    have
  • where P(v1i) is the probability of an element in
    Ei having value v1 for the goal attribute, etc.
  • Number of elements in Ei having v1 divided by Ei

24
Application to Decision Trees
  • The information gain of an attribute A is
  • or the amount of information before selecting
    the attribute minus how much is still needed
    afterwards (the values are for the goal
    attribute)
  • Heuristic select attribute with highest gain

25
Example
  • Calculate for Patrons and Type
  • Which attribute would be chosen?
  • Exercise calculate information gain of Alt

26
Carrying On
  • When you use information gain in lower levels of
    the tree, remember your set of instances under
    consideration changes
  • The decision tree construction procedure is
    recursive
  • This is the single most common mistake when
    calculating information gain by hand

27
Highly Branching Attributes
  • Highly branching attributes might generate
    spurious attributes with high gain
  • Correct for this by using the gain ratio
  • Calculate the information of the split
  • Calculate Gain(A)/Split(A)
  • Choose attribute with highest gain ratio

28
Assessing Decision Trees
  • Two kinds of assessments that we may want
  • Assess the performance of a single model
  • Assess the performance of a data mining technique
  • What kinds of metrics can we use?
  • Model size
  • Accuracy

29
Comparing Model Size
  • Suppose two models with the same accuracy
  • Choose the model with smaller size
  • Ockhams razor The most likely hypothesis is the
    simplest one that is consistent with all
    observations.
  • Can be used as a heuristic (other data mining
    techniques)
  • Why?
  • Efficiency
  • Generality
  • The problem of finding the smallest model is
    often intractable
  • NP-complete for decision tree learning

30
Accuracy
  • Measurement of the correctness of the technique
  • Success rate
  • Definitions
  • True positive a positive instance that is
    correctly classified
  • True negative a negative instance correctly
    classified
  • False positive a negative instance classified as
    a positive one
  • False negative a positive instance classified as
    a negative one
  • Accuracy is f (tp tn) / E
  • Sometimes were more accepting of some errors
  • Spam filter

31
Testing Procedures
  • In general, instances are split into two disjoint
    sets
  • Training set the set of instances used to build
    the model
  • Test set the set of instances used to test the
    accuracy
  • In both sets, the correct labeling is known

Test Set
Training Set
32
Testing Dilemma
  • Wed like both sets to be as large as possible
  • Try to create sets that are representative of
    possible data
  • As the number of attributes grows, the size of a
    representative set grows exponentially. (Why?)

33
Assessing a Single Model
  • Each test instance constitutes a Bernoulli trial
    of the model.
  • Mean and variance of single trial are p and
    p(1-p)
  • For N instances, f is a random variable with mean
    p, variance is p(1-p)/N
  • For large N (100), the distribution of f
    approaches a normal distribution (bell curve)
  • Calculate P(-z the confidence interval and c defines the
    confidence

34
Assessing a Single Model
  • The accuracy f needs to have 0 mean and unit
    variance
  • Values for c and z can be found in standard
    statistical texts
  • Solve for p,which is shown in the text

35
Assessing a Single Model
  • Two models are significantly different if their
    confidence intervals for p do not overlap
  • Choose the model with a better confidence
    interval for p

36
Assessing a Method
  • n-fold cross-validation
  • Split the instances into n equal sized partitions
  • Make sure each partition is as representative as
    possible
  • Run n training and testing sessions, treating
    each partition as a testing set during one
    session
  • Calculate accuracy and error rates
  • Means and standard deviation
  • 10 fold tests are common
  • Leave-one-out (or jackknife)
  • Special case of n-fold cross validation
  • Use for small datasets
  • Each instance is its own test set.

37
WEKA Output
38
WEKA Output
39
Extensions to Basic Algorithm
  • Numeric Attributes
  • Missing Values
  • Overfitting Avoidance (Pruning)
  • Interpreting Decision Trees

40
Handling Numeric Attributes
  • Recall that decision trees work for nominal
    attributes
  • Cant have infinite number of branches
  • Our approach is to convert numeric attributes
    into ordinal (nominal) attributes
  • This process is called discretization

41
Discretization
  • Binary split (weather data)
  • Select a breakpoint between values with maximum
    information gain (equivalently, lowest Remainder)
  • For each breakpoint calculate gain for less than
    and greater than the breakpoint.
  • For n values, this is an O(n) process (assuming
    instances are sorted already).

42
Discretization
  • Example
  • You can reuse continuous attributes, but causes
    difficulty in interpreting the results.

43
Discretization
  • Equal-interval (equiwidth) binning splits the
    range into n equal sized ranges
  • (max min) / n is the range width
  • Often distributes the instances unevenly
  • Equal-frequency (equidepth) binning splits into n
    bins containing an equal (or close to equal)
    number of instances
  • Identify splits until the histogram is flat

44
Discretization
45
Discretization
  • Entropy (information content) based
  • Requires class labeling (goal attribute)
  • Recursively apply the approach on slide 41
  • Select the breakpoint B with lowest Remainder
  • Recursively select breakpoint with lowest
    remainder on each of the two partitions
  • Stop splitting when some criterion is met
  • Minimum description length in section 5.9
  • If Gain(
  • A formula for determining t is given in the book.

46
Handling Missing Values
  • Ignore instances with missing values
  • Pretty harsh, and missing value might not be
    important
  • Ignore attributes with missing values
  • Again, may not be feasible
  • Treat missing value as another nominal value
  • Fine if missing a value has significant meaning
  • Estimate missing values
  • Data imputation regression, nearest neighbor,
    mean, mode, etc.
  • Well cover this in more detail later in the
    semester

47
Handling Missing Values
  • Follow the leader
  • An instance with a missing value for a tested
    attribute is sent down the branch with the most
    instances

Temp
75

5 instances
3 instances
Instance included on the left branch
48
Handling Missing Values
  • Partition the instance (branches show of
    instances)

Temp
3 3/8
5 5/8
Wind
Sunny
2 5/8
1 3/8
3
1
1
49
Pruning
  • To avoid overfitting, we can prune or simplify a
    decision tree.
  • More efficient, Ockhams Razor
  • Prepruning tries to decide a priori when to stop
    creating subtrees
  • This turns out to be fairly difficult to do well
    in practice
  • Postpruning simplifies an existing decision tree

50
Postpruning
  • Subtree replacement replaces a subtree with a
    single leaf node

Alt
Alt
Yes
Yes
Yes
Price
12/15



No
Yes
Yes
4/5
1/2
7/8
51
Postpruning
  • Subtree raising moves a subtree to a higher level
    in the decision tree, subsuming its parent

Alt
Alt
Yes
Yes
Res
Price
Yes
No



No
Price
No
4/4
Yes
Yes


4/5
4/5
7/9

No
Yes
Yes
4/5
1/2
7/8
52
Postpruning
  • When do we want to perform subtree replacement or
    subtree raising?
  • Consider the estimated error of the pruning
    operation
  • Estimating error
  • With a test set, similar to accuracy except
    replace f(tptn)/E with f(fpfn)/E, the
    error rate and use confidence of 25
  • The confidence can be tweaked to achieve better
    performance
  • Without a test set, consider number of
    misclassified training instances as errors, and
    take pessimistic estimate of error rate.

53
Using Error Estimate
  • To determine if a node should be replaced,
    compare the error rate estimate for the node with
    the combined error rates of the children. Replace
    the node if its error rate is less than combined
    rates of its children.

Price

5/15 err(1/5,5) 8/15 err(1/8, 8) 2/15
err(1/2,2) 0.33 err(3/15, 15) 0.28


No
Yes
Yes
4/5
1/2
7/8
54
Interpreting Decision Trees
  • Although the decision is used for classification,
    you can use the classification rules from the
    decision tree to describe concepts

55
Interpreting Decision Trees
  • A description of hard contact wearers,
    appropriate for regular people
  • In general, a nearsighted person with an
    astigmatism and normal tear production should be
    prescribed hard contacts.

56
Summary
  • Decision trees are a classification technique
  • They can represent any function representable
    with propositional logic
  • Heuristics such as information content are used
    to select relevant attributes
  • Pruning is used to avoid over fitting
  • The output of decision trees can be used for
    descriptive as well as predictive purposes
Write a Comment
User Comments (0)
About PowerShow.com