Discovering Knowledge in Data Daniel T' Larose, Ph'D'

About This Presentation

Title:

Discovering Knowledge in Data Daniel T' Larose, Ph'D'

Description:

Second-level decision node in right branch tests whether ... For example, subset of records has Savings = 'High', Income = $30,000, and Assets = 'Low' ... – PowerPoint PPT presentation

Number of Views:150

Avg rating:3.0/5.0

Slides: 13

Provided by: kathy48

Category:

more less

Transcript and Presenter's Notes

Title: Discovering Knowledge in Data Daniel T' Larose, Ph'D'

1
Discovering Knowledge in DataDaniel T. Larose,
Ph.D.

Chapter 6Decision Trees
Prepared by James Steck, Graduate Assistant

2
Objectives

Explore the general concept of decision trees
Discuss the benefits and drawbacks of decision
tree models.

3
Decision Trees

Decision Trees
Popular classification method in data mining
Decision Tree is collection of decision nodes,
connected by branches, extending downward from
root node to terminating leaf nodes
Beginning with root node, attributes tested at
decision nodes, and each possible outcome results
in branch
Each branch leads to decision node or leaf node
Requirements
Decision Tree is supervised classification method

4
Decision Trees (contd)

Pre-classified target variable must be included
in training set
The target variable must be categorical
Decision trees learn by example, so training set
should contain records with varied attribute
values
If training set systematically lacks definable
subsets, classification becomes problematic
Classification and Regression Trees (CART) and
C4.5 are two leading algorithms used in data
mining

5
Decision Trees (contd)

Example
Credit Risk is the target variable
Customers are classified as either Good Risk or
Bad Risk
Predictor variables are Savings (Low, Med, High),
Assets (Low, High) and Income

(100)
(60)
6
Decision Trees (contd)

Highest-level decision node is root node and
tests whether record has Savings Low, Med,
or High
Records are split according to value of Savings.
For example, records with Savings Low go down
leftmost branch to next decision node
Records with Savings Med proceed down middle
branch to leaf node. This terminates branch with
all records Savings Med classified as Good
Risk
Additional decision nodes not required because
records are classified with 100 accuracy

7
Decision Trees (contd)

Records with Savings Low tested at
second-level decision node to determine whether
Assets Low
Those with low assets classified Bad Risk, while
others classified Good Risk
Second-level decision node in right branch tests
whether customers with Savings High have
Income lt 30,000
Those with Income less than or equal to 30,000
classified Bad Risk. Others classified Good
Risk
If no further splits possible, algorithm
terminates

8
Decision Trees (contd)

For example, all branches terminate at pure leaf
nodes. This describes all subsets of records
arriving at leaf nodes with same target class
value
Diverse leaf node has records with different
target class values (Good Risk and Bad Risk).
Algorithm possibly unable to split
For example, subset of records has Savings
High, Income lt 30,000, and Assets Low.
Leaf node contains 2 Good Risk, and 3 Bad
Risk records
All records contain same predictor values. No
way to split further leading to pure leaf node
3/5 records are classified Bad Risk with 60
confidence

9
Decision Rule

Create Models to
Extol the Obvious
Ignore the Extraneous

Example
Training Data
10
Tree Algorithm Partition with Best Split
Best Split
x2
x1
Training Data
11
Tree Algorithm Partition with Best Split
x2
x1
Training Data
12
Recursive Partitioning
Accuracy
100
Training
Validation
50
Partition Count
Training Data

Write a Comment

User Comments (0)