Discovering Knowledge in Data Daniel T' Larose, Ph'D' - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Discovering Knowledge in Data Daniel T' Larose, Ph'D'

Description:

Second-level decision node in right branch tests whether ... For example, subset of records has Savings = 'High', Income = $30,000, and Assets = 'Low' ... – PowerPoint PPT presentation

Number of Views:150
Avg rating:3.0/5.0
Slides: 13
Provided by: kathy48
Category:

less

Transcript and Presenter's Notes

Title: Discovering Knowledge in Data Daniel T' Larose, Ph'D'


1
Discovering Knowledge in DataDaniel T. Larose,
Ph.D.
  • Chapter 6Decision Trees
  • Prepared by James Steck, Graduate Assistant

2
Objectives
  • Explore the general concept of decision trees
  • Discuss the benefits and drawbacks of decision
    tree models.

3
Decision Trees
  • Decision Trees
  • Popular classification method in data mining
  • Decision Tree is collection of decision nodes,
    connected by branches, extending downward from
    root node to terminating leaf nodes
  • Beginning with root node, attributes tested at
    decision nodes, and each possible outcome results
    in branch
  • Each branch leads to decision node or leaf node
  • Requirements
  • Decision Tree is supervised classification method

4
Decision Trees (contd)
  • Pre-classified target variable must be included
    in training set
  • The target variable must be categorical
  • Decision trees learn by example, so training set
    should contain records with varied attribute
    values
  • If training set systematically lacks definable
    subsets, classification becomes problematic
  • Classification and Regression Trees (CART) and
    C4.5 are two leading algorithms used in data
    mining

5
Decision Trees (contd)
  • Example
  • Credit Risk is the target variable
  • Customers are classified as either Good Risk or
    Bad Risk
  • Predictor variables are Savings (Low, Med, High),
    Assets (Low, High) and Income

(100)
(60)
6
Decision Trees (contd)
  • Highest-level decision node is root node and
    tests whether record has Savings Low, Med,
    or High
  • Records are split according to value of Savings.
    For example, records with Savings Low go down
    leftmost branch to next decision node
  • Records with Savings Med proceed down middle
    branch to leaf node. This terminates branch with
    all records Savings Med classified as Good
    Risk
  • Additional decision nodes not required because
    records are classified with 100 accuracy

7
Decision Trees (contd)
  • Records with Savings Low tested at
    second-level decision node to determine whether
    Assets Low
  • Those with low assets classified Bad Risk, while
    others classified Good Risk
  • Second-level decision node in right branch tests
    whether customers with Savings High have
    Income lt 30,000
  • Those with Income less than or equal to 30,000
    classified Bad Risk. Others classified Good
    Risk
  • If no further splits possible, algorithm
    terminates

8
Decision Trees (contd)
  • For example, all branches terminate at pure leaf
    nodes. This describes all subsets of records
    arriving at leaf nodes with same target class
    value
  • Diverse leaf node has records with different
    target class values (Good Risk and Bad Risk).
    Algorithm possibly unable to split
  • For example, subset of records has Savings
    High, Income lt 30,000, and Assets Low.
    Leaf node contains 2 Good Risk, and 3 Bad
    Risk records
  • All records contain same predictor values. No
    way to split further leading to pure leaf node
  • 3/5 records are classified Bad Risk with 60
    confidence

9
Decision Rule
  • Create Models to
  • Extol the Obvious
  • Ignore the Extraneous

Example
Training Data
10
Tree Algorithm Partition with Best Split
Best Split
x2
x1
Training Data
11
Tree Algorithm Partition with Best Split
x2
x1
Training Data
12
Recursive Partitioning
Accuracy
100
Training
Validation
50
Partition Count
Training Data
Write a Comment
User Comments (0)
About PowerShow.com