Random Decision Trees - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Random Decision Trees

Description:

Search simplest best hypothesis NP-hard. Various ... xrandom.sh: shell script for N-fold evaluation. runten.sh: shell script to run rt-train a number of times ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 21
Provided by: xinw2
Category:
Tags: decision | random | trees

less

Transcript and Presenter's Notes

Title: Random Decision Trees


1
Random Decision Trees
  • COSC 6412 Project
  • Xinwei Li

2
Outline
  • Motivation
  • Random Decision Trees
  • Software Implementation
  • Experimental Results and Analysis
  • Conclusions and Future Work

3
Motivation
  • Classical Decision Tree Algorithms
  • Search simplest best hypothesis NP-hard
  • Various heuristics Information Gain, Gini
    Index, etc.
  • Learning inefficient and unscalable
  • Doubt on Simplest Models
  • Complicated hypotheses
  • Bagging and boosting
  • Better performance but lower efficiency

4
Motivation
  • One of the complicated Trees
  • Random Decision Trees
  • Complicated but efficient
  • Implementation
  • Performance and Scalability study
  • Improvement

5
Random Decision Trees
  • Proposed by Wei Fan et al. in 2003
  • Feature selection
  • Randomly
  • Categorical only once in a path
  • Continuous multiple times with different
    threshold
  • No information gain calculation
  • Without using training data
  • Tree Depth
  • Predefined limit half of the attributes.

6
Random Decision Trees
  • Training Data
  • Classified into leaf nodes.
  • Update the statistics of the nodes
  • Pruning
  • Insignificant difference among child nodes
  • Number of Trees
  • gt 10
  • Classification
  • Average of posterior probability of a group of
    trees

7
Random Decision Trees
  • Error-tolerance property of probabilistic
    decision making
  • Two-class problem P(C/x) 0.51 0.90
  • Related Work
  • Randomized Decision Trees (Y. Amit et al., 1997)
  • Bagging and Boosting (J. R. Quinlan, 1996)

8
Software Implementation
  • Development Environment
  • OS Linux
  • Language C
  • Acknowledgement
  • Ross Quinlan Source code of C4.5 release 8

9
System Architecture
Main
10
Data Flow
Output
Classification
Tree Builder
Memory Pool
Cost Loader
Names Loader
Data Loader
Tree Manager
Disk
11
Algorithm Flow Chart
Begin
gtDepth?
y
Build Tree
n
N trees?
n
Random Feature
y
n
Data?
Construct Child Nodes
y
Update Statistics
12
Experimental Results
13
Analysis
  • Golf
  • Too simple 14 cases
  • Hype
  • Extreme imbalanced distribution
  • Assign all cases to the dominating class

14
Experimental Results (10-fold)
15
Analysis
  • Extreme imbalanced distribution
  • Dominating class
  • Continuous attributes
  • Random threshold
  • Many classes
  • More likely make error decisions

16
Improvement
  • For continuous type
  • Dependent and Independent threshold
  • No apparent difference
  • For extreme imbalanced data
  • Cost matrix
  • Overall accuracy drops down
  • Precision and Recall for small class increase
  • Active chemical compounds database 2 -gt 4.1
  • Precision 0 -gt 11.7. Recall 0 -gt 18

0 1 1 -40
17
Improvement
  • Ideas from Bagging and Boosting
  • A random part of training data for each tree
  • Assign weight to each tree according to its
    accuracy on train dataset
  • Enough training data

18
Scalability
  • Scalable on the size of training dataset
  • No need to hold data in memory
  • Scan data examples once
  • How about the tree depth? Exponential!
  • Adult Dataset 14 attributes
  • 9 gt

Crash
19
Conclusions and Future Work
  • Implementation of Random Decision Trees
  • Better than some other methods on some datasets
  • Similar performance on some datasets
  • Worse on some datasets
  • Some Improvement
  • Incorporate cost matrix
  • Sampling of training data
  • Adding weights to trees

20
Conclusions and Future Work
  • Handle continuous value
  • Try discretion
  • Handle imbalanced datasets

Thank you!
Program List rt-train training and testing
program rt-test testing program xrandom.sh shel
l script for N-fold evaluation runten.sh shell
script to run rt-train a number of
times average an auxiliary program to calculate
the average from results.
Write a Comment
User Comments (0)
About PowerShow.com