C4.5%20Demo - PowerPoint PPT Presentation

About This Presentation
Title:

C4.5%20Demo

Description:

C4.5 Demo Andrew Rosenberg CS4701 11/30/04 What is c4.5? c4.5 is a program that creates a decision tree based on a set of labeled input data. This decision tree can ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 11
Provided by: Andrew1507
Category:

less

Transcript and Presenter's Notes

Title: C4.5%20Demo


1
C4.5 Demo
  • Andrew Rosenberg
  • CS4701 11/30/04

2
What is c4.5?
  • c4.5 is a program that creates a decision tree
    based on a set of labeled input data.
  • This decision tree can then be tested against
    unseen labeled test data to quantify how well it
    generalizes.

3
Running c4.5
  • On cunix.columbia.edu
  • amr2104/c4.5/bin/c4.5 u f filestem
  • On cluster.cs.columbia.edu
  • amaxwell/c4.5/bin/c4.5 u f filestem
  • c4.5 expects to find 3 files
  • filestem.names
  • filestem.data
  • filestem.test

4
File Format .names
  • The file begins with a comma separated list of
    classes ending with a period, followed by a blank
    line
  • E.g, gt50K, lt50K.
  • The remaining lines have the following format
    (note the end of line period)
  • Attribute ignore, discrete n, continuous, list.

5
Example census.names
  • gt50K, lt50K.
  • age continuous.
  • workclass Private, Self-emp-not-inc,
    Self-emp-inc, Federal-gov, etc.
  • fnlwgt continuous.
  • education Bachelors, Some-college, 11th,
    HS-grad, Prof-school, etc.
  • education-num continuous.
  • marital-status Married-civ-spouse, Divorced,
    Never-married, etc.
  • occupation Tech-support, Craft-repair,
    Other-service, Sales, etc.
  • relationship Wife, Own-child, Husband,
    Not-in-family, Unmarried.
  • race White, Asian-Pac-Islander,
    Amer-Indian-Eskimo, Other, Black.
  • sex Female, Male.
  • capital-gain continuous.
  • capital-loss continuous.
  • hours-per-week continuous.
  • native-country United-States, Cambodia, England,
    Puerto-Rico, Canada, etc.

6
File Format .data, .test
  • Each line in these data files is a comma
    separated list of attribute values ending with a
    class label followed by a period.
  • The attributes must be in the same order as
    described in the .names file.
  • Unavailable values can be entered as ?
  • When creating test sets, make sure that you
    remove these data points from the training data.

7
Example adult.test
  • 25, Private, 226802, 11th, 7, Never-married,
    Machine-op-inspct, Own-child, Black, Male, 0, 0,
    40, United-States, lt50K.
  • 38, Private, 89814, HS-grad, 9,
    Married-civ-spouse, Farming-fishing, Husband,
    White, Male, 0, 0, 50, United-States, lt50K.
  • 28, Local-gov, 336951, Assoc-acdm, 12,
    Married-civ-spouse, Protective-serv, Husband,
    White, Male, 0, 0, 40, United-States, gt50K.
  • 44, Private, 160323, Some-college, 10,
    Married-civ-spouse, Machine-op-inspct, Husband,
    Black, Male, 7688, 0, 40, United-States, gt50K.
  • 18, ?, 103497, Some-college, 10, Never-married,
    ?, Own-child, White, Female, 0, 0, 30,
    United-States, lt50K.
  • 34, Private, 198693, 10th, 6, Never-married,
    Other-service, Not-in-family, White, Male, 0, 0,
    30, United-States, lt50K.
  • 29, ?, 227026, HS-grad, 9, Never-married, ?,
    Unmarried, Black, Male, 0, 0, 40, United-States,
    lt50K.
  • 63, Self-emp-not-inc, 104626, Prof-school, 15,
    Married-civ-spouse, Prof-specialty, Husband,
    White, Male, 3103, 0, 32, United-States, gt50K.
  • 24, Private, 369667, Some-college, 10,
    Never-married, Other-service, Unmarried, White,
    Female, 0, 0, 40, United-States, lt50K.
  • 55, Private, 104996, 7th-8th, 4,
    Married-civ-spouse, Craft-repair, Husband, White,
    Male, 0, 0, 10, United-States, lt50K.
  • 65, Private, 184454, HS-grad, 9,
    Married-civ-spouse, Machine-op-inspct, Husband,
    White, Male, 6418, 0, 40, United-States,
    gt50K.36, Federal-gov, 212465, Bachelors, 13,
    Married-civ-spouse, Adm-clerical, Husband, White,
    Male, 0, 0, 40, United-States, lt50K.

8
c4.5 Output
  • The decision tree proper.
  • (weighted training examples/weighted training
    error)
  • Tables of training error and testing error
  • Confusion matrix
  • Youll want to pipe the output of c4.5 to a text
    file for later viewing.
  • E.g., c4.5 u f filestem gt filestem.results

9
Example output
  • capital-gain gt 6849 gt50K (203.0/6.2)
  • capital-gain lt 6849
  • capital-gain gt 6514 lt50K (7.0/1.3)
  • capital-gain lt 6514
  • marital-status Married-civ-spouse
    gt50K (18.0/1.3)
  • marital-status Divorced lt50K
    (2.0/1.0)
  • marital-status Never-married gt50K
    (0.0)
  • marital-status Separated gt50K
    (0.0)
  • marital-status Widowed gt50K (0.0)
  • marital-status Married-spouse-absent
    gt50K (0.0)
  • marital-status Married-AF-spouse
    gt50K (0.0)
  • Tree saved
  • Evaluation on training data (4660 items)
  • Before Pruning After Pruning

10
k-fold Cross Validation
  • Start with one large data set.
  • Using a script, randomly divide this data set
    into k sets.
  • At each iteration, use k-1 sets to train the
    decision tree, and the remaining set to test the
    model.
  • Repeat this k times and take the average testing
    error.
  • The avg. error describes how well the learning
    algorithm can be applied to the data set.
Write a Comment
User Comments (0)
About PowerShow.com