Final Project- - PowerPoint PPT Presentation

About This Presentation
Title:

Final Project-

Description:

To know whether we can eat the mushroom, to survive in the wild ... Iba,W., Wogulis,J., & Langley,P. (1988). ICML, 73-79. 2. No other mushrooms data ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 26
Provided by: zena9
Category:
Tags: final | langley | project

less

Transcript and Presenter's Notes

Title: Final Project-


1
Final Project- Mining Mushroom World
2
Agenda
  • Motivation and Background
  • Determine the Data Set (2)
  • 10 DM Methodology steps (19)
  • Conclusion

3
Motivation and Background
  • To distinguish between edible mushrooms and
    poisonous ones by how they look
  • To know whether we can eat the mushroom, to
    survive in the wild
  • To survive outside the computer world

4
Determine the Data Set (1/2)
  • Source of dataUCI Machine Learning Repository
  • Mushrooms Database
  • From Audobon Society Field Guide
  • Documentationcomplete, but missing statistical
  • information
  • Described in terms of physical characteristics
  • Classificationpoisonous or edible
  • All attributes are nominal-valued
  • Large database 8124 instances (2480 missing
    values for attribute
  • 12)

5
Determine the Data Set (2/2)
  • 1. Past Usage
  • Schlimmer,J.S. (1987). Concept Acquisition
    Through Representational Adjustment (Technical
    Report 87-19).
  • Iba,W., Wogulis,J., Langley,P. (1988). ICML,
    73-79
  • 2. No other mushrooms data

6
10 DM Methodology steps
  • Step 1. Translate the Business Problem
  • into a Data Mining Problem
  • Data Mining Goalseparate edible mushrooms from
    poisonous ones
  • How will the Results be Used- increase the
    survival rate
  • How will the Results be Delivered- Decision Tree,
    Naïve Bayes, Ripper, NeuralNet

7
10 DM Methodology steps
  • Step 2. Select Appropriate Data
  • Data Source
  • The Audubon Society Field guide to North American
    Mushrooms (1981). G. H. Lincoff (Pres.), New
    York Alfred A. Knopf
  • Jeff Schlimmer donated these data on April 27th,
    1987
  • Volumes of Data
  • Total 8124 instances
  • 4208(51.8) edible 3916(48.2) poisonous
  • 2480(30.5) missing in attribute stalk-root

8
10 DM Methodology steps
  • Step 2. Select Appropriate Data
  • How Many Variables- 22 attributes
  • cap-shape, cap-color, odor, population, habitat
    and so on
  • How Much History is Required- no seasonality
  • As long as we can eat them when we see them

9
10 DM Methodology steps
  • Step 3. Get to Know the Data
  • Examine DistributionsUse Weka to visualize all
    the 22 attributes with histograms
  • Classediblee, poisonousp

10
Step 3. Get to Know the Data
  1. Examine Distributions there are 2 types of
    historgrams
  2. First- all kinds of values appear
  3. (Attribute 21) population abundanta,
    clusteredc, numerousn, scattereds, severalv,
    solitaryy

11
Step 3. Get to Know the Data
  • 1. Examine Distributionsthere are 2 types of
    historgrams
  • Second- only some kinds of value appear
  • (Attribute 7) gill-spacingclosec, crowdedw,
    distantd

12
Step 3. Get to Know the Data
  • 1. Examine Distributionsthere are exceptions
  • Exception 1- missing values in the attribute
  • (Attribute 11) stalk-rootbulbousb, clubc,
    cupu, equale, rhizomorphsz, rootedr,
    missing?
  • 2480 of this attribute have
    missing values (Total 8124)

13
Step 3. Get to Know the Data
  • 1. Examine Distributionsthere are exceptions
  • Exception 2- undistinguishable attribute
  • (Attribute 16) veil-typepartialp, universalu

14
Step3. Get to Know the Data
  • 2. Compare Values with Descriptions
  • no unexpected values except for missing values

15
10 DM Methodology steps
  • Step 4. Create a Model Set
  • Creating a Balanced Sample- 75(6093) as training
    data, 25(2031) as test data
  • Rapid Miners cross-validation function k-1 as
    training, 1 as test

16
10 DM Methodology steps
  • Step 5. Fix Problems with the Data
  • Dealing with Missing Values- the attribute
    stalk-root has 2480 missing values
  • replace all missing values with the average of
    stalk-root value
  • We replaced ? with the average value b

17
10 DM Methodology steps
  • Step 6. Transform Data to Bring Information
  • to the Surface
  • all nominal attribute, no numerical analysis in
    this step

18
10 DM Methodology steps
  • Step 7. Build Model
  • 1. Decision Tree
  • Performance
  • Accuracy99.11
  • Lift189.81

True p True e Class precision
Pred. p 961 0 100
Pred. e 18 1052 98.32
Class recall 98.16 100.00
True p True e Class precision
Pred. p 961 0 100
Pred. e 18 1052 98.32
Class recall 98.16 100.00
19
10 DM Methodology steps
  • Step 7. Build Model
  • 2. Naïve Bayes
  • Performance
  • Accuracy95.77
  • Lift179.79

True p True e Class precision
Pred. p 902 9 99.01
Pred. e 77 1043 93.12
Class recall 92.13 99.14
True p True e Class precision
Pred. p 902 9 99.01
Pred. e 77 1043 93.12
Class recall 92.13 99.14
20
10 DM Methodology steps
  • Step 7. Build Model
  • 3. Ripper
  • Performance
  • Accuracy100
  • Lift193.06

True p True e Class precision
Pred. p 979 0 100.00
Pred. e 0 1052 100.00
Class recall 100.00 100.00
True p True e Class precision
Pred. p 979 0 100.00
Pred. e 0 1052 100.00
Class recall 100.00 100.00
21
10 DM Methodology steps
  • Step 7. Build Model
  • 4. NeuralNet
  • Performance
  • Accuracy91.04
  • Lift179.35

True p True e Class precision
Pred. p 907 110 89.18
Pred. e 72 942 92.90
Class recall 92.65 89.54
True p True e Class precision
Pred. p 907 110 89.18
Pred. e 72 942 92.90
Class recall 92.65 89.54
22
10 DM Methodology steps
  • Step 8. Assess Models
  • AccuracyRipper and Decision Tree have better
    performances

23
10 DM Methodology steps
  • Step 8. Assess Models
  • Lift (to compare the performances of different
    classification models)Ripper and Decision Tree
    have higher lifts

24
10 DM Methodology steps
  • Step 9. Deploy Models
  • We havent go out and find real mushrooms
  • Step 10. Assess Results
  • Conclusion and questions
  • Maybe ripper and decision tree are better models
    for nominal data
  • How Rapid Miner separates training data from test
    data

25
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com