Final Project-

About This Presentation

Title:

Final Project-

Description:

To know whether we can eat the mushroom, to survive in the wild ... Iba,W., Wogulis,J., & Langley,P. (1988). ICML, 73-79. 2. No other mushrooms data ... – PowerPoint PPT presentation

Number of Views:38

Avg rating:3.0/5.0

Slides: 26

Provided by: zena9

Category:

more less

Transcript and Presenter's Notes

Title: Final Project-

1
Final Project- Mining Mushroom World
2
Agenda

Motivation and Background
Determine the Data Set (2)
10 DM Methodology steps (19)
Conclusion

3
Motivation and Background

To distinguish between edible mushrooms and
poisonous ones by how they look
To know whether we can eat the mushroom, to
survive in the wild
To survive outside the computer world

4
Determine the Data Set (1/2)

Source of dataUCI Machine Learning Repository
Mushrooms Database
From Audobon Society Field Guide
Documentationcomplete, but missing statistical
information
Described in terms of physical characteristics
Classificationpoisonous or edible
All attributes are nominal-valued
Large database 8124 instances (2480 missing
values for attribute
12)

5
Determine the Data Set (2/2)

1. Past Usage
Schlimmer,J.S. (1987). Concept Acquisition
Through Representational Adjustment (Technical
Report 87-19).
Iba,W., Wogulis,J., Langley,P. (1988). ICML,
73-79
2. No other mushrooms data

6
10 DM Methodology steps

Step 1. Translate the Business Problem
into a Data Mining Problem
Data Mining Goalseparate edible mushrooms from
poisonous ones
How will the Results be Used- increase the
survival rate
How will the Results be Delivered- Decision Tree,
Naïve Bayes, Ripper, NeuralNet

7
10 DM Methodology steps

Step 2. Select Appropriate Data
Data Source
The Audubon Society Field guide to North American
Mushrooms (1981). G. H. Lincoff (Pres.), New
York Alfred A. Knopf
Jeff Schlimmer donated these data on April 27th,
1987
Volumes of Data
Total 8124 instances
4208(51.8) edible 3916(48.2) poisonous
2480(30.5) missing in attribute stalk-root

8
10 DM Methodology steps

Step 2. Select Appropriate Data
How Many Variables- 22 attributes
cap-shape, cap-color, odor, population, habitat
and so on
How Much History is Required- no seasonality
As long as we can eat them when we see them

9
10 DM Methodology steps

Step 3. Get to Know the Data
Examine DistributionsUse Weka to visualize all
the 22 attributes with histograms
Classediblee, poisonousp

10
Step 3. Get to Know the Data

Examine Distributions there are 2 types of
historgrams
First- all kinds of values appear
(Attribute 21) population abundanta,
clusteredc, numerousn, scattereds, severalv,
solitaryy

11
Step 3. Get to Know the Data

1. Examine Distributionsthere are 2 types of
historgrams
Second- only some kinds of value appear
(Attribute 7) gill-spacingclosec, crowdedw,
distantd

12
Step 3. Get to Know the Data

1. Examine Distributionsthere are exceptions
Exception 1- missing values in the attribute
(Attribute 11) stalk-rootbulbousb, clubc,
cupu, equale, rhizomorphsz, rootedr,
missing?
2480 of this attribute have
missing values (Total 8124)

13
Step 3. Get to Know the Data

1. Examine Distributionsthere are exceptions
Exception 2- undistinguishable attribute
(Attribute 16) veil-typepartialp, universalu

14
Step3. Get to Know the Data

2. Compare Values with Descriptions
no unexpected values except for missing values

15
10 DM Methodology steps

Step 4. Create a Model Set
Creating a Balanced Sample- 75(6093) as training
data, 25(2031) as test data
Rapid Miners cross-validation function k-1 as
training, 1 as test

16
10 DM Methodology steps

Step 5. Fix Problems with the Data
Dealing with Missing Values- the attribute
stalk-root has 2480 missing values
replace all missing values with the average of
stalk-root value
We replaced ? with the average value b

17
10 DM Methodology steps

Step 6. Transform Data to Bring Information
to the Surface
all nominal attribute, no numerical analysis in
this step

18
10 DM Methodology steps

Step 7. Build Model
1. Decision Tree
Performance
Accuracy99.11
Lift189.81

True p True e Class precision
Pred. p 961 0 100
Pred. e 18 1052 98.32
Class recall 98.16 100.00
True p True e Class precision
Pred. p 961 0 100
Pred. e 18 1052 98.32
Class recall 98.16 100.00
19
10 DM Methodology steps

Step 7. Build Model
2. Naïve Bayes
Performance
Accuracy95.77
Lift179.79

True p True e Class precision
Pred. p 902 9 99.01
Pred. e 77 1043 93.12
Class recall 92.13 99.14
True p True e Class precision
Pred. p 902 9 99.01
Pred. e 77 1043 93.12
Class recall 92.13 99.14
20
10 DM Methodology steps

Step 7. Build Model
3. Ripper
Performance
Accuracy100
Lift193.06

True p True e Class precision
Pred. p 979 0 100.00
Pred. e 0 1052 100.00
Class recall 100.00 100.00
True p True e Class precision
Pred. p 979 0 100.00
Pred. e 0 1052 100.00
Class recall 100.00 100.00
21
10 DM Methodology steps

Step 7. Build Model
4. NeuralNet
Performance
Accuracy91.04
Lift179.35

True p True e Class precision
Pred. p 907 110 89.18
Pred. e 72 942 92.90
Class recall 92.65 89.54
True p True e Class precision
Pred. p 907 110 89.18
Pred. e 72 942 92.90
Class recall 92.65 89.54
22
10 DM Methodology steps

Step 8. Assess Models
AccuracyRipper and Decision Tree have better
performances

23
10 DM Methodology steps

Step 8. Assess Models
Lift (to compare the performances of different
classification models)Ripper and Decision Tree
have higher lifts

24
10 DM Methodology steps

Step 9. Deploy Models
We havent go out and find real mushrooms
Step 10. Assess Results
Conclusion and questions
Maybe ripper and decision tree are better models
for nominal data
How Rapid Miner separates training data from test
data

25
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Final Project- - PowerPoint PPT Presentation

Final Project-

To know whether we can eat the mushroom, to survive in the wild ... Iba,W., Wogulis,J., & Langley,P. (1988). ICML, 73-79. 2. No other mushrooms data ... – PowerPoint PPT presentation