Title: Final Project-
1Final Project- Mining Mushroom World
2Agenda
- Motivation and Background
- Determine the Data Set (2)
- 10 DM Methodology steps (19)
- Conclusion
3Motivation and Background
- To distinguish between edible mushrooms and
poisonous ones by how they look - To know whether we can eat the mushroom, to
survive in the wild - To survive outside the computer world
4Determine the Data Set (1/2)
- Source of dataUCI Machine Learning Repository
- Mushrooms Database
- From Audobon Society Field Guide
- Documentationcomplete, but missing statistical
- information
- Described in terms of physical characteristics
- Classificationpoisonous or edible
- All attributes are nominal-valued
- Large database 8124 instances (2480 missing
values for attribute - 12)
5Determine the Data Set (2/2)
- 1. Past Usage
- Schlimmer,J.S. (1987). Concept Acquisition
Through Representational Adjustment (Technical
Report 87-19). - Iba,W., Wogulis,J., Langley,P. (1988). ICML,
73-79 - 2. No other mushrooms data
610 DM Methodology steps
- Step 1. Translate the Business Problem
- into a Data Mining Problem
- Data Mining Goalseparate edible mushrooms from
poisonous ones - How will the Results be Used- increase the
survival rate - How will the Results be Delivered- Decision Tree,
Naïve Bayes, Ripper, NeuralNet
710 DM Methodology steps
- Step 2. Select Appropriate Data
- Data Source
- The Audubon Society Field guide to North American
Mushrooms (1981). G. H. Lincoff (Pres.), New
York Alfred A. Knopf - Jeff Schlimmer donated these data on April 27th,
1987 - Volumes of Data
- Total 8124 instances
- 4208(51.8) edible 3916(48.2) poisonous
- 2480(30.5) missing in attribute stalk-root
810 DM Methodology steps
- Step 2. Select Appropriate Data
- How Many Variables- 22 attributes
- cap-shape, cap-color, odor, population, habitat
and so on - How Much History is Required- no seasonality
- As long as we can eat them when we see them
910 DM Methodology steps
- Step 3. Get to Know the Data
- Examine DistributionsUse Weka to visualize all
the 22 attributes with histograms - Classediblee, poisonousp
10Step 3. Get to Know the Data
- Examine Distributions there are 2 types of
historgrams - First- all kinds of values appear
- (Attribute 21) population abundanta,
clusteredc, numerousn, scattereds, severalv,
solitaryy
11Step 3. Get to Know the Data
- 1. Examine Distributionsthere are 2 types of
historgrams - Second- only some kinds of value appear
- (Attribute 7) gill-spacingclosec, crowdedw,
distantd
12Step 3. Get to Know the Data
- 1. Examine Distributionsthere are exceptions
- Exception 1- missing values in the attribute
- (Attribute 11) stalk-rootbulbousb, clubc,
cupu, equale, rhizomorphsz, rootedr,
missing? -
- 2480 of this attribute have
missing values (Total 8124)
13Step 3. Get to Know the Data
- 1. Examine Distributionsthere are exceptions
- Exception 2- undistinguishable attribute
- (Attribute 16) veil-typepartialp, universalu
14Step3. Get to Know the Data
- 2. Compare Values with Descriptions
- no unexpected values except for missing values
1510 DM Methodology steps
- Step 4. Create a Model Set
- Creating a Balanced Sample- 75(6093) as training
data, 25(2031) as test data - Rapid Miners cross-validation function k-1 as
training, 1 as test
1610 DM Methodology steps
- Step 5. Fix Problems with the Data
- Dealing with Missing Values- the attribute
stalk-root has 2480 missing values - replace all missing values with the average of
stalk-root value - We replaced ? with the average value b
1710 DM Methodology steps
- Step 6. Transform Data to Bring Information
- to the Surface
- all nominal attribute, no numerical analysis in
this step
1810 DM Methodology steps
- Step 7. Build Model
- 1. Decision Tree
- Performance
- Accuracy99.11
- Lift189.81
True p True e Class precision
Pred. p 961 0 100
Pred. e 18 1052 98.32
Class recall 98.16 100.00
True p True e Class precision
Pred. p 961 0 100
Pred. e 18 1052 98.32
Class recall 98.16 100.00
1910 DM Methodology steps
- Step 7. Build Model
- 2. Naïve Bayes
- Performance
- Accuracy95.77
- Lift179.79
True p True e Class precision
Pred. p 902 9 99.01
Pred. e 77 1043 93.12
Class recall 92.13 99.14
True p True e Class precision
Pred. p 902 9 99.01
Pred. e 77 1043 93.12
Class recall 92.13 99.14
2010 DM Methodology steps
- Step 7. Build Model
- 3. Ripper
- Performance
- Accuracy100
- Lift193.06
True p True e Class precision
Pred. p 979 0 100.00
Pred. e 0 1052 100.00
Class recall 100.00 100.00
True p True e Class precision
Pred. p 979 0 100.00
Pred. e 0 1052 100.00
Class recall 100.00 100.00
2110 DM Methodology steps
- Step 7. Build Model
- 4. NeuralNet
- Performance
- Accuracy91.04
- Lift179.35
True p True e Class precision
Pred. p 907 110 89.18
Pred. e 72 942 92.90
Class recall 92.65 89.54
True p True e Class precision
Pred. p 907 110 89.18
Pred. e 72 942 92.90
Class recall 92.65 89.54
2210 DM Methodology steps
- Step 8. Assess Models
- AccuracyRipper and Decision Tree have better
performances
2310 DM Methodology steps
- Step 8. Assess Models
- Lift (to compare the performances of different
classification models)Ripper and Decision Tree
have higher lifts
2410 DM Methodology steps
- Step 9. Deploy Models
- We havent go out and find real mushrooms
- Step 10. Assess Results
- Conclusion and questions
- Maybe ripper and decision tree are better models
for nominal data - How Rapid Miner separates training data from test
data
25(No Transcript)