Title: An Excelbased Data Mining Tool
1- An Excel-based Data Mining Tool
- iDA
2(No Transcript)
3(No Transcript)
4ESX A Multipurpose Tool for Data Mining
5The Algorithmic Logic Behind ESX
- Given
- A set of existing concept-level nodes C1, ..., Cn
- An average class resemblance score S
- A new instance I to be classified
- Classify I with the concept class that will
improve S the most, or hurt S the least. - If learning is unsupervised, create a new concept
node with I alone if it results in a better S
score.
6iDAV Format for Data Mining
- iDA attribute/value format
- First row attribute names
- Second row attribute type identifier
- C categorical, R real (real stands for any
numeric field) - Third row attribute usage identifier
- I input, O output, U unused D display only
- Forth row test set data
7(No Transcript)
8(No Transcript)
9A Five-step Approach for Unsupervised Clustering
- Step 1 Enter the Data to be Mined
- Step 2 Perform a Data Mining Session
- Step 3 Read and Interpret Summary Results
- Step 4 Read and Interpret Individual Class
Results - Step 5 Visualize Individual Class Rules
10Step 1 Enter The Data To Be Mined
11Step 2 Perform A Data Mining Session
- iDA -gt begin mining session
- Select instance similarity and real-valued
tolerance setting
12RuleMaker Settings
13Step 3 Read and Interpret Summary Results
- Class Resemblance Scores
- Similarity of instances in the class
- Domain Resemblance Score
- Similarity of instances in the entire set
- Cluster Quality
- Class resemblance with reference to domain
resemblance (clusters should be at least as good
as the domain) -
14Step 3 Results about Attributes
- Categorical
- Domain Predictability
- Given categorical attribute A with possible
values v1,..,vn, domain predictability gives the
number of instances that has A equal to vi (if
domain predictability score is close to 100,
most of the instances have the same value, and
the attribute is not very valuable for learning
purposes) - Numeric
- Attribute significance
- Given attribute A, find the range of class means,
and divide by the domain standard deviation
(higher values are better for differentiation
purposes)
15(No Transcript)
16(No Transcript)
17Step 4 Read and Interpret Individual Class
Results
- Class Predictability is a within-class measure.
- Given class C and categorical attribute A with
possible values v1,..,vn, class predictability
gives the percent of instances that has A equal
to vi in C - Class Predictiveness is a between-class measure.
- Given class C and categorical attribute A with
possible values v1,..,vn, class predictiveness
for vi is the probability that an instance
belongs to C given it has value vi for A.
18(No Transcript)
19Necessary and Sufficient Conditions
- A predictiveness score of 1.0 tells us that all
instances with the particular attribute value
belong to this particular class. - gt Attribute v is a sufficient condition for
membership in this class. - A predictability score of 1.0 tells us that all
the instances in this class have Attribute v. - gt Attribute v is a necessary condition for
membership in this class.
20Necessary and/or Sufficient Conditions
- If both predictability and predictiveness scores
are 1.0, the particular value for the attribute
is necessary and sufficient for class membership. - ESX outputs necessary and sufficient attribute
values that meet a particular cut-off (0.80) as
highly necessary and highly sufficient.
21(No Transcript)
22Step 5 Visualize Individual Class Rules
23RuleMaker Settings
- Recall that we used the setting to ask RuleMaker
to generate all rules. This is a good way to
learn about the nature of the problem at hand.
24A Six-Step Approach for Supervised Learning
- Step 1 Choose an Output Attribute
- Step 2 Perform the Mining Session
- Step 3 Read and Interpret Summary Results
- Step 4 Read and Interpret Test Set Results
- Step 5 Read and Interpret Class Results
- Step 6 Visualize and Interpret Class Rules
25Perform the Mining Session
- Decide on the size of the training set.
- The remaining items will be used by the software
to test the model that is developed (and
evaluation results will be reported).
26Read and Interpret Summary Results
- The worksheet RES SUM contains summary
information. - Class resemblance scores, attribute summary
information (categorical and numerical) and most
commonly occurring attributes for each class are
given.
27Read and Interpret Test Set Results
28Read and Interpret Test Set Results
- Worksheets RES TST, RES MTX
- Reports performance on the test set (which was
not part of model training) - RES MTX reports confusion matrix
- RES TST reports for each instance in the test set
the models classification, and whether it is
accurate or not.
29Read and Interpret Class Results
- As individual clusters are of interest in
unsupervised learning, the information about
individual classes is relevant in supervised
learning. - Worksheet RES CLS contains the information.
- Most and least typical instances are also given
here. - The worksheet RUL TYP gives typicality scores
for all of the instances in the test set.
30Visualize and Interpret Class Rules
- All rules or covering set of rules?
- Worksheet RES Rul contains the rules generated by
RuleMaker - If all rules are generated, there might be
overlapping coverage. - The covering set algorithm works iteratively, by
identifying the best covering rule and updating
the instance set to be covered. - It is possible to run RuleMaker without running
the mining algorithm again. This menu item can be
used to change the RuleMaker settings to generate
alternative rule sets.
31Generating Rules The General Idea
- Choose an attribute that differentiates all
domain/subclass instances best. - Use the attribute to subdivide instances into
classes. - For each subclass
- If the instances meet a predefined criteria,
generate a defining rule for the subclass. - If the predefined criteria is not met, go to Step
1.
32Techniques for Generating Rules
- Define the scope of the rules.
- Choose the instances.
- Set the minimum rule correctness.
- Define the minimum rule coverage.
- Choose an attribute significance value.
33Instance Typicality
- Typicality Scores
- Identify prototypical and outlier instances.
- Select a best set of training instances.
- Used to compute individual instance
classification confidence scores.
34(No Transcript)
35Special Considerations and Features
- Avoid Mining Delays
- The Quick Mine Feature
- Erroneous and Missing Data