Title: DATA MINING : CLASSIFICATION
1DATA MINING CLASSIFICATION
2Classification Definition
- Classification is a supervised learning.
- Uses training sets which has correct answers
(class label attributes). - A model is created by running the algorithm on
the training data. - Test the model. If accuracy is low, regenerate
the model, after changing features,reconsidering
samples. - Identify a class label for the incoming new
data.
3Applications
- Classifying credit card transactions as
legitimate or fraudulent. - Classifying secondary structures of protein as
alpha-helix, beta-sheet, or random coil. - Categorizing news stories as finance, weather,
entertainment, sports, etc.
4Classification A two step process
- Model construction describing a set of
predetermined classes. - Each sample is assumed to belong to a predefined
class, as determined by the class label
attribute. - The set of samples used for model construction is
training set. - The model is represented as classification rules,
decision trees, or mathematical formula.
5- Model usage for classifying future or unknown
objects. - Estimate accuracy of the model.
- The known label of test sample is compared with
the classified result from the model. - Accuracy rate is the percentage of test set
samples that are correctly classified by the
model. - Test set is independent of training set.
- If the accuracy is acceptable, use the model to
classify data samples whose class labels are not
known.
6Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN tenured
yes
7Classification Process (2) Use the Model in
Prediction
(Jeff, Professor, 4)
Tenured?
8Classification techniques
- Decision Tree based Methods
- Rule-based Methods
- Neural Networks
- Bayesian Classification
- Support Vector Machines
9Algorithm for decision tree induction
- Basic algorithm
- Tree is constructed in a top-down
recursive divide-and-conquer manner. - At start, all the training examples are at the
root. - Attributes are categorical (if continuous-valued,
they are discretized in advance). - Examples are partitioned recursively based on
selected attributes.
10Example of Decision Tree
Training Dataset
11Output A Decision Tree forbuys_computer
12Advantages of decision tree based classification
- Inexpensive to construct.
- Extremely fast at classifying unknown records.
- Easy to interpret for small-sized trees.
- Accuracy is comparable to other classification
techniques for many simple data sets.
13Enhancements to basic decision tree induction
- Allow for continuous-valued attributes
- Dynamically define new discrete-valued attributes
that partition the continuous attribute value
into a discrete set of intervals - Handle missing attribute values
- Assign the most common value of the attribute
- Assign probability to each of the possible values
- Attribute construction
- Create new attributes based on existing ones that
are sparsely represented - This reduces fragmentation, repetition, and
replication
14Potential Problem
- Over fitting This is when the generated model
does not apply to the new incoming data. - Either too small of training data, not
covering many cases. - Wrong assumptions
- Over fitting results in decision trees that are
more complex than necessary - Training error no longer provides a good estimate
of how well the tree will perform on previously
unseen records - Need new ways for estimating errors
15How to avoid Over fitting
- Two ways to avoid over fitting are
- Pre-pruning
- Post-pruning
- Pre-pruning
- Stop the algorithm before it becomes a fully
grown tree. - Stop if all instances belong to the same class.
- Stop if no. of instances is less than some user
specified threshold
16- Post-pruning
- Grow decision tree to its entirety.
- Trim the nodes of the decision tree in a
bottom-up fashion. - If generalization error improves after trimming,
replace sub-tree by a leaf node. - Class label of leaf node is determined from
majority class of instances in the sub-tree.
17Bayesian Classification Algorithm
- Let X be a data sample whose class label is
unknown - Let H be a hypothesis that X belongs to class C
- For classification problems, determine P(H/X)
the probability that the hypothesis holds given
the observed data sample X - P(H) prior probability of hypothesis H (i.e. the
initial probability before we observe any data,
reflects the background knowledge) - P(X) probability that sample data is observed
- P(XH) probability of observing the sample X,
given that the hypothesis holds
18Training dataset for Bayesian Classification
Class C1buys_computer yes C2buys_computer
no Data sample X (agelt30, Incomemedium, Stud
entyes Credit_rating Fair)
19Advantages Disadvantages of Bayesian
Classification
- Advantages
- Easy to implement
- Good results obtained in most of the cases
- Disadvantages
- Due to assumption there is loss of accuracy.
- Practically, dependencies exist among variables
- E.g., hospitals patients Profile age,
family history etc ,Symptoms fever, cough etc.,
Disease lung cancer, diabetes etc - Dependencies among these cannot be modeled by
Bayesian Classifier
20Conclusion
- Training data is an important factor in building
a model in supervised algorithms. - The classification results generated by each of
the algorithms (NaĂŻve Bayes, Decision Tree,
Neural Networks,) is not considerably different
from each other. - Different classification algorithms can take
different time to train and build models. - Mechanical classification is faster
21References
- www.google.com
- http//www.thearling.com
- www.mamma.com
- www.amazon.com
- http//www.kdnuggets.com
- C. Apte and S. Weiss. Data mining with decision
trees and decision rules. Future Generation
Computer Systems, 13, 1997. - L. Breiman, J. Friedman, R. Olshen, and C. Stone.
Classification and Regression Trees. Wadsworth
International Group, 1984.
22Thank you !!!