Title: Chap' 6 Classification and Prediction
1Chap. 6 Classification and Prediction
2Classification vs. Prediction
- Classification
- Predicts categorical class labels of a data
- Constructs a model based on the training set,
tests the model by using the test set, and uses
it in classifying new data - Prediction
- Models continuous-valued functions
- Typical Applications
- Credit approval
- Target marketing
- Medical diagnosis
3Classification
- Model construction
- The set of tuples used for model construction
training set - ? Each tuple/sample belongs to a predefined
class, as determined by the class label - The model is represented as classification rules,
decision trees, or mathematical formulae - Model usage
- Classify future or unknown data
- Estimate accuracy of the model - a test set with
known class label is classified, and the result
is compared - Accuracy rate the percentage of test set samples
that are correctly classified by the model
4Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN
tenured yes
5Use the Model in Prediction
Unseen Data
Class
(Jeff, Professor, 4)
Yes
Tenured?
6Supervised vs. Unsupervised
- Supervised learning (classification)
- Supervision The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class - New data is classified based on the training set
- Unsupervised learning (clustering)
- The class labels of training data is not given
- Given a set of measurements, observations, etc.
- ? establishing the existence of classes or
clusters in the data
7Preparing the Data
- Data cleaning
- Preprocess data in order to reduce noise and
handle missing values - Relevance analysis (feature selection)
- Remove the irrelevant or redundant attributes
- Data transformation
- Generalize and/or normalize data
- Exgt income ? Low, Medium, High
- Exgt income, age integer ? 0.0, 1.0
8Comparing Classification Methods
- Predictive accuracy
- Speed
- Time to construct the model / use the model
- Robustness
- Handling noise and missing values
- Scalability
- Efficiency in disk-resident large databases
- Interpretability
- Understanding and insight provided by the model
9Decision Tree Induction
- Decision tree
- Internal node - a test on an attribute
- Branch - an outcome of the test
- Leaf nodes - class labels
- Decision tree generation - two phases
- Tree construction
- At start, all the training examples are at the
root - Partition examples recursively based on selected
attributes - Tree pruning
- Identify and remove branches that reflect noise
or outliers - Classifying an unknown sample
- Test the attribute values of the sample against
the decision tree
10Training Dataset
11Output A Decision Tree for buys_computer
X (agelt30, incomemedium,
studentyes, creditfair)
Yes
12Algorithm for DT Induction
- Basic algorithm
- Top-down recursive divide-and-conquer manner
- At start, all the training examples are at the
root - Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain) - Examples are partitioned recursively based on
selected attributes - Conditions for stopping partitioning
- All samples belong to the same class
- No samples left ? majority class
- No attributes left ? majority sample
13Algorithm for DT Induction - Example
age?
lt30
gt40
30..40
9,11(Yes) 1,2,8(No)
3,7,12,13(Yes)
4,5,10(Yes) 6,14(No)
age?
lt30
gt40
30..40
student?
credit rating?
Yes
n
y
fair
excellent
1,2,8(No)
9,11(Yes)
4,5,10(Yes)
6,14(No)
14Information Gain (ID3/C4.5)
- S contains s data samples in C1, C2, , Cn
classes - Number of samples in class Ci si
- Prob. That a sample belongs to Ci pi si / s
- Information required to classify a sample in Ci
- Expected information to classify a sample in S
(Entropy of S)
15Information Gain (ID3/C4.5)
- Expected entropy after check attribute A value
- Average entropy of S1, S2, Sn after partition
S using attribute A with values a1,a2,,av - Information gained by branching on attribute A
- Select attribute with largest gain
16Attribute Selection by Information Gain - Example
- C1 buys_computer yes, C2 buys_computer
no - I(S) I(9,5)
- E(age)
- Gain(age)
- Gain(income) 0.03, Gain(student) 0.15
17Information Gain for Continuous-Value Attributes
- Let attribute A be a continuous-valued attribute
- Must determine the best split point for A
- Sort the value A in increasing order
- Typically, the midpoint between each pair of
adjacent values is considered as a possible split
point - (aiai1)/2 is the midpoint between the values of
ai and ai1 - The point with the minimum expected information
requirement for A is selected as the split-point
for A - Split
- D1 is the set of tuples in D satisfying A
split-point, and D2 is the set of tuples in D
satisfying A gt split-point
18Gain Ratio for Attribute Selection
- Information gain measure is biased towards
attributes with a large number of values - C4.5 uses gain ratio to overcome the problem
(normalization to information gain) - GainRatio(A) Gain(A)/SplitInfo(A)
- Ex.
- gain_ratio(income) 0.029/0.926 0.031
- The attribute with the maximum gain ratio is
selected as the splitting attribute
19Extracting Classification Rules
- Represent the knowledge in the form of IF-THEN
rules - One rule is created for each path from the root
to a leaf - Each attribute-value pair along a path forms a
conjunction - The leaf node holds the class prediction
- Rules are easier for humans to understand
- Example
- IF age lt30 AND student no THEN
buys_computer no - IF age lt30 AND student yes THEN
buys_computer yes - IF age 3140 THEN buys_computer
yes - IF age gt40 AND credit excellent THEN
buys_computer yes - IF age gt40 AND credit fair THEN
buys_computer no
20Avoid Overfitting
- The generated tree may overfit the training data
- Too many branches, some may reflect anomalies due
to noise or outliers - Result is in poor accuracy for unseen samples
- Prepruning
- Halt tree construction earlydo not split a node
if this would result in the goodness measure
falling below a threshold - Postpruning
- Remove branches from a fully grown tree
- If pruning a node lead to a smaller error rate
(with test set), prune it
21Discussion on DT
- Advantages
- Convertible to understandable classification
rules - Relatively faster learning/classification speed
- Disadvantages
- Sensitive (not robust) to noises
- Continuous-valued attributes - dynamically
partition the continuous attribute value into a
discrete set of intervals
22Bayes Theorem
- Given a data X, we want to know P(hX)
- h a hypothesis that X belongs to a class C
- Posteriori probability of a hypothesis h, P(hX)
- MAP (maximum posteriori) hypothesis
- Assign X to h with maximum P(h X) Bayesian
Classifier
23Naïve Bayesian Classifier
- Assumption attributes are conditionally
independent - Steps
- Compute P(Ck), P(xiCk) for all xi and Ck from
training samples - To classify an unknown sample X (x1, x2, , xn
), compute - Assign X to Ck with maximum probability
24Naïve Bayesian Classifier - Example
25Naïve Bayesian Classifier - Example
- Compute P(Ck), P(xiCk)
- P(Yes) 9/14 0.643
- P(No) 5/14 0.357
- P(age lt30 Yes) 2/9 0.222
- P(age lt30 No) 3/5 0.600
- P(income medium Yes) 4/9 0.444
- P(income medium No) 2/5 0.400
- P(student yes Yes) 6/9 0.667
- P(student yes No) 1/5 0.200
- P(credit fair Yes) 6/9 0.667
- P(credit fair Yes) 2/5 0.400
-
26Naïve Bayesian Classifier - Example
- Classify unknown sample
- X (agelt30, incomemedium,
- studentyes, creditfair)
- P(Yes X) ? P(Yes) P(XYes)
- 0.643 x 0.222 x 0.444 x 0.667 x 0.667
0.028 ? - P(No X) ? P(No) P(XNo)
- 0.357 x 0.600 x 0.400 x 0.200 x 0.400
0.007 ? - Classify X as Yes
27Discussion on Naïve Bayesian
- Advantages
- Optimal classifier if all the joint probabilities
P(X h) are known (without the independence
assumption) - Easy to apply
- Disadvantages
- Need large number of training examples
- Low accuracy when there are strong dependencies
between attributes - Laplace correction
- Eliminate possibility of too strong probability
estimation (0, 1)
28Bayesian Belief Networks
- The independence hypothesis
- Makes computation possible and yields optimal
classifiers when satisfied - But it is seldom satisfied in practice, as
attributes are often correlated - Bayesian networks
- A graphical model of causal relationships
- Combine Bayesian reasoning with causal
relationships between attributes joint
conditional probability distribution - It allows a subset of the variables conditionally
independent
29Bayesian Belief Networks
30Bayesian Belief Networks
- Computing probability
- Compute any P(Ck x1 .. xn)
- with the probability distribution
31Bayesian Belief Networks
- Example compute P(LungCancer F, S, PX, D)
- Use Naive Bayesian
- Need P(LC), P(F LC), P(S LC), P(PX LC), P(D
LC) - ? 2 4 4 4 4 18 prob.
- Use Bayesian network model
- Need P(F), P(S), P(LC F, S), P(PX LC), P(D
LC) - ? 2 2 8 4 4 20 prob.
- Use full joint probability table
- Need P(LC, F, S, PX, D) ? 25 prob.
32Linear Classification
- Classification Mathematical mapping
- x ? X ?n, y ? Y 1, 1
- We want a function f X ? Y
- Binary Classification problem
- The data above the red line
- belongs to class x
- The data below red line
- belongs to class o
- Examples SVM, Perceptron,
- Probabilistic Classifiers
33Perceptron
- Vector X, W
- Input (X1, y1),
- Output classification function f(X)
- f(Xi) gt 0 for yi 1
- f(Xi) lt 0 for yi -1
- f(x) ? WX b 0
- or w1x1w2x2b 0
- Perceptron update W additively
34Neural Networks
?j
w0
x0
w1
x1
Ã¥
. . .
output o
wn
xn
Input vector X
weighted sum
Activation function
weight vector W
35Learning of a Neuron
- Given (X, t), minimize error E
-
36Multi-Layer Perceptron
Output vector O
Output nodes
wjk
Hidden nodes
wij
x1
x2
Input vector X
37Network Training
- The objective of training
- Obtain a set of weights that makes almost all the
samples in the training data classified correctly
- Steps
- Initialize weights wij with random values
- Feed the traning samples X into the network one
by one - For each unit
- Compute the output value O using the activation
function - Compute the error E
- Update the weights wij and the biases
38Backpropagation Learning
Output vector O
Output nodes
wjk
Hidden nodes
wij
x1
x2
Given (X, T)
Input vector X
39Neural Network Classifier - Example
40Neural Network Classifier - Example
- Training with 14 examples 1000 times
- Classify unknown sample
- X (agelt30, incomemedium,
- studentyes, creditfair)
- ? (0, 0.5, 1, 0)
- O 0.85
- Classify X as Yes
41Discussion on NN
- Advantages
- Robust - works when training examples contain
errors - Output may be discrete, real-valued, or a vector
of several discrete or real-valued attributes - Fast evaluation of the learned target function
- Criticism
- Long training time
- Difficult to understand the learned function
(weights) - Not easy to incorporate domain knowledge
42Classification Based on Association
- Use association rule mining
- Associative classification
- Mines high support and high confidence rules
- cond_set gt y (y a class
label) - If several rules have same cond_set, select
highest confidence rule ? Possible Rule Set - Rules are organized in decreasing confidence
order - Classification of new data
- - First rule that satisfying the data is applied
43Associative Classification - Example
- Possible rule set
- (age 30-40) ? Yes (c 100, s 21)
- (age lt30)?(student No) ? No (c 100, s
14) -
- (student yes) ? Yes (c 86, s 43)
- (income low) ? Yes (c 75, s 21)
-
- Classify unknown sample
- X (incomemedium, studentyes)
- Classify X as Yes
44k-Nearest Neighbor Algorithm
- Store all training examples (instances)
- All instances correspond to points in the n-D
space. - Classify new instance by finding the nearest
example - The nearest neighbor are defined in terms of
Euclidean distance. - The target function could be discrete or real
valued.
_
_
_
_
X
_
_
45k-Nearest Neighbor Algorithm
- For discrete-valued target
- The k-NN returns the most common value among the
k training examples nearest to X - Exgt Classify Yes or No, 10-NN, 7 Yes, 3 No ? Yes
- For continuous-valued target
- Calculate the mean values of the k nearest
neighbors - Distance-weighted method
- Weight k neighbors according to their distance to
X - Larger weight to closer neighbor
46k-NN - Example
- Given 14 examples ? map to 4-D space
- Classify unknown sample
- X (agelt30, incomemedium,
- studentyes, creditfair) ? (0,
0.5, 1, 0) - 3-NN (0, 0, 1, 0) 1(yes), d 0.5, w 1/0.5
- (0, 0.5, 0, 0) 0(no), d 1.0, w
1/1.0 - (0, 0.5, 1, 1) 1(yes), d 1.0, w
1/1.0 - W 4
- y 2/4 x 1(yes) 1/4 x 0(no) 1/4 x 1(yes)
0.75 -
- Classify X as Yes
47Discussion on k-NN
- Advantage
- Robust to noisy data (averaging k-nearest
neighbors) - No training time
- Disadvantage
- Classification time can be long when there are
too many instances (O(n2) distance computations) - Curse of dimensionality - distance between
neighbors could be dominated by irrelevant
attributes
48Case-Based Reasoning
- Similar to k-NN
- Instances(cases) are symbolic descriptions (not
points in a Euclidean space) - Customer service help desk, law cases, technical
designs - Methodology
- Instances are represented by symbolic
descriptions - Find cases with similar descriptions
- Solutions of multiple retrieved cases are
combined - Research issues
- Finding similarity measure
- Combining cases
49Lazy vs. Eager Learning
- Eager learning
- Construct generalization model before receiving
new samples to classify - Decision tree, Bayesian classifier, neural
network - Lazy learning
- Do not build a model until a new sample to
classify is given - k-nearlest neighbor classifier, case-based
reasoning - Difference
- Training - Lazy learning is faster
- Classifying Eager learning is faster
50Fuzzy Set Approaches
- Fuzzy logic
- Uses truth values in 0.0, 1.0 (not F, T)
- ? Represented as a membership function
- IF (income gt 50K) THEN yes ? false for income 49K
- IF (income high) THEN yes ? 0.9 yes for income
49K
51Fuzzy Set Approaches
- Using fuzzy logic in rule-based system
- Rules are represented with fuzzy categories
- IF (income high) THEN yes
- IF (income medium) THEN no
- For a given new sample, attribute values are
converted to fuzzy values - income 49K ? 0.1 medium, 0.9 high
- Each applicable rule contributes a vote for
membership in the categories - 0.9 yes, 0.1 no,
- The truth values for each predicted category are
summed
52Prediction
- Prediction is similar to classification
- Construct a model ? use model to predict unknown
value - Linear regression Y ? ? X
- ? and ? are estimated using the least squares
criterion to the known data (X1, Y1 ), (X2, Y2 ),
, (Xn, Yn) - Mininize ?(Yi - ? - ? Xi )2 ? find ? and ?
- Multiple regression Y ? ?1X1 ?2X2
- Y is modeled as a linear function of multiple
attributes - Non-linear regression Y ? ?1X ?2X2
- Y is modeled as a non-linear function of single
attribute
53Classification Accuracy
- Holdout
- Partition dataset into training set and test set
- Training set derive the classifier
- Test set estimate the accuracy
- K-fold cross-validation
- Divide the data set into k subsets
- Use k-1 subsets as training data and 1 subset as
test data - Repeat k times, and average the accuracy
S1, S2, S3, S4, S5
? S1, S2, S3, S4, S5
? S1, S2, S3, S4, S5
54Classification Accuracy
- Bootstrap
- Works well with small data sets
- Samples the given training tuples uniformly with
replacement - i.e., each time a tuple is selected, it is
equally likely to be selected again and re-added
to the training set - .632 boostrap
- Given a data set of d tuples, the data set is
sampled d times, with replacement, resulting in a
training set of d samples. - The data tuples that did not make it into the
training set end up forming the test set. - 36.8 will form the test set. (1 1/d)d e-1
0.368
55Ensemble Methods
- Ensemble
- Use a combination of models to increase accuracy
- Combine a series of k learned models, M1, M2, ,
Mk, with the aim of creating an improved model M - Popular ensemble methods
- Bagging averaging the prediction over a
collection of classifiers - Boosting weighted vote with a collection of
classifiers
56Bagging
- Analogy
- Diagnosis based on multiple doctors majority
vote - Training
- Given a set D of d tuples, at each iteration i, a
training set Di of d tuples is sampled with
replacement from D (i.e., boostrap) - A classifier model Mi is learned for each
training set Di - Classification
- Each classifier Mi returns its class prediction
- The bagged classifier M counts the votes and
assigns the class with the most votes to an
unknown sample X - Accuracy
- Often significant better than a single classifier
derived from D - For noise data not considerably worse, more
robust - Proved improved accuracy in prediction
57Boosting
- Analogy
- Consult several doctors, based on a combination
of diagnoses - Weight assigned based on the previous diagnosis
accuracy - Training
- Weights are assigned to each training tuple
- A series of k classifiers is iteratively learned
- After a classifier Mi is learned, the weights are
updated to allow the subsequent classifier, Mi1,
to pay more attention to the training tuples that
were misclassified by Mi - Classification
- The final M combines the votes of each
individual classifier, where the weight of each
classifier's vote is a function of its accuracy - Accuracy
- Comparing with bagging boosting tends to achieve
greater accuracy, but it also risks overfitting
the model to misclassified data
58References
- C. Apte and S. Weiss. Data mining with decision
trees and decision rules. Future Generation
Computer Systems, 13, 1997. - C. M. Bishop, Neural Networks for Pattern
Recognition. Oxford University Press, 1995. - L. Breiman, J. Friedman, R. Olshen, and C. Stone.
Classification and Regression Trees. Wadsworth
International Group, 1984. - C. J. C. Burges. A Tutorial on Support Vector
Machines for Pattern Recognition. Data Mining and
Knowledge Discovery, 2(2) 121-168, 1998. - P. K. Chan and S. J. Stolfo. Learning arbiter and
combiner trees from partitioned data for scaling
machine learning. KDD'95. - W. Cohen. Fast effective rule induction.
ICML'95. - G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu.
Mining top-k covering rule groups for gene
expression data. SIGMOD'05. - A. J. Dobson. An Introduction to Generalized
Linear Models. Chapman and Hall, 1990. - G. Dong and J. Li. Efficient mining of emerging
patterns Discovering trends and differences.
KDD'99. - R. O. Duda, P. E. Hart, and D. G. Stork. Pattern
Classification, 2ed. John Wiley and Sons, 2001 - U. M. Fayyad. Branching on attribute values in
decision tree generation. AAAI94. - Y. Freund and R. E. Schapire. A
decision-theoretic generalization of on-line
learning and an application to boosting. J.
Computer and System Sciences, 1997. - J. Gehrke, R. Ramakrishnan, and V. Ganti.
Rainforest A framework for fast decision tree
construction of large datasets. VLDB98.
59References
- J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y.
Loh, BOAT -- Optimistic Decision Tree
Construction. SIGMOD'99. - T. Hastie, R. Tibshirani, and J. Friedman. The
Elements of Statistical Learning Data Mining,
Inference, and Prediction. Springer-Verlag,
2001. - D. Heckerman, D. Geiger, and D. M. Chickering.
Learning Bayesian networks The combination of
knowledge and statistical data. Machine Learning,
1995. - M. Kamber, L. Winstone, W. Gong, S. Cheng, and
J. Han. Generalization and decision tree
induction Efficient classification in data
mining. RIDE'97. - B. Liu, W. Hsu, and Y. Ma. Integrating
Classification and Association Rule. KDD'98. - W. Li, J. Han, and J. Pei, CMAR Accurate and
Efficient Classification Based on Multiple
Class-Association Rules, ICDM'01. - T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A
comparison of prediction accuracy, complexity,
and training time of thirty-three old and new
classification algorithms. Machine Learning,
2000. - J. Magidson. The Chaid approach to segmentation
modeling Chi-squared automatic interaction
detection. In R. P. Bagozzi, editor, Advanced
Methods of Marketing Research, Blackwell
Business, 1994. - M. Mehta, R. Agrawal, and J. Rissanen. SLIQ A
fast scalable classifier for data mining.
EDBT'96. - T. M. Mitchell. Machine Learning. McGraw Hill,
1997. - S. K. Murthy, Automatic Construction of Decision
Trees from Data A Multi-Disciplinary Survey,
Data Mining and Knowledge Discovery 2(4)
345-389, 1998
60References
- J. R. Quinlan. Induction of decision trees.
Machine Learning, 181-106, 1986. - J. R. Quinlan and R. M. Cameron-Jones. FOIL A
midterm report. ECML93. - J. R. Quinlan. C4.5 Programs for Machine
Learning. Morgan Kaufmann, 1993. - J. R. Quinlan. Bagging, boosting, and c4.5.
AAAI'96. - R. Rastogi and K. Shim. Public A decision tree
classifier that integrates building and pruning.
VLDB98. - J. Shafer, R. Agrawal, and M. Mehta. SPRINT A
scalable parallel classifier for data mining.
VLDB96. - J. W. Shavlik and T. G. Dietterich. Readings in
Machine Learning. Morgan Kaufmann, 1990. - P. Tan, M. Steinbach, and V. Kumar. Introduction
to Data Mining. Addison Wesley, 2005. - S. M. Weiss and C. A. Kulikowski. Computer
Systems that Learn Classification and
Prediction Methods from Statistics, Neural Nets,
Machine Learning, and Expert Systems. Morgan
Kaufman, 1991. - S. M. Weiss and N. Indurkhya. Predictive Data
Mining. Morgan Kaufmann, 1997. - I. H. Witten and E. Frank. Data Mining Practical
Machine Learning Tools and Techniques, 2ed.
Morgan Kaufmann, 2005. - X. Yin and J. Han. CPAR Classification based on
predictive association rules. SDM'03 - H. Yu, J. Yang, and J. Han. Classifying large
data sets using SVM with hierarchical clusters.
KDD'03.