Chap' 6 Classification and Prediction

About This Presentation

Title:

Chap' 6 Classification and Prediction

Description:

... is represented as classification rules, decision trees, or mathematical formulae ... Classification: Mathematical mapping. x X = n, y Y = { 1, 1} We want a ... – PowerPoint PPT presentation

Number of Views:229

Avg rating:3.0/5.0

Slides: 60

Provided by: jiaw185

Category:

more less

Transcript and Presenter's Notes

Title: Chap' 6 Classification and Prediction

1
Chap. 6 Classification and Prediction

Data Mining

2
Classification vs. Prediction

Classification
Predicts categorical class labels of a data
Constructs a model based on the training set,
tests the model by using the test set, and uses
it in classifying new data
Prediction
Models continuous-valued functions
Typical Applications
Credit approval
Target marketing
Medical diagnosis

3
Classification

Model construction
The set of tuples used for model construction
training set
? Each tuple/sample belongs to a predefined
class, as determined by the class label
The model is represented as classification rules,
decision trees, or mathematical formulae
Model usage
Classify future or unknown data
Estimate accuracy of the model - a test set with
known class label is classified, and the result
is compared
Accuracy rate the percentage of test set samples
that are correctly classified by the model

4
Model Construction
Classification Algorithms
IF rank professor OR years gt 6 THEN
tenured yes
5
Use the Model in Prediction
Unseen Data
Class
(Jeff, Professor, 4)
Yes
Tenured?
6
Supervised vs. Unsupervised

Supervised learning (classification)
Supervision The training data (observations,
measurements, etc.) are accompanied by labels
indicating the class
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is not given
Given a set of measurements, observations, etc.
? establishing the existence of classes or
clusters in the data

7
Preparing the Data

Data cleaning
Preprocess data in order to reduce noise and
handle missing values
Relevance analysis (feature selection)
Remove the irrelevant or redundant attributes
Data transformation
Generalize and/or normalize data
Exgt income ? Low, Medium, High
Exgt income, age integer ? 0.0, 1.0

8
Comparing Classification Methods

Predictive accuracy
Speed
Time to construct the model / use the model
Robustness
Handling noise and missing values
Scalability
Efficiency in disk-resident large databases
Interpretability
Understanding and insight provided by the model

9
Decision Tree Induction

Decision tree
Internal node - a test on an attribute
Branch - an outcome of the test
Leaf nodes - class labels
Decision tree generation - two phases
Tree construction
At start, all the training examples are at the
root
Partition examples recursively based on selected
attributes
Tree pruning
Identify and remove branches that reflect noise
or outliers
Classifying an unknown sample
Test the attribute values of the sample against
the decision tree

10
Training Dataset
11
Output A Decision Tree for buys_computer
X (agelt30, incomemedium,
studentyes, creditfair)
Yes
12
Algorithm for DT Induction

Basic algorithm
Top-down recursive divide-and-conquer manner
At start, all the training examples are at the
root
Test attributes are selected on the basis of a
heuristic or statistical measure (e.g.,
information gain)
Examples are partitioned recursively based on
selected attributes
Conditions for stopping partitioning
All samples belong to the same class
No samples left ? majority class
No attributes left ? majority sample

13
Algorithm for DT Induction - Example
age?
lt30
gt40
30..40
9,11(Yes) 1,2,8(No)
3,7,12,13(Yes)
4,5,10(Yes) 6,14(No)
age?
lt30
gt40
30..40
student?
credit rating?
Yes
n
y
fair
excellent
1,2,8(No)
9,11(Yes)
4,5,10(Yes)
6,14(No)
14
Information Gain (ID3/C4.5)

S contains s data samples in C1, C2, , Cn
classes
Number of samples in class Ci si
Prob. That a sample belongs to Ci pi si / s
Information required to classify a sample in Ci
Expected information to classify a sample in S
(Entropy of S)

15
Information Gain (ID3/C4.5)

Expected entropy after check attribute A value
Average entropy of S1, S2, Sn after partition
S using attribute A with values a1,a2,,av
Information gained by branching on attribute A
Select attribute with largest gain

16
Attribute Selection by Information Gain - Example

C1 buys_computer yes, C2 buys_computer
no
I(S) I(9,5)
E(age)
Gain(age)
Gain(income) 0.03, Gain(student) 0.15

17
Information Gain for Continuous-Value Attributes

Let attribute A be a continuous-valued attribute
Must determine the best split point for A
Sort the value A in increasing order
Typically, the midpoint between each pair of
adjacent values is considered as a possible split
point
(aiai1)/2 is the midpoint between the values of
ai and ai1
The point with the minimum expected information
requirement for A is selected as the split-point
for A
Split
D1 is the set of tuples in D satisfying A
split-point, and D2 is the set of tuples in D
satisfying A gt split-point

18
Gain Ratio for Attribute Selection

Information gain measure is biased towards
attributes with a large number of values
C4.5 uses gain ratio to overcome the problem
(normalization to information gain)
GainRatio(A) Gain(A)/SplitInfo(A)
Ex.
gain_ratio(income) 0.029/0.926 0.031
The attribute with the maximum gain ratio is
selected as the splitting attribute

19
Extracting Classification Rules

Represent the knowledge in the form of IF-THEN
rules
One rule is created for each path from the root
to a leaf
Each attribute-value pair along a path forms a
conjunction
The leaf node holds the class prediction
Rules are easier for humans to understand
Example
IF age lt30 AND student no THEN
buys_computer no
IF age lt30 AND student yes THEN
buys_computer yes
IF age 3140 THEN buys_computer
yes
IF age gt40 AND credit excellent THEN
buys_computer yes
IF age gt40 AND credit fair THEN
buys_computer no

20
Avoid Overfitting

The generated tree may overfit the training data
Too many branches, some may reflect anomalies due
to noise or outliers
Result is in poor accuracy for unseen samples
Prepruning
Halt tree construction earlydo not split a node
if this would result in the goodness measure
falling below a threshold
Postpruning
Remove branches from a fully grown tree
If pruning a node lead to a smaller error rate
(with test set), prune it

21
Discussion on DT

Advantages
Convertible to understandable classification
rules
Relatively faster learning/classification speed
Disadvantages
Sensitive (not robust) to noises
Continuous-valued attributes - dynamically
partition the continuous attribute value into a
discrete set of intervals

22
Bayes Theorem

Given a data X, we want to know P(hX)
h a hypothesis that X belongs to a class C
Posteriori probability of a hypothesis h, P(hX)
MAP (maximum posteriori) hypothesis
Assign X to h with maximum P(h X) Bayesian
Classifier

23
Naïve Bayesian Classifier

Assumption attributes are conditionally
independent
Steps
Compute P(Ck), P(xiCk) for all xi and Ck from
training samples
To classify an unknown sample X (x1, x2, , xn
), compute
Assign X to Ck with maximum probability

24
Naïve Bayesian Classifier - Example
25
Naïve Bayesian Classifier - Example

Compute P(Ck), P(xiCk)
P(Yes) 9/14 0.643
P(No) 5/14 0.357
P(age lt30 Yes) 2/9 0.222
P(age lt30 No) 3/5 0.600
P(income medium Yes) 4/9 0.444
P(income medium No) 2/5 0.400
P(student yes Yes) 6/9 0.667
P(student yes No) 1/5 0.200
P(credit fair Yes) 6/9 0.667
P(credit fair Yes) 2/5 0.400

26
Naïve Bayesian Classifier - Example

Classify unknown sample
X (agelt30, incomemedium,
studentyes, creditfair)
P(Yes X) ? P(Yes) P(XYes)
0.643 x 0.222 x 0.444 x 0.667 x 0.667
0.028 ?
P(No X) ? P(No) P(XNo)
0.357 x 0.600 x 0.400 x 0.200 x 0.400
0.007 ?
Classify X as Yes

27
Discussion on Naïve Bayesian

Advantages
Optimal classifier if all the joint probabilities
P(X h) are known (without the independence
assumption)
Easy to apply
Disadvantages
Need large number of training examples
Low accuracy when there are strong dependencies
between attributes
Laplace correction
Eliminate possibility of too strong probability
estimation (0, 1)

28
Bayesian Belief Networks

The independence hypothesis
Makes computation possible and yields optimal
classifiers when satisfied
But it is seldom satisfied in practice, as
attributes are often correlated
Bayesian networks
A graphical model of causal relationships
Combine Bayesian reasoning with causal
relationships between attributes joint
conditional probability distribution
It allows a subset of the variables conditionally
independent

29
Bayesian Belief Networks
30
Bayesian Belief Networks

Computing probability
Compute any P(Ck x1 .. xn)
with the probability distribution

31
Bayesian Belief Networks

Example compute P(LungCancer F, S, PX, D)
Use Naive Bayesian
Need P(LC), P(F LC), P(S LC), P(PX LC), P(D
LC)
? 2 4 4 4 4 18 prob.
Use Bayesian network model
Need P(F), P(S), P(LC F, S), P(PX LC), P(D
LC)
? 2 2 8 4 4 20 prob.
Use full joint probability table
Need P(LC, F, S, PX, D) ? 25 prob.

32
Linear Classification

Classification Mathematical mapping
x ? X ?n, y ? Y 1, 1
We want a function f X ? Y
Binary Classification problem
The data above the red line
belongs to class x
The data below red line
belongs to class o
Examples SVM, Perceptron,
Probabilistic Classifiers

33
Perceptron

Vector X, W
Input (X1, y1),
Output classification function f(X)
f(Xi) gt 0 for yi 1
f(Xi) lt 0 for yi -1
f(x) ? WX b 0
or w1x1w2x2b 0

Perceptron update W additively

34
Neural Networks

A neuron

?j
w0
x0
w1
x1
å
. . .
output o
wn
xn
Input vector X
weighted sum
Activation function
weight vector W
35
Learning of a Neuron

Given (X, t), minimize error E

36
Multi-Layer Perceptron
Output vector O
Output nodes
wjk
Hidden nodes
wij
x1
x2
Input vector X
37
Network Training

The objective of training
Obtain a set of weights that makes almost all the
samples in the training data classified correctly
Steps
Initialize weights wij with random values
Feed the traning samples X into the network one
by one
For each unit
Compute the output value O using the activation
function
Compute the error E
Update the weights wij and the biases

38
Backpropagation Learning
Output vector O
Output nodes
wjk
Hidden nodes
wij
x1
x2
Given (X, T)
Input vector X
39
Neural Network Classifier - Example
40
Neural Network Classifier - Example

Training with 14 examples 1000 times
Classify unknown sample
X (agelt30, incomemedium,
studentyes, creditfair)
? (0, 0.5, 1, 0)
O 0.85
Classify X as Yes

41
Discussion on NN

Advantages
Robust - works when training examples contain
errors
Output may be discrete, real-valued, or a vector
of several discrete or real-valued attributes
Fast evaluation of the learned target function
Criticism
Long training time
Difficult to understand the learned function
(weights)
Not easy to incorporate domain knowledge

42
Classification Based on Association

Use association rule mining
Associative classification
Mines high support and high confidence rules
cond_set gt y (y a class
label)
If several rules have same cond_set, select
highest confidence rule ? Possible Rule Set
Rules are organized in decreasing confidence
order
Classification of new data
- First rule that satisfying the data is applied

43
Associative Classification - Example

Possible rule set
(age 30-40) ? Yes (c 100, s 21)
(age lt30)?(student No) ? No (c 100, s
14)
(student yes) ? Yes (c 86, s 43)
(income low) ? Yes (c 75, s 21)
Classify unknown sample
X (incomemedium, studentyes)
Classify X as Yes

44
k-Nearest Neighbor Algorithm

Store all training examples (instances)
All instances correspond to points in the n-D
space.
Classify new instance by finding the nearest
example
The nearest neighbor are defined in terms of
Euclidean distance.
The target function could be discrete or real
valued.

_
_
_

_

X
_

_

45
k-Nearest Neighbor Algorithm

For discrete-valued target
The k-NN returns the most common value among the
k training examples nearest to X
Exgt Classify Yes or No, 10-NN, 7 Yes, 3 No ? Yes
For continuous-valued target
Calculate the mean values of the k nearest
neighbors
Distance-weighted method
Weight k neighbors according to their distance to
X
Larger weight to closer neighbor

46
k-NN - Example

Given 14 examples ? map to 4-D space
Classify unknown sample
X (agelt30, incomemedium,
studentyes, creditfair) ? (0,
0.5, 1, 0)
3-NN (0, 0, 1, 0) 1(yes), d 0.5, w 1/0.5
(0, 0.5, 0, 0) 0(no), d 1.0, w
1/1.0
(0, 0.5, 1, 1) 1(yes), d 1.0, w
1/1.0
W 4
y 2/4 x 1(yes) 1/4 x 0(no) 1/4 x 1(yes)
0.75
Classify X as Yes

47
Discussion on k-NN

Advantage
Robust to noisy data (averaging k-nearest
neighbors)
No training time
Disadvantage
Classification time can be long when there are
too many instances (O(n2) distance computations)
Curse of dimensionality - distance between
neighbors could be dominated by irrelevant
attributes

48
Case-Based Reasoning

Similar to k-NN
Instances(cases) are symbolic descriptions (not
points in a Euclidean space)
Customer service help desk, law cases, technical
designs
Methodology
Instances are represented by symbolic
descriptions
Find cases with similar descriptions
Solutions of multiple retrieved cases are
combined
Research issues
Finding similarity measure
Combining cases

49
Lazy vs. Eager Learning

Eager learning
Construct generalization model before receiving
new samples to classify
Decision tree, Bayesian classifier, neural
network
Lazy learning
Do not build a model until a new sample to
classify is given
k-nearlest neighbor classifier, case-based
reasoning
Difference
Training - Lazy learning is faster
Classifying Eager learning is faster

50
Fuzzy Set Approaches

Fuzzy logic
Uses truth values in 0.0, 1.0 (not F, T)
? Represented as a membership function
IF (income gt 50K) THEN yes ? false for income 49K
IF (income high) THEN yes ? 0.9 yes for income
49K

51
Fuzzy Set Approaches

Using fuzzy logic in rule-based system
Rules are represented with fuzzy categories
IF (income high) THEN yes
IF (income medium) THEN no
For a given new sample, attribute values are
converted to fuzzy values
income 49K ? 0.1 medium, 0.9 high
Each applicable rule contributes a vote for
membership in the categories
0.9 yes, 0.1 no,
The truth values for each predicted category are
summed

52
Prediction

Prediction is similar to classification
Construct a model ? use model to predict unknown
value
Linear regression Y ? ? X
? and ? are estimated using the least squares
criterion to the known data (X1, Y1 ), (X2, Y2 ),
, (Xn, Yn)
Mininize ?(Yi - ? - ? Xi )2 ? find ? and ?
Multiple regression Y ? ?1X1 ?2X2
Y is modeled as a linear function of multiple
attributes
Non-linear regression Y ? ?1X ?2X2
Y is modeled as a non-linear function of single
attribute

53
Classification Accuracy

Holdout
Partition dataset into training set and test set
Training set derive the classifier
Test set estimate the accuracy
K-fold cross-validation
Divide the data set into k subsets
Use k-1 subsets as training data and 1 subset as
test data
Repeat k times, and average the accuracy

S1, S2, S3, S4, S5
? S1, S2, S3, S4, S5
? S1, S2, S3, S4, S5
54
Classification Accuracy

Bootstrap
Works well with small data sets
Samples the given training tuples uniformly with
replacement
i.e., each time a tuple is selected, it is
equally likely to be selected again and re-added
to the training set
.632 boostrap
Given a data set of d tuples, the data set is
sampled d times, with replacement, resulting in a
training set of d samples.
The data tuples that did not make it into the
training set end up forming the test set.
36.8 will form the test set. (1 1/d)d e-1
0.368

55
Ensemble Methods

Ensemble
Use a combination of models to increase accuracy
Combine a series of k learned models, M1, M2, ,
Mk, with the aim of creating an improved model M
Popular ensemble methods
Bagging averaging the prediction over a
collection of classifiers
Boosting weighted vote with a collection of
classifiers

56
Bagging

Analogy
Diagnosis based on multiple doctors majority
vote
Training
Given a set D of d tuples, at each iteration i, a
training set Di of d tuples is sampled with
replacement from D (i.e., boostrap)
A classifier model Mi is learned for each
training set Di
Classification
Each classifier Mi returns its class prediction
The bagged classifier M counts the votes and
assigns the class with the most votes to an
unknown sample X
Accuracy
Often significant better than a single classifier
derived from D
For noise data not considerably worse, more
robust
Proved improved accuracy in prediction

57
Boosting

Analogy
Consult several doctors, based on a combination
of diagnoses
Weight assigned based on the previous diagnosis
accuracy
Training
Weights are assigned to each training tuple
A series of k classifiers is iteratively learned
After a classifier Mi is learned, the weights are
updated to allow the subsequent classifier, Mi1,
to pay more attention to the training tuples that
were misclassified by Mi
Classification
The final M combines the votes of each
individual classifier, where the weight of each
classifier's vote is a function of its accuracy
Accuracy
Comparing with bagging boosting tends to achieve
greater accuracy, but it also risks overfitting
the model to misclassified data

58
References

C. Apte and S. Weiss. Data mining with decision
trees and decision rules. Future Generation
Computer Systems, 13, 1997.
C. M. Bishop, Neural Networks for Pattern
Recognition. Oxford University Press, 1995.
L. Breiman, J. Friedman, R. Olshen, and C. Stone.
Classification and Regression Trees. Wadsworth
International Group, 1984.
C. J. C. Burges. A Tutorial on Support Vector
Machines for Pattern Recognition. Data Mining and
Knowledge Discovery, 2(2) 121-168, 1998.
P. K. Chan and S. J. Stolfo. Learning arbiter and
combiner trees from partitioned data for scaling
machine learning. KDD'95.
W. Cohen. Fast effective rule induction.
ICML'95.
G. Cong, K.-L. Tan, A. K. H. Tung, and X. Xu.
Mining top-k covering rule groups for gene
expression data. SIGMOD'05.
A. J. Dobson. An Introduction to Generalized
Linear Models. Chapman and Hall, 1990.
G. Dong and J. Li. Efficient mining of emerging
patterns Discovering trends and differences.
KDD'99.
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern
Classification, 2ed. John Wiley and Sons, 2001
U. M. Fayyad. Branching on attribute values in
decision tree generation. AAAI94.
Y. Freund and R. E. Schapire. A
decision-theoretic generalization of on-line
learning and an application to boosting. J.
Computer and System Sciences, 1997.
J. Gehrke, R. Ramakrishnan, and V. Ganti.
Rainforest A framework for fast decision tree
construction of large datasets. VLDB98.

59
References

J. Gehrke, V. Gant, R. Ramakrishnan, and W.-Y.
Loh, BOAT -- Optimistic Decision Tree
Construction. SIGMOD'99.
T. Hastie, R. Tibshirani, and J. Friedman. The
Elements of Statistical Learning Data Mining,
Inference, and Prediction. Springer-Verlag,
2001.
D. Heckerman, D. Geiger, and D. M. Chickering.
Learning Bayesian networks The combination of
knowledge and statistical data. Machine Learning,
1995.
M. Kamber, L. Winstone, W. Gong, S. Cheng, and
J. Han. Generalization and decision tree
induction Efficient classification in data
mining. RIDE'97.
B. Liu, W. Hsu, and Y. Ma. Integrating
Classification and Association Rule. KDD'98.
W. Li, J. Han, and J. Pei, CMAR Accurate and
Efficient Classification Based on Multiple
Class-Association Rules, ICDM'01.
T.-S. Lim, W.-Y. Loh, and Y.-S. Shih. A
comparison of prediction accuracy, complexity,
and training time of thirty-three old and new
classification algorithms. Machine Learning,
2000.
J. Magidson. The Chaid approach to segmentation
modeling Chi-squared automatic interaction
detection. In R. P. Bagozzi, editor, Advanced
Methods of Marketing Research, Blackwell
Business, 1994.
M. Mehta, R. Agrawal, and J. Rissanen. SLIQ A
fast scalable classifier for data mining.
EDBT'96.
T. M. Mitchell. Machine Learning. McGraw Hill,
1997.
S. K. Murthy, Automatic Construction of Decision
Trees from Data A Multi-Disciplinary Survey,
Data Mining and Knowledge Discovery 2(4)
345-389, 1998

60
References

J. R. Quinlan. Induction of decision trees.
Machine Learning, 181-106, 1986.
J. R. Quinlan and R. M. Cameron-Jones. FOIL A
midterm report. ECML93.
J. R. Quinlan. C4.5 Programs for Machine
Learning. Morgan Kaufmann, 1993.
J. R. Quinlan. Bagging, boosting, and c4.5.
AAAI'96.
R. Rastogi and K. Shim. Public A decision tree
classifier that integrates building and pruning.
VLDB98.
J. Shafer, R. Agrawal, and M. Mehta. SPRINT A
scalable parallel classifier for data mining.
VLDB96.
J. W. Shavlik and T. G. Dietterich. Readings in
Machine Learning. Morgan Kaufmann, 1990.
P. Tan, M. Steinbach, and V. Kumar. Introduction
to Data Mining. Addison Wesley, 2005.
S. M. Weiss and C. A. Kulikowski. Computer
Systems that Learn Classification and
Prediction Methods from Statistics, Neural Nets,
Machine Learning, and Expert Systems. Morgan
Kaufman, 1991.
S. M. Weiss and N. Indurkhya. Predictive Data
Mining. Morgan Kaufmann, 1997.
I. H. Witten and E. Frank. Data Mining Practical
Machine Learning Tools and Techniques, 2ed.
Morgan Kaufmann, 2005.
X. Yin and J. Han. CPAR Classification based on
predictive association rules. SDM'03
H. Yu, J. Yang, and J. Han. Classifying large
data sets using SVM with hierarchical clusters.
KDD'03.